I haven't done the "Ask a Jedi" thing for a while, but it is time for it to return. For those who don't remember what this is - it is simply me responding to user questions on the blog. As a general warning, if you email me with a question, I may respond on the blog instead of email, so be sure to check both. (Although I never share the full name of the person who wrote me.)
So with that said, Kevin had this interesting question:
Is it at all practical to create a user folder for each user? I know the server can handle 100's of millions of files / folders.Well, normally I wouldn't make one folder per user. I'd use CFFILE's ability to make a unique file name on upload. That would mean you could put all files in one folder and the names won't conflict. CFFILE gives you access to the original name, so in the database you simply store both.What is the best (or a best) way to organize user data? I want to try to keep each users content as separate as possible. I've done in the past where each user created gets a folder also, and it worked well. This prevents file name conflicts with other users that want to use the same filename too.
Now - since you have a lot of users, then you may run into an issue. The OS does have an upper limit. I forget what it is - but it may be like 25k files per folder. Let's just use that as an estimate. At 25k, then obviously you can hit the limit quickly.
So in that case you may want to use subfolders for each user. One option would be to simply use the user primary key (if it is a number). So your upload folder could be:
/uploads/9/
for userID=9. However - the same upper limit also applies to folders as well.
Another option would be to create a catalog based on username. By catalog I mean a series of folders based on the letters in the username. So the username cfjedimaster would have this folder:
/uploads/c/cfjedimaster
or
/uploads/c/cf/cfjedimaster
This will reduce the number of folders needed. Obviously you want to decide ahead of time since changing the system later on could be a problem.
You didn't ask - but here are a few other things I'd watch out for. Number one is security. If you are letting folks upload files, you want to be real careful what they are allowed to send. You probably do not want to upload uploads of ColdFusion files. Don't just watch out for CFM files. Do you know if your server is configured to run PHP? You would want to block that as well.
I'd also strongly recommend watching your disk space as well. This could be done with a simple scheduled task that sends an email with the current free space.
Archived Comments
I would also recommend putting the uploaded files in a folder on the server that is not in the webroot. You can use cffile and cfcontent to serve up the files.
This removes the ability to directly invoke a file from outside your website.
As for naming conventions, I like the user name as the folder. This is typical of unix/linux based systems.
I have used the current date for the subfolders. like \username\12-28-2006
It gives you a little more organization.
I like using the user id for sub foldering for a few reasons like Ray pointed out. I'd never thought to use a catalog but I also never thought about upper folder limit!
Anyhow another couple reasons to organize by user name/id is obviously several users could have text.txt or w/e without making you rename it to text89.txt which if someone downloads it, would no longer match the name the author says to open (in the case of multiple files all self referencing like in a suite of documents).
Another cool usage is for admin reporting. Using cfdirectory/cffile etc you can report what users are potentially abusing file uploads by getting counts and estimated sizing. Contrarily you can see what users are not taking advantage of it (which may meet a need depending on the application) and report that to an admin as well.
A last use is if you maybe had a user control panel they could one click see a list of every file they've ever uploaded by one cfdirectory then loop. There would be no database interaction really required (depending on your application).
<p>I have built a number of varying-scale, document library/file library style, applications over the last few years. I use Ray's "catalog" approach for storing the files. I, also, use the "accept" attribute of CFFILE along with some other error checking code I have written to insure that the files are of only a given type (.txt, .zip, .pdf).</p>
<p>I have locked it down those types as most of the Office (MS Office) documents can execute code imbedded in themselves.</p>
<p>I am working on a way to call the command-line AVG Virus scanner to scan a given file when upload completes. The file would upload to a quarantine area, be scanned, and then moved to the actual catalog area on a successfully completed scan.</p>
to get past the issue where the user gets a different filename than expected (as noted in DK's comment), we use code like this and the browser saves the file as the right name.
<cfheader name="content-disposition" value="filename=#qFileData.OrigFileName#">
<cfcontent type="#qFileData.MIMEType#" file=#qFileData.ServerFileName#>
I worked for a company that setup folders based on username, and every time someone updated their username we had to create a new directory, move all their files, and then delete the old folder which was a pain so I would recommend using the primary key of the table, or hashing/generating a UUID and storing the foldername.
Joe, why didn't you just rename the directory? One line of code.
On a NTFS volume you can have as many files as you like in a single folder, there is no limit, up to a total of 4,294,967,295 files per NTFS volume.
Cool TJ. Although - I'd probably NOT put even close to that many files in one folder. I'd imagine a simple cfdirectory call would take forever.
LOL, id not recommend it either. Simple calculations based upon 10ms per file write, it would take close to 2 years to write that many files to a disk. This is assuming your server will only take 10ms to write each file to the disk :)
I recently had a project that created and read hundreds of thousands of cached files. Putting them all in one directory results in VERY long cfdirectory call time, at about 10k files it starts to bog down. Surprisingly it was only a handful of seconds for 200k + files. I broke it down similar to how Ray describes and each folder (about 11k folders) had far fewer files and the speeds were back to what I would consider normal. I opted to have my folder structure by the name of the catalog data and not a numerical ID. Mostly for search engine optimization as the text would help with google and whatnot.
im not so sure if I had to deal with thousands of files id want to use cfdirectory for each user request anyway. Id likely setup a directory watcher gateway (Ray has a post here somewhere) and make the gateway add each new file to a table of files, or delete them as necessary. Then users would just be querying a table, which is many many times faster.
re: Joe, why didn't you just rename the directory? One line of code.
I actually oversimplified the example and didn't realize that it mooted my point, the folder structure was actually first letter of username followed by the username like so:
"/userFiles/j/joe"
Which was silly from the get-go.
@ blowtoad - does that mean you are also storing in a db somewhere the original file name along with an entry on the uploaded file?
I have about 10 million documents in my system, and I use something like this: \cabinetName\YYYY\MM\DD\filename.blah
I just store who owns the document in the database instead of the filesystem, and I store the file size in the DB so I can just query size by user.
Benefit of directory per user is that if necessary you can provide subset of users with FTP access if they need to upload hundreds of files, etc. Also easier to look at directory structure and for it to be meaningful. For same reason I tend to upload files keeping their name, just doing a rename on nameconflict - anything else is making it harder to pull that information together which to me makes the system a little more fragile.
I also prefer to use a business key (such as username) rather than something like a user ID as again the directory system itself is more meaningful without the database and writing a publish script to edit directory name on changing of username is no big deal. Only issue with this is if you make the directories web accessible - user changes username and then all their URLs break.
Simply put, using directories to manage user content should be the obvious decision. Not only do you benefit from simple organization, but you also take advantage of the OS's ability to manage those directories in special ways, such as symbolic links, mount points, NT junctions, etc.
In an ideal situation, such as a VPS or dedicated environment, two locations should be used, one referenced by the application with nothing more than junctions/symbolic links to the actual data, stored on a separate volume. This way you can change your layout - say add a new drive, simply by updating the links.
THANK YOU!
First, thanks to ray and everyones comments I can now move forward with confidence that what I am doing is at least accepted by you guys as a good way to do it. That makes me feel better!
I have in the past used the cataloged approach, like Ray talked about (/a/alex/, /b/bob/), and it was pretty straight forward to implement. I only had about 3K users - so I was always wondering if it was a safe technique. As a rule, I do not allow username changes in my applications, so that avoids the moving files issue. Too bad CFDIRECTORY doesn't have a MOVE action. A directory or file move on the same volume does not move the data, just updates the file system records of where that file is located - and is fast.
I don't think that I need to worry about cfdirectory results taking a long time. I can't see a reason to use cfdirectory in the front end or members area back end. I know the path for each user based on data stored in the DB, so I call it directly if/when needed. All the file info is stored in the DB also.
About storing the files off webroot, great idea. Again something that I have done in the past to control hot linking.
Well, Thanks again!
@kevin -
We implemented the "outside of webroot" storage for uploaded files as part of our team's "Best Practices" book.
Also, yes, CFDIRECTORY is WAY too slow for use for filename searching/listing/etc. We (like you) use the DB to handle that SORT of thing (pun intended).
As for the username-file/directory move scenario, I am with you on that score as well. We simply do not allow the users to "change" their username once their account is created.
As with anything, a little foresight in the design phase can save a multitude of heartache in the usage phase :-)
Hi,
In my experience I have never created directories based on the user them selves. I have always created a general storage structure in conjunction with a database to store the files. I did not store the files in the db but merely stored in the db where the file was. The reasoning behind it was because I did not want a "power user" who uploaded 10x what everyone else did to mess anything up. My approach allowed me to control the number of files in each directory so that there was no overload on the system while trying to read the contents of the directories. It also allowed me to use the database to get directory information instead of using cfdirectory to examine the directories.
Just my 2 cents.
--Dave
Exactly: The db contains locations of files and file info. Cfdirectory will never be needed except in maintenance scenarios.
breaking the site into sub folders of user names will prevent a single HUGE directory, and allow each user to use any filename.