December 28, 2006 (This post is more than 2 years old.)

Ask a Jedi: Organizing user uploads

coldfusion

I haven't done the "Ask a Jedi" thing for a while, but it is time for it to return. For those who don't remember what this is - it is simply me responding to user questions on the blog. As a general warning, if you email me with a question, I may respond on the blog instead of email, so be sure to check both. (Although I never share the full name of the person who wrote me.)

So with that said, Kevin had this interesting question:

Is it at all practical to create a user folder for each user? I know the server can handle 100's of millions of files / folders.
What is the best (or a best) way to organize user data? I want to try to keep each users content as separate as possible. I've done in the past where each user created gets a folder also, and it worked well. This prevents file name conflicts with other users that want to use the same filename too.

Well, normally I wouldn't make one folder per user. I'd use CFFILE's ability to make a unique file name on upload. That would mean you could put all files in one folder and the names won't conflict. CFFILE gives you access to the original name, so in the database you simply store both.

Now - since you have a lot of users, then you may run into an issue. The OS does have an upper limit. I forget what it is - but it may be like 25k files per folder. Let's just use that as an estimate. At 25k, then obviously you can hit the limit quickly.

So in that case you may want to use subfolders for each user. One option would be to simply use the user primary key (if it is a number). So your upload folder could be:

/uploads/9/

for userID=9. However - the same upper limit also applies to folders as well.

Another option would be to create a catalog based on username. By catalog I mean a series of folders based on the letters in the username. So the username cfjedimaster would have this folder:

/uploads/c/cfjedimaster

/uploads/c/cf/cfjedimaster

This will reduce the number of folders needed. Obviously you want to decide ahead of time since changing the system later on could be a problem.

You didn't ask - but here are a few other things I'd watch out for. Number one is security. If you are letting folks upload files, you want to be real careful what they are allowed to send. You probably do not want to upload uploads of ColdFusion files. Don't just watch out for CFM files. Do you know if your server is configured to run PHP? You would want to block that as well.

I'd also strongly recommend watching your disk space as well. This could be done with a simple scheduled task that sends an email with the current free space.

Support this Content!

If you like this content, please consider supporting me. You can become a Patron, visit my Amazon wishlist, or buy me a coffee! Any support helps!

Want to get a copy of every new post? Use the form below to sign up for my newsletter.

Archived Comments

Comment 1 by Teddy R Payne posted on 12/28/2006 at 7:49 PM

I would also recommend putting the uploaded files in a folder on the server that is not in the webroot. You can use cffile and cfcontent to serve up the files.

This removes the ability to directly invoke a file from outside your website.

As for naming conventions, I like the user name as the folder. This is typical of unix/linux based systems.

Comment 2 by Chad posted on 12/28/2006 at 7:55 PM

I have used the current date for the subfolders. like \username\12-28-2006

It gives you a little more organization.

Comment 3 by DK posted on 12/28/2006 at 8:33 PM

I like using the user id for sub foldering for a few reasons like Ray pointed out. I'd never thought to use a catalog but I also never thought about upper folder limit!

Anyhow another couple reasons to organize by user name/id is obviously several users could have text.txt or w/e without making you rename it to text89.txt which if someone downloads it, would no longer match the name the author says to open (in the case of multiple files all self referencing like in a suite of documents).

Another cool usage is for admin reporting. Using cfdirectory/cffile etc you can report what users are potentially abusing file uploads by getting counts and estimated sizing. Contrarily you can see what users are not taking advantage of it (which may meet a need depending on the application) and report that to an admin as well.

A last use is if you maybe had a user control panel they could one click see a list of every file they've ever uploaded by one cfdirectory then loop. There would be no database interaction really required (depending on your application).

Comment 4 by Sid Wing posted on 12/28/2006 at 9:35 PM

I have built a number of varying-scale, document library/file library style, applications over the last few years. I use Ray's "catalog" approach for storing the files. I, also, use the "accept" attribute of CFFILE along with some other error checking code I have written to insure that the files are of only a given type (.txt, .zip, .pdf).
I have locked it down those types as most of the Office (MS Office) documents can execute code imbedded in themselves.
I am working on a way to call the command-line AVG Virus scanner to scan a given file when upload completes. The file would upload to a quarantine area, be scanned, and then moved to the actual catalog area on a successfully completed scan.

Comment 5 by BlowToad posted on 12/28/2006 at 9:45 PM

to get past the issue where the user gets a different filename than expected (as noted in DK's comment), we use code like this and the browser saves the file as the right name.

Comment 6 by Joe Zack posted on 12/28/2006 at 9:48 PM

I worked for a company that setup folders based on username, and every time someone updated their username we had to create a new directory, move all their files, and then delete the old folder which was a pain so I would recommend using the primary key of the table, or hashing/generating a UUID and storing the foldername.

Comment 7 by Raymond Camden posted on 12/28/2006 at 9:59 PM

Joe, why didn't you just rename the directory? One line of code.

Comment 8 by TJ Downes posted on 12/28/2006 at 10:13 PM

On a NTFS volume you can have as many files as you like in a single folder, there is no limit, up to a total of 4,294,967,295 files per NTFS volume.

Comment 9 by Raymond Camden posted on 12/28/2006 at 10:31 PM

Cool TJ. Although - I'd probably NOT put even close to that many files in one folder. I'd imagine a simple cfdirectory call would take forever.

Comment 10 by TJ Downes posted on 12/28/2006 at 10:45 PM

LOL, id not recommend it either. Simple calculations based upon 10ms per file write, it would take close to 2 years to write that many files to a disk. This is assuming your server will only take 10ms to write each file to the disk :)

Comment 11 by Joshua Cyr posted on 12/28/2006 at 10:52 PM

I recently had a project that created and read hundreds of thousands of cached files. Putting them all in one directory results in VERY long cfdirectory call time, at about 10k files it starts to bog down. Surprisingly it was only a handful of seconds for 200k + files. I broke it down similar to how Ray describes and each folder (about 11k folders) had far fewer files and the speeds were back to what I would consider normal. I opted to have my folder structure by the name of the catalog data and not a numerical ID. Mostly for search engine optimization as the text would help with google and whatnot.

Comment 12 by TJ Downes posted on 12/28/2006 at 11:23 PM

im not so sure if I had to deal with thousands of files id want to use cfdirectory for each user request anyway. Id likely setup a directory watcher gateway (Ray has a post here somewhere) and make the gateway add each new file to a table of files, or delete them as necessary. Then users would just be querying a table, which is many many times faster.

Comment 13 by Joe Zack posted on 12/29/2006 at 12:43 AM

re: Joe, why didn't you just rename the directory? One line of code.

I actually oversimplified the example and didn't realize that it mooted my point, the folder structure was actually first letter of username followed by the username like so:

"/userFiles/j/joe"

Which was silly from the get-go.

Comment 14 by DK posted on 12/29/2006 at 12:59 AM

@ blowtoad - does that mean you are also storing in a db somewhere the original file name along with an entry on the uploaded file?

Comment 15 by Justice posted on 12/29/2006 at 1:02 AM

I have about 10 million documents in my system, and I use something like this: \cabinetName\YYYY\MM\DD\filename.blah

I just store who owns the document in the database instead of the filesystem, and I store the file size in the DB so I can just query size by user.

Comment 16 by Peter Bell posted on 12/29/2006 at 2:15 AM

Benefit of directory per user is that if necessary you can provide subset of users with FTP access if they need to upload hundreds of files, etc. Also easier to look at directory structure and for it to be meaningful. For same reason I tend to upload files keeping their name, just doing a rename on nameconflict - anything else is making it harder to pull that information together which to me makes the system a little more fragile.

I also prefer to use a business key (such as username) rather than something like a user ID as again the directory system itself is more meaningful without the database and writing a publish script to edit directory name on changing of username is no big deal. Only issue with this is if you make the directories web accessible - user changes username and then all their URLs break.

Comment 17 by Biggz posted on 12/29/2006 at 3:05 AM

Simply put, using directories to manage user content should be the obvious decision. Not only do you benefit from simple organization, but you also take advantage of the OS's ability to manage those directories in special ways, such as symbolic links, mount points, NT junctions, etc.

In an ideal situation, such as a VPS or dedicated environment, two locations should be used, one referenced by the application with nothing more than junctions/symbolic links to the actual data, stored on a separate volume. This way you can change your layout - say add a new drive, simply by updating the links.

Comment 18 by Kevin Sargent posted on 12/29/2006 at 3:53 AM

THANK YOU!

First, thanks to ray and everyones comments I can now move forward with confidence that what I am doing is at least accepted by you guys as a good way to do it. That makes me feel better!

I have in the past used the cataloged approach, like Ray talked about (/a/alex/, /b/bob/), and it was pretty straight forward to implement. I only had about 3K users - so I was always wondering if it was a safe technique. As a rule, I do not allow username changes in my applications, so that avoids the moving files issue. Too bad CFDIRECTORY doesn't have a MOVE action. A directory or file move on the same volume does not move the data, just updates the file system records of where that file is located - and is fast.

I don't think that I need to worry about cfdirectory results taking a long time. I can't see a reason to use cfdirectory in the front end or members area back end. I know the path for each user based on data stored in the DB, so I call it directly if/when needed. All the file info is stored in the DB also.

About storing the files off webroot, great idea. Again something that I have done in the past to control hot linking.

Well, Thanks again!

Comment 19 by Sid Wing posted on 12/29/2006 at 4:59 AM

@kevin -
We implemented the "outside of webroot" storage for uploaded files as part of our team's "Best Practices" book.

Also, yes, CFDIRECTORY is WAY too slow for use for filename searching/listing/etc. We (like you) use the DB to handle that SORT of thing (pun intended).

As for the username-file/directory move scenario, I am with you on that score as well. We simply do not allow the users to "change" their username once their account is created.

As with anything, a little foresight in the design phase can save a multitude of heartache in the usage phase :-)

Comment 20 by Dave posted on 1/2/2007 at 4:54 AM

Hi,

In my experience I have never created directories based on the user them selves. I have always created a general storage structure in conjunction with a database to store the files. I did not store the files in the db but merely stored in the db where the file was. The reasoning behind it was because I did not want a "power user" who uploaded 10x what everyone else did to mess anything up. My approach allowed me to control the number of files in each directory so that there was no overload on the system while trying to read the contents of the directories. It also allowed me to use the database to get directory information instead of using cfdirectory to examine the directories.

Just my 2 cents.

--Dave

Comment 21 by Kevin posted on 1/2/2007 at 6:46 AM

Exactly: The db contains locations of files and file info. Cfdirectory will never be needed except in maintenance scenarios.

breaking the site into sub folders of user names will prevent a single HUGE directory, and allow each user to use any filename.

Support this Content!

Archived Comments

Webmentions