Jeff sent me an interesting question last Friday involving writing out large amounts of data to a text file in ColdFusion. He had to read in thousands of files and append them to a single file. He was curious about what he could do to speed up this process. I wasn't really sure what to suggest - outside of making sure he used cfsetting requesttimeout to give his script time to process, but he wrote back and said he had some success using Java to write out the file data. This led me to do a bit of digging myself. I know that the new file functions (added in ColdFusion 8) made use of higher performing code behind the scenes. So for example, if you used cffile to read in a multi gigabyte file, than ColdFusion has to store all that data in RAM. But if you make use of fileOpen and fileReadLine, you can suck in parts of the file at a time. Shoot - you can even use fileSeek (in ColdFusion 9) to jump ahead. All of this works very well, but is focused on the read side of the equation. How about writing? I whipped up a simple test to see differently I could write to a file and how differently the approaches would perform.
I began my test script by ensuring it would have enough time to run:
<cfsetting requesttimeout="999">
Next I output some whitespace junk. I'm going to be using cfflush and discovered that Chrome, like Internet Explorer, likes to get "enough" content before it renders anything.
<cfoutput>#repeatString(" ", 250)#</cfoutput><cfflush>
Here is my first test:
<cfset thisTick = getTickCount()>
<cfloop index="x" from="1" to="200000">
<cffile action="append" file="#theFile#" output="#string#">
<cfif x mod 1000 is 0>
<cfoutput>##</cfoutput>
<cfflush>
</cfif>
</cfloop>
<cfset finalTick = getTickCount() - thisTick> <cfoutput>
<p>Took #finalTick# ms to write.
</cfoutput>
<cfset string = repeatString(createUUID(), 10)>
<cfset theFile = expandPath("./data.txt")>
<cfoutput>About to write to #theFile#</cfoutput>
<p>
<cfflush>
I created a string based on a UUID repeated 10 times. I set my file name and then loop from 1 to 200,000 using the append form of cffile to write data to the file. That little cfif condition in there is just a simple way for me to monitor the progress of my test. By outputting a hash mark every one thousand iterations I can get an idea of how quickly my test is running. I wrap the meat of this with a few getTickCounts() so I can time the process.
This test took 70,222 ms to run.
Ok, so how about using the new(ish) file functions? Here's my next text.
<cfset thisTick = getTickCount()>
<cfset fileOb = fileOpen(theFile, "append")>
<cfloop index="x" from="1" to="200000">
<cfset fileWriteLine(fileOb, string)>
<cfif x mod 1000 is 0>
<cfoutput>##</cfoutput>
<cfflush>
</cfif>
</cfloop>
<cfset fileClose(fileOb)>
<cfset finalTick = getTickCount() - thisTick> <cfoutput>
<p>Took #finalTick# ms to write.
</cfoutput>
<cfset theFile = expandPath("./data2.txt")>
<cfoutput>About to write to #theFile#</cfoutput>
<p>
<cfflush>
I create a file object opened using append mode. I made use of fileWriteLine to append my text. Finally, I close the file object. So how did this perform?
This test took 1,622 ms to run.
Bit faster, eh? Then I tried something else. I thought - what would happen if I built up a large string and just wrote to the file system once. I knew that a normal string operation wouldn't work as string operations in general aren't very performant. I used a Java StringBuilder instead.
<cfset thisTick = getTickCount()>
<cfset s = createObject("java","java.lang.StringBuilder")>
<cfset newString = string & chr(13)>
<cfloop index="x" from="1" to="200000">
<cfset s.append(newString)>
<cfif x mod 1000 is 0>
<cfoutput>##</cfoutput>
<cfflush>
</cfif>
</cfloop>
<cffile action="write" file="#theFile#" output="#s.toString()#">
<cfset finalTick = getTickCount() - thisTick> <cfoutput>
<p>Took #finalTick# ms to write.
</cfoutput>
<cfset theFile = expandPath("./data3.txt")>
<cfoutput>About to write to #theFile#</cfoutput>
<p>
<cfflush>
This test took 1,658 ms to run.
Now that's pretty interesting. In every iteration of my test, the StringBuilder version was always very close to the fileWriteLine version. Always slower, but not far enough to really matter. The main difference though is that I've got one variable taking in a large amount of RAM. In theory, this could take all the RAM available to the JVM. (Keep in mind the JVM is not an area I'm strong in. This is where I typically send people to Mike Brunt. ;)
I'll include the entire test script below, but the tests verify what I expected. The newer file functions work much better for both reading and writing. Any comments?
<cfset string = repeatString(createUUID(), 10)>
<cfset theFile = expandPath("./data.txt")>
<cfoutput>About to write to #theFile#</cfoutput>
<p>
<cfflush> <cfset thisTick = getTickCount()>
<cfloop index="x" from="1" to="200000">
<cffile action="append" file="#theFile#" output="#string#">
<cfif x mod 1000 is 0>
<cfoutput>##</cfoutput>
<cfflush>
</cfif>
</cfloop>
<cfset finalTick = getTickCount() - thisTick> <cfoutput>
<p>Took #finalTick# ms to write.
</cfoutput> <hr>
<cfset theFile = expandPath("./data2.txt")>
<cfoutput>About to write to #theFile#</cfoutput>
<p>
<cfflush> <cfset thisTick = getTickCount()>
<cfset fileOb = fileOpen(theFile, "append")>
<cfloop index="x" from="1" to="200000">
<cfset fileWriteLine(fileOb, string)>
<cfif x mod 1000 is 0>
<cfoutput>##</cfoutput>
<cfflush>
</cfif>
</cfloop>
<cfset fileClose(fileOb)>
<cfset finalTick = getTickCount() - thisTick> <cfoutput>
<p>Took #finalTick# ms to write.
</cfoutput> <hr>
<cfset theFile = expandPath("./data3.txt")>
<cfoutput>About to write to #theFile#</cfoutput>
<p>
<cfflush> <cfset thisTick = getTickCount()>
<cfset s = createObject("java","java.lang.StringBuilder")>
<cfset newString = string & chr(13)>
<cfloop index="x" from="1" to="200000">
<cfset s.append(newString)>
<cfif x mod 1000 is 0>
<cfoutput>##</cfoutput>
<cfflush>
</cfif>
</cfloop>
<cffile action="write" file="#theFile#" output="#s.toString()#">
<cfset finalTick = getTickCount() - thisTick> <cfoutput>
<p>Took #finalTick# ms to write.
</cfoutput>
<cfsetting requesttimeout="999">
<cfoutput>#repeatString(" ", 250)#</cfoutput><cfflush>
Archived Comments
Depending on the requirements, you might be able to drop to the OS and just use the OS to merge the files together--which is going to be way faster than using CF.
In Windows, you can do something like the following from a command prompt:
copy /b *.txt big-old-file.txt
Or:
copy /b file1.txt +file2.txt +file3.txt merged1-3.txt
Now obviously if you have to do some parsing to the files before merging them, this may not work for you. But if the goal is to just merge a ton of individual files into a single file, you can do this very easy using command line tools.
As for your solutions, if the outputted file is going to be very large, I'm not a fan of trying to create everything in memory--because it can cause issues w/the JVM. I'm a much bigger fan of appending the data to a file and writing the data in controlled chunks. It's more work, but will work reliably even if you need to merge really large text files (that you don't want to read all in memory at one time.)
You can use java.io.LineNumberReader (which your http://www.cflib.org/udf/Fi... UDF uses) to pull in chunks from an external file, then append those chunks to another file.
This could be overkill for this project, but I thought I'd mention it.
Interesting point Dan, and while "we" ("we" being the collective CF bloggers, speakers, etc) tend to mention not writing OS specific code, this is an example where for practical reasons it may make sense. So question - do you feel like running a speed test?
As to your 3rd idea (lineNumberReader), wouldn't you agree that UDF does not make sense anymore in CF8+? Certainly it is convenient (to go to line N), but since fileReadLine should be using similar code under the hood then it should perform well.
Very interesting Ray. In tests I've run comparing string concatenation to cfsavecontent and the java string buffer, cfsavecontent has been the fastest of all three methods. For string concatenation, I've seen times like 27 seconds vs. about 160ms or less for cfsavecontent, while using the java string buffer was around 420 ms. A caveat, this test was run using cf7.02 on a WinXP box. Let me know if you'd like a copy of the code and I can send it along.
Here are the results of one test (iteration of 50000 in each case):
String Concatenation
string & string: 28297ms
String Length: 650000
CFSaveContent
cfsavecontent: 156ms
String Length: 650001
Using Java String Buffer
java string buffer: 422ms
String Length: 650001
----
Here's the code:
<cfset runtime = CreateObject("java", "java.lang.Runtime").getRuntime()>
<cfset total_memory = runtime.totalMemory()> <cfset runtime.gc()>
<cfset memory_before = (total_memory-runtime.freeMemory()) / 1024 / 1024>
<cfoutput>Memory Before: #round(memory_before)# Megs<br> <cfset string1 = "">
<cftimer label="string & string" type="inline">
<cfloop from="1" to="50000" index="i">
<cfset string1 = string1 & "Hello World. ">
</cfloop>
</cftimer>
<cfoutput>
String Length: #len(string1)#</cfoutput><br>
<cfset memory_after = (total_memory-runtime.freeMemory()) / 1024 / 1024> Memory After: #round(memory_after)# Megs -- Increase of #round(memory_after - memory_before)# Megs<br> <br />
</cfoutput>
<cfset runtime.gc()>
<cfset memory_before = (total_memory-runtime.freeMemory()) / 1024 / 1024>
<cfoutput>Memory Before: #round(memory_before)# Megs<br>
</cfoutput>
<cftimer label="cfsavecontent" type="inline">
<cfsavecontent variable="string2">
<cfloop from="1" to="50000" index="i">
<cfoutput>Hello World.</cfoutput>
</cfloop>
</cfsavecontent>
</cftimer>
<cfoutput>String Length: #len(string2)#</cfoutput><br>
<cfset memory_after = (total_memory-runtime.freeMemory()) / 1024 / 1024>
<cfoutput>Memory After: #round(memory_after)# Megs -- Increase of #round(memory_after - memory_before)# Megs<br></cfoutput>
regards,
larry
Just an addenum to my previous note. I decided to do String Concatenation with CFSavecontent using CF9 on a Windows 7 box. here are the results:
Memory Before: 208 Megs
string & string: 19718ms String Length: 650000
Memory After: 252 Megs -- Increase of 43 Megs
Memory Before: 31 Megs
cfsavecontent: 32ms String Length: 650001
Memory After: 38 Megs -- Increase of 7 Megs
Hello Ray,
I found this entry by Ben Nadel very helpful. Writing Enormous Files Based On Massive Record Sets In ColdFusion @ http://bit.ly/gGqVJJ .
I added some comments at the bottom on how to change his process if you need to specify an encoding type like UTF-8.
Based on your tests above I did this and it took 790ms
<cfscript>
thisTick = getTickCount();
// file to write
fileToWrite = expandPath('./test.txt');
// create the file
theFile = createObject("java","java.io.File").init(javaCast("string",fileToWrite));
// create my java file writer
fileWriter = createObject("java","java.io.BufferedWriter").
init(createObject("java","java.io.OutputStreamWriter").
init(createObject("java","java.io.FileOutputStream").
init(theFile),"UTF8"));
// create string object
stringToWrite = createObject("java","java.lang.StringBuffer").Init();
for(x=1; x<=200000; x++){
stringToWrite.Append(x & chr(13) & chr(10));
}
// write line
fileWriter.write(stringToWrite.toString());
// Flush the buffered output stream to make sure there is no straggling buffer data
fileWriter.flush();
// final count
finalTick = getTickCount() - thisTick;
// write output
writeOutput('<p>Took #finalTick# ms to write.');
</cfscript>
I would do this in my RDBMS. In MySQL, I would just use Navicat to spit the file out.
Ben's post goes down to Java - which would not be necessary in CF8 or higher I'd say. Interesting that yours ran even faster though - almost by half.
Sounds like what's really needed here, is a <cffile action="concatenate" input="#fileList#" file="#outputFile#" />. Especially if you don't need to do any modifications to the data, and are just cat'ing the files together.
One of the fastest ways to do this, would be to use the java nio package (in most cases should outperform using buffered streams), and I imagine that the underlying implementation of such a feature, would look something like this (which you should be able to "createObject"ify to do essentially the same thing, if you don't want to compile and load the java class.):
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.nio.channels.FileChannel;
import java.util.Arrays;
import java.util.List;
public class FileUtils {
public FileUtils() {}
public static void concatenate(String target,String inList) throws IOException {
FileChannel out = new FileOutputStream(target).getChannel();
FileChannel in = null;
try {
List<String> inputs = Arrays.asList(inList.split(","));
for (String inputFile : inputs) {
try {
in = new FileInputStream(inputFile).getChannel();
in.transferTo (0, in.size(), out);
}
finally {
try {
if (in != null)
in.close();
}
catch (Exception e) {}
}
}
}
finally {
try { out.close(); } catch (Exception e) {}
}
}
}
Appreciate your article, was very useful.