Yesterday in the IRC channel someone asked if there was a way to count the number of times each unique word appears in a string. While it was obvious that this could be done manually (see below), no one knew of a more elegant solution. Can anyone think of one? Here is the solution I used and it definitely falls into the "manual" (and probably slow) category.
First I made my string:
<cfsavecontent variable="string">
This is a paragraph with some text in it. Certain words will be repeated, and other words
will not be repeated. The question is though, how much can I write before I begin to sound
like a complete and utter idiot. Let's call that the "Paris Point". At the Paris Point, any
further words sound like gibberish and are completely worthless.
</cfsavecontent>
I then used some regex to get an array of words:
<cfset words = reMatch("[[:word:]]+", string)>
Next I created a structure:
<cfset wordCount = structNew()>
And then looped over the array and inserted the words into the structure:
<cfloop index="word" array="#words#">
<cfif structKeyExists(wordCount, word)>
<cfset wordCount[word]++>
<cfelse>
<cfset wordCount[word] = 1>
</cfif>
</cfloop>
Note that this will be inherently case-insenstive, which I think is a good thing. At this point we are done, but I added some display code as well:
<cfset sorted = structSort(wordCount, "numeric", "desc")>
<table border="1" width="400">
<tr>
<th width="50%">Word</th>
<th>Count</th>
</tr>
<cfloop index="word" array="#sorted#">
<cfoutput>
<tr>
<td>#word#</td>
<td>#wordCount[word]#</td>
</tr>
</cfoutput>
</cfloop>
Archived Comments
If "Paris Point" becomes part of the daily lexicon, you can officially coin it. Nice code work too.
REMatch() makes me happy :)
Probably not faster, but you could create a query with a single column and use qoq to get the count with a group by.
Couldn't you do something like
#ListLen(string, " #Chr(13)##Chr(10)#")#
(it seems to work with the string variable you posted)
Gareth, that counts the words. We need a count of the number of each word. Ie, the string has The ten times. Etc.
Whoops, unique instances...
Let me try that again :)
<cfset new_string = ListSort(REReplaceNoCase(LCase(string), "[^a-z ]", "", "ALL"), "Text", "Asc", " #Chr(13)##Chr(10)#")>
<cfscript>
// had to use this as CF does not allow lookbehind in regular expressions, but JAVA does
obj = createobject("java","java.util.regex.Pattern"); // create pattern searching object
x = obj.compile("(?<=[ ]|^)([^ ]*)([ ]\1)+(?=[ ]|$)"); // compile the regular expression for use
new_string = x.matcher(new_string).replaceAll("$1"); // remove all duplicates
</cfscript>
#ListLen(new_string, " ")#
OK, I'm going to stop now :)
I got a total of the unique words, but not a count of the number of duplicate words (that's what I get for trying to do write code for one thing while checking out the blogs in another tab :) )
Issue: the word "Let's" gets broken into "Let" and "s" because of your RE. Solution? Still working on it... ;)
@Todd,
I ran into that same problem during one of Ray's Friday Puzzlers... trust me - don't try to figure it out, your brain will only end up hurting. Here's why, these are all single "words":
hatin'
let's
sweet-ass
cf.objective()
O'connell
... if you can write an algorithm to use all those "non-word" characters as parts of words, well then, you are the man!
Although, one could argue that sweet-ass wouldn't be so bad as two words. hatin' is slang, and would become hatin, which is ok.
I think if you could just make get single quotes to work, you would get most "real" words.
I wonder - maybe switch from [[:word:]] to
(any non alpha except single quote)(alpha,1 or more)(optional ' if followed by alpha)(any non alpha except single quote)
Then again - another solution? Remove '. You end up with words like "lets", which could be confused with "Ray lets Paris call him", but it would be better than let and s as words.
But, isn't the next word signified by a space? So everything between a space, is a word?
Yeah, stripping out the single quotes is probably the easiest thing to do. Least amount of damage for the best results.
So why not just change from "[[:word:]]" to "[a-zA-Z0-9]+'[a-zA-Z0-9]+|[a-zA-Z0-9]+"
That seems to do the trick for let's lets. But it still doesn't make the counting any more elegant.
Thats pretty cool there Ron.
There's a CF IRC channel floating around somewhere? Anyone feel like sharing the info? :)
The one I use is #coldfusion on Dalnet.
This is better:
(?:[a-zA-Z0-9\(\)])+(?:\'|-|\.)?(?:[a-zA-Z0-9\(\)])+
It matches method chains like myarray.dedup().sort()
And just to match Ben's "hatin'" example:
"(?:[a-zA-Z0-9\(\)])+(?:\'|-|\.)?(?:[a-zA-Z0-9\(\)\-\'])+"
Don't forget (like I did) to throw in the \'\- into the last non-capturing group.
Ben does that meet your needs?
This is probably how I would do it:
<cfset string = reReplace(string,'(\.|"(?=\w))','','all') />
<cfset wordAry = listToArray(string,'#chr(10)##chr(13)##chr(32)#') />
<cfset wordQry = queryNew('word','VarChar') />
<cfloop from="1" to="#arrayLen(wordAry)#" index="i">
<cfset queryAddRow(wordQry) />
<cfset querySetCell(wordQry,'word',reReplace(wordAry[i],'[",]$','')) />
</cfloop>
<cfquery dbtype="query" name="uniqueWords">
SELECT word, count(*) as wordCount FROM wordQry group by word order by wordCount desc
</cfquery>
<cfdump var="#uniqueWords#">
this seems to work for me:
<cfset words = arrayToList(string.split('\s'))>
<cfset wordCount = structNew()>
<cfloop index="word" list="#words#">
<cfset wordCount[word] = ListValueCountNoCase(words, word)>
</cfloop>
no, that's not right - its including punctuation as part of the word. so i tried with the "hatin'" list and got this working:
<cfset words = arrayToList(string.split("\.?[[^()]\s&&([""()][\s])]"))>
Wouldn't this work?
<cfset wordcount = structNew()/>
<cfloop list="#string#" delimiters=' ,"' index="word">
<cfset word = replaceList(word,"',.","")/>
<cfif structKeyExists(wordCount, word)>
<cfset wordCount[word] = wordCount[word] + 1/>
<cfelse>
<cfset wordCount[word] = 1/>
</cfif>
</cfloop>
Why not just do:
<cfset myString = "blah blah blah bldfadsff fd ">
<cfset mCounter = stringToArray(myString,"a")>
<cfset numberOfAs = arraylen(mcounter)>
?
Unless I'm missing something in an earlier comment, there is no strintToArray() function in ColdFusion...
Make that stringToArray()...
Nice article, and thanks guys for the different ways of doing this.
Wanted to note: the sort on this ("textnocase") needs to be "numeric","desc" otherwise you're not getting your top numbers right (ie, textnocase sort would look like 4,3,20,17).
Great code on this as a first step to making a word cloud, looping it on DB-pulled text fields.
Oops. Thanks D. Fixed in the code above.
Doesn't seem to be returning carriage returns for me. Any fixes?
What do you mean? Why would it return carriage returns? It returns the number of words.
How about ListValueCount(list, value [, delimiters ])
I was trying to find out how many HRs I had in a text string in a DB column (which would show how many entries I recorded for the view history of a certain page), and this seemed to do the trick, like this:
<pre><cfset histcount = ListValueCount(list.history, "hr", "<,>")></pre>
What value would you use though?
i solved it like this ,(using getToken) i am assuming that we should have space at least between 2 words,,Does any body see any issue with that?
<cfsavecontent variable="string">
This is a paragraph with some text in it. Certain words will be repeated, and other words
will not be repeated. The question is though, how much can I write before I begin to sound
like a complete and utter idiot. Let's call that the "Paris Point". At the Paris Point, any
further words sound like gibberish and are completely worthless.
</cfsavecontent>
<cfset word="let's" />
<cfset i=1 />
<cfset countOfword=0 />
<cfloop condition="#getToken(string,i,' ')# neq ''">
<cfif #getToken(string,i,' ')# eq #word#><cfset countOfword=countOfword+1 /></cfif>
<cfset i=i+1 />
</cfloop>
<cfoutput>#countOfword#</cfoutput>
Back to the point about how to include "-" and O'Conner use [:print:] instead of [:word:]. Works wonders for me!
don't you love these post when you need an answer in a hurry..thanks Ray
I love it when stuff works 6 years later. ;)
Hi guys,
I tried the [:word:] solution, but it is counting 'blue-eyed' as 2 words rather than 1.
And 'doesn’t' is taken as 2 words. Is their any way to tell CF to let - and ' go by?
Any help would be more than appreciated :)
Regards,
Awais
Would this help? First Google result for "regular expression word hyphen" https://stackoverflow.com/q...