Posted in
ColdFusion
| Posted on 08-02-2007
| 7,000 views
Yesterday in the IRC channel someone asked if there was a way to count the number of times each unique word appears in a string. While it was obvious that this could be done manually (see below), no one knew of a more elegant solution. Can anyone think of one? Here is the solution I used and it definitely falls into the "manual" (and probably slow) category.
First I made my string:
ColdFISH is developed by Jason Delmore. Source code and license information available at coldfish.riaforge.org
<cfsavecontent variable="string">
This is a paragraph with some text in it. Certain words will be repeated, and other words
will not be repeated. The question is though, how much can I write before I begin to sound
like a complete and utter idiot. Let's call that the "Paris Point". At the Paris Point, any
further words sound like gibberish and are completely worthless.
</cfsavecontent>
1<cfsavecontent variable="string">
2This is a paragraph with some text in it. Certain words will be repeated, and other words
3will not be repeated. The question is though, how much can I write before I begin to sound
4like a complete and utter idiot. Let's call that the "Paris Point". At the Paris Point, any
5further words sound like gibberish and are completely worthless.
6</cfsavecontent>
I then used some regex to get an array of words:
ColdFISH is developed by Jason Delmore. Source code and license information available at coldfish.riaforge.org
<cfset words = reMatch("[[:word:]]+", string)>
1<cfset words = reMatch("[[:word:]]+", string)>
Next I created a structure:
ColdFISH is developed by Jason Delmore. Source code and license information available at coldfish.riaforge.org
<cfset wordCount = structNew()>
1<cfset wordCount = structNew()>
And then looped over the array and inserted the words into the structure:
ColdFISH is developed by Jason Delmore. Source code and license information available at coldfish.riaforge.org
<cfloop index="word" array="#words#">
<cfif structKeyExists(wordCount, word)>
<cfset wordCount[word]++>
<cfelse>
<cfset wordCount[word] = 1>
</cfif>
</cfloop>
1<cfloop index="word" array="#words#">
2 <cfif structKeyExists(wordCount, word)>
3 <cfset wordCount[word]++>
4 <cfelse>
5 <cfset wordCount[word] = 1>
6 </cfif>
7</cfloop>
Note that this will be inherently case-insenstive, which I think is a good thing. At this point we are done, but I added some display code as well:
ColdFISH is developed by Jason Delmore. Source code and license information available at coldfish.riaforge.org
<cfset sorted = structSort(wordCount, "numeric", "desc")>
<table border="1" width="400">
<tr>
<th width="50%">Word</th>
<th>Count</th>
</tr>
<cfloop index="word" array="#sorted#">
<cfoutput>
<tr>
<td>#word#</td>
<td>#wordCount[word]#</td>
</tr>
</cfoutput>
</cfloop>
1<cfset sorted = structSort(wordCount, "numeric", "desc")>
2
3<table border="1" width="400">
4<tr>
5 <th width="50%">Word</th>
6 <th>Count</th>
7</tr>
8
9<cfloop index="word" array="#sorted#">
10 <cfoutput>
11 <tr>
12 <td>#word#</td>
13 <td>#wordCount[word]#</td>
14 </tr>
15 </cfoutput>
16</cfloop>
#ListLen(string, " #Chr(13)##Chr(10)#")#
(it seems to work with the string variable you posted)
Let me try that again :)
<cfset new_string = ListSort(REReplaceNoCase(LCase(string), "[^a-z ]", "", "ALL"), "Text", "Asc", " #Chr(13)##Chr(10)#")>
<cfscript>
// had to use this as CF does not allow lookbehind in regular expressions, but JAVA does
obj = createobject("java","java.util.regex.Pattern"); // create pattern searching object
x = obj.compile("(?<=[ ]|^)([^ ]*)([ ]\1)+(?=[ ]|$)"); // compile the regular expression for use
new_string = x.matcher(new_string).replaceAll("$1"); // remove all duplicates
</cfscript>
#ListLen(new_string, " ")#
I got a total of the unique words, but not a count of the number of duplicate words (that's what I get for trying to do write code for one thing while checking out the blogs in another tab :) )
I ran into that same problem during one of Ray's Friday Puzzlers... trust me - don't try to figure it out, your brain will only end up hurting. Here's why, these are all single "words":
hatin'
let's
sweet-ass
cf.objective()
O'connell
... if you can write an algorithm to use all those "non-word" characters as parts of words, well then, you are the man!
I think if you could just make get single quotes to work, you would get most "real" words.
I wonder - maybe switch from [[:word:]] to
(any non alpha except single quote)(alpha,1 or more)(optional ' if followed by alpha)(any non alpha except single quote)
Then again - another solution? Remove '. You end up with words like "lets", which could be confused with "Ray lets Paris call him", but it would be better than let and s as words.
That seems to do the trick for let's lets. But it still doesn't make the counting any more elegant.
(?:[a-zA-Z0-9\(\)])+(?:\'|-|\.)?(?:[a-zA-Z0-9\(\)])+
It matches method chains like myarray.dedup().sort()
"(?:[a-zA-Z0-9\(\)])+(?:\'|-|\.)?(?:[a-zA-Z0-9\(\)\-\'])+"
Don't forget (like I did) to throw in the \'\- into the last non-capturing group.
Ben does that meet your needs?
<cfset string = reReplace(string,'(\.|"(?=\w))','','all') />
<cfset wordAry = listToArray(string,'#chr(10)##chr(13)##chr(32)#') />
<cfset wordQry = queryNew('word','VarChar') />
<cfloop from="1" to="#arrayLen(wordAry)#" index="i">
<cfset queryAddRow(wordQry) />
<cfset querySetCell(wordQry,'word',reReplace(wordAry[i],'[",]$','')) />
</cfloop>
<cfquery dbtype="query" name="uniqueWords">
SELECT word, count(*) as wordCount FROM wordQry group by word order by wordCount desc
</cfquery>
<cfdump var="#uniqueWords#">
<cfset words = arrayToList(string.split('\s'))>
<cfset wordCount = structNew()>
<cfloop index="word" list="#words#">
<cfset wordCount[word] = ListValueCountNoCase(words, word)>
</cfloop>
<cfset words = arrayToList(string.split("\.?[[^()]\s&&([""()][\s])]"))>
<cfset wordcount = structNew()/>
<cfloop list="#string#" delimiters=' ,"' index="word">
<cfset word = replaceList(word,"',.","")/>
<cfif structKeyExists(wordCount, word)>
<cfset wordCount[word] = wordCount[word] + 1/>
<cfelse>
<cfset wordCount[word] = 1/>
</cfif>
</cfloop>
<cfset myString = "blah blah blah bldfadsff fd ">
<cfset mCounter = stringToArray(myString,"a")>
<cfset numberOfAs = arraylen(mcounter)>
?
Wanted to note: the sort on this ("textnocase") needs to be "numeric","desc" otherwise you're not getting your top numbers right (ie, textnocase sort would look like 4,3,20,17).
Great code on this as a first step to making a word cloud, looping it on DB-pulled text fields.
I was trying to find out how many HRs I had in a text string in a DB column (which would show how many entries I recorded for the view history of a certain page), and this seemed to do the trick, like this:
<pre><cfset histcount = ListValueCount(list.history, "hr", "<,>")></pre>
<cfsavecontent variable="string">
This is a paragraph with some text in it. Certain words will be repeated, and other words
will not be repeated. The question is though, how much can I write before I begin to sound
like a complete and utter idiot. Let's call that the "Paris Point". At the Paris Point, any
further words sound like gibberish and are completely worthless.
</cfsavecontent>
<cfset word="let's" />
<cfset i=1 />
<cfset countOfword=0 />
<cfloop condition="#getToken(string,i,' ')# neq ''">
<cfif #getToken(string,i,' ')# eq #word#><cfset countOfword=countOfword+1 /></cfif>
<cfset i=i+1 />
</cfloop>
<cfoutput>#countOfword#</cfoutput>
[Add Comment] [Subscribe to Comments]