Count words Using Regular Expression
Saturday, June 28th, 2008I had to create a program that counts words in a string. The program need to count all words in the string except:
1. Html Code
2. Dividers
3. Extra spaces + new lines
I used regular expression to filter the strings. For this purpose I created two functions:
RemoveExtraSpaces - Remove extra spaces from string. The function allow only one space between each word.
CountWords - The function return number of words in a string. It use the function RemoveExtraSpaces and use regular expression to remove HTML Code , New Lines and Dividers.
/// <summary>
/// This function remove Extra spaces , the Regular expression is
looking for white spaces that appears 2 times and more
/// </summary>
/// <param name=”s”>This is the string that we want to check</param>
/// <returns>Fixed String</returns>
private string RemoveExtraSpaces(string s)
{
Regex FindExtraSpace = new Regex(“\\s{2,}”);
return FindExtraSpace.Replace(s, ” “);
}
/// <summary>
/// This function return the number of words in a string that are
/// separated by space
/// </summary>
/// <param name=”strText”>The text that we want to check</param>
/// <returns>number of words</returns>
public int CountWords(string strText)
{
string exp = “#;#”;
// The expression look for Html and new lines and the divider that
// we define before
Regex Match = new Regex(“<[^>]+>|” + exp + “|\r\n|\n”);
// Replace the tags with an empty string so they are not
// considered in count
strText = Match.Replace(strText, “”);
// Remove the extra Spaces
strText = RemoveExtraSpaces(strText);
// Count the words in the string by splitting them wherever a
// space is found
return strText.Split(‘ ‘).Length;
}