Tuesday 11 July 2006

Regex - The Website Accessibility Saviour

While trying to make Microsoft Content Management Server (MCMS) based websites I've had to employ several funky tricks, including Http Modules, overriding page rendering, inherited pages/master pages and the work horse of the entire system - the regular expression
Most of making CMS accessible has been catching the resulting html before it gets to the user, finding x and replacing it with y. This works for a lot of things (eg, removing the Border=0 attribute that microsoft insist on adding to any image). The situation with links was quite different. I needed to find and remove target="_blank" where found, but also add in onclick javascript to launch the page in a new window. Why should I want to do this? Target is not a valid attribute in either the HTML or XHTML strict specification and we needed to open some websites in another window as they had some dodgy bad practice that stopped you using the back button in your browser.
So...the goal was to get from:

http://www.mysite.com" target="_blank">
to

http://www.mysite.com" onclick='window.open("http://www.mysite.com");return false;'>
The key is System.Text.RegularExpressions.Regex, and more specifically the Replace method.

Regex.Replace(string sOrig, string sRegex, MatchEvaluator oMatchEvaluator, RegexOptions Options)

example below:


m_sXHTML = Regex.Replace(m_sXHTML,
"<[aA]\\s([^>]*href=\"([^>^\"]+)\"[^>]*target=[\"]?_blank[\"]?[^>]*)>",
new MatchEvaluator(ConvertTargetToJavascript), RegexOptions.IgnoreCase);
The two keys are the regular expression and the Match evaluator delegate. I'll tackle the expression first.

"<[aA]\\s([^>]*href=\"([^>^\"]+)\"[^>]*target=[\"]?_blank[\"]?[^>]*)>"

Looks fun eh? There is a well defined syntax for regular expressions, which I don't intend to go into detail on here, but I will cover how this one works, breaking it down into chunks.

"<[aA]\\s([^>]*href=\"([^>^\"]+)\"[^>]*target=[\"]?_blank[\"]?[^>]*)>
match the character "a" or "A"

"<[aA]\\s([^>]*href=\"([^>^\"]+)\"[^>]*target=[\"]?_blank[\"]?[^>]*)>"
followed by a white space character ("\s" is the regex syntax, as this is in code we need to escape the escape character!)

"<[aA]\\s([^>]*href=\"([^>^\"]+)\"[^>]*target=[\"]?_blank[\"]?[^>]*)>"
Anything but a ">" -for zero or more occurances (*).

"<[aA]\\s([^>]*href=\"([^>^\"]+)\"[^>]*target=[\"]?_blank[\"]?[^>]*)>"
anything but ">" or double quote for 1 or more occurances (+)

"<[aA]\\s([^>]*href=\"([^>^\"]+)\"[^>]*target=[\"]?_blank[\"]?[^>]*)>"
(0 or 1) double quote (?)

This gives us a match for any tag that contains the target="_blank" or target=_blank attribute after the href attribute. If we do a similar regex for when it occurs before the href then we have covered all bases.

What I have not mentioned so far are the brackets throughout the regex. These imply groups of information that can be used by our match evaluator.

"<[aA]\\s([^>]*href=\"([^>^\"]+)\"[^>]*target=[\"]?_blank[\"]?[^>]*)>"
The value of the href attribute

"<[aA]\\s([^>]*href=\"([^>^\"]+)\"[^>]*target=[\"]?_blank[\"]?[^>]*)>"
all the attribute information for this
tag.

All we need now is to provide the delegate method, in the example called "ConvertTargetToJavascript"

private string ConvertTargetToJavascript(Match m)
{
string sTest = m.Groups[0].ToString();
string sResult = "

";
return sResult ;
}


Group 0 is always the whole match. The groups that follow it depend on the round brackets used in the regex. In the example Group 1 will be all attribute information, and Group 2 will be the value of the href. The string result returned by the match evaluator will replace the original match.
All to easy.....once you get your head around regex syntax.