Detecting invalid HTML with JavaScript

January 23, 2012 javascript jquery

(This post is more than 2 years old.)

As a blogger, I write quite a few blog posts. I hate RTEs (Rich Text Editors) so I'll typically do most of any desired HTML by hand. Normally this isn't a big deal. My blogware can handle paragraphs and code formatting. I typically just worry about bold and italics. However, because I'm entering HTML manually, there's always a chance I could screw up. I've got a Preview feature on my blog but I rarely use it.

For a while now I've wondered if there was some way to possible detect bad HTML via JavaScript. I decided today to take a crack at it using some simple regex. I figured if we could detect all tags, maybe we could use a simple counter to keep track of opening and closing tags. Obviously that's not terribly precise, but for the types of mistakes I make, it would actually work out ok most of the time. I worked on it a bit and came up with the following little demo:

<html> <head> <script type="text/javascript" src="http://ajax.googleapis.com/ajax/libs/jquery/1/jquery.min.js"></script> <script> $(document).ready(function() {

$("#testBtn").click(function(e) { var code = $.trim($("#code").val()); if(code == '') return;


var regex = /<.*?>/g;
var matches = code.match(regex);
if(!matches.length) return;
var tags = {};
$.each(matches, function(idx,itm) {
console.log("Raw tag: "+itm);
//if the tag is, <..../>, it's self closing
if (itm.substr(itm.length - 2, itm.length) != "/>") {
//strip out any attributes
var tag = itm.replace(/[<>]/g, "").split(" ")[0];
console.log("Tag : " + tag);
//start or end tag?
if (tag.charAt(0) != "/") {
if (tags.hasOwnProperty(tag))
tags[tag]++;
else
tags[tag] = 1;
}
else {
var realTag = tag.substr(1, tag.length);
console.log("Real tag is -" + realTag);
if (tags.hasOwnProperty(realTag))
tags[realTag]--;
else
tags[realTag] = -1;
}
}
});
console.dir(tags);
var possibles = [];
for (tag in tags) {
if(tags[tag] != 0) possibles.push(tag);
}
if (possibles.length) {
$("#status").text("There appear to be some hanging tags in your textarea: "+possibles.join(","));
}
});
});
</script>
</head>
<body>
<div id="status"></div>

<form> <textarea name="code" id="code" cols="70" rows="30"></textarea><br/> <input type="button" id="testBtn" value="Test"> </form> </body> </html>

Basically, I used a simple regex to find any HTML tag:

var regex = /<.*?>/g;

And from that, I loop over the matches and figure out a) the real tag (so I ignore attributes for example) and if it is closing or not. I use a simple numeric value to either increment/decrement a counter of tags. I also try to support self closing tags like <p/>.

It's not the most scientific method, but it seems to work well in my testing. Check it out at the demo below.