Detecting invalid HTML with JavaScript

This post is more than 2 years old.

As a blogger, I write quite a few blog posts. I hate RTEs (Rich Text Editors) so I'll typically do most of any desired HTML by hand. Normally this isn't a big deal. My blogware can handle paragraphs and code formatting. I typically just worry about bold and italics. However, because I'm entering HTML manually, there's always a chance I could screw up. I've got a Preview feature on my blog but I rarely use it.

For a while now I've wondered if there was some way to possible detect bad HTML via JavaScript. I decided today to take a crack at it using some simple regex. I figured if we could detect all tags, maybe we could use a simple counter to keep track of opening and closing tags. Obviously that's not terribly precise, but for the types of mistakes I make, it would actually work out ok most of the time. I worked on it a bit and came up with the following little demo:

<html> <head> <script type="text/javascript" src="http://ajax.googleapis.com/ajax/libs/jquery/1/jquery.min.js"></script> <script> $(document).ready(function() {

$("#testBtn").click(function(e) {
	var code = $.trim($("#code").val());
	if(code == '') return;
	
	var regex = /&lt;.*?&gt;/g;
	var matches = code.match(regex);
	if(!matches.length) return;
	
	var tags = {};
	
	$.each(matches, function(idx,itm) {
		console.log("Raw tag: "+itm);

		//if the tag is, &lt;..../&gt;, it's self closing
		if (itm.substr(itm.length - 2, itm.length) != "/&gt;") {
		
			//strip out any attributes
			var tag = itm.replace(/[&lt;&gt;]/g, "").split(" ")[0];
			console.log("Tag : " + tag);
			//start or end tag?
			if (tag.charAt(0) != "/") {
				if (tags.hasOwnProperty(tag)) 
					tags[tag]++;
				else 
					tags[tag] = 1;
			}
			else {
				var realTag = tag.substr(1, tag.length);
				console.log("Real tag is -" + realTag);
				if (tags.hasOwnProperty(realTag)) 
					tags[realTag]--;
				else 
					tags[realTag] = -1;
			}
		}
	});

	console.dir(tags);
	
	var possibles = [];
	for (tag in tags) {
		if(tags[tag] != 0) possibles.push(tag);
	}
	if (possibles.length) {
		$("#status").text("There appear to be some hanging tags in your textarea: "+possibles.join(","));
	}
});

}); </script> </head>

<body>

<div id="status"></div>

<form> <textarea name="code" id="code" cols="70" rows="30"></textarea><br/> <input type="button" id="testBtn" value="Test"> </form> </body> </html>

Basically, I used a simple regex to find any HTML tag:

var regex = /<.*?>/g;

And from that, I loop over the matches and figure out a) the real tag (so I ignore attributes for example) and if it is closing or not. I use a simple numeric value to either increment/decrement a counter of tags. I also try to support self closing tags like <p/>.

It's not the most scientific method, but it seems to work well in my testing. Check it out at the demo below.

Raymond Camden's Picture

About Raymond Camden

Raymond is a senior developer evangelist for Adobe. He focuses on document services, JavaScript, and enterprise cat demos. If you like this article, please consider visiting my Amazon Wishlist or donating via PayPal to show your support. You can even buy me a coffee!

Lafayette, LA https://www.raymondcamden.com

Archived Comments

Comment 1 by Brian posted on 1/24/2012 at 1:32 AM

Doesn't like comments, [but at least you handle them as "hanging tags" -- what does that do to your counter?] but then who does??? :)

Comment 2 by Al Everett posted on 1/25/2012 at 12:55 AM

You probably don't want to parse HTML with regex.

http://stackoverflow.com/qu...

Comment 3 by Raymond Camden posted on 1/25/2012 at 12:59 AM

I disagree. I'm not trying to build an HTML parser here. Rather, I'm trying to flag out possible errors. Given that I'm ok with it not being 100% perfect, if it finds half my mistakes, than it's a success I'd say.

Comment 4 by Al Everett posted on 1/25/2012 at 2:07 AM

Well, you should read the top-rated answer on that question anyway. It's very insightful.

Comment 5 by Raymond Camden posted on 1/25/2012 at 2:08 AM

I did. :)