Articles

HTML Parser in Java

01 May 2012

This is just a HTML parser in Java. Or call it a XML parser which can support HTML elements. It can at least support html with correct syntax or a bit of syntax error. I tried to make it easy to use.

Use example:

Parser parser = new Parser();
parser.ignoreUnmatchedClosingTag = true;
parser.ignoreUnmatchedOpeningTagLayers = Integer.MAX_VALUE;
parser.ignoreDuplicateID = true;
parser.ignoreBRUnmatchedOpenAndChild = true;
NodeRoot root = parser.parse(
"<?XML someXML ?>" +
"<!HTML Single Tag>" +
"<!-- HTML Comment-->" +
"<p><a href=\"24k.com.sg\" id=\"test\"><invalid unclosed tag></invalidClosingTag></a></p>");
Node nodeA = root.nodesById.get("test");
if (nodeA instanceof NodeReal && "A".equalsIgnoreCase(nodeA.name)) {
  System.out.println(a.getFirstAttrWithNameReal("HREF").value);
}

Structure description: (the following 10 underlined words matches the 10 classes in source code)

Parser is the logic which can be configurated to ignore certain errors in HTML
Node has 4 types: Text node, which is purely text; Root node, which is not real node and contains ID list in HTML; Real node which is normal HTML(XML) node; Comment node, which contains purely text. Node can have attributes and children. Real node have Real node types, and the 3 node types means node like"<?XML ... ?>", "<!DOCTYPE ... >" and normal "<...></...>".
Attribute has 2 types: Boolean attribute, which means true if exists; Real attribute, which has a value.
All public methods are generally useful. Please take a look at all public methods before using, and they'll be helpful.

The source code are in the attachments. Hope it can help someone.

Articles

HTML Parser in Java

Popular Posts