Web Development Blog

24K Design Studio :: Blog :: HTML Parser in Java
Author:
Password:
HTML Parser in Java
01:52, 01 May, 2012
by David

This is just a HTML parser in Java. Or call it a XML parser which can support HTML elements. It can at least support html with correct syntax or a bit of syntax error. I tried to make it easy to use.

Use example:

Parser parser = new Parser();
parser.ignoreUnmatchedClosingTag = true;
parser.ignoreUnmatchedOpeningTagLayers = Integer.MAX_VALUE;
parser.ignoreDuplicateID = true;
parser.ignoreBRUnmatchedOpenAndChild = true;
NodeRoot root = parser.parse(
"<?XML someXML ?>" +
"<!HTML Single Tag>" +
"<!-- HTML Comment-->" +
"<p><a href=\"24k.com.sg\" id=\"test\"><invalid unclosed tag></invalidClosingTag></a></p>");
Node nodeA = root.nodesById.get("test");
if (nodeA instanceof NodeReal && "A".equalsIgnoreCase(nodeA.name)) {
System.out.println(a.getFirstAttrWithNameReal("HREF").value);

Structure description: (the following 10 underlined words matches the 10 classes in source code)

Parser is the logic which can be configurated to ignore certain errors in HTML
Node has 4 types: Text node, which is purely text; Root node, which is not real node and contains ID list in HTML; Real node which is normal HTML(XML) node; Comment node, which contains purely text. Node can have attributes and children. Real node have Real node types, and the 3 node types means node like"<?XML ... ?>", "<!DOCTYPE ... >" and normal "<...></...>".
Attribute has 2 types: Boolean attribute, which means true if exists; Real attribute, which has a value.
All public methods are generally useful. Please take a look at all public methods before using, and they'll be helpful.

The source code are in the attachments. Hope it can help someone.

Attachments:
Reply:
Your name:
Only visible to the author.

Tell us you are not a robot. What is captcha1captchaOcaptcha2 =

Reply
#1
Website Design
12:06, 22 Jul, 2013
 
Hi, You explained the topic very well. The contents has provided meaningful information thanks for visit my link <a href="http://www.powerofwebsite.com/" >Website Design</a>
Reply
Your name:
Only visible to the author.

Tell us you are not a robot. What is captcha1captchaOcaptcha2 =

Reply
#1
vjkjnawv
05:03, 25 Nov, 2015
 
1
Reply
Your name:
Only visible to the author.

Tell us you are not a robot. What is captcha1captchaOcaptcha2 =

Reply
View the latest 5 entries.