HTMLParser API is one of the powerful parser. Download the HTML Parser from the site below
| Autocad Designer | RAMTESA | United Arab Emirates | 31-May-2011 |
| OIL AND GAS DOWNSTREAM SALES ENGINE | RAMTESA | United Arab Emirates | 31-May-2011 |
Here is the java class which parses above html.
public class HtmlParserMain {
public static void main (String[] args) throws Exception{
FileInputStream fis =
new FileInputStream (new File("C:/Akila/test.html"));
//Create parser
Parser parser = new Parser(new Lexer(new Page(fis, "UTF-8")));
//Get specific node: Here I want only the tr nodes
NodeFilter filter =
new AndFilter(
new TagNameFilter("tr"),
new HasAttributeFilter("class"));
//Parse the html
NodeList list = parser.parse(filter);
//Get the iterator
SimpleNodeIterator iterator = list.elements();
//Iterate Table rows
while (iterator.hasMoreNodes()) {
//Get the TR node
TagNode node = (TagNode)iterator.nextNode();
NodeList tdList = new NodeList ();
//Get the TD nodes
node.collectInto(tdList, new TagNameFilter("td"));
System.out.print(tdList.elementAt(0).toPlainTextString() + " : ");
System.out.print(tdList.elementAt(1).toPlainTextString() + " : ");
System.out.print(tdList.elementAt(2).toPlainTextString() + " : ");
System.out.println(tdList.elementAt(3).toPlainTextString());
}
//If you want to parse the html directly from the web
//Parser parse = new Parser ("http://www.google.com");
}
}