Java Ninja Chronicles By Norris Shelton

Things I learned in the pursuit of code

I haven’t had to do this in a while, but some co-workers were talking about two problems and they had HTML parsing in common. Dave Petersheim had already introduced jsoup into our project for just that purpose.

jsoup: Java HTML Parser

<dependency>
  <!-- jsoup HTML parser library @ http://jsoup.org/ -->
  <groupId>org.jsoup</groupId>
  <artifactId>jsoup</artifactId>
  <version>1.4.1</version>
</dependency>

Dave used it to parse through an HTML fragment, looking for a text node to serve as a summary. Here is how he did it.

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
 
 
/**
 * Extracts the first line of text from the supplied html.
 * @param content
 * @return The first line of text or empty string if one could not be found
 */
static String extractFirstText(String content) {
    Document doc = Jsoup.parse(content);
 
    for (org.jsoup.nodes.Element element : doc.getAllElements()) {
        if (element.hasText() &amp;&amp; !StringUtils.isEmpty(element.ownText())) {
            return element.ownText();
        }
    }
 
    return "";
}

I also ran across a blog post by Tom Czarniecki where he had a small JUnit snippet.

String html = response.getContentAsString();  
Document document = Jsoup.parse(html);  
  
Elements elements = document.select("#errorRef");  
assertThat(elements.size(), equalTo(1));  
  
assertThat(elements.first().text(), equalTo(errorRef));

While looking at their cookbook, I found that they have css (or jquery) like selectors.

File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");

Elements links = doc.select("a[href]"); // a with href
Elements pngs = doc.select("img[src$=.png]");
  // img with src ending .png

Element masthead = doc.select("div.masthead").first();
  // div with class=masthead

Elements resultLinks = doc.select("h3.r > a"); // direct a after h3

Their documentation says the select method is available in a Document, Element or in Elements

I ran across a html parser question on stackprinter.appspot.com which asked “What are the pros and cons of the leading Java HTML parsers?”

It had an excellent response by BalusC:
[+15] [2010-07-01 00:00:32] BalusC [ACCEPTED]
General

Almost all known HTML parsers implements W3C DOM API [1] (part of JAXP, Java API for XML processing) and gives you a org.w3c.dom.Document [2] back. The major differences are usually to be found in the features of the parser in question. Most parsers are to a certain degree forgiving and lenient with non-wellformed HTML (“tagsoup”), like JTidy [3], HtmlCleaner [4] and TagSoup [5]. You usually use this kind of HTML parsers to “tidy” the HTML source so that you can traverse it “the usual way” using the W3C DOM and JAXP API.

The only ones which jumps out are HtmlUnit [6] and Jsoup [7].
HtmlUnit

HtmlUnit [8] provides a completely own API which gives you the possibility to act like a webbrowser programmatically. I.e. enter form values, click elements, invoke JavaScript, etcetera. It’s much more than alone a HTML parser. It’s a real “GUI-less webbrowser” and HTML unit testing tool.
Jsoup

Jsoup [9] also provides a completely own API. It gives you the possibility to select elements using jQuery [10]-like CSS selectors [11] and provides a very nice API to traverse the HTML DOM tree. It’s in my opinion a real revolution. Ones who have worked with org.w3c.dom.Document knows what a hell of pain it is to traverse the DOM to get the elements of interest using verbose NodeList [12] and Node [13] API’s. True, XPath [14] makes the life easier, but still, it’s another learning curve and it can end up to be pretty verbose.

Here’s an example which uses a “plain” W3C DOM parser like JTidy in combination with XPath to extract the first paragraph of your question and the names of all answerers (I am using XPath since without it, the code needed to gather the information of interest would otherwise grow up 10 times as big).

String url = “http://stackoverflow.com/questions/3152138”;
Document document = new Tidy().parseDOM(new URL(url).openStream(), null);
XPath xpath = XPathFactory.newInstance().newXPath();

Node question = (Node) xpath.compile(“//*[@id=’question’]//*[contains(@class,’post-text’)]//p[1]”).evaluate(document, XPathConstants.NODE);
System.out.println(“Question: ” + question.getFirstChild().getNodeValue());

NodeList answerers = (NodeList) xpath.compile(“//*[@id=’answers’]//*[contains(@class,’user-details’)]//a[1]”).evaluate(document, XPathConstants.NODESET);
for (int i = 0; i < answerers.getLength(); i++) { System.out.println("Answerer: " + answerers.item(i).getFirstChild().getNodeValue()); } [/sourcecode] And here's an example how to do exactly the same with Jsoup: [sourcecode language="java"] String url = "http://stackoverflow.com/questions/3152138"; Document document = Jsoup.connect(url).get(); String question = document.select("#question .post-text p").first().text(); System.out.println("Question: " + question); Elements answerers = document.select("#answers .user-details a"); for (Element answerer : answerers) { System.out.println("Answerer: " + answerer.text()); } [/sourcecode] Do you see the difference? For me as being a webdeveloper with a decade of experience, Jsoup was easy to grasp thanks to the support for CSS selectors which I am already familiar with. Summary The pro's and cons of each should be obvious enough. If you just want to use a XML based tool to traverse it, then just go for the first mentioned group of parsers. Which one to choose depends on the features it provides and the robustness of the library (how often is it updated/maintained/fixed?). There are pretty a lot [15] of them. My personal preference of them is JTidy (HtmlCleaner is also nice, it was the best choice until JTidy finally updated their API last year after years of absence). If you like to unit test the HTML, then HtmlUnit is the way to go. If you like to extract specific data from the HTML, then Jsoup is the way to go. [1] http://java.sun.com/javase/6/docs/api/org/w3c/dom/package-summary.html [2] http://java.sun.com/javase/6/docs/api/org/w3c/dom/Document.html [3] http://jtidy.sourceforge.net/ [4] http://htmlcleaner.sourceforge.net/ [5] http://home.ccil.org/~cowan/XML/tagsoup/ [6] http://htmlunit.sourceforge.net/ [7] http://jsoup.org/ [8] http://htmlunit.sourceforge.net/ [9] http://jsoup.org/ [10] http://jquery.com [11] http://www.w3.org/TR/css3-selectors/ [12] http://java.sun.com/javase/6/docs/api/org/w3c/dom/NodeList.html [13] http://java.sun.com/javase/6/docs/api/org/w3c/dom/Node.html [14] http://java.sun.com/javase/6/docs/api/javax/xml/xpath/XPath.html [15] http://java-source.net/open-source/html-parsers (2) Wow, great answer. Thanks! - Avi Flax You're welcome. - BalusC

January 27th, 2011

Posted In: Java

2 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

WP to LinkedIn Auto Publish Powered By : XYZScripts.com