Parsing HTML step by step

The following snippet shows what happens in step 4 of the PDF creation process in more detail.

HtmlPipelineContext htmlContext = new HtmlPipelineContext();
htmlContext.setTagFactory(Tags.getHtmlTagProcessorFactory());
CSSResolver cssResolver =
    XMLWorkerHelper.getInstance().getDefaultCssResolver(true);
Pipeline<?> pipeline =
    new CssResolverPipeline(cssResolver,
        new HtmlPipeline(htmlContext,
            new PdfWriterPipeline(document, writer)));
XMLWorker worker = new XMLWorker(pipeline, true);
XMLParser p = new XMLParser(worker);
p.parse(HTMLParsingProcess.class.getResourceAsStream("/html/walden.html"));

see HTMLParsingProcess and the resulting PDF walden3.pdf

Let's do a buttom-up examination of this snippet.

HTML input

As you can see, we parse the HTML as an InputStream. We could also have used a Reader object to read the HTML file.

XMLParser

The XMLParser class expects an implementation of the XMLParserListener interface. XMLWorker is such an implementation. Another implementation (ParserListenerWriter) was written for debugging purposes.

XMLWorker

The XMLWorker constructor expects two parameters: a Pipeline<?> and a boolean indicating whether or not the XML should be treated as HTML. If true, all tags will be converted to lowercase and whitespace used to indent the HTML syntax will be ignored. Internally, XMLWorker creates Tag objects that are processed using implementations of the TagProcessor interface (for instance com.itextpdf.tool.xml.html.Anchor is the tag processor for the <a>-tag).

Pipeline<?>

In this case, we're parsing XHTML and CSS to PDF; we define the Pipeline<?> as a chain of three Pipeline implementations:

  1. a CssResolverPipeline,
  2. an HtmlPipeline, and
  3. a PdfWriterPipeline.

You create the first pipeline passing the second one as a parameter; the second pipeline is instantiated passing the third as a parameter; and so on.

Pipeline<?> pipeline =
    new CssResolverPipeline(cssResolver,
        new HtmlPipeline(htmlContext,
            new PdfWriterPipeline(document, writer)));

The PdfWriterPipeline marks the end of the pipeline: it creates the PDF document.

CssResolverPipeline

The style of your HTML document is probably defined using Cascading Style Sheets (CSS). The CSSResolverPipeline is responsible for adding the correct CSS Properties to each Tag that is created by XMLWorker. Without a CssResolverPipeline, the document would be parsed without style. The CssResolverPipeline constructor needs a CssResolver instance. The getDefaultCssResolver() method in the XMLWorkerHelper class provides a default CssResolver:

CSSResolver cssResolver = XMLWorkerHelper.getInstance().getDefaultCssResolver(true);

The boolean parameter indicates whether or not the default.css (shipped with XML Worker) should be added to the resolver.

HtmlPipeline

Next in line, is the HtmlPipeline. Its constructor expects an HtmlPipelineContext.

HtmlPipelineContext htmlContext = new HtmlPipelineContext();
htmlContext.setTagFactory(Tags.getHtmlTagProcessorFactory());

Using the setTagFactory() method of the HtmlPipelineContext, you can configure how the HtmlPipeline should interpret the tags encountered by the parser. We've created a default implementation of the TagProcessorFactory interface for parsing HTML. It can be obtained using the getHtmlTagProcessorFactory() method in the Tags class.

If you want to parse other types of XML, you'll need to implement your own Pipeline implementations, for instance an SvgPipeline.

PdfWriterPipeline

This is the end of the pipeline. The PdfWriterPipeline constructor expects the Document and a PdfWriter instance you've created in step 1 and 2 of the PDF creation process.

In some cases, using the default configuration won't be sufficient, and you'll need to configure XML Worker yourself. This is the case if you want to parse HTML with images and links.