Parsing HTML step by step
The following snippet shows what happens in step 4 of the PDF creation process in more detail.
HtmlPipelineContext htmlContext = new HtmlPipelineContext(); htmlContext.setTagFactory(Tags.getHtmlTagProcessorFactory()); CSSResolver cssResolver = XMLWorkerHelper.getInstance().getDefaultCssResolver(true); Pipeline<?> pipeline = new CssResolverPipeline(cssResolver, new HtmlPipeline(htmlContext, new PdfWriterPipeline(document, writer))); XMLWorker worker = new XMLWorker(pipeline, true); XMLParser p = new XMLParser(worker); p.parse(HTMLParsingProcess.class.getResourceAsStream("/html/walden.html"));
see HTMLParsingProcess and the resulting PDF walden3.pdf
Let's do a buttom-up examination of this snippet.
HTML inputAs you can see, we parse the HTML as an InputStream
. We could also have used a Reader
object to read the HTML file.
The XMLParser
class expects an implementation of the XMLParserListener
interface. XMLWorker
is such an implementation. Another implementation (ParserListenerWriter
) was written for debugging purposes.
The XMLWorker
constructor expects two parameters: a Pipeline<?>
and a boolean indicating whether or not the XML should be treated as HTML. If true
, all tags will be converted to lowercase and whitespace used to indent the HTML syntax will be ignored.
Internally, XMLWorker
creates Tag
objects that are processed using implementations of the TagProcessor
interface (for instance com.itextpdf.tool.xml.html.Anchor
is the tag processor for the <a>
-tag).
In this case, we're parsing XHTML and CSS to PDF; we define the Pipeline<?>
as a chain of three Pipeline
implementations:
CssResolverPipeline
,HtmlPipeline
, andPdfWriterPipeline
.You create the first pipeline passing the second one as a parameter; the second pipeline is instantiated passing the third as a parameter; and so on.
Pipeline<?> pipeline = new CssResolverPipeline(cssResolver, new HtmlPipeline(htmlContext, new PdfWriterPipeline(document, writer)));
The PdfWriterPipeline
marks the end of the pipeline: it creates the PDF document.
The style of your HTML document is probably defined using Cascading Style Sheets (CSS). The CSSResolverPipeline
is responsible for adding the correct CSS Properties to each Tag
that is created by XMLWorker
. Without a CssResolverPipeline
, the document would be parsed without style.
The CssResolverPipeline
constructor needs a CssResolver
instance. The getDefaultCssResolver()
method in the XMLWorkerHelper
class provides a default CssResolver
:
CSSResolver cssResolver = XMLWorkerHelper.getInstance().getDefaultCssResolver(true);
The boolean
parameter indicates whether or not the default.css
(shipped with XML Worker) should be added to the resolver.
Next in line, is the HtmlPipeline
. Its constructor expects an HtmlPipelineContext
.
HtmlPipelineContext htmlContext = new HtmlPipelineContext(); htmlContext.setTagFactory(Tags.getHtmlTagProcessorFactory());
Using the setTagFactory()
method of the HtmlPipelineContext
, you can configure how the HtmlPipeline
should interpret the tags encountered by the parser. We've created a default implementation of the TagProcessorFactory
interface for parsing HTML. It can be obtained using the getHtmlTagProcessorFactory()
method in the Tags
class.
If you want to parse other types of XML, you'll need to implement your own Pipeline
implementations, for instance an SvgPipeline
.
This is the end of the pipeline. The PdfWriterPipeline
constructor expects the Document
and a PdfWriter
instance you've created in step 1 and 2 of the PDF creation process.
In some cases, using the default configuration won't be sufficient, and you'll need to configure XML Worker yourself. This is the case if you want to parse HTML with images and links.