I'm just starting with XML and XSLT. I'm trying to read some HTML.
Here's some pertinent details, based on the first Transformer example in the J2EE/XML tutorial. I'm using Java 5.
DocumentBuilderFactory factory =
DocumentBuilderFactory.newInstance();
factory.setNamespaceAware(false);
factory.setValidating(false);
try {
File f = new File(argv[0]);
DocumentBuilder builder =
factory.newDocumentBuilder();
document = builder.parse(f);
The document I'm feeding it starts with:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
I ran the original HTML through JTidy.
The problem is that I'm getting this:
[Fatal Error] loose.dtd:31:3: The declaration for the entity "HTML.Version" must end with '>'.
** Parsing error, line 31, uri http://www.w3.org/TR/html4/loose.dtd
The declaration for the entity "HTML.Version" must end with '>'.
org.xml.sax.SAXParseException: The declaration for the entity "HTML.Version" must end with '>'.
at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:264)
at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:292)
at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:172)
at scraper.Main.main(Main.java:54)
Now, it's obviously trying to read the DTD, even though I set Validating to false.
I'm mostly after "well formed" HTML/XML (which is why I ran it through JTidy), rather than any particular "valid" XML.
Is there a way to read this so that the parser ignores the actual DTD and simply returns me a DOM of the document?
or, can someone suggest a better "real world" HTML reader I can use? I want to be able to visit the document via DOM as well as feed it to XSLT. The easiest thing I found seems to be piping it through JTidy to clean it up, then parsing the result.
There's the CyberNeko HTML parser, but that seems too tightly integrated with Apaches Xerces (rather than "stock" JDK 5), so that seemed a more difficult integration given just a JAR file for the application.
Any help appreciated.