Java EE (Java Enterprise Edition) General Discussion

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

Error reading W3C dtd while parsing HTML with JDK 5

843834Dec 18 2005 — edited Dec 18 2005

I'm just starting with XML and XSLT. I'm trying to read some HTML.

Here's some pertinent details, based on the first Transformer example in the J2EE/XML tutorial. I'm using Java 5.

     DocumentBuilderFactory factory =
                DocumentBuilderFactory.newInstance();
        factory.setNamespaceAware(false);
        factory.setValidating(false);
        
        try {
            File f = new File(argv[0]);
            DocumentBuilder builder =
                    factory.newDocumentBuilder();
            document = builder.parse(f);

The document I'm feeding it starts with:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">

I ran the original HTML through JTidy.

The problem is that I'm getting this:

[Fatal Error] loose.dtd:31:3: The declaration for the entity "HTML.Version" must end with '>'.

** Parsing error, line 31, uri http://www.w3.org/TR/html4/loose.dtd
  The declaration for the entity "HTML.Version" must end with '>'.
org.xml.sax.SAXParseException: The declaration for the entity "HTML.Version" must end with '>'.
        at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:264)
        at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:292)
        at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:172)
        at scraper.Main.main(Main.java:54)

Now, it's obviously trying to read the DTD, even though I set Validating to false.

I'm mostly after "well formed" HTML/XML (which is why I ran it through JTidy), rather than any particular "valid" XML.

Is there a way to read this so that the parser ignores the actual DTD and simply returns me a DOM of the document?

or, can someone suggest a better "real world" HTML reader I can use? I want to be able to visit the document via DOM as well as feed it to XSLT. The easiest thing I found seems to be piping it through JTidy to clean it up, then parsing the result.

There's the CyberNeko HTML parser, but that seems too tightly integrated with Apaches Xerces (rather than "stock" JDK 5), so that seemed a more difficult integration given just a JAR file for the application.

Any help appreciated.

Locked Post

New comments cannot be posted to this locked post.

Locked on Jan 15 2006

Added on Dec 18 2005

#java-technology-xml

1 comment

606 views