Skip to Main Content

Java EE (Java Enterprise Edition) General Discussion

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

How to parse XML document with default namespace with JDOM XPath

htran_888Nov 4 2008 — edited Nov 6 2008
Hi All,

I am having difficulty parsing using Saxon and TagSoup parser on a namespace html document. The relevant content of this document are as follows:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
……..
</head>
<body>
    <div id="container">
        <div id="content">
            <table class="sresults">
                <tr>
                    <td>
                        <a href="http://www.abc.com/areas" title="Hollywood, CA">hollywood</a>
                    </td>
                    <td>
                        <a href="http://www.abc.com/areas" title="San Jose, CA">san jose</a>
                    </td>
                    <td>
                        <a href="http://www.abc.com/areas" title="San Francisco, CA">san francisco</a>
                    </td>
                    <td>
                        <a href="http://www.abc.com/areas" title="San Diego, CA">San diego</a>
                    </td>
              </tr>
……….
</body>
</html>

Below is the relevant code snippets illustrates how I have attempted to retrieve the contents (value of  <a>):

             import java.util.*;
             import org.jdom.*;
             import org.jdom.xpath.*;
             import org.saxpath.*;
             import org.ccil.cowan.tagsoup.Parser;

( 1 )       frInHtml = new FileReader("C:\\Tmp\\ABC.html");
( 2 )       brInHtml = new BufferedReader(frInHtml);
( 3 ) //    SAXBuilder saxBuilder = new SAXBuilder("org.apache.xerces.parsers.SAXParser");
( 4 )       SAXBuilder saxBuilder = new SAXBuilder("org.ccil.cowan.tagsoup.Parser");
( 5 )       org.jdom.Document jdomDocument = saxbuilder.build(brInHtml);
( 6 )       XPath xpath =  XPath.newInstance("/ns:html/ns:body/ns:div[@id='container']/ns:div[@id='content']/ns:table[@class='sresults']/ns:tr/ns:td/ns:a");
( 7 )       xpath.addNamespace("ns", "http://www.w3.org/1999/xhtml");
( 8 )       java.util.List list = (java.util.List) (xpath.selectNodes(jdomDocument));
( 9 )       Iterator iterator = list.iterator();
( 10 )     while (iterator.hasNext())
( 11 )     {
( 12 )            Object object = iterator.next();
( 13 ) //         if (object instanceof Element)
( 14 ) //               System.out.println(((Element)object).getTextNormalize());
( 15 )             if (object instanceof Content)
( 16 )                   System.out.println(((Content)object).getValue());
              }
….
This program would work on the same document without the default namespace, hence, it would not be necessary to include “ns” prefix along in the XPath statements (line 6-7) either. Moreover, I was using “org.apache.xerces.parsers.SAXParser” to have successfully retrieve content of <a> from the same document without default namespace in the past.

I would like to achieve the following objectives if possible:

( i ) Exclude DTD and namespace in order to simplifying the parsing process. How this could be done?
( ii ) If this is not possible, how to include it in XPath statements (line 6-7) so that the value of <a> is picked up correctly?
( iii ) Would changing from “org.apache.xerces.parsers.SAXParser” to “org.ccil.cowan.tagsoup.Parser” make any difference as far as using XPath is concerned?
( iv ) Failing to exlude DTD, how to change the lookup of a PUBLIC DTD to a local SYSTEM one and include a local DTD for reference?

I am running JDK 1.6.0_06, Netbeans 6.1, JDOM 1.1, Saxon6-5-5, Tagsoup 1.2 on Windows XP platform.

Any assistance would be appreciated.

Thanks in advance,

Jack
Comments
Locked Post
New comments cannot be posted to this locked post.
Post Details
Locked on Dec 4 2008
Added on Nov 4 2008
4 comments
602 views