Skip to Main Content

Java EE (Java Enterprise Edition) General Discussion

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

Parsing HTML Table using Java XPath

htran_888Jun 23 2008 — edited Jul 3 2008
[http://www.onjava.com/lpt/a/5554|http://www.onjava.com/lpt/a/5554]

Hi All,

I am trying to extract/parse the content of the following nested HTML Patient table using JDK 6.1 XPath Class without success:
<table border="0" cellpadding="0" cellspacing="0" width="782" id="main-content">
<tr>
        <td valign="top" class="top">                        
                <table border="0" cellpadding="0" cellspacing="0">                        
  
                        <tr>
                                <td valign="top" class="top">    
 
<!-- un-delay results 14/10/2004  ..................................  --->

        <div class="greyBorder">
        <table border="0" cellspacing="0" cellpadding="2" width="100%">
                <tr>
                        <td class="propType">&nbsp;</td>
                        <td class="propType"><b>Patient</b></td>
                        <td class="propType"><b>Firstname</b></td>
                        <td class="propType"><b>Surname</b></td>
                        <td class="propType" align="right"><b>Date of birth</b></td>
                        <td class="propType">Sex</td>
                </tr>
                      
                        <tr class="smallnarrow">
                                
                                <td class="even" width="10" align="left"></td>
                                <td class="even" style="vertical-align: middle;">Clinic</td>        
                                <td class="even" style="vertical-align: middle;">John</td>                
                                <td class="even" style="vertical-align: middle;">Smith</td> 
                                <td class="even" align="right" style="vertical-align: middle;">10/02/1940</td>
                                <td class="even" width="10" style="vertical-align: middle;">M</td>
                        </tr>
        </table>
        </div>
    
        <div style="margin-top:10px;">        
         <br> <br>

        <br>
        </div>

        <div align="center" style="margin-bottom: 20px;">
        .........
</td></tr></table></td></tr></table>
Below is the content of XPathEvaluator.java used to extract/parse the above HTML file:
import javax.xml.xpath.*;
import java.io.*;
import org.w3c.dom.*;
import org.xml.sax.InputSource;
import org.apache.xpath.NodeSet;
public class XPathEvaluator{

    public void evaluateDocument(File xmlDocument){

    try
    {
        XPathFactory factory=XPathFactory.newInstance();
        XPath xPath=factory.newXPath();
        InputSource inputSource=new InputSource(new FileInputStream(xmlDocument));
        XPathExpression xPathExpression=xPath.compile("/table[@id='main-content']");

        String expression = "/table[@id='main-content']";
        inputSource=new InputSource(new FileInputStream(xmlDocument));
        NodeList shows = (NodeList) xPath.evaluate(expression, inputSource, XPathConstants.NODESET);

        for (int i = 0; i < shows.getLength(); i++)
        {
            Element show = (Element) shows.item(i);
            System.out.println("The value of show.getTagName(): " + show.getTagName());
            System.out.println("The value of show.getTextContent(): " + show.getTextContent());
        }
    }
    catch(IOException  e){}
    catch(XPathExpressionException e){}
    }

    public static void main(String[] argv) 
    {
        XPathEvaluator evaluator=new XPathEvaluator();
        File xmlDocument = new File("C:/Temp/HTMLTable.txt");
        evaluator.evaluateDocument(xmlDocument);
    }
}
This code has worked successfully on catalogue.xml from http://www.onjava.com/lpt/a/5554 tutorial but generated the following error when trying to parse the above HTML file:

*[Fatal Error] :1:78: White spaces are required between publicId and systemId.*

Am I using the wrong tool? I have used the htmlparser in the past but could not achieve the same objective.

Any suggestion would be appreciated.

Thanks,

Jack
Comments
Locked Post
New comments cannot be posted to this locked post.
Post Details
Locked on Jul 31 2008
Added on Jun 23 2008
3 comments
1,172 views