Hi All,
I am trying to parse an XML file using DOM parser, I have tried two different ways to parse it where one way is giving me the error :
java.io.UTFDataFormatException: Invalid byte 1 of 1-byte UTF-8 sequenc
But other is working fine.
My XML file does not contain any XML header (and I don't know in such cases whats the default encoding it uses).
Way1: (Which gives error):
org.apache.xerces.parsers.DOMParser parser=new org.apache.xerces.parsers.DOMParser();
org.xml.sax.InputSource isrc = new org.xml.sax.InputSource(in); // where in is the ByteArrayInputStream for XML
parser.parse(isrc);
document=parser.getDocument();
It gives me java.io.UTFDataFormatException: Invalid byte 1 of 1-byte UTF-8 sequenc
Way2: (working fine):
import org.w3c.dom.DOMImplementationSource;
import org.w3c.dom.DOMImplementationList;
import org.w3c.dom.DOMImplementation;
import java.io.InputStream;
import java.io.BufferedReader;
import java.io.InputStreamReader;
DOMImplementationRegistry registry = DOMImplementationRegistry.newInstance();
DOMImplementationLS impl = (DOMImplementationLS)registry.getDOMImplementation("LS");
LSParser builder = impl.createLSParser(DOMImplementationLS.MODE_SYNCHRONOUS, null);
builder.getDomConfig().setParameter("cdata-sections", false);
LSInput lsi = impl.createLSInput();
lsi.setByteStream(in); // where in is InputStream for XML
Document doc = builder.parse(lsi);
I have following libraries in my classpath:
xalan.jar
xercesImpl.jar
xmlParserAPIs.jar
I want to know whats wrong with the way one? I think it's giving error due to some character encoding mismatch. I am not getting it why its working fine with other way.
I don't have much idea about these XML APIs and I have get this second way of coding from net.
Can anyone please explain me what the basic difference in both the ways of XML parsing and which way is preffered one (esp. in cases where XML may contain UTF8 characters).