Unicode character entities not resolved by XML parser
843834May 7 2009 — edited May 8 2009Hi,
I am parsing XML documents in which unicode characters like umlauts appear encoded (e.g. ü). This leads to elements which should only carry a single text node becoming littered with text nodes representing the non-unicode parts and entity ref elements representing the unicode characters. I was sort of expecting that the XML parser is smart enough to resolve those characters itself (which I guess it does, but I must be doing something wrong).
Here is how I parse those documents:
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setCoalescing(true);
dbf.setExpandEntityReferences(true);
DocumentBuilder builder = dbf.newDocumentBuilder();
Document document = builder.parse(xml);
xml is an InputStream.
I peeked at the builder in the debugger, and it's a Xerces implementation.
What do I have to do to be able to just call node.getNodeValue() and receive the whole string when that node contains escaped characters?
Thanks in advance,
Matthias
PS: I tried posting examples of an actual XML snippet with entity refs for unicode chars, but the forum keeps resolving them into actual characters. Oh if only that cursed XML parser would do the same thing! :)