Java EE (Java Enterprise Edition) General Discussion

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

Unicode character entities not resolved by XML parser

843834May 7 2009 — edited May 8 2009

Hi,

I am parsing XML documents in which unicode characters like umlauts appear encoded (e.g. ü). This leads to elements which should only carry a single text node becoming littered with text nodes representing the non-unicode parts and entity ref elements representing the unicode characters. I was sort of expecting that the XML parser is smart enough to resolve those characters itself (which I guess it does, but I must be doing something wrong).

Here is how I parse those documents:

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setCoalescing(true);
dbf.setExpandEntityReferences(true);

DocumentBuilder builder = dbf.newDocumentBuilder();
Document document = builder.parse(xml);

xml is an InputStream.

I peeked at the builder in the debugger, and it's a Xerces implementation.

What do I have to do to be able to just call node.getNodeValue() and receive the whole string when that node contains escaped characters?

Thanks in advance,
Matthias

PS: I tried posting examples of an actual XML snippet with entity refs for unicode chars, but the forum keeps resolving them into actual characters. Oh if only that cursed XML parser would do the same thing! :)

Locked Post

New comments cannot be posted to this locked post.

Locked on Jun 5 2009

Added on May 7 2009

#java-technology-xml

13 comments

802 views