Skip to Main Content

Java EE (Java Enterprise Edition) General Discussion

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

URL encoding - UTF-8 Combining Diacritical Marks

843840Aug 20 2008 — edited Aug 21 2008
Hello, folks.

I work with an webapp where the user saves records from other sites, like this:
User searches other sites for the record, copy/pastes the record (text) into a TextArea and
saves the record.
So, here's the problem I'm facing: some records come with UTF-8's Combining Diacritical Marks.
For instance, if a record contains the word "Ci�ncia" (Science), instead of using the UTF-8 char
'�' (UTF-8 235), the record uses 2 chars: 'e' + '^' (102 + 770). When the form is posted, the
browser encodes the word like this: Cie%26%23770%3Bncia (Cie& #770;ncia - whithout the space).
In other words, it's encoding the UTF-8 char 770 into an HTML entity(& #770;),
which is the one being saved to our database.
Does anyone know a way, perhaps using a Filter, to decode correctly this entity in the Servlet,
without going through the hassle of hardcoding all the possible entities?

Thanks for your time,

Danniel Nascimento.

Edit: Added spaces to the HTML entities so they are not rendered by your browsers.

Edited by: Danniel_Willian on Aug 20, 2008 12:02 PM
Comments
Locked Post
New comments cannot be posted to this locked post.
Post Details
Locked on Sep 18 2008
Added on Aug 20 2008
5 comments
276 views