URL encoding - UTF-8 Combining Diacritical Marks
843840Aug 20 2008 — edited Aug 21 2008Hello, folks.
I work with an webapp where the user saves records from other sites, like this:
User searches other sites for the record, copy/pastes the record (text) into a TextArea and
saves the record.
So, here's the problem I'm facing: some records come with UTF-8's Combining Diacritical Marks.
For instance, if a record contains the word "Ci�ncia" (Science), instead of using the UTF-8 char
'�' (UTF-8 235), the record uses 2 chars: 'e' + '^' (102 + 770). When the form is posted, the
browser encodes the word like this: Cie%26%23770%3Bncia (Cie& #770;ncia - whithout the space).
In other words, it's encoding the UTF-8 char 770 into an HTML entity(& #770;),
which is the one being saved to our database.
Does anyone know a way, perhaps using a Filter, to decode correctly this entity in the Servlet,
without going through the hassle of hardcoding all the possible entities?
Thanks for your time,
Danniel Nascimento.
Edit: Added spaces to the HTML entities so they are not rendered by your browsers.
Edited by: Danniel_Willian on Aug 20, 2008 12:02 PM