I have a webapp that needs to be retrofitted to support multiple character sets including Chinese (both simplified and traditional), and Japanese (SJIS).
I have it mostly working (doing all DB access and display using UTF-8) except for the retrival of form data from a POST request. (A GET request is turned into a URI and the W3C states that URI MUST be encoded in UTF-8 format. See http://www.w3.org/TR/html40/appendix/notes.html#non-ascii-chars)
I have tried forcing the form to send UTF-8 characters by:
<form accept-charset="UTF-8" ...>
This didn't send the data in UTF-8 format in Internet Explorer.
I then tried to use the JavaScript escape() function to encode the characters before sending them to my servlet.
This introduced an exception:
java.lang.IllegalArgumentException: URLDecoder: Illegal hex characters in escape (%) pattern - For input string: "u5"
The ECMA-262 spec describes the algorithm that is to be used for escape() (See http://www.ecma-international.org/publications/files/ecma-st/Ecma-262.pdf Annex B.2.1)
If the ordinal value of the character (represented as a 16-bit unsigned int) is above 255 then escape() is supposed to represent it as
%uwxyz. (For example: %u5988%u5988%u8428%u9A6C)
I would like to not have to worry about the characcter set and force the browser to do the conversion to UTF-8.
Any ideas?
Thanks.