Hello,
I have a problem with OutputStreamWriter's encoding of japanese characters into utf-8...if you have any ideas please let me know! This is what is going on:
static public String convert2UTF8(String iso2022Str) {
String utf8Str = "";
try {
//convert string to byte array stream
ByteArrayInputStream is = new ByteArrayInputStream(iso2022Str.getBytes());
ByteArrayOutputStream os = new ByteArrayOutputStream();
//decode iso2022Str byte stream with iso-2022-jp
InputStreamReader in = new InputStreamReader(is, "ISO2022JP");
//reencode to utf-8
OutputStreamWriter out = new OutputStreamWriter(os, "UTF-8");
//get each character c from the input stream (will be in unicode) and write to output stream
int c;
while((c=in.read())!=-1) out.write(c);
out.flush();
//get the utf-8 encoded output byte stream as string
utf8Str = os.toString();
is.close();
os.close();
in.close();
out.close();
} catch (UnsupportedEncodingException e1) {
return e1.toString();
} catch (IOException e2) {
return e2.toString();
}
return utf8Str;
}
I am passing a string received from a database query to this function and the string it returns is saved in an xml file. Opening the xml file in my browser, some Japanese characters are converted but some, particularly hiragana characters come up as ???. For example:
屋台骨田家は時間目離れ拠り所那覇市矢田亜希子ナタハアサカラマ楢葉さマヤア
shows up as this:
屋�?�骨田家�?�時間目離れ拠り所那覇市矢田亜希�?ナタ�?アサカラマ楢葉�?�マヤア
(sorry that's absolute nonsense in Japanese but it was just an example)
To note:
- i am specifying the utf-8 encoding in my xml header
- my OS, browser, etc... everything is set to support japanese characters (to the best of my knowledge)
Also, I ran a test with a string, looking at its characters' hex values at several points and comparing them with iso-2022-jp, unicode, and utf-8 mapping tables. Basically:
- if I don't use this function at all...write the original iso-2022-jp string to an xml file...it IS iso-2022-jp
- I also looked at the hex values of "c" being read from the InputStreamReader here:
while((c=in.read())!=-1) out.write(c);
and have verified (using character value mapping table) that in a problem string, all characters are still being properly converted from iso-2022-jp to unicode
- I checked another table (http://www.utf8-chartable.de/) for the unicode values received and all of them have valid mappings to a utf-8 value
So it appears that when characters are written to the OutputStreamWriter, not all characters can be mapped from Unicode to utf-8 even though their Unicode values are correct and there should be utf-8 equivalents. Instead they are converted to (hex value) EF BF BD 3F EF BF BD which from my understanding is utf-8 for "I don't know what to do with this one".
The characters that are not working - most hiragana (thought not all) and a few kanji characters. I have yet to find a pattern/relationship between the characters that cannot be converted.
If I am missing some....or someone has a clue....oh...and I am developing in Eclipse but really don't have a clue about it beyond setting up a project, editing it and hitting build/run. It is possible that I may have missed some needed configuration??
Thank you!!