Database Software

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

Converting between CP1252 and Unicode in a 8859-1 database

566581Mar 13 2007 — edited Mar 30 2007

My instance is WE8ISO8859P1.

However, some legacy tables in my system consistently hold characters that are Windows-1252.

Due to infrastructural dependencies we cannot change the character set of the instance.

We want to copy the data out of this column into an NVARCHAR column with the correct Unicode code points for the CP1252 chars, and are seeking a SQL based conversion algorithm to achieve this.

My initial attempt was to use the 'from character set' argument to the CONVERT function - but this funtion does not appear to behave as I would expect.

Example:

- Chr(128) is the codepoint/char for the Euro character in CP1252.
- I call the CONVERT function overriding the default char set with CP1252
- I specify AL16UTF16 as the target char set since that is the charset of my NVARCHAR cols

select dump(convert(chr(128),'AL16UTF16','WE8MSWIN1252'), 1016) from dual;

I get this in response ….
>> Typ=1 Len=2 CharacterSet=WE8ISO8859P1: 20,ac

We see here that Oracle has correctly converted character 128 (decimal) which is the Euro to the correct UTF16 bytes ie 20,ac (hex).

However, note that the system still thinks the char set of the string is WE8ISO8859P1, in addition it thinks it is two chars (Len=2) instead of a single 16 bit char with two bytes.

When I now insert this into the NVARCHAR field it gets inserted as two NCHARs 0x0020 and 0x00AC - which is completely wrong.

It seems that the convert function correctly converts at the byte level but because those bytes are interpreted as the system's character set I can do nothing useful with the resulting string.

So to summarise
- I have an 8859-1 database
- with a column containing CP1252 data
- I want to copy this data into an NCHAR column with the correct Unicode code points for the cp1252 chars.

Does Oracle provide functionality that allows me to achieve this?

By the way I can correct the encoding of the data in the VARCHAR column using a trick in Java when I read the data out of the database (see below) however this doesn't help me in the database where I want to put a trigger in place that copies the data out of the VARCHAR (CP1252) column into the NVARCHAR column...

public static String fromMiscoded1252toUnicode(String cp1252)
{
try {
byte[] b = cp1252.getBytes("ISO8859-1");
return new String(b, "windows-1252");
} catch (Exception e)
{
System.err.println(e);
return null;
}
}

Possibly I could call this Java function from SQL but I would prefer not to have to invoke Java and am hoping that Oracle can handle char set conversions properly.

One concern I have about invoking Java is that the data in question may get mangled by Oracle as it passes in or out of that function for similar reasons to why the SQL approaches above and below fail.

If this is the case then I will have no recourse but to write an external feed job using JDBC where I would have prefered for many reasons to use a trigger.

Other attempts ...

Using to_nchar fails because it doesn't allow me to specify the 'from char set'.
As a result it interprets chr(128) as 8859-1 and there is no conversion from chr(128) in 8859-1 to Unicode. So we get the 'replacement character' sequence 0xffdd instead..

select dump(convert(to_nchar(chr(128)),'AL16UTF16','WE8MSWIN1252'), 1016) from dual;

Typ=1 Len=4 CharacterSet=AL16UTF16: 0,ff,0,fd

This one fails for similar reasons to the above ...

select dump( translate(chr(128) using NCHAR_CS), 1016) from dual;

Typ=1 Len=2 CharacterSet=AL16UTF16: ff,fd

Is there an equivalent of CONVERT (or any function that allows me to specify the 'from char set') that works on NCHARs?

Locked Post

New comments cannot be posted to this locked post.

Locked on Apr 27 2007

Added on Mar 13 2007

#database-key-features, #globalization-support

1 comment

9,270 views