Thread: UTF-8 vs. UTF-16 vs. CESU-8


Permlink Replies: 4 - Pages: 1 - Last Post: Dec 12, 2005 5:53 AM Last Post By: S. Wolicki, Ora...
user468962

Posts: 1
Registered: 11/25/05
UTF-8 vs. UTF-16 vs. CESU-8
Posted: Nov 25, 2005 1:47 PM
Click to report abuse...   Click to reply to this thread Reply
According to Oracle documenation the Oracle character set UTF8, follows the CESU-8 encoding scheme, rather than the UTF-8 standard.

According to Unicode.org the CESU-8 encoding scheme for Unicode is identical to UTF-8 except for its representation of supplementary characters, i.e. a binary collation of data encoded in CESU-8 is identical to the binary collation of the same data encoded in UTF-16 thus for all practical purposes UTF-8 and UTF-16 yield comparable results. Yet, despite this assurance Unicode.org states and I quote "This document specifies an 8-bit Compatibility Encoding Scheme for UTF-16 (CESU) that is intended for internal use within systems processing Unicode in order to provide an ASCII-compatible 8-bit encoding that is similar to UTF-8 but preserves UTF-16 binary collation. It is not intended nor recommended as an encoding used for open information exchange.".

Given the Unicode.org position and Oracles statement "Starting
with the next major functional release after Oracle Database 10g Release 2, the choice for the database character set will be limited to this list of recommended character sets for new system deployment." The only "universal character set on the list is" AL32UTF8 Unicode 4.0 UTF-8 Universal character set.

The questions are three:
1-Do you expect Oracle's UTF-8 to remain as CESU-8?
2-Since we must support some 12 different languages and we want to do so in a single database UTF-8 is our only option, however, we must disseminate our content to various exchanges and so must we label our data as CESU-8 or can we allow it to be auto-detected?
3-We assume that Oracle uses UTF-8 as it's database character set within it's own internal databases as well as within the Oracle applicaton suite. In those cases when content is disseminated is that content labeled CESU-8? What is Oracle's position.

While this may seem a trivial issue we believe that AL32UTF8 is database character set that we must use to meet our needs but are concerned of possible long term implications and hence are asking for your opinion of the long term viability of AL32UTF8 given the unicode.org statement that Oracle UTF-8 is not really "unicode".

Thanks.
barry.trute@ora...

Posts: 241
Registered: 01/10/01
Re: UTF-8 vs. UTF-16 vs. CESU-8
Posted: Nov 28, 2005 4:36 PM   in response to: user468962 in response to: user468962
Click to report abuse...   Click to reply to this thread Reply
You can find a white paper on Oracle Unicode support here: http://www.oracle.com/technology/tech/globalization/pdf/TWP_AppDev_Unicode_10gR2.pdf
Oracle's recommendation for Unicode support, especially when dealing with supplementary characters is AL32UTF8. Note that AL32UTF8 does not use the CESU-8 encoding scheme for supplementary characters and is the UTF-8 character set that Oracle updates to comply with new versions of UTF-8 standard. If you currently have UTF8 character set for your database I would recoomend migrating to AL32UTF8 to ensure best compatibility.
henryho_hk

Posts: 43
Registered: 07/29/00
Re: UTF-8 vs. UTF-16 vs. CESU-8
Posted: Nov 30, 2005 6:46 PM   in response to: barry.trute@ora... in response to: barry.trute@ora...
Click to report abuse...   Click to reply to this thread Reply
Just curious. What is the meaning of "AL32" and "AL16"?
S. Wolicki, Ora...

Posts: 855
Registered: 01/10/01
Re: UTF-8 vs. UTF-16 vs. CESU-8
Posted: Dec 12, 2005 5:35 AM   in response to: henryho_hk in response to: henryho_hk
Click to report abuse...   Click to reply to this thread Reply
AL32 = All Languages, 32 bits maximum character width
AL16 = All Languages, 16 bits maximum character width

Best regards,
Sergiusz

S. Wolicki, Ora...

Posts: 855
Registered: 01/10/01
Re: UTF-8 vs. UTF-16 vs. CESU-8
Posted: Dec 12, 2005 5:53 AM   in response to: user468962 in response to: user468962
Click to report abuse...   Click to reply to this thread Reply
## 1-Do you expect Oracle's UTF-8 to remain as CESU-8?

Do not mix UTF-8 with UTF8. UTF-8 is a term defined by Unicode. UTF8 is the character set name in Oracle.

Oracle's UTF8 will remain Unicode's CESU-8 with Unicode 3.0 repertiore of characters. It is not planned to change.

Oracle's AL32UTF8 is Unicode's UTF-8 and will be enhanced if the character repertoire of Unicode is enhanced (Oracle10gR2 uses the Unicode 4.0 repertoire).

## 2-Since we must support some 12 different languages and we want to do so
## in a single database UTF-8 is our only option, however, we must disseminate
## our content to various exchanges and so must we label our data as CESU-8
## or can we allow it to be auto-detected?

If you use AL32UTF8 as the database character set (recommended for all environments that use Oracle9i or newer software only), then you should mark the data as 'utf-8' (if we talk about MIME tags).

If you use UTF8 as the database character set (recommended only if 8.0 or 8i clients or databases exist in the environment), you should use either 'utf-8' or 'cesu-8'. If your database contains no surrogate pairs, which is usually the case, use 'utf-8'. If you have surrogates, then theoretically you should use 'cesu-8'. But, as your receiving applications may not recognize this MIME tag (it is not widely known), you may have to use 'utf-8' instead.

## 3-We assume that Oracle uses UTF-8 as it's database character set
## within it's own internal databases as well as within the Oracle applicaton
## suite. In those cases when content is disseminated is that content labeled
## CESU-8? What is Oracle's position.

As far as I know, we usually assume that there are no surrogates in the database and we use 'utf-8'. But strictly speaking, if database is UTF8 and not AL32UTF8, 'cesu-8' would be the correct tag. Unfortunately, many applications may be unable to recognize it.

Best regards,

Sergiusz

Legend
Guru Guru : 2500 - 1000000 pts
Expert Expert : 1000 - 2499 pts
Pro Pro : 500 - 999 pts
Journeyman Journeyman : 200 - 499 pts
Newbie Newbie : 0 - 199 pts
Oracle ACE Director
Oracle ACE Member
Oracle Employee ACE
Helpful Answer (5 pts)
Correct Answer (10 pts)

Point your RSS reader here for a feed of the latest messages in all forums