Skip to Main Content

SQL & PL/SQL

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

Removing diacritics (accent marks) from latin-based text

Scott SwankMay 17 2018 — edited May 18 2018

I need to remove diacritics from text. There are two basic ways I'm aware of:

with data(text) as (

    select 'Mañana' from dual union all

    select 'aperçu' from dual union all

    select 'TRÈS' from dual union all

    select 'marché' from dual union all

    select 'Hôtel' from dual

)

select text,

    nlssort(text, 'nls_sort=binary_ai') as sort_bytes,

    rtrim(utl_raw.cast_to_varchar2(nlssort(text, 'nls_sort=binary_ai')), chr(0)) AS sort_chars,

    decompose(text) as decomp,

    regexp_replace(decompose(text), unistr('[\0300-\036F]'), NULL) AS decomp_chars

from data;

The first is to get the "accent insensitive" binary sort bytes for the text, use utl_raw to convert those back to text, and then strip the trailing ascii nul that's appended by nslsort().

The second is to decompose the characters with an a diacritic into base characters followed by standalone diacritics. Then, because standalone diacritics fall within the unicode range 0300-036F they can easily be removed.

The second option takes 1/2 the cpu (1M executions in an average of 11.7 seconds vs. 26.6 seconds on our dev instance) and it preserves case, where the sort-byte based option yields on lowercase text. So option 2 seems a clear favorite.

So... am I missing anything of note? Are there other, better options?

Here's my ad hoc performance script.

select sn.name, st.value as cpu_before

from v$statname  sn

    inner join v$mystat st on (st.statistic# = sn.statistic#)

where sn.name = 'CPU used by this session';

begin

    for i in 1..200000

    loop

        for x in (  with data(text) as (

                        select 'Mañana' from dual union all

                        select 'aperçu' from dual union all

                        select 'TRÈS' from dual union all

                        select 'marché' from dual union all

                        select 'Hôtel' from dual

                    )

                    select text,

--                        rtrim(utl_raw.cast_to_varchar2(nlssort(text, 'nls_sort=binary_ai')), chr(0)) AS sort_chars

                        regexp_replace(decompose(text), unistr('[\0300-\036F]'), NULL) AS decomp_chars

                    from data

        )

        loop

            null;

        end loop;

    end loop;

end;

/

select sn.name, st.value as cpu_after

from v$statname  sn

    inner join v$mystat st on (st.statistic# = sn.statistic#)

where sn.name = 'CPU used by this session';

Comments
Locked Post
New comments cannot be posted to this locked post.
Post Details
Locked on Jun 15 2018
Added on May 17 2018
34 comments
12,684 views