I need to remove diacritics from text. There are two basic ways I'm aware of:
with data(text) as (
select 'Mañana' from dual union all
select 'aperçu' from dual union all
select 'TRÈS' from dual union all
select 'marché' from dual union all
select 'Hôtel' from dual
)
select text,
nlssort(text, 'nls_sort=binary_ai') as sort_bytes,
rtrim(utl_raw.cast_to_varchar2(nlssort(text, 'nls_sort=binary_ai')), chr(0)) AS sort_chars,
decompose(text) as decomp,
regexp_replace(decompose(text), unistr('[\0300-\036F]'), NULL) AS decomp_chars
from data;
The first is to get the "accent insensitive" binary sort bytes for the text, use utl_raw to convert those back to text, and then strip the trailing ascii nul that's appended by nslsort().
The second is to decompose the characters with an a diacritic into base characters followed by standalone diacritics. Then, because standalone diacritics fall within the unicode range 0300-036F they can easily be removed.
The second option takes 1/2 the cpu (1M executions in an average of 11.7 seconds vs. 26.6 seconds on our dev instance) and it preserves case, where the sort-byte based option yields on lowercase text. So option 2 seems a clear favorite.
So... am I missing anything of note? Are there other, better options?
Here's my ad hoc performance script.
select sn.name, st.value as cpu_before
from v$statname sn
inner join v$mystat st on (st.statistic# = sn.statistic#)
where sn.name = 'CPU used by this session';
begin
for i in 1..200000
loop
for x in ( with data(text) as (
select 'Mañana' from dual union all
select 'aperçu' from dual union all
select 'TRÈS' from dual union all
select 'marché' from dual union all
select 'Hôtel' from dual
)
select text,
-- rtrim(utl_raw.cast_to_varchar2(nlssort(text, 'nls_sort=binary_ai')), chr(0)) AS sort_chars
regexp_replace(decompose(text), unistr('[\0300-\036F]'), NULL) AS decomp_chars
from data
)
loop
null;
end loop;
end loop;
end;
/
select sn.name, st.value as cpu_after
from v$statname sn
inner join v$mystat st on (st.statistic# = sn.statistic#)
where sn.name = 'CPU used by this session';