Interest in this topic was spurred by this recent thread in this forum: ORA-00932: Distinct of a column
Suppose we have a table with 1000 rows, with columns ID (number, primary key) and TXT (CLOB). We need to write a query showing all pairs of ID, let's call then ID1 and ID2, with ID1 < ID2, with identical corresponding TXT values.
Oracle doesn't allow us to compare CLOB with the equality test (T1.TXT = T2.TXT). Instead, we can use DBMS_LOB.COMPARE for the same task. However, as we will see below, that is quite slow.
So, the idea I suggested in the earlier thread (which I doubt very strongly is an original idea) is to hash the CLOB's first and compare the hashes. If the hashes are different, then we already know the CLOBs are different too, and we don't need to use DBMS_LOB.COMPARE. Of course, if the hashes are equal, we can't conclude that the CLOB's are equal, and we still need to use COMPARE.
So, adding the hash test will speed things up (we hope) when the CLOBs are distinct, and it will definitely slow them down when the CLOBs are equal. The question is, by how much. Of course, it is possible that the COMPARE function already does a hash test, or something similar, first, to save time - in which case the whole idea is going to fail miserably. Alas, as my tests show, that's not the case.
Here are the results: When the CLOBs are all equal (so the hash test is never helpful, it's pure deadweight), the execution times with or without the extra hash test are indistinguishable, a little over 15 seconds on my system. However, when all the CLOB are distinct, the query without the hashing pre-test completes in the same 15 seconds, but the query with the hash test completes in 0.1 seconds. 150 times faster.
After I give the readers the opportunity to chime in, I plan to create a request in the Ideas forum, to ask Oracle to consider adding such a hash test in the implementation of COMPARE itself. That is, unless someone can point out in this thread why that's not a good idea.
One tangential thing I noticed as I ran my tests - and I find surprising - is that when I tested on 1000 distinct CLOBs, with COMPARE alone, it didn't matter if the differences between the input strings were at the beginning of the strings (within the first four characters) or at the end. This suggests that COMPARE doesn't work in the most obvious way, which is to compare the two strings, byte by byte, starting from the beginning, and stopping as soon as a difference was found. Either that, or COMPARE does work in that way, but the overhead (to call the function, perhaps to allocate memory dynamically etc.), rather than the actual comparison byte by byte, takes up the vast majority of execution time.
SIGNIFICANT EDIT (After the first three replies, below)
Jonathan Lewis, in Reply 1, pointed out a fatal flaw in my thinking about this. Testing on a self-join is very specific, and misleading; the much better test would compare pairs of CLOBs in isolation from each other. I ran THAT test (see Reply 4), and indeed, adding the hash step DOESN'T HELP AT ALL. The technique I describe in this "original post" in the thread is still very useful - when used with a join where we must compare CLOBs; it is not, as I was thinking incorrectly, a general improvement for the application of DBMS_LOB.COMPARE.
END EDIT.
So - here are the tests.
Set up the table; populate it with 1000 identical CLOBs of length 4004; test with the two queries below; then return to the setup portion, truncate the table, populate with 1000 distinct CLOBs, repeat the queries.
Note what I am selecting in the queries: SUM(ID2 - ID1) over the rows where the CLOBs are equal (and ID1 < ID2); this is just so that the output is a single row. The queries still need to perform all the CLOB comparisons to get the answer. Also note that I used DBMS_CRYPTO.HASH, rather than ORA_HASH; ORA_HASH does not work on CLOB.
I only show the query with the hash test included. To test without it, simply comment out that line of code.
One last thing - I didn't test on VARCHAR2 strings of length, say, greater than 3000, to see if a hash test can speed up comparisons in that case too. (Nor did I try to see if in that case, the differences being at the beginning of the string vs. at the end makes any difference for straight equality comparison.) I may add to this thread later, when I get a chance to do those tests.
drop table t purge;
create table t (id number primary key, txt clob);
insert into t (id, txt)
select level, to_clob('abcd') || rpad('x', 4000, 'x')
from dual
connect by level <= 1000
;
commit;
truncate table t;
insert into t (id, txt)
select level, to_clob(level) || rpad('x', 4000, 'x')
from dual
connect by level <= 1000
;
commit;
select sum(t2.id - t1.id) as result
from t t1 inner join t t2 on t1.id < t2.id
and dbms_lob.compare(t1.txt, t2.txt) = 0
and dbms_crypto.hash(t1.txt, 1) = dbms_crypto.hash(t2.txt, 1) -- comment this out to test without hashing
;