In a recent thread, I made the following statement (paraphrased):
If you have a table with two columns x and y, and if you want to find all the x for which there are two or more distinct values of y, one common way to do that is to group by x and filter by
HAVING COUNT(DISTINCT y) > 1
but I think that is wasteful. I don't think Oracle is smart enough to stop as soon as an x has at least two distinct y; I believe (I don't know how to check that) that Oracle will keep counting the distinct y in each group, wasting time for no good reason. This is why I prefer a different test:
HAVING MIN(y) != MAX(y)
I may very well be wrong, but to my simple mind, computing MIN and MAX wastes fewer resources than a full COUNT (DISTINCT).
I don't know if this is common knowledge, or if it's "knowledge" at all. In any case, I thought to myself - even if it is, it would be best to know "how much better", rather than just "which is better".
So, I came up with the following test. In this specific case, it's not close; the MIN/MAX approach is five times faster. So, it's not a trivial matter, or a "preference" - we simply should not use COUNT(DISTINCT) for such use cases, unless someone can explain a good reason to use it. One may point out that sometimes execution time is not material, or that for a different data arrangement the difference may be smaller; even then, though, why not use MIN != MAX anyway, simply to reinforce a good practice? Are there any cases when COUNT(DISTINCT) is better?
I am putting this in the public domain, for reference, but also to invite your comments.
Here is how I set up the test. I create a table with two columns of data type NUMBER, x and y. x ranges from 1 to 1,000,000 and for each x, there are ten values of y. When x is divisible by 10, all the y values are equal; in all other cases, all the y values are distinct. I gathered statistics on the table, and then I tested the two queries. The setup shown below is from SQL Developer; however, I ran the tests themselves in SQL*Plus since I find it easier to copy the output and format it here. I flushed the buffer cache before each execution (although the I/O time is minimal compared to the actual computation, so that's not really important in this case).
Table setup:
drop table tbl purge;
create table tbl (x number, y number);
insert into tbl
select a.lvl, case mod(a.lvl, 10) when 0 then 1 else b.lvl end
from (select level as lvl from dual connect by level <= 1000000) a
cross join (select level as lvl from dual connect by level <= 10) b
;
commit;
exec dbms_stats.gather_table_stats('MATHGUY', 'TBL', estimate_percent => 100)
The tests: (The explain plans and statistics shed some light on what Oracle is doing in each query, and why the times are different)
SQL> alter system flush buffer_cache;
System altered.
Elapsed: 00:00:00.13
select x
from tbl
group by x
having count(distinct y) > 1;
900000 rows selected.
Elapsed: 00:00:28.05
Execution Plan
----------------------------------------------------------
Plan hash value: 3406339728
-------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time |
-------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 50000 | 878K| | 34363 (2)| 00:00:02 |
|* 1 | FILTER | | | | | | |
| 2 | HASH GROUP BY | | 50000 | 878K| | 34363 (2)| 00:00:02 |
| 3 | VIEW | VM_NWVW_1 | 7071K| 121M| | 34363 (2)| 00:00:02 |
| 4 | HASH GROUP BY | | 7071K| 53M| 153M| 34363 (2)| 00:00:02 |
| 5 | TABLE ACCESS FULL| TBL | 10M| 76M| | 4998 (2)| 00:00:01 |
-------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter(COUNT("$vm_col_1")>1)
Statistics
----------------------------------------------------------
113 recursive calls
0 db block gets
17975 consistent gets
31945 physical reads
0 redo size
17031353 bytes sent via SQL*Net to client
660597 bytes received via SQL*Net from client
60001 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
900000 rows processed
SQL> alter system flush buffer_cache;
System altered.
Elapsed: 00:00:00.12
select x
from tbl
group by x
having count(distinct y) > 1;
900000 rows selected.
Elapsed: 00:00:29.54
SQL> alter system flush buffer_cache;
System altered.
Elapsed: 00:00:00.09
select x
from tbl
group by x
having min(y) != max(y);
900000 rows selected.
Elapsed: 00:00:05.62
Execution Plan
----------------------------------------------------------
Plan hash value: 2838255862
-----------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-----------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 50000 | 390K| 5298 (7)| 00:00:01 |
|* 1 | FILTER | | | | | |
| 2 | HASH GROUP BY | | 50000 | 390K| 5298 (7)| 00:00:01 |
| 3 | TABLE ACCESS FULL| TBL | 10M| 76M| 4998 (2)| 00:00:01 |
-----------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter(MIN("Y")<>MAX("Y"))
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
17975 consistent gets
17964 physical reads
0 redo size
17031353 bytes sent via SQL*Net to client
660597 bytes received via SQL*Net from client
60001 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
900000 rows processed
SQL> alter system flush buffer_cache;
System altered.
Elapsed: 00:00:00.06
select x
from tbl
group by x
having min(y) != max(y);
900000 rows selected.
Elapsed: 00:00:05.42