SQL & PL/SQL

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

having count(distinct col) > 1 vs having min(col) != max(col)

mathguyMay 23 2020 — edited Jun 26 2023

In a recent thread, I made the following statement (paraphrased):

If you have a table with two columns x and y, and if you want to find all the x for which there are two or more distinct values of y, one common way to do that is to group by x and filter by

HAVING COUNT(DISTINCT y) > 1

but I think that is wasteful. I don't think Oracle is smart enough to stop as soon as an x has at least two distinct y; I believe (I don't know how to check that) that Oracle will keep counting the distinct y in each group, wasting time for no good reason. This is why I prefer a different test:

HAVING MIN(y) != MAX(y)

I may very well be wrong, but to my simple mind, computing MIN and MAX wastes fewer resources than a full COUNT (DISTINCT).

I don't know if this is common knowledge, or if it's "knowledge" at all. In any case, I thought to myself - even if it is, it would be best to know "how much better", rather than just "which is better".

So, I came up with the following test. In this specific case, it's not close; the MIN/MAX approach is five times faster. So, it's not a trivial matter, or a "preference" - we simply should not use COUNT(DISTINCT) for such use cases, unless someone can explain a good reason to use it. One may point out that sometimes execution time is not material, or that for a different data arrangement the difference may be smaller; even then, though, why not use MIN != MAX anyway, simply to reinforce a good practice? Are there any cases when COUNT(DISTINCT) is better?

I am putting this in the public domain, for reference, but also to invite your comments.

Here is how I set up the test. I create a table with two columns of data type NUMBER, x and y. x ranges from 1 to 1,000,000 and for each x, there are ten values of y. When x is divisible by 10, all the y values are equal; in all other cases, all the y values are distinct. I gathered statistics on the table, and then I tested the two queries. The setup shown below is from SQL Developer; however, I ran the tests themselves in SQL*Plus since I find it easier to copy the output and format it here. I flushed the buffer cache before each execution (although the I/O time is minimal compared to the actual computation, so that's not really important in this case).

Table setup:

drop table tbl purge;
create table tbl (x number, y number);
insert into tbl
  select      a.lvl, case mod(a.lvl, 10) when 0 then 1 else b.lvl end
  from        (select level as lvl from dual connect by level <= 1000000) a
  cross join  (select level as lvl from dual connect by level <= 10) b
;
commit;
exec dbms_stats.gather_table_stats('MATHGUY', 'TBL', estimate_percent => 100)

The tests: (The explain plans and statistics shed some light on what Oracle is doing in each query, and why the times are different)

SQL> alter system flush buffer_cache;

System altered.
Elapsed: 00:00:00.13

select x
from   tbl
group  by x
having count(distinct y) > 1;

900000 rows selected.
Elapsed: 00:00:28.05

Execution Plan
----------------------------------------------------------
Plan hash value: 3406339728
-------------------------------------------------------------------------------------------
| Id  | Operation             | Name      | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |           | 50000 |   878K|       | 34363   (2)| 00:00:02 |
|*  1 |  FILTER               |           |       |       |       |            |          |
|   2 |   HASH GROUP BY       |           | 50000 |   878K|       | 34363   (2)| 00:00:02 |
|   3 |    VIEW               | VM_NWVW_1 |  7071K|   121M|       | 34363   (2)| 00:00:02 |
|   4 |     HASH GROUP BY     |           |  7071K|    53M|   153M| 34363   (2)| 00:00:02 |
|   5 |      TABLE ACCESS FULL| TBL       |    10M|    76M|       |  4998   (2)| 00:00:01 |
-------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter(COUNT("$vm_col_1")>1)
Statistics
----------------------------------------------------------
        113  recursive calls
          0  db block gets
      17975  consistent gets
      31945  physical reads
          0  redo size
   17031353  bytes sent via SQL*Net to client
     660597  bytes received via SQL*Net from client
      60001  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
     900000  rows processed


SQL> alter system flush buffer_cache;
System altered.
Elapsed: 00:00:00.12

select x
from   tbl
group  by x
having count(distinct y) > 1;

900000 rows selected.
Elapsed: 00:00:29.54


SQL> alter system flush buffer_cache;

System altered.
Elapsed: 00:00:00.09

select x
from   tbl
group  by x
having min(y) != max(y);

900000 rows selected.
Elapsed: 00:00:05.62

Execution Plan
----------------------------------------------------------
Plan hash value: 2838255862
-----------------------------------------------------------------------------
| Id  | Operation           | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |       | 50000 |   390K|  5298   (7)| 00:00:01 |
|*  1 |  FILTER             |       |       |       |            |          |
|   2 |   HASH GROUP BY     |       | 50000 |   390K|  5298   (7)| 00:00:01 |
|   3 |    TABLE ACCESS FULL| TBL   |    10M|    76M|  4998   (2)| 00:00:01 |
-----------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter(MIN("Y")<>MAX("Y"))
Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
      17975  consistent gets
      17964  physical reads
          0  redo size
   17031353  bytes sent via SQL*Net to client
     660597  bytes received via SQL*Net from client
      60001  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
     900000  rows processed


SQL> alter system flush buffer_cache;

System altered.
Elapsed: 00:00:00.06

select x
from   tbl
group  by x
having min(y) != max(y);

900000 rows selected.
Elapsed: 00:00:05.42

Added on May 23 2020

3 comments

1,478 views