I have a table MY_TABLE with columns STATUS and MESSAGE_ID. STATUS column has an index on it. The actual data distribution now is the following:
select status, count(*)
from my_table
group by status
order by 2 desc;
STATUS COUNT(*)
-------------------------
STATUS_7 10580209
STATUS_6 23902
STATUS_5 1504
STATUS_4 48
STATUS_3 17
STATUS_2 8
STATUS_1 5
As you can see, 99.99% of table rows have one of 2 most frequent statuses.
I need to select rows with other 5 statuses:
select message_id
from my_table
where status in ('STATUS_1', 'STATUS_2', 'STATUS_3', 'STATUS_4', 'STATUS_5');
This query is executed in 0.05 seconds and has a good plan:
PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------
Plan hash value: 1564383627
------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 20 | 320 | 8 (0)| 00:00:01 |
| 1 | INLIST ITERATOR | | | | | |
| 2 | TABLE ACCESS BY INDEX ROWID BATCHED| MY_TABLE | 20 | 320 | 8 (0)| 00:00:01 |
|* 3 | INDEX RANGE SCAN | STATUS_INDEX | 20 | | 7 (0)| 00:00:01 |
------------------------------------------------------------------------------------------------------
Query Block Name / Object Alias (identified by operation id):
-------------------------------------------------------------
1 - SEL$1
2 - SEL$1 / MY_TABLE@SEL$1
3 - SEL$1 / MY_TABLE@SEL$1
Predicate Information (identified by operation id):
---------------------------------------------------
3 - access("STATUS"='STATUS_1' OR "STATUS"='STATUS_2' OR
"STATUS"='STATUS_3' OR "STATUS"='STATUS_4' OR
"STATUS"='STATUS_5')
Column Projection Information (identified by operation id):
-----------------------------------------------------------
1 - "MESSAGE_ID"[NUMBER,22]
2 - "MESSAGE_ID"[NUMBER,22]
3 - "MY_TABLE".ROWID[ROWID,10]
The problem is that the list of statuses may change, and the real business requirement is “to select all rows except status 6 and 7”:
select message_id
from my_table
where status not in ('STATUS_6', 'STATUS_7');
This query is executed 6 seconds (120 time longer) and oracle makes here full table scan:
PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------
Plan hash value: 150318439
------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 23834 | 372K| 27112 (2)| 00:00:02 |
|* 1 | TABLE ACCESS FULL| MY_TABLE | 23834 | 372K| 27112 (2)| 00:00:02 |
------------------------------------------------------------------------------
Query Block Name / Object Alias (identified by operation id):
-------------------------------------------------------------
1 - SEL$1 / MY_TABLE@SEL$1
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("STATUS"<>'STATUS_6' AND "STATUS"<>'STATUS_7')
Column Projection Information (identified by operation id):
-----------------------------------------------------------
1 - "MESSAGE_ID"[NUMBER,22]
I have tied to write a query which produces list of desired statuses:
select distinct status
from my_table
where status not in ('STATUS_6', 'STATUS_7');
This query also works fast - again ~0.05 seconds.
But when I use this as a subquery for the first one -
select t.message_id
from my_table t
where t.status in (select distinct x.status
from my_table x
where x.status not in ('STATUS_6', 'STATUS_7'))
It still works slow (3 seconds). The plan:
PLAN_TABLE_OUTPUT
-----------------------------------------------------------------------------------
Plan hash value: 2846122266
-----------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-----------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 23834 | 605K| 75182 (2)| 00:00:03 |
| 1 | NESTED LOOPS SEMI | | 23834 | 605K| 75182 (2)| 00:00:03 |
|* 2 | TABLE ACCESS FULL| MY_TABLE | 23834 | 372K| 27112 (2)| 00:00:02 |
|* 3 | INDEX RANGE SCAN | STATUS_INDEX | 23834 | 232K| 2 (0)| 00:00:01 |
-----------------------------------------------------------------------------------
Query Block Name / Object Alias (identified by operation id):
-------------------------------------------------------------
1 - SEL$5DA710D3
2 - SEL$5DA710D3 / T@SEL$1
3 - SEL$5DA710D3 / X@SEL$2
Predicate Information (identified by operation id):
---------------------------------------------------
2 - filter("T"."STATUS"<>'STATUS_6' AND "T"."STATUS"<>'STATUS_7')
3 - access("T"."STATUS"="X"."STATUS")
filter("X"."STATUS"<>'STATUS_6' AND "X"."STATUS"<>'STATUS_7')
Column Projection Information (identified by operation id):
-----------------------------------------------------------
1 - (#keys=0) "T"."MESSAGE_ID"[NUMBER,22]
2 - "T"."MESSAGE_ID"[NUMBER,22], "T"."STATUS"[VARCHAR2,50]
So, what I don't understand here, is why Oracle can't find a faster solution? When I do it in two steps (get list of statuses, get rows with statuses), it takes 0.05 second for each step, but when I put these two steps in one query, Oracle can't find this way.
I have tried to gather statistics, write the query in different ways, use hints (materialized, index, and some others), but nothing makes it better, only worse.
To reproduce the situation, you can use the following:
create table test_table(id number primary key, value varchar2(10));
insert into test_table(id, value)
select rownum, 'VAL' || floor(log(10, dbms_random.value(1, 1000000)))
from dual
connect by level < 1000000;
create index test_table_idx on test_table(value);
begin
dbms_stats.gather_table_stats (
ownname => 'YOUR_USERNAME',
tabname => 'TEST_TABLE',
method_opt => 'FOR ALL');
end;
/