I am opening a new discussion based on Andrew's last reply to my other discussion Why don't the optimizer use the new index for this query?
Andrew Sayer wrote:
Not only do the stats need to be there, they must also represent the data. Consider I have a column status that has two possible values 0 and 1, simple statistics will tell the optimizer I have two different values here so if I filter on status = 1 then expect half the table back. Now let's assume that actually almost all of the rows have status=0, so when I query select * FROM table where status=1, I'm only going to get a few rows back, in this situation an index would be perfect. However, without there being some data to tell the optimizer that there is skew in the data set, Oracle will still assume I'm going to get half the table and probably thinks an index is a terrible idea for that. This situation can be addressed with histograms.
In order to test this, I did a full test on my 11g XE DB.
SQL> SELECT * FROM v$version;
BANNER
---------------------------------------------------------------------------
Oracle Database 11g Express Edition Release 11.2.0.2.0 - 64bit Production
SQL> SELECT name, value FROM v$parameter WHERE UPPER(name) = UPPER('db_block_size');
NAME VALUE
------------------------------ ------------------------------
db_block_size 8192
SQL> SELECT name, value FROM v$parameter WHERE UPPER(name) = UPPER('db_file_multiblock_read_count');
NAME VALUE
------------------------------ ------------------------------
db_file_multiblock_read_count 111
SQL> CREATE TABLE emps
(
empid NUMBER PRIMARY KEY,
empname VARCHAR2(4000),
active VARCHAR2(1),
CONSTRAINT emp_active_check CHECK (active IN ('Y', 'N'))
);
Table created.
I now insert 50,000 records to the table. The EMPNAME column has very big values because of RPAD.
DECLARE
mod1 number;
BEGIN
FOR i IN 1..50000 LOOP
mod1 := MOD(i,2);
INSERT INTO emps VALUES (i, i || ' ' || RPAD ('a', 3500, 'z') || ' ' || i, DECODE(mod1, 1, 'Y', 'N'));
END LOOP;
END;
SQL> COMMIT;
Commit complete.
The values for column ACTIVE is evenly distributed. Y,N,Y,N,Y........ like that
SQL> SELECT active, COUNT(*) FROM emps GROUP BY active;
A COUNT(*)
- ----------
Y 25000
N 25000
Table size is quite big: 200 MB.
SQL> select segment_name,segment_type,bytes/1024/1024 MB
from dba_segments
where segment_type='TABLE' and segment_name='EMPS' AND OWNER = 'PEGA';
SEGMENT_NAME SEGMENT_TYPE MB
-------------------------------------------------- ------------------ ----------
EMPS TABLE 200
Now I analyze table.
SQL> BEGIN dbms_stats.gather_table_stats('PEGA', 'emps'); end;
2 /
PL/SQL procedure successfully completed.
Now, my testing starts...
(TEST1) Check the plan for this statement:
SELECT * FROM emps WHERE active = 'Y';
It correctly does a FTS with cost of 6885.
(TEST2) Now, I create an index for the ACTIVE column.
SQL> CREATE INDEX emps_active_idx ON emps (active);
Index created.
No need to analyze the table because, it is done automatically when the new index is created.
Now, I check the explain plan: It still does a FTS with a cost of 6885.
So, it thinks that still 1/2 the table will be ACTIVE = 'Y' records and does a FTS anyway. The new index is not used.
(TEST3) Now I make every record's ACTIVE column to 'N', except for EMPID = 25001 (i.e. the middle record), which has value 'Y' for ACTIVE.
SQL> UPDATE emps SET active = 'N' WHERE empid != 25001;
49999 rows updated.
SQL> COMMIT;
Commit complete.
Now, only 1 record, EMPID = 25001, the middle-record has ACTIVE = Y. All other records have ACTIVE = N.
Now, I check the plan for:
SELECT * FROM emps WHERE active = 'Y';
Plan says that a FTS will be done at a cost of 6885.
Since the table statistics are not updated, the above can be explained.
(TEST4) Now I re-analyze the table.
SQL> BEGIN dbms_stats.gather_table_stats('PEGA', 'emps'); end;
2 /
PL/SQL procedure successfully completed.
Now, I check the plan for below SQL which should bring only 1 record:
SELECT * FROM emps WHERE active = 'Y';
Now, here we have a problem. STILL it is doing a FTS with cost 6885.
(TEST5) Now, I actually run the query with that hint.
SQL> SET TIMING ON
SQL> SELECT /*+ GATHER_PLAN_STATISTICS */ /* level42dl */ * FROM emps WHERE active = 'Y';
... 1 record comes.
Elapsed: 00:00:07.87
SQL> SELECT sql_id, child_number, SUBSTR(sql_text, 1, 50) sql1 FROM v$sql WHERE sql_text LIKE '%level42dl%' AND sql_text NOT LIKE '%sql_id%';
SQL_ID CHILD_NUMBER SQL1
------------- ------------ --------------------------------------------------
3n94k36jzhk86 0 SELECT /*+ GATHER_PLAN_STATISTICS */ /* level42dl
SQL> SELECT * FROM table(DBMS_XPLAN.DISPLAY_CURSOR('3n94k36jzhk86',0, 'ALLSTATSLAST'));
PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
SQL_ID 3n94k36jzhk86, child number 0
SELECT /*+ GATHER_PLAN_STATISTICS */ /* level42dl */ * FROM emps WHERE active = 'Y'
Plan hash value: 2715242515
---------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | Reads |
---------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1 |00:00:07.17 | 25159 | 9580 |
|* 1 | TABLE ACCESS FULL| EMPS | 1 | 25000 | 1 |00:00:07.17 | 25159 | 9580 |
---------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("ACTIVE"='Y')
19 rows selected
As you can see, it is STILL doing a FTS.
This is totally confusing, if you ask me...