SQL & PL/SQL

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

Fun with indexes: selectivity, clustering factor, function-based indexes

mathguyJun 25 2018 — edited Jun 27 2018

Before we start:

SQL> select banner from v$version;

BANNER

--------------------------------------------------------------------------------

Oracle Database 12c Enterprise Edition Release 12.2.0.1.0 - 64bit Production

PL/SQL Release 12.2.0.1.0 - Production

CORE 12.2.0.1.0 Production

TNS for Linux: Version 12.2.0.1.0 - Production

NLSRTL Version 12.2.0.1.0 - Production

I don't think it matters, but if anyone is curious - I set up the database as multi-tenant; I did all the work in a PDB. Running on top of Oracle Linux 7.5.

- - -

I ran a few informal performance tests, for my own learning, and I ran into a few interesting things. I am sharing here for two reasons - first, some of this may be helpful to others; and second, I have a few questions of my own, and I am seeking your critique, both with regard to the way I set up my tests and to my speculation on what may be happening in some instances.

I created a table with three columns, as shown below. I populated it with 100 million rows; the first column is an ID, primary key; the second is a date, with values from 2018-01-01 00:00:00 to 2018-12-31 23:59:59, with most values appearing three times in the table (the rest appear four times); and a numeric VAL column, with values between 1 and 10,000. CREATE TABLE and INSERT statements below, for those inclined to repeat the tests.

create table t (id number primary key, created_date date, val number);
insert into t
  with h ( x ) as ( select level from dual connect by level <= :input ),
       g ( y ) as ( select level from dual connect by level <= :input )
  select :input * (x - 1) + y
       , date '2018-01-01' + numtodsinterval(mod(:input * (x - 1) + y, 31536000), 'second')
       , y
  from   h cross join g
;

NOTE - :input controls the size of the test data. I wrote the INSERT that way so I can test with :input = 100 (just to make sure I got it more-or-less right), but in creating the table I used :input = 1e4 (ten thousand), resulting in a table T with 100 million rows.

One day has 86,400 seconds, and a non-leap year has 365 days, meaning 31,356,000 seconds; that is the magic number used in the INSERT statement. 100 million is a bit more than three times that number; so the CREATED_DATE increases by one second with each successive ID, until we reach the end of the year; then the date wraps back to the beginning of the year, it reaches the end of the year again, etc. In total, most of the dates (meaning date-times, with resolution down to seconds) appear three times in the table, with a small number appearing four times. (A simple closed-form computation, or a SELECT statement WHERE ID = 1e8, shows that dates through '2018-03-04 09:46:40' appear four times; the rest appear three times.)

With this setup, suppose I want to compute the average of the VAL column for a fixed date; say DATE '2018-06-25'. (Initially I had trunc(sysdate) but that is not repeatable.) There are two ways to write a query for this:

select avg(val) as avg_val
from   t
where  trunc(created_date) = date '2018-06-25';
select avg(val)
from   t
where  created_date >= date '2018-06-25' and created_date < date '2018-06-25' + 1;

At this point there is only one index on the table, on column ID (primary key), which is irrelevant for this query (or "these queries").

Running each query a few times - and ALTER SYSTEM FLUSH BUFFER_CACHE before each run, so that all the needed data is read from disk each time - I found that the TRUNC version typically takes 16 seconds. The non-TRUNC version typically takes less than two seconds. Now, there is some variability (I ran the same tests a couple of days ago, with TRUNC(SYSDATE), and I got somewhat different results - 16.5 seconds vs. 3.3 seconds); but the order of magnitude of the relative performance gain/loss is similar.

So - Conclusion 1: At least in some cases, avoiding the overhead of function calls (here, one TRUNC() call for each row) may have significant benefits, even in the absence of an index on the relevant column.

At this point, let's summarize the statistics: the 100 million rows are distributed more or less uniformly over the CREATED_DATE values. The specific query retrieves 86,400 * 3 = 259,200 rows, or about 0.26% of the total number of rows in the table. (We know this by direct computation; Thomas the Unbeliever may run a SELECT COUNT(*) query to verify. Easier to do after we add an index.)

Now, perhaps we shouldn't read too much into this. The rows are very short (18 bytes on average, we will find after we run table stats), so I/O is probably quite a bit less than would be typical in real life. I/O should be the same for both queries; if it is much longer than in my tests, the function call overhead will represent a smaller fraction of the total. Also, my test queries are trivial - they just compute an average, with no GROUP BY, no ordering, no joins, analytic functions etc.; in a more realistic scenario, that will also add equal time to both queries, making the relative gain/loss from function calls (or avoiding them) smaller. Always test on your actual data!

- - -

Now let's add an index on CREATED_DATE. Index statistics are gathered at index creation; let's also gather table statistics. Statements below, for those who like to play along. (WARNING: Both the INSERT statement and the CREATE INDEX statements, etc., take a long time - of the order of magnitude of hours. Be patient!)

create index idx_dt on t (created_date);
exec dbms_stats.gather_table_stats(ownname => 'MATHGUY', tabname => 'T', estimate_percent => 1)   --  Obviously, replace MATHGUY with your user name.

When we run the two queries again, the TRUNC version still takes the same 16 seconds. SURPRISE - the non-TRUNC version also still takes the same 1.8 seconds! And a quick look at the EXPLAIN PLAN shows that the index is not being used.

Obviously, selectivity is not the issue. We are selecting 0.26% of all the rows. So what is the issue? For the experienced practitioners (I am not one of them!) the answer is obvious - when selectivity is high, as it is here, and stats are current, and the distribution is uniform, the immediate next suspect is the CLUSTERING FACTOR. Since I am not an experienced practitioner, I didn't immediately think of it; rather, I looked at the index statistics and I saw that it is exceptionally high. (If you have read so far but you are not familiar with CLUSTERING FACTOR: read about it in the documentation, it is a basic and relatively simple concept!)

If I had written a query to insert 30 million rows, covering almost (but not quite) one year worth of seconds, I would not be seeing this. The rows are inserted in the order of ID, which in that case would also be the order of dates; rows line up neatly in blocks, and the clustering factor would be very low.

The problem with my test data, though, is that I insert the same date (to the second) for three ID's, sometimes four ID's, and those ID's are far apart. When ordering by CREATED_DATE, the first row (ID = 1) is in one block, the second row (ID = 31536001) is in another block, and the third row is in yet another block. (And the fourth row - there are for rows with the date '2018-01-01 00:00:00' - is in yet another block.***) You can see how by definition the clustering factor will be exceptionally high - essentially equal to the number of rows in the table.

*** (Actually I was a bit negligent, and the first date in the table has time 00:00:01, not 00:00:00 - but that doesn't change anything.)

Now, in real life, if rows are inserted more or less chronologically (i.e. if there is some correlation between CREATED_DATE and the actual creation date), then clustering will be much tighter, and the index would be used by the second query with no fuss. But note that inserting RANDOM dates (as created, for example, by adding a random number to a fixed date) would have the same very high clustering factor as does my test data. Something to consider when thinking about creating indexes on date columns... (and perhaps on other columns too, although it seems to me that "date" is the most obvious).

- - -

Now, in this very special (and unnatural) setup I know something the Optimizer doesn't know. This is exactly the kind of situation where I can help the Optimizer with an INDEX hint.

What do I know? I know that my data is in fact very tightly clustered; just not in the way Oracle assesses it. As we look at the rows in my table: the first rows are all in different blocks; but the next four rows, and the next four after them, etc., are IN THE SAME blocks as the first four rows. Think of it as a form of "cyclic" clustering.

At this point it may help to look at statistics, both for the table and for the index:

select * from user_tab_statistics where table_name = 'T';

select * from user_ind_statistics where index_name = 'IDX_DT';

The table stats show approx. 100 million rows (not exact, since I sampled 1% of the rows). Estimated blocks: about 322,000. At 8 kB per block, this is about 2.5 GB of data. Average row length: 18 bytes. (Before one asks - 18 bytes times 100 million rows is about 1.8 GB, not 2.5 GB; of course, Oracle reserves some space in the blocks, they are not filled to 100% with data.)

The index stats are more interesting. BLEVEL is 3, so if we could use the index, access should be pretty fast. DISTINCT KEYS are shown as 30724232 - when in fact we know there are exactly 31,356,000. Most importantly, the CLUSTERING FACTOR is over 99 million. This is what tricks the Optimizer into not using the index. In fact, all the rows for the "magic" date in my query are found in just 823 blocks, which is about 0.25% of the total number of blocks - which confirms that they are clustered very tightly indeed!

After we add the hint /*+ index(t, idx_dt) */ the second query takes approx. 0.15 seconds. About 12 times faster than without the index, and about 100 times faster than the TRUNC query (which can't take advantage of the index).

- - -

Does this mean that TRUNC should be banished? Not necessarily!

First, regardless of what the stats say, the data must be relatively well clustered for an index on CREATED_DATE to be helpful. And if it is, but the stats don't show that (because they can't, not because they are stale!) we may need to help the Optimizer.

Second, of course, the query must compare TRUNC_DATE to truncated dates - it can't compare them to 9:00 AM on a given date.

But if these conditions are met (as they are in my tests), perhaps a function-based index on TRUNC(CREATED_DATE) would help. Let's see:

create index idx_trunc on t (trunc(created_date));

Now the TRUNC query runs in 0.08 seconds after flushing the shared pool, and in 0.03 seconds if we keep the plan from one run to the next and we only flush the buffer cache! So, if we need the absolute fastest execution time (for this very special type of query - which may not, in actuality, be TOO special), and with the data clustered as it is, despite what the CLUSTERING FACTOR says, the best option is to create a FBI index on TRUNC(CREATED_DATE).

Interestingly, the NON-TRUNC query also becomes faster, but it is about twice as slow as the TRUNC query. The Explain Plan shows something weird... the Optimizer is smart enough to see that the CREATED_DATE condition can take advantage of the TRUNC FBI; it uses that as the access predicate, but then it still checks the WHERE clause as a filter. I can't think of any good reason for this; the filter predicate (no TRUNC wrapper on the column) is 100% equivalent to the access predicate, and it shouldn't be evaluated a second time. SCRATCH THAT: On further inspection, I see what happened... The TRUNC filter (used for accessing the data) is more general; the Optimizer replaced the second inequality with a NON-STRICT inequality, and applied TRUNC around the fixed dates in the inequalities. This can only mean one thing: even queries comparing to a time-of-day like 9:00 AM may take advantage of the TRUNC FBI: the bounds are expanded to allow a TRUNC comparison for access, and then the actual WHERE clause is used for further filtering. Live and learn!

For anyone who wants to try the "non-index" versions after they created and used the indexes... if you are a beginner like me, you may run into the same issues I did. Oracle will honor a NO_INDEX hint on a normal index, but it doesn't seem to honor that on a function-based index. (I tried the unqualified NO_INDEX hint, and the FULL hint, and the Optimizer would still use the FBI.) The way I found around that was to disable the FBI. Question to the experts... is that the right way? What am I missing - why doesn't the FULL hint force a full table scan?

All comments, however critical, are welcome. The post is already about five times as long as what would be an "unreasonably long post", but if anyone cares to see more, please feel free to ask.

Locked Post

New comments cannot be posted to this locked post.

Locked on Jul 25 2018

Added on Jun 25 2018

#index

25 comments

1,193 views