Someone on a different thread asked about identifying rows where a certain date is an end-of-quarter. That other question is irrelevant for what I am doing, but it provided a good test case.
Suppose we have a table "a" with 4.5 million rows, with two columns: a number "n" (from 1 to 4.5 million) and a random date between 01-JAN-2001 and 31-DEC 2016. We want to write a query to return the "n" values where the date value is a quarter-end date.
There are several ways to do that. One tactic is to convert the date to a string, with the mask 'mm-dd', and to use an IN condition with four string values (the obvious ones). Another tactic is to truncate the date and to compare it to quarter-end dates within the date range. To generate that range, one can use a CTE generated with a hierarchical query - by either hard-coding the knowledge of the date interval or by going through the table once just to pick up the MIN and MAX dates. (I call these different tactics "cheap" IN condition and "proper" IN condition, respectively).
Another tactic is to compare the date directly to a "list" of quarter-end dates (again, either the "cheap" or the "proper" approach - the "proper" approach doesn't need or use any magic dates or numbers). Here we can either assume the dates are known to not have a time component, so they can be compared directly with "pure dates", or we can assume the dates may have a time component, so they can't be compared with equality (IN condition or inner join) and instead we must use inequalities (what I will call "inequality join" - by that I mean inequalities like >= and <, I don't mean !=). In my tests the dates are actually "pure dates" but that doesn't matter; in the inequality join, all the inequalities are evaluated anyway (and they consume time!)
I've been told on this site that reading from a table with 4.5 million rows is likely to dominate simple logical and arithmetic operations, so I should pay more attention to proper database design, good use of indexes, writing queries that minimize reads, etc., rather than spend time minimizing CPU. My tests lead me to believe that such advice is not always good. In these examples, the actual equality filters dwarf data reads. I don't work in the field, but from what I read I understand that 4.5 million rows is big enough in terms of "reads" to be meaningful.
We often encourage newer converts to avoid wrapping column values within function calls if we can avoid it. Most of the time the reason given is that "function calls prevent the optimizer from using an index that may exist on the column." That, of course, is true. However, my tests show that function calls themselves are expensive, and they may increase execution time significantly even when no index is present on anything. They also show that in some cases an index doesn't even help that much; even with an index present, the drawback of function calls is the calls themselves and the time they take, not the fact that the index is not available to the optimizer. (In another case, though, the index is the only thing that matters - by more than two orders of magnitude!)
There were also a few things that surprised me, I have no explanation for them and I would be grateful if anyone can shed some light. For full disclosure: I don't work in IT (I am an enthusiastic amateur), I don't work on any real-life problem, etc. - so if you only have interest in practical problems that arise in real production, you won't find them here. I am learning Oracle for my own pleasure, I find it fun and challenging (some day when I will know a lot more than now I may volunteer someplace) and I am passionate about good code writing; with that said, I still don't know how to properly read an execution plan or plan statistics. I stared at them for the "surprising facts" and the explanation may be staring at me - I stare right back at it, without a clue.
For anyone who would like to replicate what I've done, here is how I created the input data (and a couple of "sanity checks"):
drop table a purge; -- I may have a table "a" wasting space (from older tests)
create table a (nbr number, dt date);
insert into a (nbr, dt)
select level , trunc( date '2001-01-01' + dbms_random.value(0, date '2017-01-01' - date '2001-01-01') )
from dual connect by level <= 1500000
union all
select level + 1500000, trunc( date '2001-01-01' + dbms_random.value(0, date '2017-01-01' - date '2001-01-01') )
from dual connect by level <= 1500000
union all
select level + 3000000, trunc( date '2001-01-01' + dbms_random.value(0, date '2017-01-01' - date '2001-01-01') )
from dual connect by level <= 1500000
;
commit;
select count(*) as ct, count(distinct dt) as ct_dist, date '2017-01-01' - date '2001-01-01' as exp_ct_dist,
min(dt) as min_dt, max(dt) as max_dt from a;
CT CT_DIST EXP_CT_DIST MIN_DT MAX_DT
---------- ---------- ----------- ----------- -----------
4500000 5844 5844 01-JAN-2001 31-DEC-2016
select count(*) q_end_dates from a where to_char(dt, 'mm-dd') in ('03-31', '06-30', '09-30', '12-31');
Q_END_DATES
-----------
49157
(For the last "sanity check": the count should be approx. 4/365, or 1/90, of 4.5 million - that equals 50,000. About 50,000 of the dates should be quarter-ends.)
Rant #2: I just downloaded Oracle 12.1 EE, with its version of SQL Developer. My recollection from 11.2 XE on my laptop was correct: out of the box, the default NLS_DATE_FORMAT is DD-MON-RR. And we beat "novices" over the head for that...
Here are the different solutions:
-- APPROACH: to_char(dt)
--
select /*+ no_merge(x) */ *
from ( select nbr from a where to_char(dt, 'mm-dd') in ('03-31', '06-30', '09-30', '12-31')
) x
where rownum > 1;
-- APPROACH: trunc(dt)
--
with qtr_ends (qe_dt) as (select add_months(date '2000-12-31', 3 * level) from dual connect by level <= 64)
select /*+ no_merge(x) */ *
from ( select nbr from a where trunc(dt) in (select qe_dt from qtr_ends)
) x
where rownum > 1;
-- APPROACH: dt + cheap IN
--
with qtr_ends (qe_dt) as (select add_months(date '2000-12-31', 3 * level) from dual connect by level <= 64)
select /*+ no_merge(x) */ *
from ( select nbr from a where dt in (select qe_dt from qtr_ends)
) x
where rownum > 1;
-- APPROACH: dt + proper IN
--
with date_range (min_dt, max_dt) as (select min(dt), max(dt) from a),
qtr_ends (qe_dt) as (select add_months(trunc(min_dt) - 1, 3 * level) from date_range
connect by add_months(trunc(min_dt) - 1, 3 * level) <= max_dt)
select /*+ no_merge(x) */ *
from ( select nbr from a where dt in (select qe_dt from qtr_ends)
) x
where rownum > 1;
-- APPROACH: dt + inner join
--
with date_range (min_dt, max_dt) as (select min(dt), max(dt) from a),
qtr_ends (qe_dt) as (select add_months(trunc(min_dt) - 1, 3 * level) from date_range
connect by add_months(trunc(min_dt) - 1, 3 * level) <= max_dt)
select /*+ no_merge(x) */ *
from ( select nbr from a inner join qtr_ends q on a.dt = q.qe_dt
) x
where rownum > 1;
-- APPROACH: dt + inequality join
--
with date_range (min_dt, max_dt) as (select min(dt), max(dt) from a),
qtr_ends (qe_dt) as (select add_months(trunc(min_dt) - 1, 3 * level) from date_range
connect by add_months(trunc(min_dt) - 1, 3 * level) <= max_dt)
select /*+ no_merge(x) */ *
from ( select nbr from a join qtr_ends q on a.dt >= q.qe_dt and a.dt < q.qe_dt + 1
) x
where rownum > 1
;
And here are the results. Notice that the "optimizer cost" has little to do with the actual running time of the queries. After I ran the queries without an index on the date column, I added an index and re-ran them; results are reported both ways.
Rant #3 - the download archive for 12.1 EE included an older version of SQL Developer, 3.2.20.10, than the download for 11.2 XE (which included SQL Developer 4.1.3.20). What the heck? On my laptop, with the older version of Oracle but newer version of SQL Developer, I am able to show CPU cost and a few other statistics in Explain Plan... not on my desktop with 12.1 but the older SQL Developer (and I was too lazy to grab them from SLQ*Plus - I'd rather update SQL Developer). Anyway, that's why I don't show CPU costs below.
APPROACH to_char(dt) trunc(dt) dt + cheap IN dt + proper IN dt + inner join dt + ineq join
w/o index
opt cost 3586 3327 3327 6638 6638 6666
time 1200 1280 240 380 790 > 46000
with index
opt cost 3586 3327 752 4063 4062 6604
time 1190 1280 235 233 230 240
All times in ms (milliseconds)
Notice how the approaches where dt is not wrapped within function calls run about five times faster than the to_char() and trunc() approaches. For the "inequality join" which is often needed, this is true only if the date column is indexed, but for an inner join (when the dates are known not to include time-of-day) - and for the equivalent "IN condition" approach - the conclusion is independent of index.
QUESTIONS IN THE NEXT POST (FIRST "REPLY")