PL/SQL 101 : Grouping Sequence Ranges (Tabibitosan Method)
(based on the thread: Tabibitosan method Tutorial by @"Aketi Jyuuzou")
Author: @"BluShadow"
Credit to: @"Aketi Jyuuzou"
Last Updated: 1st June 2015
Introduction
Back in 2011, Aketi Jyuuzou introducted the community to a method of grouping sequences of numbers, called the Tabibitosan Method in Japan.
(As Aketi no longer appears to be contributing in this community, I'll document this technique rather than try and ask him to do so)
Roughly translated (as Japanese does not directly translate to English), Tabibito means the Traveler, and Tabibitosan is Traveler Calculation.
The basis of Tabobitisan comes from a Japanese High School entrance exam, and is detailed in the following (translated) Japanese web page:
[http://translate.google.co.uk/translate?hl=en&sl=ja&u=https://www.manabinoba.com/index.cfm/7,757,33,html&prev=search](http://translate.google.co.uk/translate?hl=en&sl=ja&u=https://www.manabinoba.com/index.cfm/7,757,33,html&prev=search)
However, in terms of data processing we can simply relate this to our desire to find groups or ranges within sequences, as you'll see in this article.
It's surprising how often questions pop up on the community that require this technique.
1. The basic problem
Let us say we have some data containing several sequence ranges...
Val
---
1
2
3
5
6
7
10
11
12
20
21
As you can see from that data, there's a range of numbers from 1-3, 5-7, 10-12 and 20-21, and we can determine those ranges by seeing where there are 'gaps' in sequence.
But how can we group these sequences together into their ranges using SQL? That's our basic problem.
2. The basic method
Ok, so firstly, let's (theoretically) take our data and apply a row number (RN) to each record
Val RN
--- --
1 1
2 2
3 3
5 4
6 5
7 6
10 7
11 8
12 9
20 10
21 11
Well, that's nothing special. What does that do for us?
Here's the clever bit. If we now subtract our row number (RN) from our sequence number (Val) we get this...
Val RN Val-RN
--- -- ------
1 1 0
2 2 0
3 3 0
5 4 1
6 5 1
7 6 1
10 7 3
11 8 3
12 9 3
20 10 10
21 11 10
As if by magic, we've created a number that is identical (and unique) for each "group" in the range of numbers.
If we've got unique "group" numbers, then SQL is perfectly suited to grouping our rows together with the functionality that is familiar to most... yes... the GROUP BY clause and aggregate functions (MIN, MAX etc.).
Let's do it in SQL and see...
create table myvals as
select 1 as val from dual union all
select 2 from dual union all
select 3 from dual union all
select 5 from dual union all
select 6 from dual union all
select 7 from dual union all
select 10 from dual union all
select 11 from dual union all
select 12 from dual union all
select 20 from dual union all
select 21 from dual
/
Table created.
select val
,row_number() over (order by val) as rn
,val-row_number() over (order by val) as grp
from myvals
order by val
/
VAL RN GRP
---------- ---------- ----------
1 1 0
2 2 0
3 3 0
5 4 1
6 5 1
7 6 1
10 7 3
11 8 3
12 9 3
20 10 10
21 11 10
11 rows selected.
And now let's group our value ranges based on those groups...
select min(val) as range_start
,max(val) as range_end
,count(*) as range_count
from (-- our previous query
select val
,row_number() over (order by val) as rn
,val-row_number() over (order by val) as grp
from myvals
)
group by grp
order by 1
/
RANGE_START RANGE_END RANGE_COUNT
----------- ---------- -----------
1 3 3
5 7 3
10 12 3
20 21 2
4 rows selected.
There we go, that's the basics of the Tabibitosan Method for grouping sequences. It really is that simple, but so many people are unaware of this simple trick.
The "group" relates to the Tabibitosan 'traveller' as it's effectively measuring the distance between our two sequences (the one in our data, and the one we generate as a row number). In the Japanese website, these are the two people walking at different speeds. This is sometimes referred to as "Grouping by Distance".
3. Another Example - Date Ranges
A common requirement is when we have ranges of Dates. Can we use the same method for Dates? Yes we can.
Anything that we can relate to a numeric sequence in some way can be treated like this. Let's take a look...
create table mydates as
select date '2015-04-01' as dt from dual union all
select date '2015-04-02' from dual union all
select date '2015-04-03' from dual union all
select date '2015-04-04' from dual union all
select date '2015-04-07' from dual union all
select date '2015-04-08' from dual union all
select date '2015-04-10' from dual union all
select date '2015-04-12' from dual union all
select date '2015-04-13' from dual union all
select date '2015-04-14' from dual
/
alter session set nls_date_format='YYYY-MM-DD';
select dt
,row_number() over (order by dt) as rn
,dt-row_number() over (order by dt) as grp
from mydates
/
DT RN GRP
---------- ---------- ----------
2015-04-01 1 2015-03-31
2015-04-02 2 2015-03-31
2015-04-03 3 2015-03-31
2015-04-04 4 2015-03-31
2015-04-07 5 2015-04-02
2015-04-08 6 2015-04-02
2015-04-10 7 2015-04-03
2015-04-12 8 2015-04-04
2015-04-13 9 2015-04-04
2015-04-14 10 2015-04-04
10 rows selected.
Here, our "groups" are defined by an actual date. It doesn't matter what the date is, as long as the date is unique to the group.
So, now we can use those dates to group on...
select min(dt) as range_start
,max(dt) as range_end
,count(*) as range_count
from (-- our grouping query
select dt
,row_number() over (order by dt) as rn
,dt-row_number() over (order by dt) as grp
from mydates
)
group by grp
order by 1
/
RANGE_STAR RANGE_END RANGE_COUNT
---------- ---------- -----------
2015-04-01 2015-04-04 4
2015-04-07 2015-04-08 2
2015-04-10 2015-04-10 1
2015-04-12 2015-04-14 3
4 rows selected.
Likewise you could do your ranges for seconds, minutes, hours, months or years as you want, simply by adjusting the group calculation accordingly.
We'll look at a slightly different example, grouping by months later, as it may not be as obvious as you first think, especially when you have multiple records for the same month; and that leads us to look at what happens if we have duplicate rows.
4. Tabibitosan with duplicate values
What do we do if we have duplicate rows in our sequences?
Well, we could distinct those sequences before we apply our Tabibitosan Method to them, that's one way, but requires an extra subquery.
Another way would be to group them before we apply our Tabibitosan Method to them, again that requires an extra subquery
Also, in our real applications, we're probably dealing with more than just a sequence, we probably have some other data too.
So, let's set up some example data, and see if we can use what we've learnt already...
create table mysales as
select 1 as day, 'Fred' as who, 100 as dollars from dual union all
select 1, 'Bob', 50 from dual union all
select 1, 'Jim', 75 from dual union all
select 2, 'Bob', 125 from dual union all
select 2, 'Jim', 100 from dual union all
select 3, 'Fred', 25 as dollars from dual union all
select 4, 'Fred', 50 from dual union all
select 4, 'Jim', 150 from dual union all
select 5, 'Jim', 50 from dual union all
select 8, 'Fred', 25 from dual union all
select 8, 'Bob', 100 from dual union all
select 9, 'Jim', 175 from dual union all
select 9, 'Fred', 75 from dual union all
select 10, 'Fred', 125 from dual union all
select 10, 'Fred', 225 from dual union all
select 11, 'Fred', 75 from dual union all
select 12, 'Fred', 100 from dual union all
select 15, 'Jim', 150 from dual union all
select 16, 'Bob', 150 from dual
/
select day, who, dollars
,row_number() over (order by day) as rn
,day-row_number() over (order by day) as grp
from mysales
order by 1,4
/
DAY WHO DOLLARS RN GRP
---------- ---- ---------- ---------- ----------
1 Fred 100 1 0
1 Bob 50 2 -1
1 Jim 75 3 -2
2 Bob 125 4 -2
2 Jim 100 5 -3
3 Fred 25 6 -3
4 Fred 50 7 -3
4 Jim 150 8 -4
5 Jim 50 9 -4
8 Fred 25 10 -2
8 Bob 100 11 -3
9 Jim 175 12 -3
9 Fred 75 13 -4
10 Fred 125 14 -4
10 Fred 225 15 -5
11 Fred 75 16 -5
12 Fred 100 17 -5
15 Jim 150 18 -3
16 Bob 150 19 -3
19 rows selected.
Hmmm, those groupings don't look right.
That's because we are trying to apply Tabibitosan based on the "Day" but that data is not actually sequential; it effectively has duplicates in it.
So, when we apply our row numbering, which IS sequential, against it, we don't get the right result.
Perhaps there's some other way we can account for these duplicates? Yes there is. Rather than a sequential row number, we want something similar, but that takes account of duplicates.
It may not be immediately obvious, but Oracle provides us with another analytical function, called "dense_rank".
Let's check it out...
select day, who, dollars
,dense_rank() over (order by day) as rn
,day-dense_rank() over (order by day) as grp
from mysales
order by 1,4
/
DAY WHO DOLLARS RN GRP
---------- ---- ---------- ---------- ----------
1 Fred 100 1 0
1 Bob 50 1 0
1 Jim 75 1 0
2 Bob 125 2 0
2 Jim 100 2 0
3 Fred 25 3 0
4 Fred 50 4 0
4 Jim 150 4 0
5 Jim 50 5 0
8 Fred 25 6 2
8 Bob 100 6 2
9 Jim 175 7 2
9 Fred 75 7 2
10 Fred 125 8 2
10 Fred 225 8 2
11 Fred 75 9 2
12 Fred 100 10 2
15 Jim 150 11 4
16 Bob 150 12 4
19 rows selected.
That looks good.
Ranking in principle works like positioning people on a leaderboard in sports. If we had several people competing, and two people come in first place, they are both considered 1st, and then the next person is considered in 3rd place (in general there is no second place because of the two people in 1st place). Dense Ranking works in a similar way, but doesn't leave gaps, so if there are multiple people in one position, the next people are in the next position i.e. two people in 1st position, the next place is 2nd position.
So, looking at our data above, we can see that all the Day 1 records are in 1st position (RN=1), all Day2 records are in 2nd position (RN=2) and so on. You can see that the RN goes up in a gapless sequence even though it too now has duplicates.
This difference between our generated gapless sequence and the Days which have gaps, allows us to apply out Tabibitosan Method, subtracting one from the other to generate unique group identifiers.
So, now we can group the data...
select min(day) as start_day
,max(day) as end_day
,count(distinct who) as sales_people
,count(*) as sales_count
,sum(dollars) as sales_amount
from (-- our dense rank grouping query
select day, who, dollars
,dense_rank() over (order by day) as rn
,day-dense_rank() over (order by day) as grp
from mysales
)
group by grp
order by 1
/
START_DAY END_DAY SALES_PEOPLE SALES_COUNT SALES_AMOUNT
---------- ---------- ------------ ----------- ------------
1 5 3 9 725
8 12 3 8 900
15 16 2 2 300
3 rows selected.
That works really well.
We can see from this that, the key to applying Tabibitosan, is to be able to generate a gapless sequence (even if it has duplicates) so that we can compare it with the sequential (and potentially duplicated) gappy sequence we have in our data. Once we have those two key components we're able to generate those unique groupings.
5. Tabibitosan on Dates - by Month
So, our above example had duplicate rows for the "Days", but those days weren't very realistic for most people's data. It's unlikely we would be recording a number to represent a day.
So, let's change our sales data to have some more realistic dates, and apply our dense_rank method to those dates...
drop table mysales
/
create table mysales as
select date '2014-01-01' as dt, 'Fred' as who, 100 as dollars from dual union all
select date '2014-01-02', 'Bob', 50 from dual union all
select date '2014-01-03', 'Jim', 75 from dual union all
select date '2014-02-12', 'Bob', 125 from dual union all
select date '2014-02-15', 'Jim', 100 from dual union all
select date '2014-03-07', 'Fred', 25 as dollars from dual union all
select date '2014-04-01', 'Fred', 50 from dual union all
select date '2014-04-28', 'Jim', 150 from dual union all
select date '2014-05-02', 'Jim', 50 from dual union all
select date '2014-08-13', 'Fred', 25 from dual union all
select date '2014-08-20', 'Bob', 100 from dual union all
select date '2014-09-05', 'Jim', 175 from dual union all
select date '2014-09-06', 'Fred', 75 from dual union all
select date '2014-10-11', 'Fred', 125 from dual union all
select date '2014-10-14', 'Fred', 225 from dual union all
select date '2014-11-11', 'Fred', 75 from dual union all
select date '2014-12-01', 'Fred', 100 from dual union all
select date '2015-03-06', 'Jim', 150 from dual union all
select date '2015-04-01', 'Bob', 150 from dual
/
select min(dt) as start_day
,max(dt) as end_day
,count(distinct who) as sales_people
,count(*) as sales_count
,sum(dollars) as sales_amount
from (-- our dense rank grouping query
select dt, who, dollars
,dense_rank() over (order by dt) as rn
,dt-dense_rank() over (order by dt) as grp
from mysales
)
group by grp
order by 1
/
START_DAY END_DAY SALES_PEOPLE SALES_COUNT SALES_AMOUNT
---------- ---------- ------------ ----------- ------------
2014-01-01 2014-01-03 3 3 225
2014-02-12 2014-02-12 1 1 125
2014-02-15 2014-02-15 1 1 100
2014-03-07 2014-03-07 1 1 25
2014-04-01 2014-04-01 1 1 50
2014-04-28 2014-04-28 1 1 150
2014-05-02 2014-05-02 1 1 50
2014-08-13 2014-08-13 1 1 25
2014-08-20 2014-08-20 1 1 100
2014-09-05 2014-09-06 2 2 250
2014-10-11 2014-10-11 1 1 125
2014-10-14 2014-10-14 1 1 225
2014-11-11 2014-11-11 1 1 75
2014-12-01 2014-12-01 1 1 100
2015-03-06 2015-03-06 1 1 150
2015-04-01 2015-04-01 1 1 150
16 rows selected.
Well, that works, if we had wanted to group the data by day, but our sales team is lazy (or perhaps just busy doing other things ), and don't make sales in some months, so we want to just group by consecutive months to get a more 'overall' picture of what they've been up to.
For that, we need to consider that we aren't interested in the day component of our DATE values, and our calculation for our group needs to be based on months rather than days (as most will know just subtracting a number from a DATE is subtracting a number of days, not months; and in addition there are variable numbers of days in each month so we cannot just factor up the value we take off).
Therefore, for each of our dates, we'll truncate them to their Month, and we'll calculate our groups using Oracle's add_months function (using a negative value to affect the subtraction). Likewise, for display purposes, we'll just display the year and month so it makes more sense...
select to_char(min(dt),'YYYY-MM') as start_month
,to_char(max(dt),'YYYY-MM') as end_month
,count(distinct who) as sales_people
,count(*) as sales_count
,sum(dollars) as sales_amount
from (-- our dense rank grouping query
select dt, who, dollars
,dense_rank() over (order by trunc(dt,'MM')) as rn
,add_months(trunc(dt,'MM'), -dense_rank() over (order by trunc(dt,'MM'))) as grp
from mysales
)
group by grp
order by 1
/
START_M END_MON SALES_PEOPLE SALES_COUNT SALES_AMOUNT
------- ------- ------------ ----------- ------------
2014-01 2014-05 3 9 725
2014-08 2014-12 3 8 900
2015-03 2015-04 2 2 300
3 rows selected.
Brilliant, we've achieved what we wanted, grouping duplicate rows of dates into month ranges. And because of our Tabibitosan Method, it's nice concise SQL.
6. Summary
Here we've seen a few examples using numbers and dates. If it suits the situation you can convert your dates to numbers using to_number(to_char(dt, 'j')) and apply Tabibitosan method to that, but clearly there are far too many scenarios to cover in this 101 introduction to the topic.
The link at the top of this article to Aketi's thread shows some more advanced scenarios and provides some other links with examples, so I do encourage you to take a look at that thread.
The one thing to remember though is, if you're considering how to group sequences of values into 'ranges', then think "Tabibitosan" !!
3rd June 2015 Update:
Following feedback to Steven Feuersteine's tweet about this article (thanks for the email Steven ), if you're using Oracle 12c or above, there is now another way to achieve this using the new MATCH_RECOGNIZE clause.
For details see @"Stew Ashton"'s excellent blog post demonstrating how this works and how it can improve upon what we've seen with Tabibitosan
https://stewashton.wordpress.com/2014/03/05/12c-match_recognize-grouping-sequences/
Nice one Stew!
[Edited 13/03/2023 - to reformat for the new forum platform]