Consider a list of files, along with their file sizes.
I can use NTILE to split the number of rows into evenly spaced buckets (height)
But I'd like to use something to split the rows where each bucket has approximately same total file size (width).
I may be misinterpreting the use of width_bucket for this purpose, or mis-using it.
Generate some data using the following, where a large number of “files” are small, and a small portion are bigger.
create table sw_distro_large as
select rownum rn , round(dbms_random.value(10,20)) fs from dual connect by level < 1000
union all
select rownum*10 rn , round(dbms_random.value(100,120)) from dual connect by level < 20
union all
select rownum*100 rn , round(dbms_random.value(300,320)) from dual connect by level < 10
;
So let's pretend this gives us ~20kb worth in 1027 files, and I'd like 5 buckets of approx 4kb
select sum(fs) tot_filesize, count(*) cnt, round(sum(fs)/5) desired_bucket_size From sw_distro_large;
TOT_FILESIZE CNT DESIRED_BUCKET_SIZE
------------ ---------- -------------------
19629 1027 3926
If I was bucketing by number of rows, I could use
select count(*), ntile
from (
select s.*
,NTILE(5) OVER (ORDER BY null) ntile
from sw_distro_large s
) group by ntile order by ntile;
COUNT(*) NTILE
---------- ----------
206 1
206 2
205 3
205 4
205 5
If I use similar syntax with width_bucket, the buckets are rather skewed
select sum(fs), wb, count(*)
from (
select s.*
,WIDTH_BUCKET(fs, 0, 1000, 5) wb
from imsapx.sw_distro_large s
) group by wb;
SUM(FS) WB COUNT(*)
---------- ---------- ----------
16824 1 1018
2805 2 9
Interestingly, if I use num_buckets = 10, the distribution basically matches my initial union.
select sum(fs), wb, count(*)
from (
select s.*
,WIDTH_BUCKET(fs, 0, 1000, 10) wb
from imsapx.sw_distro_large s
) group by wb;
SUM(FS) WB COUNT(*)
---------- ---------- ----------
14778 1 999
2046 2 19
2805 4 9
The results I'm chasing can be acquired by manually splitting the buckets using a case statement, but that's not dynamic, and requires me to know the totals beforehand.
select count(*), sum(fs)
,case
when nv between 0 and 4000 then 1
when nv between 4001 and 8000 then 2
when nv between 8001 and 12000 then 3
when nv between 12001 and 16000 then 4
when nv between 16001 and 20000 then 5
end bucket
from (
select rn, fs
,sum(fs) OVER (ORDER BY rn
ROWS BETWEEN UNBOUNDED PRECEDING AND current row) nv
from sw_distro_large s
)
group by case
when nv between 0 and 4000 then 1
when nv between 4001 and 8000 then 2
when nv between 8001 and 12000 then 3
when nv between 12001 and 16000 then 4
when nv between 16001 and 20000 then 5
end;
COUNT(*) SUM(FS) BUCKET
---------- ---------- ----------
162 3990 1
197 4008 2
231 4008 4
209 3635 5
228 3988 3
Should I be using width_bucket? Am I using it incorrectly in this case?
Or is there some other algorithm I could consider?
Cheers, Scott.