SQL & PL/SQL

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

Distributing data evenly by width

Scott WesleyApr 18 2024

Consider a list of files, along with their file sizes.

I can use NTILE to split the number of rows into evenly spaced buckets (height)

But I'd like to use something to split the rows where each bucket has approximately same total file size (width).

I may be misinterpreting the use of width_bucket for this purpose, or mis-using it.

Generate some data using the following, where a large number of “files” are small, and a small portion are bigger.

create table sw_distro_large as
select rownum rn , round(dbms_random.value(10,20)) fs from dual connect by level < 1000
union all
select rownum*10 rn , round(dbms_random.value(100,120)) from dual connect by level < 20
union all
select rownum*100 rn , round(dbms_random.value(300,320)) from dual connect by level < 10
;

So let's pretend this gives us ~20kb worth in 1027 files, and I'd like 5 buckets of approx 4kb

select sum(fs) tot_filesize, count(*) cnt, round(sum(fs)/5) desired_bucket_size From sw_distro_large;

TOT_FILESIZE        CNT DESIRED_BUCKET_SIZE
------------ ---------- -------------------
       19629       1027                3926

If I was bucketing by number of rows, I could use

select count(*), ntile
from (
 select s.* 
   ,NTILE(5) OVER (ORDER BY null) ntile
 from sw_distro_large s
) group by ntile order by ntile;
  COUNT(*)      NTILE
---------- ----------
      206          1
      206          2
      205          3
      205          4
      205          5

If I use similar syntax with width_bucket, the buckets are rather skewed

select sum(fs), wb, count(*)
from (
 select s.* 
   ,WIDTH_BUCKET(fs, 0, 1000, 5) wb
 from imsapx.sw_distro_large s
) group by wb;
   SUM(FS)         WB   COUNT(*)
---------- ---------- ----------
    16824          1       1018
     2805          2          9

Interestingly, if I use num_buckets = 10, the distribution basically matches my initial union.

select sum(fs), wb, count(*)
from (
 select s.* 
   ,WIDTH_BUCKET(fs, 0, 1000, 10) wb
 from imsapx.sw_distro_large s
) group by wb;
   SUM(FS)         WB   COUNT(*)
---------- ---------- ----------
    14778          1        999
     2046          2         19
     2805          4          9

The results I'm chasing can be acquired by manually splitting the buckets using a case statement, but that's not dynamic, and requires me to know the totals beforehand.

select count(*), sum(fs)
,case 
when nv between 0 and 4000 then 1
when nv between 4001 and 8000 then 2
when nv between 8001 and 12000 then 3
when nv between 12001 and 16000 then 4
when nv between 16001 and 20000 then 5
end bucket
from (
  select rn, fs
   ,sum(fs) OVER (ORDER BY rn
     ROWS BETWEEN UNBOUNDED PRECEDING AND current row) nv
  from sw_distro_large s
)
group by case 
when nv between 0 and 4000 then 1
when nv between 4001 and 8000 then 2
when nv between 8001 and 12000 then 3
when nv between 12001 and 16000 then 4
when nv between 16001 and 20000 then 5
end;
  COUNT(*)    SUM(FS)     BUCKET
---------- ---------- ----------
      162       3990          1
      197       4008          2
      231       4008          4
      209       3635          5
      228       3988          3

Should I be using width_bucket? Am I using it incorrectly in this case?

Or is there some other algorithm I could consider?

Cheers, Scott.

Added on Apr 18 2024

#analytics, #ntile, #sql

4 comments

945 views

1 mention