SQL & PL/SQL

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

How to Efficiently Sample a Fixed Number of Rows

Jon of All TradesJul 23 2009 — edited Jul 23 2009

Good afternoon. I need to select a specific number of random rows from a table, and while I believe my logic is right it's taking too long, 30 minutes for a routine data size. Hopefully someone here can show me a more efficient query. I've seen the SAMPLE function, but it just randomly selects rows on a one-by-one basis, without a guaranteed total count.

This is the idea:

INSERT INTO Tmp_Table (Value, Sequence) SELECT Value FROM Perm_Table, DBMS_RANDOM.VALUE;
SELECT Value FROM Tmp_Table WHERE ROWNUM <= 1234 ORDER BY Sequence;

I'd need to put the ORDER BY in a subselect for ROWNUM to work correctly, but anyway that's just an illustration. My actual need is a little more complicated. I have many sets of data; each set has many rows; and for each set I need to return a specific (different) number of rows. Perhaps project A has three rows in this table, and I want to keep two of them; project B has two rows, and I want to keep one of them. So I need to identify, for each row, whether it's valid for that project. This is what my data looks like:

Project Person  Sequence Position Keeper
A       Bill    1234     1        Yes
A       Fred    5678     3        No
A       George  1927     2        Yes
B       April   5784     2        No
B       Janice  2691     1        Yes

I populate Sequence with random values, then calculate the position of each person within their project, and finally discard people who's Position is greater than Max_Targets for the Project. Fred and April have the highest random numbers, so they're cut. It's not the case that I'm just trimming one person from each project; the actual percentage kept will range from zero to 100.

Populating the list with random values is not time-consuming, but calculating Position is. This is my code:

UPDATE Tmp_Targets T1
 SET Position =
  (
  SELECT
   COUNT(*)
  FROM
   Perm_Targets PT1
   INNER JOIN Perm_Targets PT2 ON PT1.Project = PT2.Project
   INNER JOIN Tmp_Targets T2 ON PT2.Target = T2.Target
  WHERE
       T1.Target = PT1.Target
   AND T2.Sequence <= T1.Sequence
  );

The Target fields are PKs, and the Project and Sequence fields are indexed. Is there a better way to approach this? I could write a cursor that pulls out project codes and performs the above operations for each project in turn; that would be logically simpler and possibly faster. Has anyone here addressed a similar problem before? I'd appreciate any ideas.

This is on 9.2, in case it matters. Thank you,

Jonathan

This post has been answered by Sentinel on Jul 23 2009

Jump to Answer

Locked Post

New comments cannot be posted to this locked post.

Locked on Aug 20 2009

Added on Jul 23 2009

5 comments

1,126 views