Skip to Main Content

SQL & PL/SQL

How can I use a subquery in conjunction with recursive subquery factoring?

Pseudo NymNov 11 2022

I am attempting to implement a data sampling algorithm that requires partitioning the data (rows of a table) into 'buckets', iterating through the buckets sequentially, and choosing one data item (row) from each bucket. The row selected for each bucket is determined by maximising a function that is dependent on:
the value of the data value for the row being considered in the current bucket;
the already selected (optimised) value in the previous bucket; and
a property derived from all the data points on the next bucket.
A recursive subquery factoring clause appears to be ideally suited to this. (Indeed, a postgresql implementation of the algorithm I am working towards - which is not mine - uses recursive subquery factoring.))
If I pre-calculate the required derived value (3) for each bucket, and store it into a table, then the recursive subquery factoring clause works as intended. (See SQL below.)
I wish to calculate the result as a single WITH ... SELECT ..., without having to create any additional temporary tables with pre-calculated intermediate results. With Oracle 19c database, if, instead of pre-calculating (3) and storing results in a table to be referenced, I calculate it using a subquery in the main WITH ... SELECT ... clause, it fails to compile with an unexpected error (ORA-00904). (See SQL below.)
The two forms appear to be equivalent and so I cannot understand why the single statement form is not working.
Does anyone understand why this is not working? Have I encountered an Oracle DB limitation or bug and if so, are there any workarounds?
As an aside, assuming I get this to work, I'll be needing to introduce another level of subquery in the WITH clause to take the data from the source table and get it into the form required by the algorithm, so this ability to handle subqueries is important. How can I do this?
Thank you.
INPUT DATA

-- A table holding the input data
CREATE TABLE test_data
(
  id NUMBER -- Row identifier
  , b NUMBER -- Bucket number
  , y NUMBER -- Value
);
INSERT INTO test_data (id, b, y) VALUES(1, 1, 10);
INSERT INTO test_data (id, b, y) VALUES(2, 2, 0);
INSERT INTO test_data (id, b, y) VALUES(3, 3, 5);
INSERT INTO test_data (id, b, y) VALUES(4, 3, 4);
INSERT INTO test_data (id, b, y) VALUES(5, 4, 7);
INSERT INTO test_data (id, b, y) VALUES(6, 5, 6);
INSERT INTO test_data (id, b, y) VALUES(7, 6, 2);
INSERT INTO test_data (id, b, y) VALUES(8, 7, 8);
INSERT INTO test_data (id, b, y) VALUES(9, 8, 1);

EXAMPLE 1 (WORKING)

-- Pre-calculate a table holding intermediate results
CREATE TABLE centroids
(
  b NUMBER -- Bucket number
  , y NUMBER -- Derived value
);
INSERT INTO centroids
SELECT
  TD.b
  , AVG(y) -- average y for each bucket
FROM test_data TD
GROUP BY TD.b;
-- Calculate full result via recursive subquery factoring
WITH
  compiled( id, b, y ) AS
  (
    (
      SELECT
        TD.id
        , TD.b
        , TD.y
      FROM
        test_data TD
      WHERE
        TD.b = 1
    )
    UNION ALL
    (
      SELECT
        CUR.id
        , CUR.b
        , CUR.y
      FROM
        compiled PRE    
      LEFT JOIN
        centroids NXT
      ON
        NXT.b = PRE.b + 2
      JOIN
        test_data CUR
      ON
        CUR.b = PRE.b + 1
        AND
        (
          CUR.id =
            (
              SELECT
                CUR_BEST.id
              FROM
                (
                  SELECT
                    CUR_ORDERED.id
                    , CASE
                        WHEN (NXT.b IS NULL) THEN
                          CUR_ORDERED.y
                        ELSE
                          PRE.y + CUR_ORDERED.y + NXT.y
                      END
                      AS a
                  FROM
                    test_data CUR_ORDERED
                  WHERE
                    CUR_ORDERED.b = PRE.b + 1
                  ORDER BY
                    a DESC
                ) CUR_BEST
              WHERE ROWNUM = 1
          )
        )
    )
  )
SELECT
  C.id
  , C.b
  , C.y
FROM
  compiled C
ORDER BY
  C.b ASC;

EXAMPLE 2 (NOT WORKING)

-- Do entire calculation in a single WITH ... SELECT ... statement
WITH
  -- Calculate intermediate result as a subquery
  -- This part does exactly the same calculation as the initial step in example 1. 
  centroids as
  (
    SELECT
      TD.b
      , AVG(y)
    FROM
      test_data TD
    GROUP BY
      TD.b
  )
  -- This part is exactly the same as example 1
  , compiled( id, b, y ) AS
  (
    (
      SELECT
        TD.id
        , TD.b
        , TD.y
      FROM
        test_data TD
      WHERE
        TD.b = 1
    )
    UNION ALL
    (
      SELECT
        CUR.id
        , CUR.b
        , CUR.y
      FROM
        compiled PRE    
      LEFT JOIN
        centroids NXT
      ON
        NXT.b = PRE.b + 2
      JOIN
        test_data CUR
      ON
        CUR.b = PRE.b + 1
        AND
        (
          CUR.id =
            (
              SELECT
                CUR_BEST.id
              FROM
                (
                  SELECT
                    CUR_ORDERED.id
                    , CASE
                        WHEN (NXT.b IS NULL) THEN
                          CUR_ORDERED.y
                        ELSE
                          PRE.y + CUR_ORDERED.y + NXT.y
                      END
                      AS a
                  FROM
                    test_data CUR_ORDERED
                  WHERE
                    CUR_ORDERED.b = PRE.b + 1
                  ORDER BY
                    a DESC
                ) CUR_BEST
              WHERE ROWNUM = 1
          )
        )
    )
  )  
SELECT
  C.id
  , C.b
  , C.y
FROM
  compiled C
ORDER BY
  C.b ASC;

This fails with:
ORA-00904: "NXT"."Y": invalid identifier

This post has been answered by Pseudo Nym on Nov 11 2022
Jump to Answer
Comments
Post Details
Added on Nov 11 2022
4 comments
344 views