I am attempting to implement a data sampling algorithm that requires partitioning the data (rows of a table) into 'buckets', iterating through the buckets sequentially, and choosing one data item (row) from each bucket. The row selected for each bucket is determined by maximising a function that is dependent on:
the value of the data value for the row being considered in the current bucket;
the already selected (optimised) value in the previous bucket; and
a property derived from all the data points on the next bucket.
A recursive subquery factoring clause appears to be ideally suited to this. (Indeed, a postgresql implementation of the algorithm I am working towards - which is not mine - uses recursive subquery factoring.))
If I pre-calculate the required derived value (3) for each bucket, and store it into a table, then the recursive subquery factoring clause works as intended. (See SQL below.)
I wish to calculate the result as a single WITH ... SELECT ...
, without having to create any additional temporary tables with pre-calculated intermediate results. With Oracle 19c database, if, instead of pre-calculating (3) and storing results in a table to be referenced, I calculate it using a subquery in the main WITH ... SELECT ...
clause, it fails to compile with an unexpected error (ORA-00904). (See SQL below.)
The two forms appear to be equivalent and so I cannot understand why the single statement form is not working.
Does anyone understand why this is not working? Have I encountered an Oracle DB limitation or bug and if so, are there any workarounds?
As an aside, assuming I get this to work, I'll be needing to introduce another level of subquery in the WITH clause to take the data from the source table and get it into the form required by the algorithm, so this ability to handle subqueries is important. How can I do this?
Thank you.
INPUT DATA
-- A table holding the input data
CREATE TABLE test_data
(
id NUMBER -- Row identifier
, b NUMBER -- Bucket number
, y NUMBER -- Value
);
INSERT INTO test_data (id, b, y) VALUES(1, 1, 10);
INSERT INTO test_data (id, b, y) VALUES(2, 2, 0);
INSERT INTO test_data (id, b, y) VALUES(3, 3, 5);
INSERT INTO test_data (id, b, y) VALUES(4, 3, 4);
INSERT INTO test_data (id, b, y) VALUES(5, 4, 7);
INSERT INTO test_data (id, b, y) VALUES(6, 5, 6);
INSERT INTO test_data (id, b, y) VALUES(7, 6, 2);
INSERT INTO test_data (id, b, y) VALUES(8, 7, 8);
INSERT INTO test_data (id, b, y) VALUES(9, 8, 1);
EXAMPLE 1 (WORKING)
-- Pre-calculate a table holding intermediate results
CREATE TABLE centroids
(
b NUMBER -- Bucket number
, y NUMBER -- Derived value
);
INSERT INTO centroids
SELECT
TD.b
, AVG(y) -- average y for each bucket
FROM test_data TD
GROUP BY TD.b;
-- Calculate full result via recursive subquery factoring
WITH
compiled( id, b, y ) AS
(
(
SELECT
TD.id
, TD.b
, TD.y
FROM
test_data TD
WHERE
TD.b = 1
)
UNION ALL
(
SELECT
CUR.id
, CUR.b
, CUR.y
FROM
compiled PRE
LEFT JOIN
centroids NXT
ON
NXT.b = PRE.b + 2
JOIN
test_data CUR
ON
CUR.b = PRE.b + 1
AND
(
CUR.id =
(
SELECT
CUR_BEST.id
FROM
(
SELECT
CUR_ORDERED.id
, CASE
WHEN (NXT.b IS NULL) THEN
CUR_ORDERED.y
ELSE
PRE.y + CUR_ORDERED.y + NXT.y
END
AS a
FROM
test_data CUR_ORDERED
WHERE
CUR_ORDERED.b = PRE.b + 1
ORDER BY
a DESC
) CUR_BEST
WHERE ROWNUM = 1
)
)
)
)
SELECT
C.id
, C.b
, C.y
FROM
compiled C
ORDER BY
C.b ASC;
EXAMPLE 2 (NOT WORKING)
-- Do entire calculation in a single WITH ... SELECT ... statement
WITH
-- Calculate intermediate result as a subquery
-- This part does exactly the same calculation as the initial step in example 1.
centroids as
(
SELECT
TD.b
, AVG(y)
FROM
test_data TD
GROUP BY
TD.b
)
-- This part is exactly the same as example 1
, compiled( id, b, y ) AS
(
(
SELECT
TD.id
, TD.b
, TD.y
FROM
test_data TD
WHERE
TD.b = 1
)
UNION ALL
(
SELECT
CUR.id
, CUR.b
, CUR.y
FROM
compiled PRE
LEFT JOIN
centroids NXT
ON
NXT.b = PRE.b + 2
JOIN
test_data CUR
ON
CUR.b = PRE.b + 1
AND
(
CUR.id =
(
SELECT
CUR_BEST.id
FROM
(
SELECT
CUR_ORDERED.id
, CASE
WHEN (NXT.b IS NULL) THEN
CUR_ORDERED.y
ELSE
PRE.y + CUR_ORDERED.y + NXT.y
END
AS a
FROM
test_data CUR_ORDERED
WHERE
CUR_ORDERED.b = PRE.b + 1
ORDER BY
a DESC
) CUR_BEST
WHERE ROWNUM = 1
)
)
)
)
SELECT
C.id
, C.b
, C.y
FROM
compiled C
ORDER BY
C.b ASC;
This fails with:
ORA-00904: "NXT"."Y": invalid identifier