Processing large volumes of data in PL/SQL
521495Jan 30 2007 — edited Jan 30 2007I'm working on a project which requires us to process large volumes of data on a weekly/monthly/quarterly basis, and I'm not sure we are doing it right, so any tips would be greatly appreciated.
Requirement
Source data is in a flat file in "short-fat" format i.e. each data record (a "case") has a key and up to 2000 variable values.
A typical weekly file would have maybe 10,000 such cases i.e. around 20 million variable values.
But we don't know which variables are used each week until we get the file, or where they are in the file records (this is determined via a set of meta-data definitions that the user selects at runtime). This makes identifying and validating each variable value a little more interesting.
Target is a "long-thin" table i.e. one record for each variable value (with numeric IDs as FKs to identify the parent variable and case.
We only want to load variable values for cases which are entirely valid. This may be a merge i.e. variable values may already exist in the target table.
There are various rules for validating the data against pre-existing data etc. These rules are specific to each variable, and have to be applied before we put the data in the target table. The users want to see the validation results - and may choose to bail out - before the data is written to the target table.
Restrictions
We have very limited permission to perform DDL e.g. to create new tables/indexes etc.
We have no permission to use e.g. Oracle external tables, Oracle directories etc.
We are working with standard Oracle tools i.e. PL/SQL and no DWH tools.
DBAs are extremely resistant to giving us more disk space.
We are on Oracle 9iR2, with no immediate prospect of moving to 10g.
Current approach
Source data is uploaded via SQL*Loader into static "short fat" tables.
Some initial key validation is performed on these records.
Dynamic SQL (plus BULK COLLECT etc) is used to pivot the short-fat data into an intermediate long-thin table, performing the validation on the fly via a combination of including reference values in the dynamic SQL and calling PL/SQL functions inside the dynamic SQL. This means we can pivot+validate the data in one step, and don't have to update the data with its validation status after we've pivoted it.
This upload+pivot+validate step takes about 1 hour 15 minutes for around 15 million variable values.
The subsequent "load to target table" step also has to apply substitution rules for certain "special values" or NULLs.
We do this by BULK collecting the variable values from the intermediate long-thin table, for each valid case in turn, applying the substitution rules within the SQL, and inserting into/updating the target table as appropriate.
Initially we did this via a SQL MERGE, but this was actually slower than doing an explicit check for existence and switching between INSERT and UPDATE accordingly (yes, that sounds fishy to me too).
This "load" process takes around 90 minutes for the same 15 million variable values.
Questions
Why is it so slow? Our DBAs assure us we have lots of table-space etc, and that the server is plenty powerful enough.
Any suggestions as to a better approach, given the restrictions we are working under?
We've looked at Tom Kyte's stuff about creating temporary tables via CTAS, but we have had serious problems with dynamic SQL on this project, so we are very reluctant to introduce more of it unless it's absolutely necessary. In any case, we have serious problems getting permissions to create DB objects - tables, indexes etc - dynamically.
So any advice would be gratefully received!
Thanks,
Chris