SQL & PL/SQL

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

Correcting outliers by decide outliers by Z-Score method

PAN KEVINNov 23 2019 — edited Nov 24 2019

I have table with outliers,

ID COL1 COL1_Z_SCORE COL2 COL2_Z_SCORE

0 1 0 12 -0.03

1 100 2.22 17 -0.01

2 1 0 31 0.04

3 1 0 530 2.01

4 1 0 22 0.01

5 1 0 32 0.05

6 1 0 -600 -2.44

7 1 0 43 0.09

8 1 0 43 0.09

9 1 0 43 0.09

10 -100 -2.25 43 0.09

Thanks for Frank's answer https://community.oracle.com/message/15449126#15449126, I can use the following method to detect outliers by Z-Score method,

(1) compute the Z-Score by each column

(2) if the absolute value of computed Z-Score > 2.2 ,consider it as outlier and set it NULL

Now, I want to correct outliers, if the value of computed Z-Score > 2.2, set it as two standard deviations, and if the value of computed Z-Score < -2.2, set it as negative two standard deviations

My expected result is as follow, 89.45 and 508.71 is the standard deviation of each columns.

ID COL1 COL1_Z_SCORE COL2 COL2_Z_SCORE

0 1 0 12 -0.03

1 89.45 2.22 17 -0.01

2 1 0 31 0.04

3 1 0 508.71 2.01

4 1 0 22 0.01

5 1 0 32 0.05

6 1 0 -508.71 -2.44

7 1 0 43 0.09

8 1 0 43 0.09

9 1 0 43 0.09

10 -89.45 -2.25 43 0.09

My expected answer can be modified based on Frank's answer as follow, since I need to do this for several columns (not just two). Thanks very much.

WITH UNPIVOTED_DATA AS
 (SELECT ID,
         COL,
         VAL,
         ABS((VAL - AVG(VAL) OVER(PARTITION BY COL)) / STDDEV(VAL)
             OVER(PARTITION BY COL)) AS ABS_Z
  FROM   MYDATA UNPIVOT(VAL FOR COL IN(COL1, COL2)))
SELECT *
FROM   UNPIVOTED_DATA
PIVOT (MIN(CASE WHEN ABS_Z <= 2.2 THEN VAL END) FOR COL IN('COL1' AS COL1,
                                                      'COL2' AS COL2))
ORDER  BY ID;

create table MYDATA (ID integer, COL1 integer, COL2 integer);
INSERT INTO MYDATA (ID, COL1, COL2) VALUES (0, 1, 12);
INSERT INTO MYDATA (ID, COL1, COL2) VALUES (1, 100, 17);
INSERT INTO MYDATA (ID, COL1, COL2) VALUES (2, 1, 31);
INSERT INTO MYDATA (ID, COL1, COL2) VALUES (3, 1, 530);
INSERT INTO MYDATA (ID, COL1, COL2) VALUES (4, 1, 22);
INSERT INTO MYDATA (ID, COL1, COL2) VALUES (5, 1, 32);
INSERT INTO MYDATA (ID, COL1, COL2) VALUES (6, 1, -600);
INSERT INTO MYDATA (ID, COL1, COL2) VALUES (7, 1, 43);
INSERT INTO MYDATA (ID, COL1, COL2) VALUES (8, 1, 43);
INSERT INTO MYDATA (ID, COL1, COL2) VALUES (9, 1, 43);
INSERT INTO MYDATA (ID, COL1, COL2) VALUES (10, -100, 43);

This post has been answered by mathguy on Nov 23 2019

Jump to Answer

Added on Nov 23 2019

3 comments

666 views