Skip to Main Content

SQL & PL/SQL

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

Correcting outliers by decide outliers by Z-Score method

PAN KEVINNov 23 2019 — edited Nov 24 2019

I have table with outliers,

ID  COL1  COL1_Z_SCORE  COL2  COL2_Z_SCORE

0      1           0      12       -0.03

1      100          2.22  17       -0.01

2      1           0      31       0.04

3      1           0      530      2.01

4      1           0      22       0.01

5      1           0      32       0.05

6      1           0      -600       -2.44

7      1           0      43       0.09

8      1           0      43       0.09

9      1           0      43       0.09

10  -100           -2.25  43       0.09

Thanks for Frank's answer https://community.oracle.com/message/15449126#15449126, I can use the following method to detect outliers by Z-Score method,

(1) compute the Z-Score by each column

(2) if the absolute value of computed Z-Score > 2.2 ,consider it as outlier and set it NULL

Now, I want to correct outliers, if the value of computed Z-Score > 2.2, set it as two standard deviations, and if the value of computed Z-Score < -2.2, set it as negative two standard deviations

My expected result is as follow, 89.45 and 508.71 is the standard deviation of each columns.

ID  COL1  COL1_Z_SCORE  COL2  COL2_Z_SCORE

0      1           0      12       -0.03

1      89.45        2.22  17       -0.01

2      1           0      31       0.04

3      1           0      508.71      2.01

4      1           0      22       0.01

5      1           0      32       0.05

6      1           0      -508.71       -2.44

7      1           0      43       0.09

8      1           0      43       0.09

9      1           0      43       0.09

10  -89.45           -2.25 43       0.09

My expected answer can be modified based on Frank's answer as follow, since I need to do this for several columns (not just two). Thanks very much.

WITH UNPIVOTED_DATA AS

(SELECT ID,

         COL,

         VAL,

         ABS((VAL - AVG(VAL) OVER(PARTITION BY COL)) / STDDEV(VAL)

             OVER(PARTITION BY COL)) AS ABS_Z

  FROM   MYDATA UNPIVOT(VAL FOR COL IN(COL1, COL2)))

SELECT *

FROM   UNPIVOTED_DATA

PIVOT (MIN(CASE WHEN ABS_Z <= 2.2 THEN VAL END) FOR COL IN('COL1' AS COL1,

                                                      'COL2' AS COL2))

ORDER  BY ID;

create table MYDATA (ID integer, COL1 integer, COL2 integer);

INSERT INTO MYDATA (ID, COL1, COL2) VALUES (0, 1, 12);

INSERT INTO MYDATA (ID, COL1, COL2) VALUES (1, 100, 17);

INSERT INTO MYDATA (ID, COL1, COL2) VALUES (2, 1, 31);

INSERT INTO MYDATA (ID, COL1, COL2) VALUES (3, 1, 530);

INSERT INTO MYDATA (ID, COL1, COL2) VALUES (4, 1, 22);

INSERT INTO MYDATA (ID, COL1, COL2) VALUES (5, 1, 32);

INSERT INTO MYDATA (ID, COL1, COL2) VALUES (6, 1, -600);

INSERT INTO MYDATA (ID, COL1, COL2) VALUES (7, 1, 43);

INSERT INTO MYDATA (ID, COL1, COL2) VALUES (8, 1, 43);

INSERT INTO MYDATA (ID, COL1, COL2) VALUES (9, 1, 43);

INSERT INTO MYDATA (ID, COL1, COL2) VALUES (10, -100, 43);

This post has been answered by mathguy on Nov 23 2019
Jump to Answer
Comments
Post Details
Added on Nov 23 2019
3 comments
613 views