Thread: Introduction to regular expressions ... continued.


Permlink Replies: 29 - Pages: 2 [ 1 2 | Next ] - Last Post: Apr 20, 2007 12:09 PM Last Post By: Volder
cd_2

Posts: 4,849
Registered: 09/08/98
Introduction to regular expressions ... continued.
Posted: Oct 5, 2006 7:28 AM
Click to report abuse...   Click to reply to this thread Reply
After some very positive feedback from Part 1 I'm now continuing on this topic for the interested audience. As always, if you have questions or problems that you think could be solved through regular expression, please post them.

Having fun with regular expressions - Part 2

Finishing my example with decimal numbers, I thought about a method to test regular expressions. A question from another user who was looking for a way to show all possible combinations inspired me in writing a small package.

CREATE OR REPLACE PACKAGE regex_utils AS
  -- Regular Expression Utilities
  -- Version 0.1 
  TYPE t_outrec IS RECORD(
    data VARCHAR2(255)
  );
  TYPE t_outtab IS TABLE OF t_outrec;
  FUNCTION gen_data(
    p_charset IN VARCHAR2 -- character set that is used for generation
  , p_length  IN NUMBER   -- length of the generated 
  ) RETURN t_outtab PIPELINED;
END regex_utils;
/ 
 
CREATE OR REPLACE PACKAGE BODY regex_utils AS
-- FUNCTION gen_data returns a collection of generated varchar2 elements
  FUNCTION gen_data(
    p_charset IN VARCHAR2 -- character set that is used for generation
  , p_length  IN NUMBER   -- length of the generated 
  ) RETURN t_outtab PIPELINED
  IS
    TYPE t_counter IS TABLE OF PLS_INTEGER INDEX BY PLS_INTEGER;
    --
    v_counter t_counter;
    v_exit    BOOLEAN;
    v_string  VARCHAR2(255);
    v_outrec  t_outrec;
  BEGIN 
    FOR max_length IN 1..p_length  
    LOOP
      -- init counter loop
      FOR i IN 1..max_length
      LOOP
        v_counter(i) := 1;
      END LOOP; 
      --
      -- start data generation loop
      --
      v_exit := FALSE;
      --
      WHILE NOT v_exit
      LOOP
        -- start generation
        v_string := '';
        --
        FOR i IN 1..max_length
        LOOP
          v_string := v_string || SUBSTR(p_charset, v_counter(i), 1);
        END LOOP;
        --
        -- set outgoing record
        --
        v_outrec.data := v_string;
        --
        -- now pipe the result 
        --
        PIPE ROW(v_outrec);
        --
        -- increment loop
        --
        <<inc_loop>>
        FOR i IN REVERSE 1..max_length
        LOOP
          v_counter(i) := v_counter(i) + 1;      
          IF v_counter(i) > LENGTH(p_charset) THEN 
             IF i > 1 THEN
                v_counter(i) := 1;
             ELSE
                v_exit := TRUE;   
             END IF;
          ELSE
             -- no further processing required
             EXIT inc_loop;   
          END IF;   
        END LOOP;         
        --
      END LOOP;  
    END LOOP;  
  END gen_data;
END regex_utils;
/ 


This package is a brute force string generator using all possible combinations of a characters in a string up to a maximum length. Together with the regular expressions, I can now show what combinations my solution would allow to pass. But see for yourself:

SELECT *
  FROM (SELECT data col1 
          FROM TABLE(regex_utils.gen_data('+-.0', 5))
       ) t
 WHERE REGEXP_LIKE(NVL(REGEXP_SUBSTR(t.col1, 
                                     '^([+-]?[^+-]+|[^+-]+[+-]?)$'
                                    ), 
                       ' '
                      ), 
                   '^[+-]?(\.[0-9]+|[0-9]+(\.[0-9]*)?)[+-]?$'
                   )
;


You will see some results, which are perfectly valid for my definition of decimal numbers but haven't been mentioned, like '000' or '+.00'. From now on I will also use this package to verify the solutions I'll present to you and hopefully reduce my share of typos.

Counting and finding certain characters or words in a string can be a tedious task. I'll show you how it's done with regular expressions. I'll start with an easy example, count all spaces in the string "Having fun with regular expressions.":

SELECT NVL(LENGTH(REGEXP_REPLACE('Having fun with regular expressions', '[^ ]')), 0)
  FROM dual
  ;


No surprise there. I'm replacing all characters except spaces with a null string. Since REGEXP_REPLACE assumes a NULL string as replacement argument, I can save on adding a third argument, which would look like this:

REGEXP_REPLACE('Having fun with regular expressions', '[^ ]', '')


So REPLACE will return all the spaces which we can count with the LENGTH function. If there aren't any, I will get a NULL string, which is checked by the NVL function. If you want you can play around by changing the space character to somethin else.

A variation of this theme could be counting the number of words. Counting spaces and adding 1 to this result could be misleading if there are duplicate spaces. Thanks to regular expressions, I can of course eliminate duplicates.

Using the old method on the string "Having fun with regular expressions" would return anything but the right number. This is, where Backreferences come into play. REGEXP_REPLACE uses them in the replacement argument, a backslash plus a single digit, like this: '\1'. To reference a string in a search pattern, I have to use subexpressions (remember the round brackets?).

SELECT NVL(LENGTH(REGEXP_REPLACE('Having  fun  with  regular  expressions', '( )\1*|.', '\1')))
  FROM dual
  ;


You may have noticed that I changed from using the "^" as a NOT operator to using the "|" OR operator and the "." any character placeholder. This neat little trick allows to filter all other characters except the one we're looking in the first place. "\1" as backreference is outside of our subexpression since I don't want to count the trailing spaces and is used both in the search pattern and the replacement argument.

Still I'm not satisfied with this: What about leading/trailing blanks, what if there are any special characters, numbers, etc.? Finally, it's time to only count words. For the purpose of this demonstration, I define a word as one or more consecutive letters. If by now you're already thinking in regular expressions, the solution is not far away. One hint: you may want to check on the "i" match parameter which allows for case insensitive search. Another one: You won't need a back reference in the search pattern this time.


Let's compare our solutions than, shall we?

SELECT NVL(LENGTH(REGEXP_REPLACE('Having  fun  with  regular  expressions.  !', 
                                 '([a-z])+|.', '\1', 1, 0, 'i')), 0)
  FROM dual;


This time I don't use a backreference, the "" operator (remember? 1 or more) will suffice. And since I want to count the occurences, not the letters, I moved the "" meta character outside of the subexpression. The "|." trick again proved to be useful.

Case insensitive search does have its merits. It will only search but not transform the any found substring. If I want, for example, extract any occurence of the word fun, I'll just use the "i" match parameter and get this substring, whether it's written as "Fun", "FUN" or "fun". Can be very useful if you're looking for example for names of customers, streets, etc.

Enough about counting, how about finding? What if I want to know the last occurence of a certain character or string, for example the postition of the last space in this string "Where is the last space?"?

Addendum: Thanks to another forum member, I should mention that using the INSTR function can do a reverse search by itself.

WITH t AS (SELECT 'Where is the last space?' col1
             FROM dual)
SELECT INSTR(col1, ' ', -1)
  FROM DUAL;


Now regular expressions are powerful, but there is no parameter that allows us to reverse the search direction. However, remembering that we have the "$" meta character that means (until the) end of string, all I have to do is use a search pattern that looks for a combination of space and non-space characters including the end of a string. Now compare the REGEXP_INSTR function to the previous solution:

SELECT REGEXP_INSTR(t.col1, ' [^ ]*$')                        
  FROM t;


So in this case, it'll remain a matter of taste what you want to use. If the search pattern has to look for the last occurrence of another regular expression, this is the way to solve such a requirement.

One more thing about backreferences. They can be used for a sort of primitive "string swapping". If for example you have to transform column values like swapping first and last name, backreferenc is your friend. Here's an example:

SELECT REGEXP_REPLACE('John Doe', '^(.*) (.*)$', '\2, \1') 
  FROM dual
  ;


What about middle names, for example 'John J. Doe'? Look for yourself, it still works.

You can even use that for strings with delimiters, for example reversing delimited "fields" like in this string '10~20~30~40~50' into '50~40~30~20~10'. Using REVERSE, I would get '05~04~03~02~01', so there has to be another way. Using backreferences however is limited to 9 subexpressions, which limits the following solution a bit, if you need to process strings with more than 9 fields. If you want, you can think this example through and see if your solution matches mine.

SELECT REGEXP_REPLACE('10~20~30~40~50', 
                      '^(.*)~(.*)~(.*)~(.*)~(.*)$', 
                      '\5~\4~\3~\2~\1'
                     )
  FROM dual;


After what you've learned so far, that wasn't too hard, was it? Enough for now ...

Continued in Part 3.

C.

Fixed some typos and a flawed example ...
cd

cd_2

Posts: 4,849
Registered: 09/08/98
Re: Introduction to regular expressions ... continued.
Posted: Oct 5, 2006 7:29 AM   in response to: cd_2 in response to: cd_2
Click to report abuse...   Click to reply to this thread Reply
For the 3rd and final part I have this topics left:

-- variable IN lists
-- LIKE and IN together
-- phone numbers
-- checking inet values, like IP address, e-mail or URL
-- Advancements in 10g2

C.
Rob van Wijk

Posts: 5,305
Registered: 08/17/06
Re: Introduction to regular expressions ... continued.
Posted: Oct 5, 2006 7:48 AM   in response to: cd_2 in response to: cd_2
Click to report abuse...   Click to reply to this thread Reply
Great information C !

One german typo :-)
I'm replacing all characters except spaces mit a null string.

Looking forward to the last part.

Regards,
Rob.
marias

Posts: 1,428
Registered: 06/23/06
Re: Introduction to regular expressions ... continued.
Posted: Oct 5, 2006 7:53 AM   in response to: cd_2 in response to: cd_2
Click to report abuse...   Click to reply to this thread Reply
SELECT REGEXP_REPLACE('John Doe', '^(.*) (.*)$', '\2, \1') FROM dual ;

SELECT REGEXP_REPLACE('10~20~30~40~50', '^(.*)~(.*)~(.*)~(.*)~(.*)$',
'\5~\4~\3~\2~\1' ) FROM dual;

those 2 are pretty interesting.

good article.
Radhakrishna Sa...

Posts: 2,865
Registered: 01/25/06
Re: Introduction to regular expressions ... continued.
Posted: Oct 5, 2006 8:18 AM   in response to: Rob van Wijk in response to: Rob van Wijk
Click to report abuse...   Click to reply to this thread Reply
Thank you very much C. Awaiting other parts.... keep going.

One german typo :-)
I'm replacing all characters except spaces mit a
null string.

I received a functional spec from my Dutch analyst in which it is written

tnsnames voor EDWH:
PCESCRD1 = (DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)
(HOST=blah.blah.blah.com)
(PORT=5227)))
(CONNECT_DATA=(SID=pcescrd1)))
db user: BW_I2_VIEWER / BW_I2_VIEWER_SCRD1

Had to look for translators.

Cheers
Sarma.
Rob van Wijk

Posts: 5,305
Registered: 08/17/06
Re: Introduction to regular expressions ... continued.
Posted: Oct 5, 2006 8:21 AM   in response to: Radhakrishna Sa... in response to: Radhakrishna Sa...
Click to report abuse...   Click to reply to this thread Reply
:-)

"voor" means "for" usually.
ebrian

Posts: 2,563
Registered: 02/16/06
Re: Introduction to regular expressions ... continued.
Posted: Oct 5, 2006 8:46 AM   in response to: cd_2 in response to: cd_2
Click to report abuse...   Click to reply to this thread Reply
cd...excellent write-up!!!

Just curious...this:

WITH t AS (SELECT 'Where is the last space?' col1
FROM dual)
SELECT LENGTH(t.col1) - DECODE(INSTR(REVERSE(t.col1), REVERSE(' ')),
0, LENGTH(t.col1),
INSTR(REVERSE(t.col1), REVERSE(' ')) - 1)
FROM t;

could actually be simplified to this...right?

select instr('Where is the last space?',' ',-1) from dual;
cd_2

Posts: 4,849
Registered: 09/08/98
Re: Introduction to regular expressions ... continued.
Posted: Oct 5, 2006 9:11 AM   in response to: ebrian in response to: ebrian
Click to report abuse...   Click to reply to this thread Reply
I knew I should have read up the INSTR function in the user guide. Completely missed that one. So yes, searching backwords is possible through INSTR. If you want to search backward with regular expression, however, you'll still have to use the method I discussed.

C.
nurhidayat

Posts: 736
Registered: 04/12/06
Re: Introduction to regular expressions ... continued.
Posted: Oct 5, 2006 7:56 PM   in response to: cd_2 in response to: cd_2
Click to report abuse...   Click to reply to this thread Reply
After what you've learned so far, that wasn't too hard, was it? Enough for now ...

remembering how hard i was to understand regular expression in my college time, yes that wasn't so hard.

Thanks, cd.
May be you could post this helpful article to Oracle Magazine

Message was edited by:
nurhidayat
cd_2

Posts: 4,849
Registered: 09/08/98
Re: Introduction to regular expressions ... continued.
Posted: Oct 5, 2006 11:45 PM   in response to: nurhidayat in response to: nurhidayat
Click to report abuse...   Click to reply to this thread Reply
When I encountered regular expressions for the first time, I went like "what the heck?". In the end it's just a matter of practice ...

As for the idea of publishing it on Oracle Mag., I'm afraid I already violated one submission rule:

"The article is an original work and has not been published in any other form,"

C.
Thamil Kadavara...

Posts: 15
Registered: 05/08/06
Re: Introduction to regular expressions ... continued.
Posted: Nov 17, 2006 6:32 AM   in response to: cd_2 in response to: cd_2
Click to report abuse...   Click to reply to this thread Reply
Dear Friend,

I need your help please.

I am using Oracle 8.

Quite a lot of forenames in our database have more than one name in the forename column.

I need SQL or Function which:

1) Detect when this is the case,
2) Can pick out a Single name - like first,second or third,
3) where you pass the forename and the number of the name - like
F_extract_name(forename,2)
4) I would need to return blanks where there isn't second or third or fourth name ...

Thanks

Thamil

APC

Posts: 10,439
Registered: 08/27/03
Re: Introduction to regular expressions ... continued.
Posted: Nov 17, 2006 6:42 AM   in response to: Thamil Kadavara... in response to: Thamil Kadavara...
Click to report abuse...   Click to reply to this thread Reply
Please raise a new thread for this question. Regex were only introduced in Oracle 10gr2 so your problem requires a different solution.

When you do post your question properly, remember to specify whether you are using Oracle 8.0 or Oracle 8i.

Regards, APC
user545394

Posts: 1
Registered: 11/27/06
Re: Introduction to regular expressions ... continued.
Posted: Dec 5, 2006 11:12 AM   in response to: cd_2 in response to: cd_2
Click to report abuse...   Click to reply to this thread Reply
cd,

lovely post.

Anyone any tips on how I can use this in with multiple rows?

I have a number of rows which have a field I need to split on a specific delimeter. I want to return a row (with the id field of the originating row) for each split in another field in the same row.

If I have a table with ID and String, for example and the rows are like:

ID STRING
item1 'abc,def,ghi,jkl'
item2 'mno,qrs'
...

What I'd like to end up with is the following written to a new table

Item1 abc
Item1 def
Item1 ghi
Item1 jkl
Item2 mno
Item2 qrs

Any assistance would be very much appreciated.

Cheers
Martin...

matelot

Posts: 35
Registered: 07/31/06
Remove /* this is comment teehee */
Posted: Jan 29, 2007 5:47 PM   in response to: user545394 in response to: user545394
Click to report abuse...   Click to reply to this thread Reply
I need to NULL out comments like "/* hello world test */" from dynam. SQL string
Can someone help ?

I got only this far:

SELECT regexp_replace('This is a /* test of */ regular expression test', '/*:print;*\*/', null) from dual ;

thanks

10gr2
BluShadow

Posts: 12,226
Registered: 09/21/05
Re: Remove /* this is comment teehee */
Posted: Jan 30, 2007 2:12 AM   in response to: matelot in response to: matelot
Click to report abuse...   Click to reply to this thread Reply
  1* SELECT regexp_replace('This is a /* test of */ regular expression test', '/\*[[:print:]]*\*/', null) from dual
SQL> / 
 
REGEXP_REPLACE('THISISA/*TESTOF*/R
----------------------------------
This is a  regular expression test
 
SQL>
Legend
Guru Guru : 2500 - 1000000 pts
Expert Expert : 1000 - 2499 pts
Pro Pro : 500 - 999 pts
Journeyman Journeyman : 200 - 499 pts
Newbie Newbie : 0 - 199 pts
Oracle ACE Director
Oracle ACE Member
Oracle Employee ACE
Helpful Answer (5 pts)
Correct Answer (10 pts)

Point your RSS reader here for a feed of the latest messages in all forums