Skip to Main Content

APEX

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

Regular Expressions for converting HTML to Structured Plain Text

Jritschel-OracleOct 26 2010 — edited Oct 26 2010
I'm writing a PL/SQL function that will convert HTML to plain text, but still preserve some of the formatting/line breaks. One of my challenges is in writing a regular expression to capture the text blocks while ignoring the markup. I'm trying to write an expression that will grab all of the text between start/end tags, but discard the tags. For example, to find all of the text between a start/end paragraph, I want to do something like:

REGEXP_REPLACE('<p style="text-align:center&#59;">This is the body of the paragraph</p>', '<p.*>(.*)</p>', '\1||v_crlf' )

where \1 returns the contents of the paragraph and v_crlf (declared earlier in the function) inserts a line break. I know there are more general expressions that will remove all tags, but I want to specifically identify the tags so I can process them appropriately. This way I can easily convert HTML to plain text for email and reporting without having to keep two versions around. Any help would be greatly appreciated. Once I get this worked out, I will repost with the function code for others to use. Thanks.

Edited by: jritschel on Oct 26, 2010 9:58 AM
Comments
Locked Post
New comments cannot be posted to this locked post.
Post Details
Locked on Nov 23 2010
Added on Oct 26 2010
8 comments
7,799 views