Skip to Main Content

Java Programming

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

regex to remove undesired HTML tags from HTML page.

807589Sep 30 2008 — edited Oct 5 2008
I'm looking for a regex that will remove ALL HTML tags except for a few that I'd like to put in a list such as: (P|H1|LI|<rest of list>). The regex would remove the < -tag stuff- > for those tags NOT in the list. Tags in the list can include blanks. For example: < P> or <LI >. The -tag stuff- can include any legal HTML including blanks. What I came up with that worked for listed tags that did NOT contain spaces: <[^(P|H1|LI)].*?>. This one fails to recognize and not remove: < P> (with a space) but does work with <P>.
I tried adding \\s* to skip optional leading spaces but this didn't work.
Comments
Locked Post
New comments cannot be posted to this locked post.
Post Details
Locked on Nov 2 2008
Added on Sep 30 2008
21 comments
2,339 views