Skip to Main Content

Java Programming

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

Regex to extract text from html document

807605Sep 5 2007 — edited Sep 5 2007
I need a regular expression to extract words from a html document (or any document which comprises of tags).

For example, given the following html:
<div class="test">this is some text</div>
<b>the sky is blue</b>
<script type="text/javascript">
make ad_registerSpace(790600,234,60);
</script>
<a class="text"> this is a link</a>
I need to extract:
this is some text
the sky is blue
this is a link

However I do not want to extract the contents of the <script> tag.

I have been looking through tutorials, and come up with this regex:

"<[^script.*> | .* ]>(.*?)<"

However for some reason this only extracts words in bold tags: < b >

Furthermore is there a way to tell the regex to ignore comments such as: <!-- Comment --> ?

Any help with this would be appeciated!

Thanks
Comments
Locked Post
New comments cannot be posted to this locked post.
Post Details
Locked on Oct 3 2007
Added on Sep 5 2007
2 comments
390 views