Java Programming

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

Regex to extract text from html document

807605Sep 5 2007 — edited Sep 5 2007

I need a regular expression to extract words from a html document (or any document which comprises of tags).

For example, given the following html:

<div class="test">this is some text</div>
<b>the sky is blue</b>
<script type="text/javascript">
make ad_registerSpace(790600,234,60);
</script>
<a class="text"> this is a link</a>

I need to extract:
this is some text
the sky is blue
this is a link

However I do not want to extract the contents of the <script> tag.

I have been looking through tutorials, and come up with this regex:

"<[^script.*> | .* ]>(.*?)<"

However for some reason this only extracts words in bold tags: < b >

Furthermore is there a way to tell the regex to ignore comments such as:  ?

Any help with this would be appeciated!

Thanks

Locked Post

New comments cannot be posted to this locked post.

Locked on Oct 3 2007

Added on Sep 5 2007

2 comments

393 views