I need a regular expression to extract words from a html document (or any document which comprises of tags).
For example, given the following html:
<div class="test">this is some text</div>
<b>the sky is blue</b>
<script type="text/javascript">
make ad_registerSpace(790600,234,60);
</script>
<a class="text"> this is a link</a>
I need to extract:
this is some text
the sky is blue
this is a link
However I do not want to extract the contents of the <script> tag.
I have been looking through tutorials, and come up with this regex:
"<[^script.*> | .* ]>(.*?)<"
However for some reason this only extracts words in bold tags: < b >
Furthermore is there a way to tell the regex to ignore comments such as: <!-- Comment --> ?
Any help with this would be appeciated!
Thanks