Best way to parse html to retrieve the displayed text??
843810Mar 13 2007 — edited Mar 19 2007Hi Guys,
i am currently working on an application that will spell check and gramatically check web pages for errors.
I am using the jdic web browser to display the page, the user then clicks check and the page is parsed and checked against a dictionary etc.
then, mis-spelt words are highlighted in a similar way to googles cached page word highlight method.
Currently i am having to add a lot of code to ignore all the irrelevant tags (Script, Head, table etc....) and only look at text in the body of the page. This is not really working as well as i had hoped however.
Is there any way to retrieve only the text that is displayed on the page, without having to go through the source character by character?? I presume there are third party parsers around, i would prefer to have a bit more control by doing it myself, i just wondered if i am doing it in a really long winded way!
thanks,
edd