Skip to Main Content

Java SE (Java Platform, Standard Edition)

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

Best way to parse html to retrieve the displayed text??

843810Mar 13 2007 — edited Mar 19 2007
Hi Guys,

i am currently working on an application that will spell check and gramatically check web pages for errors.

I am using the jdic web browser to display the page, the user then clicks check and the page is parsed and checked against a dictionary etc.

then, mis-spelt words are highlighted in a similar way to googles cached page word highlight method.

Currently i am having to add a lot of code to ignore all the irrelevant tags (Script, Head, table etc....) and only look at text in the body of the page. This is not really working as well as i had hoped however.

Is there any way to retrieve only the text that is displayed on the page, without having to go through the source character by character?? I presume there are third party parsers around, i would prefer to have a bit more control by doing it myself, i just wondered if i am doing it in a really long winded way!

thanks,

edd
Comments
Locked Post
New comments cannot be posted to this locked post.
Post Details
Locked on Apr 16 2007
Added on Mar 13 2007
1 comment
102 views