Skip to Main Content

Java Programming

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

Parsing HTML and Regular Expressions

807607Oct 25 2006 — edited Oct 25 2006
Hello, and thanks in advance for your help.

I'm trying to parse a web page in order to gather some data. I was originally attempting to use regular expressions to do this. I started by attempting to identify all of the tags in the document using the pattern "<.*>". Unfortunately, instead of getting a sequence of tags, I just got the entire document back. It seems that instead of attempting to match the shortest possible substring to the pattern, it matches the largest possible substring. So if I have the following text:

<html><head><body><p>text</body></head></html>

Instead of getting back the list:

<html>, <head>, <body>, <p>, </body>, </head>, </html>

I just get the entire original string back.

Is there any way to configure regular expressions to return the shortest match instead of the longest?

On another note, I may want to use an HTML parser instead. Any recommendations? I found a few using a Google search, but didn't see any "authoritative" open source solution, like something from the Apache foundation. What's out there that has good docs and is reliable?

Regards,
Anthony Frasso
Comments
Locked Post
New comments cannot be posted to this locked post.
Post Details
Locked on Nov 22 2006
Added on Oct 25 2006
1 comment
309 views