Java Programming

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

Parsing HTML and Regular Expressions

807607Oct 25 2006 — edited Oct 25 2006

Hello, and thanks in advance for your help.

I'm trying to parse a web page in order to gather some data. I was originally attempting to use regular expressions to do this. I started by attempting to identify all of the tags in the document using the pattern "<.*>". Unfortunately, instead of getting a sequence of tags, I just got the entire document back. It seems that instead of attempting to match the shortest possible substring to the pattern, it matches the largest possible substring. So if I have the following text:

<html><head><body><p>text</body></head></html>

Instead of getting back the list:

<html>, <head>, <body>, <p>, </body>, </head>, </html>

I just get the entire original string back.

Is there any way to configure regular expressions to return the shortest match instead of the longest?

On another note, I may want to use an HTML parser instead. Any recommendations? I found a few using a Google search, but didn't see any "authoritative" open source solution, like something from the Apache foundation. What's out there that has good docs and is reliable?

Regards,
Anthony Frasso

Locked Post

New comments cannot be posted to this locked post.

Locked on Nov 22 2006

Added on Oct 25 2006

1 comment

393 views