Parsing HTML and Regular Expressions
807607Oct 25 2006 — edited Oct 25 2006Hello, and thanks in advance for your help.
I'm trying to parse a web page in order to gather some data. I was originally attempting to use regular expressions to do this. I started by attempting to identify all of the tags in the document using the pattern "<.*>". Unfortunately, instead of getting a sequence of tags, I just got the entire document back. It seems that instead of attempting to match the shortest possible substring to the pattern, it matches the largest possible substring. So if I have the following text:
<html><head><body><p>text</body></head></html>
Instead of getting back the list:
<html>, <head>, <body>, <p>, </body>, </head>, </html>
I just get the entire original string back.
Is there any way to configure regular expressions to return the shortest match instead of the longest?
On another note, I may want to use an HTML parser instead. Any recommendations? I found a few using a Google search, but didn't see any "authoritative" open source solution, like something from the Apache foundation. What's out there that has good docs and is reliable?
Regards,
Anthony Frasso