I've been using the following code to grep out links from <a> tags in html. For some pages it is throwing stack overflows from within the pattern matcher class. I am wondering, is there a better way to be doing this, or did I make some kind of mistake that is causing this to happen? Any help or advice would be greatly appreciated.
private static final String matchATags = "(?i)<a(.|\\n)*?>";
private static final String matchHREFPre = "(?i)\\A(.|\n)*HREF\\s*=\\s*\"";
private static final String matchHREFPost = "\"(.|\n)*$";
private static final Pattern tagPattern = Pattern.compile(matchATags);
protected List<String>
getLinkText(
String text
){
if (text == null)
throw new NullPointerException("null text");
Matcher matcher = tagPattern.matcher(text);
ArrayList<String> linkList = new ArrayList<String>();
while (matcher.find()){
String link = text.substring(matcher.start(), matcher.end());
link = link.replaceAll(matchHREFPre, "");
link = link.replaceAll(matchHREFPost, "");
linkList.add(link);
}
return linkList;
}
Here is a trace of the exception
Parsing error scanning site http://www.amazon.com
Error parsing http://www.amazon.com/exec/obidos/ASIN/0201752808/xeo
Caused by java.lang.StackOverflowError
java.lang.StackOverflowError
at java.util.regex.Pattern$Branch.match(Pattern.java:3998)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4052)
at java.util.regex.Pattern$LazyLoop.match(Pattern.java:4241)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4111)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:3962)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3314)
at java.util.regex.Pattern$Branch.match(Pattern.java:3998)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4052)
at java.util.regex.Pattern$LazyLoop.match(Pattern.java:4241)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4111)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:3962)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3314)
at java.util.regex.Pattern$Branch.match(Pattern.java:3998)
...