Skip to Main Content

Java Programming

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

Regex gurus: Intermittent stack overflow from Matcher?

807580Dec 1 2009 — edited Dec 2 2009
I've been using the following code to grep out links from <a> tags in html. For some pages it is throwing stack overflows from within the pattern matcher class. I am wondering, is there a better way to be doing this, or did I make some kind of mistake that is causing this to happen? Any help or advice would be greatly appreciated.
    private static final String matchATags    = "(?i)<a(.|\\n)*?>";
    private static final String matchHREFPre  = "(?i)\\A(.|\n)*HREF\\s*=\\s*\"";
    private static final String matchHREFPost = "\"(.|\n)*$";
    private static final Pattern tagPattern = Pattern.compile(matchATags);
    protected List<String>
    getLinkText(
            String text
    ){
        if (text == null)
            throw new NullPointerException("null text");

        Matcher matcher = tagPattern.matcher(text);
        ArrayList<String> linkList = new ArrayList<String>();
        while (matcher.find()){
            String link = text.substring(matcher.start(), matcher.end());
            link = link.replaceAll(matchHREFPre, "");
            link = link.replaceAll(matchHREFPost, "");
            linkList.add(link);
        }
        return linkList;
    }
Here is a trace of the exception
Parsing error scanning site http://www.amazon.com
Error parsing http://www.amazon.com/exec/obidos/ASIN/0201752808/xeo
Caused by java.lang.StackOverflowError
java.lang.StackOverflowError
        at java.util.regex.Pattern$Branch.match(Pattern.java:3998)
        at java.util.regex.Pattern$GroupHead.match(Pattern.java:4052)
        at java.util.regex.Pattern$LazyLoop.match(Pattern.java:4241)
        at java.util.regex.Pattern$GroupTail.match(Pattern.java:4111)
        at java.util.regex.Pattern$BranchConn.match(Pattern.java:3962)
        at java.util.regex.Pattern$CharProperty.match(Pattern.java:3314)
        at java.util.regex.Pattern$Branch.match(Pattern.java:3998)
        at java.util.regex.Pattern$GroupHead.match(Pattern.java:4052)
        at java.util.regex.Pattern$LazyLoop.match(Pattern.java:4241)
        at java.util.regex.Pattern$GroupTail.match(Pattern.java:4111)
        at java.util.regex.Pattern$BranchConn.match(Pattern.java:3962)
        at java.util.regex.Pattern$CharProperty.match(Pattern.java:3314)
        at java.util.regex.Pattern$Branch.match(Pattern.java:3998)
      ...
Comments
Locked Post
New comments cannot be posted to this locked post.
Post Details
Locked on Dec 30 2009
Added on Dec 1 2009
3 comments
902 views