Java Programming

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

regular expression trouble while parsing CSS

807580Jun 9 2010 — edited Jun 10 2010

Hey forum,

this is probably a simple question for the experts but it has me kinda stumped. I want to parse CSS files to retrieve all the URLs that are linked to. My stress test file looks like this

@import url(one.css);

@import url('two.css');
@import url("three.css");
@import 'four.css';
@import "five.css";

body
{
  background-color: black;
  color: white;
}

h1
{
  background-image: url(six.jpg); color: black; doesnotexist: url( "seven.css" ); 
  text: "this is junk url(jadajada)"
}

@import 'eight.css';

This is not an entirely legal CSS file but I thought that this would be a reasonable approximation of what I can expect to encounter in the wild.
My parsing code looks like this:

import java.io.BufferedReader;
import java.io.FileReader;
import java.util.regex.*;

public final class CSSParserTwo 
{
  private static final String REGEX1 = "url\\( *'(.*)' *\\)";
  private static final String REGEX2 = "url\\( *\"(.*)\" *\\)";
  private static final String REGEX3 = "url\\((.*)\\)";
  private static final String REGEX4 = "@import '(.*)'";
  private static final String REGEX5 = "@import \"(.*)\"";

  public static void main (String[] aArguments)
  {
    try
    {
      BufferedReader in = new BufferedReader(new FileReader("testcss.css"));
      String line = null;
      while((line = in.readLine()) != null)
      {
        matchParts(line, REGEX1);
        matchParts(line, REGEX2);
        matchParts(line, REGEX3);
        matchParts(line, REGEX4);
        matchParts(line, REGEX5);
      }
      in.close();
    } catch (Exception e)
    { e.printStackTrace(); } 
  }

  private static void matchParts( String aText , String regex ){
    Pattern pattern = Pattern.compile( regex );
    Matcher matcher = pattern.matcher( aText );
    while ( matcher.find() ) {
      System.out.println("regex: " + regex);
      System.out.println("line: " + aText);
      System.out.println("URL: " + matcher.group(1) + "\n");
    }
  }
}

I get the following output:

regex: url\((.*)\)
line: @import url(one.css);
URL: one.css

regex: url\( *'(.*)' *\)
line: @import url('two.css');
URL: two.css

regex: url\((.*)\)
line: @import url('two.css');
URL: 'two.css'

regex: url\( *"(.*)" *\)
line: @import url("three.css");
URL: three.css

regex: url\((.*)\)
line: @import url("three.css");
URL: "three.css"

regex: @import '(.*)'
line: @import 'four.css';
URL: four.css

regex: @import "(.*)"
line: @import "five.css";
URL: five.css

regex: url\( *"(.*)" *\)
line:   background-image: url(six.jpg); color: black; doesnotexist: url( "seven.css" ); 
URL: seven.css

regex: url\((.*)\)
line:   background-image: url(six.jpg); color: black; doesnotexist: url( "seven.css" ); 
URL: six.jpg); color: black; doesnotexist: url( "seven.css" 

regex: url\((.*)\)
line:   text: "this is junk url(jadajada)"
URL: jadajada

regex: @import '(.*)'
line: @import 'eight.css';
URL: eight.css

I don't mind matching on jadajada but not getting six.jpg is a problem. And I get the feeling that I get seven.css completely by accident. How do I improve the code so that I extract all the URLs?

Thanks

Locked Post

New comments cannot be posted to this locked post.

Locked on Jul 8 2010

Added on Jun 9 2010

13 comments

1,167 views