Hey forum,
this is probably a simple question for the experts but it has me kinda stumped. I want to parse CSS files to retrieve all the URLs that are linked to. My stress test file looks like this
@import url(one.css);
@import url('two.css');
@import url("three.css");
@import 'four.css';
@import "five.css";
body
{
background-color: black;
color: white;
}
h1
{
background-image: url(six.jpg); color: black; doesnotexist: url( "seven.css" );
text: "this is junk url(jadajada)"
}
@import 'eight.css';
This is not an entirely legal CSS file but I thought that this would be a reasonable approximation of what I can expect to encounter in the wild.
My parsing code looks like this:
import java.io.BufferedReader;
import java.io.FileReader;
import java.util.regex.*;
public final class CSSParserTwo
{
private static final String REGEX1 = "url\\( *'(.*)' *\\)";
private static final String REGEX2 = "url\\( *\"(.*)\" *\\)";
private static final String REGEX3 = "url\\((.*)\\)";
private static final String REGEX4 = "@import '(.*)'";
private static final String REGEX5 = "@import \"(.*)\"";
public static void main (String[] aArguments)
{
try
{
BufferedReader in = new BufferedReader(new FileReader("testcss.css"));
String line = null;
while((line = in.readLine()) != null)
{
matchParts(line, REGEX1);
matchParts(line, REGEX2);
matchParts(line, REGEX3);
matchParts(line, REGEX4);
matchParts(line, REGEX5);
}
in.close();
} catch (Exception e)
{ e.printStackTrace(); }
}
private static void matchParts( String aText , String regex ){
Pattern pattern = Pattern.compile( regex );
Matcher matcher = pattern.matcher( aText );
while ( matcher.find() ) {
System.out.println("regex: " + regex);
System.out.println("line: " + aText);
System.out.println("URL: " + matcher.group(1) + "\n");
}
}
}
I get the following output:
regex: url\((.*)\)
line: @import url(one.css);
URL: one.css
regex: url\( *'(.*)' *\)
line: @import url('two.css');
URL: two.css
regex: url\((.*)\)
line: @import url('two.css');
URL: 'two.css'
regex: url\( *"(.*)" *\)
line: @import url("three.css");
URL: three.css
regex: url\((.*)\)
line: @import url("three.css");
URL: "three.css"
regex: @import '(.*)'
line: @import 'four.css';
URL: four.css
regex: @import "(.*)"
line: @import "five.css";
URL: five.css
regex: url\( *"(.*)" *\)
line: background-image: url(six.jpg); color: black; doesnotexist: url( "seven.css" );
URL: seven.css
regex: url\((.*)\)
line: background-image: url(six.jpg); color: black; doesnotexist: url( "seven.css" );
URL: six.jpg); color: black; doesnotexist: url( "seven.css"
regex: url\((.*)\)
line: text: "this is junk url(jadajada)"
URL: jadajada
regex: @import '(.*)'
line: @import 'eight.css';
URL: eight.css
I don't mind matching on jadajada but not getting six.jpg is a problem. And I get the feeling that I get seven.css completely by accident. How do I improve the code so that I extract all the URLs?
Thanks