Domains and regex
807603Dec 27 2007 — edited Dec 28 2007Hi,
I'm facing the problem to extract from a URL different parts of it, in particular I'm interested in extracting the pair "second level domain" + "top level domain".
I've used the following pattern in order to extract the domain (subdomain+second level+top level domain) plus other info like the parameters.
Pattern = "\\b((https?)://([-a-zA-Z0-9.]+)(:[0-9]*)?(/[-A-Z0-9+&@#/%=~_|!:,.;]*)?(\\?[-A-Z0-9+&@#/%=~_|!:,.;]*)?)"
The third group extracts (i.e.) "www.google.com" from a URL like "http://www.google.com:8080/?a=b&c=d", but my target is to extract "google.com".
Has anybody an advice to address this issue (it's clear that I'm not a regex expert...) ?
Thanks a lot
ny