Your opinion needed - generating unique ID from a URL
807605Aug 20 2007 — edited Aug 20 2007I am currently making a web crawler type program and before rushing straight in and coding something I thought i'd gather a few "expert" opinions so to speak ;)
A crawler obviously takes a URL, downloads it and saves it in a database somewhere. However I need to keep a record of which URLs have been previously crawled - as not to crawl and recrawl the same URL.
Ideally I need to make an algorithm which takes a String URL and outputs a unique ID for that URL - no other URL would be assigned the same ID.
After a while researching I have ran into base64 which could take a URL and output a long String of characters. I have a question here though:
1. Is base64 always going to output UNIQUE IDs for every URL or is there a slight offchance you could have duplicates?
However even if base64 worked and it gave me the ID:
aHR0cDovL3d3dy5qYXZhLmNvbQ==
For the URL:
http://www.java.com
I still have to record it in a database and search for its prescence to determine if the URL has been crawled before.
So my question is:
Is there a way to construct a unique NUMERICAL ID for a URL?
This way, it would be much easier to find the record if all were sorted by their numerical ID number.
Hope I havent confused you, if I have just tell me!
Thanks
-Myles