Skip to Main Content

Java Programming

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

Your opinion needed - generating unique ID from a URL

807605Aug 20 2007 — edited Aug 20 2007
I am currently making a web crawler type program and before rushing straight in and coding something I thought i'd gather a few "expert" opinions so to speak ;)

A crawler obviously takes a URL, downloads it and saves it in a database somewhere. However I need to keep a record of which URLs have been previously crawled - as not to crawl and recrawl the same URL.

Ideally I need to make an algorithm which takes a String URL and outputs a unique ID for that URL - no other URL would be assigned the same ID.

After a while researching I have ran into base64 which could take a URL and output a long String of characters. I have a question here though:
1. Is base64 always going to output UNIQUE IDs for every URL or is there a slight offchance you could have duplicates?

However even if base64 worked and it gave me the ID:
aHR0cDovL3d3dy5qYXZhLmNvbQ==
For the URL:
http://www.java.com
I still have to record it in a database and search for its prescence to determine if the URL has been crawled before.

So my question is:
Is there a way to construct a unique NUMERICAL ID for a URL?

This way, it would be much easier to find the record if all were sorted by their numerical ID number.

Hope I havent confused you, if I have just tell me!

Thanks
-Myles
Comments
Locked Post
New comments cannot be posted to this locked post.
Post Details
Locked on Sep 17 2007
Added on Aug 20 2007
18 comments
2,093 views