Java Programming

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

link checker fails to recognize protocol

807606May 3 2007

Hello all,

I am attempting to write a basic link checker to parse an html page and determine if any links are dead. I found a couple tutorials online, and most of this source comes from one of them. I am trying to familiarize myself with the basic classes, however, i am running into multiple problems. The basic link checker verifies links by using the http HEAD request, it then checks the response code and determines from there if the link is dead.

However, I have two major problems with this:

1) How can I verify links when they are not using http, for instance a ftp link on a webpage?

2) Sometimes when I parse link, i am losing the protocol from the link....

ie. href="http://www.ibm.com/systems/x/=>href="//www.ibm.com/systems/x/"

             Reader rd = getReader(rootURI);
 
              // Parse the HTML.
              kit.read(rd, doc, 0);
              
              // Iterate through the elements 
              // of the HTML document.
              ElementIterator it = new ElementIterator(doc);
              javax.swing.text.Element elem;
              System.out.println("Starting Validation from: " + rootURI);
              while ((elem = it.next()) != null) {
                SimpleAttributeSet s = (SimpleAttributeSet)elem.getAttributes().getAttribute(HTML.Tag.A);
                if (s != null) {
                  System.out.println("attributeSet: " + s);
                  String link = (String)s.getAttribute(HTML.Attribute.HREF);
                  System.out.println("Validating: " + link);
                  validateHref(link);
                }else{
                  //System.out.println("Attribute set is empty");
                  //flag this to user
                }
                	
              }

This is the block of code i am using to extract the link from the html source. My console output looks like this:

[stdout]
attributeSet: href=//www.ibm.com/systems/x/ class=left-nav-overview
Validating: //www.ibm.com/systems/x/
[stdout]

However, looking at the html source, the link should be http protocol....ie. http://www.ibm.com/systems/x , since i later check to make sure that http protocal urls start with "http:", these links that do not have http: at the beginning are being ignored...

Can someone please help and explain to me how I can fix this? Also, i am eventually planning to turn this into more of a web spider, and traverse levels of pages checking and validating the links.

Thank you,
Nick

full src

	
    import java.io.*;
    import java.net.*;
    import javax.swing.text.*;
    import javax.swing.text.html.*;

public class enhanceLinkCheck {
    
	  static int failCnt = 0;
	
      public static void main(String[] args) {
          HttpURLConnection.setFollowRedirects(false);
          EditorKit kit = new HTMLEditorKit();
          Document doc = kit.createDefaultDocument();

          // The Document class does not yet 
          // handle charset's properly.
          doc.putProperty("IgnoreCharsetDirective", 
            Boolean.TRUE);
    	  
    	  try{
    		  //String rootURI = "http://www-304.ibm.com/jct01004c/systems/support/supportsite.wss/docdisplay?lndocid=MIGR-65651&brandind=5000008#osa";
    		  String rootURI = "http://www-304.ibm.com/jct01004c/systems/support/supportsite.wss/docdisplay?lndocid=MIGR-65665&brandind=5000008";
              // Create a reader on the HTML content.
              Reader rd = getReader(rootURI);

              // Parse the HTML.
              kit.read(rd, doc, 0);
              
              // Iterate through the elements 
              // of the HTML document.
              ElementIterator it = new ElementIterator(doc);
              javax.swing.text.Element elem;
              System.out.println("Starting Validation from: " + rootURI);
              while ((elem = it.next()) != null) {
                SimpleAttributeSet s = (SimpleAttributeSet)elem.getAttributes().getAttribute(HTML.Tag.A);
                if (s != null) {
                  System.out.println("attributeSet: " + s);
                  String link = (String)s.getAttribute(HTML.Attribute.HREF);
                  System.out.println("Validating: " + link);
                  validateHref(link);
                }else{
                  //System.out.println("Attribute set is empty");
                  //alert user
                }
                	
              }                    
    	  }catch (Exception e){
    		  e.printStackTrace();
    	  }
    	  System.out.println("Failed links: " + failCnt);
    	  System.exit(1);    	  
      }
      
      // Returns a reader on the HTML data. If 'uri' begins
      // with "http:", it's treated as a URL; otherwise,
      // it's assumed to be a local filename.
        static Reader getReader(String uri) 
          throws IOException {
          if (uri.startsWith("http:")) {      
            // Retrieve from Internet.
            URLConnection conn = new URL(uri).openConnection();
            return new InputStreamReader(conn.getInputStream());
            }else {
              // Retrieve from file.
              return new FileReader(uri);
          }
        }
        
        private static void validateHref(String urlString){
        	if((urlString != null) && (urlString.startsWith("http://"))){
        		try{
        			URL url = new URL(urlString);
        			URLConnection connection = url.openConnection();
        			if(connection instanceof HttpURLConnection){
        				HttpURLConnection httpConnection = (HttpURLConnection)connection;
        				httpConnection.setRequestMethod("HEAD");
        				httpConnection.connect();
        				int response = httpConnection.getResponseCode();
        				if(response >=400){
        					System.out.print("[FAILED]");
        					failCnt++;
        				}
        				System.out.println("Response: " + response);
        				String location = httpConnection.getHeaderField("Location");
        				if(location != null){
        					System.out.println("Location: " + location);
        				}
        				System.out.println();        				
        			}else {
        				System.out.println("Connection not HTTP: " + connection);
        			}
        		}catch (IOException e){
        			e.printStackTrace();
        		}
        	}
        }
}

Locked Post

New comments cannot be posted to this locked post.

Locked on May 31 2007

Added on May 3 2007

0 comments

119 views