Books

Wednesday, December 31, 2014

Using Krawkraw To Find Broken Links

I recently tweeted that I did just that: used Krawkraw to fish out broken links on the company's website.


From all the 187 pages, I was able to find the 8 broken links: the rogues!



To have achieved this would require implementing a KrawlerAction that acts only when broken links are encountered. Such implementation may look like this:

public class NotFoundAction implements KrawlerAction {
    
    private Path outPutFile = Paths.get("output.txt");
    private String entry = "[Not found:] {LINK} [Source Url:] {SOURCE} \n";
    
    public NotFoundAction(String pathToOutput) {
        this.outPutFile = Paths.get(pathToOutput);
    }

    public NotFoundAction() {
    }
    
    @Override
    public void execute(FetchedPage fetchedPage) {
        // writes information about the broken link to file
        if (fetchedPage.getStatus() == 404) {
            writeToFile(fetchedPage);
        }
    }
    
    private void writeToFile(FetchedPage page) {
        
        if (!Files.exists(outPutFile)) {
            try {
                Files.createFile(outPutFile);
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
        
        try (BufferedWriter writer = Files.newBufferedWriter(outPutFile,
                StandardCharsets.UTF_8, StandardOpenOption.APPEND)) {
            String replaced = entry
                    .replace("{LINK}", page.getUrl()).replace("{SOURCE}", 
                             page.getSourceUrl());
            writer.write(replaced);
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

I thought this little application of Krawkraw might have some utility that appeals widely, so I did write the above NotFoundAction, and packaged it as an executable jar file that can be run from the command line.

You can download the jar file here:

How to use:

You run the jar from the command line.

The executable jar takes 2 arguments; one of which is mandatory. The two arguments are the url to be crawled (specified by -Durl) and the destination file to write the details of the broken links to (specified by -Doutputfile).
If -Doutputfile is not specified the output would be written to output.txt in the same directory the jar is ran from.

For example:
// Find broken links in www.example.nl. Since -Doutputfile is not used, 
// the result would be written to output.txt
java -jar -Durl=http://www.example.nl brokenlink-extractor.jar
or
// Find broken links in www.example.nl. 
//And the results would be written to my desktop in result.txt
java -jar -Durl=http://www.example.nl -Doutputfile=/home/dadepo/Desktop/result.txt brokenlink-extractor.jar

The content of the output file would be in this format: [Not found:] {LINK} [Source Url:] {SOURCE}

The {LINK} is the broken link, and {SOURCE} tells you the page that included the broken link; so you know where to fix the problem!

So for example:
[Not found:] http://www.example.com/aboutus.html [Source Url:] http://www.example.com/company.html

Means on http://www.example.com/company.html there is a link to http://www.example.com/aboutus.html, which does not exist.

When the extractor finishes, it prints out the total number of pages it crawled. If no output file is created, then it means the site crawled has got no broken links.

Hope somebody somewhere finds this useful!

No comments:

Post a Comment