Diplo Link Checker for Umbraco Blog post

Umraco AngularJS

Diplo Link Checker is a dashboard add-on for Umbraco 7 that allows an editor to easily check their site for broken or problematic links.

I've been a great fan of the Umbraco Content Management System for many years. It's powerful, flexible, open source and has a great community associated with it. One of its many great features is that it allows developers to contribute packages that extend or enhance Umbraco's functionality. Best of all, thanks to the ethos of open source, the vast majority of these packages are available to download free of charge. I have benefited greatly from some of the great packages so I thought it was time to give a little something back…

So I created a dashboard package from Umbraco 7 (using AngularJS and WebAPI) that allows an editor to check all the links within an Umbraco site.

Features

  • Able to check an entire site, or just a section or even a single page
  • Completely asynchronous so can check multiple links simultaneously and provide real-time feedback
  • Caches link status so only checks each unique link once (within a short period)
  • Works for all types of links - external, internal, HTML, files, images and even CSS and JavaScript files
  • Provides feedback on errors with help dialogue
  • Quick edit facility allows you to easily edit the page that contains the broken link directly within Umbraco
  • Advanced options allow you to set the timeout period, toggle between viewing all checked links and only links that have problems

Screenshot

Umbraco Link Checker

Note: This is only for Umbraco 7.1 and above. This is an initial release and so please provide feedback to help improve it.

How it Works and Source Code

The basic premise is that the checker first iterates over every published page in the site from the chosen start node (using Umbraco's published content API) and creates a list of the page IDs to be checked.

This list is then passed back to an Angular controller that sends an asynchronous request to an Umbraco Web API controller, passing in the ID of the node to be checked.

A service then makes a HTTP GET request to the full URL of the page to get back the entire HTML for the page. This HTML is then parsed, using the HtmlAgility pack (which comes with Umbraco) and a list of every link in the page is collated. Certain link types are discarded that cannot be checked (such as mailto: links etc).

Another service then makes asynchronous HTTP HEAD requests to each of the links in the page using the HttpClient class in .NET that allows you to easily to make multiple requests in parallel (an HTTP HEAD request doesn't send back the content body, just the status, so is much faster than downloading entire pages). The HTTP status code of each request is then recorded and then sent back to the Angular controller that updates the UI with the results. I also keep a track of every URL that has been checked in a cache, and if the same URL is requested then the result is retrieved from memory, rather than re-checking it again.

The Angular UI layer then allows filtering to be performed as well as showing more detailed results for each link, including things like a full description of the status code, then line number in the HTML where the link was found etc.

You can find the entire source code on my GitHub page. It's a bit rough, but hey, that's what pull requests are for!

 


10 Comments


Tommy Enger Avatar

Hi! This looks great. Does it work with 7.3? I installed it, click on "Start check", select the root page. Then the "start check" button changes to "checking" and the text below says: "Checked 0 pages starting from Startpage". Nothing more happens, and it stays at "checked 0 pages..." No errors in the console.


Tommy Enger Avatar

Ok, I found a way to solve it quick and dirty. Changed to: $http.get(getIdsToCheckUrl + data.id).success(function (data, status, headers, config) { data = data.$values; ..............


Dan Diplo Avatar

Hi Tommy,

Strange, as I have tried installing Link Checker on Umbraco 7.3.0 and it worked OK... Was there something specific to your installation that was preventing it from working, do you think? Thanks for posting the fix - I'll bear it in mind for future releases.


Angelbert David Avatar

Hi there Dan,

I am trying to use your diplo link checker on Umbraco 6.2.5 (using your 1.5 version), but all the pages that are scanned return 0 links, even though I know that there are links on the pages.

I've tried a few things: -Links generated by the template -Links actually embeded in through the rich text editor on a page

Am I missing something? Some sort of configuration maybe?

thanks.


Dan Diplo Avatar

Hi Angelbert,

I'm afraid I don't really have the source for the older version, so it's hard to support it. It did use to work "out-of-the-box", but I haven't tried it on the later 6.2.x builds, I'm afraid.


Evan Moore Avatar

Thanks Dan, this is a great tool. I'm excited to look through your code and learn a few things about parsing HTML. I'm using Umbraco 7.3.4 and all external links on our website report as 403 or 404. I've not determined if this is an environmental issue or a problem with the link checker. Have you run into this?


Dan Diplo Avatar

Hi Evan,

A 403 normally means the URL exists but the application trying to access it doesn't have permission. Are any of these URLs password protected or require authentication?

One thing my Link Checker does is use HEAD requests rather than traditional GET request, since this is faster. Occasionally some servers will request this as they are configured to only accept GET/POST requests. That's my only thought.

 

 


Eric Schrepel Avatar

Trying to use this on Umbraco 7.1.4 and we happen to run the Umbraco BackOffice as https:// rather than http://. We're finding that all the pages return a ton of "bad" links that actually work fine but maybe the https:// is throwing the Diplo link checker? Didn't know if there was a way around this. Also, is it only checking links to pages within our site, or is it examining links to external sites also? We mostly care that our own pages aren't causing 404s, somewhat less so about outside links.


Dan Diplo Avatar

Eric, I can't think of any reason why it wouldn't work running under HTTPS links, but haven't actually tried it. I'll see if I can test it sometime. It is designed to check both internal and external links. Could you have some firewall or network policy that might be restricting outgoing traffic, perhaps?


blackhawk Avatar

This package works perfectly for me in Umbraco 7.6.6. Well done!

Just fill in the form and click Submit. But note all comments are moderated, so spare the viagra spam!

Tip: You can use Markdown syntax within comments.