A broken link checker for Umbraco 7
Diplo Link Checker is a dashboard add-on for Umbraco 7 that allows an editor to easily check their site for broken or problematic links.
I've been a great fan of the Umbraco Content Management System for many years. It's powerful, flexible, open source and has a great community associated with it. One of its many great features is that it allows developers to contribute packages that extend or enhance Umbraco's functionality. Best of all, thanks to the ethos of open source, the vast majority of these packages are available to download free of charge. I have benefited greatly from some of the great packages so I thought it was time to give a little something back…
So I created a dashboard package from Umbraco 7 (using AngularJS and WebAPI) that allows an editor to check all the links within an Umbraco site.
- Able to check an entire site, or just a section or even a single page
- Completely asynchronous so can check multiple links simultaneously and provide real-time feedback
- Caches link status so only checks each unique link once (within a short period)
- Provides feedback on errors with help dialogue plus an overview of all status codes
- Quick edit facility allows you to easily edit the page that contains the broken link directly within Umbraco
- Advanced options allow you to set the timeout period, toggle between viewing all checked links and only links that have problems
- You can whitelist HTTP codes and only report on those
- You can also configure it to ignore ports (if you are behind a reverse proxy, for example)
How it Works and Source Code
The basic premise is that the checker first iterates over every published page in the site from the chosen start node (using Umbraco's published content API) and creates a list of the page IDs to be checked.
This list is then passed back to an Angular controller that sends an asynchronous request to an Umbraco Web API controller, passing in the ID of the node to be checked.
A service then makes a HTTP GET request to the full URL of the page to get back the entire HTML for the page. This HTML is then parsed, using the HtmlAgility pack (which comes with Umbraco) and a list of every link in the page is collated. Certain link types are discarded that cannot be checked (such as mailto: links etc).
Another service then makes asynchronous HTTP HEAD requests to each of the links in the page using the HttpClient class in .NET that allows you to easily to make multiple requests in parallel (an HTTP HEAD request doesn't send back the content body, just the status, so is much faster than downloading entire pages). The HTTP status code of each request is then recorded and then sent back to the Angular controller that updates the UI with the results. I also keep a track of every URL that has been checked in a (in-memory) cache, and if the same URL is requested then the result is retrieved from memory, rather than re-checking it again.
The Angular UI layer then allows filtering to be performed as well as showing more detailed results for each link, including things like a full description of the status code, then line number in the HTML where the link was found etc.
You can find the entire source code on my GitHub page. It's a bit rough, but hey, that's what pull requests are for!
You can also install the package via NuGet: