Crawling FHU.edu

by Michael Clark
October 4, 2011 8:07 PM

A project that has been on my list for a while now has been a broken link and spell check application that we can use to verify the content on the FHU website. As it currently stands, the application back-end is probably 75% complete.

The broken link portion successfully crawls the website (any website for that matter) and returns the status code for it and every link on the page. This has enabled us to find instances where a typo was made when editing CMS pages or a page existed that no longer exists. The broken link portion currently only returns status codes of 200 (OK), 301 (moved permanently), 404 (not found), or 500 (internal server error). If the page returns anything but those, the applications records a status code of 99 noting that further investigation is needed.

The spellchecking portion is what I'm currently working on and has been a headache to say the least. I'm using the Hunspell dictionary with the NHunspell .NET wrapper. Hunspell is the same dictionary used by LibreOffice, Mozilla, Eclipse, Google Chrome and Mac OS X Snow Leopard. The biggest hurdle at this moment is parsing page content so that things like HTML tags are not checked. This has been problematic just because of the sheer number of cases to find and fix.

To parse HTML content, I've come across a free .NET library called HTML Agility Pack. The library used Xpath syntax and enables me to easily select different sections of code (nodes). From there I can either remove the nodes, split them, or add to them.

Once I finish the back-end portion, I'll begin working on a front-end that will grab data from the database and generate reports detailing the broken links and misspelled words on the website. The back-end will be a task application that runs periodically.


Add comment




biuquote
  • Comment
  • Preview
Loading