Using Html Agility Pack to build a Dilbert SharePoint WebPart


I used to have a Dilbert Web part on my SharePoint home page. Every morning when I logged into work, and opened my browser I got a giggle out of the daily Dilbert comic strip. However a little while back it stopped working, I don’t fully know the reason why, think it had something to do with google feed burner changing and incidentally it appears to be working again, however, at the time it made me think about creating my own simple version of the web part.

For those who have stumbled upon this blog because they search Html Agility Pack but have no idea what Dilbert is, please follow this link http://www.dilbert.com/

To be able to grab the daily Dilbert image I needed something like JQuery that could grab the DOM of the Dilbert webpage, but needed to be code behind. I discovered that Html Agility Pack is great for screen scrapping HTML and it is easy to use. To download Html Agility Pack you can obtain it from CodePlex. http://htmlagilitypack.codeplex.com/. You can also find additional examples, source code and documentation.

As taken directly from the Code Plex Html Agility Pack home page

What is exactly the Html Agility Pack (HAP)?

This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don’t HAVE to understand XPATH nor XSLT to use it, don’t worry…). It is a .NET code library that allows you to parse “out of the web” HTML files. The parser is very tolerant with “real world” malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).

It also states that HTML Agility Pack can also work on malformed HTML pages, which is also nice feature.

Working out the URL for the image.

I had to work out the URL of the image. Luckily the Dilbert site puts each image on a different web page which is in the format of http://www.dilbert.com/strips/comic/YYYY-MM-DD/ Which goes back years to 16th April 1989. Then using a browsers developer tools, I’ve clicked on the image which tells me what the location of the image is.

As you can see from above, In the div class “STR_Image” there is an Image tag, which has the source /dyn/str_strip/000000000/00000000/0000000/100000/90000/2000/200/192230/192230.strip.gif. So to get this source value this is where Html Agility Pack comes in handy.

So to get this value we will need to use the Html Agility Pack to;

  1. Load the page
  2. Find the STR_Image div tag and Image Tag
  3. Get the Source for the image
  4. Load the image on the page.

Apart from loading the image in step 4 these steps can all be performed by using Html Agility Pack.

Including the Html Agility pack dll within SharePoint WSP.

Download the Html Agility Pack from Codeplex and extract the zip file. From entering the folder you will see many subfolders. Each folder relates to a version of .NET or Windows Phone version

For my SharePoint 2010 environment I’m using Net20 folder. Inside this folder is 3 files. I’ve added the .dll to a folder in my project, and set the properties, copy to Output Directory as “Copy Always”

In my SharePoint project, double click on Package, then click on the Advanced button. Now we need to add the .dll . Click the Add button, Add Assembly from Project Output.

I’ve added a Visual WebPart. The webpart just has an Image and a label on it. The label is just to inform the user what date they are looking at.

On the code behind I’m stepping through these steps.

Load the page

var todayDate = DateTime.Now.ToString("yyyy-MM-dd");
HtmlWeb htmlWeb = new HtmlWeb();
HtmlDocument document = htmlWeb.Load("http://www.dilbert.com/strips/comic/" + todayDate);

Find the STR_Image div tag and Image Tag

As you can see from below I’m using XPath to get the div class STR_Image and then grab the img inside. This is the only example I have on Html Agility Pack but I found this blog post very useful in helping me obtain the correct XPath for this project.
http://codingfields.com/guides/htmlagilitypack/

var node = document.DocumentNode.SelectSingleNode("//*[@class='STR_Image']//img");

Get the Source for the image

string imageUrl = String.Empty;
var imageSrc = node.Attributes["src"];
   if(String.IsNullOrEmpty(imageSrc.Value))
        throw new Exception("No image src found!");
    imageUrl = "http://www.dilbert.com" + imageSrc.Value;

Load the image on the page.

imgDilbert.ImageUrl = imageUrl;

Once deployed it will look like this.

To view my sample code, which I have also included a bit of error handling and caching can be found on my SkyDrive (Or whatever Microsoft has now called it since writing this blog).

http://sdrv.ms/YYyQY3

The original Dilbert WebPart that I was using before I created my own can be found below. This uses RSS feeds, and can work in https:// environments.

http://new.amrein.com/apps/page.asp?Q=5734

Advertisements