Scraping the world (How to screen scrape/rip images off the web)

Had to get some info on the web for one of my pet projects recently and found that there was no available web service on the web to provide the data to me that i was in search of. So i decided to look for sites that can provide me with up to date data and then scrape it off there for my own consumption (Not the nicest thing to do I know but it was the only way. Will not mention them at this point). The project consist of 2 parts namely

  1. the scraper service that mines the data and pushes only new relevant data into my data store
  2. The User interface to administer the data and consume it in a functional way.

The idea still hangs around consuming the data in a silverlight application or maybe utilizing asp.net with some other funky features, from WCF REST and DATA SERVICES but the focus is on rapid building as I want to make use of the app quickly but not spend to much time getting it up.

Tech used in scraper so far :

  1. Dotnet 4
  2. SubSonic (Probably one of the best open source projects i’ve used)
  3. Silverlight 4
  4. RegEx and Linq for parsing and getting the “STUFF” in the right format.
  5. SGML Reader Library by Mindtouch to get the HTML well formed
  6. Some other things that I maybe forgot to mention 🙂

Anyhoo, there has been some challenges in scraping the data as a lot of the sites did not comply with XHTML standards, thus making it difficult to consume the data (Hence SGML READER coming to the rescue there). String manipulation is ok to do but I wanted a structured way of working with the data I get back and be able to build queries with the use of linq. So I opted for getting the dat ain well formed XML and then working from there.

Eventually after pulling out the needed info from the xml into a workable object and then pushing that to my store, I moved on to the next task. The data contained URL’s to images that I needed to store within my DB as the APP should not rely on the online images , thus pulling the images off the web and serializing then to binary data and then storing them as an associated record item in the DB. DEFINITELY NOT AS EASY AS I THOUGH. But eventually got it right. Here is a little snippet of converting the image from URL to raw bytes and eventually back when retrieved from the data source.

(NOTE : This is a quick and dirty as I am purely building POC for myself so no guarantees that this will work for you. I am using Log4net so the “_log” references the global object of my application you can take that out and replace error handling in what ever way you need)

Image Manipulation Snippet
  1. ///<summary>
  2.     /// Downloads the image RAW.
  3.     ///</summary>
  4.     ///<param name=”Url”>The URL.</param>
  5.     ///<returns></returns>
  6.     public byte[] DownloadRawImage(string Url)
  7.     {
  8.       Image imageIn = this.DownloadImage(Url);
  9.       MemoryStream ms = new MemoryStream();
  10.       imageIn.Save(ms, System.Drawing.Imaging.ImageFormat.Gif);
  11.       return ms.ToArray();
  12.     }
  13.     ///<summary>
  14.     /// Converts the byte array to image.
  15.     ///</summary>
  16.     ///<param name=”byteArrayIn”>The byte array in.</param>
  17.     ///<returns></returns>
  18.     public Image byteArrayToImage(byte[] byteArrayIn)
  19.     {
  20.       MemoryStream ms = new MemoryStream(byteArrayIn);
  21.       Image returnImage = Image.FromStream(ms);
  22.       return returnImage;
  23.     }
  24.     ///<summary>
  25.     /// Downloads the image.
  26.     ///</summary>
  27.     ///<param name=”Url”>The URL.</param>
  28.     ///<returns></returns>
  29.     public Image DownloadImage(string Url)
  30.     {
  31.       System.IO.Stream webStream = null;
  32.       Image tmpImage = null;
  33.       int retries = 5;
  34.       while (retries > 0)
  35.       {
  36.         try
  37.         {
  38.           // Open a connection
  39.           System.Net.HttpWebRequest httpWebRequest = (System.Net.HttpWebRequest)System.Net.HttpWebRequest.Create(Url);
  40.           if (!string.IsNullOrEmpty(_proxyUser))
  41.           {
  42.             if (!string.IsNullOrEmpty(_proxyDomain))
  43.               httpWebRequest.Proxy.Credentials = new NetworkCredential(_proxyUser, _proxyPass, _proxyDomain);
  44.             else httpWebRequest.Proxy.Credentials = new NetworkCredential(_proxyUser, _proxyPass);
  45.           }
  46.           httpWebRequest.AllowWriteStreamBuffering = true;
  47.           // Request response:
  48.           System.Net.WebResponse webResponse = httpWebRequest.GetResponse();
  49.           // Open the data stream:
  50.           webStream = webResponse.GetResponseStream();
  51.           // convert webstream to image
  52.           if (webStream != null)
  53.             tmpImage = Image.FromStream(webStream);
  54.           // Cleanup
  55.           webResponse.Close();
  56.           retries = 0;
  57.         }
  58.         catch (Exception ex)
  59.         {
  60.           // Error
  61.           _log.Error(string.Format(“Exception caught in process: {0}”, ex.ToString()));
  62.           _log.Info(string.Format(“Retry in process for : {0}”, Url));
  63.           retries -= 1;
  64.         }
  65.       }
  66.      return tmpImage;
  67.    }

The amount of records I had to build from the data scraped is also quite a lot and takes a long time. My plan is to convert some of my scraping code to make use of the parallel framework in .net to see what the impact on the results will be.

Hope this is useful to someone.

Drop me a comment if you found this of interest or you maybe have something similar you are doing.

Advertisements

Serialization in dotnet the easy way

Many times I have been asked , “how can I easily serialize my objects in .net to xml representation or to JSON objects to be injected or called from client side.” Well there is surely a lot of ways to skin a cat but my little sample shows you how to do it very easy using generics.

Usually when you work with serialized objects in some form of string representation , the difficulty comes in with encoding of character so that the data can actually get interpreted correctly, or formatting especially in things like dates (this is to ensure that they can be parsed). In my sample I have created a simple console App that shows how the serialization methods are created using generics and the .net contract serializer.

[ Download the Code ]