Claromentis Web Crawler

April 3rd, 2009 by David Sanders

I’ve recently been tasked with developing a web-crawler for our clients at the Savannah Riverkeeper organisation. Their conservation efforts mean they have been manually looking through a long list of web-pages for new PDF documents of applications regarding the Georgia Savannah area, and this is something they wished to automate. This is the result:

web_crawler

The application will crawl a URL to a specified link-depth and automatically import new documents into a folder in Claromentis’ Document Manager. It also sends the relevant users notifications of new documents, and any problems encountered. This should really cut down the time needed for our clients to sift a range of sites for relevant new documents and notices.

There is a question I was asked about whether it was legal and/orĀ  responsible to trawl other organisation’s sites in this way. Of course there is no way for a web server to know what sort of client is connecting apart from the User Agent (how the browser identifies itself to the server) and the IP address, so server administrators really have to take it as a given that their sites will be crawled on a regular basis. Anyone who’s looked at a website access log will know that Google, Yahoo! and Microsoft are constantly on the crawl for new content for their search engines to index. So legal definitely yes, and in this instance the end certainly justifies the means.

Using the Claromentis Framework and other FOSS(Free/Open Source Software) tools for these kinds of mini-applications makes Rapid Application Development easily possible, and I was able to get a working prototype and interface ready in plenty of time to meet the client’s requirements.

And who knows, maybe I’ll have helped save an important, useful and beautiful stretch of river in the process.

Resources-Technical

  1. April 4th, 2009 at 10:23 | #1

    Dave,

    This is really creative application, which I believe would bring a lot of benefits to our clients.

    Good job!

  2. Matt
    May 29th, 2009 at 15:56 | #2

    Am interested in finding out more about your web-crawler.

  1. No trackbacks yet.