Charlottetown Building Permits in RSS

From rukapedia
Revision as of 21:06, 25 February 2006 by Peter (talk | contribs)
Jump to: navigation, search

This page documents an experimental script to scrape the City of Charlottetown website to create an RSS feed of weekly building permit summaries.

The General Idea

Every week the City of Charlottetown where I live produces a weekly summary of building permits issued in the past week as a PDF file, and they update a page on their website with a link to the latest PDF file. This system has existed since May of 2005, and the index contains links to PDF files going back to May 21, 2005.

I'm interested in following building permit activity -- it's a good way to keep track of what's being built or renovated where -- but I don't want to have to remember to go and check that page every week, and the City provides no way of "subscribing" to the information, and no RSS feed for the information.

I originally looked at the idea of scraping the actual PDF files to pull out the data contained therein to create an RSS feed of actual permits issues, but the PDF files don't contain enough consistency to allow for that. I settled for the notion of creating an RSS feed that would be updated every time a new weekly summary was issued.

The script documented here, then, is intended to regularly grab the HTML of the page containing the index to the PDF files, uses an HTML parser to locate the URLs of the PDFs (they are consistent enough, so far, to rely on, as they all start with weekly_permits_approved_), compares the URLs received to a cache or previously-scraped URLs stored in a MySQL table, updates the table if a new PDF is found, and then creates an RSS feed that contains pointers to the most recent 36 PDFs found.

With the RSS feed in place and subscribed to, I can now get automatically alerted in my RSS newsreader whenever a new index of building permits is issued.

Download Source Code

You can browse the source code from a browser, or grab the latest version of the source code using Subversion at:

svn://svn.reinvented.net/CharlottetownRSS/BuildingPermits

or

http://svn.reinvented.net/CharlottetownRSS/BuildingPermits

Dependencies

The script has the following dependencies:

MySQL Table for Permit Summary Cache

You'll need a MySQL server where you can create a table to hold a cache of weekly summaries already discovered. In a database on that server you'll need to create a table with the the following structure:

CREATE TABLE `permitcache` (
  `number` int(11) NOT NULL auto_increment,
  `dateadded` datetime NOT NULL default '0000-00-00 00:00:00',
  `url` text NOT NULL,
  `heading` varchar(50) NOT NULL default '',
  `filesize` int(11) NOT NULL default '0',
  UNIQUE KEY `numberdex` (`number`)
) 

You don't have to call the table permitcache -- you can set an arbitrary name for it in the script.

Install

To get the script running, take the following steps:

  1. Create the MySQL table as above, and note the hostname, database name, table name, and authentication information for the table.
  2. In the permits2rss.php script, modify the user-configurable options:
    1. Set the MySQL server, database and table information.
    2. Change the URL of the web page containing the index of weekly permit summaries, if required.
    3. Change the variables used to store information to be embedded in the RSS feed, if required.
    4. Set the location for the RSS feed to be created; this should be a web-accessible file location if you want to make the RSS feed public.

If all goes according to plan, you should now be able to run the script as follows:

php ./permits2rss.php

...and the result should be an RSS file in the location you specified.

To be useful, you should set the script up to run regularly -- perhaps once a day -- as a cron job, so that the RSS file is always current.

Example

I've set up a Test RSS Feed. Examine the contents of this feed if you want to see the intended result of the script.

Bugs and To Do and Updates

I believe I've solved the problem with file_get_contents returning HTTP/1.1 500 Internal Server Error, by changing the code that gets the remote web page to:

$handle = fopen($permitsurl,'rb');
$html = stream_get_contents($handle);
fclose($handle);