Charlottetown Building Permits in RSS
This page documents an experimental script to scrape the City of Charlottetown website to create an RSS feed of weekly building permit summaries.
The General Idea
Every week the City of Charlottetown where I live produces a weekly summary of building permits issued in the past week as a PDF file, and they update a page on their website with a link to the latest PDF file. This system has existed since May of 2005, and the current index contains links to PDF files going back to January, 2007.
I'm interested in following building permit activity -- it's a good way to keep track of what's being built or renovated where -- but I don't want to have to remember to go and check that page every week, and the City provides no way of "subscribing" to the information, and no RSS feed for the information.
I originally looked at the idea of scraping the actual PDF files to pull out the data contained therein to create an RSS feed of actual permits issues, but the PDF files don't contain enough consistency to allow for that. I settled for the notion of creating an RSS feed that would be updated every time a new weekly summary was issued.
The script documented here, then, is intended to regularly grab the HTML of the page containing the index to the PDF files, uses an HTML parser to locate the URLs of the PDFs (they are consistent enough, so far, to rely on, as they all start with Weekly_permits_approved_), compares the URLs received to a cache or previously-scraped URLs stored in a MySQL table, updates the table if a new PDF is found, and then creates an RSS feed that contains pointers to the most recent 100 PDFs found.
With the RSS feed in place and subscribed to, I can now get automatically alerted in my RSS newsreader whenever a new index of building permits is issued.
Download Source Code
You can grab the source code from Github.
The script has the following dependencies:
- The XML_HTMLSax PEAR package. You can probably install this with
pear install XML_HTMLSax
- The sweet FeedCreator.class.php RSS creation class.
MySQL Table for Permit Summary Cache
You'll need a MySQL server where you can create a table to hold a cache of weekly summaries already discovered. In a database on that server you'll need to create a table with the the following structure:
CREATE TABLE `permitcache` ( `number` int(11) NOT NULL auto_increment, `dateadded` datetime NOT NULL default '0000-00-00 00:00:00', `url` text NOT NULL, `heading` varchar(50) NOT NULL default '', `filesize` int(11) NOT NULL default '0', UNIQUE KEY `numberdex` (`number`) )
You don't have to call the table permitcache -- you can set an arbitrary name for it in the script.
To get the script running, take the following steps:
- Create the MySQL table as above, and note the hostname, database name, table name, and authentication information for the table.
- In the permits2rss.php script, modify the user-configurable options:
- Set the MySQL server, database and table information.
- Change the URL of the web page containing the index of weekly permit summaries, if required.
- Change the variables used to store information to be embedded in the RSS feed, if required.
- Set the location for the RSS feed to be created; this should be a web-accessible file location if you want to make the RSS feed public.
If all goes according to plan, you should now be able to run the script as follows:
...and the result should be an RSS file in the location you specified.
To be useful, you should set the script up to run regularly -- perhaps once a day -- as a cron job, so that the RSS file is always current.
I've set up a Test RSS Feed. Examine the contents of this feed if you want to see the intended result of the script.