Data.gov uses the ckan system to store all its data. That means there’s not only a fairly robust open source engine underlying data.gov but it also exposes an API which should make getting data out of it a fairly easy exercise. There should be no need to scrape. The operative word is should.

Want a group list? That works fine with a simple curl request:
curl -O http://catalog.data.gov/api/3/action/group_list -d ‘{}’

Want a package list? That doesn’t work as well. The command should be very similar:
curl -O http://catalog.data.gov/api/3/action/package_list -d ‘{}’

The result ends up being a long wait and then an error page coming back. Based on the returned error, it looks like ckan doesn’t have enough resources to build the package list. That means getting the data set list is going to require something a bit messier than a simple api call, at least until the data.gov team fix their API problem.

Update: This is open issue #295 here. It’s been open about a month at time of writing but after I posted my update with information from comments (thanks again PenGun) I got a quick response and a promise of a workaround on the way to an eventual fix.

Advertisements