About this site

Subscribe to the RSS Feed

2010 Census: A pain in my block

I’ve written about my very limited exploits in exploring and leveraging 2010 census data in the past but this weekend I finally got back around to doing something about it. What I ended up with is looking like a pretty useful tool for anyone looking at projecting census data onto arbitrary geographies.

A quick recap

For those who didn’t read the first couple posts I wrote about the reasoning behind why I’m going through all this trouble in the first place, here’s the story so far:

I work for the Bahá’í Faith. The administration of Bahá’í communities across the continental United States is broken down into what we call Bahá’í Localities. These can sometimes be mapped one-to-one onto an existing geography, such as a municipal boundary or a county but often times are combinations or subdivisions of those areas. Which makes the task of projecting census data onto them somewhat tricky. After I wrote about getting the census.ire.org Django project setup on my local VM, I got a Twitter shout out from Joe Germuska over at the Chicago Tribune suggesting that I take a look at trying to build my shapes out of the basic kernel of all census shapes: the block. So, in case the title of this post doesn’t give that away, that’s what I did.

Re-purposing guts of the IRE app

Once I got the census.ire.org project setup I kinda sat back at said to myself, “uh, OK, now what…”. My first approach was actually to forget about all the juicy goodness in there and try to just get what I needed from the API that they offer. Problem is, block level data is not available through the API (which I suppose is understandable given that there are like, 8,000,000 census blocks in the U.S.).

Since no one else had the data, it was starting to look like I would have to build it myself (*gulp*). In reading the docs for the IRE Django app, the parts that intrigued me the most were the parts that actually fetched census data and loaded it someplace. The someplace, in their case, was a MongoDB database which, at the end of the day, got dumped out and turned into flat JSONP files which were then uploaded to an S3 bucket so that schmoes like me could access it through the API. Problem with that, at least for me, was that I needed some way of comparing my arbitrary shapes with the census block shapes. What I needed, in a nutshell, was a PostGIS database with both my arbitrary shapes and the census blocks. Luck for me, I’ve done that before.

So, what it came down to for me was to basically take what they were doing, rip out the MongoDB parts, put in PostGIS parts, and run some spatial lookups. Easy, right? Well, thanks to the work that was already done, yeah. But really, really boring.

Before the boring part, the code

OK, now we’ll get into the meaty part where I show you what I did. My basic approach, like the IRE app, was to write a series of scripts which actually executed the various steps in the process. The real brains of what’s going on is directly cribbed from the IRE app so I’d literally be nowhere without them. If you’re interested in seeing what they did before going any further, the hub round which everything else revolves is called batch_sf.sh and lives here.

The basic steps I came up with were slightly different than the ones that IRE uses because of a few things: all I wanted was block level data, I needed to know what area on the surface of the Earth the census data related to, and I didn’t need to get all the census tables (I cherry picked 14 that seemed to be what I was after). The steps I followed, in order, are:

  • Load the block level geospatial data into a PostGIS database
  • Fetch the census data, generate the table headers and save it as CSV
  • Create a cross reference table that facilitates a relation between the GIS data and the census data
  • Load the block level census data into already prepared Django models for the 14 tables that I wanted
  • Finally, make the spatial relations between my arbitrary geographies and census blocks

One word of caution before doing any of this stuff: working with block level data takes a huge amount of patience. When I got my batch script working, I ran it against Connecticut and it took 7 hours to complete. And that’s one of the small states with only 67,000+ census blocks. Illinois has something like 450,000+ census blocks so the scale here gets pretty astronomical pretty fast. When I get the final wrinkles ironed out, the next phase will probably involve setting this stuff up on a cloud server, just letting it crank and then coming up with a way of serializing the output into something that I can use without ever having to run this stuff again until the 2020 census.

Load GIS data

This was actually remarkably simple thanks to the GeoDjango LayerMapping tool. All it took was getting the data from here and stuffing it into a Django model. Here’s the model I went with:

The locality field is a ForeignKey to the model where I store the Bahá’í localities (my arbitrary geographies). The xref field is a ForeignKey to a cross reference table that, at least while it’s loading, gives us the ability to relate a place on the surface of the earth to the number of people that live there. Other than that, it’s pretty much a one-to-one mapping of the GIS data that you get from the Census Bureau. Here’s my LayerMapping script:

As I mentioned above, this is something that can take an insane amount of time to complete. If you’re having trouble getting it to finish the whole job, the LayerMapping tool does support passing a fid_range keyword argument into the .save() method with a tuple containing the indexes of the first and last features you want to load. That gives you a way of resuming should something get frozen or die on you. On a test run, I was able to load all 450,000+ shapes for the state of Illinois into my Ubuntu VMs with relatively limited resources (512MB RAM for app server and 1024MB RAM for db server) on my Intel i5 MacBook Pro. It finished. Eventually. It helps if you don’t watch.

Fetching census data

This part is basically grabs the raw data from the Census Bureau, does some basic cleanup and generates the table headers (using this script from IRE) so that the next step is easier. I pretty much directly ripped this part off from this script from within the IRE project. Hey, they told me to steal it. One thing I want to point out from within that file is the last line:

in2csv -e "latin-1" -f fixed -s ${DATAPROCESSING_DIR}/census2010_geo_schema.csv ${STATE_NAME_ABBR}geo2010.sf1 > ${STATE_NAME_ABBR}geo2010.csv

This leverages the ever useful csvkit to take the census2010_geo_schema.csv file and add it as a header to the info about the actual geographic areas that are part of the raw census data you downloaded in the first part of the script. This becomes kind of the lynchpin when we get to the part where we’re loading the census tables into Django models.

Make cross reference table

The Census Bureau provides the census data in what I consider to be relatively cryptic form. I suppose it’s all in the interest of making it portable and platform agnostic but for me it was somewhat of a headache to make heads or tails of what was going on there. Again, if it weren’t for the work put in on the IRE project, I would be nowhere.

The first part of the headache was trying to relate the geospatial data that we’ve already loaded to the census tables that we’re about to load so that when you query a place, you know how many people live there. The way this was accomplished by the IRE project was by building sort of a throw away cross reference relation that both the data about a place and the census data about that place have in common. Which is trickier than it sounds for reasons that are totally unknown to me. Since MongoDB is a non-relational database, the process that the IRE project had to go through to make the relation involved making a key that was stored when you loaded the info about a place and then could refer to later when loading the census data about that that place. My approach was to build a table that each geography would have a ForeignKey to and then looking up that geography using that relation when I loaded the census data. It has pretty much the same effect as the IRE approach only it’s relational rather than non-relational. Anyways, here’s the code I used to make the cross reference table:

The crux of what’s going on here is that it’s taking the [state abbrv]geo2010.csv file that you created in step two, going through it row by row, creating a cross reference based upon the FILEID, (which is basically the same for all of these since we’re just using the census summary file) the STUSAB (which is a state level identifier) and the LOGRECNO which is basically an index number for the entry in question. We’re then associating that with a particular census block based upon it’s geoid. That’s basically a concatenation of the various FIPS codes of the areas that contain that block plus the block number.

One thing to note is that this script uses two external modules for some of it’s functionality config and utils. In my implementation, these are pretty much straight copies of this and this, respectively. I’m not using everything in there since some of it has to do with the MongoDB parts.

The part that makes a huge difference (and took me a fair bit of time to figure out) was to enable the block SUMLEV in the config file (this is what that if row_dict['SUMLEV'] not in config.SUMLEVS bit up there is all about). When you get the code from the IRE Github repo, it’s not turned on by default (which is understandable). To do that, you just need to add it to SUMLEVS list on line 16 of the config file. For good measure, I also turned all the other ones off so I wouldn’t be loading data that I wasn’t going to be using.

Actually loading the census counts

Now that we’ve made it possible to cross reference the census data with the geospatial data, let’s load it up. Just so you know what we’re working with, the SF1 (Summary File 1) data that we’re loading here is broken down into 300+ tables which are divided amongst the 47 files we downloaded in step two. You can see the extensive (and rather nitpicky) nature of what that looks like in the census.ire.org app (here’s Chicago). Since I’m really only after 14 of these tables, I only built models to house 14 of them. Here’s a sample (putting the whole thing here would just be silly):

The script used to load data into these is, again, based heavily upon what was already in the IRE project, just with the MongoDB parts taken out and the Django ORM/PostgreSQL/PostGIS parts put in. Here’s what that looks like:

Quick rundown of the action here: we’re getting a census table file (one of the 47 created in step 2) as a command line argument, and associating each row with a CensusBlock object based upon that ever handy cross reference table that we created earlier. The utils.parse_table_from_key(k) part is, again, referencing a function from the utils module that uses a regex to parse which census table the row being looked at contains data from based upon the pattern that is present in the row. Since I’m only interested in 14 of the tables, I’m only actually loading the data that is from those tables (hence the if k not in GET_CENSUS_TABLE.keys(): part. After that, it’s just passing the geo_id and the table values into the Django model (which is looked up in the GET_CENSUS_TABLE dict) as keyword arguments. By the way, that’s my new favorite Python shortcut: o = m(**v).

The final relation The final step here is to relate my arbitrary geographies with the census blocks they contain. I chose to do this by creating a ForeignKey from the CensusBlock object to the Locality object which contains it. I’m still tweaking the actual queries that I’m going to run in this stage but, for the purposes of a proof of concept, it’s pretty solid. Here’s my script for this last stage:

I’m actually trying to catch the census blocks associated with a particular area in three, ever widening, passes because the first time I tried this, just using the CensusBlock.objects.filter(mpoly__coveredby=loc.mpoly) filter, I ended up with a bunch of CensusBlock objects that weren’t associated with any Locality objects. It’s still a little messy but, just to show you it can actually be done, here’s a screen shot showing a side-by-side comparison of what our Washington, D.C. Locality boundaries (which in this case correspond precisely with the actual boundaries of Washington, D.C.) look like next to all the census blocks that make it up:

Screen shot showing Washington D.C. built out of a Baha'i Locality compared to the same area made out of census blocks

Wrapping it up

After I had my 5 scripts together, I made another script that basically takes a state name as command line argument and executes all the other scripts in order. It also handles downloading and unpacking the data:

So, really all you have to do is execute that and wait. And wait. It takes a really long time. Maybe go take in a movie. Or a movie marathon. Watch all 5 seasons of The Wire. You get the idea….

In all practicality…

Getting this far was relatively simple, given the amount of work that was already done for me by the wonderful IRE project. The next stage is going to be kind of where the rubber hits the road in terms of usability. I don’t really want to have to run through this process on every state more than once so I’m thinking I’ll probably just serialize the parts I need and store them in an S3 bucket where I can grab it when I need it. But that’s a project for later…

blog comments powered by Disqus