Data sets are often disparate and/or siloed, thereby reducing a user’s ability to utilise multiple sources in an analysis
I believe that the biggest, and ironically also the first, stumbling block for budding data scientists is data collection and subsequent formatting for analyses that may follow. To begin with, locating the necessary datasets can be a major task when current civic society is still adopting an attitude of information sharing. Then, even if the data can be found, with the abundance of coding frameworks and data formats in the current information milieu we find ourselves in these days, each dataset that a researcher might wish to use in a given investigation may exhibit a different structure, file format, or means of access, e.g. API, downloadable CSV, raw HTML table requiring scrubbing, or even…a dreaded PDF (groan). In an attempt to minimise the aforementioned hassles, we wanted to construct a system where city users could draw from several disparate datasets and then combine them into a custom analysis of their own. For this initial prototype we decided to focus on the following sources:
- 102 of SACN’s city indicators that are published yearly in the printed State of South African Cities Report.
- The 2011 South African National Census as accessed through the Wazimap data tool built by OpenUp for Media Monitoring Africa (MMA)
- Municipality budget data as accessed through the Municipal Money data portal built by OpenUp for National Treasury
- Custom datasets that city users themselves upload to the platform
Each of these required a unique data management back-end component to wrangle the data into a unified format for analysis construction/visualisation; and are indicative of a modularized approach to dataset construction, whereby for each new data source needed, an adaptor is written to “jack” it into the system. In the rest of this post, I will go through, at a high-level, the approach used for each of the above sources and how we coded everything up in the platform. While I won’t go into much of the code in these posts, please see the codebase on our Github repository.
City indicators:
For each of the nine biggest metropolitan areas in South Africa, as well as for each of the provinces, SACN compiled a set of indicators that covered Access to Water all the way to Voter Demographics. To work with this set, we decided to adopt a data format that would be maximally extendible, allowing for easy manipulation and re-sampling for data visualisation, dataset joining, and querying. To this end, the fundamental unit of data was a 4-tuple, (x1, x2, x3, x4), that contained the region name (metro or province), the year of the value, the name of the value’s dataset, and the actual value itself. From this simplistic representation, we could then group the data into datasets which would then form indicators, e.g. the dataset Access to Piped Water in the Yard would be part of the Access to Water indicator. The approach also allows for data to be efficiently resampled into time-series for chronological analyses. We use this bottom-up approach for data handling at ODD as it allows for the data to be recast for many different applications with simple database queries. It also creates a fundamental unit that, if adopted in our subsequent plug-ins, allows for any part of our data to talk any other.
Our web application was built with the Python-based Flask framework, we stored data in a PostgreSQL database and this was accessed via the SQLAlchemy ORM. Server-side data manipulations were performed with Pandas, which also provides extremely handy import-from-SQL functionality that can be easily linked to a SQLAlchemy database object.
The Wazimap interface:
In the open data world, just because a dataset is available does not guarantee that it is accessible; and the South African National Census is a prime example. While the data is available from StatsSA, extraction of a data series for a specific region is not trivial and a fair amount of post-processing is needed. Thankfully, the kind folk at OpenUp (formerly Code for South Africa), built the excellent Wazimap interface to the census and allowed for it to be truly accessible in a visual/interactive framework. They also included an API that allows for applications to query this data in a logical manner. For those who don’t know, an API (Application Program Interface) is a means by which one machine can talk to another across some form of network, in this case, the internet. So, as far as our SCODA platform is concerned, we added a web form, powered by the WTForms package, that is then translated into the necessary query string for the Wazimap API. The returning data is then configured according to our aforementioned 4-tuple and made available for inclusion in datasets plus downstream visualisations. As this information is only for the 2011 census (though the 2016 Household Survey data is scheduled for inclusion with Wazimap soon) the SCODA platform will only allow it to be used with other 2011 data to ensure responsible data use. In order to handle this transform, we use the Pandas DataFrame merge function which aligns on both city and year.
The Municipal Money interface:
In a similar fashion to Wazimap’s treatment of the census, OpenUp collaborated with National Treasury to make budget information available for all municipalities in South Africa over the Municipal Money API. The SCODA platform once again uses a web form front-end to this query system which is significantly more detailed than that for Wazimap. To a user of the SCODA platform, the input forms for Wazimap and Municipal Money look very similar, however, the unique data structures used for the two sets is quite different. If you take a look under the hood, viz. our GitHub folder, you can compare the two systems. The returning information is rearranged to confirm with our standard format and piped into analyses.
User datasets via CKAN:
The freely available CKAN open data portal platform is a truly wonderful piece of open-source software, with some of the best documentation I have ever come across…plus it’s easily deployable on an AWS instance. A powerful feature of the system is its DataStore module, which allows for an uploaded dataset to be instantly queryable, provided it adopts a certain format. In our case, we used this functionality to act as the back-end to our user file management and query system. When a file is selected for upload, we submit it to the CKAN instance via its built-in API, after which the data it contains is accessed once again via the API. With this file now living on the CKAN server, it can also be accessed, shared and downloaded as part of the open data portal, thereby allowing easy sharing of information if a user would like it to be made available, though all files are treated private until deliberately opened. The win is therefore two-fold, we were able to easily plug-in a robust file management system as well provide data-sharing / API access as well.
Some lessons learned:
In terms of computational nuts and bolts, a significant challenge for me was to handle several web forms at once, i.e. data could come from any of the four forms and might require unique validations/handling pipelines. My solution was to add a HiddenInput WTForm widget field to each form that was set to “on” via a jQuery command when the submit button of that particular form was clicked. On the Flask side, I could now check these fields to find out which form had been submitted and then ignore the others. I also had the pleasure of playing with Pandas merge/join / pivot functionalities which are truly superb…I heartily recommend their use for any data re-arrangements you may come across.
Outlook:
Each of these data sources presented a unique challenge to unify them into a single framework, but our task was made easy through the establishing of a fundamental base unit, the 4-tuple format, which allowed for easy data-handling once the information had been rearranged to fit this. With this approach, we can now continue to add data “plug-ins” for other APIs or sources knowing that as long as we can convert to our base unit we can make the data talk to each other. A bit of time on the information architecture design definitely saved a few headaches down the line when we looked to the dashboard and map our data. This part of the journey will be covered in Part III.