Expanding on Monday’s post today I’m getting into the detail of what “A way to download text files and parse them into database loads that is repeatable” means. The goal is to publicly work out a specification of what this MVP looks like.

The full design is to essentially design a conduit architecture for quickly pulling in datasets from government sources. The most simple of those datasets is the flat file sitting on an FTP server. If it doesn’t slow things down, putting in a stub of a conduit architecture is a nice to have but that can be retrofitted later. The process must be automated because one of the differentiators is to create a process for building trust that government numbers are not being manipulated after the fact is to download the data and periodically download it again. Knowing that at least one outside organization is watchdogging your data and whenever there is the slightest hiccup, calling you on it is a significant value add for the business and putting that repeatability is important to have even in the beginning.

Data needs to be downloaded and treated as if it were source code. The meta data of what changed, when it changed, who changed it, and why is a norm in the coding world but is not generally a norm in the government data world. Establishing that richer metadata standard is a value addition for the project. Loading the data into Git or a like source code repository allows for this richer metadata to be expressed without having to engage in reinventing the wheel with some proprietary mechanism.

At the same time, the data does need to go into a database so that it can be manipulated and used for generating reports. Code repositories are not good for that. Whether you copy and load twice or you adjust the database to replicate repository functions available in Git and company is not important. A double load would seem to be OK for an MVP and then moving on to a more efficient solution later might be called for.

Advertisements