Data Cleaning With Muck
Data journalism is a growing field, and as the practice becomes more sophisticated, data journalists are facing all the challenges of modern software development. In order to manage the increasingly complexity of data projects, journalists will need good tools. Muck is a prototype tool for data analysis, with a particular focus on data journalism. It allows users to take an incremental, iterative approach to their work. By breaking programming tasks into smaller steps, journalists can better explore their data, develop stories and try new techniques, while at the same time making their work more transparent and reproducible.
Projects built with Muck are structured as a collection of interdependent steps (we refer to this structure as the “dependency graph”). With a simple naming convention and a little behind-the-scenes magic, Muck is able to infer the dependencies between source code and data files. With this information, Muck rebuilds only the parts of a project that have changed, and their dependent parts. For large datasets, the time savings can be dramatic, allowing users to explore their data with little overhead. Our goal is to provide an environment that encourages correctness and clarity of both code and data, while remaining fast, pragmatic and ergonomic. This post describes a few early results, specifically facilities for patching text and transforming records.
Muck fills a role similar to the traditional Unix tool Make (see: paper and reference). Make, first released in 1976, is a command-line tool for automatically building products (typically compiled executables) from source files in an incremental fashion; when asked to produce a particular product (or “target”), it will only rebuild relevant portions of the project that have changed since the last build. The process is directed by a “makefile”, which specifies, for each target product to be built, the list of source files and products that the target depends on (the “dependencies”), as well as the commands to build the product (the “recipe”). In essence, the developer explicitly writes out the project’s dependency graph in the makefile, thus enabling Make to perform incremental builds.
Muck improves on this paradigm by eliminating the need for the makefile. Instead, it examines the contents of source files whose names match those of the target products, and infers the dependency relationships. While Muck is certainly not the first system to boast this capability, it is notable for supporting both source and data formats common to data journalism.
Test Project: Webster’s 1913 Dictionary
Once we got the basic dependency calculations and build commands working, we began several test projects that could help guide development in a practical direction. One ongoing experiment parses and analyzes the Project Gutenberg version of Webster’s 1913 Unabridged Dictionary. The work required a variety techniques that are fundamental to data journalism: web scraping, error correction, and text parsing. The project is not yet finished, but our experience thus far has led to some interesting additions to Muck.
The dictionary text that we have been working with is quite messy. Its basic structure is straightforward, but we have encountered exceptions in the text at nearly every step of development. At over 100,000 entries, the dictionary is too large to correct by hand (it appears that there have been valiant efforts over the years, but a variety of problems remain), so getting good results is a real challenge.
The first, most glaring problem is that splitting the text into discrete records fails in a few places. These flaws are easily understood, but the code to correct them ranges from straightforward to convoluted.
Ambiguous Escape Sequences
The text contains a variety of obscure escape sequences (specific patterns of text intended to encode symbols or meaning that is not otherwise representable), some of which cannot be automatically parsed because the sequences also occur as legitimate, unescaped text. Simple find-and-replace operations using regular expressions yield lots of false positives.
Common and Rare Flaws
Some flaws occur once or a handful of times, while others occur thousands of times. Time-efficient strategies for correcting rare versus common flaws tend to be quite different. Sadly, it seems that the only way to know whether it makes more sense to correct by hand or programmatically is to try both!
Structure and Information Loss
One interesting thing about the English dictionary (for a programmer at least) is that the text is much more mechanical than prose, but still not so rigidly defined that it can be parsed like a programming language. The pronunciations, parts of speech descriptions, etymologies, and even the definitions themselves are written in a systematic style (although the early Webster’s Dictionary is famous for its colorful definitions: James Somers’ blog post was our initial inspiration for the project). Nonetheless, there seems to be an exception to any syntactic rule that one might conceive, and like natural languages, some ambiguities can only be resolved by semantic understanding of the content. To make matters worse, crucial punctuation like matching parentheses and brackets are missing in some cases. Writing code to parse these various elements into structured data has been downright maddening.
Data cleaning can be a challenging, time consuming process. The primary goal of Muck is to make it easy to create perfectly reproducible data projects; a curious reader should be able to check out a project from a source repository like Github, run the `muck` command, and reproduce all of the computations that create the analysis. Ideally, the various processes that go into the computation will be well organized and easily audited for correctness.
Programming languages offer tremendous capabilities to automate such tasks, but such power comes with the risk of overcorrection. Sometimes it is easier (and less confusing) to make a correction by hand, rather than via code. However, simply altering the original source data is not a reproducible practice, and too many hand corrections make it impossible to properly fact-check a data-driven story.
A classic solution to this problem has been available in the Unix programming world for many years: the `diff` and `patch` tools. “Diffing” is the process of calculating a “diff” (also called a “delta” or “patch”) from two versions of the same document: the diff shows the edits needed to transform the original version into the modified version. “Patching” is the process of applying a patch file to an original document to produce the modified result. Traditionally, these tools have been used to track and communicate changes to program source files, and form the conceptual basis for modern version control systems like Git. However, thanks to the Unix tradition of designing tools to operate on text, `diff` and `patch` are easily applied to text-based data formats as well.
So, we added support for patch files. Muck treats them as just another source format, with single data dependency (the “original” file) and output (the “modified” file). The benefit was immediately obvious: patch files are human-readable, allowing the patches to function as a reproducible version of the “data diary” that many data journalists use. Unfortunately, the various traditional patch formats have some shortcomings for our purposes:
- “Unified diff” (the standard modern patch format) hunks contain hard-to-read positional information at the top of each hunk; diffs produced by `git` are even worse, containing hash values and contextual “@” lines designed specifically for source code.
- Empty Unix patch files are truly empty, omitting the original file name needed by Muck for dependency inference.
- The file paths in patch hunks often contain the build directory prefix, a minor but annoying detail.
Of these, the most significant (and surprising) problem is that for some workflows, once a patch is created, it makes more sense to edit the patch directly rather than to edit the “modified” file and recompute the patch. This is especially true when reviewing a set of patches; occasionally we would find a typo or change our correction strategy part way through. Unix patch formats were not designed with hand-editing in mind. As an experiment, we created our own patch format and tool, called Pat. The pat format is similar to the traditional “unified diff” format, but addresses the above shortcomings directly, and provides us with a means of experimenting further with this sort of workflow. In particular we would like to add commenting syntax, escaping of control characters, line splitting, and intra-line highlighting for ease of use. Pat is still in early development, and currently lacks documentation, but the code can be found at https://github.com/gwk/pat.
While the patching methodology in Muck needs some refinement, it has already proved useful. We believe patching is an important tool for achieving reproducibility in real-world projects because it offers a middle road between manual editing and programmatic correction. The technique should be used with discretion though, and knowing when to switch strategies from patching to programmatic correction is largely a matter of experience. Choosing the right strategy for a given problem often requires experimentation.
Cleaning and Verification
Regardless of whether flaws are fixed by hand or via code, a fundamental challenge is to apply fixes without introducing new flaws. Good developers typically use some sort of testing framework to verify that their code works as intended, but how best to apply these methodologies to data problems is not obvious. Often, flaws are discovered in the course of implementing some end goal or feature, but are best fixed somewhere earlier in the data pipeline. The result is a disconnect between the logic that checks for a flaw and the process that fixes it.
Before proceeding, we should explain our use of the terms “check” and “test”. We make a distinction between “checking” for flaws in the data, with logic in the main pogram, and “testing” the program for logical flaws, via external testing code. This distinction becomes fuzzy if “checks” get factored out into scripts that are not part of the main computational graph, because once externalized they essentially become tests against the data. There are further elaborations to be explored (e.g. best practices regarding assertions and exceptions versus non-fatal logging in data cleaning code) but the main point is that traditional software testing methodologies do not map perfectly to the needs of data cleaning projects.
Once a data cleaning fix is implemented, the checking logic takes on a questionable role: either it issues a warning prior to the fix, or it is applied afterwards and remains silent. In the former case, the programmer quickly learns to ignore the message, and it only serves to obscure more meaningful warnings. In the latter, the developer has several options:
- remove the check because it is slow, confusing or inconveniently located;
- leave the check in place, where it no longer executes;
- factor it out into some side branch of the computational graph.
Removing the checking code is a undesirable, because doing so eliminates the evidence of the flaw and thus the primary explanation of why the fix exists. At the same time, code that never executes is a notorious liability; as projects evolve, unexercised code tends to become incorrect. Only the last option sounds promising, but just moving the code into a separate program does not ensure that the code will get executed or maintained. The broad question we need to address is this: what are good strategies for clarity and correctness when working with these sorts of checks and fixes? At the very least, factoring out the checking code from the program pipeline and into a side branch allows it to be read and run as part of a manual audit, or via automated testing.
As a first step, we implemented a new file source type in Muck, “.list”, which is simply a list of scripts to execute. By creating a top-level ‘test.list’ file listing the test scripts that have accumulated in the project, we can simply run `muck test` and all of the checks will run.
Standalone tests are useful, but by themselves they don’t make the work much easier. At worst, they become a dumping ground for old code. What we need is a way to show how the flaws and fixes are related to the data pipeline. Ideally, we would be able log records that fail checks, even after fixes have been added, so that reviewers can confirm that the fix behaves as intended. The intent of the code would also be much clearer if check, fix, and logging logic were colocated in some fashion.
All of this suggests a major limitation of the file-based perspective inherent to Muck: many operations apply on a per-record basis, rather than per-file. Thus, an emerging goal is to articulate and enable a per-record data cleaning strategy:
- Identify a flaw and implement checks for it.
- Implement a fix.
- Preserve both the check and the fix in the codebase in a clear, coherent way.
- Continually verify that the check and fix remain well behaved using tests.
Auditable Data Transformations
We want the check for a flaw, the application of a fix, and any reporting of the occurrence to be expressed together in a cohesive, clearly written unit. After some experimentation and quite a bit of supporting work, Muck now supports just such a workflow, via a function called `transform`. The technical description of how it works is more painful than using it, so we’ll start with an example:
with muck.transform('input.txt') as t: @t.keep def non_empty_lines(line): 'lines containing only whitespace are ommitted, without logging.' return bool(re.strip()) @t.drop def vulgar(line): 'profane lines are ommitted and logged.' return re.match(r'darn|gosh|shucks', line) @t.edit def (line): 'all occurrences of the word "tweet" are replaced with "trumpet",' 'preserving the leading capital; altered lines are logged.' return re.sub(r'([Tt])weet', r'1rumpet', line) @t.convert def br_tags(line): 'capitalize each line; no logging.' return line.capitalize() @t.flag def big_words(line): 'log lines with long words, but do not alter them.' return any(len(word) > 16 for word in line.split()) @t.put def out(line): 'write the final result line.' print(line, end='')
It’s worth admitting up front that `muck.transform` uses several advanced python features in combination – this makes it a bit tough to describe. `transform` is meant to be used at the top level of a python script, and just like `muck.source`, it takes the name of the input data as its first parameter. It returns a `Transformer` object (`t` in the example), which provides several methods that serve as function decorators. Each decorator indicates a different kind of transformation. When applied to a function, a decorator adds a new stage to the pipeline.
All transformation functions must take a record as their sole parameter. Each kind of transformation has a different behavior:
- `drop`: if the function returns True, the record is logged and dropped from subsequent processing.
- `keep`: if the function returns False, the record is dropped; no logging takes place.
- `edit`: the function returns a value to be passed to the next stage; if the returned value is not equal to the original value, then both are logged as a pair of -/+ lines.
- `convert`: like `edit`, the function can return either the original or a new value; no logging takes place.
- `flag`: if the function returns true, then the value is logged; the value is never altered.
- `put`: the function is expected to write the output or otherwise consume the value; no logging occurs. Usually there is a single `put` at the end of the pipeline, but `transform` allows zero puts (perhaps for a script that just checks validity) or many puts (for example, if some intermediate output or multiple output formats are needed).
All stages are applied to each record in the order in which they were decorated. For those modes that feature automatic logging, the complete effect of the transformation is reported in a dedicated file, without the user having written any reporting code. This leads to a much better verification experience for the user, because Muck’s logging facilities are fancier and more organized than the typical debugging code the user would write. The logging feature also obviates the need for such clutter in project code.
For the technically inclined: there is a `run` method that actually performs the transformation on the input sequence. Note that `run` is not called in our example; instead we use the `with` form to treat the `Transformer` `t` as a Python “context manager”. The `__exit__` method of `Transformer`, which is automatically called at the end of the `with` scope, simply calls `run`. This usage pattern is optional and preferred purely as a convenience.
Our experience with `transform` so far has been that it speeds up development by letting us alter and rearrange multiple stages easily. The individual stages tend to be quite simple and easy to read, whereas our previous solutions were larger blocks of code, often inside nested `for` loops, which are more difficult to reason about. The automatic logging performed by `transform` makes it easy to verify that a given stage is performing as intended, which saves even more time over the course of a project.
Muck is not yet a mature tool, but our experience with it thus far has been promising. The framework it provides for data scripting tasks reduces clutter and boilerplate (an industry term for uninteresting, repetitive setup code), which lowers the effort required to create or alter steps in the project dependency graph. As a result, the conceptual clarity of our test project improved over time, as we gradually organized the code into small, meaningful pieces. This sort of progression stands in contrast to our prior experiences, in which monolithic analysis scripts became increasingly convoluted over the course of development.
The patch and transform techniques demonstrated here are conceptually simple, but they dramatically improve the programming experience for certain kinds of tasks common to data journalism. As we develop Muck further, we hope to identify other classes of problems which can be made less painful with support from the build system. If you have any ideas, let us know!