Histogrammar

Histogrammar is a suite of data aggregation primitives for making histograms and much, much more. A few composable functions can generate many different types of plots, and these functions are reimplemented (exactly!) in multiple languages and serialized to JSON for cross-platform compatibility.

StrangeLoop 2016 presentation

What is this and why might I want it?

Histogrammar allows you to aggregate data using cross-platform, functional primitives. It serves the same need as HBOOK and its descendants— summarizing a large dataset with discretized distributions— but it does so using lambda functions and composition rather than a restrictive set of histogram types.

For instance, to book and fill a histogram in ROOT, you would do the following:

histogram = ROOT.TH1F("name", "title", 100, 0, 10)
for muon in muons:
    if muon.pt > 10:
        histogram.fill(muon.mass)

In histogrammar, you would do it like this:

histogram = Select(lambda mu: mu.pt > 10,
                Bin(100, 0, 10, lambda mu: mu.mass,
                    Count()))
for muon in muons:
    histogram.fill(muon)

because ROOT's TH1F is just a selection on binned counts. To accumulate something else in each bin, such as mean and standard deviation (a “profile plot”), you'd change the Count to Deviate and leave everything else the same.

histogram = Select(lambda mu: mu.pt > 10,
                Bin(100, 0, 10, lambda mu: mu.mass,
                    Deviate(lambda mu: mu.trackQuality)))

To make a 2-d histogram of px and py, you'd simply nest the Bin primitives.

histogram = Bin(100, 0, 50, lambda mu: mu.px,
                Bin(100, 0, 50, lambda mu: mu.py,
                    Count()))

Now the content of each bin is another histogram (the x slices). And so on. A good set of primitives can generate any kind of aggregation you need.

Histogrammar primitives are designed for distributed computing (they are all order-independent commutative monoids) and cross-platform compatibility (all languages produce the same JSON). As a data analyst, you just express your data aggregation in terms of nested Histogrammar primitives and pass it to any system for evaluation. Since all of the logic of what to fill is encoded in your lambda functions, the aggregation phase is automatic.

for muon in muons:
    histogram.fill(muon)

Benefits

Moving the logic of data analysis out of the for loop allows the analyst to describe an entire analysis declaratively. A whole analysis can be wrapped up in subdirectories like

Label(
    dir1 = Label(
        hist1 = Bin(...),
        hist2 = Bin(...)),
    dir2 = ...)

This tree gets filled the same way as a single histogram, because the Label collection is a primitive just like Bin and Count.

Thus, analysis code is now independent of where the data are analyzed. This is especially helpful for aggregating data in “hard to reach” places: across a distributed system like Apache Spark, on a GPU coprocessor, or through a thin bandwidth connection.

In addition, expressing an analysis this way formalizes it so that it can be inspected algorithmically. At any level, the cuts applied to a particular histogram can be inferred by tracing the primitives from the root of the tree to that histogram. Named functions provide bookkeeping, such that a quantity and its label are defined in one place to reduce errors when the units are changed.

Scope

Histogrammar aggregates data but does not produce plots (much like HBOOK, which had an associated HPLOT for 1970's era line printers). Histogrammar has front-end extensions to pass its aggregated data to many different plotting libraries.

Histogrammar also has back-end extensions for aggregating data from different frameworks. It can therefore be thought of as a common language for aggregating and then plotting data, so that every plotting library doesn't have to have individual hooks for every kind of aggregation system. As a language, it represents aggregated data in a way that doesn't have to be recomputed for trivial changes in visualization, such as changing the color of axis tickmarks.

Get started!

See the documentation below for installation instructions, tutorials, full references, and the cross-language specification. See the GitHub site for the code.