Building a simple expression language

If you're interested in what we work on, please apply - we're hiring: http://mixpanel.com/jobs/

(This is part one of a two part series, you can find part II here)

The Mixpanel reporting API is built around a custom expression language that customers (and our main reporting application) can use to slice and dice their data. The expression language is a simple tool that allows you to ask powerful and complex questions and quickly get the answers you need.

The actual Mixpanel expression engine is part of a complex, heavily optimized C program, but the core principles are simple. I’d like to build a model of how the expression engine works, in part to illustrate how simple those core principles are, and in part to use for exploring how some of the optimizations work.

This post will use a lot of Python to express common ideas about data and programs. Familiarity with Python should not be required to enjoy and learn from the text, but familiarity with a programming language that has string-keyed hash tables, maps, or dictionaries, or familiarity with the JSON data model will help a lot.

The Mixpanel data model

Before we talk about the expression language, it’s worth talking about the subject of expressions- Mixpanel data. You can think of Mixpanel as a log full of Python dictionaries or JSON Objects – loosely typed tables that map string keys to values that can be numbers, text, lists of values, or nested tables of keys and values. We’ll call each mapping of key to value a property. Here’s what the properties of a typical profile in Mixpanel people analytics might look like, for an app that allows rock stars to sell their records on other planets (in Python notation):

The Mixpanel integration for this app stores the name and years active for a musician, along with a list-valued property containing descriptions of each of the musician’s albums. The album descriptions themselves are collections of properties, and each album contains a property tracks that in turn contains a list of descriptions of each track on the album.

Querying Mixpanel data in Python

A common Mixpanel operation on Mixpanel people analytics is to request every people profile that conforms to a particular query. We might want to see all of the artists who have been active for less than five years, for example. Suppose we had the whole database in memory, as a list named profiles. In Python, we could query the data with a program like this:

Of course, we might want to run other queries as well, so we might write our code like this:

Here we’ve abstracted out the query itself. We can treat any object with a matches(profile) method like a query. If we want to pick a particular artist by name, we can just say

Querying Mixpanel data in the Mixpanel expression language

The basic execution model of our Python queries is to filter data – basically, to loop over every record and apply a filter function to each set of properties in turn. If that function evaluates to true, return the record.

Now what if we want to allow other people and processes to write their own queries? In an ideal world, our query program would look something like this:

We could just eval() python code as it came in (and it would actually work!) but that approach has a lot of other disadvantages. What we really need is a language for defining functions from property sets onto values that also has the following other attributes:

  • It should be quick and simple to parse, and quick and simple for other programs to generate
  • It should simple for us to secure
  • It should be quick and simple to apply, and accommodate optimizations (which means it shouldn’t expose much information about its associated execution model)
  • Functions defined in the language should be guaranteed to terminate, and even perform predictably, which means the language can’t be Turing complete
  • But our language should be powerful enough to ask complex questions (which means it should be recursive)

The Mixpanel expression language fulfills all of those criteria. It has a simple and common structure that is easy to express in parsers and generators – it’s a set of recursively defined expressions, all of which have a value. In particular, Mixpanel expressions are either primitives (like literal values) or the results of combining other expressions together with operators. Here’s a pretty typical expression in the query language:

If we were to pick apart the query expression above into its component expressions, it might look something like this:

A Mixpanel Expression Language

This expression has three kinds of sub-expressions:

    • "Ziggy Stardust", "artist", "years_active" and the numeral 5 are primitive literals. They always evaluate to the same thing
    • properties is a primitive property lookup – it takes on the value of the record that is the subject of the expression. Individual properties can be accessed using the [ ] indexing operator.
    • The operators [ ], ==, and < take multiple expressions and bind them together to produce a new value.

If you built this expression in Python, it might look something like this

Here we build up expressions from smaller expressions, just like in the diagram- each sub-expression corresponds to an instance of an object with an evaluate(properties) method. The individual record is the only thing that matters to the expression, so we pass it as a parameter to evaluate(properties).

It’s not hard to imagine (or to build) a simple parser that constructs and composes expressions from the raw language into a final expression in the format above. The grammar might look something like this in a yacc or bison like parser generator tool:

Digression – optimizing property lookups in expressions

A typical Mixpanel query may run over millions of records, and any speedup we can make to our queries is very important. There are a lot of different ways we optimize our queries when we run them, but I want to talk about a few optimizations in particular that are present in the real Mixpanel, surrounding the properties construct.

It turns out that index lookups are the only useful thing to do with properties, and they’re present in every interesting expression. In fact, most interesting expressions do property lookups using string literals. We can speed things up a lot if we change our parse to look like this:

A Mixpanel Expression Language (1)

Now, instead of interpreting properties["artist"] as a compound expression with three nodes (a properties access, the primitive “artist”, and an index lookup) we’ll interpret it as a single node. That means less indirection, and we won’t need to call evaluate on both sides of the Index every time.

We can perform this optimization during the parse, and just create a different kind of node, like this:

Notice that this only works if we know the complete value of the key beforehand – the value of the key can depend on the parse, but not on the value of other properties. The easy way to make sure that’s the case is to only allow literal values as property indexes. Our previous grammar allowed any arbitrary expression as a property lookup, but this grammar won’t do that. We’re trading a little bit of power for a lot of performance (and the ability to make future optimizations.)

Here’s how we might tweak what we pass to the parser generator to use this new grammar.

This change to the parser has two implications. The first one is an improvement – in the old version of the language, the following was a legal but meaningless parse:

Now we’ll catch this as an error during parsing rather than just always returning a useless undefined result – that’s the kind of change with no real tradeoffs. The other change makes our query language slightly less expressive. In the old version of the language, users could say:

Which is now an error. That’s a choice- it would be easy for us to change the line that reads

to instead read

We could even keep our optimization in the special case of a literal index key- however, this would make the implementation a lot more complicated, and possibly rule out other optimizations down the road. Limiting the language to an explicit, static key in property lookups is actually an important decision, and a tradeoff – we trade some expressiveness in the language for performance now and elbow room to make improvements later.

Of course, it would be easy to add property lookup by dynamic key to the language later, if it turns out to be an important feature. In general, it is difficult to take capabilities away once we have clients out in the wild performing queries, so when faced with a tradeoff like this we generally err on the side of caution.

Optimizing property lookups with PropertyLookup speeds up expression evaluation quite a lot, but we’re not done yet. It turns out that property lookups by key are actually still kind of expensive, and expressions like

that refer to the same property more than once are quite common. Caching the results of the property lookup seems like it’s worth doing. To do the caching, we’ll use a helper class, the Reference:

Note: this is where the Python analogy breaks down a bit. It’s not clear that self.value is all that much faster than properties["value"] in most circumstances in Python. In our real-world implementation in C, caching means the difference between looking things up in a hashtable and just dereferencing a pointer. For the sake of the argument, please suppose that self.value can be optimized more easily by the python interpreter than the dictionary lookup.

A reference is pretty simple, but we can use it to cache lookups in our expressions with just a little extra work, and a change to how we call evaluate:

This implementation works great, but it will require a bunch of extra work from our parsing engine. It also changes the way we call evaluate() – since we’re binding references before the call, there is nothing to pass in.

We can imagine a parser that works something like this:

The changed line, in yellow, refers to some wonderful references helper we haven’t written yet. It would probably looks something like this:

Now we end up with an expression, and a populated references object we can use to bind the expression to a different context with new records. We’ll need adapt our run_query function to accommodate the new interface.

Bringing it on home

We started this journey because we wanted to write a query server – a process that could take expressions, parse them and execute them. Our new queries can do just that.

To sum up, we’ve built a simple engine for querying, and the applied some optimizations. The “apply some optimizations” bit ends up being the lion’s share of the work, and usually involves some trade-off between expressiveness of the language, complexity of the implementation, and performance of the expression.

I hope this post gives you some insight into the basic ideas behind parsing and executing expressions written in a custom language. For some problems, a custom language is the best answer, and this should give you a starting point for thinking about writing one.

UPDATE: We’ve posted a sequel to this post that describes how to take this simple framework to the next (semantic) level, literally, by adding iterators and scope!

If you're interested in what we work on, please apply - we're hiring: http://mixpanel.com/jobs/

Leave a Reply

Your email address will not be published. Required fields are marked *