Mixpanel Engineering

Real-time scaling

Queuing and Batching on the Client and the Server

without comments

We recommend setting up work queues and batching messages to our customers as an approach for scaling upward server-side Mixpanel implementations, but we use the same approach under the hood in our Android client library to scale downward to fit the constraints–battery power and CPU–of a mobile phone.

The basic technique, where work to be done is discovered in one part of your application and then stored to be executed in another, is a simple but broadly useful; both for scaling up in your big server farm and scaling down for your customer’s smartphones.

 

Read the rest of this entry »

Written by joe

February 15th, 2013 at 2:02 pm

Posted in Frontend

Debugging MySQL performance at scale

without comments

On Monday we shipped distinct_id aliasing, a service that makes it possible for our customers to link multiple unique identifiers to the same person. It’s running smoothly now, but we ran into some interesting performance problems during development. I’ve been fairly liberal with my keywords; hopefully this will show up in Google if you encounter the same problem.

The operation we’re doing is conceptually simple: for each event we receive, we make a single MySQL SELECT query to see if the distinct_id is an alias for another ID. If it is, we replace it. This means we get the benefits of multiple IDs without having to change our sharding scheme or moving data between machines.

A single SELECT would not normally be a big deal – but we’re doing a lot more of them than most people. Combined, our customers have many millions of end users, and they send Mixpanel events whenever those users do stuff. We did a little back-of-the-envelope math and determined that we would have to handle at least 50,000 queries per second right out of the gate.
Read the rest of this entry »

Written by Tim Trefren

December 7th, 2012 at 1:03 pm

Posted in Backend,Operations

Tagged with

How we handle deploys and failover without disrupting user experience

with 5 comments

At Mixpanel, we believe giving our customers a smooth, seamless experience when they are analyzing data is critically important. When something happens on the backend, we want the user experience to be disrupted as little as possible. We’ve gone to great lengths to learn new ways for maintaining this level of quality, and today I want to share some of the techniques were employing.

During deploys

Mixpanel.com runs Django behind nginx using FastCGI. Some time ago, our deploys consisted of updating the code on our application servers, then simply restarting the Django process. This would result in a few of our rubber chicken error pages when nginx failed to connect to the upstream Django app servers during the restart. I did some Googling and was unable to find any content solving this problem conclusively for us, so here’s what we ended up doing.

The fundamental concept is very simple. Suppose that currently, the upstream Django server is running on port 8000. I added this upstream block:

upstream app {
    server 127.0.0.1:8000;
    server 127.0.0.1:8001 down;
}

 

So now, when we fastcgi_pass to app, all the requests get sent to our Django server running on port 8000. When we deploy, we get the most up to date code and start up a new Django server on port 8001. Then we rewrite the upstream app block to mark 8000 as down instead of 8001, and we perform an nginx reload. The nginx reload starts up new worker processes running the new configuration, and when the old worker processes finish their existing requests, they get gracefully shutdown, resulting in no downtime.

Another option to consider is using the backup directive instead of using down. This causes nginx to automatically failover to the servers marked with backup when connections to the other servers in the block fail. You’re then able to seamlessly deploy by first restarting the backup server, and then the live one. The advantage here is there’s no configuration file rewriting required, nor any restarting of nginx. Unfortunately, some legitimate requests take longer than a second to resolve, resulting in a false positive for the original server being down.

Spawning is yet another option. Spawning can run your Django server, monkeypatched with eventlet to provide asynchronous IO. Furthermore, it has graceful code reloading. Whenever it detects any of your application’s python files have been changed, it starts up new processes using the updated files and gracefully switches all request handling to the new process. Unfortunately, attempting this solution didn’t work out for us, as somewhere within our large Django application, we had some long blocking code. This prevented eventlet from switching to another execution context, resulting in timeouts. Nevertheless, this would still be the best option if you can make sure that your WSGI application doesn’t have any blocking code.

During data store failures

At Mixpanel, we employ a custom built data store we call “arb” to perform the vast majority of queries that our customers run on data. These machines are fully redundant and are queried through HTTP requests using httplib2. When a machine fails for any reason, we want to be able to seamlessly detect the failure and redirect all requests to the corresponding replica machine. Properly doing this required some modification of the HTTPConnection class.

The main problem was httplib2 only supported a single socket timeout parameter, used for sending and receiving through the underlying socket. However, we wanted initial connection timeout to fail very quickly, but still have a long receive timeout, since a query over large amounts of data could correctly take a long amount of time. Luckily, httplib2 requests allow for passing in a custom connection type, as long as it implements the methods of httplib.HTTPConnection. Armed with this knowledge, we created our own subclass of HTTPConnection that had a custom connect method. Prior to making the connection, we used settimeout on the socket object to lower the timeout to a short 1 second. If the connection was successful, we revert the timeout it back to the original setting.

This way, if we get a socket.error exception on the connection, a custom ConnectTimeoutException gets raised and the machine being connected to is properly marked as down. One small drawback is that the request takes an additional second, but this only needs to happen a small number of times before all future requests see the machine being marked as down. For the requests that timeout on connections, we simply handle the ConnectTimeoutException and retry the query on the replica machine.

The takeaway here is to take advantage of the ability to change the socket timeout to check for an unresponsive machine. Often with systems that work with large volumes of data, long timeouts are required for database queries. But this is only necessary for established connections. When the connection is initially created, failing fast results in a better user experience, avoiding long delays when a machine goes down.

Written by Anlu Wang

September 28th, 2012 at 12:15 pm

We went down, so we wrote a better pure python memcache client

with 8 comments

Memcache is great. Here at Mixpanel, we use it in a lot of places, mostly to cache MySQL queries but also for other data stores. We also use kestrel, a queue server that speaks the memcache protocol.

Because we use eventlet, we need a pure python memcache client so that eventlet can patch the socket operations to be non-blocking. The de-facto standard for this is python-memcached, which we used until recently.
Read the rest of this entry »

Written by

July 16th, 2012 at 12:00 pm

Posted in Uncategorized

How to do cheap backups

with 7 comments

This post is a follow up to Why we moved off the cloud.

As a company, we want to do reliable backups on the cheap. By “cheap” I mean in terms of cost and, more importantly, in terms of developer’s time and attention. In this article, I’ll discuss how we’ve been able to accomplish this and the factors that we consider important.

Backups are an insurance policy. Like conventional insurance policies (e.g. renter’s), you want piece of mind that your stuff is covered if disaster strikes, while paying the best price you can from the available options.

Backups are similar. Both your team and your customers can rest a bit more easily knowing that you have your data elsewhere in case of unforeseen events. But on the flip side, backups cost money and time that could be better applied to improving your product — delivering more features, making it faster, etc. This is good motivation for keeping the cost low while still being reliable.

Read the rest of this entry »

Written by peter

February 21st, 2012 at 4:30 pm

Posted in Uncategorized

Internship stories

with 6 comments

Last year, I wrote about my internship story because I felt it was such an impactful experience for me. It was simply a story of how working hard and being out in Silicon Valley can lead to very serendipitous occurrences. I don’t think I could have built Mixpanel without the knowledge and connections I gained at Slide. I learned so much about product, how to “get things done” at a real company, and met really close friends that I will take with me forever in life. I was also fortunate enough to work closely with Max, who has been an invaluable mentor and investor for our business.

The point of that post, of course, was to find ourselves interns. We wanted to get a lot of work done, but we also genuinely wanted to give them an extremely meaningful experience like my own. We’d publicly promised them one, so we set out to make good on it. At the end of the summer I asked them to write about what it was like to intern at Mixpanel. I hope those of you that are considering interning at a startup vs. a big company will benefit.

Read the rest of this entry »

Written by Suhail

November 15th, 2011 at 12:31 pm

Posted in Uncategorized

Why We Moved Off The Cloud

with 58 comments

This post is a follow up to We’re moving. Goodbye Rackspace.

Cloud computing is often positioned as a solution to scalability problems. In fact, it seems like almost every day I read a blog post about a company moving infrastructure to the cloud. At Mixpanel, we did the opposite. I’m writing this post to explain why and maybe even encourage some other startups to consider the alternative.

Read the rest of this entry »

Written by mixpanel

October 27th, 2011 at 12:34 pm

Posted in Uncategorized

How and Why We Switched from Erlang to Python

with 29 comments

A core component of Mixpanel is the server that sits at http://api.mixpanel.com. This server is the entry point for all data that comes into the system – it’s hit every time an event is sent from a browser, phone, or backend server. Since it handles traffic from all of our customers’ customers, it must manage thousands of requests per second, reliably. It implements an interface we’ve spec’d out here, and essentially decodes the requests, cleans them up, and then puts them on a queue for further processing.

Because of these performance requirements, we originally wrote the server in Erlang (with MochiWeb) two years ago. After two years of iteration, the code has become difficult to maintain.  No one on our team is an Erlang expert, and we have had trouble debugging downtime and performance problems. So, we decided to rewrite it in Python, the de-facto language at Mixpanel.

Given how crucial this service is to our product, you can imagine my surprise when I found out that this would be my first project as an intern on the backend team. I really enjoy working on scaling problems, and the cool thing about a startup like Mixpanel is that I got to dive into one immediately. Our backend architecture is modular, so as long my service implemented the specification, I didn’t have to worry about ramping up on other Mixpanel infrastructure.

Read the rest of this entry »

Written by mixpanel

August 5th, 2011 at 5:37 pm

Posted in Uncategorized

My first week at Mixpanel, or how I didn’t take down the Internet

with 2 comments

During my first week at Mixpanel I was asked to design, implement and deploy a highly requested feature in our core javascript library.  I had just started as the new intern and I hit the ground running.  Our customers wanted a simple method to track link clicks without having to hassle with browser incompatibilities or fiddle with event models.  The new functionality would also lay the groundwork for future enhancements such as form integration.  I got to work right away.

Read the rest of this entry »

Written by

May 23rd, 2011 at 10:54 am

Posted in Frontend

Tagged with , ,

Sharding techniques

with 5 comments

At Mixpanel, we process billions of API transactions each month and that number can sometimes increase rapidly just in the course of a day. It’s not uncommon for us to see 100 req/s spikes when new customers decide to integrate. Thinking of ways to distribute data intelligently is pivotal in our ability to remain real-time.

I am going to discuss several techniques that allow people to horizontally distribute data. We have conducted interviews (by the way, we’re hiring engineers) with people in the past that make poor decisions in partitioning (e.g. partitioning by the first letter in a user’s name) and I think we can spread some knowledge around. Hopefully, you’ll learn something new.

Read the rest of this entry »

Written by Suhail

May 11th, 2011 at 4:29 pm

Posted in Backend

Safe place to purchase clomid and buy weight loss pills