Diagnosing networking issues in the Linux Kernel

A few weeks ago we started noticing a dramatic change in the pattern of network traffic hitting our tracking API servers in Washington DC. From a fairly stable daily pattern, we started seeing spikes of 300-400 Mbps, but our rate of legitimate traffic (events and people updates) was unchanged.

Suddenly our network traffic started spiking like crazy.

Pinning down the source of this spurious traffic was a top priority, as some of these spikes were triggering our upstream routers into a DDos mitigation mode, where traffic was being throttled.

There are a couple of good built-in linux tools that help in diagnosing networking issues.

  • ifconfig  will show you your interfaces and how many packets are moving across them
  • ethtool -S  will show you some more detailed information on packet flow, with counters for things like dropped packets at the NIC level.
  • iptables -L -v -n  will show you the counts of packets being processed by your various firewall rules.
  • netstat -s  will show you the values of a bunch of counters maintained by the kernel network stack, eg. the number of ACKs, the number of retransmits, etc.
  • sysctl -a | grep net.ip  will show you all your kernel network related settings.
  • tcpdump  will show you the content of the packets going back and forth.

The clue to our problem was in the output of netstat -s. Unfortunately when you look at the output of this command, it can be hard to tell what the numbers mean, what they should be, and how they are changing. To help see how they were changing, we created a small program to show the numeric deltas between successive runs of a command which allowed us to see how fast the various counters were ticking. One of the output lines looked particularly worrying.

The usual rate of this counter on an unaffected server of ours is more like 30-40 per second so we knew something was wrong here. The counter suggested that we were rejecting a large amount of packets because they had invalid values for TCP timestamps. The short term fix to quickly mitigate the issue was to turn off TCP timestamps with the following command:

This immediately caused the packet storm to stop. This isn’t a permanent solution though, as TCP timestamps are useful for measuring round-trip time and correctly allocating delayed packets to the right place in the stream. This becomes an issue on high-speed connections where TCP sequence numbers can wrap around in timespans on the order of seconds. For more information on TCP timestamps and performance, take a look at RFC 1323.

At Mixpanel we generally run a tcpdump as well whenever we are seeing abnormal traffic patterns, so that we can analyze the traffic afterward to try and determine a root cause. What we found was a huge number of TCP ACK packets being sent back and forth between our API server and a particular IP address. Effectively our server was stuck in an infinite loop with another server sending TCP ACK packets back and forth. Each host was continually acknowledging a TCP timestamp that the other end did not recognize as being valid.

At this point we realize we are dealing with an issue that can only be solved in the linux kernel TCP stack. So our CTO went to the linux-netdev mailing list to see if we could find a solution. Thankfully we found that this issue has been encountered before, and there was a solution available. It turns out this type of packet storm can be initiated by some faulty hardware or 3rd party changing the TCP SEQ, ACK, or Timestamp values in a connection to the point where each host thinks that the other is sending out-of-window packets. The way to avoid this turning into a packet storm is to limit the rate at which Linux will send duplicate ACK packets to only one or two per second. Here is a great explanation on the topic.

We were able to take this patch and backport it to the current Ubuntu (trusty) kernel that we use. Thankfully Ubuntu makes this pretty simple, and recompiling the patched kernel was simply a matter of running the following commands, installing the resulting .deb file and rebooting.

Building data products that people will actually use

This was originally posted on High Scalability.

Building data products is not easy.

Many people are uncomfortable with numbers, and even more don’t really understand statistics. It’s very, very easy to overwhelm people with numbers, charts, and tables – and yet numbers are more important than ever. The trend toward running companies in a data-driven way is only growing…which means more programmers will be spending time building data products. These might be internal reporting tools (like the dashboards that your CEO will use to run the company) or, like Mixpanel, you might be building external-facing data analysis products for your customers.

Either way, the question is: how do you build usable interfaces to data that still give deep insights?

We’ve spent the last 6 years at Mixpanel working on this problem. In that time, we’ve come up with a few simple rules that apply to almost everyone:

  1. Help your users understand and trust the data they are looking at
  2. Strike the right balance between ease and power
  3. Support rapid iteration & quick feedback loops

Continue reading

Feb 2015 Mixpanel C++ meetup: Fun with Lambdas (Effective Modern C++ chapter 6)

We’ve been hosting a series of monthly meetups on C++ programming topics. The theme of the series is a chapter-by-chapter reading of Scott Meyers’ new book, “Effective Modern C++”.

The meetings so far have been

  1. December: Arthur O’Dwyer on “C++11’s New Pointer Types” (EMC++ chapter 4)
  2. January: Jon Kalb on “Rvalue References, Move Semantics, and Perfect Forwarding” (EMC++ chapter 5)
  3. February: Sumant Tambe on “Fun with Lambdas” (EMC++ chapter 6)

Next up, we’ll be continuing chapter 6 with a presentation on “Generic Lambdas from Scratch“. Come by the office and check it out!

Building a simple expression language

The Mixpanel reporting API is built around a custom expression language that customers (and our main reporting application) can use to slice and dice their data. The expression language is a simple tool that allows you to ask powerful and complex questions and quickly get the answers you need.

The actual Mixpanel expression engine is part of a complex, heavily optimized C program, but the core principles are simple. I’d like to build a model of how the expression engine works, in part to illustrate how simple those core principles are, and in part to use for exploring how some of the optimizations work.

This post will use a lot of Python to express common ideas about data and programs. Familiarity with Python should not be required to enjoy and learn from the text, but familiarity with a programming language that has string-keyed hash tables, maps, or dictionaries, or familiarity with the JSON data model will help a lot.
Continue reading

Queuing and Batching on the Client and the Server

We recommend setting up work queues and batching messages to our customers as an approach for scaling upward server-side Mixpanel implementations, but we use the same approach under the hood in our Android client library to scale downward to fit the constraints–battery power and CPU–of a mobile phone.

The basic technique, where work to be done is discovered in one part of your application and then stored to be executed in another, is a simple but broadly useful; both for scaling up in your big server farm and scaling down for your customer’s smartphones.

Continue reading

Debugging MySQL performance at scale

On Monday we shipped distinct_id aliasing, a service that makes it possible for our customers to link multiple unique identifiers to the same person. It’s running smoothly now, but we ran into some interesting performance problems during development. I’ve been fairly liberal with my keywords; hopefully this will show up in Google if you encounter the same problem.

The operation we’re doing is conceptually simple: for each event we receive, we make a single MySQL SELECT query to see if the distinct_id is an alias for another ID. If it is, we replace it. This means we get the benefits of multiple IDs without having to change our sharding scheme or moving data between machines.

A single SELECT would not normally be a big deal – but we’re doing a lot more of them than most people. Combined, our customers have many millions of end users, and they send Mixpanel events whenever those users do stuff. We did a little back-of-the-envelope math and determined that we would have to handle at least 50,000 queries per second right out of the gate.
Continue reading

How we handle deploys and failover without disrupting user experience

At Mixpanel, we believe giving our customers a smooth, seamless experience when they are analyzing data is critically important. When something happens on the backend, we want the user experience to be disrupted as little as possible. We’ve gone to great lengths to learn new ways for maintaining this level of quality, and today I want to share some of the techniques were employing.

During deploys

Mixpanel.com runs Django behind nginx using FastCGI. Some time ago, our deploys consisted of updating the code on our application servers, then simply restarting the Django process. This would result in a few of our rubber chicken error pages when nginx failed to connect to the upstream Django app servers during the restart. I did some Googling and was unable to find any content solving this problem conclusively for us, so here’s what we ended up doing.
Continue reading

We went down, so we wrote a better pure python memcache client

Memcache is great. Here at Mixpanel, we use it in a lot of places, mostly to cache MySQL queries but also for other data stores. We also use kestrel, a queue server that speaks the memcache protocol.

Because we use eventlet, we need a pure python memcache client so that eventlet can patch the socket operations to be non-blocking. The de-facto standard for this is python-memcached, which we used until recently.
Continue reading

How to do cheap backups

This post is a follow up to Why we moved off the cloud.

As a company, we want to do reliable backups on the cheap. By “cheap” I mean in terms of cost and, more importantly, in terms of developer’s time and attention. In this article, I’ll discuss how we’ve been able to accomplish this and the factors that we consider important.

Backups are an insurance policy. Like conventional insurance policies (e.g. renter’s), you want piece of mind that your stuff is covered if disaster strikes, while paying the best price you can from the available options.

Backups are similar. Both your team and your customers can rest a bit more easily knowing that you have your data elsewhere in case of unforeseen events. But on the flip side, backups cost money and time that could be better applied to improving your product — delivering more features, making it faster, etc. This is good motivation for keeping the cost low while still being reliable.

Continue reading

Internship stories

Last year, I wrote about my internship story because I felt it was such an impactful experience for me. It was simply a story of how working hard and being out in Silicon Valley can lead to very serendipitous occurrences. I don’t think I could have built Mixpanel without the knowledge and connections I gained at Slide. I learned so much about product, how to “get things done” at a real company, and met really close friends that I will take with me forever in life. I was also fortunate enough to work closely with Max, who has been an invaluable mentor and investor for our business.

The point of that post, of course, was to find ourselves interns. We wanted to get a lot of work done, but we also genuinely wanted to give them an extremely meaningful experience like my own. We’d publicly promised them one, so we set out to make good on it. At the end of the summer I asked them to write about what it was like to intern at Mixpanel. I hope those of you that are considering interning at a startup vs. a big company will benefit.

Continue reading