At Mixpanel, we process billions of API transactions each month and that number can sometimes increase rapidly just in the course of a day. It’s not uncommon for us to see 100 req/s spikes when new customers decide to integrate. Thinking of ways to distribute data intelligently is pivotal in our ability to remain real-time.
I am going to discuss several techniques that allow people to horizontally distribute data. We have conducted interviews (by the way, we’re hiring engineers) with people in the past that make poor decisions in partitioning (e.g. partitioning by the first letter in a user’s name) and I think we can spread some knowledge around. Hopefully, you’ll learn something new.
Consistency is the idea that when you add a new node to the system it will not shift all the items and then shard old data to a new node. Balancing means that your partitioning system can evenly distribute the potential load on the system. Split information is effectively the data you can utilize or have deemed useful to evenly distribute data across your infrastructure. For example, at Mixpanel, we do not use a customer id to split data because our customers can range from having 100′s to 10′s of millions of users.
You absolutely need these properties in order to shard your system correctly.
One simple way to partition your data is in cells (I believe Facebook uses this term and I like it). The idea is that you purchase a certain fixed set of hardware upfront: lets say 10 nodes. You partition each user via modulo into each node. So for example, a user with the id of 783,902 will be sharded to node 2 (crazy math: 783902 % 10). The idea of cells is that when you notice performance degradation you can create a new cell with 10 more nodes in it. New users (let’s say users above user_id 1 million) will then always get sharded to a node in that cell.
Cells have a nice property in that if a part of your infrastructure goes down it only affects a subset of your users. That, in many cases, is a huge advantage. Cells can also be used in conjunction with other sharding techniques.
Routing is a bit more complicated. If your data happens to be data where you get to control the split information (e.g. a user_id) then you can use routing. Here’s how it works:
- Decide how large a user_id can potentially be.
- Decide how many nodes you’ll reasonably ever have.
- Make your new user_id the length of #1 and #2.
- The first part of the user_id will be the shard id and the second part will be the actual user_id.
For example, imagine we believe our company can have 1 million nodes then we may only need 32 bits to represent that. That means the first 4 bytes of our user_id will be shard_id and the remaining data will be just the user_id. The last part could be another 8 bytes if we think we can have over 4B (max int) users. Never know!
When a user shows up to comment, upload a picture, etc. you know exactly which node this specific user is on and can grab all the data at once or write the data by just parsing the shard_id from their user_id.
Lookup tables are useful when you can’t do routing. The idea is that you have a giant hash table in some datastore and you map some unique information to a random node in your system. This allows us to have consistency even though we may not know the exact information that is being piped to us. The best example of this is when you have single users who generate tons of data that you do not control and where that data cannot fit on 1 single node. Luckily, data of this type will generally have characteristics that allow you to evenly distribute it.
Lookup tables are easy enough to implement but you may come across the issue that you wish to shard your lookup table. The irony of that situation is intense at times. Fortunately, your fallback plan can be to use cells of lookup tables where you buy enough nodes within a cell that no single customer on the system could reasonably cause bottlenecks. By the time you need that you will understand your infrastructure fairly well anyway to know that exact number.
This sucks but people scale this way. It’s fairly simple: you shard users anyway you want, if a node is seeing problems you migrate some of the data away. If you’re willing to accept the tediousness of data migration then there’s all kinds of solutions to scaling.
Consistent hashing on a ring
Consistent hashing on a ring is interesting and a lot of NoSQL type datastores use it (Cassandra, Riak, etc.). The idea is that you have a ring that is numbered 0 to a huge number (2**64) and the piece of data is assigned a number on that ring. As you add new nodes to the system they are given a number on that ring and they will basically be declared as the owner of certain range of tokens on that ring (e.g. 0 – 1M).
Check out the picture from project voldemort:
Please be aware there are many issues with this but it primarily works well (in my opinion) in caching situations (e.g. memcache). If any node in the ring is added or disappears the data will have to shift making you lose some level of consistency. In caching, this is relatively okay as long as you can avoid the stampeding.
Outside of caching where you need persistency, data must migrate. There’s a concept of vnodes that riak employs to get around this. Cassandra does not use vnodes however. We deprecated Cassandra from our stack specifically because it was difficult to have proper node balance as some nodes would have more data than others. The solution for Cassandra is to continuously double your infrastructure in order to get even data distribution or move nodes around to new tokens on the ring (extremely tiresome).
Tips and tricks
Read repairing is great when things go bad or you wish to migrate data in a less tedious way. The idea is that every time new data is written, you map it to the new shard, and then write it to the new shard as you normally would. Anytime old data is updated or read you re-map it to a new shard_id and write it to the new shard_id. Of course, anytime data is read you always check the new shard, if it exists, great, but if not, then you must check the old shard.
- Is data in new shard? Yes: Return it. No: Ask old shard.
- Is data in the old shard? Yes: Write it to new shard and return it. No: Does not exist or if you’re writing something you must write it to the new shard.
Eventually, you will hit a point where 99% of reads/writes are occurring on your new node.
Write flags can be employed in many partitioning scenarios such as in lookup tables, cells, or even routing. The idea is that once a node appears to be hitting capacity you stop writing any new data to it. For example, at Mixpanel when we see nodes in our system reaching a certain threshold of iowait we will stop writing new data to that node (old data that is currently mapped to it continues to be written) but new data will be sharded to new less loaded nodes in our infrastructure.
If you have any sharding techniques that I have not talked about here that are more novel please talk about them in comments because we’d love to hear them. If you find yourself needing to use any of these techniques on the web, it is generally a sign of success!