by Colin Miller
Many serious applications on the web need to scale to some extent. For smaller projects that you make for yourself or for a small audience, a single server box hosting both the web server and the database can be enough. For larger audiences and application demands, often you'll separate the web server from the database into different boxes. This will allow a small bit of scaling as you can utilize one machine for your processing logic and another for your data store. But what about going further than that?
So you find that your front end machine can no longer handle all of the traffic arriving. Typically you'll start setting up your application on multiple front end machines and hide them behind a load balancer so that your traffic is split between different machines. Now this can add in some new problems depending on how you architected your application (session data can become tricky), but overall it can help release load. In fact if you engineered correctly so as to not utilize session data to a great extent, or if you use a smart load balancer that works on an IP address hashing algorithm to make sure that requests from one client always go to the same server, then you can expand horizontally in a linear fashion. Excellent! But what about your data?
Well most people store their data in some sort of RDBMS (Relational DataBase Management System), basically a database. On the free side there are tools like MySQL, Postgresql, sqlite, berkely db, and probably a slew of others. On the not-free side there are a lot of options, but most people end up talking about Oracle. I'm not going to talk about Oracle because it's a very large, expensive, difficult to manage piece of software that does a lot, but requires a lot of support. I also won't talk about MSSQL because no one should host anything on a Windows machine (personal bias). MySQL is what I'm most familiar with and probably the most popular of the free databases so it is where I will focus.
If you have all of these front end machines set up with load distributed between them you're going to run into a problem where they are all trying to access the same database and they get bogged down because your DB server has too high of a load. What can you do? Well you can set up replication and make slaves is what! Each slave reads commands that were performed on the master and performs them locally so as to keep data in sync. Basically a slave receives all of the same write commands from the master that the master received so that it can have an up to date¹ copy of your data. From then on you can treat your slave as read-only and send read requests to the slave instead of the master, thereby freeing up processing on your master for just writes (and critical reads, but I won't get into that).
You can even add lots of slaves to scale reads horizontally in a linear fashion, which is exactly what we want. But what about writes? While one of your many front end boxes can choose one of your many slave database boxes to read data from, when it comes time to insert or update any data, those requests all have to go to the master database. In a write heavy application that can become a bottleneck. Unfortunately you cannot scale writes like you can with reads. You could create another DB machine with Master-Master replication (each machine is a slave to the other), but unfortunately it doesn't help your write problem at all because each master has to write not only its own requests but also replicate from the other thereby nulling any effect you might have wanted to gain.
The only real solution I've ever heard about is sharding, which isn't a very fun solution at all. Basically you figure out some method for separating the data, say by user_id. So all user data with an ID between a certain range gets sent to one master box and all of the data for IDs in another range go to another box. Unfortunately because your data is now in two separate database systems, you can't run SQL joins on them (Oracle RAC can I believe, but again it's expensive and hard to maintain for the regular developer). Instead you'll have to make your queries separately and massage the data in code, separating your model into multiple places that is error prone.
I start to wonder then, if we need to use a database at all. Is there another data storage engine that would work better? I think about projects like memcached, and I wonder if one could make a similar system but with more integrity. There are some efforts to do so with hadoop, but they're not quite ready yet and I'm not sure about the real-time performance of anything involved with hadoop just yet.
A lot of people won't run into the write problem with scalability. But if you do it can be a real difficulty to solve in an elegant fashion. I haven't done so, but I think it's one of the next areas of research I shall be looking into.
¹ Technically the data isn't guaranteed to be up to date as the replication process has some lag to it. It can take a few seconds or maybe minutes depending on the data and load for replication to catch up and the data to be mirrored onto the slave. That is why any critical reads that need to know exactly what's on the master without doubt must be made to the master server.