OSCON Day 3: Real world scalability
I just got out of Ask Bjørn Hansen's "Real World Scalability" presentation and my head is still spinning. Ask's lightening fast paced slides zipped by in a blur of practical tips for speeding up your web site. This presentation, in contrast to Theo Schlossnagel's scalability tutorial, was packed with real life advice. Theo's perspective focused more on the planning and general rules behind scalability, whereas Ask peppered us with details on what to consider and how to make changes to your site.
To set the stage for his talk, Ask talked about vertical scaling, which entails buying faster hardware to make your site run faster. Simple math shows us that you can buy one really fast machine or hundreds of faster machines that collectively have much more power than the one really fast machine. The key to scalability lies in breaking your Internet site into small independent chunks that can be pawned off to many commodity hardware machines.
The first real life bit of advice focused on caching -- if you can stop regenerating often requested pages and cache the output, you can make dramatic speed improvements to your site. First, check to see which pages are most often requested and attempt to cache the whole pages. Check your database and web server logs to find which pages to cache first. Some pages present a bigger challenge to cache than other pages -- pages that contain per user information (e.g. a shopping cart) can't effectively be cached in one piece. Each cache copy is useful for only one person -- its more effective if each copy can be utilized by as many requests as possible.
The most drastic way to do this would be to take all of your dynamically generated pages and write them to a cache, so that pages only get regenerated when the underlying data changes. If that is not possible, you can take data chunks that make up the pages and cache those chunks. The next time a page is requested, it can be regenerated from the cached data chunks. This works really well for slow database queries -- pulling the data in the cache can be multiple times faster than re-querying the database.
Caching your pages could be done in process memory, but that data isn't shared with other processes. Shared memory works better, but its still not shared between machines. The best solution for caching is using memcached, which was developed by Brad Fitzpatrick from LiveJournal. Memcached uses a server model to cache random bits of data -- accessed over the network, a memcached server can be queried from many client machines. This makes memcached the most flexible web site caching solution around today.
Ask's next round of advice focused on scaling databases -- his first point was to never rely on one server for all your database needs. If your web site does 99% reads and only a few writes, then this is not so critical. But for a lot of other sites that have more write needs, using one master database server for writing data and a network of replicated read-only database servers spreads the load onto multiple servers.
If your site still needs to have more write power for any one server, you should consider partitioning your data. Identify independent chunks of data and store these chunks on separate servers. Make sure to select data chunks so that you don't need to do database joins between chunks. If you can do this with your data, you can have multiple write database servers and even more read only replicated servers. You see the pattern here -- each of Ask's pieces of advice aims to spread the work onto more machines, rather than forcing vertical scaling.
Next, Ask focused on storing your users sessions (e.g. user log in information, shopping carts, etc.). The best method is not to store the entire user profile or a whole shopping cart in a session, but to only store its associated id -- store the actual data in the database where it can be shared with other machines. Ask suggests that the golden session balance is to store important information in the database and not-so-important data in the session.
A good way to manage sessions is to keep the data in the session light and to use a cookie (or a few) to store the data. Never keep state on the server -- keep everything stateless. This allows any web-server to handle a request for a user and not a specific server that is keeping the session for the user.
Ask's last major point was using light processes for light tasks. You don't want to have a heavy apache process that contains a scripting interpreter (e.g. perl, python, etc) to serve out a 49 byte single pixel gif file. The best thing to do is to split your web servers into light weight front end processes that serve out static files and images, and a backend server that does the heavy lifting for accessing the database. This one change that can be accomplished easily with Apache 2.0 (or mod_proxy) as your front end and Apache 1.x (or 2.0) as your back end.
Between Theo's tutorial and Ask's sessions, I feel that my scalability curiosity has been satisfied -- well satisfied. Time to go and pay attention to some of the other cool presentations. And with that, I'm late for the next cool session...
Do you think these tips are useful?
Categories
WebRead More Entries by Robert Kaye.
