Web Scale Software Challenges for Lay Folk

An Example

This post grew out of a chat with Jae Sinnett, a great jazz drummer, composer, band leader, and music educator here in Tidewater Virginia. Jae likes to write essays about jazz music and the joys and trials of being a working jazz musician. He publishes these on Facebook and he writes well and at length. But Jae’s essays often come out as a single block of text with the paragraph breaks missing.

Thinking Jae had not discovered the secret sauce for getting Facebook to create a paragraph break, I commented on a recent essay to describe the shift-return technique. It turned out that Jae knew this technique but that it worked or failed at random. What could be going on?

Who would think the users would …

Facebook came into being at Harvard to help students get to know each other and to coordinate campus activities. Users made short posts and met at a local watering hole or activity for long for interaction.

Since then, Facebook has evolved to allow open membership to share content with scattered friends and family and to form global communities of interest like the several Greyhound groups. So there’s been a bit of mission creep. And some members like WHRV radio presenter and jazz musician Jae Sinnett, write significant essays of several paragraphs. Is writing of paragraphs intended to be a feature of the Facebook post editor? Who knows?

Move fast and break stuff

That’s the Facebook way. Get to market quickly and fix it in the mix. Mark Zuckerburg gives his engineers permission to try things and to make mistakes. He wants them to innovate and it is ok for the occasional goof to enter production.

But Facebook Works at “Internet Scale”

I’m web master for two small non-profits whose sites run on shared hosting services. One host, one database instance, and one set of site modules and media files. I don’t have a lot to keep straight. And I have enough troubles with shared hosting services whose provisioning levels and service standards did not grow to support WordPress and Drupal. But at Internet scale things are different.

Internet scale applications like Facebook, Twitter, and Google Search or YouTube have massive infrastructure to service the request rates. Assuming a three layer architecture with a UI service, application services, and database services, multiple instances of each are required to respond to the posts and queries of millions of concurrent users.

Design Issues

Interface design matters. The protocols between services must be well defined and accurately implemented to allow the services to interact correctly while the internal implementation is being refined.

Services must have well understood semantics and those semantics must be correctly implemented and the results returned using the agreed-upon interface protocol. When the protocol changes with time, the client must indicate the version it needs and the servers responding must use that version. The request distribution mechanism must route a client request to a compatible instance of the service. We’ll see why this is next.

Version Mix and Match

Second, change happens, so when an application changes, Facebook or Google must literally replace 100,000 copies while continuing to provide service. This happens as a “rolling upgrade” so there is always a mix of generations of each client and each service running in an Internet scale distributed application. Any client must be able to work with any of several versions of a service and to retry with a different instance if an incompatible instance responds to a request. In an Internet scale cluster, there are no neat one to one relationships between clients and servers.

Testing Becomes an M times N Problem

Each client has to be tested with each service that it uses to confirm that each is implementing the protocol correctly and semantics of the request are as expected. The testing problem becomes more complex because each client must be tested with each version expected to be active in the rolling upgrade environment.

Deployment Issues

Verified software must be distributed to each host that is to run it. This requires accurate configuration management recipes and that any update failures be retried after the cause has been corrected.

Problems Present Randomly

Because everything is mixed and matched dynamically, visible problems can occur randomly. Each user request is processed independently but a sequence of requests may interact through stored data in the content database. It is entirely possible that a different combination of program versions will respond to each of several post requests that originate from a user. This could explain why paragraphs are preserved in some of Jae’s posts but not in others. All it takes is for two different versions of a program, one that preserves paragraphs and another that does not to respond to the posts.