Toward a vision of Sustainable Server Programming

For a number of years I’ve been thinking that I should write down some of the ideas I’ve had and lessons I’ve learned over way too many years banging on a keyboard. In my head this has has always been centered around the vague label “sustainable server development”. Let me try to peel away some layers of that, and make the label a little less vague.

We will begin with “server”. A substantial amount of what I’ve written over the past decade or so I would handwave label as “server side”. But what do I mean by that? For me a “server” is a program intended to be mainly headless, running unattended for considerable periods of time, probably on remote hardware, and providing a well defined service. By preference for me a server runs under some form of Unix. Really the landscape has collapsed to three platforms now: some form of Unix, Windows, and the widely variable set of options for software embedded in specialist hardware (although these days, that is very frequently a Unix variant as well). I’ve done a little bit against Windows, and always found the experience frustratingly complex and ultimately unrewarding. Unix was built from the ground up to provide many of the facilities needed for “server side” coding (I’m particularly thinking of a robust security model, abstracted hardware and networking facilities, and sophisticated multi-processing facilities), and provides the coder with access to multiple layers of the stack between her code and the hardware in ways that Windows makes difficult. Bringing that back to a statement: for me a “server” is a headless, service-oriented piece of code running under Unix, and required to be robust, performant and reliable.

So. “sustainable”. Like any coin, this has two sides (I know, technically all coins have at least three faces): “sustainable server” and “sustainable development”. I believe the two are really linked, and hope over a series of articles to illustrate this. When I talk about a “sustainable server”, I mean something that has been built to minimise hassle and surprise for administrators and for code maintainers. When I talk about “sustainable development” I mean approaches that make building and maintaining robust, reliable and performant code a pleasant and simple 9-to-5 job, rather than a heroic nightmare of late nights, pizza and caffeine.

I am not a fan of heroic coding. There is plenty of verified clinical evidence that amply demonstrates that a tired and stressed coder is a bad coder: some clinical studies suggest that a few days disturbed sleep has the same effects on cognition to being seriously inebriated. We have a culture that is proving very hard to break, where a mad hack over a sleepless week resulting in partially completed, un-documented, un-maintainable code is an effort to be applauded (and repeated) rather than treated as an unwelcome and undesired exception. While the coder is subject to a variety of lunacies from project managers and product owners, we are our own worst enemies if we keep committing to unhealthy and irrational death marches. A calm and rational approach to developing server side services should make it unnecessary: most of the problems to be solved in this space have been solved before, and we’ve got a lot of historical precedents to call on. Most of the time, none of this has to be considered hard or complex, so please just go take a cold shower and a walk around the block, and calm down.

Let me point to an example outside the coding world. Watch a carpenter, or a blacksmith, at work. There’s no sense of rush or panic or urgency. The craftsman knows how long each part of the process takes, has learned from the past, and is happy to re-use established patterns. She gives herself time to deal with the hard parts of the problem by knocking away the simple parts efficiently. And most relevantly: if a project manager rushes in and says “this needs to be done in half the time”, the response is not “oh, in that case we’d better order pizzas because it will be a long night.”

The key elements of what I would classify as ‘good’ server software are as follows:

1) Clarity of purpose. The software does one thing, provides one well defined service;
2) Predictability. The software behaves in a well defined and documented fashion;
3) Robustness. The server should be resilient and gracefully adapt to environmental changes, cope with malformed or malicious requests, and not collapse under extreme load;
4) Manageability. Administrators should be able to monitor and configure the service with ease, and should be able to programatically manage the service state;
5) Performant. Requests should be responded to within 50ms – 100ms or better under all conditions. In particular performance should not degrade unpredictably under load.

In my experience a lot of coders – and managers of coders – have the idea that setting these goals as a minimum base requirement are unrealistic and expensive. Twaddle and nonsense. Let me point to two exemplars, both available as FOSS and both initially built out by small teams: Varnish and the core Apache server. In both cases, these are not part of the base operating system, they are services run on a server. In both cases, all the goals above are amply met. And in both cases, there is a development and maintenance infrastructure around the code which is palpably sustainable and effective.

Varnish is a particularly fine example. There were no surprises or complexities in installing it and running it. It worked as expected ‘out of the box’ without intensive configuration. It’s very easy to monitor and manage. It does one thing, extremely well, and does it in the fashion described, documented and expected. And most importantly it just runs and runs and runs without intervention or alarm.

Lets make all server software that good, and knock off work at 5pm, enjoy our weekends, take up hobbies and stop these panicked head-long rushes into the night. Our partners, family, waistlines and hearts will thank us for it.

Thanks Chris, all very relevant points, most of which I will be addressing in later pieces of writing as I delve deeper, however some quick off the cuff responses:

“- Iâ€™d not specify a time frame for service performance; ”

Absolutely. My hand-wave is largely informed by the systems I’ve worked on, where the domain of acceptable response times is about what I’ve said so far. A rough rule of thumb for the worst-possible response time though is around 200ms. Most services these days are tied in one way or another to a user front end brokered through a web browser (yes, wild-assed hand waving there), and anything more than about 200ms response to there starts being perceptible to the human. Going the other way, the cost of getting down from 50ms to 10ms or lower often is very hard to justify, except in edge cases like real-time processing systems. My experience suggests that getting down below the magic 50ms means you start having to engage with infrastructure and deployment issues far more than coding issues – it is in that realm where there’s often not much more the coder can do without acts of heroism (and luck).

“- How the service persists data is important as well.”

if it persists data at all 🙂

This has been one of my key points of ranting for years. I’ve sat in meetings where coders and designers have agonized about increasing response time of code by 10 or 15 ms, and ignored the fact that the biggest cost in the total end-to-end transaction is overwhelmingly dominated by the cost of the persistence layer. Again, my rough rule of thumb is to anticipate a minimum of 300-500ms trip through the persistence layer.

“- Is resilience and availability built into the service or provided by the infrastructure? ”

always an interesting question. My off-the-cuff thought is that this is a good place to invoke Separation of Concerns. Lets take a meaningful example: the host operating system providing a writeable file system. A bunch of things can go wrong: file system filling, becoming unwriteable, reverting to a previous state, and so forth. Is the application responsible for knowing about all these cases? No, the application only needs to be able to deal with “i asked for something to be written, and it didn’t get written, what do i do now?”. So the overall solution requires two services to be robust and resilient. The operating system filesystem facilities need to gracefully degrade when faced with hardware failures, the application has to gracefully degrade when the services it depends on degrade. The server coder should only be concerned with one of those things – what the server does when the write fails, not the reasons for the failure.

2 thoughts on “Toward a vision of Sustainable Server Programming”

Chris slee says:

01/09/2013 at 09:26

Great post. I like the cut of your jib, young man.

A couple of quick comments from an enterprise infrastructure guy:

– I’d not specify a time frame for service performance; it’s dependent on the business requirements. Acceptable performance may be less than 10ms or as long as a second.

– How the service persists data is important as well. Is each transaction serialized to disk immediately or dumped periodically from memory? This makes a big difference to the design of the underlying infrastructure supporting the service.

– Is resilience and availability built into the service or provided by the infrastructure? Applications understand their own data and state better than infrastructure can.

That is all. Carry on.

1. Robert Hook says:
  
  01/09/2013 at 13:05
  
  Thanks Chris, all very relevant points, most of which I will be addressing in later pieces of writing as I delve deeper, however some quick off the cuff responses:
  
  “- Iâ€™d not specify a time frame for service performance; ”
  
  Absolutely. My hand-wave is largely informed by the systems I’ve worked on, where the domain of acceptable response times is about what I’ve said so far. A rough rule of thumb for the worst-possible response time though is around 200ms. Most services these days are tied in one way or another to a user front end brokered through a web browser (yes, wild-assed hand waving there), and anything more than about 200ms response to there starts being perceptible to the human. Going the other way, the cost of getting down from 50ms to 10ms or lower often is very hard to justify, except in edge cases like real-time processing systems. My experience suggests that getting down below the magic 50ms means you start having to engage with infrastructure and deployment issues far more than coding issues – it is in that realm where there’s often not much more the coder can do without acts of heroism (and luck).
  
  “- How the service persists data is important as well.”
  
  if it persists data at all 🙂
  
  This has been one of my key points of ranting for years. I’ve sat in meetings where coders and designers have agonized about increasing response time of code by 10 or 15 ms, and ignored the fact that the biggest cost in the total end-to-end transaction is overwhelmingly dominated by the cost of the persistence layer. Again, my rough rule of thumb is to anticipate a minimum of 300-500ms trip through the persistence layer.
  
  “- Is resilience and availability built into the service or provided by the infrastructure? ”
  
  always an interesting question. My off-the-cuff thought is that this is a good place to invoke Separation of Concerns. Lets take a meaningful example: the host operating system providing a writeable file system. A bunch of things can go wrong: file system filling, becoming unwriteable, reverting to a previous state, and so forth. Is the application responsible for knowing about all these cases? No, the application only needs to be able to deal with “i asked for something to be written, and it didn’t get written, what do i do now?”. So the overall solution requires two services to be robust and resilient. The operating system filesystem facilities need to gracefully degrade when faced with hardware failures, the application has to gracefully degrade when the services it depends on degrade. The server coder should only be concerned with one of those things – what the server does when the write fails, not the reasons for the failure.

2 thoughts on “Toward a vision of Sustainable Server Programming”

Leave a Reply Cancel reply