Blog

Chaos monkeys, the rise of the developer and the end of the DBA: Field notes from CassandraSF

By Andy Ormsby

19 Jul 2011
Category: Business Insights

I’ve had a great week at the Cassandra Summit in San Francisco. Here are some notes on how the Cassandra community is changing and, more generally, what this means for corporate IT. 

First, the Cassandra community is growing fast.  There were over 450 attendees at the conference, compared with just 150 a year ago.  Attendees ranged from well known tech start-ups, just getting started with the management of large datasets, through to large blue-chip organisations.

Some of the most popular talks were pretty technical. But for me, the most interesting was a presentation by Adrian Cockcroft from Netflix that highlighted how the dynamics of enterprise IT is changing. Adrian spoke about Netflix’s continuing migration from Datacenter Oracle to Cassandra in the cloud (slides here are well worth a look). He made two key points:

  • Moving from Oracle means a change from a highly centralised scale-up view of data management to one in which control of the data used by an application rests with the application, rather than with DBAs.
  • Moving from the datacenter into the cloud means that Netflix has essentially no IT Ops people.  Adrian described this as "NoOps" - an interesting take on the current DevOps debate. 

Since Netflix has its infrastructure on the Amazon cloud, it is effectively outsourcing Ops to Amazon.  But that doesn't mean taking a traditional IT operations structure and outsourcing it "as is" to an outsourcing provider.  It means working within the limits of what Amazon provides: a highly automated set of capabilities across an elastic and highly regular infrastructure.

Designing large scale applications means designing for failure.  In the cloud, this means avoiding single points of failure and distributing activity across multiple nodes in multiple data centres.  Design for failure means that if bits (even large bits) of the infrastructure fail, the applications are designed to continue running.

One of the ways in which Netflix tests whether they have got this right is the excellently named "Chaos Monkey".  This is a script that brings down production servers at random during the day.  Servers are going to fail - it's a fact of life.  But they don't fail frequently enough to make testing resilience of the applications easy. Chaos Monkey increases the failure rate during the part of the day when developers are around - just in case things go wrong. The result is greater confidence in the end service.

Cassandra fits this model well: a distributed database that can scale across multiple data centres, has no single point of failure, and in which all nodes are peers. If a node goes away, not much happens. In fact, with some careful design, even if a whole data centre goes away, the application can continue running.

So what does this mean for the IT organisation? In the case of Netflix it loses its DBAs and IT Ops people but keeps developers as it grows in the cloud.

A recent Gartner report suggests that as many as 1 in 5 companies would have no IT infrastructure themselves by 2012. However, despite this don't think the role of the developer is going to be threatened any time soon. Indeed, the scope of what developers are responsible for may only rise in importance. Time will tell whether this prediction comes true and we see more examples of Oracle DBAs in white coats being replaced by Chaos Monkeys.

 

 

 

blog comments powered by Disqus