One of the things I want to spend some time thinking about in this new blog is how we might build massive scale (as oppposed to merely large scale) distributed systems and what we can learn about building these applications from other areas of knowledge such as biological, organic and social systems.
Last year Werner Vogels gave a talk (related pdf) about how you might build VERY large systems (million+ nodes) and why these systems just will not scale with our current deterministic way of building systems.
Systems of this size are highly fluid in nature. Individual nodes within the system will come and go almost constantly as hardware, software or user failure/action/error causes localised problems. Even with highly reliable hardware and a mean time between failure of components measured in hundreds of thousands (or even millions) of hours – with millions of nodes in a system the law of averages means that you will get failures hourly. Add in coding errors, support screw ups and end user errors and it very quickly ends up looking like a digital massacre. Somehow the application that is sitting on top of this quantum flux of failure must be able to deal with all the chaos and provide a stable and coherent user experience.
With all this failure it begs the question: “Is it even possible to build systems of this size and complexity?”
Trying to deal with this flux in a deterministic, synchronous or Turing style organised system manner is clearly a non starter. The management overhead will be horrendous. The overall system will be highly brittle and subject to the most extreme strain.
If we are unable to muscle the system into the desired shape we need to think about different approaches. The application needs to deal with the flux as a fact of life and embrace it. As Bloglines found out this afternoon even systems that are of the scale most of us build could take some of this thinking on board.
As Werner points out in his talk there are many highly complex biological and organic systems that are capable of scaling to massive degrees with virtually no centralised control mechanism in place. These systems are probabilistic and self organising in nature.
The classic example is the ant colony or bee hive where each individual ant or drone goes about its own business and yet adds to the collective good. Cells within the human body are capable of acting in a highly consistent and coherent manner, displaying highly complex behaviour (you try programming a system to fight viruses or even breathe!) despite minimal directed management. Even humans are capable of forming complex, self organising systems with minimal direct interaction – think of markets and other forms of large scale crowd behaviour.
How do these systems come about? How do they manage to create such stable environments? How do they fail and what are their weak points?