Sunday, November 20, 2011

Errors and Exceptions at OSDC

I spoke at the Open Source Developers Conference (canberra, au) and did a talk on what turned out to be the erlang way of building apps:

Sunday, October 30, 2011

Google+ a place for interesting discussions

Much as I don't need *another* social network, google plus has, well, some pluses ! Seems a good place for in-depth discussion - I expect that is largely due to the early adopter crowd, but we will see. 

Posted via email from Michael's posterous

Sunday, October 16, 2011

Resilient processes (and erlang)

So of late - Erlang has been a great tool I have been using as part of the day job.

Its headline feature seems to be concurrency - and whilst that is part of it - I think the real secret sauce is reliability. We have concurrent software, we have fast software, but we don't really have a lot of reliable software.

At its core it is quite simple: take the semantics of operating system processes, and their isolation from each other, and scale that down so you can have thousands of them. A typical Erlang application is made up of many processes (not to be confused with OS processes) - which are light, disposable and have their own memory heap, garbage collector etc...

I did a talk on this recently (and will be refining this for an upcoming talk):

It is worth noting that Erlang draws inspiration from both functional programming, and logic programming, both things near and dear to my heart, it is no surprising I found it quite pleasant in almost every way.


Monday, May 30, 2011

Elasticness and Clouds

Amazon pretty much claimed the word "elastic" in computing when they delivered their Elastic Compute Cloud (EC2) years ago. One of the key features of this is, unsurprisingly, that it is elastic: you use an API (ideally) or a web interface, to provision resources on demand. 

This works nicely - except when it doesn't. After years of experience with this - I have noticed that (in general) at the first sign of trouble the API will start failing, requests dropping, timing out. (sometimes that can even be an early warning sign of impending doom). 

It is worth noting that most clouds seem to have similar issues regarding APIs  - they often don't have the same quality of service that your servers get. This wouldn't normally be a problem, but a common strategy with infrastructure clouds is to make use of this elasticness (duh !) day to day for your operations, as well as recovery. Frustratingly, due to this behaviour you have to either accept this QoS limitation, or plan around it by consuming extra resources ahead of time. The latter approach somewhat undoes the benefit of having a highly elastic API - but here we are anyway. 

Somewhere, there is a balance, but at the moment, the big users of the public clouds are treating them increasingly as a less than elastic resource pool (look up Netflix and their use of Amazon for an example of this). I can't help but wonder if this means APIs will fall out of favour for highly elastic workloads, or if the QoS of these APIs will improve over time...

Posted via email from Michael's posterous

Friday, March 18, 2011

How to waste a friday

A unusually verbose writeup on network issues and what they will do at status.aws.amazon.com:

Repeated here for posterity (E-BS is a story for another day, but for now): 

From 7:28pm PDT to 9:56pm PDT, a networking issue affected connectivity to a significant number of instances in the US-EAST-1 region. Affected instances experienced degraded network connectivity to the Internet and to instances in other availability zones.

The root cause of last night's issue was when a core network routing device experienced a partial failure. While the router was causing packet loss, the failure was not detected by surrounding network devices and therefore they did not automatically fail traffic over to redundant network paths as intended.

Additionally, our network monitoring tools failed to help our network operators locate the specific source of the connectivity issues. Once our networking team determined the location of the impact, they were able to identify the failing router and manually failed traffic routes away from it. At this point, all affected instances regained full connectivity.

We will be completely replacing the failed network device and our team will work on the failed device to understand the source of the failure. More importantly, we will be working to understanding why our network monitoring did not allow our team to quickly isolate the problem and force the manual failover to redundant network routes. We rely on this monitoring to help us deal with partial failures which defeat the normal redundancy built into high availability network architectures. We understand the impact this event had on some of our users, this just took us too long to figure out, and will be intensely focused on improving our monitoring and addressing the root cause of this failure.


Posted via email from Michael's posterous

Monday, January 31, 2011

CloudBees runtime

So one of the things I have been working on a bit lately is now finally live: 

The short version: very much like GAE but without the google. On top of this, this works with the so called "dev@cloud" (which is hudson - no renamed to Jenkins - still with me?) so you can have a code push -> test -> run cycle all in one place (if you like). 

Zero proprietary apis, the data remains yours of course. There will always be a free runtime in the cloud, I think it is necessary to "keep the dream alive" so to speak. Signup is free of course (you may need to put in your phone number so it can check you are human). 

It has been nice building something that I want to use. It is a nice place to be. 

Enjoy !

Posted via email from Michael's posterous