home | photography | software engineering | talks | About

Reducing risk while deploying changes in a cloud service

Sometimes a 1-line change is all it takes ...

There is a continuous struggle happening in today's software development landscape. On one side are the developers, delivery managers, business owners who would like to make their work (product features) available to the users quickly and on the other hand are the QA engineers, the devops engineers, who would like to take some time to ensure that we avoid regressions and service outages at all cost while we deliver those new features.

Both sides have valid points, delivering features quickly seems to be a key component of capturing market share on the other hand, service outage causes the users real pain and translates into revenue loss, and in the worst case, loss of market share or user base.

Well, how about we hire an army of amazing engineers who write flawless code with 100% unit test coverage? - we can just push that code to production quickly, right?, Wrong. Testing (unit, manual, etc.) can only ensure the presence of bugs, not the absence True story, during a service incident, several engineers in one of my previous team had to spend days debugging an issue caused by a single instance of case-sensitive string comparison.

Ring-wise deployments

Over time, all software components increase its interdependencies with other components within the architectures. When we deploy a sizeable change in a component or a new component, the probability of the introduction of bugs increases. These could be caused by logical errors, by faulty code, and or integration issues. Given this high probability, there needs to be a way to mitigate any negative impact on the customer bases while we deploy code to the production environment. Ring-wise continuous deployment is one such method to mitigate the risk of customer impact while achieving high velocity of feature delivery.

Divide all environments where the software component is available into several rings of availability. Starting from the developer's machine (Ring 0), to the final production environment (Ring 3). Continually deploy and test software features from one ring to the next until it is available to all users. The exact implementation of the rings depends on the type of software, the team, and the target user base, one such definition for a moderate-sized team could be:

Ring#	Description	Testing type
0	Developer's machine	Unit, functional, or manual tests
1	Staging environment	Functional, manual, or integration tests
2	Production preview	Integration, performance
3	Production	TIP (Testing in production)

Epilogue

The underlying need of deploying software w/o causing disruptions in the production environment arises from the customer-first/user-first thinking. In that case, the tech, product, and operations teams all work towards achieving the optimal experience for their customers. The additional work for ring-wise deployment and testing is part of that cost.