A mature DevOps organisation
At bol.com, we’ve officially been doing DevOps since 2015. Since then, we have developed an expert group of platform engineering teams. They build and run the infrastructure layers our 170+ engineering teams need to efficiently develop and run their software systems.
Therefore, when we started up a dedicated SRE team in 2020, we stayed away from infrastructure problems other SRE teams often focus on. The platform teams had this one covered.
We focussed on process instead. How can we make it as easy as possible for our teams to apply SRE to find the optimal balance between innovation and reliability.
In online retail the competition is fierce, and the marketplace is global. All our teams need to innovate to the best of their ability for us to stay ahead as a company.
Our SRE team’s stated mission is to enable products to balance reliability and innovation to maximize customer value through data-driven decisions.
We want to give every team that ability to innovate as fast as possible while safeguarding enough reliability to maximally delight users.
When will we be successful?
So what does life look like in a team that’s set up to reap all the benefits SRE promises?
Every team has three to five critical error budgets they’re always aware of. If they are threatened, they limit risk. Until then, they innovate with confidence. All alerting is based on SLOs and every alert received results in a change, whether that is in resiliency, alerting coverage or something else.
Product management is in the lead for setting the SLO targets. They understand that higher reliability targets are an investment that comes with slower innovation. They use this knowledge to judge these reliability targets against innovation requirements.
When someone comes knocking on the team’s door about a service interruption, the conversation can be about improving the SLIs and SLOs instead of firefighting. This provides a positive feedback cycle that maintains the active balance between reliability and innovation.
All this enables engineers to make changes with confidence and invest in resiliency when necessary, and only when necessary.
The road ahead
That is where we’re headed, but we still have a long road ahead of us.
There are a few products and teams where we see SRE applied to such a level that the rewards are clear, but adoption has been slower than we had originally hoped.