Scaling DevSecOps and SRE

G. Scott Tomlin
May 18, 2023
9 min read

Thoughts about quality, risk, and specialization in developing and operating software

Let me start with something controversial, DevOps and DevSecOps were never meant to be roles on a team but rather philosophies and practices by which teams executes. If you want two great reads on DevOps, read Gene Kim’s “The Phoenix Project” and “The Unicorn Project.” The movement of DevOps was a natural organic evolution of what software engineering was to become. DevOps was meant to break the barriers between software development and software operations. No more throwing code over the wall and playing the telephone game back and forth over heavy process. We later added security accountability to the list, giving us DevSecOps. This goal got very confused by leadership teams. This led to responses like “great, we are now a DevSecOps team we can fire our Ops & Security Engineering teams!” As a result, feature velocity fell as our once developers were spending time with environment management, production support, and more, causing sprints to fail, timelines to be reset.

So, we invented a new role, a role even more confused around the industry, Site Reliability Engineering (SRE). We hoped SRE would be our salvation, a role focused on all things production. We thought these engineers could be generalists and could handle everything from incident management and debugging to fixing all production issues.

This constant expansion and contraction from wanting engineering generalists to wanting engineering specialists is a natural progression in the world of software, unlike many other engineering disciplines. We do not hire general engineers to build cars. We hire specialists for each role because we need experts. You do not want your Internal Combustion Mechanical Engineer designing the electronic braking system. Where generalists start to manifest in the automotive industry is in maintenance and service roles. But even in these roles there is some level of specialization, at least there are concepts of levels of skills by experience from apprentices to experts.

Generalization is Dead, Long Live Generalization

Does this mean there is no room for generalists? No, I do not think so. You are not going to hire an expert builder to build a wooden raised garden bed. You will do everything yourself from obtaining materials, to fabrication, to actual assembly. The impact and risk of failure here is exceptionally low. Generalists have their place where you can take risks, or when you need broad oversight. Generalists make talented team managers, architects, and project leaders; they have a working knowledge of most of what is happening on the team and in the project but may lack current skills to reach elevated levels of quality. This is very much the concept of a General Contractor (GC) that you may hire to manage a primary bed and bath remodel project. The GC could do all the work in the remodel but are not certified to install electrical wiring or plumbing and certainly do not have the skills to ensure that all engineering is done properly to support the loads that the house needs to carry. We need specialists when quality, safety, and security are critical to the output of the engineering effort.

Startups and small companies start in one of two modes. They either start with super specialized engineers or they start with a team of generalized engineers. Both approaches make sense in different scenarios. For a startup that is bringing new technical capabilities to the market like new optimizations, algorithms, and so on, they would start with some specialized engineering focused on deep technical knowledge and experience around the company's focus. A company focused on bringing new services or workflows to the market may focus on a more generalized or full stack engineering team. Again, there is nothing wrong with either approach, however at some point each will be longing for something they do not have. There are points in every company’s path where they either need to invest in these skill gaps or supplement with outside resources. If they are skills that are core to the business and add value to the bottom line, the company will be faced with the need to build the skills internally. For example, a company working in public safety, medical or government sectors would benefit from having in house engineering skills focused on key standards for data and information access, safety, and security. However, a company that is distributing open-source software libraries may not consider these as compelling to invest in, yet still have accountability to their customer base, so they may find outsourcing solutions are better suited for them. Both companies may benefit from a specialist for an abbreviated time or just to get things rolling. A company may want to have someone who can set up their continual integration (CI) pipelines but will do the ongoing maintenance with internal resources.

Having a healthy understanding of the risk framework for your company will help guide these complex decisions. This will help the company decide when the need for experts is required. This may be due to the scale of your team and / or product, it may come from the regulatory environment, the court of public opinion, a change in industry needs, business continuity planning and many more factors.

What does this have to do with DevOps and SRE? Everything. DevOps and DevSecOps are the philosophies and practices that an engineering team is responsible for their implementation, deployment, quality, security, privacy, and operations from design and development to production operations and monitoring. This is very much a generalized skill approach, especially when applied to single engineers or small teams. It can also give a sense of, misguided, comfort that all my engineers are fungible and completely skilled in all phases of the Software Development Lifecycle (SDLC). This will limit the change to the team make-up if there is a need to invest more in one part of the SDLC than another. However, at some point, production will become your highest priority, and this is how SREs evolved out of the nascent and misunderstood world of DevOps.

Time to Specialize

When production becomes the priority, you will start focusing on key metrics like uptime, time to issue detection, time to issue resolution, time to market for new features, and more! You will also be looking to increase your confidence in deploying updates and new features. These aspects of DevSecOps were likely compromised on at some point in the SDLC. I have made these tradeoffs many times over the course of my career. You will be looking to find solutions that will not slow your innovation down or reduce your value to your customer. There is no silver bullet for these types of decisions, but there are a few key areas to focus on, telemetry and monitoring, incident response and management, and the SDLC tooling itself. Do not attempt to boil the ocean all at once, instead focus on areas where the benefits will be most impactful. I have worked in more than one situation where we reached the point that we were afraid to deploy our own software. We lost domain knowledge and instinct that we had with our freshly written software. Do not let this happen to you, take action!

Telemetry

If I could recommend the first step for most engineering teams it would be to focus on telemetry. Peter Drucker, the father of modern management, is often quoted as saying, "you can't manage what you can't measure." Measurement is incredibly important in production software. Your software is out there running without your supervision, being used in ways you did not expect and doing things you did not anticipate. Adding telemetry in the form of logs and dashboards is the first step. This provides the basic tools necessary to better operate your software effectively, thus empowering your engineers to practice DevSecOps and SRE to the full extent of their intended goals. Telemetry is extremely important. I have built dedicated telemetry teams to ensure that we have the right tools, that our software has quality telemetry and, most of all, it is easy if not automatic to get telemetry out of new software.

Telemetry does not just give you logs of error messages. When done properly, you will have details on the customer's journey, you will have the ability to gather metrics both technical and usage based, you will have the ability to raise alarms when your systems are not operating within proper ranges.

Developer Experience

Telemetry is not just for production anymore, measure your SDLC! Focusing on the metrics for your developer experience becomes valuable and should be your next step. How long are your software builds taking? What is the defect count? What static analysis results are being identified? How long are deployments? How many roll backs or fast follows do you have? A developer experience team is one of the next specializations that I tend to build in my engineering organizations. This team is focused on building the continual integration and deployment (CI/CD) workflow and empowering engineers to focus more on the feature work than the instrumentation and automation of builds, deployments, and analysis tool integrations. This team can quickly pay for itself in engineering efficiency and confidence building. When coupled with the right infrastructure, it can save massive amounts of engineering effort, that is often repeated over and over, it is like getting more engineers for free!

At my last gig, we took this to heart. We changed our virtual machine focused infrastructure and deployment scheme to one focused on containerization and Kubernetes. The impact on the SDLC was massive. Our process to launch a new service took several software engineers, “developer experience” engineers and infrastructure engineers many days to weeks to complete. After the change we empowered a software engineer to deploy a brand-new service to production in mere minutes without needing support from the developer experience or infrastructure teams. Not only was the deployment time improved, but the new pipelines provided better static analysis feedback for the engineers and raised the quality and security of the software being deployed. These faster deployments also improved our time to resolution of production issues.

Incident Management

This brings me to my last key focus area, incident management. This is the least technical of the focus areas, but the most important. If there was one skill that a SRE team should have it is incident management skill. Production incidents have a cost, it may be actual direct revenue, it may be reputation, but is always a velocity killer for engineering. Incident management has three phases, real-time incident management, postmortems, and after actions.

Real-time incident management is focused on one goal, return production to full service (or an acceptable level deprecation, depending on on business priorities). Times like this are where an incident commander (IC) is critical. The IC keeps the engineers focused on recovery and not discovery. It is easy to get in rat holes of trying to find root causes, but companies should focus on recovery as the highest priority. The IC is also there to ensure that changes in production are coordinated and made in a visible and documented way. Often there is a secondary role to the IC, the chronicler. The chronicler documents the actions, conversations, and players in the incident. They are also often the person that communicates with stakeholders keeping the greater organization informed.

Post-mortems (PMs) are done, typically, within a couple of days after the incident. The goal is to find the root cause of the incident and identify actions to be taken to ensure the incident does not recur. I have found in my experience following a blameless PM approach is key, it facilitates open and honest conversations and helps get to the root of the problems. Google has a great description of their process that I highly recommend, in Chapter 15 of Site Reliability Engineering. I also subscribe to the “Five Whys” approach to find root causes (RC). Tune these processes to something that works for your team, evolve them as your teams grow and mature.

The output of PMs is after-actions (AA). AAs are the tasks that were identified in the PM that need to happen to ensure the incident does not recur. As much as I do not like to say it, there are times that a RC is not identifiable, that is OK. In these cases, the AAs may be adding more and better telemetry, so we have insights into what is happening during the incident. This may enable us to detect and raise an alert when the problem happens again, thus reducing the time to recovery. Playbooks are often outputs of AAs as well. They are critical in documenting steps to recovery, again reducing the outage time. You may find symptoms that can be addressed and thus lower your overall risk. Leveraging the “five whys” will also discover areas where risk has been accumulated.

What Now?

DevSecOps and SRE are not cure-alls. They are not a prescription. They are not one size fits all. If you are focused on risk management and measurement, you will be able to find how to make DevSecOps and SRE work for your team’s needs and goals. Software quality, developer experience, incident management and telemetry immaturity are risks to your business and success. I am a strong believer in the DevSecOps and SRE approaches, but it requires specialized execution when you hit a certain scale of team size, product complexity or customer base. These are evolutionary efforts and should be implemented with careful planning. Hire experts and consultants when needed, you do not have to learn alone. You should build in-house muscles in these areas when they are realized and important to your technology or strategy. Muscle memory is where your teams will scale, these changes may feel uncomfortable at the start, but with practice and accountability they will become automatic. Your team’s individuals will benefit from DevSecOps and SRE as they build their skills and develop a broader set of skills, giving them skill growth, career advancement and overall greater satisfaction with their work and job.