by Dave Gladfelter (SETI, Google Drive)
The SETI (Software Engineer, Tools and Infrastructure) role at Google is a strange one in that there’s no obvious reason why it should exist. The SWEs (Software Engineers) on a project understand its problems best, and understanding a problem is most of the way to fixing it. How can SETIs bring unique value to a project when SWEs have more on-the-ground experience with their impediments?
The answer is scope. A SWE is rewarded for being an expert in their particular area and domain and is highly motivated to make optimizations to their carved-out space. SETIs (and Test Engineers and EngProdin general) identify and solve product-wide problems.
Product-wide problems frequently arise because local optimizations don’t necessarily add up to product-wide optimizations. The reason may be the limits of attention, blind spots, or mis-aligned incentives, but a group of SWEs each optimizing for their own sub-projects will not achieve product-wide maxima.
Often SETIs and Test Engineers (TEs) know what behavior they’d like to see, such as more integration tests. We may even have management’s ear and convince them to mandate such tests. However, in the absence of incentives, it’s unlikely that the decisions SWEs make in response to such mandates will add up to the behavior we desire. Mandates around methods/practices are often ineffective. For example, a mandate of documentation for each public method on an interface often results in “method foo does foo.”
The best way to create product-wide efficiencies is to change the way the team or process works in ways that will (initially) be uncomfortable for the engineering team, but that pays dividends that can’t be achieved any other way. SETIs and TEs must work to identify the blind spots and negative interactions between engineering teams and change the environment in ways that align engineering teams’ incentives. When properly incentivized, SWEs will make optimal decisions enhanced by product-wide vision rather than micro-management.
Common Product-Wide Problems
One common example of local optimizations resulting in cross-team de-optimization is documentation and ease-of-use of internal APIs. The team that implements an internal API is not rewarded for making it easy to use except in the most oblique ways. Clients are compelled to use the internal APIs provided to them, so the API owner has a monopoly and will set the price of using it at “you must read all the code and debug it yourself” in the absence of incentives or (rare) heroes.
Big, slow releases
Another example is large and slow releases. Without EngProd help or external pressure, teams will gravitate to the slowest, biggest release possible.
This makes sense from the position of any individual SWE: releases are painful, you have to ensure that there are no UI and API regressions, watch traffic and error rates for some time, and re-learn and use tools and processes that are complex and specific to releases.
Multiple teams will naturally gravitate to having one big release so that all of these costs can be bundled into one operation for “efficiency.” The result is that engineers don’t get feedback on features for weeks and versioning of APIs and data stores is ignored (since all the parts of the system are bundled together into one big release). This greatly slows down developer and feature velocity and greatly increases risks of cascading failures when the release fails.
How EngProd fixes product-wide problems
SETIs can nibble around the edges of these kinds of problems by writing tools and automation. TEs can create easy-to-use test environments that facilitate isolating and debugging faults in integration and ambiguities in APIs. We can use fancy technologies to sample live traffic and ensure that new versions of systems behave the same as previous versions. We can review design docs to ensure that they have an appropriate test plan. Often these actions do have real value. However, these are not the best way to align incentives to create a product-wide solution. Facilitating engineering teams’ fruitful collaboration (and dis-incentivizing negative interactions) gives EngProd a multiplier that is hard to achieve with only tooling and automation.
Heroes are few and far between so we must turn to incentives, which is where discomfort comes in. Continuity is comfortable and change is painful. EngProd looks at how to change the problem so that teams are incentivized to work together fruitfully and disincentivized (discomforted) to pursue local optimizations exclusively.
So how does EngProd align incentives? Certainly there is a place for optimizing for optimal behaviors, such as easy-to-use integration environments. However, incentivizing optimal behaviors via negative feedback should not be overlooked. Each problem is different, so let’s look at how to address the two examples above:
Incentivizing easy-to-use APIs
Engineers will make the things they’re incentivized to make. For APIs, make teams incentivized to provide integration help in the form of fakes. EngProd works with team leads to ensure there are explicit objectives to provide Fakes for their APIs as part of the rollout.
Fakesare as-simple-as-possible implementations of a service that still can be used to do pre-submit testing of client interactions with the system. They don’t replace integration tests, but they reduce the likelihood of finding errors in subsequent integration test runs by an order of magnitude.
Furthermore, have some subset of the same client-owned and server-owned tests run against the fakes (for quick presubmit testing) as well as the real implementation (for continuous integration testing) and work with management to make it the responsibility of the Fake owner to debug any discrepancies for either the client- or the server-owned tests.
This reverses the pain! API owners, who are in a position to make APIs better, are now the ones experiencing negative incentives when APIs are not easy to use. Previously, when clients felt the pain, they had no recourse other than to file easily-ignored bugs (“Closed: working as intended”) or contribute changes to the API owners’ codebase, hurting their own performance with distractions.
This will incentivize API owners to design APIs to be as simple as possible with as few side-effects as possible, and to provide high-quality fakes that make it easy for clients to integrate with the API. Some teams will certainly not like this change at first, but I have seen API teams come to the realization that this is the best choice for the larger effort and implement these practices despite their cost to the team in the short run.
Helping management set engineering team objectives may not seem like a typical SETI responsibility, but although management is responsible for setting performance incentives and objectives, they are not well-positioned to understand how the low-level decisions of different teams create harmful interactions and lower cross-team performance, so they need SETI and TE guidance to create an environment that encourages optimal behaviors.
Fast, small releases
Being forced to release more frequently than is required by feature deployment requirements has many beneficial side-effects that make release velocity a goal unto itself. SETIs and TEs faced with big, slow releases work with management to mandate a move to a set of smaller, more frequent releases. As release velocity is ratcheted up, negative behaviours such as too much manual testing or too much internal coupling become more painful, and many optimal behaviors are incentivized.
Less coupling between systems
When software is released together, it is easy to treat the seams between different components as implementation details. Resulting systems becoming so intertwined (coupled) that responsibilities between them are completely and randomly mixed and their interactions are too complex for any one person to understand. When two components are released separately and at different times, different versions of them must be compatible with one another. Engineers who were previously complacent about this fragility will become fearful of failed releases due to implicit contract changes. They will change their behavior in beneficial ways such as defining the contract between components explicitly and creating regression testing for it. The result is a system composed of robust, self-contained, more easily understood components.
Better/More automated testing
Manual testing becomes more painful as release velocity is ramped up. This will incentivize automated regression, UI and performance tests. This makes the team more agile and able to catch defects sooner and more cheaply.
When incremental feature changes can be released to dogfood or other beta channels more frequently, user interaction designers and product managers get much faster feedback about what paths lead to better user engagement and experience than in big, slow releases where an entire feature is deployed simultaneously. This results in a better product.
The SETIs and TEs optimize interactions between teams and create fixes for product-wide, cross-team problems in order to improve engineering productivity and velocity. There are many worthwhile projects that EngProd can do using broad knowledge of the system and expertise in refactoring, automation and testing, such as creating test fixtures that enable continuous integration testing or identifying and combining duplicative tests or tools.
That said, the biggest problem that EngProd is positioned to solve is to break the chain of local optimizations resulting in cross-team de-optimizations. To that end, discomfort is a tool that can incentivize engineers to find solutions that are optimal for the entire product. We should look for and advocate for these transformative changes.