Efficacy Presubmit


  • Moderador

    By Peter Spragins
    with input from John Roane, Collin Johnston, Matt Rodrigues and Dave Chen

    A Brief History of Efficacy

    Originally named “Test Efficacy”, a small team was formed in 2014 to quantify the value of individual tests to the development process. Some tests were particularly valuable because they provided a reliable breakage signal for critical code. Some tests were not useful because they were non-deterministic or they never failed. Confoundingly, tests would change in value over time as well. The team’s initial intention was to present this information to developers and help them optimize the development process.

    To achieve the goal of informing developers about their tests, the team had to collect a huge amount of developer infrastructure/workflow data from a variety of sources across Google. Collecting all of this data in one place turned out to be incredibly valuable.

    In addition to collecting and processing the data, the team developed a somewhat radical philosophy towards running tests at scale: the only important results come from tests which deterministically fail. Running an additional test that you know will pass is not a valuable signal to developers, and likely a waste of resources.

    Background on Google Presubmit

    The process of committing code at Google has several testing stages. Perhaps the three most important testing stages are:

    1. Individual ad-hoc testing
    2. Presubmit
    3. Continuous build/continuous integration (hereafter referred to as continuous build).

    Stages 1 and 2 can actually be interleaved in any order and repeated any number of times.
    A presubmit executes all of the tests which are known to be affected by the edited code within one user’s proposed code changes. The “affected tests” are calculated with the help of a “project definition”, a configuration maintained by teams. A presubmit can run at any point during the change proposal process, but most importantly it must run before a user can permanently commit their changes.
    Continuous build, (3), is the continuous running of all tests within a project at the newest committed version of the code. Continuous build will execute tests even when they have already passed at presubmit.
    The same test may run several times at presubmit during the development process, one last time at presubmit before a commit and then finally once again at continuous build, after being merged into the main branch of Google’s huge repository. For this reason, a “missed failure” at presubmit is not a critical failure. The test will still be run at continuous build, and then rolled back if it fails.

    Efficacy Presubmit Service

    Efficacy Presubmit Service is the fusion of “running the right tests at the right time” with one of the largest collections of test/developer data in the world. The service has one simple job: save time and resources by not running, or even compiling, tests that we are very confident will pass at Presubmit. The ideal “Efficacy Presubmit” would predict which tests will pass ahead of time and only run tests which were going to fail. Then the user can get feedback from the failing tests, and fix their mistakes with the minimal possible cost of user and CPU time.
    To make this idea possible we have made one significant abstraction of the actual presubmit testing process. In a given presubmit there may be zero tests run, or many. In a presubmit with one test, if that test fails then the presubmit fails. In a presubmit with a thousand tests, only one failing test will still fail the presubmit. Efficacy Presubmit makes the abstraction that each of these test executions is an equivalent unit. This greatly simplifies creating a training dataset.

    Machine Learning / Probabilistic Safety

    Quick background on ML

    ML techniques and processes are quite well known throughout the industry at this point. The Tensorflow tutorials are a great introduction. The type of ML we use is classification. A classifier is essentially a mapping from the domain of the dataset, to the range of the classes. Mnist is a very famous example of classification. An mnist classifier maps from the domain of the input image to the range of digits {0, 1, …, 9}.

    In some other classification problems, the inputs are more “tabular”. A famous example of tabular classification is Iris Species. This is very similar to what Efficacy does.

    Efficacy’s Application of ML

    Given the abstraction on the presubmit testing process described above, predicting the outcomes of automated testing at a large company is a perfect machine learning problem in many ways. You have:

    1. The set of test executions and results is a very large labelled dataset

    2. Copious numerical feature columns with trustworthy values

    3. Recent failure history of each test

    4. Various “distance” metrics from edited source files to tests - i.e. is this a test for the edited code?

    5. Test size and runtime data

    6. Several dimensions that can be aggregated

    There are some aspects of the problem which make ML difficult as well:

    1. The classes are highly imbalanced with respect to labels (the vast majority of tests are going to pass, not fail)
    2. Flaky tests can mislead the model because their labels are “untrue”

    We chose to reduce the problem to binary classification. The model chooses whether or not to run the test. In other words, failure is the positive class, and everything else is the negative class.

    We pick a threshold that results in an extremely low number of false negatives - failing tests which are not run because the model thinks they would have passed. This does reduce the number of skipped tests, true negatives, in exchange for a very high margin of safety. In addition to this, tests will be run afterwards at continuous build anyway, making presubmit skipping very safe.

    Difficulties of Scale

    In addition to the problems that were natural to the “schema” of the dataset, we faced some problems due to the scale of Google’s testing.

    Many of these problems stem from the fact that Google works out of one large repository (paper, talk). Because of this some presubmits have a very large number of tests and some commits require a large number of presubmits before they are finished. This means that the service has to make predictions for a very large number of tests all at once. If a presubmit tried to run every test at Google, then the service would have to predict each test individually. That means N times the number of columns, etc. Loading the data to generate all of these feature values uses a lot of memory.

    Another difficulty of doing this work at scale is that even with very rare false negatives, they will still happen somewhat frequently. This requires our team to be open to communication with any customer team. In some cases we may have to tell them they were the victim of a very low probability event. In other cases we may find a bug, or room for improvement.

    Results

    The two key numbers for the system’s performance are sensitivity, the percentage of failing tests we actually execute, and specificity, the percentage of passing tests we actually skip. The two numbers go hand in hand. For a given model, requiring a higher sensitivity will result in a lower specificity, or vice versa. We can easily tune the percentage of tests skipped, resulting in changes to the fidelity of the testing signal the developers receive. When the system is wrong, it can have some negative impact to developers if the prediction is a false negative. Rarely, it will allow a developer to commit code that will break a test during continuous build. This results in a broken “project”, which takes some time to detect, and then a roll-back of the code. This requires some developer time, and a flexible mentality towards testing. In order to achieve a positive balance from this, we must extract millions of skipped tests for every negative developer experience. The sensitivity of our system is very high, and our specificity is around 25%.