How Facebook Uses High-Precision Tests to Prevent Performance Regressions for Mobile Apps
At the beginning of mobile optimization on Facebook, each fix meant major advances in performance improvement. Today, applications are so optimized that time is better spent avoiding small regressions that, if sent, could cause progress to shift. On Facebook’s scale, this means checking thousands of commits a day to find regressions as small as 1%. Previous methods worked well to detect major performance changes.
But to refine accuracy, they had to build a new system called MobileLab. In this system, testing and environment could be substantially more deterministic.
In production, MobileLab has already prevented the sending of thousands of regressions thanks to its ability to detect very small changes in performance (up to 1% for many of the metrics). Compared to the previous standard used, MobileLab improves confidence intervals by 7x, while reducing false positives by 75%.
The Facebook team had to first build a validation framework that would allow for a quick iteration and allow easy viewing when the tests were successful.
MobileLab determines whether a performance metric has changed from a good construct known as (control) to an unknown construction new (treatment). We chose to tolerate a 5% false-positive rate, as is common in many statistical analyses.
The validation framework performs many experiments and reports summary statistics in these repeated experiments. As input, they provide A/A designs, for which there should be no reported difference, as well as pairs of constructs with regression of a known size. Facebook used these summarized statistics to understand the variability of the metrics and validate the statistical assumptions of the hypothesis test.
Here are some examples of validation framework outputs
- Rate of false positives observed: Based on the construct of the hypothesis test, we can expect a false positive rate of 5% if we run many A/A experiments. If this assumption is violated for some reason, one can see the false-positive rates drastically different.
- The variance of control and treatment metrics: The structure would report the observed variation of metrics so that it could monitor progress toward a 10x reduction.
- Average metric value by test number in experiments: If the data are IID, there should be no correlation between the test number and the observed value.
With the working structure, they experimented with the ideas and they were surprised to find that the simple experiment violated some of the statistical assumptions that were made.
The figure below shows a performance change, which means that the tests were not, in fact, IID as it was supposed.
To better understand what was happening. Several steps were taken to improve consistency and understand the differences between the tests seen.
Consistent device performance
The team began using Profile, a tracking tool, to understand the differences between system tests.
Every trait the system produced was so different from the others that they were failing to progress. They need to start simply and work for a complicated app like Facebook for Android.
They then developed a central processing unit (CPU) benchmark separately from the behavior of workloads.
Using the benchmark, they found that the performance of the devices was extremely variable
Android dynamically changes CPU frequency to improve battery life and phone temperature.
This meant that the speed of the phone was changing when the device heated up. Using the CPU controller settings of the team phones, we were able to block the processed ones at a fixed frequency so they could run indefinitely without overheating.
They also built a similar reference to the GPU and blocked it for a fixed frequency. This work resulted in much more consistent results than that of the benchmark.
These simple benchmarks also allowed them to verify that performance on different devices of the same model is consistent when running at a fixed frequency.
Benchmarks run continuously, like any other test, and go through the validation framework to ensure that the platform remains of high quality.
When they added a new device model to the lab, they used these benchmarks to find the optimal governor settings for the new device model.
Consistent application performance
Consistent benchmarks were useful, but real applications are more than just a CPU-limited benchmark. For example, many interactions in applications get information from Facebook servers using GraphQL. This means that the performance characteristics of Facebook’s servers and their dependent services affect benchmarks. If you happen to access Facebook.com at peak times, pages can load a little slower and change test results.
To isolate our system from these effects, we created an HTTP proxy server. The proxy server is provided to the device via USB/ADB reverse. We record request and response pairs, which we can reuse to provide consistent data and time for the benchmark. This approach is similar to BrowserLab, our browser benchmarking system.
This proxy server significantly decreased system noise, but more importantly, it reduced the number of components involved, restricting it only to Test Runner and the phone. This facilitates reasoning about other sources of nondeterminism since many components have been eliminated.
Application state in tests
With the system free of external sources of noise, the team focused on the behavior of the application. They begin to observe consistent performance patterns in the tests of each experiment. For non-cold start tests, this can be an application memory state. But even non-cold start tests showed this pattern. During testing, the application was changing the state of the disk.
To control this, we now back up the application disk state before the first test.
Between each attempt, we stop the application and restore that state. This means that the state of the application cannot perform between attempts.
After all these optimizations, the team returned to the previous approach of studying profiling screenings. The traces were no longer overwhelming and could easily identify individual blocks with greater variability. This allowed to find other sources of noise:
Disk burning performance:
Because they executed the same code repeatedly, disk reads typically reach the cache at the operating system level. Disk performance is less consistent than CPU performance, so small writes contributed a surprising amount of noise. The team resolved this by reassembling the application data directories to a RAM disk using tmpfs. Both read and write performance are not realistic in this scenario, so we have increased our latency metrics with I/O metrics such as loaded classes and written bytes.
Some application behaviour depends on system time. An example is a code, which determines the TTL for cache entries. If too much time has passed, the application would use different code paths and have different performance characteristics. This was corrected by resetting the device clock to the same timestamp before each assessment.
In some rare cases, the application may crash during execution, resulting in an Android crash dialog box displaying. Without further action, this dialog box would remain on the screen for the remainder of the test. The test would continue to work correctly, but the appearance of the dialog box affected application performance and produced slightly slower metrics.
Eliminate logcat rejection:
Originally, they communicated metrics through record lines written by the application and the tail logcat to receive those metrics. Actively listening to the logcat via ADB causes additional noise during the testing process. Instead, they established an ADB reverse tunnel and requested that the application send metrics directly to the test executor over a socket connection.
Combining the above-mentioned methods provides an experiment flow as shown below. With these changes, MobileLab has successfully reduced the variation in important performance tests by an order of magnitude and can safely detect regressions of less than 1% with only 50 attempts.
Testing with MobileLab
In the previous system, they use end-to-end correction tests in the performance tests. End-to-end testing works well when you need to run the test once.
But in a performance test that is running multiple times, this greatly increases the duration of the test. Besides, each step may not be completely deterministic, resulting in a high variation of an experiment.
The speed and accuracy of the test were one of the main complaints of the old system.
In MobileLab, instead, they have provided a more limited and opinionated test API that encourages users to write tests that advance to measurement as quickly and simply as possible. Unlike some end-to-end test frameworks, they have not researched UI elements or loaded additional libraries into the application for testing. This removes all the overhead from the framework.
Facebook set performance targets based on the time taken to complete interactions. It is impossible, in a lab environment, to represent the huge combination of environments, devices, and connection speeds. So in many cases, they chose to close these gaps by collecting additional metrics instead of creating more tests. Some examples of metrics that crawl are loaded classes, bytes read and written to disk, and allocated memory. Consumption metrics also helped them understand more about why time metrics have changed, saving them research time.
The approach of using noise reduction and consumption metrics results in a more synthetic benchmark and less representative of all scenarios. Despite clear production differences, we found that MobileLab is still able to find regressions that occur in real-world scenarios. The focus is on being directionally correct, not producing the same magnitude as production. They performed all tests, including the continuums as an A/B test, reporting the difference between A and B instead of the absolute value of one side.
Precision improvements allow you to take better advantage of device time and apply features to simulate more representative scenarios. For example, they can now run jobs to simulate different combinations of A/B product experiences or news mixes from the News Feed. When MobileLab does not detect a regression that reaches production, they use this information to correct gaps in lab coverage.
Automated Regression Detection
Using Mobile Labs’ highly accurate measurement system, they limited the regressions sent to production. They automatically ran MobileLab continuously, performing an hourly comparison of the current production branch with the master. They applied additional statistics to detect changes in metric steps and perform an automatic split to find the code change that caused the regression.
The split process integrates with the task tracking system, automatically alerting the engineer.
Who created the detected problem confirmation and blocking future releases until the regression is resolved.
Because you launch applications weekly, the detection and splitting process must be fast and reliable. Otherwise, they may not be able to correct a regression in time.
The approach and lessons the Facebook team learned in creating MobileLab apply to many performance benchmarking scenarios, not just mobile devices.
That is, a methodical approach is necessary to evaluate the change of the system; Tools to make this easy to speed up progress and divide the problem into smaller components and limit components in the system make it easier to find and reduce noise.
We hope that by sharing these learnings, others may find similar success.
These improvements in signal quality are helping them detect previously undetectable regressions, making MobileLab an important part of the performance team’s workflow on Facebook.