Today I would like to describe one way to build a scalable and frictionless benchmarking pipeline for Android native libraries, aiming to support different benchmark and device variants. It is for open source projects, so it composes public services, commonly free under such conditions. The ingredients are cloud virtual machines for building, local single board computers (e.g., Raspberry Pi) for hosting Android devices and executing benchmarks, a Dana server for keeping track of benchmark results of landed changes, and Python scripts for posting benchmark comparisons to pull requests. A Buildkite pipeline chains them together and drives the full flow.
The importance of benchmarking on real devices goes without saying for libraries that care about performance. ML inference is one such case, especially on mobile devices, where we have limited resources yet want real-time interactions. Benchmarking such libraries on mobile devices, especially Android, is quite difficult, due to its special development flow and overwhelming number of device variants. A good benchmarking pipeline should be resilient and offer extensibility and productivity.
Before going into details, I just want to note again that the solution described
here is for native libraries, that is, C/C++ binaries, instead of full Java
Although native libraries are meant to be integrated into some Java app
eventually, they are built using the Android NDK and normally do not require
Android Studio as the development environment.
It’s common to develop these native libraries with normal C/C++ flow and
test/benchmark them under the
adb environment, which is better for automation
and continuous integration.
Requirements and Goals
This solution is built for IREE, an end-to-end ML compiler and runtime, where edge/mobile devices are one of the main deployment scenarios. So in the following I’ll use IREE’s pipeline as an example. While some of the considerations may be specific to IREE, the general idea and pipeline should be widely applicable.
First and foremost, the pipeline should follow the open source way, as we are building it for an open source project. We would like to adopt public services and make information universally accessible. This avoids requiring learning project-specific infrastructure and allows easy understanding and debugging issues.
The benchmarking pipeline needs to be scalable across multiple dimensions. Clearly there are various ML models and Android phones. For IREE, we additionally want to support various CPUs, GPUs, and dedicated accelerators on a phone. Like other systems handling heterogeneous devices, IREE uses a hardware abstraction layer (HAL) to handle them all. Each such compute device can be driven by one or multiple HAL “drivers”1, for example, Dylib/VMVX for CPU, Vulkan for GPU. Therefore, drivers fall into its own dimension. That’s still not all of it. For the same ML model, compute device, and driver, we can have multiple benchmark modes, like big/little cores, the number of threads for CPU. So in summary, we have at least 1) ML models, 2) Android phones, 3) Compute devices, 4) IREE HAL drivers, 5) Benchmark modes.
Any of the above dimensions can see new variants. In order to be scalable, they need to be detached, so that adding one variant in one dimension does not affect others, and accumulate, so that products involving the new variants are automatically generated.
Actually, the above dimensions largely fall into two categories. ML models, IREE HAL drivers, and benchmark modes are on the software side. They can be scaled relatively easily. Android phones and compute devices involve real hardware; that’s harder.
The benchmarking pipeline also needs to be frictionless. It should integrate into the normal development flow and avoid manual steps as much as possible. We would like to have the ability to try out changes on pull requests, and we also need a tool to track performance of landed commits and perform regression analysis. They should come as neglectable additional cost though.
The whole pipeline contains three main steps:
- Benchmark artifact generation: building benchmark binaries and model artifacts for all the target phones.
- Benchmark execution: filtering and executing all suitable benchmarks on a target phone.
- Benchmark result presentation: either posting on pull requests or pushing to the Dana server.
We use Buildkite as the harness for the full pipeline. Buildkite is a public continuous integration platform that employs a server-client architecture. buildkite.com is the central coordinator that listens to GitHub webhook events and dispatches CI jobs to Buildkite agents.
Buildkite itself does not provide agents; each project using Buildkite brings its own agent. Buildkite agents can run on many platforms (Linux, macOS, Windows, etc.) so it enables a project to compile and run on all of them. A project can register as many agents to the Buildkite server as it wants and these agents can focus on performing different tasks. These agents are just normal programs; they can effectively execute whatever code. So it’s very flexible.
Each step in a Buildkite pipeline can be performed by a different Buildkite agent. The Buildkite server selects the agent to run a step by looking at the required agent tags and matches it against all available agents and chooses whichever is available.
Benchmark artifact generation
The first step is generating the benchmark binary and model artifacts. This step does not need to access the exact to-be-benchmarked device; we can just generate all interested configurations.
However, this step is resource demanding. For IREE, it’s especially so because we need to compile both TensorFlow and LLVM. So it’s impractical to happen locally for each device to be benchmarked. Instead we aggregate the generation and use powerful cloud virtual machines.
That means managing all the build environment by ourselves though. Docker is very helpful for such a purpose. It also has the benefit that we can handle the build environment via code changes, as we can check in the Dockerfiles and reference different docker SHAs in Buildkite pipeline’s YAML file.
The generated artifacts are then uploaded to the Buildkite server. They are loater downloaded for use by the execution step.
With the benchmark artifacts generated, the next step is to execute them to collect benchmark numbers. This step needs access to concrete devices.
Devices can have CPU/GPU of different architectures. In the first artifact generation step, many architectures are targeted. So for this step, we need to add scripts to probe the benchmark device, filter those generated artifacts, and execute suitable benchmarks on the device.
How to probe the device is specific to the operating system. For Android,
we can use
adb shell cat /proc/cpuinfo and
adb shell cmd gpu vkjson to
get information regarding the CPU/GPU.
With the detailed product/architecture, e.g., ARMv8.2-A/Adreno-640/Mali-G78,
we can filter the generated artifacts to execute benchmarks.
Filtering the generated artifacts will mean that in the first generation step the artifacts are placed in a well defined file directory structure (or some other mechanisms, e.g., a file containing various details). So some naming conventions are needed here.
Apparently we cannot perform these tasks on the Android devices themselves.
Using SBCs like Raspberry Pi as the host is a great solution here, because the
task is not resource demanding. SBCs can handle them well with a familiar Linux
adb is used to communicate with Android devices.
Here is the Python script for the above.
With this probing and filtering mechanism, we can achieve scalability at the device level. Each time we add a new device for benchmarking, the script will automatically filter artifacts for it to run. And with Buildkite’s distributed agent architecture, it means anybody can plug in new devices from anywhere (assuming they have the proper agent token).
In IREE we use Google Benchmark as the benchmarking library. It allows dumping the results in the JSON format. Benchmark results from execution at each Buildkite agent are uploaded to the Buildkite server as JSON files, containing all the details. Then they are aggregated and presented in the next step.
Benchmark result presentation
In the final step, a Buildkite agent in the cloud downloads all benchmark results and presents them. Depending on whether this is run for landed commits or pull requests, it means pushing to a dashboard or posting as pull request comments.
Publishing results to dashboard
We’d like to have a dashboard for historical benchmark results. A couple of merits we are expecting from it:
- It should provide some API so it’s easy to integrate with Buildkite.
- It should be benchmarking centric and provide an intuitive and customizable UI.
- It should provide good regression analysis flows.
There seems to be no existing services fulfilling this. Fortunately, the open source Dana project ticks the above boxes, so we are hosting our own Dana server at https://perf.iree.dev. It exposes APIs for updating or querying benchmark results. So we just teach the Buildkite agent in this step about the API token and then let it call the corresponding APIs.
If you are interested in the details, the Python scripts can be found here.
Dana requires an identifier for each benchmark. Considering all the extensibility points and forward compatibility, in IREE, we use the following scheme to identify a benchmark:
<model-name> `[` <model-tag>.. `]` `(` <model-source> `)` <benchmark-mode>.. `with` <iree-driver> `@` <phone-model> `(` <target-architecture> `)`
For example, “MobileNetV2 [fp32,imagenet] (TensorFlow) full-inference with IREE-Vulkan @ Pixel-4 (GPU-Adreno-640)”. It’s a bit lengthy, but it clearly shows all important details about the benchmark.
These benchmark identifiers can be trivially constructed from the benchmark result JSON files, which contain all the information.
Posting on pull requests
Being able to invoke benchmarking on a pull request is super helpful to try
out new functionalities and guard against performance regression proactively.
In IREE, we use a
buildkite:benchmark GitHub label to manually trigger
benchmarks on specific pull requests. Adding one label is trivial and that’s
all one needs to do; after all benchmark results are available, the Buildkite
agent will post them as pull request comments (example),
in a nice and concise way, diff’ing against previous known results.
So this is frictionless.
This, of course, requires calling GitHub APIs. For example, creating/updating a pull request comment. (Yeah they are for issues, but it also works for pull requests.) In order to avoid clutter on the pull request, the full benchmark result is pushed to Gist, only an abbreviated version is posted to pull requests.
The Python scripts can be found here.
That’s it. Hopefully this can be helpful to you if you are also shopping for a solution to fulfill similar needs.
You can think of them as device-specific concrete implementations of HAL abstractions. ↩︎