Senior Software Engineer Google
AI Frameworks & Compilers. Now: Vulkan compute, IREE, MLIR. Previous: Vulkan graphics, SPIR-V toolchain.
Compilers are often critical components in various development toolchains that boosts developer productivity. A compiler is normally used as a monolithic black box that consumes a high-level source program and produces a semantically-equivalent low-level one. It is still structured inside though; what flows between internal layers are called intermediate representations (IRs). IRs are critical to compilers. Like there are many compilers, there are also many IRs in use. I’m fortunate to have direct experience with three major schools of IRs or infrastructures thus far—LLVM IR, SPIR-V, MLIR, particularly extensively for the last two, where I both joined development in an early stage. So I’d like to write a series of blog posts to log down my understanding of compilers and IRs. Hopefully it could be beneficial to others.
This blog post talks about how to generate performant code for convolution ops using MLIR’s multiple levels of abstractions and transformations. I initially created it for targeting ARM Mali GPUs in IREE. But given it is just direct tiling and vectorization, it should be widely applicable. I will walk through the lowering steps, so if you are interested to know how to organize MLIR’s various dialects/patterns together to achieve similar tasks, this blog post might also be useful.
Today I would like to describe one way to build a scalable and frictionless benchmarking pipeline for Android native libraries, aiming to support different benchmark and device variants. It is for open source projects, so it composes public services, commonly free under such conditions. The ingredients are cloud virtual machines for building, local single board computers (e.g., Raspberry Pi) for hosting Android devices and executing benchmarks, a Dana server for keeping track of benchmark results of landed changes, and Python scripts for posting benchmark comparisons to pull requests. A Buildkite pipeline chains them together and drives the full flow.
Nowadays GPUs are utilized for both graphics rendering and general-purpose compute (GPGPU). For the latter, CUDA is the indisputable leading solution. Though, with so many other GPU vendors, the quest for a GPGPU standard never stops. OpenCL was a great attempt and is used widely; but still it falls short on many aspects. Given the success of Vulkan in graphics and it being both a graphics and compute API, one would wonder whether it can actually be the next-generation GPGPU standard. I certainly believe so; but the road is not full of roses.
These days if you would like to learn about machine learning, there are abundant great resources on the web discussing model architectures and how to code and train them. Materials about inference, though, are generally much harder to find, especially for edge and mobile. You might ask, inference is just the forward pass of training, so how hard can it be? Actually, it faces lots of unique challenges, to the extent that we are basically solving completely different major problems. I have been working on inference at the edge for a while, so let me capture them in this blog post, by contrasting training and inference in the cloud.