GPGPU, ML Inference, and Vulkan Compute

Nowadays GPUs are utilized for both graphics rendering and general-purpose compute (GPGPU). For the latter, CUDA is the indisputable leading solution. Though, with so many other GPU vendors, the quest for a GPGPU standard never stops. OpenCL was a great attempt and is used widely; but still it falls short on many aspects. Given the success of Vulkan in graphics and it being both a graphics and compute API, one would wonder whether it can actually be the next-generation GPGPU standard. I certainly believe so; but the road is not full of roses.

Please don’t disagree yet. 😊 Generally speaking, compute can mean anything. Even for GPGPU, there are a variety of domains and applications. These are all broad and multifaceted topics; we might be thinking/talking about different aspects. So let me be more specific:

I believe Vulkan (compute) has the potential to be the next-generation GPGPU standard for various GPUs to support various domains; one immediate compelling application, though, is machine learning inference for resource-constrained scenarios like in mobile/edge devices and for gaming. Fulfilling it, Vulkan (compute) will gain further ground as a GPGPU standard and trickle down to more domains and applications.

I’ll explain the rationale and status next so hopefully afterwards you’ll find the above is reasonable. Before that, please feel free to grab your snacks because due to the topics in question, this blog post is inevitably broad and lengthy; I may be a bit free form here and there and handwave or even speculate a bit sometimes.

Intro to Vulkan: graphics and compute

Just in case you are not that familiar with Vulkan, here is a super quick introduction:

Vulkan is a modern cross-platform GPU API for both graphics and compute; it gives developer explicit control (over object lifetime management, different memory allocation strategies and mechanisms, workload composition and reuse, clear dependency specification, fine-grained synchronization, etc.) in order to achieve low overhead (via slim drivers with predictable behavior and little magic, etc.) and high efficiency (via better multiple threading support, various pooling objects, clear cost on each API calls, etc.).

Since its introduction five years ago, Vulkan has been enjoying great adoption and witnessing a thriving ecosystem for graphics. It is becoming the common abstraction layer—there are many open source projects that implement other APIs on top of Vulkan for porting existing applications, or that implement Vulkan on top of other APIs for using one API to rule all platforms.

The success in graphics certainly makes Vulkan widely accessible on various platforms and helps it to gain traction for pure compute, which is what I’ll mostly talk about today. In Vulkan, graphics specific bits are optional; there is a clean subset for pure compute. I wrote about it in a previous blog post, please feel free to check it out to learn more. From now on, I’ll just use “Vulkan compute” to mean the subset for pure compute.

With this background, now we can chat about GPGPU and ML inference. In general, GPGPU is an ecosystem. Ecosystems are formed when collections of interested parties develop solutions to serve everyone together. That naturally entails technology and business.

The Technical Aspect

Normally technical discussions are straightforward because whether a solution has technical merits are most of the time clear to see. However, it is under the assumption that we have properly defined the problem to solve and understood the constraints it entailed. Reasonable technical decisions to address one problem might be totally off for others. So it’s worth it to pin down the domain and problem first.

Domain and problem

For GPGPU, there are a lot of domains and applications. Vulkan starts to gain traction in multiple ones already, for example, audio/video, FFT. But today I’ll mainly talk about machine learning, where I have direct experience.

ML itself is a large enough domain to see different areas and use cases. First, there is the split between training and inference. Then, for inference, it can happen either in the cloud or at the edge. In a previous blog post I contrasted their different characteristics and particularly explained the unique challenges for edge inference. Please feel free to take a look for the details. I’ll rephrase the parts relevant to GPUs here because it’s necessary context to understand why Vulkan compute’s technical merits matter.

Cloud ML training and inference

Training needs to process a huge amount of data. That allows effective batching to exploit GPU parallelism. For inference in the cloud, because we can aggregate requests from everywhere, we can also effectively batch them. This characteristic allows training and inference in the cloud to sustain high GPU utilization relatively easily, so for them we are mostly GPU bound. Under such circumstances, the specific API for driving the GPU does not matter much technically. (Though there might exist business reasons like using GPUs from different vendors. I’m getting ahead of myself here.)

Edge ML inference

Inference at the edge is a completely different problem. For training and inference in the cloud, we pretty much have ultimate control and can choose whatever technology stack and iterate the model and infrastructure in whatever fashion. Inference at the edge, especially on mobile devices, has little control over the environment and system; we can basically only take what’s there as it is.

Mobile devices, especially for Android, have highly fragmented hardware. For GPUs, we have Qualcomm Adreno, ARM Mali, Imagination PowerVR, and soon AMD RDNA. Each has multiple generations. Then they are packed into different SoCs and assembled into different final phones, so more variants. Together with the notorious Android version fragmentation problem, even for the same SoC, we can see GPU drivers at different versions.

This heterogeneity and fragmentation is additionally coupled with GPU software stack quality issues. Both OpenGL and OpenCL feature a thick driver stack (including full compilers for kernels/shaders and implicit runtime status tracking and validation) and do not have strong conformance test suites. That results in different bugs or inconsistent performance among different devices, and unpredictable performance on the same device.

The above challenges together make it very hard to utilize OpenGL/OpenCL for ubiquitous mobile inference. There are more problems if we look at the nature of inference on mobile devices.

Inference happening on the mobile device really needs to be efficient and predictable. Mobile phones have very limited resources (both energy and computation) to be shared by all running apps and tasks. So we cannot blessedly ignore power draw and assume all computation capability. And they are real-time interactive devices. Unstable performance can cause perceivable lag and negative user experience. OpenGL/OpenCL’s thick driver stack can hinder efficiency and predictability.

Also, inference workloads at the edge are typically of small and variable sizes; we frequently just handle one image, one language sentence, or one audio sequence. This is quite different from games (which OpenGL is designed for) and high performance compute (which OpenCL aims at), where by nature there are lots of tasks and data to handle.

Because of the different characteristics, we cannot simply adopt what works for training and inference in the cloud. Like, having a fat runtime claiming the whole GPU, dispatching different flavors of hand-written ops, and synchronizing after each one is not optimal as it does not play nice with app multi-tenancy, cannot scale to all the hardware architectures and software variants, and does not give us the most performance due to aggressive synchronization hurting small workloads.

We need to have another solution that meets the needs of edge—a thin runtime on a bare metal GPU stack to gain the most control for eliminating inefficiency and unpredictability, and a compiler for generating kernels to handle the proliferation of hardware architectures. This is where Vulkan is a great fit. Next let’s look more into this, after clearing one more thing.

ML in gaming

I mentioned gaming in my original claim and I haven’t forgotten about it. Essentially gaming is where we see similar constraints like inference at the edge. There is a super tight latency envelope we need to meet. To render at a minimum 30 FPS, each frame can only take ~30 ms. That’s everything. Typically graphics rendering already takes the majority. To slot some ML inference (e.g., style transfer) in, it must be using the same API, as the cost across APIs is just prohibitive, regardless of whether the user has a different API environment set up.

On the other hand, the concerted evolution from implicit APIs (OpenGL, Direct3D 11) to explicit ones (Vulkan, Direct3D 12, Metal) in the whole graphics industry is a testimony of thin GPU stacks for performance under tight constraints and with limited resources. We really need to take control as much as possible, and don’t let too much abstraction get in the way.

Vulkan compute for ML

Apologies for the long introduction and context in the above; they are really essential for understanding the problem space and developing the proper trade-off criteria. As generally for computation, we have a solution spectrum (custom hardware → compiled software → interpreted software) offering different trade-offs between performance and programmability/portability. Although the whole computation world is layered up with abstractions, they do introduce additional costs, which can be problematic for places where we are under a lot of constraints and still want predictable performance.

In the spectrum, Vulkan chooses the lowest possible API abstraction and the largest possible offline kernel compilation.

Explicit control with bare minimal drivers

In Vulkan most of the time the concepts map directly to hardware mechanisms. This gives developers explicit control over basically everything and results in a bare minimal driver.

As an example, for buffers, the backing memory (VkMemory) and buffer handle (VkBuffer) are separate objects. To use them in a kernel, the buffer needs to be bound to the compute pipeline via a descriptor set (VkDescriptorSet), which has its own layouts (VkDescriptorSetLayout). There are many other examples, like Vulkan does not try to hide the cache inconsistency and requires the developer to explicitly manage them via barriers.

In general, Vulkan requires developers to explicitly manage all object’s lifetime and memory, and perform all synchronization where necessary, as by default everything can run out of order. (This is where it differs greatly with CUDA/OpenCL, where by default kernels launched into the same stream are guaranteed to complete in submission order.)

This level of explicitness certainly hurts programmability. (Compilers can come to the rescue though.) But it does mean developers have the ultimate control now. They are real GPU resource objects, or they are really how the GPU functions under the hood. So there is a cost to create/use/destroy them.

Note that as long as there are abstractions, there is no escape from translating high-level abstractions down as that’s not how the hardware works. It’s either the developer or the driver. By exposing them, developers can handle them more appropriately than the driver, as developers know the app logic the best and can create/destroy at the proper time and batch/pool them, while the driver is most of the time just guessing and trying to be smart.

Explicit control results in an ultra thin driver without general applicable “magic” and improves performance and predictability.

Largest possible offline compilation

Vulkan uses SPIR-V for expressing kernels. SPIR-V is a binary intermediate language like LLVM IR. It’s consumed by the driver compiler for the last-mile architecture-specific compilation.

This put the whole kernel compiler frontend outside of the driver. So it eliminates a whole host of compiler frontend bugs from the driver and enables even thinner drivers and predictable performance. Furthermore, now we can directly target Vulkan/SPIR-V with other high-level languages. HLSL is already enabled for graphics. For ML, directly compiling TensorFlow/PyTorch models is now possible (and a reality).

Compilers are the proven approach to handle hardware heterogeneity. They can offer performance together with programmability. With the overwhelming heterogeneity we see in mobile GPUs, there is no other easy way to achieve both.

In contrast, for OpenGL/OpenCL, we send the whole source code string to the driver, which contains a full-blown compiler. That causes lots of functionality and performance consistency issues. Also using if-else to #define kernel variables and concatenate source code strings to handle different cases at runtime really has a limit regarding how much we can achieve.

CPU friendliness and best practices

Yeah, we are talking about GPUs. But the CPU is still important as that’s where workloads are prepared (and where all the GPU API calls are executed). Due to the particular nature of edge ML inference, the importance of having efficient CPU code is even higher as the GPU is typically starved.

Vulkan is designed according to modern CPU architectures and programming best practices. It has great multithreading support, various pooling objects for amortizing costs, and proper separation between development and runtime. For example, functionalities previously done by the driver at runtime, like API parameter validation, are now completely a development time concept with validation layers. There is no runtime cost.

Robust GPU software stack

The thin driver helps a lot to have a robust software stack. Another aspect that contributes to this is that Vulkan has an extensive conformance test suite (CTS) from day one. All core Vulkan functionalities are required to have corresponding CTS tests. To make things even nicer, Android has its own additional tests. While it certainly cannot catch all the issues, it’s a great guard, especially after years development. OpenCL wasn’t doing good on this front—it’s initially released 2009; but its CTS wasn’t available in open source until 2017.

Addressing mobile ML inference challenges

Okay now let’s look particularly at how Vulkan addresses mobile ML inference challenges.

  • Heterogeneity is handled via Vulkan’s extensive mechanisms for different hardware—layers, extensions, features, and limits. This all happens within the same interface and framework. Putting kernel compilation offline as much as possible makes it possible to have systems generate code to best utilize different architectures without having unpleasant string manipulation in the codebase.
  • Overall, thin drivers help a lot with GPU stack quality. GPU drivers used to have many issues during the initial days when Vulkan was brought up. But now they are much better due to thin drivers and better CTS.
  • Programmability is where Vulkan lacks. However, for ML, we don’t program in a shading language anyway; the models are authored in Python at a much higher level. Compilers help to address the issue, even for the verbose runtime API code! (I won’t go into details here but the explicitness of the Vulkan API is really good as a compiler target as each API has a clear cost. Note that don’t be intimidated by compilers; they are essentially just translating programs from one form into another. Anything of similar nature can be thought of as a compiler.)
  • Performance and predictability are now more at the hands of the developers. There is much less magic and overhead in the stack that developers cannot control. Now it’s possible to utilize both the GPU and the CPU better for running ML workloads of small scales.

Okay, this wraps up the discussion on the technical side. Everything seems quite promising thus far (if you agree). Next onto business, where things start to shape up differently. For business, usually there are even less about right and wrong and things can change quickly.

The Business Aspect

The broad GPGPU ecosystem is again huge. Given I’m talking about inference at the edge, I’ll again restrict it to the Android ecosystem for now. Many parties are involved—GPU vendors, the Android platform holder, device manufacturers (Android OEMs), app developers (either companies or individuals), and end users.1

The platform holder

Let’s look at the platform holder first as for frameworks and APIs, the platform holder is very influential.

Android is held by the Android Open Source Project (AOSP), led by Google. For an open OS that suffers a lot from fragmentation issues, one of the top concerns is to reduce the fragmentation and improve consistency. GPUs are exceptionally the case here; as it’s used for gaming and that’s where a lot of users extensively care and where a huge amount of developer revenue comes from.

But we already know that the GPU landscape is fairly diverse. Past experience with OpenGL in that landscape is not pleasant. The functionality and performance inconsistency is causing lots of headaches and pains for developers targeting the broad Android market and gives Android a bad name. That’s hard lessons learnt. That’s why OpenCL is not officially supported in AOSP. One OpenGL is difficult enough; throwing in OpenCL means doubling all the issues.

That’s also why Android tries to consolidate and have one true GPU API—Vulkan. With its low level nature, in the long run, both OpenGL and OpenCL can be emulated via software, which can be updated much more easily, and so GPU vendors just need to implement one driver. Strong focus on Vulkan CTS should help alleviate consistency issues.

So using Vulkan compute for ML inference is quite aligned here. But ultimately, Android is an open system and Android OEMs can feel free to customize.

Android OEMs

Android OEMs decide which GPU to buy and the drivers for the final phone. So they actually have the ultimate control. The goal is to sell more devices to end users. But how? Why would an end user choose one phone over another? It’s differentiation. So it’s pretty common to see Android devices boasting about photography, gaming, or AI prowess.

This is where things start to become tricky. It’s nice to have one GPU API to rule them all. But because of the existing ecosystem around OpenGL for gaming and OpenCL for compute, OEMs are still forced to support them. So that’s why actually OpenCL is indeed shipped in many phones.

Now with Vulkan it’s more work, at least in the short term. So it’s understandable that the OEM will have less resource for each or lack incentives. There is no good way to improve the situation quickly as it’s essentially ecosystem evolution. That only resolves with time—old phones will gradually phase out, Vulkan will be more and more prevalent, and software OpenGL/OpenCL implementations will become more and more stable and performant.

App developers

For app developers, the goal is to deploy to as many devices and reach as many end users as possible. The difficulty for them, though, is fragmentation; it’s just beyond typical app developers' ability to handle. So the safe bet is just to use what’s readily there and more reliable—the CPU. Or at least that’s what Facebook does; not even GPU, let alone dedicated accelerators.

So we have a chicken and egg problem. Developers are less likely to use unless the software stack is robust and performant enough, but without developer interests, platforms and OEMs can be less motivated to address the system issues. There is again no good way to change quickly. It only improves with time as we see a more stable GPU stack and better accelerator toolchain support.

Challenges, Status, and Looking Forward

Okay, I’ve outlined both the technical and business aspect for Vulkan compute as a solution for edge ML inference, as one important GPGPU domain. Technically it makes a lot of sense, but for business there are various issues.

However, if we look closely, it basically boils down to an immature ecosystem. Vulkan is still mainly used for graphics rendering. Using its compute subset for ML is still something of a relatively early stage. So we are seeing all typical issues of a new ecosystem.

But it’s getting more and more real for sure—we see various great mobile AI inference frameworks use Vulkan compute to drive the GPU, including Tencent ncnn, Alibaba MNN, and recently PyTorch Mobile. ncnn is exceptional here as its model/feature coverage and its production usage in Tencent apps.

The existing framework still uses the Vulkan API like OpenCL though. They have hand-written kernels authored in GLSL for high-level ML ops. In IREE we believe there is a more native way to utilize Vulkan compute for its full potential. It requires to fully go down the compilation road to generate both GPU kernels and CPU API calls. The Vulkan API calls are meant to be generated by higher level abstractions (it can be game engines) suiting the specific problem domain, and the SPIR-V kernels should be compiled directly from domain-specific language.

Going down this road means much more infrastructure investment. We need to build the full compilation stack, rather than leveraging existing ones. But it’s worth it considering the benefits it brings. So, taking a step back to look at the whole stack (across specification and drivers, operating systems, compilers, programming languages, frameworks, toolchain) and capture the status of the world:

  • After five years since Vulkan’s initial release, GPU drivers are in a much better shape nowadays. They are more stable and reliable. And the nice thing is given it’s also used by graphics, there is a strong tendency to be better and better. Still there are problematic parts right now, like low adoption of certain key extensions (e.g., timeline semaphore), restrictive SPIR-V programming model (e.g. pointer casting), and other compute specific functionalities to make ML even better. For this, Vulkan ML TSG is working on it.
  • Vulkan are getting better and better coverage in Android devices. As of early 2021, 36%+ devices support Vulkan 1.1. That seems not much but note that Android captures 80%+ market share and has 3 billions+ of active devices. So for the absolute number, it’s massive. But given using compute for ML is still in its early stage, there is no special optimization towards this. It’s actually not surprising to see various issues, like pure Vulkan compute isn’t treated as favorably as OpenCL workloads by the scheduler.
  • We need a new compiler stack from ML frameworks down to SPIR-V and Vulkan. It’s structurally there: the kernel compilation functionalities are mainly in MLIR while the API side is in IREE. Right now it can compile quite a few vision and language models end-to-end (from TensorFlow SavedModels). Getting it optimized towards each GPU architecture is a long fight though.
  • For programming languages and frameworks, there is no need to invent new ones; there are already plenty. Thanks to the compiler approach, supporting either new programming languages and frameworks is much less work.
  • Toolchain for pure Vulkan compute is lagging behind. Existing debuggers and profilers are mainly for graphics usage; they have the assumption of render loops and presentation. For Android, there is the additional assumption of a full-blown app. But for pure Vulkan compute on Android, we would like to use the command-line under the adb environment for productivity. Lack of handy tools actually shows as the most annoying part in our development.

I’ll just stop here as this blog post is already fairly lengthy. Thanks for reading it through and apologies for all the hand waving and speculation. Hopefully this blog post, together with the previous one in this series, lays down enough backgrounds and contexts about Vulkan compute for ML inference. In future blog posts I will expand on more of the challenges and also talk about how to optimize towards different mobile GPU architectures.

Using Vulkan compute for edge ML inference is certainly a very promising and thrilling direction. I’m very happy that I can be a part and contribute to it. Hopefully this eventually pushes the frontier towards better utilizing edge GPUs for ML inference ubiquitously!

  1. There are also SoC vendors. But they are typically either the GPU vendor (e.g., Qualcomm) or the Android OEM (e.g., Samsung) at the same time. I’ll omit them for simplicity. ↩︎