These days if you would like to learn about machine learning, there are abundant great resources on the web discussing model architectures and how to code and train them. Materials about inference, though, are generally much harder to find, especially for edge and mobile. You might ask, inference is just the forward pass of training, so how hard can it be? Actually, it faces lots of unique challenges, to the extent that we are basically solving completely different major problems. I have been working on inference at the edge for a while, so let me capture them in this blog post, by contrasting training and inference in the cloud.
Training, especially if depending on gradient descent and back propagation, is by nature highly iterative. The core to the algorithm is two loops; they iterate over epochs and batches, and perform forward loss calculation and backward loss distribution. This structure and the need to update parameters after each iteration really stresses system utilization, as it means full of control flows and synchronization, both are bad for any sort of chips1.
Fortunately, today for ML we are typically performing deep learning with neural networks, which require a large amount of data to fit, so batching can help to improve chip utilization. Still, the algorithm nature together with the increasing needs to handle millions or billions of parameters from enormous models means training is more and more a datacenter endeavor involving machine clusters with plentiful GPUs or dedicated AI accelerators.2
At this scale, we are really discussing distributed systems and should use the mindset and tools there. Although, the unique nature of deep learning and its training algorithm do introduce new dimensions for consideration: we need to balance between training speed and model accuracy; we can tolerate the parameter/gradient inconsistency between different workers; we need to make sure the generality of the model, and so on. So we trade off among data or model parallelism, AllReduce or parameter server, synchronous or asynchronous update, and such.
Inference just performs the forward calculation on one or a few data points. It’s much less data and compute intensive than training, so it can run either in the cloud or on some edge devices. These two deployment scenarios have quite different characteristics to affect trade-offs again.
Inference in the cloud allows aggregating requests from everywhere. This gives it similar characteristics like training. We can sustain high GPU or AI accelerator utilization with a single task via batching. As long as there are enough requests, the inference task can be the only computation happening on a chip and acquire all of its resources yet still maintain high utilization.
Under such circumstances, effectively we are mostly GPU or accelerator bound. This is where it makes more economic sense to develop all sorts of customized chips to accelerate a specific task because that’s really the bottleneck. More FLOPS helps throughput a lot.
However, due to latency and security concerns, inference at the edge is more compelling. Inference at the edge is a quite different problem.
Mobile Inference Challenges
Fundamentally, edge devices, with mobile phones as a prominent example, have an environment that we don’t control. Unlike in the cloud, where we can feel free to choose whatever stack for building a solution and iterate on it in whatever manner, we can pretty much only take what’s in a mobile phone’s stack just as it is. Changes may be requested and made, but it’s a lengthy procedure and at the mercy of many parties in the mobile phone vendor chain. Let’s look at the various challenges we face in mobile inference.
Resource constrained environment
Mobile phones are very much resource constrained. They are powered by batteries and have much less powerful CPUs/GPUs than the cloud. What’s more, these limited resources are to support a host of apps running simultaneously. So we are fighting with:
- App and task multi-tenancy. All the running apps and tasks compete for the limited resource. Foreground iterative tasks have priority in order to maintain the smoothness of the whole system.
- Power consumption and heat dissipation. The form factor of the mobile phone determines that there is a limit on total battery power. An app or task cannot freely draw as much power as we want. We also cannot push the phone to extremes (like, beyond 4 Watt) for a long time because that will hit the phone’s ability to dissipate heat and then we will see throttling.
- Operating system scheduling and throttling. Even though current phones have CPUs up to 1+3+4 cores, typically we cannot expect the task to be scheduled on big cores if it’s short-lived (like less than 50 ms). If it indeed takes longer, then it might be placed on some big core by the OS, but now there is the problem of power consumption—big cores draw much more power (like, 5x) than small ones. So again it might not sustain for a long time before we see throttling again; it’s really for a short burst. Multithreading with multiple big cores is hardly possible. This is for the CPU. GPUs are better at handling lots of requests and maintain high throughput. So if we are not running games, GPUs are generally more than enough to render the UI and run some ML inference workload at the same time. But the OS might not clock it at the highest for energy considerations too.
End-to-end real-time solution for small workloads
Also, while training and cloud inference can have the flexibility to focus on one specific task (even if it’s a subtask) and be conducted offline in a batched manner, inference at the edge typically needs to be incorporated into an end-to-end solution and serve in real time. For example, to perform face unlocking, there are multiple models involved, e.g., one for detecting a face from image frames captured by the camera at a high FPS, one for matching the detected face, and potentially with additional processing like cropping, rotating, and others. All of these need to happen in a blink of an eye after you pick up your phone. So we have:
- Small and variable workload sizes: Inference on a phone typically handles one data point (one image, one language sentence, one audio sequence, etc.). The workload can be too small to justify the startup cost of GPUs or other accelerators and to hide the inefficiencies in other places of the GPU/accelerator stack. We might be able to batch a bit like the cloud, but that’s only possible for tasks not sensitive to latency.
- Work distribution and chip utilization. Because of the workload characteristics, mobile ML need to be smart at deciding how to run a model, depending on the model itself (e.g., whether it contains enough computation to even worth dispatch to the GPU) and the current state of the system (e.g., screen on/off, overheating or not, CPU/GPU load, etc.).
- Latency sensitivity. Typically a user interacts with the phone in real-time; any noticeable lag can cause negative experience. So the response should be immediate.
- Deep integration with app logic. ML inference is just a component of the app. So it should be built with that mindset and be non-intrusive wherever possible. It should not force design choices on the app; instead, options should be provided to let the app suit its own needs.
Most of the challenges discussed thus far are not concerns for training and inference in the cloud. Even for those that are, like work distribution and chip utilization, we can in a sense ignore it and “hide” the inefficiency via brute force—just throw in more machines with more GPUs. That is not an option for the edge for sure.
Heterogeneous and fragmented system
If you think the above is still manageable as we just need to be very careful and write efficient and performant libraries, hear me out, the above list is just to get started. 😊
The above points are just the nature of mobile devices and inference. Considering more on the system and implementation side, we also need to fight:
- Hardware fragmentation. CPUs are okay here; it’s basically ARMv8 nowadays. For GPU, We have Qualcomm Adreno, ARM Mali, Imagination PowerVR, and soon AMD RDNA. Each has multiple generations. And they are packed into a SoC. So more variants. Then the SoC is assembled into the final phone by device OEMs. Even more variants. CPUs and GPUs are general compute devices; it’s already overwhelming. If we additionally put various AI accelerators into the picture, it’s just daunting.
- Software stack fragmentation. There are quite a few ML frameworks; models authored with them typically need to be converted before deployment. So a different path for each such framework. There are Android and iOS. For Android, there is the notorious version fragmentation issue. Together with hardware fragmentation, we have a compounding problem. Even for the same SoC, we can have GPU drivers at different versions, which may lack certain functionalities or have various bugs. So yeah, thus far the device configuration space for GPU has really grown into many dimensions: GPU architecture, SoC, Android, version, driver version. For AI accelerators, it entirely relies on vendor-specific toolchains to only target very specific SoCs.
- Layered system. There are app sandboxes and multiple levels of abstractions. They are meant for security or trying to hide chip or hardware differences, but they introduce performance instability and additional costs, which can very well exceed the original computation cost if the workload is small.
- Deployment constraints. Unlike in the datacenter, where we can deploy new infrastructure or models in whatever manner, mobile devices typically update apps thru app stores, with the user’s control. Bandwidth and size is really a concern here, in addition to runtime performance and efficiency.
And it’s not only me trying to scare you or something. Facebook published “Machine Learning at Facebook: Understanding Inference at the Edge” in early 2019 and speaks of many of the same challenges with concrete numbers, particularly on the system and usability side. I highly suggest you give it a read. I’ll quote some of their key points in the following section.
A Different Categorization of Challenges
Basically, the challenges can be divided into two categories: those dictate whether we can do ML at all and those require us to have very thoughtful design and efficient implementation. For the former:
- Heterogeneity is a fundamental characteristic in the mobile world. “The most-commonly used SoC has less than 4% of the market share; there are only 30 SoCs with more than 1% coverage and their joint coverage is only 51%; 225 SoCs are still just covering 95%.” Without a proper way to handle this, we are either restricted to a limited subset, or we cannot utilize the full power of each phone. Hardware fragmentation is the main cause.
- Programmability is a primary roadblock for fully utilizing mobile SoCs other than CPUs. It’s great to see various SoCs have dedicated AI accelerators sporting a fantastic FLOPS number, but the reality is they are hardly used, due to toolchain issues and portability. Only major mobile and chipset vendors have AI accelerators. Lots of effort is needed to deploy models and a different SoC means redo the work again. Facebook reported in their paper that they resort to mainly using CPU because that’s the most available and reliable. Heterogeneous and fragmented system as a whole causes programmability issues.
- Stack robustness. Even GPU is not well used on mobile phones, due to the inconsistent and buggy GPU stack for OpenGL ES and OpenCL; the latter is also not officially supported in Android Open Source Project. Having a robust stack helps deliver an end-to-end real-time solution for small workloads. Looking forward, Vulkan is a very promising GPGPU API as it provides a thin stack with extensive conformance tests.
The second category requires the solution to be:
- An end-to-end configurable deployment solution is great for the model life cycle. The more streamlined and the less manual interaction from research to deployment, the better. This helps software stack fragmentation, deep integration with app logic, and deployment constraints.
- All dimension efficiency is needed, including energy, runtime, size, etc. The solution should be performant and use the least resources possible to be friendly to others. Efficiency helps the small and variable workload sizes, work distribution and chip utilization, and resource constrained environment.
- Predictability is a very important factor to consider for mobile inference to avoid perceivable lag to cause negative experience in real-time workloads. Facebook even argues that for them “co-processors and accelerators are used for power and stable performance; speedup is often secondary.” This helps latency sensitivity.
How to Address
So, inference at the edge is juggling among many constraints and trying to find a balance. It’s bad to just paint the sky as grey without giving a way out.
Generally, compilers are proven to be great at both handling heterogeneity and achieving efficiency. In IREE, we are developing an end-to-end compiler for generating both kernels and scheduling logic to handle heterogeneity and programmability. For GPU we use Vulkan, which is a very thin abstraction, to gain the most control of the stack and achieve performance and predictability. These are topics worth their own blog posts; This one is fairly long already, so I’ll just stop here. Till next time! 😊
Practically it is less so for control flow on CPUs. Branch prediction on CPUs are quite good these days, at the cost of complicated hardware logic. But still, CPUs also love streamlined code, especially for utilizing SIMD functionalities. ↩︎
For edge devices we sometimes also want to train. But that’s mostly fine tuning a previously trained model to fit a specific task or user. The vast majority of the parameters are frozen for such cases. ↩︎