Sampling Performance Counters from Mobile GPU Drivers

2021-07-08
10 min read

In a previous blog post I gave a general introduction to GPU driver internals in Android/Linux systems. Following up with it, today I will explain how a specific functionality, hardware performance counter (perf counter) queries, is handled in both Qualcomm Adreno and ARM Mali drivers, by walking through the kernel driver source code.

Rationale: A Perf Counter Sampling Library

But, why looking into perf counters? What’s interesting about them? Perf counters are special processor registers showing various metrics about the processor. They are crucial for helping understanding software performance issues and making the best use of hardware.

But it also comes down to a library that I’ve been working on recently. So please allow me to digress a bit here. 😊

Unlike CPU, in the GPU world we have many different GPU architectures from quite a few hardware vendors. To understand the fine performance details, typically we need to resort to vendor-specific tools like AMD Radeon GPU Profiler, ARM Mobile Studio, NVIDIA Nsight, Qualcomm Snapdragon Profiler. They are really all-encompassing tool suites, providing many utilities and aiming for profiling the whole system. If you only care about one vendor or one GPU architecture and would just like an integrated solution, then they are great. But if you need to support multiple vendors and/or have your own profiling solution in the development flow, e.g., for better automation, then not so great: managing an IDE-like GUI application for each vendor is not really fun.

So I have been working on a lightweight and embeddable library for sampling GPU perf counters as an alternative solution. It should support multiple vendors. For performance (as we normally sample perf counters at a very high frequency), the plan is to directly interact with the GPU kernel driver. This is actually inspired by HWCPipe, which is a great resource showing how it is done for ARM Mali GPUs. However, some of its design choices (e.g., mandatory C++ features like STL and exceptions) renders it unsuitable for my needs. The biggest issue, though, is that it does not support other vendors. So this motivates me to write my own.

Anyway, I should probably use another blog post once I’ve the library ready. Now switching back to the main topic for today: perf counters in drivers.

Methodology

We will be looking at perf counters for both Qualcomm Adreno and ARM Mali GPUs. It requires reading the kernel driver source code, which we cannot find in the upstream Linux kernel source tree. Instead they are released by the Android OEMs shipping products with these GPUs.

Here, I’ll use the code released by Samsung for their Galaxy S21 series. Depending on the market, Galaxy S21 contains either the Snapdragon 888 (e.g., devices with model number SM-G991U) or the Exynos 2100 (e.g., devices with model number G-991B) SoC. I downloaded the kernel code from Samsung’s open source website and put them on GitHub (SM-G991U, SM-G991B) so that I can grab links to use in this post.

As explained in my previous blog post, GPU drivers use common frameworks and have similar structure in their implementation. That gives us anchors to read the source code. For example, we can search for module_init in the driver’s directory to find out the entry point for the whole module. Similarly, platform_driver_register’s argument defines the driver’s major traits, including the name, which hardware device to match, and so on.

With all of the above, we can look into each GPU now.

Adreno GPU

First let’s have some fun reading the kernel code.

Kernel code walkthrough

The kernel driver for Adreno GPUs are called KGSL, short for Kernel Graphics Support Layer. It is written as a loadable kernel module and uses the platform driver framework. So the anchors mentioned in the above section work; grepping them will show that the drivers/gpu/msm/adreno.c file is the main file1 pulling everything together and containing various function pointers.

Device/driver information

Within it, adreno_platform_driver is the struct defining the driver:

static struct platform_driver adreno_platform_driver = {
  .probe = adreno_probe,
  .remove = adreno_remove,
  .driver = {
    .name = "kgsl-3d",
    .pm = &adreno_pm_ops,
    .of_match_table = of_match_ptr(adreno_match_table),
  }
};

That’s where the driver name, kgsl-3d, comes from; and it’s matching against devices specified in the adreno_match_table2:

static const struct of_device_id adreno_match_table[] = {
  { .compatible = "qcom,kgsl-3d0", .data = &device_3d0 },
  { },
};

The .compatible field is the interesting one; it follows the <vendor>,<device> format. So the driver is compatible with devices from vendor qcom and with the name kgsl-3d0. Such devices can be bound to and managed by this driver.

Thus far these are all pretty straightforward stuff; but I just wanted to point them out so we are on a solid footing regarding the device/driver names.

Details about the device, if you want to understand more, can be found via searching kgsl-3d0 in the kernel codebase, because they need to expose that name in the compatible field in their Device Tree Source files. For example, Adreno 660 is defined in the arch/arm64/boot/dts/vendor/qcom/lahaina-gpu.dtsi file3. The doc for the fields in it can be found in the arch/arm64/boot/dts/vendor/bindings/gpu/adreno.txt file. Low level details are unlikely useful to us there; but the power levels can be interesting as it defines the frequencies we can see for the GPU.

Let’s continue to look at device driver binding, which is done in the andreno_probe function. That shows that the driver is actually an aggregate driver; it uses component helpers to pull in components like Graphics Management Unit. Anyway, eventually it calls the adreno_bind function, which then in turn calls the GPU core specific probe function:

static int adreno_bind(struct device *dev)
{
  struct platform_device *pdev = to_platform_device(dev);
  const struct adreno_gpu_core *gpucore;
  // ...

  return gpucore->gpudev->probe(pdev, chipid, gpucore);
}

GPU core definition

We are approaching the meaty definitions–the adreno_gpu_core struct:

struct adreno_gpu_core {
  enum adreno_gpurev gpurev;
  unsigned int core, major, minor, patchid;
  const char *compatible;
  unsigned long features;
  struct adreno_gpudev *gpudev;
  const struct adreno_perfcounters *perfcounters;
  // ...
};

Yes, perfcounters! But before looking into that, also worth noting is the adreno_gpudev struct inside. It’s a huge struct containing GPU core specific function pointers, including the probe function mentioned earlier.

Looking at where adreno_gpu_core are referenced, we can find the full list of Adreno GPU core definitions in the drivers/gpu/msm/adreno-gpulist.h file. This is basically the main file containing pointers to various GPU core facts. From it we can see, for example, for A6XX GPU series, the perf counters are defined in the adreno_a6xx_perfcounters variable, in the drivers/gpu/msm/adreno_a6xx_perfcounter.c file. There we can find all the perf counter groups.

(It might seem that we are going through a rather convoluted approach here to discover this, as we might be able to directly find such information by trying to find source files with keywords related to perf counters. But I generally feel the above is better as it is more principled and can be used to discover whatever you’d like to know. The same holds for the following analysis.)

Ioctl interface

Okay, now we know there are quite a few perf counter groups. But still, how do we query them from the kernel? That comes to the ioctl system call.

If we follow what we left previously, the adreno_bind function calls the GPU core specific probe function. If we look a concrete one, e.g., the a6xx_probe function that is registered to the adreno_a6xx_gpudev struct, it calls the a6xx_probe_common function, and then in turn calls the adreno_device_probe function, which then in turn calls the adreno_setup_device function. adreno_setup_device references a adreno_functable struct. There, we have a bunch of function pointers, including the one for ioctl: the adreno_ioctl function. It actually only handles a few ioctl commands, all listed in the adreno_ioctl_funcs struct and all related to perf counters:

static struct kgsl_ioctl adreno_ioctl_funcs[] = {
  { IOCTL_KGSL_PERFCOUNTER_GET, adreno_ioctl_perfcounter_get },
  { IOCTL_KGSL_PERFCOUNTER_PUT, adreno_ioctl_perfcounter_put },
  { IOCTL_KGSL_PERFCOUNTER_QUERY, adreno_ioctl_perfcounter_query },
  { IOCTL_KGSL_PERFCOUNTER_READ, adreno_ioctl_perfcounter_read },
  { IOCTL_KGSL_PREEMPTIONCOUNTER_QUERY, adreno_ioctl_preemption_counters_query },
};

Searching the command symbols, we find they are all defined in the include/uapi/linux/msm_kgsl.h header. uapi means APIs for userspace here, so that matches. After reading the related ioctl struct comments, it’s relatively clear that we need to issue

  • IOCTL_KGSL_PERFCOUNTER_GET for activating the desired perf counters we want. There is a limit on how many counters we can enable per group. (The limit is reflected by how many adreno_perfcount_registers we have per group. They can be found in, for example, the adreno_a6xx_perfcounter.c file.)
  • IOCTL_KGSL_PERFCOUNTER_PUT for deactivating perf counters after done.
  • IOCTL_KGSL_PERFCOUNTER_READ for sampling perf counters.

And the full list of perf counter groups is also defined in the same header file.

Perf counters

I hope the above is interesting. Thus far we know the ioctl commands to use for interacting with the kernel driver and we know there are quite a few perf counter groups. But we still don’t know what those exact counters are! That’s where the open source Freedreno driver comes as super helpful. In its envytool subproject, we can directly find all the perf counters in an XML database. For example, for the A6XX series, it would be the registers/adreno/a6xx.xml file. It contains enums whose names end with perfcounter_select and that’s what we want.

Up to this point, we basically have all the information we need. Now we can put everything together as a proof of concept. I created a Gist for it to sample a hardcoded list of counters for 100 iterations. Everything seems fine.

Mali GPU

Due to the existence of the HWCPipe project, we can actually know how to sample perf counters from the kernel directly. But the above methodology and steps still apply. (And to truly understand the interaction, it’s inevitable to read the kernel code.) I’ll just point out some key points regarding the Mali driver code here.

Kernel API versions

Compared to Adreno GPUs, the driver code for Mali GPU is actually much more complex: you can find multiple copies of the driver at different versions; and for the same copy, it’s using a versioned API! This actually makes sense. Compared to Adreno, which is only used by Qualcomm, Mali GPUs are licensed to various SoC vendors as IP blocks. Different vendors have different needs that ARM needs to serve, thus requiring the kernel code to structure like this way.

But it does mean more steps to interact with the kernel driver. We need to additionally negotiate the API version and set up the API context.

GPU characteristics

Figuring out the exact GPU characteristics is also harder, as Mali GPUs are configurable (again for satisfying different SoC vendors' needs). There can be a varying number of cores or L2 cache slices. So that all needs to be factored in to properly calculate the final perf counter value. To make things even obscure, unlike Adreno GPUs where we can get the GPU ID like the marketing product name (Adreno 540/650/etc.), GPU ID reported by the Mali kernel driver has nothing to do with the marking name (Mali G57/G78/etc.). These GPU properties are all packed as key value pairs and returned as a flat buffer when using ioctl to query them. To show how it’s done, here is a Gist file dumping some interesting properties of Mali GPUs.

Perf counters

Perf counters in the Mali kernel driver go into a separate API entry point. Unlike Adreno, where we use the main device file descriptor to handle perf counters, for Mali GPUs we need to request another dedicated file descriptor from the driver for perf counters. The drivers/gpu/arm/bv_r26p0/mali_kbase_ioctl.h file contains top-level ioctl commands, including the ioctl command for setting up perf counter reader. The ioctl commands for perf counters are in the drivers/gpu/arm/bv_r26p0/mali_kbase_hwcnt_reader.h file. The ioctl entry point function is kbasep_vinstr_hwcnt_reader_ioctl.

In a Mali GPU, there are four functionality blocks (job manager, tiler, shader core, memory) that can emit perf counters. Each functionality block always returns a fixed-size block containing 64 counters. All the perf counters are packed into a continuous buffer, whose layout is detailed here. But the exact meaning of each counter varies per device. What’s nice, though, is that ARM publishes very detailed explanations of their perf counters. So no need for guess work. 😊

Closing Remarks

That’s it. Thanks for reading through! Hopefully this provides useful information for you. It just happens that I need to understand perf counters so I’m using them as examples here. But really, this blog post is more to show that by inspecting the kernel code we can gain a lot of insights into those mobile GPUs.


  1. Note that there is also an drivers/gpu/drm/msm directory; but that’s for the open source drivers↩︎

  2. In case you are curious, of stands for “Open Firmware”. The Open Firmware Project defines the device tree↩︎

  3. “Lahaina” is Snapdragon 888’s codename. ↩︎