gpu | Lei.Chat()

Gluon: Explicit Performance

Gluon enhances the Triton language and compiler solutions with an additional approach towards GPU kernel programming. It strikes a different balance in the portability and performance spectrum to expose more compiler internals; thus giving developers more explicit controls to reach higher performance ceiling. In this blog post I’ll explain Gluon per my understanding. I will also use this as an opportunity to talk about domain-specific languages, particularly in the context of dramatically evolving agentic software development.

2026-02-28

13 min read

compiler, triton

triton

Triton Linear Layout: Examples

The previous blog post talked about Triton linear layout concepts, aiming to provide some underlying motivations and an intuitive understanding. As a companion, in this one I’d like to touch on linear layout internals and follow up with some concrete examples to show its usage in action and make it even more comprehensible. Following the same vein, common languages and explanations are preferred instead of mathematical terms and interpretations.

2026-01-10

13 min read

compiler, triton

triton

Triton Linear Layout: Concept

Layout is a core concept in Triton for representing and optimizing distribution mappings from source problems to the target hardware compute and memory hierarchy. In this blog post I will talk about linear layout in Triton, the new unifying mechanism over existing bespoke layouts for different purposes. The aim is to provide motivation and an intuitive understanding of linear layout; I will rely on examples and illustrations instead of theories and proofs.

2024-12-31

16 min read

compiler, triton

triton

Triton Compiler Development Tips

Triton provides an elegant solution to program GPU kernels in Python, positioning itself as a critical component in the modern AI software stack. To deliver performance and portability, it leverages a compiler, the capability of which determines the potential. Hacking the compiler internals is not a simple task. Here are some tips hopefully useful to folks. I’ll try to keep this blog post updated periodically.

2024-12-25

10 min read

compiler, triton

triton

Single-node ML Runtime Foundation

Previous blog posts overviewed the MLIR dialect hierarchy for kernel code generation (CodeGen) and zoomed in on the Linalg and Vector dialects among them. Now I will switch to discuss the runtime side a bit, in order to provide a holistic view of MLIR-based machine learning (ML) compilers. This one touches the foundation and basics, including the target landscape, runtime requirements and designs to meet thereof.

2023-04-01

18 min read

runtime, mlir, ml-inference

ml-inference , compiler-development

CodeGen Performant Convolution Kernels for Mobile GPUs

This blog post talks about how to generate performant code for convolution ops using MLIR’s multiple levels of abstractions and transformations. I initially created it for targeting ARM Mali GPUs in IREE. But given it is just direct tiling and vectorization, it should be widely applicable. I will walk through the lowering steps, so if you are interested to know how to organize MLIR’s various dialects/patterns together to achieve similar tasks, this blog post might also be useful.

2021-09-19

36 min read

android, ml-inference, gpu-performance, compiler, mlir

gpu-codegen

Sampling Performance Counters from Mobile GPU Drivers

In a previous blog post I gave a general introduction to GPU driver internals in Android/Linux systems. Following up with it, today I will explain how a specific functionality, hardware performance counter (perf counter) queries, is handled in both Qualcomm Adreno and ARM Mali drivers, by walking through the kernel driver source code.

2021-07-08

10 min read

android, gpu-driver, gpu-performance

gpu-driver

Android/Linux GPU Drivers: Internals and Resources

Recently I have been working on a library that needs to directly interact with GPU kernel drivers from various vendors on Android/Linux systems. Compared to various GPU APIs, information at this level is quite sparse; so it is not a straightforward task, to say the least, and ends up requiring me to piece multiple sources together to figure out the details. So I am logging these driver internals and resources down in case it can be useful to others that are interested in these low-level bits.

2021-07-05

12 min read

android, linux, gpu-driver

gpu-driver

What is Vulkan Compute?

Vulkan is designed to be both a graphics and compute API. However, there is no formal definition of the compute subset from the Khronos group, the industry consortium behind Vulkan. The unified specification of Vulkan does not help here either as it contains everything, both graphics and compute. Unlike the complicated graphics subset, the compute subset is actually quite straightforward and clean. So in this blog post I try to explain what Vulkan compute is, from my point of view.

2021-06-25

16 min read

vulkan-compute

Shader Toolchain: HLSL in Vulkan

On 2018 Vulkan Developer Day in Montréal, I gave a talk regarding “Shader Toolchain: HLSL in Vulkan”. Here are the links to the video recording, slides, and documentation/downloads for DirectX Shader Compiler (DXC) SPIR-V CodeGen.

2018-05-12

1 min read

vulkan-graphics, dxc