Gluon: Explicit Performance

2026-02-28
13 min read

Gluon enhances the Triton language and compiler solutions with an additional approach towards GPU kernel programming. It strikes a different balance in the portability and performance spectrum to expose more compiler internals; thus giving developers more explicit controls to reach higher performance ceiling. In this blog post I’ll explain Gluon per my understanding. I will also use this as an opportunity to talk about domain-specific languages, particularly in the context of dramatically evolving agentic software development.

Gluon and Triton

Let’s start by introducing Gluon. Like previous blog posts in this series, I’ll discuss the overall structures and design choices and explain my mental model of it, instead of sweeping language features and providing a guide of how to get started on implementing specific kernels for a particular problem. For the latter, there already exists great talks and NVIDIA GPU-specific tutorials; we are also developing similar ones for AMD GPUs.

Frontend to Triton GPU IR

Inside the Triton compiler, we have three levels of IRs, namely, Triton tt, Triton GPU ttg, and LLVM llvm. Triton can be seen as the Python frontend to the tt IR: we mechanically parse Python AST using visitor pattern to emit corresponding tt ops. There are a few layers during this procedure:

@triton.jit (decorator)          # python/triton/runtime/jit.py
  → JITFunction._do_compile()    # python/triton/runtime/jit.py
    → ASTSource.make_ir()        # python/triton/compiler/compiler.py
      → ast_to_ttir()            # python/triton/compiler/code_generator.py
        → TritonSemantic         # python/triton/language/semantic.py
          → ir.builder           # python/src/ir.cc
  1. The @triton.jit decorator wraps a Python function in a JITFunction instance. On first kernel launch, its _do_compile() method creates an ASTSource instance, which holds the kernel function and metadata, and passes it to the compile() method.
  2. Next ASTSource’s make_ir() method calls into the code generator via ast_to_ttir() call. Inside, we build a CodeGenerator instance, which inherits standard Python ast.NodeVisitor.
  3. CodeGenerator creates ir.builder and TritonSemantic instance. During AST traversal, TritonSemantic methods are invoked accordingly and uses ir.builder to generate IR. For example, for tl.load we have this semantic method and pybind builder method.

Similary, Gluon is effectively the Python frontend to the ttg IR. It defines its own @gluon.jit decorator, GluonJITFunction (subclass of JITFunction), and GluonASTSource (subclass of ASTSource) in its _runtime.py. Though overall it shares much with the above mechanical AST parsing and IR generation flow, particularly, the core CodeGenerator class. In it, we only differentiate the plugged in IR builder and semantic instance.

The following is an overall JIT compilation flow comparison between Triton and Gluon:

Triton:
  @triton.jit → ASTSource → ast_to_ttir()
    → CodeGenerator + TritonSemantic + ir.builder
    → ttir  →  ttgir  →  llir  →  ptx/amdgcn  →  cubin/hsaco
Gluon:
  @gluon.jit → GluonASTSource → ast_to_ttir()
    → CodeGenerator + GluonSemantic + GluonOpBuilder
            →  ttgir  →  llir  →  ptx/amdgcn  →  cubin/hsaco

As you can see from the above, the major difference is skipping tt IR and directly exposing and building ttg IR from Python. It naturally means that now developers gain access to low-level explicit controls which were previously hidden inside the compilers. It also means that now developers need to handle the optimizations previously performed by the tt to ttg conversion.

GPU-specific explicit controls

Comparing to Triton tt IR, which focuses on describing the algorithm, Triton GPU ttg IR adds more GPU specific abstractions, the most notable one being 1) explicit layouts. Layouts are critical mechanisms to express how tensor elements are distributed/shared inside the GPU, which is a tile based architecture. You can read my previous blog posts for an introduction if not familiar.

In addition to layouts, ttg IR also has 2) explicit shared memory management. It is entirely managed inside the compiler for the canonical Triton flow, where developers can only indirectly influence via setting the number of shared memory buffers to use for software pipelining, typically. Now this all become explicitly programmable in Python via the shared_memory_descriptor API.

At the level of Triton GPU IR, there are both common ops applicable to multiple (vendor) architectures, and specific ops applicable to one architecture. This aspect is also reflected into Gluon, where we can use 3) architecture specific features. For such kind of control, it means that Gluon kernels are not portable anymore like Triton.

Block/warp level programming

Another major feature of Gluon is that it provides a well defined explicit API for 4) warp specialization. This arguably lowers the programming model by one level comparing to Triton, where we can only program at the block level and let all warps go through uniform control flow paths.

This reflects the GPU hardware evolution trend. From Ampere to Hopper to Blackwell we see the GPU becomes more and more complex and asynchronous. In order to achieve higher compute throughput, tensor cores get enhanced greatly every generation. To feed those tensor cores, special hardware units like TMA were introduced to match. We want to utilize all the resources inside the streaming multiprocessor to the best to keep various pipelines fully occupied as any stall would be much more exposed and affect the overall performance more dramatically than earlier, given how fast and extreme everything becomes.

To carefully arrange all instruction issuing statically with a compiler is a very hard problem. Instead, relying on hardware dynamic scheduling to “fill the bubble” gives a much simpler software stack and achieves better overall utilization for various workloads as it can dynamically respond with hardware pipeline backpressure or so. So throughout generations, we see warp becoming more and more capable as specializing on different roles and performing async tasks using shared memory as “working space” with flexible synchronization primitives.

Gluon’s warp specialization API exposes such warp level programming capability. It still maintains a similar style like Triton where, although we have multiple partitions now, inside each partition, we still reason all warps collectively as a “subblock”. This makes it easier to write and understand. Furthermore, under the hood compiler hides tedious steps like packing/unpacking data in shared memory for all warps to identify themselves and collaborate with each other, synchronizing warps accordingly at API boundary, and so on, to provide a clean experience. See the tutorial as an example:

# Allocate all the buffers and barriers.
# ...

descs = (a_desc, b_desc, c_desc)
barriers = (load_empty_bars, load_ready_bars, c_empty_bars, c_ready_bars)
buffers = (a_bufs, b_bufs, c_bufs)
numel = (xnumel, ynumel)

pid = gl.program_id(0)
xoff = pid * XBLOCK

# `gl.warp_specialize` declares a warp-specialized section of the kernel.
# It accepts arguments for the default partition function, which can include
# tensors, and the default partition function. It takes arguments for all
# the worker partitions, which cannot include tensors, and takes a list of
# worker partition functions. The warps and register budget for each
# partition are passed as lists.
# ...
gl.warp_specialize([
    (compute_partition, (barriers, buffers, ynumel, YBLOCK, layout)),
    (load_partition, (descs, barriers, buffers, xoff, numel, YBLOCK)),
    (store_partition, (descs, barriers, buffers, xoff, numel, YBLOCK)),
], [1, 1], [24, 24])

GPU-specific explicit optimizations

Bypassing the tt IR and directly exposing ttg IR to Python yields more control to developers. That does not come as free; the associated trade-off is now developers also need to perform explicit optimizations that were previously done by the compiler during the tt to ttg IR conversion flow. You can find those optimizations from the make_ttgir() method in CUDA and HIP backend.

One such major optimization is software pipelining (SWP) and assoicated shared memory buffer management. SWP helps to set up kernel top-level memory-compute structure, which is very core to getting high performance. It prefetches memory operations from next iterations while performing compute of the current iteration. Due to the prefetch, compute also don’t depends on memory operations in the same loop iteration, making compiler happier given it can better overlap memory and compute instructions. (With warp specialization, they are further decoupled into different producer-consumer warps; therefore, an even happier compiler given less to schedule and overlap! But arguably that’s different programming model entirely; we are talking about the same programming model as canonical Triton here and see what “delta” we have by using Gluon.)

Another major manual optimization is specifying layouts on various ops, and manual gl.convert_layout() call when needed. By definintion, Gluon yields such control to developers so the compiler won’t do any layout propagation and resolution to disturb what is dictated by the developers. There is of course a trade-off here given it can be quite tedious and bloat the kernel code. So there are also lightweight layout inferencing aiding the job, like gl.AutoLayout and gl.CoalescedLayout, as demonstrated by this test.

Being able to invoke architecture specific features helps performance optimization immensely. One such example is buffer load/store on AMD GPUs. Buffer load/store intrinsics on MI300X/MI350X hardware accepts one base 64-bit pointer for a whole warp, and then 32-bit offsets per thread. It natively supports out-of-bound access with returning zeros or discarding the write, which is pretty useful to handle masking; while the normal global load/store needs explicit if-else structure to guard.

Though from Triton, the API is tl.load/tl.store which accepts per-thread 64-bit pointers. In the compiler it is pretty hard to prove in a general manner that a tl.load op can be safely converted into buffer load by fitting into 32-bit indexing range. While if we go with the Gluon path we just need to call the gl.amd.cdna3.buffer_load op for example. The above is a straightforward one; we can have many such examples where we perform targeted optimizations which was difficult or impossible in canonical Triton path.

Portability vs Performance

One would wonder, it would be great if we have one solution that is both highly portable and also performant on various architectures. Then we just need to invest once initially and then keep incremental developments with new architectures to always reap the benefits.

The reality is, unfortunately, portability and performance are two conflicting goals—performance by definition means (over)fitting to one specific architecture and leverage all the hardware-specific features to achieve the top FLOp/s or TB/s; contrarily portability means covering a broad range so inevitably leading to abstractions and leveraging more common denominator features.

General compilers

General compilers deliver strongly on portability but less so on performance.

Building and maintaining a general compiler stack in solid production state requires enormous amount of resources. Such high investment cost naturally leads to industry and community collective efforts, from lots of companies spanning multiple years or decades. GCC and Clang are successful models here—they support various general programming languages and target a lot of different CPU/GPU/xPU architectures.

With so many targets to support, we inevitably have conflicting goals, especially about performance. Each architecture can have different optimization directions and every direction can potentially require some special if-else statements or code paths somewhere. The codebase would be messy if all such optimizations get free pass and maintenance burden would be prohibitively high. Any change could potentially regress some workloads for somebody and cause a ripple effect in unexpected ways.

So it’s natural there will be exclusions of certain changes even if they optimizes the performance of some workload on an architecture. Optimizations needs to be more generally applicable in order to benefit everybody in the ecosystem; not increasing everybody’s maintenance burden just for one specific workload. Further along this line, it’s understandable that general compilers are more delivering as a portability solution and performance tends to be “common denominator”.

However, people do still want to push performance boundaries out of general compilers given we have a portable solution that are vastly successful and installed by default on various development environments. To leverage existing assets as much as we can and tame down maintenance cost, we build commonly reusable mechanisms and drive them with architecture specific policies. The hope is that we decouple and isolate well so that we get the best of both worlds.

It’s a good design paradigm and fulfills various cases, though still as we know there is no free lunch—if we rely on those few tunable knobs to control, we effectively do a lossy compression of information there that we will inevitably run into “hash collision” and various workloads would compete with contradicting settings. It ends up that we have quite some compiler hueristics and “magic numbers” to tweak there in a whack-a-mole manner.

Domain-specific compilers

Domain-specific compilers can typically deliver higher performance than general compilers.

By definition being domain-specific means reducing to focus on a very specific problem space. Within that context, the approaches towards performance are more likely uniform so we can just adopt them in a more principled manner given there is no need to compromise in order to serve other problems' needs too. For example, if we are only concerning GPUs, then we would definitely need to have first-party shared memory usage and bank conflict optimizations and this can span multiple steps throughout the compiler stack. It would be hard to ask non-GPU architectures to take it on as it would just be burden for no gains.

Solutions, including both the language and compiler, can be quite tailored towards to the specific problem domain. We can adjust the language surface area between developers and compilers to decide whether we want developers to program the high-level algorithm or low-level execution details. We can freely define ways to pass down information throughout the compiler stack to reference later given generally compilation flow is lossy on high-level structures. Also, we don’t need to solve everything using the compiler—the developers definitely know more details about their specific problems; no amount of heuristics in the compiler can guess better.

Domain-specific compilers have smaller scopes so it means less effort to build and maintain. Still, the effort is non trivial given there are a lot of design inside, like at the core how we keep language features minimal and orthogonal to be composable to achieve combinational effects when addressing problems. Additionally, we need all the bells and whistles like releases, tools, and so on.

Though, with all the efforts above, we are still not getting 100% of the performance like hand-written assembly. And, LLM capabilities is still advancing in big strides. Some day, they will be able to write good assembly code. Does it still matter?

Agentic development

Predicting the future is hard. But as past experiences always show, there is rarely a silver bullet or one solution fitting for all.

With highly capable LLMs nowadays, the cost of writing software dramatically decreases. What that brings is we can create infinite amounts of software customization tailoring to specific use cases. The need of creating a unified solution to leverage for a lot of needs is due to human limitation after all given we need to keep investment and maintenance burden down for ourselves. So AI here achieves an “unbundling” effect—we don’t force one giant solution for everything.

It is just a matter of time that those solutions will overwhelm human abilities to comprehend—what AI is doing, whether it is correct, whether it makes sense, whether it is safe, and so on. There will certainly be cases where we don’t care about what’s behind and can let LLMs do whatever. For the cases we do care, letting LLMs write the solution in a specific domain-specific language is certainly a great way to scrutinize!

As at the end of the day, language is a “contract”, it defines the surface area how we can express solutions to problems. And the associated compiler stack is a very clear “verifier” whether the solution works or not. The language sets a clear boundary formally guaranteeing that LLMs won’t be able to perform anything beyond. We see more people adopting Rust for its safety, partially due to this reason. The language also serves as the vehicle that the solution can be understandable to human beings.

Given the domain-specific nature, it should be easier to understand, and with higher performance potentials, especially if we let agents keep iterating on it to push for better and better code. The trick might be, now we need to further dial the knob inside the design space to strike new balances for the needs, like Gluon, and maybe more to come down the road. Exciting time to be in!

Summary

In this blog post I explained Gluon, which is effectively Python frontend to Triton GPU ttg IR. It provides GPU-specific explicit controls and optimization possibilities comparing to canonical Triton. Starting from Gluon, in the second part I explained my hand-wavy thinking over portability and performance in the context of general and domain-specific compilers. In the era of great agentic AI advances, I think domain-specific languages would be a good companion to it.