通过分析ir_dumps中的object二进制文件,发现Triton算子没有被编译为Vector指令,所以性能与PyTorch调用的cpublas算子相比查了很多。
下面,我们请iFlow心流帮助分析一下从linalg代码开始生成Vector指令的编译流程,下面是几轮对话的记录:
@bare_fused_mm_softmax_tt.mlir and @bare_fused_mm_softmax_linalg.mlir
✔ read_many_files Will attempt to read and concatenate files using patterns: `bare_fused_mm_softmax_tt.mlir…
### ReadManyFiles Result (Target Dir: `C:\work\triton\spine-triton\fused_bare_mm_softmax`)
Successfully read and concatenated content from **2 file(s)**.
**Processed Files:**
- `bare_fused_mm_softmax_linalg.mlir`
- `bare_fused_mm_softmax_tt.mlir`
✦ I can see both MLIR files now. The Triton (tt) version and the Linalg version represent different stages of
the same computation. The Triton version is at a higher level with GPU-specific operations like tt.dot,
tt.load, and tt.store, while the Linalg version has been lowered to more general operations.
The Triton version shows:
1. GPU-specific operations using tt.get_program_id to get thread IDs
2. Tensor operations like tt.dot for matrix multiplication
3. Triton-specific reductions using tt.reduce
4. Memory operations using tt.load and tt.store with pointer arithmetic
The Linalg version shows:
1. More general operations suitable for CPU execution
2. Linalg-specific operations like linalg.matmul and linalg.reduce
3. Affine map indexing for memory access patterns
✦ Both implement the same fused matmul-softmax algorithm but at different abstraction levels. The Triton
version would typically be further lowered to Linalg and then to loops before final code generation.
@bare_fused_mm_softmax_tt.mlir and
@bare_fused_mm_softmax_linalg.mlir and
@bare_fused_mm_softmax_ll.mlir and
@bare_fused_mm_softmax_.ll
✔ read_many_files Will attempt to read and concatenate files using patterns: `bare_fused_mm_softmax_tt.mlir…
### ReadManyFiles Result (Target Dir: `C:\work\triton\spine-triton\fused_bare_mm_softmax`)
Successfully read and concatenated content from **4 file(s)**.
**Processed Files:**
- `bare_fused_mm_softmax_.ll`
- `bare_fused_mm_softmax_linalg.mlir`
- `bare_fused_mm_softmax_ll.mlir`
- `bare_fused_mm_softmax_tt.mlir`
✦ I can see the complete compilation pipeline now. The files represent different stages of the same fused
matmul-softmax operation:
1. Triton IR (`bare_fused_mm_softmax_tt.mlir`): The highest level with GPU-specific operations like tt.dot,
tt.get_program_id, tt.load, and tt.store
2. Linalg IR (`bare_fused_mm_softmax_linalg.mlir`): Middle level with linalg.matmul, linalg.reduce, and
linalg.generic operations
3. LLVM IR (`bare_fused_mm_softmax_ll.mlir`): Lower level with llvm.func and structured control flow
operations
4. Raw LLVM IR (`bare_fused_mm_softmax_.ll`): The lowest level with detailed control flow and memory
operations
The compilation process follows the typical Triton pipeline: Triton → Linalg → Loop-based Affine → LLVM →
Target-specific assembly. Each file represents a different stage of lowering and optimization.
✦ To get RISC-V vector instructions, the raw LLVM IR would need to be compiled with an LLVM backend configured
for RISC-V target with vector extensions enabled.
Can you explain how the Triton IR is compiled into RISC-V Vector instructions?
Please include:
1.summarize the major steps;
2.describe in detail what each step does;
3.give me an exmple of what the final Vector assembly instructions will be look like.
✦ I'll explain how the Triton IR is compiled into RISC-V Vector instructions.
1. Major Steps
1. Triton IR → Linalg IR (high-level tensor operations)
2. Linalg IR → Affine IR (loop-based representation)
3. Affine IR → LLVM IR (low-level representation)
4. LLVM IR → Target-specific assembly (RISC-V with vector extensions)
2. Detailed Description of Each Step
Step 1: Triton IR to Linalg IR
- Triton-specific operations like tt.dot, tt.load, and tt.store are lowered to standard Linalg operations
- The tt.dot operation becomes a linalg.matmul operation
- Memory operations are converted to standard memref.load and memref.store
- Reduction operations like tt.reduce become linalg.reduce operations
Step 2: Linalg IR to Affine IR
- Linalg operations are decomposed into explicit loop structures using affine loops
- The linalg.matmul operation becomes nested affine.for loops with multiply-accumulate patterns
- Reduction operations become loops with accumulation patterns
- Memory accesses are expressed as affine load/store operations with computed indices
Step 3: Affine IR to LLVM IR
- Affine loops are converted to standard control flow (scf.for, scf.if)
- Memory operations become LLVM pointer operations
- The structured control flow is lowered to basic blocks and branches
- LLVM IR is generated with standard operations that are closer to machine code
Step 4: LLVM IR to RISC-V Vector Assembly
- The LLVM backend pattern-matches operations that can be vectorized
- Consecutive memory loads become vector loads (vlseg.v or similar)
- Arithmetic operations become vector operations (vfmacc.vv, vfadd.vv, etc.)
- Reductions are implemented using RISC-V vector reduction instructions
3. Example of Final Vector Assembly Instructions
Here's what the vector assembly might look like for parts of the computation:
1 # Load matrices A and B
2 vlseg2e32.v v0_t0, (a_ptr) # Load 2 vector registers from matrix A
3 vlseg2e32.v v2_t2, (b_ptr) # Load 2 vector registers from matrix B
4
5 # Matrix multiplication loop
6 li t0, 512 # Loop counter
7 vsetvli t1, t0, e32, m8 # Set vector length for 32-bit floats
8
9 matmul_loop:
10 vfmacc.vv v4, v1, v2 # Multiply and accumulate: v4 += v1 * v2
11 add a_ptr, a_ptr, 256 # Move to next row in A (64 elements * 4 bytes)
12 add b_ptr, b_ptr, 256 # Move to next column in B
13 vfmv.v.f v1, 0.0 # Reset accumulator
14 sub t0, t0, 64 # Decrement counter
15 bnez t0, matmul_loop # Continue if not done
16
17 # Softmax computation
18 vfmax.vv v6, v4, v5 # Find max value for numerical stability
19 vfredmax.vs v7, v6, v0 # Reduce to find global max
20 vfsub.vv v8, v4, v7 # Subtract max from each element
21 vfexp.v v9, v8 # Compute exponential
22 vfadd.vv v10, v9, v0 # Sum for normalization
23 vfredsum.vs v11, v9, v0 # Reduce to get sum
24 vfdiv.vf v12, v9, v11 # Divide by sum for softmax
25
26 # Store result
27 vsseg2e32.v v12_t13, (out_ptr) # Store result to output
✦ These instructions demonstrate:
- Vector loads (vlseg2e32.v) for loading multiple data elements
- Vector arithmetic (vfmacc.vv, vfadd.vv, vfsub.vv) for parallel computation
- Vector reductions (vfredmax.vs, vfredsum.vs) for max/sum operations
- Vector stores (vsseg2e32.v) for saving results
- The RISC-V vector extension's support for configurable vector lengths with vsetvli
@bare_fused_mm_softmax_tt.mlir and @bare_fused_mm_softmax_linalg.mlir and @bare_fused_mm_softmax_ll.mlir and @bare_fused_mm_softmax_.ll and @bare_fused_mm_softmax_.o.dmp and @test_fused_mm_softmax.single_run.py
Can you explain how the Triton IR is compiled into RISC-V Vector instructions?
Please include:
1.summarize the major steps;
2.describe in detail what each step does;
3.give me an exmple of what the final Vector assembly instructions will be look like.
✔ read_many_files Will attempt to read and concatenate files using patterns: `bare_fused_mm_softmax_tt.mli…
### ReadManyFiles Result (Target Dir: `C:\work\triton\spine-triton\fused_bare_mm_softmax`)
Successfully read and concatenated content from **6 file(s)**.
**Processed Files:**
- `bare_fused_mm_softmax_.ll`
- `bare_fused_mm_softmax_.o.dmp`
- `bare_fused_mm_softmax_linalg.mlir`
- `bare_fused_mm_softmax_ll.mlir`
- `bare_fused_mm_softmax_tt.mlir`
- `test_fused_mm_softmax.single_run.py`
✦ I'll explain how the Triton IR is compiled into RISC-V Vector instructions, analyzing the files you've
provided.
1. Major Steps Summary
The compilation pipeline from Triton IR to RISC-V assembly follows these major steps:
1. Triton IR (TT.MLIR): High-level Triton operations like tt.dot, tt.load, tt.store, and tt.reduce
2. Linalg IR: Lowering to standard MLIR operations like linalg.matmul, linalg.reduce
3. LLVM IR: Further lowering to LLVM operations with memory management and parallel dispatch
4. RISC-V Assembly: Final compilation to RISC-V assembly code
2. Detailed Step-by-Step Process
Step 1: Triton IR Analysis
The bare_fused_mm_softmax_tt.mlir file contains the high-level representation with Triton-specific
operations:
- tt.dot for matrix multiplication
- tt.reduce for finding max/sum in softmax
- tt.load/tt.store for memory operations
- tt.get_program_id for parallelization
Step 2: Linalg Lowering
In bare_fused_mm_softmax_linalg.mlir, we see the Triton operations converted to standard operations:
- tt.dot becomes linalg.matmul
- tt.reduce becomes linalg.reduce with max/add operations
- Memory operations use memref abstractions
Step 3: LLVM Lowering
The bare_fused_mm_softmax_ll.mlir shows the lowering to LLVM with:
- Memory management using spine_malloc, spine_thread_malloc
- Parallel execution with spine_parallel_dispatch_1d
- Complex nested loops for the fused computation
Step 4: RISC-V Assembly Generation
In bare_fused_mm_softmax_.o.dmp, we see the final assembly code with:
- Scalar floating-point operations (e.g., flw, fsw, fmul.s, fadd.s)
- Vector extension instructions are available but not used in the main computation loops
3. RISC-V Vector Assembly Examples
Looking at the assembly in bare_fused_mm_softmax_.o.dmp, I notice that while the RISC-V vector extension is available (shown by csrr a0,vlenb at offset 0x55c), the main computational loops actually use scalar instructions rather than vector instructions.
For example, in the computational core:
1 45c: 0006a787 flw fa5,0(a3) # Load float from address in a3 to fa5
2 460: 00072707 flw fa4,0(a4) # Load float from address in a4 to fa4
3 47a: 10e7f7d3 fmul.s fa5,fa5,fa4 # Multiply: fa5 = fa5 * fa4
4 47e: 00f6f7d3 fadd.s fa5,fa3,fa5 # Add: fa5 = fa3 + fa5
5 482: 00f4a027 fsw fa5,0(s1) # Store result to address in s1
These are scalar floating-point operations, not vectorized operations. The vector extension instructions that would look like this:
1 # Example of what vectorized instructions would look like:
2 vsetvli t0, a0, e32,f8 # Set vector length for 32-bit floats, 8x vectorization
3 vle32.v v1, (a1) # Vector load 32-bit floats from address in a1
4 vle32.v v2, (a2) # Vector load 32-bit floats from address in a2
5 vfmul.vv v3, v1, v2 # Vector multiply: v3 = v1 * v2
6 vfadd.vv v4, v3, v0 # Vector add: v4 = v3 + v0
7 vse32.v v4, (a3) # Vector store result to address in a3
The compilation process has not effectively utilized the RISC-V vector extensions in the main computational kernels, even though the hardware capability is detected (as shown by the vlenb CSR access). This suggests either:
1. The current Triton-to-RISC-V compiler backend doesn't fully optimize for vector instructions
2. The loop structure or data dependencies prevent effective vectorization
3. The compilation flags don't enable vectorization optimizations
✦ For proper vectorization, the Triton compiler would need to generate vectorized code during the lowering process, possibly through loop transformations and vector code generation passes that are currently not being applied effectively.
fused_bare_mm_softmax.zip (98.3 KB)