Spine-Triton查看MLIR代码——iFlow解读编译流程并示例矢量指令

通过分析ir_dumps中的object二进制文件,发现Triton算子没有被编译为Vector指令,所以性能与PyTorch调用的cpublas算子相比查了很多。
下面,我们请iFlow心流帮助分析一下从linalg代码开始生成Vector指令的编译流程,下面是几轮对话的记录:

@bare_fused_mm_softmax_tt.mlir and @bare_fused_mm_softmax_linalg.mlir

  ✔  read_many_files Will attempt to read and concatenate files using patterns: `bare_fused_mm_softmax_tt.mlir…
     ### ReadManyFiles Result (Target Dir: `C:\work\triton\spine-triton\fused_bare_mm_softmax`)

     Successfully read and concatenated content from **2 file(s)**.

     **Processed Files:**
     - `bare_fused_mm_softmax_linalg.mlir`
     - `bare_fused_mm_softmax_tt.mlir`

✦ I can see both MLIR files now. The Triton (tt) version and the Linalg version represent different stages of  
  the same computation. The Triton version is at a higher level with GPU-specific operations like tt.dot,      
  tt.load, and tt.store, while the Linalg version has been lowered to more general operations.

  The Triton version shows:
   1. GPU-specific operations using tt.get_program_id to get thread IDs
   2. Tensor operations like tt.dot for matrix multiplication
   3. Triton-specific reductions using tt.reduce
   4. Memory operations using tt.load and tt.store with pointer arithmetic

  The Linalg version shows:
   1. More general operations suitable for CPU execution
   2. Linalg-specific operations like linalg.matmul and linalg.reduce
   3. Affine map indexing for memory access patterns

✦ Both implement the same fused matmul-softmax algorithm but at different abstraction levels. The Triton       
  version would typically be further lowered to Linalg and then to loops before final code generation.

@bare_fused_mm_softmax_tt.mlir and
@bare_fused_mm_softmax_linalg.mlir and
@bare_fused_mm_softmax_ll.mlir and
@bare_fused_mm_softmax_.ll

  ✔  read_many_files Will attempt to read and concatenate files using patterns: `bare_fused_mm_softmax_tt.mlir…
     ### ReadManyFiles Result (Target Dir: `C:\work\triton\spine-triton\fused_bare_mm_softmax`)

     Successfully read and concatenated content from **4 file(s)**.

     **Processed Files:**
     - `bare_fused_mm_softmax_.ll`
     - `bare_fused_mm_softmax_linalg.mlir`
     - `bare_fused_mm_softmax_ll.mlir`
     - `bare_fused_mm_softmax_tt.mlir`

✦ I can see the complete compilation pipeline now. The files represent different stages of the same fused      
  matmul-softmax operation:

   1. Triton IR (`bare_fused_mm_softmax_tt.mlir`): The highest level with GPU-specific operations like tt.dot, 
      tt.get_program_id, tt.load, and tt.store

   2. Linalg IR (`bare_fused_mm_softmax_linalg.mlir`): Middle level with linalg.matmul, linalg.reduce, and     
      linalg.generic operations

   3. LLVM IR (`bare_fused_mm_softmax_ll.mlir`): Lower level with llvm.func and structured control flow        
      operations

   4. Raw LLVM IR (`bare_fused_mm_softmax_.ll`): The lowest level with detailed control flow and memory        
      operations

  The compilation process follows the typical Triton pipeline: Triton → Linalg → Loop-based Affine → LLVM →    
  Target-specific assembly. Each file represents a different stage of lowering and optimization.

✦ To get RISC-V vector instructions, the raw LLVM IR would need to be compiled with an LLVM backend configured 
  for RISC-V target with vector extensions enabled.

Can you explain how the Triton IR is compiled into RISC-V Vector instructions?
Please include:
1.summarize the major steps;
2.describe in detail what each step does;
3.give me an exmple of what the final Vector assembly instructions will be look like.

✦ I'll explain how the Triton IR is compiled into RISC-V Vector instructions.

  1. Major Steps

   1. Triton IR → Linalg IR (high-level tensor operations)
   2. Linalg IR → Affine IR (loop-based representation)
   3. Affine IR → LLVM IR (low-level representation)
   4. LLVM IR → Target-specific assembly (RISC-V with vector extensions)

  2. Detailed Description of Each Step

  Step 1: Triton IR to Linalg IR
   - Triton-specific operations like tt.dot, tt.load, and tt.store are lowered to standard Linalg operations   
   - The tt.dot operation becomes a linalg.matmul operation
   - Memory operations are converted to standard memref.load and memref.store
   - Reduction operations like tt.reduce become linalg.reduce operations

  Step 2: Linalg IR to Affine IR
   - Linalg operations are decomposed into explicit loop structures using affine loops
   - The linalg.matmul operation becomes nested affine.for loops with multiply-accumulate patterns
   - Reduction operations become loops with accumulation patterns
   - Memory accesses are expressed as affine load/store operations with computed indices

  Step 3: Affine IR to LLVM IR
   - Affine loops are converted to standard control flow (scf.for, scf.if)
   - Memory operations become LLVM pointer operations
   - The structured control flow is lowered to basic blocks and branches
   - LLVM IR is generated with standard operations that are closer to machine code

  Step 4: LLVM IR to RISC-V Vector Assembly
   - The LLVM backend pattern-matches operations that can be vectorized
   - Consecutive memory loads become vector loads (vlseg.v or similar)
   - Arithmetic operations become vector operations (vfmacc.vv, vfadd.vv, etc.)
   - Reductions are implemented using RISC-V vector reduction instructions

  3. Example of Final Vector Assembly Instructions

  Here's what the vector assembly might look like for parts of the computation:

    1 # Load matrices A and B
    2 vlseg2e32.v v0_t0, (a_ptr)     # Load 2 vector registers from matrix A
    3 vlseg2e32.v v2_t2, (b_ptr)     # Load 2 vector registers from matrix B
    4
    5 # Matrix multiplication loop
    6 li t0, 512                      # Loop counter
    7 vsetvli t1, t0, e32, m8        # Set vector length for 32-bit floats
    8
    9 matmul_loop:
   10   vfmacc.vv v4, v1, v2         # Multiply and accumulate: v4 += v1 * v2
   11   add a_ptr, a_ptr, 256         # Move to next row in A (64 elements * 4 bytes)
   12   add b_ptr, b_ptr, 256         # Move to next column in B
   13   vfmv.v.f v1, 0.0             # Reset accumulator
   14   sub t0, t0, 64               # Decrement counter
   15   bnez t0, matmul_loop          # Continue if not done
   16
   17 # Softmax computation
   18 vfmax.vv v6, v4, v5            # Find max value for numerical stability
   19 vfredmax.vs v7, v6, v0         # Reduce to find global max
   20 vfsub.vv v8, v4, v7            # Subtract max from each element
   21 vfexp.v v9, v8                 # Compute exponential
   22 vfadd.vv v10, v9, v0           # Sum for normalization
   23 vfredsum.vs v11, v9, v0        # Reduce to get sum
   24 vfdiv.vf v12, v9, v11          # Divide by sum for softmax
   25
   26 # Store result
   27 vsseg2e32.v v12_t13, (out_ptr)  # Store result to output

✦ These instructions demonstrate:
   - Vector loads (vlseg2e32.v) for loading multiple data elements
   - Vector arithmetic (vfmacc.vv, vfadd.vv, vfsub.vv) for parallel computation
   - Vector reductions (vfredmax.vs, vfredsum.vs) for max/sum operations
   - Vector stores (vsseg2e32.v) for saving results
   - The RISC-V vector extension's support for configurable vector lengths with vsetvli

@bare_fused_mm_softmax_tt.mlir and @bare_fused_mm_softmax_linalg.mlir and @bare_fused_mm_softmax_ll.mlir and @bare_fused_mm_softmax_.ll and @bare_fused_mm_softmax_.o.dmp and @test_fused_mm_softmax.single_run.py
Can you explain how the Triton IR is compiled into RISC-V Vector instructions?
Please include:
1.summarize the major steps;
2.describe in detail what each step does;
3.give me an exmple of what the final Vector assembly instructions will be look like.

  ✔  read_many_files Will attempt to read and concatenate files using patterns: `bare_fused_mm_softmax_tt.mli…
     ### ReadManyFiles Result (Target Dir: `C:\work\triton\spine-triton\fused_bare_mm_softmax`)

     Successfully read and concatenated content from **6 file(s)**.

     **Processed Files:**
     - `bare_fused_mm_softmax_.ll`
     - `bare_fused_mm_softmax_.o.dmp`
     - `bare_fused_mm_softmax_linalg.mlir`
     - `bare_fused_mm_softmax_ll.mlir`
     - `bare_fused_mm_softmax_tt.mlir`
     - `test_fused_mm_softmax.single_run.py`

✦ I'll explain how the Triton IR is compiled into RISC-V Vector instructions, analyzing the files you've
  provided.

  1. Major Steps Summary

  The compilation pipeline from Triton IR to RISC-V assembly follows these major steps:

   1. Triton IR (TT.MLIR): High-level Triton operations like tt.dot, tt.load, tt.store, and tt.reduce
   2. Linalg IR: Lowering to standard MLIR operations like linalg.matmul, linalg.reduce
   3. LLVM IR: Further lowering to LLVM operations with memory management and parallel dispatch
   4. RISC-V Assembly: Final compilation to RISC-V assembly code

  2. Detailed Step-by-Step Process

  Step 1: Triton IR Analysis
  The bare_fused_mm_softmax_tt.mlir file contains the high-level representation with Triton-specific
  operations:
   - tt.dot for matrix multiplication
   - tt.reduce for finding max/sum in softmax
   - tt.load/tt.store for memory operations
   - tt.get_program_id for parallelization

  Step 2: Linalg Lowering
  In bare_fused_mm_softmax_linalg.mlir, we see the Triton operations converted to standard operations:
   - tt.dot becomes linalg.matmul
   - tt.reduce becomes linalg.reduce with max/add operations
   - Memory operations use memref abstractions

  Step 3: LLVM Lowering
  The bare_fused_mm_softmax_ll.mlir shows the lowering to LLVM with:
   - Memory management using spine_malloc, spine_thread_malloc
   - Parallel execution with spine_parallel_dispatch_1d
   - Complex nested loops for the fused computation

  Step 4: RISC-V Assembly Generation
  In bare_fused_mm_softmax_.o.dmp, we see the final assembly code with:
   - Scalar floating-point operations (e.g., flw, fsw, fmul.s, fadd.s)
   - Vector extension instructions are available but not used in the main computation loops

  3. RISC-V Vector Assembly Examples

  Looking at the assembly in bare_fused_mm_softmax_.o.dmp, I notice that while the RISC-V vector extension is available (shown by csrr a0,vlenb at offset 0x55c), the main computational loops actually use scalar instructions rather than vector instructions.

  For example, in the computational core:
   1 45c: 0006a787           flw     fa5,0(a3)     # Load float from address in a3 to fa5
   2 460: 00072707           flw     fa4,0(a4)     # Load float from address in a4 to fa4
   3 47a: 10e7f7d3           fmul.s  fa5,fa5,fa4   # Multiply: fa5 = fa5 * fa4
   4 47e: 00f6f7d3           fadd.s  fa5,fa3,fa5   # Add: fa5 = fa3 + fa5
   5 482: 00f4a027           fsw     fa5,0(s1)     # Store result to address in s1

  These are scalar floating-point operations, not vectorized operations. The vector extension instructions that would look like this:
   1 # Example of what vectorized instructions would look like:
   2 vsetvli t0, a0, e32,f8        # Set vector length for 32-bit floats, 8x vectorization
   3 vle32.v  v1, (a1)             # Vector load 32-bit floats from address in a1
   4 vle32.v  v2, (a2)             # Vector load 32-bit floats from address in a2
   5 vfmul.vv v3, v1, v2           # Vector multiply: v3 = v1 * v2
   6 vfadd.vv v4, v3, v0           # Vector add: v4 = v3 + v0
   7 vse32.v  v4, (a3)             # Vector store result to address in a3

  The compilation process has not effectively utilized the RISC-V vector extensions in the main computational kernels, even though the hardware capability is detected (as shown by the vlenb CSR access). This suggests either:
   1. The current Triton-to-RISC-V compiler backend doesn't fully optimize for vector instructions
   2. The loop structure or data dependencies prevent effective vectorization
   3. The compilation flags don't enable vectorization optimizations

✦ For proper vectorization, the Triton compiler would need to generate vectorized code during the lowering process, possibly through loop transformations and vector code generation passes that are currently not being applied effectively.

fused_bare_mm_softmax.zip (98.3 KB)