## ThreadFuser: A SIMT Analysis Framework for MIMD Programs

Ahmad Alawneh, Ni Kang, Mahmoud Khairy\*, Timothy G. Rogers



Elmore Family School of Electrical and Computer Engineering \*Currently at AMD

**MICRO 2024** 

## **Expanding the Parallelism with SIMT**

- Rising Demand for Parallelism
  - Traditional parallel programs use MIMD, primarily targeting CPUs
- Shifting Paradigms
  - The Slowing of Moore's Law
  - CPUs face energy-efficiency limitations
- Emergence of Accelerators:
  - GPUs with SIMT are leading this trend



#### **100X GPU DEVELOPERS**



SIMT: Single Instruction Multiple Threads

2024 NVIDIA Corporation Annual Review

### **Pathway of Program Evolution**



#### What is Next?



UNIVERSI

#### Challenges







Elmore Family School of Electrical and Computer Engineering

## Challenges





- Porting effort is needed to exploit the SIMT accelerator
- Porting is risky
  - Time-consuming
  - Speedups are not guaranteed
- Hardware designers struggle to analyze the potential efficiency of SIMT hardware due to a lack of diverse software







ThreadFuser: A SIMT Analysis Framework for MIMD Programs

SIMT

### Challenges

Instr 2

Instr 1

Instr 3

- Porting effort is needed to exploit the SIMT accelerator
- Porting is risky
  - Time-consuming
  - Speedups are not guaranteed



Goal: If we already have a parallel CPU Version, Can we analyze the performance of a MIMD application on SIMT hardware without porting?

Microservices



SIMT

#### Single Instruction Multiple Thread (SIMT)

- Execution Model
  - Based on SPMD running on SIMD hardware
  - Threads grouped into warps/wavefronts
- CUDA, ROCm, and OpenCL are popular frameworks
- Efficiency Gains
  - Pipeline: Fetch, decode, and schedule once per warp
  - Memory: Coalesced accesses reduce traffic

Wavefront/warp 32 or 64 threads



# **Control-Flow and Memory Divergence in SIMT**

#### **Control Flow Divergence**



#### SIMT Stack

| Rec point | Next BBL     | Active<br>Mask |     |
|-----------|--------------|----------------|-----|
| -         | D            | 1111           |     |
| D         | С            | 1100           |     |
| D         | В            | 0011           | TOS |
| SIM       | T Efficiency |                |     |

#### **Memory Divergence**



### **Previous Work**





Prior Approaches: Depend on machine learning models to predict execution time \*

#### Limitations:

- Single Metric Focus
- Small Codebase
- No Architecture Exploration

\* 1-Predicting Cross-Architecture Performance of Parallel Programs (IPDPS 2024)

2- Cross-architecture performance prediction (XAPP) (MICRO 2015)





SIMT

MIMD/ SIMD

#### **Previous Work**



- Low overhead porting estimate
- Detailed analysis for developer and architect





SIMT





#### **Previous Work**



**ThreadFuser:** is an analysis framework that enables performance analysis of any MIMD CPU programs on SIMT hardware

#### MIMD/ SIMD

ThreadFuser: A SIMT Analysis Framework for MIMD Programs

SIMT

### **SYSTEM OVERVIEW**

**Output:** Cycle-level performant analysis using trace-driven simulator



#### **ThreadFuser Tracer and Analyzer**



## **Experimental Setup**

• 36 workloads across task-parallel and data-parallel architectures

| Benchmark suite | Parallelism            |  |  |  |
|-----------------|------------------------|--|--|--|
| Rodinia         | data-parallel          |  |  |  |
| Parapoly        | data-parallel          |  |  |  |
| µsuite          | task-parallel          |  |  |  |
| DeathStarBench  | task-parallel          |  |  |  |
| ParSec 3.0      | Task and data parallel |  |  |  |

- Intel's Pin tool to build tracer
- Correlated SIMT efficiency and memory divergence against Nvidia H100
  - 11 applications with identical CPU and GPU implementations
- Accel-Sim for performance simulation (x86 support added)
  - In task parallel microservices application, we batch requests to the same microservice and run them on SIMT

## **Use Case: Quick Porting Estimation**

- For Developers:
  - Provides SIMT efficiency projections, helping to assess the potential effort required for porting
- For Architects:
  - Provides insights for designing future SIMT architectures and accelerators suited to diverse applications



#### **Use Case: Detailed Performance Analysis**

Full cycle level analysis using generated SIMT-based traces and trace-driven simulation



---GPU speedup (real HW)

### Conclusion

#### Summary:

• A Framework for comprehensive performance analysis of any MIMD CPU binary on SIMT hardware

#### Impact:

- Enables performance evaluation on SIMT hardware without code porting
- Supports SIMT hardware design by offering efficiency insights across diverse applications

Questions





# Backup



Elmore Family School of Electrical and Computer Engineering

#### **Use Case: Source Code Analysis**

- ThreadFuser diagnoses low SIMT efficiency in ported MIMD implementations when source code is available.
- Example: HDSearch-Midtier.







#### **Use Case: Source Code Analysis**





#### **Correlation**







**Elmore Family School of Electrical** 

### Synchronization study





### Synchronization study





- ThreadFuser: is an analysis framework that predicts the performance of MIMD CPU programs on SIMT hardware.
- Analyze dynamic traces from unmodified CPU binaries.
- Offers insights into control flow efficiency, memory divergence, and synchronization on SIMT hardware.
- ThreadFuser integrates for cycle-level analysis with simulators like Accel-Sim.



#### Workloads

|                       | Workload                     | #SIMT<br>Threads | ion            | Workload                        | #SIMT<br>Threads | Implementation | Workload        | #SIMT<br>Threads |
|-----------------------|------------------------------|------------------|----------------|---------------------------------|------------------|----------------|-----------------|------------------|
|                       | <b>Rodinia 3.1 [12]</b>      | , <b>]</b>       |                | $\mu$ suite [41]                |                  | Itat           | ParSec 3.0 [10] |                  |
| Correlation Workloads | BFS                          | 4K               | Implementation | McRouter (Memcached ,Mid ,Leaf) | 2K               | len            | blackscholes    | 1K               |
|                       | Nearest Neighbors(NN)        | 42K              |                | TextSearch(Mid, Leaf)           | 2K               | len            | streamcluster   | 8K               |
|                       | Stream Cluster(SC)           | 16K              | du             | HDImageSearch(Mid, Leaf)        | 2K               | ] du           | bodytrack       | 1K               |
|                       | b+tree                       | 4K               |                | DeathStarBench                  |                  |                | facesim         | 1K               |
|                       | Particle Filter(PF)          | 4K               | GPU            | Post                            | 2K               | PU             | fluidanimate    | 4K               |
|                       | Parapoly [49]                |                  |                | Text                            | 2K               | U U            | freqmine        | 2K               |
|                       | BFS                          | 4K               | no             | URLShort                        | 2K               | no             | swaptions       | 512              |
|                       | Connected Components(CC)     | 4K               | with           | UniqueID                        | 2K               | with           | vips            | 512              |
|                       | Page Rank                    | 4K               | A              | UserTag                         | 2K               | B              | x264            | 4K               |
|                       | Nbody                        | 4K               | sbi            | User                            | 2K               | spi            | Others          |                  |
|                       | Micro Benchmark              |                  | Workloads      | Others                          |                  | l lõ           | Pigz [1]        | 128              |
|                       | VectorAdd                    | 1K               | ork            | Rotate [7]                      | 1K               | Workloads      |                 |                  |
|                       | Uncoalesced Vector operation | 1 <b>K</b>       |                | MD5 [7]                         | 512              |                |                 |                  |

TABLE I: Studied Workloads. #SIMT threads is the number of threads simulated by ThreadFuser



and Computer Engineerin

#### **Memory Transactions**



