An abstraction for automatic tensorized program optimization.

Open Table of contents

1. Introduction

challenge: a broad range of models and a broad spectrum of devices.

The trend of hardware specialization complicates this problem. alt text

For example, nvidia introduces specialized primitives to speedup the computation by using tensor core.

Domain experts start to develop micro-kernel primitives. These hardware inst. and micro-kernel operate on multi-dimensional tensor regions and perform tensor ops like load, dot product, and matmul.

tensorized intrinsics: these opaque tensor computation acceleration constructs.
tensorization: transformation procedure

The goal of Modern ml sys: optimize programs that contain hierarchical loop nests, multi-dimensional loads, and tensor intrinsics.

The limitation of existing methods: only search over a program space of loop nest transformations and not handle tensorized programs.

Key challenge:

abstraction for tensorized programs: need to represent multi-dimensional memory accesses, threading hierarchies, and tensorized computation primitives from different hardware backend
large design space of possible tensorized program optimizations.

Two components of TensorIR:

a new construct called block (allows us to divide and isolate tensorized computation region from the outer loop nests).
program transformation primitives

2. Overview

The workflow of how a domain expert optimizes a tensorized program is the below:

Divide and conquer approach: divide the original problem into sub-problems of tensorized computation and loop nests that use the tensorized computation, then optimize them seperately.

Existing methods:

bottom-up: (Halide, TVM…) models the search space using loop nests iterators around scalar operation bodies, and find the best loop nest transformation.
top-down: decompose the problem into sub-problems.

Introduce a new abstraction called block into the loop nests, which contains the right amount of signature info to isolate the inner problem space and outer problem. We can perform transformations on both sub-problems independently. For example, we can use a block to represent a tensorized computation primitive in a hardware backend.

Construct a search space of possible ways to divide the problem guided by the hardware tensor computation primitives.

3. TensorIR abstraction

An example of TensorIR program:

Contains three main elements:

multi-dimensional buffers
loop nests (with possible thread bindings)
blocks

A block can contain one or more nested loop nests with sub-blocks or sequence of imperative statements that correspond to the content of the computation.

3.1 Block

an block for the matmul:

Rationale. isolate tensorized computation. introduce a block signature that contains sufficient dependency info for transformations. It can also be used to verify the correctness of the iterator bindings during transformations.

Block Iterator Domain. Store the iterator domain information and the constraints of iterators in the block signature. In the above example, vx, vy and vk must bind to iterators in domain [0, 16). vk is a reduction axis, so it cannot be binded to a parallel loop unless the reduction is atomic.

Access Region and Dependency. Only mark each block’s dependency with respect to the multi-dimensional buffers, which enables a broader range of trans., like data layout trans. and re-comp.

Reduction Block and Initialization. Introduce an optional Initialization statement for blocks that perform reduction. This statement is executed during the first iteration of a reduction. We provide trans. primi. to transform between the two-block-based repr. and the init-block-based repr..

3.2 Scheduling Transformation

A block is schedulable if it only contains loop nests with sub-blocks as its leaves. We can transform the loop nests and sub-block comp. locs. within a schedulable block by analyzing the sub-block signatures and their dependency info. An opaque block can contain a schedulable sub-block. A schedulable block can contain a opaque block (like tensor core comp.).

Loop Trans. loop tiling(split, reorder) and compute location mutation.

bind loops to GPU threads and provide annotation hints such as vectorization and unrolling.

Blockization. Isolate a sub-region computation into a new sub-block. Introduce other primitives that can change the block hierarchies. Caching primitives, back and forth trans. between a reduction block and corresponding two-block.

Separation of scheduling and TensorIR. As a standalone transformation from one TensorIR program to another.

3.3 Validation

Loop Nest Validation.

Check whether the iterator binding provided by the loop nests matches the constraints of the iterator domain, including the domain size and iterator independence info. For example, v1 = i; v2 = i * 2 is illegal. Also check the producer-consumer relations.

Threading Validation

thread binding
Cooperative memory access:
Execution scope: e.g. tensor core needs to run at warp-level

Correctness of Schedule primitives

Summary

The first two kinds are used to filter out invalid TensorIR programs. The third kind is used to ensure the correctness of transformations.

3.4 Programming Effort

Python-AST dialect.

4. Auto-scheduling tensorized programs

4.1 Abstraction for Tensor Intrinsics

Introduce a TensorIntrin construct composed of two blocks. One block describes the comp. semantics, and the other provides the low-level implementation.

Also include the data type, storage scope, memory layout, and contiguity constraints through the multi-dim buffer spec in a TensorIntrin.

4.2 Tensorization Candidate Generation

4.3 Tensorized Program Sketch Generation

Fix parts of program structures while leaving space for remaining choices of parameters.

Data Movement as First-class citizen.

Existing limitation: focus on the schedule of computations.

Data movement decisions depend on computation schedule decisions like tilings, thread bindings, execution scopes, and producer-consumer data flow granularity.

We decouple them from comp. schedules, and insert AutoCopy blocks.

The signature of this block only exposes the buffer access info.

The body describes the details of the data movement task, including buffer position mapping, threading, and storage scope requirements.

A data movement scheduler takes this info as input and performs memory-related schedule transformations, such as

inserting intermediate cache stages
utilizing data movement tensor intrinsics
vectorization
cooperative fetching
stride padding to avoid bank conflicts

Notes on TensorIR

Table of contents