Skip to content

Bloglink

Updates from the IREE team


Data-Tiling Walkthrough

Data-tiling is the modification of data layout of operands of certain operations, such as matrix multiplication, that prefer specific layouts. These layout preferences depend on the operations and the target hardware. For example, matrix multiplications may need to use hardware matrix multiplication instructions that perform optimally with a specific matrix data layout.

Layout changes may also be motivated by memory access performance, as data-tiling can result in improved locality of memory accesses, fewer cache lines being accessed, and generally simpler memory access patterns that are more likely to be handled performantly by the memory system.

These layout changes can be propagated as far as possible across the workload, so the entire workload can use the updated layouts, as opposed to having to perform layout transformations at runtime. This may involve fusions or constant-evaluation that can amortize or remove layout-transformation overheads.

The main conceptual difficulty in modeling this in tensor-level MLIR is that tensors don't have layouts: tensors are higher-level, abstract arrays. This is addressed by the concept of tensor encodings, which this document will explain.

IREE / MLIR / Linalg tutorial

Introduction

This tutorial is simultaneously about IREE, MLIR, and specifically the MLIR Linalg dialect.

What is MLIR?

MLIR is a programming language, but MLIR in itself is almost just an empty shell. What it really provides is a framework allowing to define MLIR dialects which are where the features come from.

The "IR" part of the MLIR name stands for "intermediate representation". It means that MLIR is meant to be primarily for compiler-internal representations of code. But MLIR is actually fairly nice for humans to work with, and it's not hard to hand-author some MLIR programs from scratch. That is exactly the topic of this tutorial.

Exploring CPU microkernels on a matmul example

Basic setup, command lines

Source file: matmul.mlir:

func.func @matmul_dynamic(%lhs: tensor<?x?xf32>, %rhs: tensor<?x?xf32>, %acc: tensor<?x?xf32>) -> tensor<?x?xf32> {
  %result = linalg.matmul ins(%lhs, %rhs: tensor<?x?xf32>, tensor<?x?xf32>) outs(%acc: tensor<?x?xf32>) -> tensor<?x?xf32>
  return %result: tensor<?x?xf32>
}

Basic compilation command line:

$ iree-compile matmul.mlir -o /tmp/matmul.vmfb \
  --iree-hal-target-device=local \
  --iree-hal-local-target-device-backends=llvm-cpu \
  --iree-llvmcpu-target-cpu=znver4 \
  --iree-llvmcpu-enable-ukernels=all

This creates a IREE bytecode module:

$ ls -l /tmp/matmul.vmfb

-rw-rw-r-- 1 2884 Jan 22 10:37 /tmp/matmul.vmfb

CUDA backend

IREE is being designed with re-targetability as a core goal: it should be possible to use IREE to target a broad spectrum of power regimes, from embedded systems to distributed clusters; and it should be possible to extend IREE to target new back-ends without having to reinvent the wheel each time.

To explore this, we recently branched out from our initial focus on low-latency mobile deployments with a goal of using IREE to target data center workloads on Nvidia CUDA. This post describes how we quickly brought up a CUDA back-end for IREE and used it to train BERT, then shares some metrics and next steps.

Matrix Multiplication with MMT4D

Introduction

Matrix multiplication (matmul) is an important operation in ML workloads that poses specific challenges to code generation. For example, matmul makes repeated accesses to the same data, which makes locality of reference a top concern.

Moreover, modern CPUs instruction set architectures (ISAs) offer specialized SIMD instructions that the matmul implementation needs to use to achieve optimal performance, and these instructions expect data to be in a particular layout.

This article is about an in-development MLIR operation, linalg.mmt4d, offering a compilation path for linalg.matmul that is designed from the ground up for these efficiency considerations.

TFLite support via TOSA

IREE can now execute TensorFlow Lite (TFLite) models through the use of TOSA, an open standard of common tensor operations, and a part of MLIR core. TOSA’s high-level representation of tensor operations provides a common front-end for ingesting models from different frameworks. In this case we ingest a TFLite FlatBuffer and compile it to TOSA IR, which IREE takes as an input format to compile to its various backends.