Bloglink

Updates from the IREE team

March 7, 2025
in Performance
7 min read

Demo: Data-tiling with multi-device

This write-up demonstrates how data-tiling works when there are multiple devices. It is the write-up followed by How data-tiling works with encoding specialization.

March 6, 2025
in Performance
10 min read

How data-tiling works with encoding specialization

Data-tiling is a technique that transforms the input data to be in a particular layout for good performance. It allows you to access data through the cache hierarchy efficiently and do the computation with very less latency.

IREE is a compiler which sees the whole graph. There are many opportunities to remove layout-transformation overheads. They may be propagated, fused into other operations, or be constant-evaluated for weights. IREE uses encodings to apply data-tiling technique, and the post explores how encodings work in data-tiling.

January 29, 2024
in Performance
13 min read

IREE / MLIR / Linalg tutorial

Introduction

This tutorial is simultaneously about IREE, MLIR, and specifically the MLIR Linalg dialect.

What is MLIR?

MLIR is a programming language, but MLIR in itself is almost just an empty shell. What it really provides is a framework allowing to define MLIR dialects which are where the features come from.

The "IR" part of the MLIR name stands for "intermediate representation". It means that MLIR is meant to be primarily for compiler-internal representations of code. But MLIR is actually fairly nice for humans to work with, and it's not hard to hand-author some MLIR programs from scratch. That is exactly the topic of this tutorial.

January 22, 2024
in Performance
10 min read

Exploring CPU microkernels on a matmul example

Basic setup, command lines

Source file: matmul.mlir:

func.func @matmul_dynamic(%lhs: tensor<?x?xf32>, %rhs: tensor<?x?xf32>, %acc: tensor<?x?xf32>) -> tensor<?x?xf32> {
  %result = linalg.matmul ins(%lhs, %rhs: tensor<?x?xf32>, tensor<?x?xf32>) outs(%acc: tensor<?x?xf32>) -> tensor<?x?xf32>
  return %result: tensor<?x?xf32>
}

Basic compilation command line:

$ iree-compile matmul.mlir -o /tmp/matmul.vmfb \
  --iree-hal-target-backends=llvm-cpu \
  --iree-llvmcpu-target-cpu=znver4 \
  --iree-llvmcpu-enable-ukernels=all

This creates a IREE bytecode module:

$ ls -l /tmp/matmul.vmfb

-rw-rw-r-- 1 2884 Jan 22 10:37 /tmp/matmul.vmfb

October 15, 2021
in Platforms
7 min read

CUDA backend

IREE is being designed with re-targetability as a core goal: it should be possible to use IREE to target a broad spectrum of power regimes, from embedded systems to distributed clusters; and it should be possible to extend IREE to target new back-ends without having to reinvent the wheel each time.

To explore this, we recently branched out from our initial focus on low-latency mobile deployments with a goal of using IREE to target data center workloads on Nvidia CUDA. This post describes how we quickly brought up a CUDA back-end for IREE and used it to train BERT, then shares some metrics and next steps.

October 13, 2021
in Performance
9 min read

Matrix Multiplication with MMT4D

Introduction

Matrix multiplication (matmul) is an important operation in ML workloads that poses specific challenges to code generation. For example, matmul makes repeated accesses to the same data, which makes locality of reference a top concern.

Moreover, modern CPUs instruction set architectures (ISAs) offer specialized SIMD instructions that the matmul implementation needs to use to achieve optimal performance, and these instructions expect data to be in a particular layout.

This article is about an in-development MLIR operation, linalg.mmt4d, offering a compilation path for linalg.matmul that is designed from the ground up for these efficiency considerations.

July 19, 2021
in Frontends
2 min read

TFLite support via TOSA

IREE can now execute TensorFlow Lite (TFLite) models through the use of TOSA, an open standard of common tensor operations, and a part of MLIR core. TOSA’s high-level representation of tensor operations provides a common front-end for ingesting models from different frameworks. In this case we ingest a TFLite FlatBuffer and compile it to TOSA IR, which IREE takes as an input format to compile to its various backends.