Demo: Data-tiling with multi-device
This write-up demonstrates how data-tiling works when there are multiple devices. It is the write-up followed by How data-tiling works with encoding specialization.
This write-up demonstrates how data-tiling works when there are multiple devices. It is the write-up followed by How data-tiling works with encoding specialization.
Data-tiling is a technique that transforms the input data to be in a particular layout for good performance. It allows you to access data through the cache hierarchy efficiently and do the computation with very less latency.
IREE is a compiler which sees the whole graph. There are many opportunities to remove layout-transformation overheads. They may be propagated, fused into other operations, or be constant-evaluated for weights. IREE uses encodings to apply data-tiling technique, and the post explores how encodings work in data-tiling.
This tutorial is simultaneously about IREE, MLIR, and specifically the MLIR Linalg dialect.
MLIR is a programming language, but MLIR in itself is almost just an empty shell. What it really provides is a framework allowing to define MLIR dialects which are where the features come from.
The "IR" part of the MLIR name stands for "intermediate representation". It means that MLIR is meant to be primarily for compiler-internal representations of code. But MLIR is actually fairly nice for humans to work with, and it's not hard to hand-author some MLIR programs from scratch. That is exactly the topic of this tutorial.
Source file: matmul.mlir
:
func.func @matmul_dynamic(%lhs: tensor<?x?xf32>, %rhs: tensor<?x?xf32>, %acc: tensor<?x?xf32>) -> tensor<?x?xf32> {
%result = linalg.matmul ins(%lhs, %rhs: tensor<?x?xf32>, tensor<?x?xf32>) outs(%acc: tensor<?x?xf32>) -> tensor<?x?xf32>
return %result: tensor<?x?xf32>
}
Basic compilation command line:
$ iree-compile matmul.mlir -o /tmp/matmul.vmfb \
--iree-hal-target-backends=llvm-cpu \
--iree-llvmcpu-target-cpu=znver4 \
--iree-llvmcpu-enable-ukernels=all
This creates a IREE bytecode module:
$ ls -l /tmp/matmul.vmfb
-rw-rw-r-- 1 2884 Jan 22 10:37 /tmp/matmul.vmfb
Matrix multiplication (matmul) is an important operation in ML workloads that poses specific challenges to code generation. For example, matmul makes repeated accesses to the same data, which makes locality of reference a top concern.
Moreover, modern CPUs instruction set architectures (ISAs) offer specialized SIMD instructions that the matmul implementation needs to use to achieve optimal performance, and these instructions expect data to be in a particular layout.
This article is about an in-development MLIR operation,
linalg.mmt4d
,
offering a compilation path for
linalg.matmul
that is designed from the ground up for these efficiency considerations.