Demo: Data-tiling with multi-device
This write-up demonstrates how data-tiling works when there are multiple devices. It is the write-up followed by How data-tiling works with encoding specialization.
Updates from the IREE team
This write-up demonstrates how data-tiling works when there are multiple devices. It is the write-up followed by How data-tiling works with encoding specialization.
Data-tiling is a technique that transforms the input data to be in a particular layout for good performance. It allows you to access data through the cache hierarchy efficiently and do the computation with very less latency.
IREE is a compiler which sees the whole graph. There are many opportunities to remove layout-transformation overheads. They may be propagated, fused into other operations, or be constant-evaluated for weights. IREE uses encodings to apply data-tiling technique, and the post explores how encodings work in data-tiling.
This tutorial is simultaneously about IREE, MLIR, and specifically the MLIR Linalg dialect.
MLIR is a programming language, but MLIR in itself is almost just an empty shell. What it really provides is a framework allowing to define MLIR dialects which are where the features come from.
The "IR" part of the MLIR name stands for "intermediate representation". It means that MLIR is meant to be primarily for compiler-internal representations of code. But MLIR is actually fairly nice for humans to work with, and it's not hard to hand-author some MLIR programs from scratch. That is exactly the topic of this tutorial.
Source file: matmul.mlir
:
func.func @matmul_dynamic(%lhs: tensor<?x?xf32>, %rhs: tensor<?x?xf32>, %acc: tensor<?x?xf32>) -> tensor<?x?xf32> {
%result = linalg.matmul ins(%lhs, %rhs: tensor<?x?xf32>, tensor<?x?xf32>) outs(%acc: tensor<?x?xf32>) -> tensor<?x?xf32>
return %result: tensor<?x?xf32>
}
Basic compilation command line:
$ iree-compile matmul.mlir -o /tmp/matmul.vmfb \
--iree-hal-target-backends=llvm-cpu \
--iree-llvmcpu-target-cpu=znver4 \
--iree-llvmcpu-enable-ukernels=all
This creates a IREE bytecode module:
$ ls -l /tmp/matmul.vmfb
-rw-rw-r-- 1 2884 Jan 22 10:37 /tmp/matmul.vmfb
IREE is being designed with re-targetability as a core goal: it should be possible to use IREE to target a broad spectrum of power regimes, from embedded systems to distributed clusters; and it should be possible to extend IREE to target new back-ends without having to reinvent the wheel each time.
To explore this, we recently branched out from our initial focus on low-latency mobile deployments with a goal of using IREE to target data center workloads on Nvidia CUDA. This post describes how we quickly brought up a CUDA back-end for IREE and used it to train BERT, then shares some metrics and next steps.
Matrix multiplication (matmul) is an important operation in ML workloads that poses specific challenges to code generation. For example, matmul makes repeated accesses to the same data, which makes locality of reference a top concern.
Moreover, modern CPUs instruction set architectures (ISAs) offer specialized SIMD instructions that the matmul implementation needs to use to achieve optimal performance, and these instructions expect data to be in a particular layout.
This article is about an in-development MLIR operation,
linalg.mmt4d
,
offering a compilation path for
linalg.matmul
that is designed from the ground up for these efficiency considerations.
IREE can now execute TensorFlow Lite (TFLite) models through the use of TOSA, an open standard of common tensor operations, and a part of MLIR core. TOSA’s high-level representation of tensor operations provides a common front-end for ingesting models from different frameworks. In this case we ingest a TFLite FlatBuffer and compile it to TOSA IR, which IREE takes as an input format to compile to its various backends.