Meta LLM Compiler - Foundation models of compiler optimisation

04.07.2024 — 5 min read

Meta's LLM Compiler Research Paper

Meta have just published an interesting research paper and series of foundation models that are specifically tailored for code optimisation tasks. They have been trained on a huge amount of LLBM IR and assembly code, with fine tuning to interpret compiler behaviour. You can read the paper here

Given the models we work with at Ultraleap are deployed on edge VR / AR devices. Binary size and code optimisation is something we care deeply about, this directly affects the FPS of our models, as well as power consumption. This got me thinking, could this be used to optimise tensorflow model LLVM IR outputted from the XLA compiler? Spoilers - definitely not yet, and I'll explain why.

XLA takes model graphs from ML frameworks (which are defined in StableHLO) and compiles them into machine language for various architectures. The steps for converting the model graph into a target optimised executable include:

XLA performs several passes to optimise the StableHLO graph, these are target independent at this stage. This includes steps like buffer analysis for allocating memory for computation at runtime.
XLA then send the HLO computation to a backend for further optimisations, now with target specific information. So if you're using CUDA on an nvidia GPU, you may get some op fusions that are beneficial for that programming model.
We then have target specific code generation, and XLA uses LLVM for it's low level IR, optimisation and code generation. These backends emit the LLVM IR necessary to represent the HLO computation efficiently.

So in theory, we may be able to use this LLM for optimising tensorflow compiled graphs.

Start with the basics

To figure out how the model works, I initially put together a toy example that generates LLVM IR from python code, and passes it into meta's model. This uses the llvmlite python package as an IR builder.

Repo is here for anyone wanting to build on top of

Here's a snippet:

from llvmlite import ir, binding


def multiply_by_two(x, y):
    return x * y


func_type = ir.FunctionType(ir.IntType(32), [ir.IntType(32), ir.IntType(32)])

module = ir.Module(name="toy_llm_compiler")
module.triple = binding.get_default_triple()

function = ir.Function(module, func_type, name="multiply_by_two")

ent_b = function.append_basic_block(name="entry")

builder = ir.IRBuilder(ent_b)

arg_x, arg_y = function.args

result = builder.add(arg_x, arg_x)

builder.ret(result)

# save the module as a file
with open("multiply_by_two.ll", "w") as f:
    f.write(str(module))

This simply defines a very simple multiplication function, and writes out the LLVM IR to a file. If we run this we get our multiply_by_two.ll file saved to disk:

; ModuleID = "toy_llm_compiler"
target triple = "x86_64-unknown-linux-gnu"
target datalayout = ""

define i32 @"multiply_by_two"(i32 %".1", i32 %".2")
{
entry:
  %".4" = add i32 %".1", %".1"
  ret i32 %".4"
}

The next step is to load the LLM and pass in the prompt. For this we can use the huggingface transformers package.

from pathlib import Path

working_dir = Path(__file__).resolve().parent

# your huggingface access token
access_token = "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"

tokenizer = AutoTokenizer.from_pretrained(
    "facebook/llm-compiler-7b-ftd", access_token=access_token
)
model = AutoModelForCausalLM.from_pretrained(
    "facebook/llm-compiler-7b-ftd", access_token=access_token
)

# load llvm ir as text
with open(working_dir / "multiply_by_two.ll", "r") as f:
    code = f.read()

inputs = tokenizer(
    "Please optimise the following llvm ir code using opt -03:\n" + code,
    return_tensors="pt",
)

generate_ids = model.generate(inputs.input_ids, max_length=1024)

tokens = tokenizer.batch_decode(
    generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
)

with open(working_dir / "multiply_by_two_opt.ll", "w") as f:
    f.write(tokens[0])

Tensorflow

I've set up a toy example of a small tensorflow model, which compiles using JIT, which is required for XLA compilation as we need runtime optimisations rather than Ahead of Time (AOT). However even when compiling this small model, the LLVM IR that is spat out is huge i.e. ~3000 lines. If we do a quick back of the envelope token calculation against the context window limit of this model i.e. 15k:

Assumed Line Length of code: 80 characters per line
Context Window Size: 15,000 characters

To find the number of lines:

Context window size / av line length = 15000 / 80 = 187.5 lines

Which is far lower than what we require!

However the toy example is a useful proof of concept to build on top of at least. It works by:

using a tensorflow pre-built docker image to build the package on top of
use docker compose to pass the required XLA env vars to dump the LLVM IR to the data/baseline directory
run a simple model train and compile process

Summary & Limitations

This certainly seems like an interesting use case, I can imagine a future where these models get better and better and we can deploy models like these to optimise code very quickly for specific applications. However there are some rather large limitations on this model currently (which are generally applicable to most LLMs):

context length - as we see, the LLVM IR for even a tiny tensorflow model is far too long to be processed in the LLM compilers meagre 15k context window. We could chunk it, but a shortcoming of LLMs generally is retention of information between prompts.
compiler code accuracy - LLVM IR is hyperspecific and needs to be modified with great care for it to actually be interpreted properly at the machine level. LLMs aren't known for their accuracy. I think an application that would be useful is some sort of "fuzzy analysis" - a model that highlights general areas of unoptimised compiler code, and suggests improvements.
This generally isn't an enormously interesting piece of research, training a LLM on buckets of LLVM IR data was bound to be an improvement for compiler code optimisation over a foundation model like GPT4. It does however show that these models can be greatly improved over their foundational models