Meta LLM Compiler - Foundation models of compiler optimisation
— 5 min read
Meta's LLM Compiler Research Paper
Meta have just published an interesting research paper and series of foundation models that are specifically tailored for code optimisation tasks. They have been trained on a huge amount of LLBM IR and assembly code, with fine tuning to interpret compiler behaviour. You can read the paper here
Given the models we work with at Ultraleap are deployed on edge VR / AR devices. Binary size and code optimisation is something we care deeply about, this directly affects the FPS of our models, as well as power consumption. This got me thinking, could this be used to optimise tensorflow model LLVM IR outputted from the XLA compiler? Spoilers - definitely not yet, and I'll explain why.
XLA takes model graphs from ML frameworks (which are defined in StableHLO) and compiles them into machine language for various architectures. The steps for converting the model graph into a target optimised executable include:
- XLA performs several passes to optimise the StableHLO graph, these are target independent at this stage. This includes steps like buffer analysis for allocating memory for computation at runtime.
- XLA then send the HLO computation to a backend for further optimisations, now with target specific information. So if you're using CUDA on an nvidia GPU, you may get some op fusions that are beneficial for that programming model.
- We then have target specific code generation, and XLA uses LLVM for it's low level IR, optimisation and code generation. These backends emit the LLVM IR necessary to represent the HLO computation efficiently.
So in theory, we may be able to use this LLM for optimising tensorflow compiled graphs.
Start with the basics
To figure out how the model works, I initially put together a toy example that generates LLVM IR from python code, and passes it into meta's model. This uses the llvmlite python package as an IR builder.
Repo is here for anyone wanting to build on top of
Here's a snippet:
from llvmlite import ir, binding
def multiply_by_two(x, y): return x * y
func_type = ir.FunctionType(ir.IntType(32), [ir.IntType(32), ir.IntType(32)])
module = ir.Module(name="toy_llm_compiler")module.triple = binding.get_default_triple()
function = ir.Function(module, func_type, name="multiply_by_two")
ent_b = function.append_basic_block(name="entry")
builder = ir.IRBuilder(ent_b)
arg_x, arg_y = function.args
result = builder.add(arg_x, arg_x)
builder.ret(result)
# save the module as a filewith open("multiply_by_two.ll", "w") as f: f.write(str(module))
This simply defines a very simple multiplication function, and writes out the LLVM IR to a file. If we run this we get our multiply_by_two.ll file saved to disk:
; ModuleID = "toy_llm_compiler"target triple = "x86_64-unknown-linux-gnu"target datalayout = ""
define i32 @"multiply_by_two"(i32 %".1", i32 %".2"){entry: %".4" = add i32 %".1", %".1" ret i32 %".4"}
The next step is to load the LLM and pass in the prompt. For this we can use the huggingface transformers package.
from pathlib import Path
working_dir = Path(__file__).resolve().parent
# your huggingface access tokenaccess_token = "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"
tokenizer = AutoTokenizer.from_pretrained( "facebook/llm-compiler-7b-ftd", access_token=access_token)model = AutoModelForCausalLM.from_pretrained( "facebook/llm-compiler-7b-ftd", access_token=access_token)
# load llvm ir as textwith open(working_dir / "multiply_by_two.ll", "r") as f: code = f.read()
inputs = tokenizer( "Please optimise the following llvm ir code using opt -03:\n" + code, return_tensors="pt",)
generate_ids = model.generate(inputs.input_ids, max_length=1024)
tokens = tokenizer.batch_decode( generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)
with open(working_dir / "multiply_by_two_opt.ll", "w") as f: f.write(tokens[0])
Tensorflow
I've set up a toy example of a small tensorflow model, which compiles using JIT, which is required for XLA compilation as we need runtime optimisations rather than Ahead of Time (AOT). However even when compiling this small model, the LLVM IR that is spat out is huge i.e. ~3000 lines. If we do a quick back of the envelope token calculation against the context window limit of this model i.e. 15k:
- Assumed Line Length of code: 80 characters per line
- Context Window Size: 15,000 characters
To find the number of lines:
Context window size / av line length = 15000 / 80 = 187.5 lines
Which is far lower than what we require!
However the toy example is a useful proof of concept to build on top of at least. It works by:
- using a tensorflow pre-built docker image to build the package on top of
- use docker compose to pass the required XLA env vars to dump the LLVM IR to the
data/baseline
directory - run a simple model train and compile process
Summary & Limitations
This certainly seems like an interesting use case, I can imagine a future where these models get better and better and we can deploy models like these to optimise code very quickly for specific applications. However there are some rather large limitations on this model currently (which are generally applicable to most LLMs):
- context length - as we see, the LLVM IR for even a tiny tensorflow model is far too long to be processed in the LLM compilers meagre 15k context window. We could chunk it, but a shortcoming of LLMs generally is retention of information between prompts.
- compiler code accuracy - LLVM IR is hyperspecific and needs to be modified with great care for it to actually be interpreted properly at the machine level. LLMs aren't known for their accuracy. I think an application that would be useful is some sort of "fuzzy analysis" - a model that highlights general areas of unoptimised compiler code, and suggests improvements.
- This generally isn't an enormously interesting piece of research, training a LLM on buckets of LLVM IR data was bound to be an improvement for compiler code optimisation over a foundation model like GPT4. It does however show that these models can be greatly improved over their foundational models