Optimizing LLM Performance on Mac and Ubuntu

When you’re wrangling large language models (LLMs) on a MacBook Pro with an Apple M1 or M2 chip—or maybe on an Ubuntu box rocking twin NVIDIA 2080 Ti GPUs—performance tuning and making the most of GPU acceleration isn’t just a nice-to-have. It’s survival. Here’s how I’ve been installing and running Ollama (with GPU support!) and a few other OpenAI API-compatible tools that play nice with local hardware.

1. Ollama with GPU Acceleration on macOS

Ollama taps into the llama.cpp library, which, in turn, rides macOS’s Metal API for GPU acceleration. Here’s how to make sure Ollama is actually firing on all GPU cylinders:

Enabling GPU Acceleration

You don’t need to jump through any extra hoops—Ollama auto-detects your hardware and uses GPU acceleration out of the box. If you’ve got an M1 or M2 MacBook, it’ll quietly lean on the Metal framework for a nice speed boost.

Checking if GPU Acceleration is Working

Want to see if GPU acceleration’s actually happening? Run an inference task and compare its performance to a CPU-only run. You can also dig through the logs for any telltale Metal mentions.

Performance Tweaks

Use quantized models: Ollama lets you load quantized model variants (like 4-bit, 8-bit), which can really amp up GPU efficiency.
Go smaller: Picking a smaller variant (say, llama-7b) keeps the hardware stress to a minimum.

2. Other Tools and Platforms

If you’re eyeing alternatives to Ollama that support local execution, GPU acceleration, and OpenAI API compatibility, here are a few worth trying:

(1) Text Generation Web UI

What it is: An open-source web interface for running models like LLaMA, GPT-J, and friends.
Why it’s cool:
- OpenAI API compatible
- GPU acceleration via macOS Metal
Install it:

git clone https://github.com/oobabooga/text-generation-webui.git
cd text-generation-webui
pip install -r requirements.txt
python server.py

(2) GPT4All

What it is: Local LLM project with support for a bunch of models.
Why it’s cool:
- Metal acceleration on macOS
- CLI and GUI that don’t make you want to flip tables
Install it:

brew install gpt4all
gpt4all

(3) LocalAI

What it is: OpenAI-compatible LLaMA model API server.
Why it’s cool:
- Metal GPU support on macOS
- REST API support
Install it:

curl -LO https://github.com/go-skynet/LocalAI/releases/download/v1.0.0/local-ai-darwin-arm64
chmod +x local-ai-darwin-arm64
./local-ai-darwin-arm64

(4) MLC LLM

What it is: A framework tuned for macOS and mobile gadgets.
Why it’s cool:
- GPU acceleration via Apple Metal
- Runs a bunch of models
Install it: Just grab the precompiled binary, load up a model, and go.

(5) llama.cpp

What it is: A lean, mean, LLaMA-running machine.
Why it’s cool:
- Supports Metal API
Install it:

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
make
./main -m path/to/llama/model

3. Running Multiple NVIDIA GPUs on Ubuntu

If you’re going full beast mode with two NVIDIA 2080 Ti GPUs on Ubuntu and want to spin up an OpenAI-compatible API for large models, here’s roughly what I did:

System and Environment Prep

Install NVIDIA drivers:

sudo apt update
sudo apt install -y nvidia-driver-530
reboot

Install CUDA and cuDNN: Make sure your versions match your drivers—don’t ask how I know.
Install Python and deps:

sudo apt install -y python3 python3-pip
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Using Distributed Inference Frameworks

To put both GPUs to work, frameworks like DeepSpeed or Hugging Face Accelerate are your friends:

DeepSpeed

Install:

pip install deepspeed

Sample inference:

import deepspeed
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "EleutherAI/gpt-neo-2.7B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

ds_engine = deepspeed.init_inference(model=model, mp_size=2)

Hugging Face Accelerate

Install:

pip install accelerate

Configure:

accelerate config

Sample inference:

from transformers import AutoModelForCausalLM, AutoTokenizer
from accelerate import infer_auto_device_map

model_name = "EleutherAI/gpt-neo-2.7B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

device_map = infer_auto_device_map(model)

4. Serving OpenAI-Compatible APIs

If you want to serve up an OpenAI-compatible API, check out FastAPI or LocalAI:

FastAPI + Uvicorn

Install:

pip install fastapi uvicorn

API service code:

from fastapi import FastAPI
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

app = FastAPI()
model_name = "EleutherAI/gpt-neo-2.7B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

@app.post("/v1/completions")
async def completions(request: CompletionRequest):
    inputs = tokenizer(request.prompt, return_tensors="pt").to("cuda:0")
    outputs = model.generate(**inputs)
    return {"choices": [{"text": outputs[0]}]}

Run it:

uvicorn serve:app --host 0.0.0.0 --port 8000

LocalAI

Install: See above for the drill.
Run:

localai --models-path /models --api-port 8080

Ending

That’s pretty much the playbook for making Ollama and other OpenAI-compatible tools fly on your local GPU, whether you’re team MacBook (thank you, Metal API) or team Ubuntu (hello, DeepSpeed). Follow these setups and you’ll squeeze every last drop out of your hardware while serving up solid APIs.