Optimizing LLM Performance on Mac and Ubuntu

When you’re wrangling large language models (LLMs) on a MacBook Pro with an Apple M1 or M2 chip—or maybe on an Ubuntu box rocking twin NVIDIA 2080 Ti GPUs—performance tuning and making the most of GPU acceleration isn’t just a nice-to-have. It’s survival. Here’s how I’ve been installing and running Ollama (with GPU support!) and a few other OpenAI API-compatible tools that play nice with local hardware.
1. Ollama with GPU Acceleration on macOS
Ollama taps into the llama.cpp library, which, in turn, rides macOS’s Metal API for GPU acceleration. Here’s how to make sure Ollama is actually firing on all GPU cylinders:
Enabling GPU Acceleration
You don’t need to jump through any extra hoops—Ollama auto-detects your hardware and uses GPU acceleration out of the box. If you’ve got an M1 or M2 MacBook, it’ll quietly lean on the Metal framework for a nice speed boost.
Checking if GPU Acceleration is Working
Want to see if GPU acceleration’s actually happening? Run an inference task and compare its performance to a CPU-only run. You can also dig through the logs for any telltale Metal mentions.
Performance Tweaks
- Use quantized models: Ollama lets you load quantized model variants (like 4-bit, 8-bit), which can really amp up GPU efficiency.
- Go smaller: Picking a smaller variant (say,
llama-7b) keeps the hardware stress to a minimum.
2. Other Tools and Platforms
If you’re eyeing alternatives to Ollama that support local execution, GPU acceleration, and OpenAI API compatibility, here are a few worth trying:
(1) Text Generation Web UI
- What it is: An open-source web interface for running models like LLaMA, GPT-J, and friends.
- Why it’s cool:
- OpenAI API compatible
- GPU acceleration via macOS Metal
- Install it:
git clone https://github.com/oobabooga/text-generation-webui.git
cd text-generation-webui
pip install -r requirements.txt
python server.py
(2) GPT4All
- What it is: Local LLM project with support for a bunch of models.
- Why it’s cool:
- Metal acceleration on macOS
- CLI and GUI that don’t make you want to flip tables
- Install it:
brew install gpt4all
gpt4all
(3) LocalAI
- What it is: OpenAI-compatible LLaMA model API server.
- Why it’s cool:
- Metal GPU support on macOS
- REST API support
- Install it:
curl -LO https://github.com/go-skynet/LocalAI/releases/download/v1.0.0/local-ai-darwin-arm64
chmod +x local-ai-darwin-arm64
./local-ai-darwin-arm64
(4) MLC LLM
- What it is: A framework tuned for macOS and mobile gadgets.
- Why it’s cool:
- GPU acceleration via Apple Metal
- Runs a bunch of models
- Install it: Just grab the precompiled binary, load up a model, and go.
(5) llama.cpp
- What it is: A lean, mean, LLaMA-running machine.
- Why it’s cool:
- Supports Metal API
- Install it:
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
make
./main -m path/to/llama/model
3. Running Multiple NVIDIA GPUs on Ubuntu
If you’re going full beast mode with two NVIDIA 2080 Ti GPUs on Ubuntu and want to spin up an OpenAI-compatible API for large models, here’s roughly what I did:
System and Environment Prep
- Install NVIDIA drivers:
sudo apt update
sudo apt install -y nvidia-driver-530
reboot
- Install CUDA and cuDNN: Make sure your versions match your drivers—don’t ask how I know.
- Install Python and deps:
sudo apt install -y python3 python3-pip
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
Using Distributed Inference Frameworks
To put both GPUs to work, frameworks like DeepSpeed or Hugging Face Accelerate are your friends:
DeepSpeed
- Install:
pip install deepspeed
- Sample inference:
import deepspeed
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "EleutherAI/gpt-neo-2.7B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
ds_engine = deepspeed.init_inference(model=model, mp_size=2)
Hugging Face Accelerate
- Install:
pip install accelerate
- Configure:
accelerate config
- Sample inference:
from transformers import AutoModelForCausalLM, AutoTokenizer
from accelerate import infer_auto_device_map
model_name = "EleutherAI/gpt-neo-2.7B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
device_map = infer_auto_device_map(model)
4. Serving OpenAI-Compatible APIs
If you want to serve up an OpenAI-compatible API, check out FastAPI or LocalAI:
FastAPI + Uvicorn
- Install:
pip install fastapi uvicorn
- API service code:
from fastapi import FastAPI
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
app = FastAPI()
model_name = "EleutherAI/gpt-neo-2.7B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
@app.post("/v1/completions")
async def completions(request: CompletionRequest):
inputs = tokenizer(request.prompt, return_tensors="pt").to("cuda:0")
outputs = model.generate(**inputs)
return {"choices": [{"text": outputs[0]}]}
- Run it:
uvicorn serve:app --host 0.0.0.0 --port 8000
LocalAI
- Install: See above for the drill.
- Run:
localai --models-path /models --api-port 8080
Ending
That’s pretty much the playbook for making Ollama and other OpenAI-compatible tools fly on your local GPU, whether you’re team MacBook (thank you, Metal API) or team Ubuntu (hello, DeepSpeed). Follow these setups and you’ll squeeze every last drop out of your hardware while serving up solid APIs.


