Running LLaMA 65B on a 64GB M1 MacBook Max with llama.cpp

See all posts

More detailed instructions here: https://til.simonwillison.net/llms/llama-7b-m2

There are several more steps required to run 13B/30B/65B.

Go star llama.cpp: https://github.com/ggerganov/llama.cpp

Setup

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make
pip install torch numpy sentencepiece # you might want to do this in a virtual env

Updates

There are now helper scripts for quantizing weights:

python3 convert-pth-to-ggml.py models/7B/ 1

./quantize.sh 7B

65B

python3 convert-pth-to-ggml.py models/65B/ 1
./quantize ./models/65B/ggml-model-f16.bin ./models/65B/ggml-model-q4_0.bin 2
./quantize ./models/65B/ggml-model-f16.bin.1 ./models/65B/ggml-model-q4_0.bin.1 2
./quantize ./models/65B/ggml-model-f16.bin.2 ./models/65B/ggml-model-q4_0.bin.2 2
./quantize ./models/65B/ggml-model-f16.bin.3 ./models/65B/ggml-model-q4_0.bin.3 2
./quantize ./models/65B/ggml-model-f16.bin.4 ./models/65B/ggml-model-q4_0.bin.4 2
./quantize ./models/65B/ggml-model-f16.bin.5 ./models/65B/ggml-model-q4_0.bin.5 2
./quantize ./models/65B/ggml-model-f16.bin.6 ./models/65B/ggml-model-q4_0.bin.6 2
./quantize ./models/65B/ggml-model-f16.bin.7 ./models/65B/ggml-model-q4_0.bin.7 2
./main -m ./models/65B/ggml-model-q4_0.bin -t 8 -n 128 -p "when is the singularity going to kill us?"

30B

./quantize ./models/30B/ggml-model-f16.bin   ./models/30B/ggml-model-q4_0.bin 2
./quantize ./models/30B/ggml-model-f16.bin.1 ./models/30B/ggml-model-q4_0.bin.1 2
./quantize ./models/30B/ggml-model-f16.bin.2 ./models/30B/ggml-model-q4_0.bin.2 2
./quantize ./models/30B/ggml-model-f16.bin.3 ./models/30B/ggml-model-q4_0.bin.3 2
./main -m ./models/30B/ggml-model-q4_0.bin -t 8 -n 128 -p "never gonna"