Running LLaMA 65B on a 64GB M1 MacBook Max with llama.cpp
65B running on m1 max/64gb! 🦙🦙🦙🦙🦙🦙🦙 pic.twitter.com/Dh2emCBmLY
— Lawrence Chen (@lawrencecchen) March 11, 2023
More detailed instructions here: https://til.simonwillison.net/llms/llama-7b-m2
There are several more steps required to run 13B/30B/65B.
Go star llama.cpp: https://github.com/ggerganov/llama.cpp
Setup
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make
pip install torch numpy sentencepiece # you might want to do this in a virtual env
Updates
There are now helper scripts for quantizing weights:
python3 convert-pth-to-ggml.py models/7B/ 1
./quantize.sh 7B
65B
python3 convert-pth-to-ggml.py models/65B/ 1
./quantize ./models/65B/ggml-model-f16.bin ./models/65B/ggml-model-q4_0.bin 2
./quantize ./models/65B/ggml-model-f16.bin.1 ./models/65B/ggml-model-q4_0.bin.1 2
./quantize ./models/65B/ggml-model-f16.bin.2 ./models/65B/ggml-model-q4_0.bin.2 2
./quantize ./models/65B/ggml-model-f16.bin.3 ./models/65B/ggml-model-q4_0.bin.3 2
./quantize ./models/65B/ggml-model-f16.bin.4 ./models/65B/ggml-model-q4_0.bin.4 2
./quantize ./models/65B/ggml-model-f16.bin.5 ./models/65B/ggml-model-q4_0.bin.5 2
./quantize ./models/65B/ggml-model-f16.bin.6 ./models/65B/ggml-model-q4_0.bin.6 2
./quantize ./models/65B/ggml-model-f16.bin.7 ./models/65B/ggml-model-q4_0.bin.7 2
./main -m ./models/65B/ggml-model-q4_0.bin -t 8 -n 128 -p "when is the singularity going to kill us?"
30B
./quantize ./models/30B/ggml-model-f16.bin ./models/30B/ggml-model-q4_0.bin 2
./quantize ./models/30B/ggml-model-f16.bin.1 ./models/30B/ggml-model-q4_0.bin.1 2
./quantize ./models/30B/ggml-model-f16.bin.2 ./models/30B/ggml-model-q4_0.bin.2 2
./quantize ./models/30B/ggml-model-f16.bin.3 ./models/30B/ggml-model-q4_0.bin.3 2
./main -m ./models/30B/ggml-model-q4_0.bin -t 8 -n 128 -p "never gonna"