You need to use n_gpu_layers in the initialization of Llama (), which offloads some of the work to the GPU. If you have enough VRAM, just put an arbitarily high number, or decrease it until you don't ...