Complete guide to running a GPU accelerated LLM with WSL2
This is probably the easiest way to run an LLM for free on your PC
5 min readJul 4, 2023
If you would like to be able to test different LLMs locally for free and happen to have a GPU powered PC at home you’re in luck — thanks to the wonderful Open Source community, running different LLMs on Windows is very straightforward. In this article I will show you how to run a version of the Vicuna model in WSL2 with GPU acceleration and prompt the model from Python via an API.
What we will use:
- Windows 10 or 11 — I used Windows 11, but everything should work on Windows 10 as well.
- GPU card — I have an Nvidia RTX 3060 with 12Gb of dedicated memory, but this will work with less dedicated GPU memory or even without a GPU. Just know that if you use a CPU, the model inference will be very slow.
- The popular textgeneration-web-ui — this is a ready to use gradio web UI for running different LLMs, including LLaMA family models like Vicuna.
- TheBloke/vicuna-13B-1.1-GPTQ-4bit-128g model— this is a version of Vicuna 13B which was compressed using 4-bit quantization. I picked this model because it is a lightweight (less than 8Gb) version of the popular Vicuna model (the authors of the Vicuna model claim it achieves 90%…