Complete guide to running a GPU accelerated LLM with WSL2

This is probably the easiest way to run an LLM for free on your PC

Dasha Herrmannova, Ph.D.
5 min readJul 4, 2023
A laptop computer with a llama wearing glasses popping out of the screen.
Created using Midjourney.

If you would like to be able to test different LLMs locally for free and happen to have a GPU powered PC at home you’re in luck — thanks to the wonderful Open Source community, running different LLMs on Windows is very straightforward. In this article I will show you how to run a version of the Vicuna model in WSL2 with GPU acceleration and prompt the model from Python via an API.

What we will use:

  • Windows 10 or 11 — I used Windows 11, but everything should work on Windows 10 as well.
  • GPU card — I have an Nvidia RTX 3060 with 12Gb of dedicated memory, but this will work with less dedicated GPU memory or even without a GPU. Just know that if you use a CPU, the model inference will be very slow.
  • The popular textgeneration-web-ui — this is a ready to use gradio web UI for running different LLMs, including LLaMA family models like Vicuna.
  • TheBloke/vicuna-13B-1.1-GPTQ-4bit-128g model— this is a version of Vicuna 13B which was compressed using 4-bit quantization. I picked this model because it is a lightweight (less than 8Gb) version of the popular Vicuna model (the authors of the Vicuna model claim it achieves 90%

--

--