Run OpenAI's gpt-oss-20b LLM Locally: A How-To Guide

Theregister

OpenAI recently enhanced the accessibility of its large language models (LLMs) by releasing two “open-weight” models, gpt-oss-20b and gpt-oss-120b, allowing users to download and run these advanced AI tools directly on their personal computers. This development marks a significant step towards democratizing access to advanced AI, allowing users to harness powerful models without relying on cloud infrastructure.

The lighter of the two, gpt-oss-20b, features 21 billion parameters – a measure of its complexity and size – and requires approximately 16GB of free memory to operate. Its larger sibling, gpt-oss-120b, is a substantially more demanding model with 117 billion parameters, necessitating a hefty 80GB of memory. To put this in perspective, a cutting-edge “frontier” model like DeepSeek R1 boasts 671 billion parameters and demands around 875GB of memory, which explains why major AI developers are rapidly constructing massive data centers. While gpt-oss-120b remains largely out of reach for most home setups, gpt-oss-20b is surprisingly accessible.

To run gpt-oss-20b, a computer needs either a graphics processing unit (GPU) equipped with at least 16GB of dedicated video random access memory (VRAM), or a minimum of 24GB of system memory, ensuring at least 8GB remains available for the operating system and other applications. Performance is crucially dependent on memory bandwidth. A graphics card utilizing GDDR7 or GDDR6X memory, capable of transferring data at over 1000 GB/s, will significantly outperform a typical laptop or desktop’s DDR4 or DDR5 memory, which operates in the range of 20 to 100 GB/s.

For local deployment, Ollama emerges as a key tool. This free client application streamlines the process of downloading and executing these LLMs across Windows, Linux, and macOS. Users can begin by downloading and installing Ollama for their respective operating systems. Once launched, the application typically defaults to gpt-oss:20b. Initiating a prompt, such as “Write a letter,” will trigger a substantial download of the model data – approximately 12.4GB to 13GB depending on the platform – a process that can take a considerable amount of time. After the download completes, users can interact with gpt-oss-20b through Ollama’s intuitive graphical interface.

For those who prefer a more technical approach or seek performance insights, Ollama also supports command-line interface (CLI) operation. Running Ollama from the terminal allows users to activate a “verbose mode,” which provides detailed statistics, including the time taken to complete a query. This option is available across all supported operating systems, offering greater control and diagnostic information.

To evaluate gpt-oss-20b’s local performance, tests were conducted on three diverse hardware configurations using two prompts: a request for a 600-word fan letter to Taylor Swift and a simpler query about the first US president. The test devices included a Lenovo ThinkPad X1 Carbon laptop (Core Ultra 7-165U CPU, 64GB LPDDR5x-6400 RAM), an Apple MacBook Pro (M1 Max CPU, 32GB LPDDR5x-6400 RAM), and a custom-built PC featuring a discrete Nvidia RTX 6000 Ada GPU (AMD Ryzen 9 5900X CPU, 128GB DDR4-3200 RAM).

The Lenovo ThinkPad X1 Carbon exhibited notably slow performance. The fan letter took 10 minutes and 13 seconds, while the simple presidential query required 51 seconds. This sluggishness was largely attributed to Ollama’s inability to leverage the laptop’s integrated graphics or neural processing unit (NPU), forcing the processing onto the less efficient CPU. During this “thinking” phase, the model typically spends a minute or two processing before generating output. In contrast, the Apple MacBook Pro, despite having similar memory speed to the ThinkPad, significantly outperformed it, completing the fan letter in just 26 seconds and answering the presidential question in a mere three seconds. Unsurprisingly, the desktop PC, powered by the high-end Nvidia RTX 6000 Ada GPU, delivered the fan letter in a swift six seconds and the answer to the presidential query in under half a second.

These results underscore that local performance of gpt-oss-20b is highly dependent on hardware. Systems equipped with powerful dedicated GPUs or modern Apple Silicon processors can expect robust performance. However, users on Intel or AMD-powered laptops relying on integrated graphics that Ollama does not fully support may experience considerable delays, potentially necessitating a break while their queries process. For those facing such performance bottlenecks, alternative applications like LM Studio, which also facilitates local LLM execution, might offer a more optimized experience.