NVIDIA Optimizes OpenAI’s GPT-OSS for Ultra-Fast Local Inference
NVIDIA and OpenAI have partnered to optimize two new open-source reasoning models, gpt-oss-20b and gpt-oss-120b, for NVIDIA GPUs, marking a major step in making advanced AI accessible across devices from the cloud to personal PCs. These models, designed for complex tasks like web search, in-depth research, coding assistance, and document analysis, are now available to developers, enterprises, and AI enthusiasts worldwide. Trained on NVIDIA H100 GPUs, they leverage a mixture-of-experts architecture with chain-of-thought reasoning and support context lengths up to 131,072 tokens—among the longest for local inference—enabling deep understanding of large documents and complex queries. The models are optimized for NVIDIA’s latest hardware, including the GeForce RTX 5090 and Blackwell-based systems, delivering performance of up to 256 tokens per second on high-end consumer GPUs and over 1.5 million tokens per second on NVIDIA GB200 NVL72 systems. A key innovation is the use of MXFP4 4-bit precision, which enhances efficiency and reduces memory and power demands without sacrificing accuracy—critical for scaling trillion-parameter models in real time. NVIDIA’s collaboration with OpenAI extends beyond hardware. The company has worked with leading open-source frameworks like Ollama, llama.cpp, Hugging Face, vLLM, and FlashInfer to ensure seamless deployment across platforms. Ollama now offers out-of-the-box support for the gpt-oss models on RTX AI PCs, allowing users to run them with minimal setup—just select the model and start chatting. The app also supports file uploads (PDFs, text), multimodal inputs (images), and customizable context lengths, making it ideal for developers and hobbyists. Windows users can also access the models via Microsoft AI Foundry Local, now in public preview, which enables on-device inference through command line, SDK, or API. It uses ONNX Runtime with CUDA optimization, with NVIDIA TensorRT support coming soon. This release underscores NVIDIA’s full-stack leadership in AI, from training to inference, and its commitment to open-source innovation. The CUDA platform, with over 450 million downloads, continues to serve as the foundation for AI development across 250 countries. By integrating OpenAI’s models into its ecosystem, NVIDIA empowers a global community of 6.5 million developers to build next-generation AI applications in healthcare, manufacturing, creative workflows, and beyond. The partnership builds on a long history of collaboration, dating back to 2016 when Jensen Huang delivered the first DGX-1 supercomputer to OpenAI. Today’s release demonstrates how open models, combined with optimized software and hardware, accelerate innovation and strengthen U.S. leadership in AI. NVIDIA also continues to support the open-source community through contributions to llama.cpp and GGML, including CUDA Graphs and CPU overhead reductions for better RTX performance. Developers can explore more through the RTX AI Garage blog series, NVIDIA’s social channels, and the Discord community. With these models, OpenAI and NVIDIA are unlocking a new era of accessible, powerful AI—where advanced reasoning is no longer limited to large corporations but available to anyone with an RTX-powered device.