HyperAI
Home
News
Latest Papers
Tutorials
Datasets
Wiki
SOTA
LLM Models
GPU Leaderboard
Events
Search
About
English
HyperAI
Toggle sidebar
Search the site…
⌘
K
Headlines
Anthropic eyes $5B funding round that could push valuation to $170B despite CEO's concerns over ties to authoritarian investors
5 days ago
Optimizing LLMs for Performance and Accuracy with Post-Training Quantization Quantization is a key technique for enhancing inference performance with minimal overhead. It significantly improves latency, throughput, and memory efficiency by reducing model precision in a controlled manner—without requiring retraining. Most models today are trained in FP16 or BF16, with some, like DeepSeek-R1, using FP8 natively. Further quantization to formats such as FP4 unlocks even greater efficiency and performance gains, supported by an expanding open-source ecosystem. NVIDIA TensorRT Model Optimizer’s post-training quantization (PTQ) framework offers a flexible, modular approach to these optimizations. It supports a wide range of formats, including NVFP4—optimized for NVIDIA Blackwell GPUs—and integrates advanced calibration methods like SmoothQuant, activation-aware weight quantization (AWQ), and AutoQuantize for superior results. The framework is ecosystem-friendly, working seamlessly with native PyTorch, Hugging Face, NeMo, and Megatron-LM checkpoints, and integrating easily with inference engines such as TensorRT-LLM, vLLM, and SGLang. This guide explores PTQ techniques in depth and demonstrates how to use TensorRT Model Optimizer to compress AI models while preserving high accuracy—enhancing both user experience and application performance. Understanding Quantization Neural networks consist of layers with learnable parameters—weights, activations, and biases—that enable them to perform complex tasks. Models are typically trained at full precision (FP32/TF32), half-precision (BF16/FP16), or mixed precision, and increasingly, FP8. The training precision determines the native precision of the model, directly impacting computational load and memory usage during inference. Quantization allows developers to trade excess precision used during training for faster inference and reduced memory footprint. The performance gains depend on the extent of quantization, the difference between native and target precision, and the chosen algorithm. Figure 1 illustrates how high-precision weights are resampled into lower-precision representations. Common data types used in quantization include FP32, FP16, BF16, FP8, FP4, INT8, and INT4. Each has a distinct bit width and representable range (see Table 1). For example, FP16 values can be quantized to FP4, reducing precision and thus the number of distinct representable values—leading to coarser resolution (Figure 2). Table 1 summarizes key data types, their bit width, and dynamic range: | Data Type | Total Bits | Representable Range | Format Type | |----------|------------|---------------------|-------------| | FP32 | 32 | ±3.4 × 10³⁸ | Floating Point | | FP16 | 16 | ±65,504 | Floating Point | | BF16 | 16 | ±3.4 × 10³⁸ | Floating Point | | FP8 | 8 | ±448 | Floating Point | | FP4 | 4 | -6 to +6 | Floating Point | | INT8 | 8 | -128 to +127 | Integer | | INT4 | 4 | -8 to +7 | Integer | Quantization involves mapping original values to a smaller range using a scaling factor. The formula for computing the scale factor is: S = max(|X|) / (2^(b-1) - 1) where b is the target bit count and max(|X|) is the largest absolute value in the original data. The quantized value is then calculated as: Q = round(X / S) Figure 3 shows an example of quantizing FP16 values {4.75, 2.01, -3.44, -7.11, 0, 13.43, -4.91, -6.43} to FP4, resulting in {2, 1, -2, -4, 0, 7, -3, -3} using symmetric static quantization. While many quantization methods exist, this post focuses on effective PTQ using TensorRT Model Optimizer. The library simplifies the process with a clean API, enabling developers to apply optimal configurations without deep implementation details. PTQ with TensorRT Model Optimizer TensorRT Model Optimizer is a powerful library designed to optimize inference performance across a wide range of models. After optimization, models can be deployed via frameworks like Dynamo, SGLang, TensorRT-LLM, and vLLM. Table 2 outlines the quantization formats supported by Model Optimizer, including floating-point, integer, and key-value (KV) cache options. | Quantization Format | Description | |---------------------|-----------| | Per-Tensor FP8 | Full-model FP8 quantization using default scale encoding | | FP8 Block-wise Weight Only | 2D block-wise, weight-only quantization with shared scaling per block | | FP8 Per Channel and Per Token | Per-channel weights, dynamic per-token activations | | nvfp4 | Default FP4 quantization for weights and activations | | INT8 SmoothQuant | 8-bit quantization with SmoothQuant calibration; per-channel weights, per-tensor activations | | WA416 | 4-bit weight-only quantization with AWQ calibration; group-wise weights, FP16 activations | | W4A8 | 4-bit weights, FP8 activations with AWQ; block-wise weights, per-tensor FP8 activations | | fp8 (KV) | FP8 quantization of key-value caches in attention layers | | nvfp4 (KV) | FP4 quantization of KV caches in transformer attention layers | | nvfp4_affine (KV) | Affine scaling for KV cache quantization | Choosing the right quantization format, KV cache precision, and calibration method depends on the model and workload. Model Optimizer provides several calibration techniques to help make these decisions. Standard Quantization with Min-Max Calibration Before quantization, models must be calibrated to determine activation ranges. Min-max calibration is one of the simplest and most widely used methods. It involves passing a small, representative dataset through the model to collect activation statistics. The minimum and maximum values observed for each tensor are used to compute scaling factors. While fast and easy to implement, min-max calibration can be sensitive to outliers and lacks adaptive scaling, which limits its accuracy in some cases. Using Model Optimizer, calibration is streamlined with utility functions like get_dataset_dataloader and create_forward_loop. For example, to calibrate using the cnn_dailymail dataset: calib_loader = get_dataset_dataloader( dataset_name="cnn_dailymail", tokenizer=tokenizer, batch_size=batch_size, num_samples=calib_samples, device="cuda" ) forward_loop = create_forward_loop(dataloader=calib_loader) Quantization is then applied using a configuration object. In this case, NVFP4 is used: quant_cfg = mtq.NVFP4_DEFAULT_CFG model = mtq.quantize(model, quant_cfg, forward_loop=forward_loop) After successful quantization, the model can be exported as a Hugging Face checkpoint using export_hf_checkpoint. Advanced Calibration Techniques Calibration plays a crucial role in determining scaling factors based on input data. Simple methods like max calibration use only the maximum absolute value, which may underutilize dynamic range. More advanced techniques like SmoothQuant and AWQ improve accuracy by adapting scaling based on activation patterns. Activation-Aware Weight Quantization (AWQ) Introduced in 2023, AWQ improves weight quantization by considering activation distributions. It identifies less active weight channels—those contributing minimally to outputs—and allows them to be more aggressively quantized. Meanwhile, salient weights (those aligned with high-activation channels) are preserved with higher fidelity. This selective approach minimizes quantization error where it matters most, enabling effective 4-bit weight quantization with minimal accuracy loss. Figure 4 illustrates how AWQ applies different scaling per channel based on activation magnitude. AWQ’s strength lies in its ability to maintain output distribution by adjusting weight scales post-training. The Model Optimizer API allows fine-grained control, such as setting block size, for optimal performance. SmoothQuant Developed in 2022, SmoothQuant tackles the problem of activation outliers—common in transformer models due to large attention values in Q/K/V projections. These outliers can cause severe quantization errors when using standard methods. SmoothQuant addresses this by scaling down activations and compensating by scaling up corresponding weights, preserving the mathematical validity of the output (Figure 5). This balancing act reduces quantization noise and improves robustness. The technique is particularly effective for models with skewed activation distributions and is well-suited for use with FP8 and FP4 quantization. AutoQuantize Model Optimizer’s AutoQuantize function uses a gradient-based sensitivity analysis to evaluate each layer’s tolerance to quantization. It then selects the optimal quantization format—such as INT8, NVFP4, or even skips quantization—on a per-layer basis. This layer-wise optimization allows aggressive compression of less sensitive layers while preserving precision in critical ones. The process is guided by user-defined constraints like effective_bits, enabling trade-offs between performance and accuracy. AutoQuantize supports diverse hardware targets and model requirements, offering fine-grained control. However, due to its search-based nature, it can be computationally intensive. To reduce overhead, users can limit the search space or skip KV cache calibration. Applying AutoQuantize is straightforward: Define candidate quantization configurations for weights and KV caches. Run the auto-quantization function with the model and forward loop. Export the resulting model. Figure 6 visualizes the workflow: each layer is evaluated and assigned the best-fit quantization format based on sensitivity and performance goals. Results of NVFP4 Quantization NVFP4 delivers the highest compression ratio among supported formats in Model Optimizer PTQ. It provides stable accuracy recovery and substantial throughput improvements across major LLMs. Figure 8 shows the trade-off between throughput and accuracy for models like Qwen 23B, DeepSeek-R1-0528, and Llama Nemo Ultra. NVFP4 achieves 2–3x speedup in token generation while maintaining over 95% of original accuracy. Figure 9 compares responses from a DeepSeek-R1 model in FP8 vs. NVFP4. Both produce identical outputs, but the NVFP4 version responds significantly faster—demonstrating that high performance doesn’t come at the cost of fidelity. This balance of speed and accuracy enables efficient total cost of ownership (TCO) optimization with no degradation in AI workload quality. Exporting PTQ-Optimized Models After applying PTQ, models can be exported to Hugging Face checkpoints for easy sharing and deployment. This format is compatible with major inference engines like vLLM, SGLang, TensorRT-LLM, and Dynamo. To export: from modelopt.torch.export import export_hf_checkpoint export_hf_checkpoint(model, export_dir=export_path) Pre-quantized models are available on Hugging Face Hub. The NVIDIA Model Optimizer collection includes ready-to-use checkpoints for Llama 3, Llama 4, and DeepSeek. Summary Quantization is a powerful method for accelerating model inference—delivering major gains in latency, throughput, and memory efficiency without retraining. While most models run in FP16 or BF16, and some in FP8, pushing to FP4 unlocks unprecedented efficiency. NVIDIA TensorRT Model Optimizer takes this further with support for NVFP4 (optimized for Blackwell GPUs), advanced calibration techniques like SmoothQuant, AWQ, and AutoQuantize, and seamless integration with PyTorch, Hugging Face, NeMo, Megatron-LM, vLLM, SGLang, TensorRT-LLM, and Dynamo. The result is a powerful, flexible toolkit for model compression without compromise—enabling faster, leaner, and more scalable AI deployments that maintain accuracy and enhance user experience. Explore the Jupyter notebook tutorials or try the pre-quantized models on Hugging Face to experience the benefits firsthand.
a day ago
A Data Analyst’s Guide to Learning from YouTube with AI: How to Extract Insights Efficiently Without Watching Full Videos
a day ago
The Unfulfilled Promise: Why Today’s AI Falls Short of Genuine Intelligence — Part 1 Everyone’s talking about Artificial General Intelligence (AGI) these days. Industry leaders are pouring immense sums into the pursuit of “true AGI” and even “superintelligence.” But here’s my hot take: if we keep going down the same path we’re on, this goal remains fundamentally out of reach. Current AI systems, while undeniably capable and often astonishing, are still, at their core, statistical models. They’re brilliant at pattern recognition and prediction, but there’s no genuine reasoning happening under the hood. What’s more, compared to the incredible adaptability of the human mind, these models are still relatively limited to the specific tasks they’ve been trained for. Okay, so some folks might say they do reason, but let me ask you: if you build a machine to act exactly like a human, right down to every tiny brain pathway and a gazillion rules, is that true intelligence? Let’s be real for a sec. If you could actually hardcode a machine to behave just like you—every thought, every little twitch, every single reaction—would that really be intelligent behavior, or would it just look smart to anyone watching, like a super sophisticated puppet show? You’ve probably also heard another common statement in AI circles: that humans “learn from fewer samples than machines.” It’s an appealing idea, suggesting a kind of inherent efficiency in biological learning. But in the first part of this blog post, I’m going to break down why that’s a myth, and why our understanding of “data” in human learning might be fundamentally flawed. Part 1: The “Fewer Samples” Myth – Why Human Learning Isn’t About Less Data, But Better Data You may have heard this before in AI discussions: “Humans can learn from fewer samples than machines.” While it sounds intuitively appealing, I’m here to offer a hot take: this idea is fundamentally flawed, and clinging to it might be one of the biggest misconceptions holding back the pursuit of Artificial General Intelligence (AGI). We often underestimate the sheer volume and, more importantly, the quality and richness of the information stream we’re immersed in from the moment we’re born. Let’s break down why this comparison misses the mark, and what AI can truly learn from human cognition. The Daily Data Deluge of a Human Being Forget “fewer samples.” We are data processing machines of an incredible scale. Consider these fascinating insights, some of which have been discussed for years and continue to highlight our immense processing capabilities: The human brain processes about 11 million pieces of information per second, but only 40 of those make it to conscious awareness. Our sensory systems are constantly streaming data: vision, hearing, touch, smell, and internal bodily signals. Even at a conservative estimate, the brain receives and filters an estimated 100,000 to 1 million bits of information per second. These figures underscore that far from learning from “fewer” samples, humans are constantly bathed in a torrent of information. A Child’s First Decade: An Ocean of Information Now, let’s extrapolate this to a child’s foundational learning years. While a baby likely isn’t reading “The Hobbit,” the raw, continuous sensory input is still tremendous. Let’s make a very conservative estimate for the data processed by age of 10: Visual Data: Imagine 12 waking hours a day, 365 days a year. Over 10 years, that’s 43,800 hours. Even at a mere 10 “frames” per second (far less than real-time perception), that’s over 1.5 billion visual “samples.” Each isn’t a static image, but a dynamic, multi-faceted scene. Auditory Data: Conservatively, hearing 10,000 words or distinct auditory events daily for 10 years tallies up to 36.5 million auditory “samples.” Tactile, Proprioceptive, Olfactory, Gustatory Data: The constant stream from touch, movement, smell, and taste adds millions more “samples” daily, forming a continuous, embodied sensory experience. The Terabytes of Life Experience Translating this raw, continuous sensory input into digital terms, even with highly conservative estimates: this leads to a very conservative minimum of around 88 Terabytes of raw, integrated sensory data processed by a human by their 10th birthday. And this is just the raw input; it doesn’t account for the complex internal representations and learned models. The Stark Contrast: Human vs. AI Data Now, let’s compare this to the datasets used by even the most advanced AI models. While these models are trained on immense datasets, their scale, especially in terms of integrated, multimodal, and continuously contextualized data, often pales in comparison to human experience: GPT-3: OpenAI’s groundbreaking GPT-3 was trained on hundreds of billions of tokens, which, when filtered and processed, amounted to a dataset size in the range of ~45 TB of text. Llama 3: Meta’s more recent Llama 3 models were pre-trained on an even larger scale, utilizing over 15 trillion tokens. Depending on token encoding, this translates to approximately 60 Terabytes of text data. Other Foundation Models (GPT-4, Claude 3, Gemini): While exact training data sizes for cutting-edge models like GPT-4, Claude 3, and Gemini are often proprietary, industry estimates suggest they are trained on datasets that range from tens to low hundreds of terabytes. While these AI datasets are vast, they are still significantly smaller in sheer volume than the estimated integrated, multimodal data a human processes in their first decade. More importantly, they lack the inherent richness and real-world interconnectedness of human experience. They are primarily text-based, or multimodal in a stitched-together fashion, rather than inherently integrated from the ground up. It’s Not Less Data, It’s Better Processing of Richer Data The core of the “fewer samples” myth lies in a misunderstanding of what a “sample” truly means for a human. We don’t just consume discrete, isolated data points like an image file or a text token. Our learning is characterized by: Multimodal Integration: Our brain seamlessly fuses sight, sound, touch, smell, and taste, creating a holistic understanding of the world. Recent research into “Embodied Multimodal Large Models (EMLMs)” in AI is a direct acknowledgment of this human advantage, aiming to integrate diverse sensory modalities for more robust AI. Contextual & Embodied Learning: Every piece of information is learned within a dynamic, real-world context, directly linked to our physical interactions and consequences. We don’t just “see” an object; we interact with it, understand its properties, and experience its effects. This embodied interaction is a critical component of how we build understanding. Active & Feedback-Driven: Human learning is a continuous loop of experimentation, immediate feedback, and self-correction. We are not passive observers. This aligns with the growing focus in AI on “agentic AI,” which seeks to teach models to behave and adapt based on real-world interactions and expert guidance. Hierarchical & Abstract Reasoning: Beyond mere pattern recognition, we build complex conceptual models, understanding relationships, categories, and abstract principles. This allows us to generalize from fewer novel experiences because we have a robust internal model of how the world works, built on a lifetime of rich data. The True Path to Intelligence: Learning from Biology and Human Cognition The profound takeaway for AI development isn’t that humans learn with less data, but that our biological architecture is uniquely designed to extract vastly more meaning and build incredibly sophisticated representations from the massive, high-quality, multimodal data stream we’re constantly immersed in. Our learning is incredibly efficient per unit of extracted information, not per raw byte. This suggests that the pursuit of Artificial General Intelligence (AGI) won’t simply be achieved by dumping more data and compute into current, fundamentally unimodal or weakly multimodal architectures. The real leap will come when AI systems can: Integrate truly multimodal, high-dimensional, and contextually rich data streams at their core, mimicking the seamless fusion of human senses. Engage in active, embodied learning, with continuous, real-time feedback loops, allowing them to interact with and learn from their environment like a human child. Develop sophisticated symbolic reasoning, abstraction, and the ability to construct internal models of the world, moving beyond statistical correlations to true understanding. It’s not about the quantity of data alone; it’s about the inherent quality, interconnectedness, and the underlying processing architecture that makes human learning so remarkably powerful and adaptable. The next frontier in AI isn’t just bigger models, but fundamentally smarter ways of learning from the world, much like we do.
a day ago
Your chats with Meta's AI could end up in Google searches — just like early versions of ChatGPT did before OpenAI reversed course. Meta's standalone MetaAI app lets users share conversations to a public "Discover" feed, which Google can index and display in search results. Even though Meta now warns users that shared chats are public and visible to anyone, many of the conversations still contain personal details like names, emails, and sensitive topics. While the company says it’s working to reduce accidental disclosures, the feature remains active — meaning your AI chat could show up in a Google search, possibly linked to your social media identity. And unlike OpenAI, which stopped indexing shared ChatGPT conversations, Meta has no plans to change that.
a day ago
PagerDuty Named a Leader and Outperformer in the 2025 GigaOm Radar for AIOps for Fourth Consecutive Year
a day ago
Asana to Release Second Quarter Fiscal Year 2026 Financial Results on September 3, 2025, with Investor Webcast Scheduled
a day ago
Apple is set to significantly expand its investments in artificial intelligence, CEO Tim Cook revealed, underscoring the company’s growing commitment to advancing AI capabilities across its products and services. Speaking in a recent interview, Cook emphasized Apple’s openness to acquiring AI-focused companies as a way to accelerate innovation and strengthen its technological edge. He also addressed broader economic concerns, including the impact of Trump-era tariffs, noting that while trade policies remain a challenge, Apple continues to adapt its global supply chain strategies to maintain efficiency and competitiveness. The comments come as Apple intensifies its efforts to integrate AI more deeply into its ecosystem, from on-device machine learning to enhanced personal assistant features, signaling a pivotal shift in the company’s long-term vision.
2 days ago
Tesla ordered to pay $329 million in damages after fatal Autopilot crash, jury rules
2 days ago
OpenAI Warns Students Against Using ChatGPT as an 'Answer Machine' and Pushes for Productive Struggle in AI Education
2 days ago
Previous
Next