The Unfulfilled Promise: Why Today’s AI Falls Short of Genuine Intelligence — Part 1 Everyone’s talking about Artificial General Intelligence (AGI) these days. Industry leaders are pouring immense sums into the pursuit of “true AGI” and even “superintelligence.” But here’s my hot take: if we keep going down the same path we’re on, this goal remains fundamentally out of reach. Current AI systems, while undeniably capable and often astonishing, are still, at their core, statistical models. They’re brilliant at pattern recognition and prediction, but there’s no genuine reasoning happening under the hood. What’s more, compared to the incredible adaptability of the human mind, these models are still relatively limited to the specific tasks they’ve been trained for. Okay, so some folks might say they do reason, but let me ask you: if you build a machine to act exactly like a human, right down to every tiny brain pathway and a gazillion rules, is that true intelligence? Let’s be real for a sec. If you could actually hardcode a machine to behave just like you—every thought, every little twitch, every single reaction—would that really be intelligent behavior, or would it just look smart to anyone watching, like a super sophisticated puppet show? You’ve probably also heard another common statement in AI circles: that humans “learn from fewer samples than machines.” It’s an appealing idea, suggesting a kind of inherent efficiency in biological learning. But in the first part of this blog post, I’m going to break down why that’s a myth, and why our understanding of “data” in human learning might be fundamentally flawed. Part 1: The “Fewer Samples” Myth – Why Human Learning Isn’t About Less Data, But Better Data You may have heard this before in AI discussions: “Humans can learn from fewer samples than machines.” While it sounds intuitively appealing, I’m here to offer a hot take: this idea is fundamentally flawed, and clinging to it might be one of the biggest misconceptions holding back the pursuit of Artificial General Intelligence (AGI). We often underestimate the sheer volume and, more importantly, the quality and richness of the information stream we’re immersed in from the moment we’re born. Let’s break down why this comparison misses the mark, and what AI can truly learn from human cognition. The Daily Data Deluge of a Human Being Forget “fewer samples.” We are data processing machines of an incredible scale. Consider these fascinating insights, some of which have been discussed for years and continue to highlight our immense processing capabilities: The human brain processes about 11 million pieces of information per second, but only 40 of those make it to conscious awareness. Our sensory systems are constantly streaming data: vision, hearing, touch, smell, and internal bodily signals. Even at a conservative estimate, the brain receives and filters an estimated 100,000 to 1 million bits of information per second. These figures underscore that far from learning from “fewer” samples, humans are constantly bathed in a torrent of information. A Child’s First Decade: An Ocean of Information Now, let’s extrapolate this to a child’s foundational learning years. While a baby likely isn’t reading “The Hobbit,” the raw, continuous sensory input is still tremendous. Let’s make a very conservative estimate for the data processed by age of 10: Visual Data: Imagine 12 waking hours a day, 365 days a year. Over 10 years, that’s 43,800 hours. Even at a mere 10 “frames” per second (far less than real-time perception), that’s over 1.5 billion visual “samples.” Each isn’t a static image, but a dynamic, multi-faceted scene. Auditory Data: Conservatively, hearing 10,000 words or distinct auditory events daily for 10 years tallies up to 36.5 million auditory “samples.” Tactile, Proprioceptive, Olfactory, Gustatory Data: The constant stream from touch, movement, smell, and taste adds millions more “samples” daily, forming a continuous, embodied sensory experience. The Terabytes of Life Experience Translating this raw, continuous sensory input into digital terms, even with highly conservative estimates: this leads to a very conservative minimum of around 88 Terabytes of raw, integrated sensory data processed by a human by their 10th birthday. And this is just the raw input; it doesn’t account for the complex internal representations and learned models. The Stark Contrast: Human vs. AI Data Now, let’s compare this to the datasets used by even the most advanced AI models. While these models are trained on immense datasets, their scale, especially in terms of integrated, multimodal, and continuously contextualized data, often pales in comparison to human experience: GPT-3: OpenAI’s groundbreaking GPT-3 was trained on hundreds of billions of tokens, which, when filtered and processed, amounted to a dataset size in the range of ~45 TB of text. Llama 3: Meta’s more recent Llama 3 models were pre-trained on an even larger scale, utilizing over 15 trillion tokens. Depending on token encoding, this translates to approximately 60 Terabytes of text data. Other Foundation Models (GPT-4, Claude 3, Gemini): While exact training data sizes for cutting-edge models like GPT-4, Claude 3, and Gemini are often proprietary, industry estimates suggest they are trained on datasets that range from tens to low hundreds of terabytes. While these AI datasets are vast, they are still significantly smaller in sheer volume than the estimated integrated, multimodal data a human processes in their first decade. More importantly, they lack the inherent richness and real-world interconnectedness of human experience. They are primarily text-based, or multimodal in a stitched-together fashion, rather than inherently integrated from the ground up. It’s Not Less Data, It’s Better Processing of Richer Data The core of the “fewer samples” myth lies in a misunderstanding of what a “sample” truly means for a human. We don’t just consume discrete, isolated data points like an image file or a text token. Our learning is characterized by: Multimodal Integration: Our brain seamlessly fuses sight, sound, touch, smell, and taste, creating a holistic understanding of the world. Recent research into “Embodied Multimodal Large Models (EMLMs)” in AI is a direct acknowledgment of this human advantage, aiming to integrate diverse sensory modalities for more robust AI. Contextual & Embodied Learning: Every piece of information is learned within a dynamic, real-world context, directly linked to our physical interactions and consequences. We don’t just “see” an object; we interact with it, understand its properties, and experience its effects. This embodied interaction is a critical component of how we build understanding. Active & Feedback-Driven: Human learning is a continuous loop of experimentation, immediate feedback, and self-correction. We are not passive observers. This aligns with the growing focus in AI on “agentic AI,” which seeks to teach models to behave and adapt based on real-world interactions and expert guidance. Hierarchical & Abstract Reasoning: Beyond mere pattern recognition, we build complex conceptual models, understanding relationships, categories, and abstract principles. This allows us to generalize from fewer novel experiences because we have a robust internal model of how the world works, built on a lifetime of rich data. The True Path to Intelligence: Learning from Biology and Human Cognition The profound takeaway for AI development isn’t that humans learn with less data, but that our biological architecture is uniquely designed to extract vastly more meaning and build incredibly sophisticated representations from the massive, high-quality, multimodal data stream we’re constantly immersed in. Our learning is incredibly efficient per unit of extracted information, not per raw byte. This suggests that the pursuit of Artificial General Intelligence (AGI) won’t simply be achieved by dumping more data and compute into current, fundamentally unimodal or weakly multimodal architectures. The real leap will come when AI systems can: Integrate truly multimodal, high-dimensional, and contextually rich data streams at their core, mimicking the seamless fusion of human senses. Engage in active, embodied learning, with continuous, real-time feedback loops, allowing them to interact with and learn from their environment like a human child. Develop sophisticated symbolic reasoning, abstraction, and the ability to construct internal models of the world, moving beyond statistical correlations to true understanding. It’s not about the quantity of data alone; it’s about the inherent quality, interconnectedness, and the underlying processing architecture that makes human learning so remarkably powerful and adaptable. The next frontier in AI isn’t just bigger models, but fundamentally smarter ways of learning from the world, much like we do.
The pursuit of Artificial General Intelligence (AGI) is one of the most ambitious goals in modern technology, yet despite massive investments and rapid progress, today’s AI systems still fall far short of genuine intelligence. The core issue lies not in the amount of data or compute, but in a fundamental misunderstanding of how human intelligence works. A common narrative in AI circles claims that humans learn from fewer samples than machines, implying a kind of innate efficiency in biological cognition. This idea, however, is a myth—and clinging to it may be one of the biggest roadblocks to achieving true AGI. The truth is, humans are not learning from less data. In fact, we are constantly processing an overwhelming flood of information from birth. Consider just the sensory input over a child’s first decade: 12 waking hours a day, 365 days a year, amounts to 43,800 hours. Even at a conservative estimate of 10 visual “frames” per second, that’s over 1.5 billion visual samples. Add in auditory input—around 10,000 words or distinct sounds per day—amounting to 36.5 million auditory samples in ten years. Tactile, proprioceptive, olfactory, and gustatory data further contribute millions more. When converted into digital terms, this raw sensory stream totals at least 88 terabytes of integrated data by age 10—far more than most AI models have ever seen. Now contrast this with the largest AI models. GPT-3 was trained on roughly 45 terabytes of text. Llama 3 used over 15 trillion tokens, translating to about 60 terabytes. Even the most advanced models like GPT-4, Claude 3, and Gemini are trained on datasets in the tens to low hundreds of terabytes—still dwarfed by the sheer volume and richness of human sensory experience. But the difference isn’t just in quantity. It’s in quality and integration. Human learning is inherently multimodal, embodied, and context-rich. We don’t process vision, sound, and touch as separate streams. Our brain fuses them seamlessly, creating a unified perception of the world. We learn not just by observing, but by doing—touching, moving, failing, and adjusting in real time. This active, feedback-driven loop allows us to build deep, abstract models of reality. AI, by contrast, still operates largely on static, pre-labeled datasets. It sees images or text as isolated tokens, not as part of a continuous, embodied experience. Even multimodal models often stitch together data rather than integrate it from the ground up. The real lesson from human cognition isn’t that we use less data—it’s that we extract far more meaning from it. Our brains are built to compress, generalize, and reason across vast, interconnected experiences. They develop internal models of cause and effect, object permanence, social dynamics, and abstract concepts. This is why we can learn a new concept after just one example, not because we’re seeing fewer samples, but because we’re drawing on a lifetime of context. The path to AGI won’t come from scaling up current models with more data and compute. It will come from rethinking how AI learns—by building systems that can perceive the world multimodally, interact with it dynamically, and develop rich, symbolic representations of reality. The future of AI lies not in mimicking human output, but in emulating the way humans truly learn: through continuous, embodied, context-aware, and meaning-driven experience.