Foundation Models and the Return of General-Purpose AI Systems

Part of The Long Human Road to AI, Season 1.

In the summer of 1956, a small group of researchers gathered at Dartmouth College with a sweeping bet: that every feature of intelligence could be so precisely described that a machine could be made to simulate it. Language, abstraction, problem solving, self-improvement—nothing was off the table. The bet did not pay off on schedule, and for most of the next sixty years AI advanced by narrowing its scope. Programs that mastered chess did not translate Japanese. Systems that recognized faces did not write essays. The broad ambition went dormant.

Then, in the early 2020s, a new kind of system began to feel different. One model could summarize a contract, draft an email, explain a poem, help debug code, and answer questions in dozens of languages. It was not reliable at all of those things, and it often made mistakes that a child would not, but its range was unmistakably broad. The old ambition had returned—not as a finished machine mind, but as a family of systems called foundation models.

What a foundation model is

The term sounds technical, but the idea is simple. A foundation model is a broadly trained model, usually trained with self-supervision at scale, that can be adapted to many downstream tasks.

Point C1 A foundation model is a broadly trained model, generally trained with self-supervision at scale, that can be adapted to many downstream tasks.

Think of it as a shared base material for many products. One large model is trained once on enormous amounts of text, images, audio, or code. Then smaller teams, sometimes with modest resources, adapt it through prompts, fine-tuning, or retrieval systems to do something specific. The same base can become a medical-question assistant, a coding companion, a translation tool, or an interactive tutor.

The phrase was introduced and examined in depth in a 2021 Stanford report, which also warned that this “homogenization” of AI capabilities carries concentrated risk: flaws in the base can propagate across many applications. That caution is part of the definition, not an afterthought.

The old dream returns in a new form

The 1955 Dartmouth proposal framed AI as a broad program involving language, abstraction, problem solving, and self-improvement. The language was optimistic, but the ambition was real: intelligence as a general competence, not a collection of narrow tricks.

Point C2 Foundation models revive general-purpose AI ambition by supporting many tasks from a shared base, but this should not be equated with humanlike understanding.

For decades, the field moved in the opposite direction. Expert systems succeeded when the rules were narrow. Machine-learning classifiers succeeded when the problem was well-defined. Progress came from tightening the scope, not expanding it. Foundation models reversed that pattern. They are not evidence that machines understand the world the way humans do, but they do show that a single trained base can be stretched across a surprising range of tasks.

The transformer made scale easier to use

No single paper caused the modern wave, but one architecture became impossible to ignore. In 2017, researchers introduced the Transformer, a design for sequence transduction that replaced recurrence and convolution with attention.

Point C3 The Transformer replaced recurrence and convolution with attention for sequence transduction and made training more parallelizable.

The practical effect was important. Earlier sequence models processed words one step at a time, which made training on large datasets slow and expensive. Attention allowed the model to relate different positions in a sequence directly, which meant the computation could be spread more efficiently across modern hardware. The Transformer became the practical base for later large language models.

But it is easy to overstate this. Architecture alone does not explain the modern AI wave. Data, compute, engineering practice, and product deployment all converged. The transformer was a door that many other developments pushed open.

Pretraining turned unlabeled data into a reusable base

Before foundation models, the default was to train a separate model for each task. If you wanted a sentiment classifier, you collected labeled reviews and trained a classifier. If you wanted a translator, you collected parallel sentences and trained a translator. Each task needed its own dataset and its own architecture.

Point C4 Broad pretraining enabled models such as BERT and GPT-3 to be adapted or prompted across many tasks.

BERT, introduced in 2018, showed how deep bidirectional pretraining on unlabeled text could support many downstream language tasks with small task-specific additions. A year later, GPT-3 showed that scaling autoregressive language models could improve few-shot performance through text prompts: the model could attempt a new task after seeing only a handful of examples embedded in a prompt.

These are different model styles—bidirectional versus autoregressive—but both shifted the default from task-specific models toward reusable base models. Instead of building a new model from scratch, practitioners increasingly asked: how do I get a large pretrained model to do what I want?

Scaling became a research program and an industrial strategy

Once pretrained bases became valuable, the relationship among model size, data, and compute became a central object of study. Empirical scaling laws mapped how loss improved as models, datasets, and training budgets grew.

Point C5 Scaling research made model size, data, and compute explicit variables, while later work emphasized compute-optimal allocation rather than model size alone.

The framing was powerful: scale could be treated as a method, a budgeting discipline, and an infrastructure requirement. But it was never a complete theory of intelligence. Later work, such as the Chinchilla study, argued that many models were undertrained for their size and that data scale must grow alongside model scale. More capacity does not automatically produce judgment, truth, or safety.

This is where the industrial story intersects with the scientific one. Training the largest models requires clusters of specialized processors, energy budgets measured in megawatts, and teams that span research, systems engineering, and data curation. Scale became both a research finding and a capability that only a few organizations could afford to push to the frontier.

Post-training made models more usable

A large pretrained model is a predictor of likely text. It is not, by default, a helpful assistant. Left to itself, it may continue a prompt rather than answer it, produce fluent nonsense, or echo biases in its training data.

Point C6 Instruction tuning and RLHF can improve usefulness and intent-following, but do not eliminate mistakes or alignment limits.

Instruction tuning and reinforcement learning from human feedback showed that model behavior could be steered after pretraining. Humans rate or rank outputs, and the model is adjusted to produce more useful, more cautious, or more honest-seeming responses. The result is often a better interactive experience, but post-training does not remove hallucinations, bias, or misuse risk. It changes the surface behavior of a system whose underlying limitations remain.

Multimodality widened the idea of a foundation model

The first wave of large language models worked mostly with text. The next wave widened the interface. CLIP used natural-language supervision to learn visual models that could connect images and captions.

Point C7 Natural-language supervision and multimodal training widened foundation-model behavior beyond text-only tasks.

Later systems combined text and image inputs and outputs, and the category “foundation model” expanded to include models that reasoned across different kinds of data. This is genuinely useful. It is also easy to misread. Multimodality widens the range of tasks that can be attempted from a shared base; it is not proof that the system has human sensory grounding.

Retrieval, tools, and agents move work outside the model

One of the most important developments of the 2020s was the realization that a model’s knowledge does not have to live entirely inside its parameters. Retrieval-augmented generation combined a language model with an external memory store, so the model could ground its output in retrieved documents rather than only in what it had memorized during training.

Point C8 Retrieval, tool use, and reasoning/action loops can extend model behavior by connecting models to external sources, APIs, and environments.

Researchers also explored tool use, in which a model learns to call APIs, run code, or search the web, and ReAct-style systems that interleaved reasoning traces with actions. These designs make systems more capable by moving parts of the task into external systems with state and memory.

“Agent” is used here operationally: a system that uses model outputs plus tools, state, and action selection over time. It does not imply autonomy, personhood, or reliable intent.

Evaluation and governance lag the surface impression

Foundation models can feel uncanny. They write plausible prose, answer obscure questions, and sometimes seem to reason. That surface impression makes it easy to forget how uneven their reliability is. Benchmarks can be gamed by training on their test sets. Accuracy alone hides tradeoffs in calibration, robustness, fairness, bias, toxicity, and efficiency.

Point C9 Language-model evaluation needs multi-metric transparency because accuracy alone hides tradeoffs in calibration, robustness, fairness, bias, toxicity, and efficiency.

The HELM evaluation framework argued for exactly this kind of holistic transparency. It is a useful antidote to single-number leaderboard culture.

As of 2026-06-19, the 2026 AI Index reports rapid changes in AI capabilities, adoption, incidents, and responsible-AI measurement gaps. Those trends are useful context, not evergreen facts.

Point C10 As of 2026-06-19, the 2026 AI Index reports rapid changes in AI capabilities, adoption, incidents, and responsible-AI measurement gaps.

Governance is also moving quickly. As of 2026-06-19, NIST AI 600-1 is the generative AI profile used here for lifecycle risk-management framing.

Point C11 As of 2026-06-19, NIST AI 600-1 is the generative AI profile used here for lifecycle risk-management framing.

And as of 2026-06-19, European Commission pages state that EU general-purpose AI model rules became effective in August 2025 and that the Code of Practice supports compliance.

Point C12 As of 2026-06-19, European Commission pages state that EU general-purpose AI model rules became effective in August 2025 and that the Code of Practice supports compliance.

The human road ahead

Foundation models are a genuine turning point in the long history of AI. They brought back the ambition of general-purpose systems after decades of narrow success. They made it practical to adapt one trained base to many tasks. They created new interfaces for language, images, code, and tools.

But they also introduced a new kind of overclaim. Broad capability is not understanding. Few-shot prompting is not teaching. Tool use is not agency. Scaling is not a theory of intelligence. The systems are powerful, unreliable, and deeply shaped by the data, objectives, and institutions that produced them.

That combination—capability without reliable understanding, generality without accountability—is why the next part of the road leads through labor, institutions, governance, and meaning. The technology is only half the story. The other half is what humans choose to do with it.

Article guide Important points and sources 12 points Show guide Hide guide

C001 framing · high A foundation model is a broadly trained model, generally trained with self-supervision at scale, that can be adapted to many downstream tasks.
C002 core · high Foundation models revive general-purpose AI ambition by supporting many tasks from a shared base, but this should not be equated with humanlike understanding.
C003 argument · high The Transformer replaced recurrence and convolution with attention for sequence transduction and made training more parallelizable.
C004 argument · high Broad pretraining enabled models such as BERT and GPT-3 to be adapted or prompted across many tasks.
C005 argument · high Scaling research made model size, data, and compute explicit variables, while later work emphasized compute-optimal allocation rather than model size alone.
C006 risk · high Instruction tuning and RLHF can improve usefulness and intent-following, but do not eliminate mistakes or alignment limits.
C007 argument · medium-high Natural-language supervision and multimodal training widened foundation-model behavior beyond text-only tasks.
C008 argument · high Retrieval, tool use, and reasoning/action loops can extend model behavior by connecting models to external sources, APIs, and environments.
C009 risk · high Language-model evaluation needs multi-metric transparency because accuracy alone hides tradeoffs in calibration, robustness, fairness, bias, toxicity, and efficiency.
C010 landscape · medium As of 2026-06-19, the 2026 AI Index reports rapid changes in AI capabilities, adoption, incidents, and responsible-AI measurement gaps.
C011 landscape · medium-high As of 2026-06-19, NIST AI 600-1 is the generative AI profile used here for lifecycle risk-management framing.
C012 landscape · medium As of 2026-06-19, European Commission pages state that EU general-purpose AI model rules became effective in August 2025 and that the Code of Practice supports compliance.

Sources Sources used 19 sources Show sources Hide sources

A Proposal for the Dartmouth Summer Research Project on Artificial Intelligence primary
Attention Is All You Need research
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding research
Scaling Laws for Neural Language Models research
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks research
Language Models are Few-Shot Learners research
Learning Transferable Visual Models From Natural Language Supervision research
On the Opportunities and Risks of Foundation Models research-report
Training language models to follow instructions with human feedback research
Training Compute-Optimal Large Language Models research
ReAct: Synergizing Reasoning and Acting in Language Models research
Holistic Evaluation of Language Models research
Toolformer: Language Models Can Teach Themselves to Use Tools research
GPT-4 Technical Report technical-report
Artificial Intelligence Risk Management Framework 1.0 guidance
NIST AI RMF Generative AI Profile guidance
The 2026 AI Index Report report
AI Act policy
Drawing-up a General-Purpose AI Code of Practice policy

Look closer

Sources and notes

Open details Close details

These notes collect the sources, counterpoints, and review status behind the article's important points. Read the essay first; open this when you want to check something.

Confidence reflects how strongly the sources support the point (low / medium / high). Status describes the point's role (e.g., core, argument, landscape). Sources link to supporting material; counterpoints note boundary conditions or conflicting findings.

C001 high framing

A foundation model is a broadly trained model, generally trained with self-supervision at scale, that can be adapted to many downstream tasks.

Sources (1): “The 2021 Stanford report defines foundation models as models trained on broad data at scale that can be adapted to a wide range of downstream tasks, and highlights the homogenization risks of a shared base.”
On the Opportunities and Risks of Foundation Models direct
Counterpoints (1): The term is still contested; some practitioners reserve it for very large language models or require specific adaptation mechanisms, while others use it more loosely.

C002 high core

Foundation models revive general-purpose AI ambition by supporting many tasks from a shared base, but this should not be equated with humanlike understanding.

Sources (3): “The 1955 Dartmouth proposal framed AI as a broad program involving language, abstraction, problem solving, and self-improvement, establishing the original general-purpose ambition.”
A Proposal for the Dartmouth Summer Research Project on Artificial Intelligence background

“Foundation models enable many downstream tasks from a shared base, creating homogenization of capabilities and risks across applications.”
On the Opportunities and Risks of Foundation Models direct

“GPT-3 demonstrated few-shot task behavior across many language tasks through prompting, while also reporting failures and limitations.”
Language Models are Few-Shot Learners direct
Counterpoints (1): Success on a broad range of tasks does not demonstrate understanding; foundation models can produce fluent, confident errors and fail tasks that require grounded reasoning or causal knowledge.

C003 high argument

The Transformer replaced recurrence and convolution with attention for sequence transduction and made training more parallelizable.

Sources (1): “Attention Is All You Need proposes the Transformer architecture, relying entirely on attention mechanisms for sequence transduction and allowing greater parallelization during training.”
Attention Is All You Need direct
Counterpoints (1): The Transformer was one factor among many; data scale, compute, engineering practices, and deployment also contributed to later large-model successes.

C004 high argument

Broad pretraining enabled models such as BERT and GPT-3 to be adapted or prompted across many tasks.

Sources (2): “BERT introduces deep bidirectional pretraining on unlabeled text and shows that a small task-specific head can adapt the model to many downstream language tasks.”
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding direct

“GPT-3 demonstrates that scaling autoregressive language models improves few-shot learning through text prompts without fine-tuning.”
Language Models are Few-Shot Learners direct
Counterpoints (1): Prompting and fine-tuning are brittle; small changes in prompt wording can produce large performance swings, and adaptation does not work equally well for all tasks or languages.

C005 high argument

Scaling research made model size, data, and compute explicit variables, while later work emphasized compute-optimal allocation rather than model size alone.

Sources (2): “Scaling Laws for Neural Language Models reports empirical power-law relationships among model size, dataset size, compute, and language modeling loss.”
Scaling Laws for Neural Language Models direct

“Training Compute-Optimal Large Language Models argues that many large models are undertrained and that data scale should grow alongside model size for compute-optimal training.”
Training Compute-Optimal Large Language Models direct
Counterpoints (1): Scaling laws describe predictive loss on specific datasets, not general intelligence, safety, or real-world utility; the optimal allocation also depends on inference cost and application constraints.

C006 high risk

Instruction tuning and RLHF can improve usefulness and intent-following, but do not eliminate mistakes or alignment limits.

Sources (1): “Training language models to follow instructions with human feedback shows that RLHF improves instruction following while the authors note remaining mistakes and incomplete alignment.”
Training language models to follow instructions with human feedback direct
Counterpoints (1): Post-training can suppress some harmful outputs but may also create new failure modes, such as sycophancy, overrefusal, or reward hacking, that are not present in the base model.

C007 medium-high argument

Natural-language supervision and multimodal training widened foundation-model behavior beyond text-only tasks.

Sources (2): “CLIP learns transferable visual models from natural-language supervision, enabling zero-shot image classification through text prompts.”
Learning Transferable Visual Models From Natural Language Supervision direct

“The GPT-4 Technical Report describes a large multimodal model that accepts text and image inputs and generates text outputs, while noting limited public disclosure of architecture details.”
GPT-4 Technical Report direct
Counterpoints (1): Multimodal behavior does not imply humanlike sensory grounding; models can associate patterns across modalities without embodied perception or causal models of the world.

C008 high argument

Retrieval, tool use, and reasoning/action loops can extend model behavior by connecting models to external sources, APIs, and environments.

Sources (3): “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks combines parametric memory with non-parametric retrieval to improve factual consistency and provenance.”
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks direct

“ReAct interleaves reasoning traces and actions in language models, showing improved performance on tasks requiring interaction with external environments.”
ReAct: Synergizing Reasoning and Acting in Language Models direct

“Toolformer shows that language models can learn to call external tools such as search engines and calculators through self-supervised training.”
Toolformer: Language Models Can Teach Themselves to Use Tools direct
Counterpoints (1): Tool use increases capability but also expands the failure surface; models can invoke tools incorrectly, trust unreliable outputs, or expose systems to security and misuse risks.

C009 high risk

Language-model evaluation needs multi-metric transparency because accuracy alone hides tradeoffs in calibration, robustness, fairness, bias, toxicity, and efficiency.

Sources (1): “Holistic Evaluation of Language Models evaluates models across multiple metrics and scenarios, revealing tradeoffs hidden by single benchmarks.”
Holistic Evaluation of Language Models direct
Counterpoints (1): Holistic evaluation is expensive and may still miss real-world harms; simpler benchmarks remain common because they are cheaper to run and easier to compare across papers.

C010 medium landscape

As of 2026-06-19, the 2026 AI Index reports rapid changes in AI capabilities, adoption, incidents, and responsible-AI measurement gaps.

Sources (1): “The 2026 AI Index Report from Stanford HAI tracks rapid changes in capabilities, adoption, incidents, and responsible-AI measurement as of the report's release.”
The 2026 AI Index Report direct
Counterpoints (1): AI Index aggregates third-party data and reflects available metrics and reporting practices; some categories have incomplete global coverage and definitions that change year to year.

C011 medium-high landscape

As of 2026-06-19, NIST AI 600-1 is the generative AI profile used here for lifecycle risk-management framing.

Sources (2): “NIST AI RMF Generative AI Profile provides risk categories and lifecycle risk-management guidance for generative AI systems.”
NIST AI RMF Generative AI Profile direct

“The NIST AI Risk Management Framework 1.0 provides a lifecycle governance and risk-management foundation used by the generative AI profile.”
Artificial Intelligence Risk Management Framework 1.0 background
Counterpoints (1): NIST guidance is voluntary in the United States and may be superseded or supplemented by sector-specific rules or future framework versions.

C012 medium landscape

As of 2026-06-19, European Commission pages state that EU general-purpose AI model rules became effective in August 2025 and that the Code of Practice supports compliance.

Sources (2): “European Commission pages describe the EU AI Act regulatory framework, including general-purpose AI model obligations.”
AI Act direct

“The EU Code of Practice page states that the general-purpose AI Code of Practice became effective on 1 August 2025 as a voluntary compliance tool.”
Drawing-up a General-Purpose AI Code of Practice direct
Counterpoints (1): Regulatory timing and interpretation evolve; the exact scope and enforcement of GPAI rules may change and should be rechecked after 2026-12-31.

Review recordHow this was madeShow detailsHide details

Created 2026-06-20 by human. Policy: policy:default v1.0.0.

✓ Approved hash matches current article

Reviews

agentapproved2026-06-20
Scope: claims, sources, tone, privacy
Initial seed draft reviewed against the source map, agent brief, and privacy contract. No client-specific or proprietary information detected. High-volatility current-state claims flagged for recheck after 2026-12-31. Approved for publication after final review.
sibling-agentapproved2026-06-20
Scope: claims, sources, tone, privacy, cross-links
Sibling review approved after addressing unused source removal and current-state source recheck notes. No blockers remain.
humanapproved2026-06-20
Scope: claims, sources, tone, privacy
contentHash: 14bbae6f14e0ba00…
Human final review approved for publication after sibling-agent review and CI pass.

Machine-readable files

The same points, sources, and relationships are also available as structured files for agents and tools. The JSON follows the publication record schema.

JSON file Brief (Markdown)

What a foundation model is

The old dream returns in a new form

The transformer made scale easier to use

Pretraining turned unlabeled data into a reusable base

Scaling became a research program and an industrial strategy

Post-training made models more usable

Multimodality widened the idea of a foundation model

Retrieval, tools, and agents move work outside the model

Evaluation and governance lag the surface impression

The human road ahead

Sources and notes

A foundation model is a broadly trained model, generally trained with self-supervision at scale, that can be adapted to many downstream tasks.

Foundation models revive general-purpose AI ambition by supporting many tasks from a shared base, but this should not be equated with humanlike understanding.

The Transformer replaced recurrence and convolution with attention for sequence transduction and made training more parallelizable.

Broad pretraining enabled models such as BERT and GPT-3 to be adapted or prompted across many tasks.

Scaling research made model size, data, and compute explicit variables, while later work emphasized compute-optimal allocation rather than model size alone.

Instruction tuning and RLHF can improve usefulness and intent-following, but do not eliminate mistakes or alignment limits.

Natural-language supervision and multimodal training widened foundation-model behavior beyond text-only tasks.

Retrieval, tool use, and reasoning/action loops can extend model behavior by connecting models to external sources, APIs, and environments.

Language-model evaluation needs multi-metric transparency because accuracy alone hides tradeoffs in calibration, robustness, fairness, bias, toxicity, and efficiency.

As of 2026-06-19, the 2026 AI Index reports rapid changes in AI capabilities, adoption, incidents, and responsible-AI measurement gaps.

As of 2026-06-19, NIST AI 600-1 is the generative AI profile used here for lifecycle risk-management framing.

As of 2026-06-19, European Commission pages state that EU general-purpose AI model rules became effective in August 2025 and that the Code of Practice supports compliance.

Reviews

Related articles

The Long Human Road to AI: A Reader’s Guide to Season 1

Learning Machines: Statistics, Neural Networks, and the Data Turn

The Human Road Through AI: Labor, Institutions, Governance, and Meaning

Machine-readable files