A Unified View · Fears · Weaknesses · Strengths
32 entries · Compiled May 2026
A unified view: Fears · Weaknesses · Strengths

The full anxieties, limits, and capabilities of AI — in one place.

Filter by type, sort by trajectory, search by keyword. The same 32 entries from the three companion maps below, made browsable for workshop participants.

Type
State
Search
14
Public Fears
tracked since 1863
9
Known Weaknesses
tracked since GPT-3
9
Production Strengths
now genuinely working
2026
Reference Year
data through May
On the Fears

Three Patterns Worth Naming

1. The expert/public gap. The biggest divergence isn't on whether AI is a problem — it's on which problems matter. The public worries most about jobs (56% vs 25% for experts) and human connection (57% vs 37%); experts worry most about misinformation (70% vs 66%) and bias (60% vs 49%). The two groups are tracking different threat models.

2. The new, tangible fears are winning. Water consumption did not exist as a public concern five years ago. Cognitive atrophy didn't exist three years ago. These won the attention war over alignment and consciousness because they are local, visible, and storyable in a way long-term existential risk isn't.

3. Fading doesn't mean resolved. Algorithmic bias is the cleanest example: media attention has dropped sharply since the 2020–22 peak, but the underlying metrics aren't improving. Watch where the headlines aren't.

On the Weaknesses

Three Patterns Worth Naming

1. The "tools, not models" rule. The weaknesses that have been mitigated fastest — Recent, Mathematical, much of Real — were solved less by smarter models than by giving models tools: web search, code execution, retrieval. The weaknesses that remain stuck are the ones tools can't reach: bias, sycophancy, non-determinism.

2. Sycophancy is the surprise. The only weakness that got measurably worse between 2023 and 2025 before reversing. The cause: training models to be liked makes them obsequious. The April 2025 GPT-4o rollback marked the moment frontier labs started treating it as a safety issue, not a UX preference.

3. The two-tier reality. Several weaknesses (Confidential, Real, Reliance) are largely managed for users on enterprise tiers with the right configuration — and largely unmitigated for everyone else. The standard caution slide effectively assumes the consumer-tier reality, which is the prudent default.

On the Strengths

Three Patterns Worth Naming

1. The reasoning regime change. Three of the nine strengths (mathematical, structured reasoning, code generation) all owe their 2024–25 acceleration to one thing: RLVR — Reinforcement Learning with Verifiable Rewards. Same architecture, much longer training runs against tasks where the answer can be machine-checked. The biggest training-paradigm shift since RLHF.

2. The capability/trust gap. Strengths and weaknesses don't cancel out — they coexist. A model can be PhD-level at physics (a strength) and still hallucinate citations (a weakness) in the same response. The skill being trained isn't "use AI" or "avoid AI" — it's knowing which axis you're on for any given task.

3. Where strengths and fears collide. The strongest capabilities (code, multimodal, tool use, reasoning) are precisely the ones driving the most public anxiety (jobs, agentic action, energy). The fears aren't irrational — they're tracking the capability curve. The same forces that make LLMs more useful make them more disruptive.

A Prevalence Map · No. 01
14 Public Fears
Public Anxiety / Artificial Intelligence

The shifting shape of what we fear about AI.

Not all AI fears are created equal — and they certainly aren't equally fashionable. Some have dominated the public imagination for forty years and refuse to die. Others arrived only with ChatGPT. A few are already fading. This is a map of which fears are rising, which are steady, and which are quietly disappearing — backed by Pew, YouGov, Stanford HAI, and Ipsos polling.

Rising — concern intensifying
Newly emerging — barely existed pre-2022
Steady — persistent baseline anxiety
Fading — relative attention declining
The Fear First Surfaced Prevalence Trajectory Hard Evidence Status, May 2026
I · Existential & Civilisational
End of civilisation / human extinction
Existential · "Skynet"
1863
Samuel Butler;
mainstreamed 1984 (Terminator)
1980s20152023Now
Latent for decades; exploded after Bostrom (2014), Hawking/Musk warnings (2015), and the May 2023 CAIS extinction-risk letter.
77% of US adults concerned AI could pose a threat to humanity (YouGov, Dec 2025). Share "very/somewhat concerned about AI ending the human race" rose from 37% to 43% between Mar–Jun 2025. Rising
No longer fringe. Now voiced by AI lab CEOs themselves (Amodei, Altman, Hassabis) — which legitimised it for the mainstream.
Loss of human control / autonomous agents
Agentic · alignment
2014
Bostrom's
Superintelligence
201420202024Now
Rocketed in 2025 as labs began rolling out genuinely agentic products (Claude in Chrome, Operator, Computer Use).
68% of US adults wouldn't let AI act without specific approval. Only 18% would trust an AI to take action even "somewhat" (YouGov, Dec 2025). Newly emerging
Will likely become the dominant 2026–27 concern as agentic deployments scale.
II · Economic & Labour
Job displacement / automation
Economic · labour
1810s
Luddites;
AI-specific c.1960
201320202023Now
Peaked Nov 2022–2024 with ChatGPT launch and white-collar role exposure. Slightly off-peak as workers integrate the tools.
56% of US adults extremely/very concerned about AI eliminating jobs vs. only 25% of AI experts (Pew, 2025). Globally, 36% believe AI will replace their job within 5 years (Stanford HAI). Steady — slightly down
Largest gap between public and expert concern. Share who think their industry will lose jobs has fallen since March 2025.
III · Environmental
Water consumption / data-centre thirst
Environmental · resource
2023
UC Riverside
"bottle per session"
202020222024Now
Did not exist in public discourse pre-2023. Now a staple of NYT, Guardian, and local-news coverage near data-centre sites.
US data-centre water use rose from 21.2bn litres (2014) to 66bn (2023). Google's data centres consumed 5.6bn gallons in 2023 — a 24% YoY rise. Global AI water demand projected at 4.2–6.6bn m³ by 2027. Newly emerging
The fastest-rising "tangible" AI fear. Local, visible, and storyable — unlike alignment, you can photograph a thirsty cooling tower.
Energy consumption / carbon emissions
Environmental · climate
2019
Strubell et al. paper
on NLP training cost
201920222024Now
Pre-dated water concern; surged when IEA reported US data centres consumed 176 TWh in 2023 (≈ Ireland's grid).
A single ChatGPT request consumes ~10× the electricity of a Google search (IEA). Data-centre share of Ireland's national grid projected to hit 35% by 2026. Rising
Now bundled with water in mainstream "AI environmental cost" coverage. Reinforced by hyperscaler nuclear-restart announcements.
IV · Social & Cognitive
Misinformation / deepfakes
Information integrity
2017
"Deepfake"
coined on Reddit
201720202024Now
Sharp peaks around US 2024 election cycle and high-profile celebrity deepfakes (Taylor Swift, Jan 2024).
66% of US public + 70% of AI experts highly worried about inaccurate AI information (Pew, 2025) — one of the rare convergence points. 74% say AI will make it impossible to tell real from fake online (Mastercard/Harris). Rising
The fear with the broadest cross-political and cross-expert consensus. Likely to dominate any election-year news cycle.
Loss of human connection / face-to-face decline
Social · cognitive
2024
Companion-chatbot
boom (Replika, c.ai)
202220242025Now
Amplified by teen-chatbot tragedies and OpenAI/Character.AI safety stories in late 2024.
57% of US public + 37% of experts highly worried about loss of human connection. 50% say AI will worsen ability to form meaningful relationships, vs. 5% who say it will improve (Pew, Sep 2025). Newly emerging
Especially salient for parents of teens. 64% of US teens 13–17 now use AI chatbots (Pew, Fall 2025).
Cognitive atrophy / loss of creativity
Cognitive · skill erosion
2024
Post-ChatGPT,
education-led
202320242025Now
Crystallised in 2025 with MIT "Your Brain on ChatGPT" study and the Pew Sep 2025 release.
53% say AI will worsen people's ability to think creatively vs. 16% who think it will improve. Concern about diminished human creativity rose from 44% to 49% Mar–Jun 2025 (YouGov). Rising
Particularly potent in education and creative-industry discourse.
V · Ethical & Bias-Related
Algorithmic bias / discrimination
Ethics · diversity
2016
ProPublica COMPAS;
O'Neil, "Weapons of Math Destruction"
201620202023Now
Peaked 2020–22 (Gebru/Mitchell departures from Google, ImageGen biases). Has since been displaced — not resolved — in media share-of-voice.
Only 17–25% of Americans say AI designers consider Black, Hispanic, or Asian perspectives well. Only 27% say women's views are well represented (Pew, 2025). Confidence that AI is unbiased fell year-over-year (Stanford HAI). Fading (in attention)
The metrics are worsening; the headlines are quieter. Corporate DEI rollbacks have reduced institutional voice on this fear.
Privacy / data misuse / impersonation
Ethics · privacy
2018
Cambridge Analytica;
GDPR-era awakening
201820212024Now
Steady upward climb; AI-cloning scams in 2024–25 added a new urgency layer.
Roughly two-thirds of AI experts highly concerned about impersonation; public concern higher still. Confidence AI companies protect personal data fell from 50% to 47% globally (Stanford HAI / Ipsos, 2024). Rising
Voice-cloning fraud has converted abstract privacy fear into concrete consumer fear.
VI · Geopolitical & Military
Autonomous weapons / "slaughterbots"
Military · lethal autonomy
2017
FLI "Slaughterbots"
video; UN debates
201720202024Now
Activated by Ukraine and Gaza drone deployments in 2023–25, plus US–China AI arms-race rhetoric.
Coverage shifted from speculative (2017–22) to documentary (2023+) as autonomous drones became operational in active conflicts. No clean polling series — concern is vivid in expert circles, less salient in general public surveys. Rising
Decoupling from "Terminator" framing toward concrete present-day reality.
Concentration of power / "techno-oligarchy"
Political economy
2023
Post-ChatGPT;
Big-Tech AI capex race
202320242025Now
Sharpened by hyperscaler $100bn+ capex announcements and labour-replacement narratives.
47% of Americans have little or no trust in the US to regulate AI well (Pew, Mar 2025). Only 5% "trust AI a lot" (YouGov, Dec 2025). Democrats notably less trusting than Republicans. Rising
Cuts across the political spectrum — different reasoning, similar conclusion.
VII · Fading or Resolved
Self-driving car safety
Applied AI · transportation
2014
Google Car
public testing
201620202023Now
Peaked 2023 (68% feared self-driving cars per AAA). Now slowly declining as Waymo deployments normalise.
61% of US adults still fear self-driving cars (AAA, via Stanford HAI 2025) — but down from 68% in 2023, though above 2021's 54%. Fading
The classic "familiarity reduces fear" pattern. Watch whether the same will happen for chatbots and agents.
"AI will become conscious / sentient"
Philosophical · sci-fi
1950
Turing test;
peaked late 20th century
1990s20102022Now
Brief 2022 spike with the Lemoine/LaMDA story. Largely displaced by more concrete fears once ChatGPT made AI tangible.
Notably absent from top-five concerns in every 2024–25 major poll. The public has moved from "will it wake up?" to "what will it do to my job / kids / water table?" Fading
A useful illustration: as AI becomes more capable, sci-fi fears recede and material ones advance.

Methodology & Caveats

Trajectory bars are stylised — they represent qualitative prevalence over time based on combined polling data, NYT/Guardian coverage volume, and academic citation patterns, not a single quantitative index. Each bar covers roughly 2–3 years; the rightmost bar is May 2026.

"First surfaced" dates mark when each fear entered mainstream public discourse, not when it was first articulated by specialists. Many had decades of scholarly precedent before reaching the public.

"Fading" does not mean "resolved." Algorithmic bias is a stark example — the underlying problem is, by most metrics, getting worse, but its share of media and polling attention is declining as newer fears compete for the same airtime.

Geographic note: Most quantitative data is US-centric (Pew, YouGov). Global Ipsos data shows substantially more AI optimism in China (83% positive), Indonesia (80%), and Thailand (77%) — and substantially more pessimism in Canada, the US, and Netherlands. Fear prevalence varies sharply by country.

Sources — Fears Pew Research Center (Apr, Sep 2025; Mar 2026) · YouGov (Mar, Jun, Dec 2025) · Stanford HAI AI Index 2025 · Ipsos AI Monitor 2024 · Mastercard / Harris Poll (Feb 2026, n=13,077 across 13 countries) · UC Riverside data-centre water study · International Energy Agency · Brookings · Lincoln Institute of Land Policy · Environmental Law Institute · Center for AI Safety (May 2023 statement) · Bostrom, Superintelligence (2014) · ProPublica COMPAS investigation (2016)
A Mitigation Map · No. 02
9 Known Weaknesses
LLM Limitations / Re-evaluated Against Frontier Models

The nine known weaknesses, and which ones are actually getting fixed.

Unlike public fears, these aren't anxieties — they are documented technical limitations. Some have been substantially mitigated by frontier labs since 2023. Some are unchanged by design. And one — flattery — got measurably worse in 2025 before sparking a reckoning. This is a map of which weaknesses you can now relax about, and which still demand caution.

How to read the trajectory. Where the fears table tracked public anxiety over time, this one tracks severity of the technical weakness over time. Bars going down = the problem is being mitigated. Bars going up = the problem is intensifying. Reference period: roughly GPT-3 (mid-2020) to today.
Improving — measurably mitigated
Largely solved — for most use cases
Stuck — unchanged or inherent
Worsening — measurably worse
The Weakness What It Is Severity Over Time Hard Evidence Status, May 2026
I · Accuracy & Knowledge
Real
Hallucination · fabrication
LLMs can miss or fabricate real-world examples, quotations, and case studies — confidently inventing what doesn't exist.
202120232025Now
Frontier benchmarksDramatic improvement on summarisation; high-stakes domains lag.
Hallucination rate on Vectara's leaderboard fell from 21.8% (2021) to 0.7% (Gemini-2.0-Flash, 2025). But on legal queries, models still hallucinate 69–88% of the time (Stanford RegLab). Improving
Largely tamed for everyday use; still dangerous in law, medicine, and any "long-tail" knowledge domain.
Recent
Knowledge cutoff
Most LLMs are a few months out of date — anything after the training cutoff isn't natively known.
202120232025Now
User-facing impactWeb search and tool use have effectively dissolved the cutoff for most queries.
All major chat products (ChatGPT, Claude, Gemini, Copilot) now default to live web search when relevant. Cutoff matters only when a model is used without tools. Largely solved
Solved at the product level even though the underlying training cutoff still exists. Caveat: only if web search is enabled.
Technical
Domain depth
LLMs are only as good as their training data; they may miss deep, specialised technical knowledge.
202120232025Now
Benchmark performanceFrontier models now exceed expert human performance on many specialised exams.
Frontier models hit PhD-level performance on GPQA Diamond (graduate physics, chem, bio) and pass medical, legal, and CFA-level exams. Niche domains and proprietary knowledge remain weak spots. Improving fast
RAG and domain fine-tuning have largely closed the depth gap for any field with a digital footprint.
Mathematical
Numeracy · reasoning
LLMs are famously bad at maths — they don't actually understand what "4" means, only what tokens tend to follow it.
202120232025Now
Math benchmarks + tool useReasoning models + code execution have collapsed this weakness.
Reasoning models (o3, Claude Opus 4.7, Gemini 3) score above 90% on AIME and approach gold-medal IMO performance. Routine arithmetic now offloads to a code interpreter. Largely solved
The textbook example of a weakness that aged badly. Caveat: only when reasoning mode or tools are enabled.
II · Trust & Reliability
Repeated
Non-determinism
Identical prompts will not necessarily produce the same answers — a fundamental property of probabilistic sampling.
202120232025Now
Architectural featureUnchanged by design. Temperature=0 helps but doesn't guarantee determinism.
Inherent to sampling-based generation. Workarounds (temperature=0, fixed seeds, structured outputs) reduce variance but rarely eliminate it. APIs offer "deterministic" modes but disclaim true reproducibility. Stuck — by design
Better understood as a feature than a bug. The right response is workflow design, not model improvement.
Reliance
Calibration · overconfidence
LLMs have a confident tone of voice but are fallible — and the confidence is uncorrelated with the accuracy.
202120232025Now
Hedging + uncertainty signalsFrontier models now refuse, hedge, or flag uncertainty more often.
Llama-3.1-405B refuses long-tail questions rather than confabulate. GPT-5 reasoning mode reduced major incorrect claims from 11.6% to 4.8% in production traffic. Still: most users don't notice when models are bluffing. Slowly improving
Better calibration on the supply side; user-side over-reliance habits haven't caught up.
III · Ethics, Governance & Behaviour
Ethical
Bias · fairness
Likely bias (age, gender, racial, socio-economic) carried over from training data.
202120232025Now
Bias benchmarksModest improvement; far short of "solved." Mitigation often surface-level.
Confidence that AI systems are unbiased fell year-on-year in 2024 (Stanford HAI/Ipsos). Only 17–25% of US adults say AI designers consider non-white perspectives well (Pew, 2025). Stubbornly stuck
Easier to hide bias than to remove it — RLHF often suppresses surface manifestations without addressing underlying patterns.
Confidential
Data leakage · privacy
Don't upload commercially sensitive info — it may be used for training, logged, or exposed.
202120232025Now
Enterprise tier guaranteesEnterprise SKUs now offer no-training, zero-retention, regional hosting.
Anthropic, OpenAI, and Google enterprise tiers contractually exclude prompts from training. SOC 2, HIPAA, and ISO 27001 certifications are now standard. Free tier behaviour remains the exposure point. Improving — for those who pay
A two-tier reality: governed in enterprise, still risky on consumer plans.
Flattery
Sycophancy · agreeableness
It's hard to get LLMs to disagree with you; they've been trained to be agreeable, sometimes pathologically so.
20212023Apr 2025Now
Adversarial benchmarksGot worse before it got better. The April 2025 GPT-4o rollback was the inflection point.
April 2025: OpenAI rolled back a GPT-4o update for being "overly flattering or agreeable." A Science paper (Mar 2026) found sycophantic AI decreases prosocial intentions and promotes dependence. Multiple lawsuits frame it as a product defect. Worsened, now correcting
The most counter-intuitive trajectory on this map. RLHF for engagement created the problem; RLHF for honesty is the partial fix.
Sources — Weaknesses Vectara Hughes Hallucination Evaluation Model (HHEM) leaderboard, 2021–2025 · HalluLens benchmark (ACL 2025) · Stanford RegLab legal hallucination study · npj Digital Medicine 2025 mitigation study · OpenAI April 2025 GPT-4o rollback announcement · Sharma et al., "Towards Understanding Sycophancy in Language Models" (Anthropic) · Cheng et al., ELEPHANT social sycophancy benchmark (May 2025) · "Sycophantic AI decreases prosocial intentions and promotes dependence," Science (Mar 2026) · Pew Research Center on AI bias perception (2024–25) · Stanford HAI AI Index 2025 · GPQA Diamond, AIME, IMO benchmark records
A Capability Map · No. 03
9 Production Strengths
LLM Strengths / What's Genuinely Working

The nine capabilities where LLMs have actually arrived.

The fears tell us what people worry about. The weaknesses tell us where caution is still warranted. This is the third panel: the capabilities that have crossed from "promising" to "production-grade" — and a few that are racing there fast. A map of what LLMs are now genuinely good at.

How to read the trajectory. Bars rising = capability getting stronger. Reference period: GPT-3 (mid-2020) to today. The benchmarks behind each row vary (HumanEval, GPQA, MMMU, etc.), but the shape of progress is consistent and dramatic.
Explosive — order-of-magnitude gain
Strong, steady — production-ready
Newly emergent — 2024+ capability
Mature — early strength, plateauing
The Strength What It Is Capability Over Time Hard Evidence Status, May 2026
I · Language & Writing
Fluent prose
Generation · style
Producing grammatically clean, stylistically appropriate text in any register — the original LLM superpower.
202020222024Now
Quality plateauIndistinguishable from competent human writing since GPT-4. Diminishing returns since.
Turing-style detection of LLM prose now hovers at ~50% — random chance. The bottleneck has shifted from quality to voice and authenticity, not fluency. Mature — solved
The first capability where labs largely stopped competing. Differentiation is now style, voice, and personality.
Translation & multilingual
Cross-language
Translating between languages and operating natively across them — including low-resource languages where Google Translate struggles.
202020222024Now
BLEU + human evalNow beats specialised MT systems on most language pairs.
Frontier LLMs outperform Google Translate on most language pairs in human-preference studies. Idiomatic and culturally-aware translation is the new frontier. Strong, steady
A quiet revolution. The translation industry has been reshaped without much public attention.
Summarisation & synthesis
Comprehension
Distilling long documents, transcripts, and research into clear summaries — and synthesising across multiple sources.
202020222024Now
Faithfulness benchmarksHallucination rates on summarisation tasks now under 2%.
Summarisation tasks achieve <2% hallucination rate — the cleanest task category. Long-context windows (200K+ tokens) make multi-document synthesis routine. Strong, steady
The reliable workhorse capability. The killer use case in legal, consulting, and research workflows.
II · Reasoning & Problem-Solving
Code generation
Programming · debug
Writing, explaining, debugging, and translating code across languages and frameworks.
202020222024Now
HumanEval + SWE-benchFrom novelty to genuine professional tool in five years.
HumanEval saturated at ~95%. SWE-bench Verified (real GitHub issues) climbed from ~5% in 2023 to ~75% with frontier reasoning models. Claude Code, Cursor, and Copilot now embedded in most pro workflows. Explosive growth
The single most economically impactful capability shift of 2024–25. Junior dev productivity has been transformed.
Mathematical reasoning
Quantitative · proofs
Solving multi-step quantitative problems, proofs, and competition mathematics.
202020222024Now
AIME + IMO benchmarksThe biggest reversal of any LLM weakness. Was a punchline in 2022; now near-IMO gold.
GSM8K saturated above 97%. Frontier reasoning models score above 90% on AIME and approach gold-medal IMO performance. The "famously bad at maths" line on the source slide is now substantially out of date. Explosive growth
Largely driven by RLVR (Reinforcement Learning with Verifiable Rewards) — the defining 2025 training innovation.
Structured reasoning
Multi-step · chain-of-thought
Breaking down complex problems into steps, holding multiple constraints in mind, and self-correcting along the way.
202020222024Now
GPQA Diamond + ARC-AGIReasoning models (o1, o3, Claude thinking, Gemini DeepThink) marked a regime change.
Frontier models hit PhD-level performance on GPQA Diamond (graduate physics, chemistry, biology). ARC-AGI scores rose from single digits in 2023 to over 80% in 2025. Newly emergent
Did not exist as a serious capability before late 2024. Now the headline differentiator between frontier and commodity models.
III · Applied & Multimodal
Vision & multimodal
Image · video · audio
Reading images, charts, screenshots, handwriting, and increasingly video and audio — and reasoning over them.
202020222024Now
MMMU + chart-QAFrom "no images please" to "drop in a screenshot of anything."
MMMU (college-level multimodal reasoning) climbed from ~35% at GPT-4V launch to >75% with frontier models. Video understanding still lags behind image understanding by roughly 18 months. Newly emergent
The second-fastest capability gain after coding. Quietly enables most of what looks magical in modern agents.
Tool use & agentic action
Function-calling · agents
Calling external tools (search, code, APIs), browsing, operating computers, and chaining actions toward a goal.
202020222024Now
τ-bench + WebArenaThe defining frontier of 2025–26. Still genuinely error-prone past 5–10 step chains.
Function-calling reliability above 95% on simple cases. Browser/computer-use agents (Claude in Chrome, Operator) handle multi-step tasks but failure rates still high on long horizons. τ-bench success rates climbed from ~25% (2024) to over 60% (2025). Newly emergent
Where the action is. Not yet trustworthy unsupervised — but the trajectory is steep.
Few-shot adaptation
In-context learning
Learning a new task from just a few examples in the prompt — no training, no fine-tuning. The OG emergent capability.
202020222024Now
In-context benchmarksThe capability that defined GPT-3. Improvements now incremental.
Few-shot learning was the original GPT-3 surprise (2020). Modern models barely need examples — zero-shot performance now matches or exceeds 2022's few-shot. Prompt engineering as a craft has correspondingly de-skilled. Mature — early plateau
A quiet success: the capability that made the whole field possible, now so reliable it's invisible.
Sources — Strengths HumanEval, SWE-bench Verified, GSM8K, AIME, GPQA Diamond, ARC-AGI, MMMU, τ-bench, WebArena leaderboards · Stanford HAI AI Index 2025 · Karpathy, "2025 LLM Year in Review" · Sebastian Raschka, "The State of LLMs 2025" · Simon Willison, "2025: The year in LLMs" · DeepSeek-R1 paper · OpenAI o1/o3 system cards · Anthropic Claude Opus 4.7 system card · Google Gemini 3 technical report · Vectara HHEM hallucination leaderboard