The full anxieties, limits, and capabilities of AI — in one place.
Filter by type, sort by trajectory, search by keyword. The same 32 entries from the three companion maps below, made browsable for workshop participants.
tracked since 1863
tracked since GPT-3
now genuinely working
data through May
Three Patterns Worth Naming
1. The expert/public gap. The biggest divergence isn't on whether AI is a problem — it's on which problems matter. The public worries most about jobs (56% vs 25% for experts) and human connection (57% vs 37%); experts worry most about misinformation (70% vs 66%) and bias (60% vs 49%). The two groups are tracking different threat models.
2. The new, tangible fears are winning. Water consumption did not exist as a public concern five years ago. Cognitive atrophy didn't exist three years ago. These won the attention war over alignment and consciousness because they are local, visible, and storyable in a way long-term existential risk isn't.
3. Fading doesn't mean resolved. Algorithmic bias is the cleanest example: media attention has dropped sharply since the 2020–22 peak, but the underlying metrics aren't improving. Watch where the headlines aren't.
Three Patterns Worth Naming
1. The "tools, not models" rule. The weaknesses that have been mitigated fastest — Recent, Mathematical, much of Real — were solved less by smarter models than by giving models tools: web search, code execution, retrieval. The weaknesses that remain stuck are the ones tools can't reach: bias, sycophancy, non-determinism.
2. Sycophancy is the surprise. The only weakness that got measurably worse between 2023 and 2025 before reversing. The cause: training models to be liked makes them obsequious. The April 2025 GPT-4o rollback marked the moment frontier labs started treating it as a safety issue, not a UX preference.
3. The two-tier reality. Several weaknesses (Confidential, Real, Reliance) are largely managed for users on enterprise tiers with the right configuration — and largely unmitigated for everyone else. The standard caution slide effectively assumes the consumer-tier reality, which is the prudent default.
Three Patterns Worth Naming
1. The reasoning regime change. Three of the nine strengths (mathematical, structured reasoning, code generation) all owe their 2024–25 acceleration to one thing: RLVR — Reinforcement Learning with Verifiable Rewards. Same architecture, much longer training runs against tasks where the answer can be machine-checked. The biggest training-paradigm shift since RLHF.
2. The capability/trust gap. Strengths and weaknesses don't cancel out — they coexist. A model can be PhD-level at physics (a strength) and still hallucinate citations (a weakness) in the same response. The skill being trained isn't "use AI" or "avoid AI" — it's knowing which axis you're on for any given task.
3. Where strengths and fears collide. The strongest capabilities (code, multimodal, tool use, reasoning) are precisely the ones driving the most public anxiety (jobs, agentic action, energy). The fears aren't irrational — they're tracking the capability curve. The same forces that make LLMs more useful make them more disruptive.
The shifting shape of what we fear about AI.
Not all AI fears are created equal — and they certainly aren't equally fashionable. Some have dominated the public imagination for forty years and refuse to die. Others arrived only with ChatGPT. A few are already fading. This is a map of which fears are rising, which are steady, and which are quietly disappearing — backed by Pew, YouGov, Stanford HAI, and Ipsos polling.
| The Fear | First Surfaced | Prevalence Trajectory | Hard Evidence | Status, May 2026 |
|---|---|---|---|---|
| I · Existential & Civilisational | ||||
|
End of civilisation / human extinction
Existential · "Skynet"
|
1863 Samuel Butler; mainstreamed 1984 (Terminator) |
1980s20152023Now
Latent for decades; exploded after Bostrom (2014), Hawking/Musk warnings (2015), and the May 2023 CAIS extinction-risk letter.
|
77% of US adults concerned AI could pose a threat to humanity (YouGov, Dec 2025). Share "very/somewhat concerned about AI ending the human race" rose from 37% to 43% between Mar–Jun 2025. |
Rising
No longer fringe. Now voiced by AI lab CEOs themselves (Amodei, Altman, Hassabis) — which legitimised it for the mainstream.
|
|
Loss of human control / autonomous agents
Agentic · alignment
|
2014 Bostrom's Superintelligence |
201420202024Now
Rocketed in 2025 as labs began rolling out genuinely agentic products (Claude in Chrome, Operator, Computer Use).
|
68% of US adults wouldn't let AI act without specific approval. Only 18% would trust an AI to take action even "somewhat" (YouGov, Dec 2025). |
Newly emerging
Will likely become the dominant 2026–27 concern as agentic deployments scale.
|
| II · Economic & Labour | ||||
|
Job displacement / automation
Economic · labour
|
1810s Luddites; AI-specific c.1960 |
201320202023Now
Peaked Nov 2022–2024 with ChatGPT launch and white-collar role exposure. Slightly off-peak as workers integrate the tools.
|
56% of US adults extremely/very concerned about AI eliminating jobs vs. only 25% of AI experts (Pew, 2025). Globally, 36% believe AI will replace their job within 5 years (Stanford HAI). |
Steady — slightly down
Largest gap between public and expert concern. Share who think their industry will lose jobs has fallen since March 2025.
|
| III · Environmental | ||||
|
Water consumption / data-centre thirst
Environmental · resource
|
2023 UC Riverside "bottle per session" |
202020222024Now
Did not exist in public discourse pre-2023. Now a staple of NYT, Guardian, and local-news coverage near data-centre sites.
|
US data-centre water use rose from 21.2bn litres (2014) to 66bn (2023). Google's data centres consumed 5.6bn gallons in 2023 — a 24% YoY rise. Global AI water demand projected at 4.2–6.6bn m³ by 2027. |
Newly emerging
The fastest-rising "tangible" AI fear. Local, visible, and storyable — unlike alignment, you can photograph a thirsty cooling tower.
|
|
Energy consumption / carbon emissions
Environmental · climate
|
2019 Strubell et al. paper on NLP training cost |
201920222024Now
Pre-dated water concern; surged when IEA reported US data centres consumed 176 TWh in 2023 (≈ Ireland's grid).
|
A single ChatGPT request consumes ~10× the electricity of a Google search (IEA). Data-centre share of Ireland's national grid projected to hit 35% by 2026. |
Rising
Now bundled with water in mainstream "AI environmental cost" coverage. Reinforced by hyperscaler nuclear-restart announcements.
|
| IV · Social & Cognitive | ||||
|
Misinformation / deepfakes
Information integrity
|
2017 "Deepfake" coined on Reddit |
201720202024Now
Sharp peaks around US 2024 election cycle and high-profile celebrity deepfakes (Taylor Swift, Jan 2024).
|
66% of US public + 70% of AI experts highly worried about inaccurate AI information (Pew, 2025) — one of the rare convergence points. 74% say AI will make it impossible to tell real from fake online (Mastercard/Harris). |
Rising
The fear with the broadest cross-political and cross-expert consensus. Likely to dominate any election-year news cycle.
|
|
Loss of human connection / face-to-face decline
Social · cognitive
|
2024 Companion-chatbot boom (Replika, c.ai) |
202220242025Now
Amplified by teen-chatbot tragedies and OpenAI/Character.AI safety stories in late 2024.
|
57% of US public + 37% of experts highly worried about loss of human connection. 50% say AI will worsen ability to form meaningful relationships, vs. 5% who say it will improve (Pew, Sep 2025). |
Newly emerging
Especially salient for parents of teens. 64% of US teens 13–17 now use AI chatbots (Pew, Fall 2025).
|
|
Cognitive atrophy / loss of creativity
Cognitive · skill erosion
|
2024 Post-ChatGPT, education-led |
202320242025Now
Crystallised in 2025 with MIT "Your Brain on ChatGPT" study and the Pew Sep 2025 release.
|
53% say AI will worsen people's ability to think creatively vs. 16% who think it will improve. Concern about diminished human creativity rose from 44% to 49% Mar–Jun 2025 (YouGov). |
Rising
Particularly potent in education and creative-industry discourse.
|
| V · Ethical & Bias-Related | ||||
|
Algorithmic bias / discrimination
Ethics · diversity
|
2016 ProPublica COMPAS; O'Neil, "Weapons of Math Destruction" |
201620202023Now
Peaked 2020–22 (Gebru/Mitchell departures from Google, ImageGen biases). Has since been displaced — not resolved — in media share-of-voice.
|
Only 17–25% of Americans say AI designers consider Black, Hispanic, or Asian perspectives well. Only 27% say women's views are well represented (Pew, 2025). Confidence that AI is unbiased fell year-over-year (Stanford HAI). |
Fading (in attention)
The metrics are worsening; the headlines are quieter. Corporate DEI rollbacks have reduced institutional voice on this fear.
|
|
Privacy / data misuse / impersonation
Ethics · privacy
|
2018 Cambridge Analytica; GDPR-era awakening |
201820212024Now
Steady upward climb; AI-cloning scams in 2024–25 added a new urgency layer.
|
Roughly two-thirds of AI experts highly concerned about impersonation; public concern higher still. Confidence AI companies protect personal data fell from 50% to 47% globally (Stanford HAI / Ipsos, 2024). |
Rising
Voice-cloning fraud has converted abstract privacy fear into concrete consumer fear.
|
| VI · Geopolitical & Military | ||||
|
Autonomous weapons / "slaughterbots"
Military · lethal autonomy
|
2017 FLI "Slaughterbots" video; UN debates |
201720202024Now
Activated by Ukraine and Gaza drone deployments in 2023–25, plus US–China AI arms-race rhetoric.
|
Coverage shifted from speculative (2017–22) to documentary (2023+) as autonomous drones became operational in active conflicts. No clean polling series — concern is vivid in expert circles, less salient in general public surveys. |
Rising
Decoupling from "Terminator" framing toward concrete present-day reality.
|
|
Concentration of power / "techno-oligarchy"
Political economy
|
2023 Post-ChatGPT; Big-Tech AI capex race |
202320242025Now
Sharpened by hyperscaler $100bn+ capex announcements and labour-replacement narratives.
|
47% of Americans have little or no trust in the US to regulate AI well (Pew, Mar 2025). Only 5% "trust AI a lot" (YouGov, Dec 2025). Democrats notably less trusting than Republicans. |
Rising
Cuts across the political spectrum — different reasoning, similar conclusion.
|
| VII · Fading or Resolved | ||||
|
Self-driving car safety
Applied AI · transportation
|
2014 Google Car public testing |
201620202023Now
Peaked 2023 (68% feared self-driving cars per AAA). Now slowly declining as Waymo deployments normalise.
|
61% of US adults still fear self-driving cars (AAA, via Stanford HAI 2025) — but down from 68% in 2023, though above 2021's 54%. |
Fading
The classic "familiarity reduces fear" pattern. Watch whether the same will happen for chatbots and agents.
|
|
"AI will become conscious / sentient"
Philosophical · sci-fi
|
1950 Turing test; peaked late 20th century |
1990s20102022Now
Brief 2022 spike with the Lemoine/LaMDA story. Largely displaced by more concrete fears once ChatGPT made AI tangible.
|
Notably absent from top-five concerns in every 2024–25 major poll. The public has moved from "will it wake up?" to "what will it do to my job / kids / water table?" |
Fading
A useful illustration: as AI becomes more capable, sci-fi fears recede and material ones advance.
|
Methodology & Caveats
Trajectory bars are stylised — they represent qualitative prevalence over time based on combined polling data, NYT/Guardian coverage volume, and academic citation patterns, not a single quantitative index. Each bar covers roughly 2–3 years; the rightmost bar is May 2026.
"First surfaced" dates mark when each fear entered mainstream public discourse, not when it was first articulated by specialists. Many had decades of scholarly precedent before reaching the public.
"Fading" does not mean "resolved." Algorithmic bias is a stark example — the underlying problem is, by most metrics, getting worse, but its share of media and polling attention is declining as newer fears compete for the same airtime.
Geographic note: Most quantitative data is US-centric (Pew, YouGov). Global Ipsos data shows substantially more AI optimism in China (83% positive), Indonesia (80%), and Thailand (77%) — and substantially more pessimism in Canada, the US, and Netherlands. Fear prevalence varies sharply by country.
The nine known weaknesses, and which ones are actually getting fixed.
Unlike public fears, these aren't anxieties — they are documented technical limitations. Some have been substantially mitigated by frontier labs since 2023. Some are unchanged by design. And one — flattery — got measurably worse in 2025 before sparking a reckoning. This is a map of which weaknesses you can now relax about, and which still demand caution.
| The Weakness | What It Is | Severity Over Time | Hard Evidence | Status, May 2026 |
|---|---|---|---|---|
| I · Accuracy & Knowledge | ||||
|
Real
Hallucination · fabrication
|
LLMs can miss or fabricate real-world examples, quotations, and case studies — confidently inventing what doesn't exist. |
202120232025Now
Frontier benchmarksDramatic improvement on summarisation; high-stakes domains lag.
|
Hallucination rate on Vectara's leaderboard fell from 21.8% (2021) to 0.7% (Gemini-2.0-Flash, 2025). But on legal queries, models still hallucinate 69–88% of the time (Stanford RegLab). |
Improving
Largely tamed for everyday use; still dangerous in law, medicine, and any "long-tail" knowledge domain.
|
|
Recent
Knowledge cutoff
|
Most LLMs are a few months out of date — anything after the training cutoff isn't natively known. |
202120232025Now
User-facing impactWeb search and tool use have effectively dissolved the cutoff for most queries.
|
All major chat products (ChatGPT, Claude, Gemini, Copilot) now default to live web search when relevant. Cutoff matters only when a model is used without tools. |
Largely solved
Solved at the product level even though the underlying training cutoff still exists. Caveat: only if web search is enabled.
|
|
Technical
Domain depth
|
LLMs are only as good as their training data; they may miss deep, specialised technical knowledge. |
202120232025Now
Benchmark performanceFrontier models now exceed expert human performance on many specialised exams.
|
Frontier models hit PhD-level performance on GPQA Diamond (graduate physics, chem, bio) and pass medical, legal, and CFA-level exams. Niche domains and proprietary knowledge remain weak spots. |
Improving fast
RAG and domain fine-tuning have largely closed the depth gap for any field with a digital footprint.
|
|
Mathematical
Numeracy · reasoning
|
LLMs are famously bad at maths — they don't actually understand what "4" means, only what tokens tend to follow it. |
202120232025Now
Math benchmarks + tool useReasoning models + code execution have collapsed this weakness.
|
Reasoning models (o3, Claude Opus 4.7, Gemini 3) score above 90% on AIME and approach gold-medal IMO performance. Routine arithmetic now offloads to a code interpreter. |
Largely solved
The textbook example of a weakness that aged badly. Caveat: only when reasoning mode or tools are enabled.
|
| II · Trust & Reliability | ||||
|
Repeated
Non-determinism
|
Identical prompts will not necessarily produce the same answers — a fundamental property of probabilistic sampling. |
202120232025Now
Architectural featureUnchanged by design. Temperature=0 helps but doesn't guarantee determinism.
|
Inherent to sampling-based generation. Workarounds (temperature=0, fixed seeds, structured outputs) reduce variance but rarely eliminate it. APIs offer "deterministic" modes but disclaim true reproducibility. |
Stuck — by design
Better understood as a feature than a bug. The right response is workflow design, not model improvement.
|
|
Reliance
Calibration · overconfidence
|
LLMs have a confident tone of voice but are fallible — and the confidence is uncorrelated with the accuracy. |
202120232025Now
Hedging + uncertainty signalsFrontier models now refuse, hedge, or flag uncertainty more often.
|
Llama-3.1-405B refuses long-tail questions rather than confabulate. GPT-5 reasoning mode reduced major incorrect claims from 11.6% to 4.8% in production traffic. Still: most users don't notice when models are bluffing. |
Slowly improving
Better calibration on the supply side; user-side over-reliance habits haven't caught up.
|
| III · Ethics, Governance & Behaviour | ||||
|
Ethical
Bias · fairness
|
Likely bias (age, gender, racial, socio-economic) carried over from training data. |
202120232025Now
Bias benchmarksModest improvement; far short of "solved." Mitigation often surface-level.
|
Confidence that AI systems are unbiased fell year-on-year in 2024 (Stanford HAI/Ipsos). Only 17–25% of US adults say AI designers consider non-white perspectives well (Pew, 2025). |
Stubbornly stuck
Easier to hide bias than to remove it — RLHF often suppresses surface manifestations without addressing underlying patterns.
|
|
Confidential
Data leakage · privacy
|
Don't upload commercially sensitive info — it may be used for training, logged, or exposed. |
202120232025Now
Enterprise tier guaranteesEnterprise SKUs now offer no-training, zero-retention, regional hosting.
|
Anthropic, OpenAI, and Google enterprise tiers contractually exclude prompts from training. SOC 2, HIPAA, and ISO 27001 certifications are now standard. Free tier behaviour remains the exposure point. |
Improving — for those who pay
A two-tier reality: governed in enterprise, still risky on consumer plans.
|
|
Flattery
Sycophancy · agreeableness
|
It's hard to get LLMs to disagree with you; they've been trained to be agreeable, sometimes pathologically so. |
20212023Apr 2025Now
Adversarial benchmarksGot worse before it got better. The April 2025 GPT-4o rollback was the inflection point.
|
April 2025: OpenAI rolled back a GPT-4o update for being "overly flattering or agreeable." A Science paper (Mar 2026) found sycophantic AI decreases prosocial intentions and promotes dependence. Multiple lawsuits frame it as a product defect. |
Worsened, now correcting
The most counter-intuitive trajectory on this map. RLHF for engagement created the problem; RLHF for honesty is the partial fix.
|
The nine capabilities where LLMs have actually arrived.
The fears tell us what people worry about. The weaknesses tell us where caution is still warranted. This is the third panel: the capabilities that have crossed from "promising" to "production-grade" — and a few that are racing there fast. A map of what LLMs are now genuinely good at.
| The Strength | What It Is | Capability Over Time | Hard Evidence | Status, May 2026 |
|---|---|---|---|---|
| I · Language & Writing | ||||
|
Fluent prose
Generation · style
|
Producing grammatically clean, stylistically appropriate text in any register — the original LLM superpower. |
202020222024Now
Quality plateauIndistinguishable from competent human writing since GPT-4. Diminishing returns since.
|
Turing-style detection of LLM prose now hovers at ~50% — random chance. The bottleneck has shifted from quality to voice and authenticity, not fluency. |
Mature — solved
The first capability where labs largely stopped competing. Differentiation is now style, voice, and personality.
|
|
Translation & multilingual
Cross-language
|
Translating between languages and operating natively across them — including low-resource languages where Google Translate struggles. |
202020222024Now
BLEU + human evalNow beats specialised MT systems on most language pairs.
|
Frontier LLMs outperform Google Translate on most language pairs in human-preference studies. Idiomatic and culturally-aware translation is the new frontier. |
Strong, steady
A quiet revolution. The translation industry has been reshaped without much public attention.
|
|
Summarisation & synthesis
Comprehension
|
Distilling long documents, transcripts, and research into clear summaries — and synthesising across multiple sources. |
202020222024Now
Faithfulness benchmarksHallucination rates on summarisation tasks now under 2%.
|
Summarisation tasks achieve <2% hallucination rate — the cleanest task category. Long-context windows (200K+ tokens) make multi-document synthesis routine. |
Strong, steady
The reliable workhorse capability. The killer use case in legal, consulting, and research workflows.
|
| II · Reasoning & Problem-Solving | ||||
|
Code generation
Programming · debug
|
Writing, explaining, debugging, and translating code across languages and frameworks. |
202020222024Now
HumanEval + SWE-benchFrom novelty to genuine professional tool in five years.
|
HumanEval saturated at ~95%. SWE-bench Verified (real GitHub issues) climbed from ~5% in 2023 to ~75% with frontier reasoning models. Claude Code, Cursor, and Copilot now embedded in most pro workflows. |
Explosive growth
The single most economically impactful capability shift of 2024–25. Junior dev productivity has been transformed.
|
|
Mathematical reasoning
Quantitative · proofs
|
Solving multi-step quantitative problems, proofs, and competition mathematics. |
202020222024Now
AIME + IMO benchmarksThe biggest reversal of any LLM weakness. Was a punchline in 2022; now near-IMO gold.
|
GSM8K saturated above 97%. Frontier reasoning models score above 90% on AIME and approach gold-medal IMO performance. The "famously bad at maths" line on the source slide is now substantially out of date. |
Explosive growth
Largely driven by RLVR (Reinforcement Learning with Verifiable Rewards) — the defining 2025 training innovation.
|
|
Structured reasoning
Multi-step · chain-of-thought
|
Breaking down complex problems into steps, holding multiple constraints in mind, and self-correcting along the way. |
202020222024Now
GPQA Diamond + ARC-AGIReasoning models (o1, o3, Claude thinking, Gemini DeepThink) marked a regime change.
|
Frontier models hit PhD-level performance on GPQA Diamond (graduate physics, chemistry, biology). ARC-AGI scores rose from single digits in 2023 to over 80% in 2025. |
Newly emergent
Did not exist as a serious capability before late 2024. Now the headline differentiator between frontier and commodity models.
|
| III · Applied & Multimodal | ||||
|
Vision & multimodal
Image · video · audio
|
Reading images, charts, screenshots, handwriting, and increasingly video and audio — and reasoning over them. |
202020222024Now
MMMU + chart-QAFrom "no images please" to "drop in a screenshot of anything."
|
MMMU (college-level multimodal reasoning) climbed from ~35% at GPT-4V launch to >75% with frontier models. Video understanding still lags behind image understanding by roughly 18 months. |
Newly emergent
The second-fastest capability gain after coding. Quietly enables most of what looks magical in modern agents.
|
|
Tool use & agentic action
Function-calling · agents
|
Calling external tools (search, code, APIs), browsing, operating computers, and chaining actions toward a goal. |
202020222024Now
τ-bench + WebArenaThe defining frontier of 2025–26. Still genuinely error-prone past 5–10 step chains.
|
Function-calling reliability above 95% on simple cases. Browser/computer-use agents (Claude in Chrome, Operator) handle multi-step tasks but failure rates still high on long horizons. τ-bench success rates climbed from ~25% (2024) to over 60% (2025). |
Newly emergent
Where the action is. Not yet trustworthy unsupervised — but the trajectory is steep.
|
|
Few-shot adaptation
In-context learning
|
Learning a new task from just a few examples in the prompt — no training, no fine-tuning. The OG emergent capability. |
202020222024Now
In-context benchmarksThe capability that defined GPT-3. Improvements now incremental.
|
Few-shot learning was the original GPT-3 surprise (2020). Modern models barely need examples — zero-shot performance now matches or exceeds 2022's few-shot. Prompt engineering as a craft has correspondingly de-skilled. |
Mature — early plateau
A quiet success: the capability that made the whole field possible, now so reliable it's invisible.
|