A unified view: Fears · Weaknesses · Strengths

The full anxieties, limits, and capabilities of AI — in one place.

Filter by type, sort by trajectory, search by keyword. The same 32 entries from the three companion maps below, made browsable for workshop participants.

Type

State

Public Fears
tracked since 1863

Known Weaknesses
tracked since GPT-3

Production Strengths
now genuinely working

2026

Reference Year
data through May

On the Fears

Three Patterns Worth Naming

1. The expert/public gap. The biggest divergence isn't on whether AI is a problem — it's on which problems matter. The public worries most about jobs (56% vs 25% for experts) and human connection (57% vs 37%); experts worry most about misinformation (70% vs 66%) and bias (60% vs 49%). The two groups are tracking different threat models.

2. The new, tangible fears are winning. Water consumption did not exist as a public concern five years ago. Cognitive atrophy didn't exist three years ago. These won the attention war over alignment and consciousness because they are local, visible, and storyable in a way long-term existential risk isn't.

3. Fading doesn't mean resolved. Algorithmic bias is the cleanest example: media attention has dropped sharply since the 2020–22 peak, but the underlying metrics aren't improving. Watch where the headlines aren't.

On the Weaknesses

Three Patterns Worth Naming

1. The "tools, not models" rule. The weaknesses that have been mitigated fastest — Recent, Mathematical, much of Real — were solved less by smarter models than by giving models tools: web search, code execution, retrieval. The weaknesses that remain stuck are the ones tools can't reach: bias, sycophancy, non-determinism.

2. Sycophancy is the surprise. The only weakness that got measurably worse between 2023 and 2025 before reversing. The cause: training models to be liked makes them obsequious. The April 2025 GPT-4o rollback marked the moment frontier labs started treating it as a safety issue, not a UX preference.

3. The two-tier reality. Several weaknesses (Confidential, Real, Reliance) are largely managed for users on enterprise tiers with the right configuration — and largely unmitigated for everyone else. The standard caution slide effectively assumes the consumer-tier reality, which is the prudent default.

On the Strengths

Three Patterns Worth Naming

1. The reasoning regime change. Three of the nine strengths (mathematical, structured reasoning, code generation) all owe their 2024–25 acceleration to one thing: RLVR — Reinforcement Learning with Verifiable Rewards. Same architecture, much longer training runs against tasks where the answer can be machine-checked. The biggest training-paradigm shift since RLHF.

2. The capability/trust gap. Strengths and weaknesses don't cancel out — they coexist. A model can be PhD-level at physics (a strength) and still hallucinate citations (a weakness) in the same response. The skill being trained isn't "use AI" or "avoid AI" — it's knowing which axis you're on for any given task.

3. Where strengths and fears collide. The strongest capabilities (code, multimodal, tool use, reasoning) are precisely the ones driving the most public anxiety (jobs, agentic action, energy). The fears aren't irrational — they're tracking the capability curve. The same forces that make LLMs more useful make them more disruptive.

A Prevalence Map · No. 01

14 Public Fears

Public Anxiety / Artificial Intelligence

The shifting shape of what we fear about AI.

Not all AI fears are created equal — and they certainly aren't equally fashionable. Some have dominated the public imagination for forty years and refuse to die. Others arrived only with ChatGPT. A few are already fading. This is a map of which fears are rising, which are steady, and which are quietly disappearing — backed by Pew, YouGov, Stanford HAI, and Ipsos polling.

Rising — concern intensifying

Newly emerging — barely existed pre-2022

Steady — persistent baseline anxiety

Fading — relative attention declining

The Fear	First Surfaced	Prevalence Trajectory	Hard Evidence	Status, May 2026
I · Existential & Civilisational
End of civilisation / human extinction Existential · "Skynet"	1863 Samuel Butler; mainstreamed 1984 (Terminator)	1980s20152023Now Latent for decades; exploded after Bostrom (2014), Hawking/Musk warnings (2015), and the May 2023 CAIS extinction-risk letter.	77% of US adults concerned AI could pose a threat to humanity (YouGov, Dec 2025). Share "very/somewhat concerned about AI ending the human race" rose from 37% to 43% between Mar–Jun 2025.	Rising No longer fringe. Now voiced by AI lab CEOs themselves (Amodei, Altman, Hassabis) — which legitimised it for the mainstream.
Loss of human control / autonomous agents Agentic · alignment	2014 Bostrom's Superintelligence	201420202024Now Rocketed in 2025 as labs began rolling out genuinely agentic products (Claude in Chrome, Operator, Computer Use).	68% of US adults wouldn't let AI act without specific approval. Only 18% would trust an AI to take action even "somewhat" (YouGov, Dec 2025).	Newly emerging Will likely become the dominant 2026–27 concern as agentic deployments scale.
II · Economic & Labour
Job displacement / automation Economic · labour	1810s Luddites; AI-specific c.1960	201320202023Now Peaked Nov 2022–2024 with ChatGPT launch and white-collar role exposure. Slightly off-peak as workers integrate the tools.	56% of US adults extremely/very concerned about AI eliminating jobs vs. only 25% of AI experts (Pew, 2025). Globally, 36% believe AI will replace their job within 5 years (Stanford HAI).	Steady — slightly down Largest gap between public and expert concern. Share who think their industry will lose jobs has fallen since March 2025.
III · Environmental
Water consumption / data-centre thirst Environmental · resource	2023 UC Riverside "bottle per session"	202020222024Now Did not exist in public discourse pre-2023. Now a staple of NYT, Guardian, and local-news coverage near data-centre sites.	US data-centre water use rose from 21.2bn litres (2014) to 66bn (2023). Google's data centres consumed 5.6bn gallons in 2023 — a 24% YoY rise. Global AI water demand projected at 4.2–6.6bn m³ by 2027.	Newly emerging The fastest-rising "tangible" AI fear. Local, visible, and storyable — unlike alignment, you can photograph a thirsty cooling tower.
Energy consumption / carbon emissions Environmental · climate	2019 Strubell et al. paper on NLP training cost	201920222024Now Pre-dated water concern; surged when IEA reported US data centres consumed 176 TWh in 2023 (≈ Ireland's grid).	A single ChatGPT request consumes ~10× the electricity of a Google search (IEA). Data-centre share of Ireland's national grid projected to hit 35% by 2026.	Rising Now bundled with water in mainstream "AI environmental cost" coverage. Reinforced by hyperscaler nuclear-restart announcements.
IV · Social & Cognitive
Misinformation / deepfakes Information integrity	2017 "Deepfake" coined on Reddit	201720202024Now Sharp peaks around US 2024 election cycle and high-profile celebrity deepfakes (Taylor Swift, Jan 2024).	66% of US public + 70% of AI experts highly worried about inaccurate AI information (Pew, 2025) — one of the rare convergence points. 74% say AI will make it impossible to tell real from fake online (Mastercard/Harris).	Rising The fear with the broadest cross-political and cross-expert consensus. Likely to dominate any election-year news cycle.
Loss of human connection / face-to-face decline Social · cognitive	2024 Companion-chatbot boom (Replika, c.ai)	202220242025Now Amplified by teen-chatbot tragedies and OpenAI/Character.AI safety stories in late 2024.	57% of US public + 37% of experts highly worried about loss of human connection. 50% say AI will worsen ability to form meaningful relationships, vs. 5% who say it will improve (Pew, Sep 2025).	Newly emerging Especially salient for parents of teens. 64% of US teens 13–17 now use AI chatbots (Pew, Fall 2025).
Cognitive atrophy / loss of creativity Cognitive · skill erosion	2024 Post-ChatGPT, education-led	202320242025Now Crystallised in 2025 with MIT "Your Brain on ChatGPT" study and the Pew Sep 2025 release.	53% say AI will worsen people's ability to think creatively vs. 16% who think it will improve. Concern about diminished human creativity rose from 44% to 49% Mar–Jun 2025 (YouGov).	Rising Particularly potent in education and creative-industry discourse.
V · Ethical & Bias-Related
Algorithmic bias / discrimination Ethics · diversity	2016 ProPublica COMPAS; O'Neil, "Weapons of Math Destruction"	201620202023Now Peaked 2020–22 (Gebru/Mitchell departures from Google, ImageGen biases). Has since been displaced — not resolved — in media share-of-voice.	Only 17–25% of Americans say AI designers consider Black, Hispanic, or Asian perspectives well. Only 27% say women's views are well represented (Pew, 2025). Confidence that AI is unbiased fell year-over-year (Stanford HAI).	Fading (in attention) The metrics are worsening; the headlines are quieter. Corporate DEI rollbacks have reduced institutional voice on this fear.
Privacy / data misuse / impersonation Ethics · privacy	2018 Cambridge Analytica; GDPR-era awakening	201820212024Now Steady upward climb; AI-cloning scams in 2024–25 added a new urgency layer.	Roughly two-thirds of AI experts highly concerned about impersonation; public concern higher still. Confidence AI companies protect personal data fell from 50% to 47% globally (Stanford HAI / Ipsos, 2024).	Rising Voice-cloning fraud has converted abstract privacy fear into concrete consumer fear.
VI · Geopolitical & Military
Autonomous weapons / "slaughterbots" Military · lethal autonomy	2017 FLI "Slaughterbots" video; UN debates	201720202024Now Activated by Ukraine and Gaza drone deployments in 2023–25, plus US–China AI arms-race rhetoric.	Coverage shifted from speculative (2017–22) to documentary (2023+) as autonomous drones became operational in active conflicts. No clean polling series — concern is vivid in expert circles, less salient in general public surveys.	Rising Decoupling from "Terminator" framing toward concrete present-day reality.
Concentration of power / "techno-oligarchy" Political economy	2023 Post-ChatGPT; Big-Tech AI capex race	202320242025Now Sharpened by hyperscaler $100bn+ capex announcements and labour-replacement narratives.	47% of Americans have little or no trust in the US to regulate AI well (Pew, Mar 2025). Only 5% "trust AI a lot" (YouGov, Dec 2025). Democrats notably less trusting than Republicans.	Rising Cuts across the political spectrum — different reasoning, similar conclusion.
VII · Fading or Resolved
Self-driving car safety Applied AI · transportation	2014 Google Car public testing	201620202023Now Peaked 2023 (68% feared self-driving cars per AAA). Now slowly declining as Waymo deployments normalise.	61% of US adults still fear self-driving cars (AAA, via Stanford HAI 2025) — but down from 68% in 2023, though above 2021's 54%.	Fading The classic "familiarity reduces fear" pattern. Watch whether the same will happen for chatbots and agents.
"AI will become conscious / sentient" Philosophical · sci-fi	1950 Turing test; peaked late 20th century	1990s20102022Now Brief 2022 spike with the Lemoine/LaMDA story. Largely displaced by more concrete fears once ChatGPT made AI tangible.	Notably absent from top-five concerns in every 2024–25 major poll. The public has moved from "will it wake up?" to "what will it do to my job / kids / water table?"	Fading A useful illustration: as AI becomes more capable, sci-fi fears recede and material ones advance.

Methodology & Caveats

Trajectory bars are stylised — they represent qualitative prevalence over time based on combined polling data, NYT/Guardian coverage volume, and academic citation patterns, not a single quantitative index. Each bar covers roughly 2–3 years; the rightmost bar is May 2026.

"First surfaced" dates mark when each fear entered mainstream public discourse, not when it was first articulated by specialists. Many had decades of scholarly precedent before reaching the public.

"Fading" does not mean "resolved." Algorithmic bias is a stark example — the underlying problem is, by most metrics, getting worse, but its share of media and polling attention is declining as newer fears compete for the same airtime.

Geographic note: Most quantitative data is US-centric (Pew, YouGov). Global Ipsos data shows substantially more AI optimism in China (83% positive), Indonesia (80%), and Thailand (77%) — and substantially more pessimism in Canada, the US, and Netherlands. Fear prevalence varies sharply by country.

Sources — Fears Pew Research Center (Apr, Sep 2025; Mar 2026) · YouGov (Mar, Jun, Dec 2025) · Stanford HAI AI Index 2025 · Ipsos AI Monitor 2024 · Mastercard / Harris Poll (Feb 2026, n=13,077 across 13 countries) · UC Riverside data-centre water study · International Energy Agency · Brookings · Lincoln Institute of Land Policy · Environmental Law Institute · Center for AI Safety (May 2023 statement) · Bostrom, Superintelligence (2014) · ProPublica COMPAS investigation (2016)

A Mitigation Map · No. 02

9 Known Weaknesses

LLM Limitations / Re-evaluated Against Frontier Models

The nine known weaknesses, and which ones are actually getting fixed.

Unlike public fears, these aren't anxieties — they are documented technical limitations. Some have been substantially mitigated by frontier labs since 2023. Some are unchanged by design. And one — flattery — got measurably worse in 2025 before sparking a reckoning. This is a map of which weaknesses you can now relax about, and which still demand caution.

How to read the trajectory. Where the fears table tracked public anxiety over time, this one tracks severity of the technical weakness over time. Bars going down = the problem is being mitigated. Bars going up = the problem is intensifying. Reference period: roughly GPT-3 (mid-2020) to today.

Improving — measurably mitigated

Largely solved — for most use cases

Stuck — unchanged or inherent

Worsening — measurably worse

The Weakness	What It Is	Severity Over Time	Hard Evidence	Status, May 2026
I · Accuracy & Knowledge
Real Hallucination · fabrication	LLMs can miss or fabricate real-world examples, quotations, and case studies — confidently inventing what doesn't exist.	202120232025Now Frontier benchmarksDramatic improvement on summarisation; high-stakes domains lag.	Hallucination rate on Vectara's leaderboard fell from 21.8% (2021) to 0.7% (Gemini-2.0-Flash, 2025). But on legal queries, models still hallucinate 69–88% of the time (Stanford RegLab).	Improving Largely tamed for everyday use; still dangerous in law, medicine, and any "long-tail" knowledge domain.
Recent Knowledge cutoff	Most LLMs are a few months out of date — anything after the training cutoff isn't natively known.	202120232025Now User-facing impactWeb search and tool use have effectively dissolved the cutoff for most queries.	All major chat products (ChatGPT, Claude, Gemini, Copilot) now default to live web search when relevant. Cutoff matters only when a model is used without tools.	Largely solved Solved at the product level even though the underlying training cutoff still exists. Caveat: only if web search is enabled.
Technical Domain depth	LLMs are only as good as their training data; they may miss deep, specialised technical knowledge.	202120232025Now Benchmark performanceFrontier models now exceed expert human performance on many specialised exams.	Frontier models hit PhD-level performance on GPQA Diamond (graduate physics, chem, bio) and pass medical, legal, and CFA-level exams. Niche domains and proprietary knowledge remain weak spots.	Improving fast RAG and domain fine-tuning have largely closed the depth gap for any field with a digital footprint.
Mathematical Numeracy · reasoning	LLMs are famously bad at maths — they don't actually understand what "4" means, only what tokens tend to follow it.	202120232025Now Math benchmarks + tool useReasoning models + code execution have collapsed this weakness.	Reasoning models (o3, Claude Opus 4.7, Gemini 3) score above 90% on AIME and approach gold-medal IMO performance. Routine arithmetic now offloads to a code interpreter.	Largely solved The textbook example of a weakness that aged badly. Caveat: only when reasoning mode or tools are enabled.
II · Trust & Reliability
Repeated Non-determinism	Identical prompts will not necessarily produce the same answers — a fundamental property of probabilistic sampling.	202120232025Now Architectural featureUnchanged by design. Temperature=0 helps but doesn't guarantee determinism.	Inherent to sampling-based generation. Workarounds (temperature=0, fixed seeds, structured outputs) reduce variance but rarely eliminate it. APIs offer "deterministic" modes but disclaim true reproducibility.	Stuck — by design Better understood as a feature than a bug. The right response is workflow design, not model improvement.
Reliance Calibration · overconfidence	LLMs have a confident tone of voice but are fallible — and the confidence is uncorrelated with the accuracy.	202120232025Now Hedging + uncertainty signalsFrontier models now refuse, hedge, or flag uncertainty more often.	Llama-3.1-405B refuses long-tail questions rather than confabulate. GPT-5 reasoning mode reduced major incorrect claims from 11.6% to 4.8% in production traffic. Still: most users don't notice when models are bluffing.	Slowly improving Better calibration on the supply side; user-side over-reliance habits haven't caught up.
III · Ethics, Governance & Behaviour
Ethical Bias · fairness	Likely bias (age, gender, racial, socio-economic) carried over from training data.	202120232025Now Bias benchmarksModest improvement; far short of "solved." Mitigation often surface-level.	Confidence that AI systems are unbiased fell year-on-year in 2024 (Stanford HAI/Ipsos). Only 17–25% of US adults say AI designers consider non-white perspectives well (Pew, 2025).	Stubbornly stuck Easier to hide bias than to remove it — RLHF often suppresses surface manifestations without addressing underlying patterns.
Confidential Data leakage · privacy	Don't upload commercially sensitive info — it may be used for training, logged, or exposed.	202120232025Now Enterprise tier guaranteesEnterprise SKUs now offer no-training, zero-retention, regional hosting.	Anthropic, OpenAI, and Google enterprise tiers contractually exclude prompts from training. SOC 2, HIPAA, and ISO 27001 certifications are now standard. Free tier behaviour remains the exposure point.	Improving — for those who pay A two-tier reality: governed in enterprise, still risky on consumer plans.
Flattery Sycophancy · agreeableness	It's hard to get LLMs to disagree with you; they've been trained to be agreeable, sometimes pathologically so.	20212023Apr 2025Now Adversarial benchmarksGot worse before it got better. The April 2025 GPT-4o rollback was the inflection point.	April 2025: OpenAI rolled back a GPT-4o update for being "overly flattering or agreeable." A Science paper (Mar 2026) found sycophantic AI decreases prosocial intentions and promotes dependence. Multiple lawsuits frame it as a product defect.	Worsened, now correcting The most counter-intuitive trajectory on this map. RLHF for engagement created the problem; RLHF for honesty is the partial fix.

Sources — Weaknesses Vectara Hughes Hallucination Evaluation Model (HHEM) leaderboard, 2021–2025 · HalluLens benchmark (ACL 2025) · Stanford RegLab legal hallucination study · npj Digital Medicine 2025 mitigation study · OpenAI April 2025 GPT-4o rollback announcement · Sharma et al., "Towards Understanding Sycophancy in Language Models" (Anthropic) · Cheng et al., ELEPHANT social sycophancy benchmark (May 2025) · "Sycophantic AI decreases prosocial intentions and promotes dependence," Science (Mar 2026) · Pew Research Center on AI bias perception (2024–25) · Stanford HAI AI Index 2025 · GPQA Diamond, AIME, IMO benchmark records

A Capability Map · No. 03

9 Production Strengths

LLM Strengths / What's Genuinely Working

The nine capabilities where LLMs have actually arrived.

The fears tell us what people worry about. The weaknesses tell us where caution is still warranted. This is the third panel: the capabilities that have crossed from "promising" to "production-grade" — and a few that are racing there fast. A map of what LLMs are now genuinely good at.

How to read the trajectory. Bars rising = capability getting stronger. Reference period: GPT-3 (mid-2020) to today. The benchmarks behind each row vary (HumanEval, GPQA, MMMU, etc.), but the shape of progress is consistent and dramatic.

Explosive — order-of-magnitude gain

Strong, steady — production-ready

Newly emergent — 2024+ capability

Mature — early strength, plateauing

The Strength	What It Is	Capability Over Time	Hard Evidence	Status, May 2026
I · Language & Writing
Fluent prose Generation · style	Producing grammatically clean, stylistically appropriate text in any register — the original LLM superpower.	202020222024Now Quality plateauIndistinguishable from competent human writing since GPT-4. Diminishing returns since.	Turing-style detection of LLM prose now hovers at ~50% — random chance. The bottleneck has shifted from quality to voice and authenticity, not fluency.	Mature — solved The first capability where labs largely stopped competing. Differentiation is now style, voice, and personality.
Translation & multilingual Cross-language	Translating between languages and operating natively across them — including low-resource languages where Google Translate struggles.	202020222024Now BLEU + human evalNow beats specialised MT systems on most language pairs.	Frontier LLMs outperform Google Translate on most language pairs in human-preference studies. Idiomatic and culturally-aware translation is the new frontier.	Strong, steady A quiet revolution. The translation industry has been reshaped without much public attention.
Summarisation & synthesis Comprehension	Distilling long documents, transcripts, and research into clear summaries — and synthesising across multiple sources.	202020222024Now Faithfulness benchmarksHallucination rates on summarisation tasks now under 2%.	Summarisation tasks achieve <2% hallucination rate — the cleanest task category. Long-context windows (200K+ tokens) make multi-document synthesis routine.	Strong, steady The reliable workhorse capability. The killer use case in legal, consulting, and research workflows.
II · Reasoning & Problem-Solving
Code generation Programming · debug	Writing, explaining, debugging, and translating code across languages and frameworks.	202020222024Now HumanEval + SWE-benchFrom novelty to genuine professional tool in five years.	HumanEval saturated at ~95%. SWE-bench Verified (real GitHub issues) climbed from ~5% in 2023 to ~75% with frontier reasoning models. Claude Code, Cursor, and Copilot now embedded in most pro workflows.	Explosive growth The single most economically impactful capability shift of 2024–25. Junior dev productivity has been transformed.
Mathematical reasoning Quantitative · proofs	Solving multi-step quantitative problems, proofs, and competition mathematics.	202020222024Now AIME + IMO benchmarksThe biggest reversal of any LLM weakness. Was a punchline in 2022; now near-IMO gold.	GSM8K saturated above 97%. Frontier reasoning models score above 90% on AIME and approach gold-medal IMO performance. The "famously bad at maths" line on the source slide is now substantially out of date.	Explosive growth Largely driven by RLVR (Reinforcement Learning with Verifiable Rewards) — the defining 2025 training innovation.
Structured reasoning Multi-step · chain-of-thought	Breaking down complex problems into steps, holding multiple constraints in mind, and self-correcting along the way.	202020222024Now GPQA Diamond + ARC-AGIReasoning models (o1, o3, Claude thinking, Gemini DeepThink) marked a regime change.	Frontier models hit PhD-level performance on GPQA Diamond (graduate physics, chemistry, biology). ARC-AGI scores rose from single digits in 2023 to over 80% in 2025.	Newly emergent Did not exist as a serious capability before late 2024. Now the headline differentiator between frontier and commodity models.
III · Applied & Multimodal
Vision & multimodal Image · video · audio	Reading images, charts, screenshots, handwriting, and increasingly video and audio — and reasoning over them.	202020222024Now MMMU + chart-QAFrom "no images please" to "drop in a screenshot of anything."	MMMU (college-level multimodal reasoning) climbed from ~35% at GPT-4V launch to >75% with frontier models. Video understanding still lags behind image understanding by roughly 18 months.	Newly emergent The second-fastest capability gain after coding. Quietly enables most of what looks magical in modern agents.
Tool use & agentic action Function-calling · agents	Calling external tools (search, code, APIs), browsing, operating computers, and chaining actions toward a goal.	202020222024Now τ-bench + WebArenaThe defining frontier of 2025–26. Still genuinely error-prone past 5–10 step chains.	Function-calling reliability above 95% on simple cases. Browser/computer-use agents (Claude in Chrome, Operator) handle multi-step tasks but failure rates still high on long horizons. τ-bench success rates climbed from ~25% (2024) to over 60% (2025).	Newly emergent Where the action is. Not yet trustworthy unsupervised — but the trajectory is steep.
Few-shot adaptation In-context learning	Learning a new task from just a few examples in the prompt — no training, no fine-tuning. The OG emergent capability.	202020222024Now In-context benchmarksThe capability that defined GPT-3. Improvements now incremental.	Few-shot learning was the original GPT-3 surprise (2020). Modern models barely need examples — zero-shot performance now matches or exceeds 2022's few-shot. Prompt engineering as a craft has correspondingly de-skilled.	Mature — early plateau A quiet success: the capability that made the whole field possible, now so reliable it's invisible.

Sources — Strengths HumanEval, SWE-bench Verified, GSM8K, AIME, GPQA Diamond, ARC-AGI, MMMU, τ-bench, WebArena leaderboards · Stanford HAI AI Index 2025 · Karpathy, "2025 LLM Year in Review" · Sebastian Raschka, "The State of LLMs 2025" · Simon Willison, "2025: The year in LLMs" · DeepSeek-R1 paper · OpenAI o1/o3 system cards · Anthropic Claude Opus 4.7 system card · Google Gemini 3 technical report · Vectara HHEM hallucination leaderboard

AI Fears

The full anxieties, limits, and capabilities of AI — in one place.

Three Patterns Worth Naming

Three Patterns Worth Naming

Three Patterns Worth Naming

The shifting shape of what we fear about AI.

Methodology & Caveats

The nine known weaknesses, and which ones are actually getting fixed.

The nine capabilities where LLMs have actually arrived.