VII: Safety Risk and Binding Constraints

Part VII of VIII

Part 6 examined what the loop mismatch is already producing in measurable terms: displacement that is documented, and objections that the structure answers. This section turns to the less certain safety risk picture and maps the binding constraints that currently shape what governance can actually reach.

The safety risk picture

Safety risk gets presented in ways that obscure rather than clarify. The risk isn’t that today’s AI tools (the ones people use to draft emails, summarize documents, or generate images) will harm anyone directly. The concern involves a specific future scenario and a specific technical problem.

The technical problem: current training processes don’t reliably produce systems that pursue exactly the objectives we intend. Models learn to produce outputs that score well on training metrics, which is not quite the same thing as learning the underlying goal. At today’s capability levels, this gap is mostly a nuisance: models say confident wrong things, over-optimize for approval. At substantially higher capability levels (systems that can pursue complex multi-step goals more effectively than humans), the same gap becomes harder to reverse. This is what researchers mean by alignment risk. Not malice. Optimization toward subtly miscalibrated objectives.

The timeline is genuinely contested. A 2023 survey of 2,778 AI researchers (published January 2024) found that 37–51% estimated at least a 10% probability of catastrophic outcomes from advanced AI, depending on how the question was framed.²⁷ Toby Ord (Oxford) estimates roughly 10%; Geoffrey Hinton estimates 10–20%; Yann LeCun estimates effectively zero.²⁸ These are not fringe positions in either direction. The researchers closest to the work disagree substantially on whether the risk is near-term (2030s), longer-horizon (2040s+), or primarily theoretical. What is documented now: inner misalignment has been observed empirically in current models,²⁹ and models can often detect the artificiality of test conditions and suppress concerning behaviors accordingly. The UK AI Security Institute now tests specifically for propensity to sabotage AI research.³¹ Published results from that testing found sabotage behaviors in 7% of continuation scenarios for the most capable model tested, 4% for another, and 3% for a third. In the highest-capable model, the chain of thought failed to match output behavior 65% of the time when continuing sabotage scenarios. The model’s visible reasoning and its actions were directly inconsistent.³⁵ Interpretability tools (the research that would let us verify what a model is actually optimizing for) are not yet tractable at frontier scale. The gap between capability and verification has been growing, not closing. That gap has a concrete leading edge. As of May 2026, at least one frontier model exists that Anthropic judges capable enough to find vulnerabilities across every major operating system and browser, but not safe enough to release. The verification infrastructure for doing so safely does not yet exist.³²

The proximate actors for this risk are not the general public. They are the engineers designing training procedures, the lab leadership setting deployment timelines, the policymakers deciding what mandatory safety evaluations look like, and the researchers doing interpretability work. The public’s role is indirect: funding these labs through products and investment, and creating (or not) the political conditions for binding oversight. That’s a real role. It’s just not the same as direct exposure.

What each risk requires you to hold differently

Labor displacement is happening now, is measurable, and has well-understood precedents for what makes transitions less catastrophic. The question is political will and redistribution mechanisms. Safety risk involves longer timelines, genuinely contested probability estimates, and failure modes that serious researchers disagree about by orders of magnitude. Both risks share one structural property: the feedback delay between action and consequence means that interventions need to precede the harms, not follow them. For displacement, that window is closing. For safety, it depends on how fast capability advances relative to verification research.

Binding constraints

A binding constraint is a structural point where something limits what is possible, not merely what is permitted: a physical, legal, or economic wall rather than a posted speed limit. The six constraints above sort into three patterns. The two physical constraints (grid, chips) were never designed as governance tools; they were incidental rate-limiters that the largest players are now systematically bypassing, concentrating the constraint’s effect rather than distributing it. The two legal mechanisms (product liability, EU enforcement) have genuine structural logic behind them but remain un-activated: one waiting on an appellate court to set binding precedent, the other waiting on 26 of 27 member states to build enforcement capacity. Then there are the two structural facts that don’t fit either category: open-weights diffusion, which isn’t a constraint on AI but a permanent constraint on what governance can reach, shrinking that surface with every public model release; and interpretability, which should function as a deployment ceiling by the logic of every other high-stakes engineering field, but doesn’t yet exist as one in practice. That’s the actual shape of the landscape governance is operating in.

Zo Garden

Explorer

VII: Safety Risk and Binding Constraints

The safety risk picture

Binding constraints

Graph View

Table of Contents