There’s a natural hope, especially among people who are accustomed to betting on their beliefs, that those with differing views on AI risk can identify 'crux' forecasting questions. The idea is that even if their beliefs differ now, their beliefs conditional on a particular resolution of a crux question will be much closer together.
I think this exercise is going to produce less 'conditional consensus' than it seems like it should, even if everyone involved is engaging in good faith.
There are various reasons for this, which I’ll outline below. Importantly, some of the points are reasons to be pessimistic about getting people with different views to agree about how much they should update based on a crux resolving, but not reasons to be pessimistic about how much they will actually update once the evidence emerges.
Several of these points are mentioned in some form in FRI’s XPT report (in one case credited to me; in other cases, I don’t think I was the only person to make some version of the point). I wrote this post because I wanted a cleaner articulation to exist than appears there.
Why conditional consensus is harder than it looks
Priors affect the expected nature of resolution
Consider people with very different beliefs about rainfall in a region, interpreting a forecast of a 60% chance of rain in that region. One, who thinks that the region has extremely low rainfall, will expect just a light drizzle if it rains. The other, who thinks the region is very wet, might anticipate a heavy downpour. Both might be happy to use the same forecast, but interpret it differently due to their priors.
The person who expects the world to be very wet is going to assign most of that 60% probability mass to pretty heavy rainfall, while the person who thinks the area is very dry is going to assign most of that probability mass to very light rainfall. If the disagreement they are trying to resolve is about the total volume of rain, then both people agreeing on a 60% chance of rain doesn't actually move their distributions very close together.
Different priors lead to varying expectations of what would cause a question to resolve positively. Because any question about a future event is going to have some wiggle room in what might cause it to trigger, this means that people with different priors will expect the same resolution to provide different amounts of evidence. In the case of AI risk, there is almost certainly going to be a lot of this kind of wiggle room in candidates for cruxes, because it’s hard to precisely specify the nature of an unprecedented event ahead of time.
Different expectations of parallels to resolution
Consider two economists thinking about how to condition on the announcement of a $2 trillion infrastructure spending package in the next year. One expects this to happen alongside strong economic indicators, as a sensible decision by a government with a good handle on the economy. The other anticipates it as a desperate measure by a government facing a recession and rising unemployment. They may agree on the likelihood of the announcement, and largely on the nature of the announcement itself, but their expectations about what else will be true in the world where the announcement occurs remain radically different.
Different people have vastly different beliefs about what other factors are likely to be true if a particular event occurs. This concern, as with the previous worry, is especially relevant for unprecedented events where people’s priors currently look very different. Even if two people agree on the likelihood and nature of a specific outcome, they might have strongly divergent views on what else will be true in that scenario.
In AI discussions, this might manifest as different expectations about the state of the world when a specific AI capability is achieved. Even if two researchers agree on the likelihood of an AI system reaching human-level performance on a particular benchmark and what ‘human-level performance’ means, they might have radically different ideas about what other technological, social, or economic conditions will accompany that achievement.
Conditioning on observations
How would you change your estimate of the chance of a given building being successfully robbed, if you learned that an alarm in the building had been triggered? The alarm going off could imply several things:
The building has a (working) security system
A potential burglar failed to disable it
The building is more likely to have CCTV installed
The police are more likely to be notified early and arrive in time
Someone attempted to break into the building
These implications don't all point in the same direction. Which ones have the bigger effect will depend on your priors about the individual facts.
Observing an event isn't the same as the event actually happening. When we try to condition our beliefs on future events, we're really conditioning on our observations of those events, which may not fully capture reality. The observation often bundles together multiple pieces of information, some of which might even push our beliefs in opposite directions.
In AI discussions, this is a particularly big problem when thinking about how to condition on ‘red lines’ being crossed for dangerous capabilities (for example, sophisticated deception, or autonomous operation and shutdown-resistance). If you’re very sceptical that AI will ever be able to achieve these capabilities, observing them might be a significant source of worry. If, however, you’re confident that future AI systems will be able to achieve such capabilities, you might be most worried about whether those capabilities will be compellingly demonstrated before it’s too late, meaning that the demonstration might be reassuring.
Same evidence, different interpretations
While the previous points primarily affect our ability to agree on cruxes in advance, this final issue presents a challenge even after a supposed crux has resolved. It's not just about predicting how we'll update our beliefs, but also about fundamentally different ways of interpreting the same evidence.
Consider the reactions to AI achievements in game playing, such as Deep Blue defeating Kasparov in chess in 1997. Some researchers saw this as a sign that artificial general intelligence (AGI) was just around the corner. They argued that if an AI could master a complex game like chess, surely it would soon be able to tackle other cognitive tasks at a human or superhuman level. Others, however, interpreted the same event as a narrow achievement in a highly constrained domain, with little bearing on progress towards AGI.
More recently, we've seen similar divergent interpretations of large language model capabilities. When GPT-3 was released, some critics pointed to its mistakes — such as basic arithmetic errors or failures in common-sense reasoning — as evidence that transformer-based models had fundamental limitations. They argued that these errors showed that such models didn't truly understand language or possess genuine intelligence. Others saw the same errors as minor issues that would likely be resolved with scaled-up training and improved architectures, focusing instead on the model's impressive capabilities in areas like natural language generation and task adaptation.
In both cases, the same piece of evidence led to radically different interpretations and predictions about the future of AI. This divergence isn't just about different prior beliefs, but about the difficulty of agreeing on what the evidence actually means, even after it has been observed.
I worry that even if we do manage to agree on what future AI capabilities or achievements we're conditioning on, there’s still going to be room for fundamental disagreement on how to interpret those achievements when they occur. This seems especially likely because key concepts like 'intelligence' and 'agency' lack widely agreed-upon definitions.
What does this mean for conditional consensus?
I’m not sure. I don’t think it means all hope is lost. In particular, I think it’s encouraging that three of the points above are primarily about the difficulty of getting agreement in advance about what people will find cruxy, but don’t present reasons to expect non-response to new evidence.
All of the above focuses on a pretty formal ‘forecaster-y’ approach to conditional consensus, as opposed to things that look more like:
Get people talking to each other and finding common ground and seeing the other side as ‘real people’ rather than faceless adversaries.
Even within the more forecaster-y approaches, I think the points above can (and hopefully will) steer future work in more productive directions. In particular, I hope that getting people to articulate their reasoning about particular scenarios in detail (potentially with the help of facilitators, who might be LLM-based), might shed a lot of light on the concerns above, including how they might be resolved. I’m glad that FRI experimented with this (and as far as I can tell, hopes to do more of it).