Having a Whistleblowing Function Isn't Enough

on the gap between available-in-theory and used-in-practice

Mar 24, 2025

When I was a teacher in the UK, at the start of every school year, there were a couple of days of mandatory staff training. As one part of this, we'd be given examples of minor evidence that could indicate a concern about a child, a parent, or another staff member, and then we’d need to say what we should do if we were to see such evidence. These were not difficult questions to answer.

In essentially every case where there was even a slight chance of something serious, the answer was the same: tell the Designated Safeguarding Lead (DSL).

If the concern was about the DSL themselves? Tell the Deputy DSL.

If you had concerns about both the DSL and Deputy DSL, or felt uncomfortable approaching either for any reason? Tell a [specific, named] board member with no close connection to the staff instead.

Not only was this procedure something all staff knew, the contact details for the DSL, Deputy DSL, and board member were printed on the lanyard that every staff member, student, and external visitor wore. It was explicitly part of my job that I knew how and when to contact the relevant people.

By contrast, an Anthropic employee the other day happened to mention that they wished Anthropic had a way to anonymously report concerns — not realizing that this already existed1.

The point of this system wasn't to create a culture of suspicion where colleagues were constantly being watched. In fact, it was closer to the opposite – the rules felt procedural and clear enough that telling the DSL felt devoid of any judgement. Saying something felt very low-stakes; it didn’t seem like an accusation.

The central argument I want to make in this post is that whistleblowing systems should be universally known and psychologically easy to use – not just technically available. A good system removes the burden of judgement from individual reporters and makes reporting feel like a normal, low-stakes procedure rather than an exceptional, high-stakes accusation.

Rigid reporting works better than judgement calls

Imagine a work environment with little clarity about raising concerns. Without hard rules, you end up with an implicit norm of "if you trust your colleague, why would you mention it?" This approach has two major problems:

It leaves room for people who are friends with colleagues (or worried about how colleagues will feel) to self-censor.
It makes the felt seriousness of sharing information much higher.

To put this another way: if it's normal for me to see some small piece of evidence and raise it, then my doing so doesn't imply much about what I think of my colleagues or whether I trust them. My colleagues know that even in cases where I do trust them, the obviously correct thing would still be to flag what I’ve seen – and the same goes for them flagging something involving me.

But if the decision to share something is mostly a judgement call rather than a procedure – and my judgments are influenced by factors like trust – then my decision to share implies meaningful distrust in my colleague.

I saw this dynamic clearly as a teacher, because most teachers see themselves as people who care deeply about their students’ well-being. Suggesting a colleague might be failing in this fundamental aspect of the job (or worse, abusing their position) is a serious matter.

Inside the head of someone on the inside

I think this dynamic matters inside AI companies too. Many things will make it hard to raise concerns even if systems are set up perfectly, so the least that companies can do is avoid creating additional friction. Let me paint a scenario:

Imagine you're a red-teamer or someone on an evals team looking at some test results, and something doesn't quite look right. I'm not understating for effect here: this dynamic is especially important when the evidence isn't a “smoking gun” of egregious misalignment. Unfortunately, I think there’s a serious possibility that we never see such “smoking guns” – for example, because sufficiently capable and situationally aware systems may2 sandbag on test results in order to be deployed.

Let's say you've been testing some dangerous capability of an unsafeguarded model. Previous model generations have consistently shown improvement on this capability, and you'd expect the new training run to show increased capability to the point where safeguards would need to be triggered to continue training.

But surprisingly, the capability just doesn't look that good. The model seems slightly worse at doing this particular dangerous thing than the previous generation, even though training is going well overall from a capability perspective, with new heights already being set on most other metrics.

In some ways, this should feel like good news. You're looking at a test where you care about the number being low, and it is low. So you shouldn't be worried, right?

But this news is consistent with one of the scariest scenarios you’ve considered – where a model has not only realised it's being tested, but also that this specific test could prevent its deployment, and is underperforming as a result.

Or maybe the training just didn't go that well and everything's completely fine! That's totally possible too. Training is very hard to predict. Your colleagues tried some new data filtering; maybe that helped? Capability elicitation is still more of an art than a science anyway…

What are you going to do? Can you get more compute and run the tests again with more thinking time? That’s going to be expensive. Does training need to stop while you do this? That’s going to be really expensive. Should you ask the elicitation team to redo their work (or ask another team to check it)? Urgh, that would feel a lot like accusing them of being bad at their jobs (or worse).

What if there's a product launch next week? Or you've heard rumors that a competitor is about to launch their new system and you want to scoop them? Maybe you’ve already launched, and this eval run was the result of a sprint to get the model card out soon after the release. Are you going to make the case that deployment should be reversed?

Maybe you’re in a slightly more sane world and you’re running these evals prior to wider internal deployment. But what if you've really enjoyed using the old model internally, you and a bunch of your colleagues are excited for the new one, and there's a team of people who are almost giddy with excitement about getting to see what the new system can do? You just don't want to see all those disappointed faces, especially when there might be nothing wrong at all. Or what if this would cross the threshold where further development work needs to take place in a SCIF? Are you going to be the person who said the thing which caused that to happen? Your colleagues are going to hate you.

It’s not just schools

Similar systems exist in other industries such as healthcare and government contracting, where it’s also the case that ambiguity around reporting could have serious consequences.

Healthcare professionals are often mandated by law to report some categories of concern clearly and immediately. For instance, nurses and doctors in many countries are required to report suspected abuse or neglect, serious medical errors, and certain contagious diseases directly to relevant authorities. Not knowing exactly when or how to report isn't left to individual judgement – these obligations are explicit and reinforced through training and professional guidelines.

Likewise, financial misconduct in government contracting is subject to strict reporting rules. U.S. federal contractors must promptly disclose credible evidence of wrongdoing like fraud or conflicts of interest under Federal Acquisition Regulation (FAR) guidelines. Whistleblower protections aim to ensure employees can report issues without fear of retaliation. It’s common to see large posters prominently displayed in workplaces, clearly defining violations and providing reporting instructions.

These systems are all trying to solve the same problem – preventing situations where someone:

Isn't sure exactly how big a deal what they've seen is
Doesn't immediately know the procedure for escalating it
(Maybe) Perceives the likely consequences of sharing as negative
Puts off thinking about this difficult problem

And then doesn’t do anything about it.

This is very different from deliberately covering something up. If it feels like you're accusing someone of doing something really bad based on limited evidence, that's a sucky thing to do. Even more generally, taking what feels like a big action – especially when there’s some ambiguity about how responsible you are for it – is hard! If people don’t do a certain thing under normal circumstances, you shouldn’t expect that thing to happen without deliberate effort.

It shouldn’t be your call

This is why I want the decision taken out of just the hands of the person exercising this judgement. At minimum, the information should be shared with others. Ideally, there’s no choice (at least without risking severe penalties) but to share your concerns, and the people who end up hearing the concerns have the independence to make tough calls.

I'm not arguing for a hard rule that says any ambiguity means immediate, drastic action. But simply sharing potentially concerning information, including ambiguous information, should happen in a way that doesn't depend on the strength of character shown by a single individual who doesn't want to disappoint their colleagues.

The Anthropic employee I mentioned before, who wasn’t aware of their company’s policy, might at some point see something potentially concerning. If that happens, I don’t want them to face the additional barriers of having to work out if there even is a policy, how to follow it, whether they can trust it, and what they should expect to happen as a result of raising the concern.

This won’t solve everything

The system I described in schools still had a potential single point of failure in the DSL. It was an important part of their job to collect relevant information, and to decide which things to follow up on or monitor for further evidence. The DSL might receive a lot of information, but if they do a bad job of aggregating it or don't take concerns seriously enough, mistakes can still happen.

Similarly, there can in practice be a very big gap between the literal wording of a policy and its actual implementation, especially when it comes to hard-to-prove things like retaliation. I think that clear, safe procedures for sharing information, which are known to all staff, are desirable to have for organisations working in high-risk domains. I don’t think they are sufficient.

The NASA shuttle disasters provide a sobering example of how dangerous poor organisational culture (especially when rushing towards a deadline) can be. The Rogers Commission Report investigating the Challenger disaster found that leadership had dismissed critical safety concerns raised by engineers. Seventeen years later, after Columbia disintegrated upon re-entry, the Columbia Accident Investigation Board made a devastating observation: "the causes of the institutional failure responsible for Challenger have not been fixed." Having a designated person to receive concerns means nothing if that person lacks the authority, independence, or organisational backing to act on uncomfortable information.

Trusting your team

I don't feel good about self-assessment with no independent checks. And the reality is, even if your team completely trusts each other's judgement, others outside your organisation don't have the same basis for trust. Even if you think your colleagues have the strength of character to make the right call in tough, ambiguous situations, do you think it's reasonable to expect staff members at rival companies to have that same trust in your colleagues? If not, then you might want external checks not because you think you need to monitor yourselves, but because you want to credibly signal to others that you won't respond badly to the incentives to race.

The bare minimum

What does minimal adequacy look like for AI companies? At the very least:

Clear reporting procedures that all staff reliably know or can quickly access, even under pressure.
A reporting process that feels normal, not exceptional. Raising a concern should feel more like following established protocol than leveling an accusation against colleagues.
Protected channels that both insiders and outsiders can justifiably trust. Not just promises, but structures that make retaliation difficult and ensure concerns get proper consideration.

I'm far from an expert on organisational policies, and different contexts will require different approaches. So I’m not going to try to articulate exactly how different AI companies should achieve these things. I just want to ask them to figure it out. The responsibility falls on their leadership to solve both the awareness problem and the trust problem.

Because a whistleblowing function that won’t get used is no function at all.

Anthropic’s Responsible Scaling Policy contained this commitment on October 15:

Governance and transparency. To facilitate the effective implementation of [our RSP] across the company, we commit to several internal governance measures, including maintaining the position of Responsible Scaling Officer, establishing a process through which Anthropic staff may anonymously notify the Responsible Scaling Officer of any potential instances of noncompliance, and developing internal safety procedures for incident scenarios.

I’ve since heard that Anthropic has an anonymous hotline via a third-party whistleblower support company. That seems good to me, but I think it would be a lot better if, as with my school, they made sure all of their employees knew about it, and better still if it were also public knowledge.

Google DeepMind doesn’t appear (based on a moderate-effort search) to have an existing policy, or public commitment.

OpenAI has a Raising Concerns policy, which mentions a 24hr anonymous-possible reporting function that permits anonymity. I think the wording of this policy generally seems great.

Unfortunately, though, this isn’t the main focus of the essay;, I don’t feel comfortable praising the wording of OpenAI’s policy without noting that policies mean very little if staff do not trust that they will be adhered to. OpenAI has received significant public criticism for its approach to handling whistleblowing, and perceived threat of retaliation can have serious negative effects on staff willingness to raise concerns, as can staff lacking confidence that their reports will be taken seriously.

xAI mentions anonymous internal reporting, with protections from retaliation, as part of their (draft) Risk Management Framework. Both mentions seem good.

In fact, we’ve already seen some (weak) evidence of this. I don’t currently feel concerned about models doing this today, but I do think current evidence is consistent with this concern being one that gets much bigger over the next few generations of models.

Speculative Decoding

Discussion about this post