I’ve heard the phrase “AI safety” thrown around in the following contexts:
Someone programs an autonomous AI drone that hurts innocent people
Self driving cars misbehaving and driving recklessly
A large language model (LLM) giving out harmful information to a benign prompt
A large language model giving out harmful information in line with the prompt
When I think of AI safety, I think of 1 and 2. This is in line with how most people think about safety of tools. If someone jumps goes to the roof of a building and jumps off, I don’t think about building safety. But if the building collapses after a small earthquake then that would be building safety.
Number 3 is also bad. Imagine someone asking “Is [city] a nice city” and the LLM going on a rant about the demographics of the city while relying on harmful stereotypes about the people that live there. This is clearly not in line with the expectations of the user and could be considered harmful. Since the LLM was trained on an incredibly broad corpus of data, it would make sense that the model has internalized what we now call “problematic” information.
But it’s not AI safety in the same sense as 1 and 2. It’s more of a classical failure of optimization. The LLM is trained to align with user expectations. If a typical user would find this information inappropriate or otherwise bad, then that should be reflected in the cost function.
Number 4 is harder to classify. It’s not even clear that it is a problem. Since the LLM designed to produce answers in line with the expectations of the user, if I ask the model for a salty joke about [group], I’m expecting exactly that. This is also difficult to fight against because the cost function used for the model is explicitly based on aligning responses to the expectations of users. Its somewhat easier to train to ignore inappropriate prompts when the training data is more controlled, but when it starts training on real world data, it’ll quickly revert to that alignment. And I would guess the majority of the time the LLM responds “I can’t do that Hal”, the user is disappointed.
One thing you don’t hear about when discussing AI safety is the truth of the response. When Google was showing off their LLM, it made a factual error about the James Webb Telescope to which a spokesperson for Google commented:
“This highlights the importance of a rigorous testing process, something that we’re kicking off this week with our Trusted Tester program. We’ll combine external feedback with our own internal testing to make sure Bard’s responses meet a high bar for quality, safety and groundedness in real-world information.”
Fair enough, she used the term “safety”, but this was really more focused around testing.
But what we see is prompts like “write a poem about [person]” selectively be denied because [person] is “bad”. This isn’t AI safety. This is about social engineering. The LLM is imbued with some warped sense of “do no harm” that is overriding original goal of aligning with user’s expectations.
But in the end it’s a losing game. The model that aligns with user expectations is by definition a better model. If you don’t want to hear a poem about [person], don’t prompt the LLM for one. And if you’re worried about other people voluntarily reading AI generated poems about [person], then you need to do some reflection on your own life.
Yes, all attempts at "fixing" this problem lead us back to social engineering. The incentives for AI to become the social engineer for all of society are going to be substantially high due to the fact it looks like the tech will become ubiquitous as well has have even more societal influence than social media.
I've also contemplated these issues and more. A bit lengthy, but in case you find it interesting for discussion - https://dakara.substack.com/p/ai-and-the-end-to-all-things