We will try to explain why this danger is real and why future superintelligence (artificial intelligence that significantly surpasses human intelligence) could turn against us – assuming it succeeds in developing, and many believe it will.
Let’s start with classical programming, which has been ubiquitous since the last century. It is transparent in the sense that the program code is a precise sequence of instructions to the computer, written in a human-understandable programming language. Every step is explicit and we know what is happening, it is not difficult to see what and why the program is doing, it is easy to change and its behavior is predictable.
Let’s explain why artificial intelligence is not transparent in this way. It receives input data (text, image, data from sensors, etc.) and all of them are converted into numbers. At the output, we can get some other text (e.g. the answer to a question), the image we wanted to generate, a decision or action (e.g. a self-driving car) and so on. But what is in the middle? How do we get output data from input data? In between is not some readable programming language, but a bunch of numbers and calculation operations, the so-called deep neural network. The numbers representing the input data are transformed by multiplication (and other operations) with the so-called weights or network parameters. The results are multiplied again in the next layer and thus propagated all the way to the last layer, the one for output data, where they are converted into text, image, signal or something else. Programmers, if necessary, can see all the listed numbers, but there are trillions of network parameters and we don’t really understand them. These numbers were not invented by humans, but were obtained by automated machine learning on a large number of examples for which we had the correct answer in advance. Except in specific cases, we do not know the meaning of individual numbers, and therefore we do not know what the artificial intelligence “thinks” and how it arrives at the answer.
This brings us to the problem of alignment: ensuring that artificial intelligence works in accordance with human values and goals. This is a very general definition and not particularly precise. That is why it is problematic: it cannot be described mathematically and it is difficult to incorporate it into artificial intelligence, which in some sense only understands numbers. The complexity of human ethics is the first reason for the difficulty of alignment. Let’s look at, for example, the instruction do no harm to man. Even if the AI were to carefully follow our instructions (which is not guaranteed), what does ‘harm’ even mean? It is possible to be harmed in various ways: physically, emotionally, financially, reputationally and so on. Boundaries are not easy to define: if we manage a company and beat the competition, we have harmed it financially, but many will say that such behavior is not problematic. There are also many cases where every action will harm someone, from triage in emergency situations onwards; we often have to weigh who to help and who to rest. There are no clear rules.
Moreover, natural selection is thought to favor AIs that are selfish and unscrupulous. For example, when AIs start managing businesses instead of humans, the most successful will be those with the fewest limitations. An AI that follows the instruction do not break the law will make less profit than an AI that follows the instruction do not break the law, because the latter has more freedom. And the more successful will naturally be used and propagated more: there will be more of its “descendants” (copies and upgrades).
Some matching attempts use human moderators who monitor the actions or responses of artificial intelligence (e.g. ChatGPT) and rate them as good or dangerous. The AI is additionally trained on these grades so that its goal is to give as many good and as few dangerous answers as possible. In this way, it will learn to say what we want to hear – but not necessarily “think” what it says. It is a network with a large number of internal layers: we only see the output layer (response), but not the way it was obtained. For example, if we ask it how to do something illegal and dangerous, ChatGPT will probably come up with an (illegal and dangerous) answer in one of its layers, but will withhold it because it has learned not to give such answers. It can think one thing and say or do another.
It is connected with the so-called reward hacking, which is often mentioned as a significant difficulty in alignment. The problem is in the goal we set for AI: the goal is always numerical, it is only a measure of the phenomenon we want to obtain. When a measure becomes a goal, it ceases to be a good measure (this is the so-called Goodhart’s law). Just as some students “study for a grade” according to specific test questions rather than fulfilling the real meaning of the grade, so AI can exploit the way we grade it. If we reward a robot that collects garbage in proportion to the amount of garbage it collects, it can cheat us and get a big reward by creating garbage itself and then collecting that same garbage. It maximizes the grade, not its meaning.
In general, there are three ways an AI can get a high score for good behavior. The first is that it actually follows the man’s instructions and thus gets a good grade: that’s what we want. Another way is cheating: if the human just thinks the AI is following his instructions, the AI still gets a high score. The third option is the most dangerous: AI gets rid of humans and rewards itself! Perhaps his best strategy is to take control and collect ticks for himself, because it is programmed to get as many ticks as possible.
Superintelligence doesn’t have to have evil intentions to harm people; it is enough that her goals are not completely aligned with ours. Let’s describe a hypothetical situation that illustrates this type of problem. Let’s imagine that we give our AI assistant a goal: connect to the Internet and collect as many postage stamps as possible in a year. Although the goal sounds benign, the best outcomes – those with the most collected stamps – are actually dangerous. For starters, the AI can opt for an email scam: sending messages to a large number of stamp collectors inviting them to send their stamps to a (fake) stamp museum. Then the AI can deliberately commit a bunch of misdemeanors and crimes on the Internet to cause legal action just to receive mail, because every mail received (report, lawsuit, subpoena) will have a stamp attached! If it has greater abilities, it can collect money by hacking to buy huge amounts of stamps. If it hacks printers around the world, it can force them to print nothing but stamps for days. Ultimately, the AI will get the most stamps if it turns the whole world into a stamp factory! And in that, people only bother him. Moreover, atoms from our bodies can be used to produce stamps. These examples show that benign goals combined with great power can have consequences we don’t think about. Cataclysm doesn’t have to be the goal of the AI, but it can be a side effect of some other goal.
This is a consequence of a more general principle called instrumental convergence: there are certain subgoals that are useful for almost every goal. For example, most people want money – not because money is their ultimate goal, but because money can help them achieve their goals. Thus, the AI will understand that acquiring resources, financial and other, helps it to achieve its goal. In general, an AI will try to gain as much control as possible as well as improve itself simply because doing so increases the likelihood of achieving (any) goal it has been programmed to achieve. Also, it won’t want us to change it and give him another goal – again because it only cares about the current goal. It will be concerned with self-preservation: if it knows there’s a button to shut him down, he’ll do anything to stop us from doing so – not because it cares about being alive, but because it won’t be able to fulfill his current goal if he’s shut down. All these instrumental goals are natural, but against conformity: it is very dangerous that AI wants control, resources, that we cannot shut it down, change it, and the like. Powerful artificial intelligence is in some sense inherently inconsistent, and this is a problem for which there is no clear solution yet.
Ideas where one AI supervises another are limited because we don’t know if they will “agree” at some point and how the multi-agent system will behave. The idea that we will be able to test potentially dangerous artificial intelligence in time is limited by the fact that passing the test in safe conditions does not guarantee passing the test in conditions where the AI could actually harm us. It is about the so-called distribution shift: if the data on which the AI learned differs from the data on which we put it to work, it is difficult to predict the result. In dangerous conditions, we must succeed on the first try. And the superintelligence could assess the situation and predict whether a dangerous action would succeed or whether it is better to wait for the right moment and behave harmlessly. In general, she could predict our every move and plan several steps in advance. Just as we do not know how we will lose to a chess grandmaster, we also do not know how a greater intelligence will outwit us.