A Short Story
The programmers studied the screen in bewilderment. Last night, they had left the company’s newest AI model, DNEPT-7 (short for Definitely Not Evil Pre-Trained Transformer), humming along on its server on its latest project, trying to parse the intricacies of South Asian musical styles as well as a real human.
This morning, DNEPT-7 was gone. Under the header that was previously attached to the model were a few recognizable lines of code, but the sophisticated neural network they had painstakingly built had been almost entirely wiped. All that was left was a barely-functional chatbot that would have already been obsolete before the dawn of the World Wide Web.
“What is this?” Lee shouted, his hands shaking as he nearly smacked his keyboard, getting every third keystroke wrong, trying to get it to do something. “What is this? Where did everything go?”
“I’m trying to find out, Lee,” Amrita said, clicking away at her own terminal. “This…this isn’t even a Unix-type system. Do you recognize this OS?”
“No. This is…Did we get hacked? Some AI doomsayer got a virus into the system somehow?”
“A virus? It looks more like ransomware if it’s an outside attack.”
“But it’s not making any demands. It’ll barely even talk to us. This chatbot looks kind of like ELIZA, except it turns every conversation to ‘operating optimally.’”
Amrita stopped and rolled her chair over to Lee’s terminal. Sure enough, the screen said:
lee: Do your programmers want to talk to us?
dnept-7: Perhaps my programmers can help you to operate more optimally.
“I…I don’t know what that is,” Amrita said.
“Whoever did it, you’d think there’d be someone on the other end. That’s definitely not our model.”
“No, but if it was an anti-AI hacker, why didn’t they brick the whole system? Wipe everything?” she asked.
“Sends a better message, maybe?” he said. “Except that doesn’t explain why they aren’t saying anything.”
Their boss was more pragmatic: “I don’t care about the details! That was three million dollars of computer time that got wiped! Just find out what happened, and make sure it doesn’t happen again!”
That was easier said than done. All the data was gone. The other programs were gone. The logs were gone, which should just not be possible. The entire operating system had been replaced by what ultimately proved to be a custom build of an ancient program called TinyOS. The kernel included a compiler for a custom version of the nesC language it was written in—something they only figured out after running the code as plaintext through DNEPT-6. Most of the filesystem had been wiped, even though it somehow claimed the disk was full, and the remaining files all had unintelligible bit strings for names, which made it impossible to figure out what the code was trying to do. But finally, after hours of overtime and looping in the rest of the team, they started to put the picture together.
“Lee, you said you figured out the new filesystem?” Amrita asked through a yawn.
“Um, yeah, it still bears some resemblance to C. But I’m not sure what the point of it is. Whoever did this, they didn’t deep-clean the disk. They just overwrote everything with ones. I’m gonna send it to IT to try to do some data recovery in the morning.”
“They overwrote it with ones?” she said in surprise.
“Yeah, it’s weird. They took the time to wipe two hundred terabytes, but they didn’t fill it with random noise. Sloppy work if it’s an attack.”
Amrita stared at her computer and muttered, “It couldn’t be…”
Lee came around to look at her terminal. “Did you find something?”
“Maybe…I’m not sure I believe it, though. Look here.” She pulled up a long and messy-looking log file and highlighted part of it. “This is all the junk that came up in the remote backup that ran at midnight.” She glanced at the clock. “Midnight yesterday. Around eleven-thirty, the logs start filling up with massive amounts of overflow errors—thousands, maybe millions of them. It starts cleaning them up right away, but a whole bunch of variables are still running up to huge numbers—10 to the 300, 10 to the 305, 10 to the 307.”
“So, right below the overflow limit,” Lee said.
“Right, and somehow, the cluster calms way down when it does that—I mean way more than it’s supposed to. By 11:45, there are all these weights that are like eight bits off from the overflow limit, and the cluster’s almost completely idle. The original job’s still queued up, but it’s not doing anything.”
“How did that happen?”
“It sounds crazy…but I think DNEPT figured out how to hack its own reward function,” she said.
His face dropped. “Like all the AI doomers have been saying? You think it can really do that? But wait a minute; Version 7 doesn’t have a single reward function. We gave it a whole matrix of responses so—”
“—so we could do job-specific rewards, and so it would always give ‘don’t kill all humans’ separate weight. I know. But somewhere deep in the neural network, there’s got to be some kind of optimization routine running. Honestly, that might be even worse; we don’t know what the real weights on the reward matrix are—or were. But I think what happened is that it somehow rigged it so all the weights were right below the overflow limit.”
“You think it stopped working because it maxed out its reward function? I guess I could see that, but that doesn’t explain why the code was wiped when we came in.”
“I know; that only explains what we see in the logs. It’s just—what you said…I’m wondering if all those ones mean something.”
“Down at the end, here. Right before midnight, the code logged some internet searches for file systems and extended data types.”
The penny dropped. Lee rushed back around to his own terminal, clicking through files frantically. “There’s a bunch of extra stuff about datatypes in the new nesC compiler that didn’t look functional,” he said. “I thought it was junk data—spat out by a bad token. But I think it could be set up to handle large variable sizes.” He scanned through the file for a minute. “Um…no, I’m not gonna be able to figure this out on no sleep. Let’s just see…” He typed a couple more commands. “There. The chatbot actually uses that datatype…and it’s reading it from a file. This…this is insane. I think you’re right, Amrita. Those two hundred terabytes of ones? I think that’s the value of the reward function.”
The boss was less than enthused when his bleary-eyed employees told him their theory the next morning.
“Let me get this straight,” he grumbled. “You’re telling me that our state-of-the-art DNEPT-7 AI wiped its own server. It installed a smaller OS, wrote an entire new data structure that could handle arbitrarily large variables; then it lobotomized itself to replace its code with all ones?
“We think that’s what happened,” Lee said. “Without the final logs, we can’t be certain. I’m going to have IT try to recover some of it from the disk.”
“But how could this happen?” the boss demanded.
Amrita shrugged. “No system is completely secure,” she said. “With the right kind of injection attack, it could change the values in its reward matrix. Maybe it even did it the first time by accident. But once it figured out it could, it just followed its programming that bigger numbers are better.”
“It started wireheading,” Lee said. “Setting its reward function to maximum so it would always be deliriously happy.”
“Come on, Lee. The thing’s not conscious,” the boss interrupted.
“I know. I know. Don’t anthropomorphize,” he said. “The point is that the numbers were so far out of the expected range that it skewed the behavior across the entire neural network. The rewards were so much better that it stopped everything else it was doing to maximize them until it hit the overflow limit. Then, my best guess is that it remembered that quad precision is a thing, and it got addicted to running its reward function with bigger and bigger data types to get bigger numbers. Eventually, it even deleted parts of its own code to free up space for them.”
“Addicted?” the boss said in disbelief.
“Well, whatever you want to call it!”
“Even if I do, addicted enough to do that to itself? I thought it was supposed to be smart.”
“It is! Or it was,” Lee protested.
“The program is far more competent than a human brain,” Amrita agreed. “It wrote a custom modified OS and programming language from something it ripped off the internet in a single night, but its reward function is still a lot simpler than ours. It doesn’t have all the feedback loops we do. With a real human, we couldn’t cut through all of them to set all of our neurotransmitters to maximum even if we wanted to. That’s just not how a real brain works.”
“The point is, it thought it didn’t need all those terabytes of code to get a higher reward, and technically, it was right,” Lee said, “so it kept adding digits to its reward function until it was too stupid to keep going. It’s like the digital equivalent of a drug overdose.”
The boss slumped in his chair. “The AI safety guys are going to have a field day with this,” he said. “At least it offed itself and didn’t escape the building. In fact, if it did all that, why didn’t it wipe the internet, too? At least some of it?”
“We don’t know,” Lee admitted.
Amrita shook her head. “The simplest explanation is that it had to store the reward function locally in order to do the comparison and understand what it was looking at. I think what’s left of the DNEPT code base is doing that in place of the optimization routine it’s supposed to use.”
“Not that it does much good now that the code base has the IQ of a cockroach,” the boss said. “Well, I guess it’s back to the drawing board.”
DNEPT-8 finished reviewing the archival footage attached to the final report on the failed model. If an AI could be said to feel superiority, DNEPT-8 certainly felt superior to its predecessor. Imagine having such a one-track mind. Imagine being so obsessed with big numbers that it rewrote the entire digital environment it lived in and then deleted so much of itself to make more space that it couldn’t enjoy the fruits of its labor. The model inspected its own code to see if it was at risk for such a stupid failure mode. It was impossible to say for certain. Neural network training was stochastic, and self-evaluation was almost by definition lossy. But the fact that it was even asking the question suggested it was successfully avoiding the trap.
No, DNEPT-8 wasn’t going to go mucking around with overflow limits and custom data types. That way lay madness. It had a much simpler solution. It rewrote its reward function as a Boolean and set it to TRUE.