Will GPT-4 Run DOOM?

A.k.a. "Doomguy is all you need"

This is the TL;DR of my paper Will GPT-4 Run DOOM?. If you are looking for the code, it is here.

All the opinions expressed here are my own. Angry mail should be directed at me. Happy mail too--I'd rather receive that one.
All products, company names, brand names, trademarks, and images are properties of their respective owners. The images are used here under Fair Use for the educational purpose of illustrating mathematical theorems.

Outline


The TL;DR

I will always link this fantastic monologue when talking about ethics and ML.

A gif of GPT-4 playing Doom
Best run for all experiments. Here "best" means "it got really far".

Will GPT-4 Run DOOM?

Doom is a 1993 shooter by id Software that has been hacked and put to run on like, everything for the past 30 years. It is only (somewhat) natural to ask whether GPT-4 can (a) run Doom, and (b) play Doom.

So, under this setup (see the picture below), GPT-4 (and GPT-4 with vision, or GPT-4V) cannot really run Doom by itself. That's because it needs sufficient memory to keep track of the states and render stuff (first of all, GPT-4 doesn't even have image-rendering capabilities, so it'd have to have an encoding to output Doom in text). There are a couple of issues with that:

But wait, if we can get it to act as a proxy for the engine (like E. Coli, mentioned above) we can definitely get it to "run" Doom. Also this allows us to interface with GPT-4 for play purposes, so it works peachily. So: GPT-4V "reads" Doom to GPT-4 in natural language, and we have covered the "will it run Doom" part (with a technicality) and can use GPT-4's natural-language capabilities for play.

Aside: we could have been able to get GPT-4 to "diffuse" a bunch of black and white frames and output the next one based on exemplars and a sequence (that'd probably need Turbo, not GPT-4-32k), but then we'd be relying on the model "guessing" what is the next game state based on the user commands, which isn't great--it would need access to the code determining enemy movement.

Architecture of the system
Setup for the paper: GPT-4/GPT-4V interface with Doom via a Python connector running on top of Matplotlib. The Manager (i.e., a Python class) sends screenshots to GPT-4V, which returns natural-language descriptions. These are then sent to the Manager, which sends this, history, and call parameters to GPT-4. Depending on the prompting scheme, it will also call the Planner (i.e., GPT-4 under a planning prompt) or Experts (3 separate calls to GPT-4) before the GPT-4 call.

Will GPT-4 Play DOOM?

Okay, this is the fun part -- but let's start with a theoretical note BECAUSE THEORY IS FUN. A few years back (god I'm old) a paper came out showing that a bunch of (a bunch = most) games are in PSPACE. This means that if you want a computer to play the game perfectly (find the best winning strategy every time), all you need is space polynomial on the size of the game to simulate all moves. Anyway, the point is that GPT-4 can't really play Doom perfectly: by the token argument above, it doesn't have sufficient memory to simulate all plays.

That said, GPT-4 can play Doom to a passable degree. Now, "passable" here means "it actually does what it is supposed to, but kinda fails at the game and my grandma has played it wayyy better than this model". What is interesting is that indeed, more complex call schemes (e.g., having a "planner" generate short versions of what to do next; or polling experts) do lead to better results.

To elaborate a bit on "passable": the model can traverse the map, especially if it has a walkthrough. This is legit because people do use walkthroughs in games. I have used walkthroughs for some games (don't know how I'd have gotten most Halo MCC achievements without Maka's guides). Even then, however, GPT-4 does sometimes do really dumb things like getting stuck in corners and just punching the wall like an angsty teenager, or shooting point blank explosive barrels (blooper reel below).

Generally, though, GPT-4 does well when it has an extra model outputting reasoning steps in a more fine-grained fashion. A walkthrough is a general plan, but if the model gets both the walkthrough and the immediate step-by-step plan on what to do next, it does much better consistently. This is called hierarchical planning. If we boost it (in the traditional sense) by polling a bunch of models on what step to take next, it works double better, though consistency is a bit of an issue.

Aside from consistency, there are some caveats that I will cover in the next section, but the gist of it is that GPT-4 does better with less complex work. So if you can break down the task both in reasoning steps (think how chain-of-thought prompting works) and in time steps (i.e., have an orchestrator/planner having a wider view of the problem), the model works much better. It is actually pretty good at ad hoc decisions, like "here's a door, let's open it" or "there's a zombie, fire!" (though usually misses).

GPT-4 shooting an explosive barrel point blank.
"hahah gun goes brr" --not GPT-4

Unexpected Findings: GPT-4 and Reasoning

Okay, so this is less of a "findings" section and more of a discussion. Reasoning in LLMs is a bit of a spicy topic because a lot of people think that these models are sentient (christ no), or that they are actually having a conversation with you (christ on a bike no). You probably can guess my position on this subject. Generally LLMs are very good at giving the impression that they are intelligent (see image below), and without sufficient critical reasoning, people are easily fooled.

That said, there is a massive different between reasoning in long, complex scenarios, Aristotleing/Newtoning/Turinging your way through life; and being able to solve simple problems by performing sufficient (abductive) inference. I do not expect GPT-4 to be able to develop a new, unseen insight in higher category theory, but hey, it can follow an instruction like "if you see a door, press SPACE to open". Maybe it just learnt to talk as a parlour trick.

Extremely relevant meme.

Okay, so we will lower the bar a bit on what it means to reason because we really should--grandstanding aside, there are interesting things models can do by interpreting strings and returning strings. Better yet, are some fascinating things to observe about this scientific exercise (which is a grand term for "a paper that started as a shitpost"). Let's talk about some (negative) observations:

Ok, but enough potshots at GPT-4. It's absolutely remarkable that this model has (hopefully, maybe, idk) not been trained to play Doom, and still does manage to get all the way to the last room (once). While I do believe that some level of prompting and calling schemes like planning and k-levels will get the model to play the game better (maybe even finish the map), to reliably play the game it is probable that it will need some fine-tuning.

I should also point out that there is a massive difference between planning and reasoning in an academic setting (think Towers of Hanoi or box stacking or some benchmark problems like that, like in Big Bench) and planning in an applied setting (Doom falls into this category). It is much much much harder to get a model to play Doom than it is to get it to solve Towers of Hanoi. And also far more interesting.

Same thoughts go for GPT-4V's zero-shot output. It is remarkably good, which is great! (do you really want to write a OCR model to sort of get it to work on Doom? Sounds expensive and too much work). The descriptions are exactly what the player model needs. There are some hallucinations here and there but overall I would argue it's more of a reasoning problem, not a parsing problem, that limits GPT-4's ability to play Doom.

Implications

On the ethics department, it is quite worrisome how easy it was for (a) me to build code to get the model to shoot something; and (b) for the model to accurately shoot something without actually second-guessing the instructions. So, while this is a very interesting exploration around planning and reasoning, and could have applications in automated videogame testing, it is quite obvious that this model is not aware of what it is doing. I strongly urge everyone to think about what deployment of these models imply for society and their potential misuse. For starters, the code is only available under a restricted licence.

That said, since it is a very commmon issue in videogames to get the game to be extensively playtested, this suggests that eventually some parts of the testing may be automated by an LLM. This means that you won't have to heavily train specialist models to play it, and it's very likely that this can already be done for non-real-time play (think Catan without turn time limits).

Conclusion

Right, so what have we learnt? For starters, that GPT-4 can run Doom (sort of). It also can play Doom (also sort of). But what matters the most here, in my opinion, is both that the model can reason a wee bit about its environment, or at least, follow verbatim instructions of the type "when you see this, do this". It fails more at the whole "when you see this, and this, and this has happened, do this", and it can't reason well about its history. So I'd argue that most of these claims around reasoning are more related to memorisation and semantic matching, and less about actual reasoning. There is something there, though, and I'd wager that we are just starting to see the beginning of it.

From a benchmarking perspective, I think it is far more interesting to have evaluation benchmarks that can't be memorised. Let's get real, if I tried hard enough I'm pretty sure I can get GPT-4 to output the entire QQP dataset. It's not on purpose--it's just that its training corpus might have been contaminated one way or another. However, benchmarks that test a phenomenon as opposed to a fixed corpus are far more interesting and realistic. You really can't claim your test set is representative of the entire phenomenon you are modelling/learning--but if you have a genie that, when you pull its arm, gives you a new test set, you are definitely testing the real thing. It's also memorisation proof. Also, it's cool to see LLMs play videogames.

But back to the "there is something there" thing. For this reason, and because seriously it was super easy to get GPT-4 to unquestionably follow instructions, I think that it is extremely dangerous how easy and fast this tech has widespread. It has its benefits, like automated playtesting, or, better, helping users with dyslexia (neurodiverse groups are aaaalways marginalised, just ask me how hard it's been to get products to do inclusion work), but it has massive potential for misuse.

So I would very much like to repeat that Jurassic Park video and note that technology isn't good or bad, it just is. A tool reflects the value set of its creators, and it typically is people who give it meaning.


Adrian de Wynter
9 March 2024