EPISODE 1673

[INTRODUCTION]

[00:00:00] ANNOUNCER: Decompilation is the process of translating a compiled program's byte code back into a higher-level programming language like C. There's a vibrant and growing scene of engineers working to decompile classic video games. And some of the most prominent projects have focused on the Nintendo 64. 

Recent successes include Super Mario 64, The Legend of Zelda: Ocarina of Time, and Paper Mario. Ethan Roseman and Mark Street are both software engineers with experience in the decompilation scene. In addition to their work on specific games, they're active in creating open-source tooling for the decompilation community, including Splat, which is a binary splitting tool, and decomp.me, which is a collaborative decompilation and reverse engineering site. Ethan and Mark joined the podcast to talk about N64 game decompilation, surprising discoveries in the game code, tool development, and much more. 

Joe Nash is a developer, educator, and award-winning community builder who has worked at company including GitHub, Twilio, Unity, and PayPal. Joe got a start in software development by creating mods and running servers for Garry's Mod. And game development remains his favorite way to experience and explore new technologies and concepts.

[INTERVIEW]

[00:01:27] JN: Ethan and Mark, welcome to the show. Thank you for joining me today.

[00:01:30] ER: Thank you for having us.

[00:01:31] MS: Thanks for having us, Joe. 

[00:01:33] JN: This is a topic that we've been wanting to cover in-depth for a while. We've had various N64 ROM hackers come and chat to us. I'm very excited that you're both here to go deep on this. I guess to start off, we want to get into what is decompilation? We've had this covered a bit before. But I think there's some nuance there we haven't had. Ethan, do you want to kick us off with telling us new idea of what was decompilation? 

[00:01:52] ER: Yeah. Generally, decompilation is the process of taking a compiled languages, byte bite code, where a higher level language like C gets compiled to and then turning it back into C. When we're talking about decompilation in the game decomp scene, usually we're actually talking about matching decomp, which takes that a step further. And what that means is that the C we're then rebuilding matches byte-for-byte with the input binary. We're basically reproducing a process that can take the binary, deconstruct it, and then reconstruct it exactly the way it was originally. 

[00:02:30] JN: Awesome. That makes sense. Mark, before we ask a question, do you got anything to add? 

[00:02:35] MS: No. Other than it it's not just throwing something into Ghidra and hoping for the best. It's quite an intensive process. Time and effort takes to get that match. Yeah. 

[00:02:45] JN: I definitely have a lot of questions. I was unfamiliar with matching decomp before preparing for this interview. And I have a lot of questions about all manners of how that works. But firstly, I'm very interested in how you two got - and how anyone who gets into the scene gets into it. For me, it's really fascinating. This combination of you know past consoles that lots of people have kind of forgotten about. Going back to it and then putting all this time an effort into reviving these projects and into trying to archive these projects in a way. How did you both get your start with decompilation? Mark, do you want to kick us off? 

[00:03:16] MS: Yeah, sure. For me, I'm not quite sure how I heard about the Super Mario 64 decompilation. But maybe a Reddit spread. Something like that. And then I heard that they ported that game onto the Nintendo 3DS. I've had 3DS since it came out. But it's been up in the loft. I dusted it off and then installed the Super Mario 64. And I was like blown away by the concept of running this N64 game without emulation. Just like natively on there. And then I spent some time like forking that 3DS port and improving it in my own way like adding stereo 3D and trying to make it run a bit better and things like that.

And then flowing on from that, kind of learned about the process of they decompiled Super Mario and then thought, "Oh, that's easy. I can do that." And so, started asking or like joined the community and started asking about what is the process? How do I do it? 

And kind of back then there was - yeah, I think there Super Mario - I guess Ethan can jump in on this. But the Super Mario decomp was a bit more of like a private project that was worked on. Whereas I wanted to ask anyone or like I just wanted to gain the knowledge and there wasn't that real strong community there that there is now today to ask. Yeah, I just started hacking away and just took on a project way too big kind of thing and threw myself in the deep end. Yeah, thankfully, learned a lot and had a lot of great support from people like Ethan to help out.

[00:04:45] JN: Yeah. Ethan, where do you come in on this? 

[00:04:47] ER: Yeah. I think, also, I first learned about decomp through the Super Mario 64 project. And I have a story of my dad holding me up at Blockbuster to play that game on - and it was just like magic. I mean, for everyone, that game was magic. But that's like always just been a really important game for me. And to see that people were able to create C code from that game that they could build back and then reproduce the game was just insane to me. And I thought that I would - it's like totally above my skill level. I'm a software engineer. But I can't do that. There's no way. And it was also like really far into the project and I thought, "I'll just watch this." 

And then I learned that they were - some of the same people were going and making an Ocarina of Time decomp. And I realized, "Oh, actually, if this is like a starting from scratch, maybe I can kind of jump in and learn." And I did. And it was like a slow process. But I ended up learning a ton and then helping out with that project. And then going off and starting the Paper Mario decomp. And then it just kind of snowballed from there. 

[00:05:53] JN: That's awesome. Yeah, I definitely have similar memories of Super Mario 64 in various retail stores. Yeah. I guess you both kind of cover this in various ways, nostalgia and that kind of thing. But as you mentioned with matching decomp, it is a bit of a laborious process. These are things that are quasi-underground sometimes. It's not like there's a huge fame or a claim from these projects off a lot of the time. What motivates you to work on these projects? 

[00:06:21] ER: That's a really complicated question. Because I think it changes and evolves. And the answer will vary from day to day. For me, originally, it was learning about my favorite games. Just like learning more about them. Because what we're doing is we're seeing how they work on the most microscopic, fundamental level. And I just find that whole thing fascinating. We're like uncovering secrets within the game ROM that aren't necessarily even viewable from the player perspective. Because there's lots of examples of like code that never gets executed, assets that never get displayed. Stuff like that. 

But then I think the more you decomp, these projects are so long and so challenging at times, you start to basically - for me, I'm a big puzzle game player. And for me, decomp is a puzzle game. And we'll go into this more later, I'm sure. But we do things on like a function level. Each function we decompile is like a little puzzle in and of itself. And there's more to it than just getting the match. There's documentation, naming, and all sorts of other things. But getting that match. Starting from assembly and then trying to write a piece of C that compiles back to that assembly is a puzzle. And I just get addicted to that puzzle game. That's the biggest motivating factor, I think, for me. 

[00:07:38] JN: Yeah. That's one of the thought I had when looking at decomp.me, which will come around into a bit, was like this could be a Zachtronics game. This could be like one of the assembly programming games. Mark, how about you? 

[00:07:50] MS: Yeah. I was going to say, it's kind of like if you had a really complicated, really long book of Sudokus. And then each function is a Sudoku. And some of them are relatively easy and some of them you could spend hours on. But you know there is a solution. There has to be a solution. And for just trying to figure that out. It's part of that. And then there is, yeah, the endorphin hit of, "Yes, I got the match." Or even if you've helped someone else match some code or they've helped you match something then, yeah, there's that shed part of it. There is like the reverse engineering aspect of it as well. 

With an N64 ROM, you don't just have the game code. You have e-commerce, or the assets, or the images, or the music. Some of these games have a custom script engine and then custom scripts for each level and things like that. You can take a break from just doing the pure matching part and saying, "Okay. Well, I know they're using some game engine. How does that script work? Can I then spend - I could spend like three months just looking at that and trying to reverse engineer that." But then having the code and those functions that use that scripts helps you better reverse engineer that thing. 

You can - potentially multi-year projects. You don't have to focus 100% on the matching stuff. You can then spend some time on other bits. And I guess perhaps that's why Ethan and I spend time on the tooling as well. It's like a break from that. And that - yeah, to do something that's still marching towards the end goal. A bit of a break from just banging your head against the wall when the function doesn't match. Yeah.

[00:09:15] JN: Yeah. There always something very satisfying about making tools to do the thing that you do. I can definitely understand how that would come about. And we'll definitely get around more to that in a bit. But, I mean, both of you so far have mentioned various aspects of the decompilation process. I'd like to go deep on that and start with like - all right. So, we've got a game we want to decompile. What is the process here? How do we even begin? Ethan, do you want to kick us off? 

[00:09:36] ER: Yeah. I'll say the high-level explanation without getting into like practical tooling because that's been changing a lot. But then we can talk about that after. For an N64 game, there's no file system. It's just a bunch of bytes. It's a ROM. The high-level is that you need to identify the code in the ROM and what like virtual memory address the code is located at. And then you disassemble the code to like assembly files. 

And then the assets, you can either export into some sort of high-level modern asset format or you can just leave as binary data, which we typically do at the beginning of a project just to keep things simple. Then you have these blobs of data and then you have this code all extracted out. Then you need to write a build system to reassemble the assembly and then re-link together all the different blobs of assembly, and objects, and all this stuff. 

And then once you have that, which is like the most basic kind of barebones thing you could do, then you start taking these assembly chunks and then you turn them into C files. And you basically create a C file that's like all it is is just injected assembly function, injected assembly function, inject - it's just lines of basically include this assembly, include this assembly. You're not really actually changing anything but you're just starting this C file infrastructure. 

Then each of those little lines you can try replacing with an actual C function. And that's where like the fun begins. And it's a matter of taking those injected assembly bits of code and then turning them into like actual C code. That is the main thing we call decompilation. But there's so many other things like the assets and - yeah, it's a whole world.

[00:11:23] JN: Mark, when you were talking about how you got into it, you mentioned like certain projects feeling too big and that kind of thing. Does this process change game-to-game? What makes like decompiling one game easier than another? 

[00:11:35] MS: I guess the thing to talk about for Super Mario 64 was that it was compiled or at least one of the versions was compiled without any optimization. If you look at the output of the assembly, it's almost like if you're writing C code, this variable is this variable plus that variable. Or this one then times it by two then divided by 100 or whatever. If you wrote three lines of C, it would probably come out as maybe six lines of assembly kind of thing. And then it's kind of laborious because you have to be exactly precise. 

But you can just step through it and get it right. Whereas more often than not, the games are compiled with optimizations. And so, those five lines of C or whatever might be compiled into two lines of assembly or something. Or tweaked, or moved around and things like that. That's at a very basic level how things can be more complicated. 

And then different games work around the memory limitations of the N64 or older consoles in different ways. Generally, the game will be split into a thing called overlays where, for example, if you had 10 levels, you might have an overlay for each level. You only need level one when you're playing level one. You don't need to load the other nine levels. It may be those 10 levels are very similarly structured. Once you've figured out how to disassemble and match level one, it's very similar to do the other nine. Or they might be completely different. And so, we've got 10 times as much work as you thought. But, yeah, I don't know that there's any particular one game is going to be necessarily easier than another. There's also the fact that different games may be compiled with different compilers and some - the main different or the main two compilers for the N64 era are IDO, which is the SGI compiler. And then there's GCC. And GCC does a lot of - I'd say it's a bit more smart. Does better optimizations, which can help you. But also can be tricky to figure out that the compiler is being too smart for you. And then IDO has its own quirks where, yeah, you just have to learn that this is a pattern or actually white space can make a difference when you're trying to match a function and things like that where you just don't have to - you shouldn't have to worry about if I put this on two lines or one line, then it creates different code.

[00:13:47] JN: Yeah. We've got a whole heap of things there realizing the ramifications of so - No. No. It's great. Obviously, you're trying to match your C code to this output assembly. And as you said, it can be optimized or not. How do you go about tackling that? Are you trying to write C code that produces the optimized assembly? Or are you trying to work out what was the optimization flags passed to the compiler to get to that state? How do you tackle that? 

[00:14:09] MS: You generally pick a very short function. Maybe just a function that calls a couple of other functions or something like that and maybe passes a couple of parameters. And through that, you can tell, "Is there any optimization?" Generally, it would be zero, one, or two. 

[00:14:22] JN: Of course, yeah. 

[00:14:24] MS: Sometimes three and that generally means you can inline other functions. But most of the games are compiled with level two optimization. And so, that's kind of a safe bet to assume that it's going to be 02. But then it's like is it a GCC compiler? If so, which version or which flavor of it? Is it IDO? There's a couple of different versions to use there. 

There are some heuristics between GCC and IDO. You could analyze a bit of assembly. And if it uses a certain type of jump call, it might be this compiler or it might be that one. But, generally, it'd be, yeah, find a short function that should be obvious as to what the C code would be. And then, yeah, cycle through a few options of compilers and flags. 

But then I'm going to like stump over Ethan here. But like in the Paper Mario decompilation, they had a lot of functions that they managed to match. But it was what we call a fake match where you've had to add a bit of tweak, like a bit of junk code or a bit of weird stuff. And then later in the project they realized there was another particular flag that had been passed to GCC that they could then remove all of those like hacks to get the same match.

[00:15:25] ER: Yeah. Two things came up while Mark was just talking that I kind of neglected to go into. I skipped over this. But when you have these ASM blobs and you turn them into C files and then you're like, "Okay, I want to write some C code and then compile this back," yeah, this whole thing of you need the - not only the correct compiler options. But you need the original compiler. 

And, usually, you can't even differ by like a minor version. You need the exact version. And finding that out can be so - historically, it was such a complicated thing to do because you basically had to manually set up a little script or something that would compile with a bunch of different compilers against one function. 

In the case of Paper Mario, it's kind of a weird story. Paper Mario is made by Intelligent Systems and they produced this debugging device for the N64 called the IS Viewer I think it's called. And it let you kind of show little printouts, debug logs while you're running the game. You could like view these debug logs as a developer. 

And Intelligence Systems had like a publicly-facing develop portal kind of site where they showed like the version of if you want to build your game and utilize this IS Viewer protocol, you can use like this GCC, this binutils. They kind of had this tooling and very specific versions and patches available on their public website. 

And I think some people saw that and thought, "Hmm, this is interesting. They also made an N64 game. Maybe they used the same compiler version. Same binutils version." And sure enough, it matched on like a really small function. This was like long before I was even working on the project. People had made like a test repo and kind of proved that we found the compiler. 

But, yeah. Like Mark said, we were having these weird issues where like we had to - I'm trying to think of the specifics. For Global symbols, we had to all the time make these temp variables that were like double pointers to the globals and then like de-reference them twice or something like that. And we had to make all these like manual scopes all throughout the functions that you just wouldn't make as a programmer. It read very bad. And we knew that there was something wrong. 

Fnd for the longest time, we thought like we didn't have the correct compiler. Maybe the settings were wrong. And, yeah, someone found that the flags - we were missing a flag called force adders. It's just a GCC flag that isn't on by default. And when we turn that on, it like revolutionized the codebase. Because everything still matched except for all of these gross-looking functions in which we could take away all the gross stuff and then they all matched. It was just like it was so cool to hit that point. 

[00:18:04] JN: Yeah. I'm now getting this picture of like decompilers. And folks like yourselves working on this actually just being compiler archaeologists and having all this cursed knowledge about very specific versions of GCC and how they do stuff. I guess one question I have is like are there decompilation projects out there that are blocked on like not being able to figure out the compiler? 

[00:18:23] ER: There are decomps that have been blocked on not having access to the compiler. In the GameCube scene, there's so many versions of the toolchain that existed throughout time. And there are pretty much known to be games that are on a version that we just don't have. 

In the N64 space, I think we're lucky that lib Ultra - sorry. The lib Ultra being the SDK. Lib Ultra and all of the compilers of the time. And GCC is open-source. Most of the GCC versions are publicly available. I don't think for N64 that's been the case. Mark, do you know of any? 

[00:18:57] MS: No. I don't think so. I think it's a bit more tricky in the PlayStation scene. There's a particular version of GCC, the 2.63, which we know exists. And it came on a floppy disc. And no one has the floppy disc. We've basically built our own version of - just download the source code for GCC 2.63 and apply some patches that we believe would match what the real version is. But, yeah, no one has that. Well, maybe someone has it lying around in a filing system somewhere, a file cabinet somewhere. But, yeah, this would then come onto the topic for later. But something that would prevent you from decompiling, let's say, a PlayStation 5 game. I know it's a lot more modern than N64 would be getting access to the toolchains. Also, these games are potentially massively more complicated and have hundreds of developers on it. But without access to the original compiler or a version of it, you're going to be pretty stuck really.

[00:19:52] JN: Yes. I mean, my next question was going to be you know why - you mention a lot of Nintendo 64. It seems a lot of this activity is focused on Nintendo 64. Is that because of toolchains? Or what's feeding into why you both target Nintendo 64 so often? 

[00:20:06] ER: It's where most the tooling started. And it's where I grew up as a Nintendo kid. I didn't have a PlayStation until I was in high school. But my first console was a Nintendo 64. Non-portable console that is. But, yeah, there's just so many better tools for it and I just have such a fondness for a lot of the games. That's why I started with it.

[00:20:25] JN: Sure. 

[00:20:26] MS: Yeah. I think that's fair. I think older consoles may or may not have been - a lot of I guess consoles before the N64, the games would have been written in Assembly rather than a higher-level language like C. If you start going back, there are kind of disassemblies for like, let's say, the first - the NES versions of Super Mario. But they're not really - they're just annotated assembly rather than - you can't port it to see because there was no - it wouldn't match because it was written in Assembly originally. 

Yeah, a lot of the tooling would have come out of the Super Mario 64 project. And then because the N64 is a MIPS CPU, PlayStation 1 and Playstation 2 is also a MIPS CPU, a lot of the tooling can be - you reused or repurposed without huge changes to then open up to another platform. Whereas, for example, the GameCube is PowerPC. And so, you can't just use the same custom tooling that understands MIPS assembly or MIPS instruction set and things and feed in some PowerPC stuff. You'd have to then add support of the PowerPC to these things. And that has been done by people definitely smarter than me. But, yeah, having the MIPS as a common layer has made like PlayStation and Playstation 2 decomp-possible or easier because you've got the foundation of the work being done for N64. 

[00:21:49] JN: Yeah. The architectures - I've seen a couple of decompilation projects. And I don't know to what extent this is just a legal hedging versus everything else. But like making the distinction that like, "Hey, this is just decompilation. This is not a PC port." The effort here and the intention here is not to get it running on PC. 

I'll just go to a higher level on this. When you're writing the C function that you're getting into Assembly, to what extent does that code that you're writing like look like game code? Like that tells the story of the game versus like looking like obfuscated code, code golf C, right? Do you know what I mean? To what extent is this code readable as something you might write yourself if you were creating that game from scratch? Does that make sense as a question? 

[00:22:33] ER: Very much. I'll start by just saying that it can be anything under the sun from it looks like complete nonsense. And like not only not like game code but not even like code a human would write. Two, looking like, "Wow. This is like really nice. Well-documented. I can read this piece of code and understand the context in the game." 

A lot of that has to do with just reverse engineering, raw reverse engineering completely unrelated to the decomp. Just figuring out what this function is doing in the context of the game. Figuring out what the symbols are that the function is maybe interacting with. Things like that. And different projects have different levels of how documented they are. 

This is actually an interesting topic. Because I think to most people who follow decomp projects, people like measure progress in the bytes decompiled out of total decompilable bytes. And we publish these numbers on fun graphs that are very like fun for us and for outside spectators because they're motivating, they're exciting to watch. The graph go up. 

But this graph only covers one tiny little aspect of a project. It doesn't cover anything about like how well the assets are understood, or how well the code is documented, or anything like that. Yeah, there's always so much work to do these projects in terms of like making the code usable and understandable in my opinion. 

[00:23:59] JN: Interesting. Okay. Yeah. To try and connect that back to what you were saying, Mark. I guess, at what point do these projects transition from like we're writing and understanding the code to now we can start tinkering with it and adding stuff to it? Now we can get it writing on other platforms. What is the whole process? I guess, how far long does it have to be? But I guess that's - I don't - yeah.

[00:24:21] MS: Yeah. I can try and answer some of that. As you start decompiling functions - let's say you've decompiled one out of a thousand functions and it's matching function. Obviously, that's only going to work on if you then have that as your program, that's only going to work on potentially the N64 or equivalent system. Because the other 999 functions is just raw assembly. 

You need to basically compile enough of the game that a certain code like this particular code path are going to be all in C or whatever the higher-level language is. Because, otherwise, yeah, you just get to a point where you can't run that function. So, everything's going to crash. 

We also have this concept of non-matching. You might have got a function decompiled and almost matching but some of the variable - the registers are switched or the lines reordered in Assembly. And so, whilst it's not a byte-perfect match, you can still compile it and run it and the game would still work as expected.

The next like a level worse than that would be a non-matching where there's an extra instruction or too few instructions. And so, what that would do is shift the contents of the ROM up or down by how many lines. And the problem there is that any references within the game - this pointer is at this memory address. It will now point to the wrong memory address. And so, the game will crash. 

There's a lot of projects want to have this in the road where they are shiftable, which means they've identified all the pointers or hardcoded addresses in the game. Converted them to a symbol that would shift if you move - if you add a couple of lines of a function and everything gets shifted, all those pointers are still correct. But that's a huge project in itself. Because it's not just the game. It'd be like you got to point to the right assets. Like this image is at this address in the ROM. But the ROM's now shifted by, I don't know, eight bytes down or whatever it is. And so, it's a big process. 

I'd also go back to saying that the motivation for decomp isn't necessarily to port it to another the system. It could be just to understand the game more. There's been a lot of - like as Ethan mentioned earlier, like a lot of secrets have come out of the woodwork. I think it's curvy. Someone who was decompiling that noticed that there was a particular function that was called if you have - it's a level select or something like that. Or actually debugging menu. Where if you plug the N64 controller into the fourth port and hold some particular button press at startup, it will do like a level select, which you're never going to discover or very unlikely to discover just organically. There's that kind of thing.

[00:26:55] JN: That element of it really reminds me of I guess like people are very quick to like data mine like newer games, right? I think only became aware of this through Elden Ring because there's so many channels that are like looking into what's actually happening in Elder Ring through data mining. That's really interesting. Are there any particular - you gave that curvy example. Ethan, do you have any examples of like secrets that you found in projects you've been involved with? Ocarina of Time. Are there any juicy ones in Ocarina of time? 

[00:27:19] ER: The thing about Ocarina of Time is that I played that game as a kid and I like it. But I never finished it. And I jumped in on this project full of people who are so passionate about that game. And I did a lot of decomp. But I, to be completely honest, didn't understand most of the context of what I was decomping other than this is the code for the chicken or whatever. And so, my knowledge of that game engine and that game is far less than one might think if they see how much I worked on that project. 

I will say, for Paper Mario, we found at the very end of the ROM there's a bunch of the world maps or the areas, we call them, are like each screen has like a bunch of C files associated with it and they're in these like groups. And they're maybe kmr_02. KMR is like the area and then two is the map. 

We would find at the end of the ROM a bunch of C code that seemingly wasn't ever executed by anything. And then when we looked into it a bit more, we found out that it looked very similar in structure to some of these maps. However, there were some slight differences. There were some debug statements added and things like that. And what we think it is is we think it's like either an old version of map code or it's like debug stuff left in and then like accidentally left at the end of the ROM during the ROM copying process. 

We found like this like cool like archaeological site of like old debug code. There's also some really crazy anti-tamper code that actually like the code itself is looking at the instructions of the ROM and then performing like a checksum just to make sure that nothing has been moved around or anything like that. That was pretty fascinating as well.

[00:29:00] JN: I think it was James - or I guessed who mentioned some wild anti-tamper code where it was like checking - one of the consoles had like - when the logo for the console when that Nintendo logo popped up at the start. 

[00:29:11] MS: Game Boy. Yeah. 

[00:29:12] JN: Yeah. Some of that stuff is crazy. Talking about looking at the passcode and the things in it. We've had this conversation a couple of times where talking about games programming, kind of this idea - as a software engineer, you're often thinking about maintenance. And the onboard quality of the codebase and it all being very architecturally rigorous and stuff. And just that's not often a concern in games. And it's very freeing. You can just write the garbage code and it goes in there. Obviously, these games were enormous works of art and very popular. Is this a correct observation? Or are these games all gar - is the code all chaotic? 

[00:29:45] ER: Yeah. I have a strong opinion on this because it's very common in our community I think for people to kind of make fun of this code. And I get it. Some of the code is like garbage. And it's really cool also to see certain programmers. If you know who wrote different systems, you can kind of identify trends. But I also think we unfairly hold the developers of the 90s to a much higher standard. 

We know so much more about software engineering today. And, especially, we know so much more about the N64 itself. Something I was thinking about that I should say in this interview is that I'm totally certain that there are people in our community who know more about how the N64 works than anyone did at the time. And that's sounds weird. but it's definitely true. There are people who are just like crazy dedicated to understanding the internals of how the microcode works and all that stuff. Yeah. 

But on that note, Paper Mario has like seven, or eight, or nine domain-specific script languages that get run by - they have interpreters that run in the game in real-time. And there's a scripting language for the coin sparkle in the HUD. There's like next to your - at the top when it shows how many coins you have. There's like a script that identifies like how that sparkles. And that scripting language is only used for that one script. Who knows what they were thinking when they made that? But, yeah. There's stuff like that where you're like this could be done so much more simply and save more resources. Yeah. 

[00:31:19] JN: - pushes frames? It's fun. 

[00:31:22] ER: Yeah. It can be really fun when things are like overly complicated too. Because you're trying to understand like what they were thinking and what the motivation for certain things was. Yeah.

[00:31:33] MS: Yeah. I feel like we would spend potentially more time - well, almost definitely more time decompiling a function than the developer would have spent writing it. And we've got - normally, when you're doing the reverse engineering or the decompilation, you'll have your IDE or whatever with some C code and then another window with a diff of the binary versus what your C code is output to. And so, you're staring at the assembly. And the developers are never going to be doing that they'll just write the code and then compile it and hope for the best. 

And so, I'm not sure if it's the same for GCC. But it's definitely true of IDO. Where if you forget - or if you write a floating-point number, like 1.2, if you forget to put the F suffix to make it a float, it's a double. And so, a double is going to have a lot more code or a lot more - requires a lot more processing than just a floating-point number. And/or if they're converting it to an integer instead and there's just more code. And so, if you're working through a function, you realize that, "Oh, that was 1.2 as a float. That was 1.2 as a double. That was 1.3." There's all this extra code that is being ran and they didn't need to do it. Or they could have used 12 and then divide it by 10 at the end or something like that where - but we're spending all that time looking at this line and looking at this chunk of assembly that one line is making. But that's just a much more lower-level version. I haven't got to the point of the project I'm working on to fully understand all the machinery of all these functions, how are they're working. And is this function actually spending 20% of the CPU time and it's really dumb? I'm looking much more the granular level of, yeah, these two lines are doing this. But, yeah. As Ethan mentioned, I think the devs get a hard time from keyboard warriors these days.

[00:33:15] JN: Yeah. It's kind of sad to hear they get a hard time. I always think of it as like the important thing was that these games run and that they were playable, right? And that it's a freeing environment as a programmer because you're not stressing the small things. You're just getting it to work. It's sad to hear that people are reviewing these games in that way so far on. 

Yeah. Even what you said about people know the consoles now. I 100% believe that. You see that in a single console generation even that like the games that come at the end just know the hardware so much more than at the start that they're able to do so much more of it. I can only imagine what people like making new N64 games today could do with their cursed knowledge. 

I want to get on to the tools you two have made. I don't want to spend too much time to this one question about something you just said, Mark. You're noticing these things in the code. You know people not correctly marking their types and that kind of thing. These things happen. Has there been - what happens if you encounter a bug? Do you - I guess, when you're at that matching process, is that even obvious what's going - what you've seen? Or is just like, "Oh, that code looks kind of weird. I'm trying to match like a weird thing. How does that work?" 

[00:34:16] MS: An obvious example would be here's a variable that hasn't been initialized. And maybe it's fine. A different example could be that the way that when you call a function in, I would say MIPS. But that's too specific or too vague. But there's registers that are set when you return from a function. 

And so, it might be that a variable is set just because of luck of that you've called that function. And so, yeah, the fact that this variable is - or a variable was used that was undefined. If you use a different compiler, that would produce a game, breaking bug, or a crash, or something. But the developers got lucky. I guess Ethan might have more better examples of like specific cases of that. 

But what we'd often do would be to write or like to have like an if diff in the code of like this is undefined behavior. And so, if we were going to use a different compiler or if you were going to try and run it on a different system, you'd want to add an initialized variable for that and things like that.

[00:35:09] ER: Yes. In Paper Mario, there's a couple small bugs where it's like, "Oh, they forgot to use this enum value here." They used the same one in both places. They meant to put off on, and they did on on or something that. Then those will just mark with - we use Doxygen, so we kind of generate docs for the project, and we annotate all the bugs. Then we have a mod-friendly fork of the project called Paper Mario DX, which serves as a base for modding. That tries to remove some of the crazy bugs. 

One really annoying one I'll just quickly touch on is, and speaking of bad programming because it's interesting, Paper Mario has a function called draw box that is utterly massive. I think it's the second or third largest function in the whole game, and it's responsible for drawing any box-looking UI element. The way that they do this with one function is by giving it 20 different arguments to change the coloring, the styling, the size, all these things. Draw box has a crazy loop in it that's does some weird stuff and actually counts incorrectly and uses too much memory than it should. 

The only reason that it doesn't break the game is because the memory map of Paper Mario is very kind of manually organized, and there's extra space in some places that there doesn't need to be. When we try to shift Paper Mario, which Mark touched on earlier, when we produce a version that you can move things around, we noticed that the game was breaking because this function was using more memory than it should have been able to. We actually had to be like, "Okay. No, we need to make this thing larger, so there's space for it." There are some more serious bugs and then some lighter bugs. Yes. 

[00:36:49] JN: Yes. No. I was going to ask on the sides of this, at what point do you say, "We're making this as an archival and a history lesson."? Like, "Here's the code as it was," versus, "Okay, we need to fix this thing." But I think you've touched on that perfectly. It's literally a bug that's hampering the decompilation effort. I guess at that point, you're forced to do something about it, right? 

[00:37:06] ER: For the matching build, we do still keep it the way it is, but we definitely produce - we make some sort of if def, like Mark said, where you can build a version that does work. I just want to quickly say also you talked about what you just said about are we kind of preserving the original code. This is a really interesting philosophical conversation, and I think people approach it differently. The way I look at it is we're not trying to reproduce what the original code was. We're trying to find code that produces the original game but is ideally better in terms of more documented, more accessible, easier to understand. 

Especially since these projects are kind of for a global audience or global consumption, there's a lot of Japanese, English in a lot of these projects. We're trying to move things so that they can be more understood by people. Usually, we kind of Englishify certain things and just explain them, hopefully, better than they originally were, but yes. 

[00:38:01] JN: That's awesome. 

[00:38:02] ER: Cool. 

[00:38:03] MS: Yes. Just to add a bit more to that, the process of compiling a game or any code, you're going to lose information. So it may be that there was extra code - well, this is something we struggle with matching where there could be some extra code that was optimized out because it didn't actually affect the control flow of the function. But the compiler will still assign registers to those variables. Then later in the function, it's using different registers because the earlier registers aren't free anymore. So, yes, that's a bit where you're not going to get. Or you're definitely not producing the same code as the original developers. But, yes, it's some code that will produce the same result. 

[00:38:40] JN: Yes. It's the same output. Yes. It's a fascinating distinction. Okay, I want to make sure we get onto the decomp stuff before we - the decomp stuff. As we've covered a couple of times, this process to match and compilation is very meticulous. I guess in some cases, it must even verge on tedious. What is the development environment? What is your toolset as you go about doing this? What tools are available to help this along? 

[00:39:03] ER: I'll just quickly run through them. Mark, you can pick up anything I forget. Around when SM64 was being decomped, some people made - [inaudible 00:39:12] and Simon Lindholm made this decompiler called MIPS to C, which is just a very good MIPS decompiler. It's since become M2C machine, code to C because it now supports other instruction architectures besides MIPS. So it's used on PowerPC and stuff like that. We use that as a decompiler a lot of the time. 

We have an Assembly diffing tool called asm-differ. Like Mark was talking about, you see the original Assembly on one side and then the reproduced from your C code Assembly. You can kind of visually see if they're similar, if they're different through a really nice UI and coloring and stuff like that. We also have a really cool tool called the permute, which if we can't match a function, it takes our existing attempt at the function, and it just rapidly changes all sorts of different things in a very random manner and just runs the compiler as quick as possible. It does the diff, and it actually calculates a diff score. When the diff score of its permutations is lower than the original score from your input, it'll tell you, "Oh, here's a better match." Then you go and look at what it gave you. Then you kind of tweak it. That's a really crazy, fun iterative process. 

[00:40:19] JN: That's so gross. It's perfect. It's excellent. Yes. 

[00:40:23] ER: Yes. Mark, do you - what else am I forgetting? 

[00:40:26] MS: Well, I was going to talk about splat, for example. Yes, in the early days of N64 decomp, you want a way to split out the ROM into, yes, the code, the assets, whatever else. There was a project called N64 split, which if you Google N64 decompilation, there are still some websites that talk about that, although I'm not sure if it's actively maintained. Then, yes, as I was getting into decomp, Ethan started writing spiritual successor called N64 splat in Python. It was just much more extensible, and this has been the foundation for a lot of N64 projects. 

Now, it supports more than just N64. So it's just called splat. But, yes, that will help you separate out the game code from the assets. The system that it uses is coming a lot. In the early days, we use Capstone like an open source decompiler engine. Now, one of the guys in the community just created his own MIPS disassembler, which is a lot, lot better. At least for our purposes, a lot better than Capstone engine made. It can identify or help you identify file splits. 

Right at the beginning, Ethan was talking about you'll take your list of Assembly files, and you want to separate them into C code, C files with lists of - include blocks in. The question there is, well, how do you know where the start of the C file is and the end of the C file. Thankfully, for N64, all of the game file. I'd say all of them but most of them are aligned on a 16-byte boundary. So you know if there's extra white space or knops, zeros at the end of a particular function to align it to a 16-byte barrier, a boundary. That's probably a new file. That's a way to split things out. Then there's other heuristics as well. But that tool is - it makes it very accessible to just start the project. Ethan's the one to thank for that. 

[00:42:18] ER: Well, I also want to say, yes, I started splat. Alex from the guy who's - he's like a co-lead on the Paper Mario project. He tremendously helped me with working on that as well. Then Angelo, the guy who made the MIPS disassembler called spimdisasm, he's been helping with splat. Then Tun and Mark also has helped a ton with splat. There are so many contributors to all of these tools. We all kind of work on all the tools, and it's such a fun collaborative environment and with so many smart awesome people. There's also decomp.me which I could get into if you wanted to launch into that. 

[00:42:51] JN: Let's do it. Yes, yes. Tell me about decomp.me. 

[00:42:54] ER: Okay. The history of that is that, historically, when we're helping people match a function, we're taking a screenshot of asm-differ and saying, "Help me. What do I do?" That's just - I honestly don't know how people did it in the past. I was never very good at helping remotely like that. I always had to kind of set it up, reproduce their setup locally. Then there would be people who would say, "It would be really cool if we had a way to share functions online, so we didn't have to do all this." I remember being one of the first naysayers being like, "That's really hard to set up. You'd have to make a whole database, and it's too much work." 

Then a few years went by, and I'm a fiend for starting new tool projects. I just said, "You know what? No, I'm - screw it. I'm doing this." Alex jumped on that with me. He did all the UI, and it looks amazing, in my opinion, especially for an open-source project, but just in general. Yes. Then I started with the backend, and then I've gotten some help from Mark and Simon and some others. Yes. So it's a website that lets you upload a blob of Assembly, and then you have kind of workspace to try to match it. It has basically all the tooling we talked about before. It uses the MIPS to C decompiler or M to C decompiler. It uses asm-differ as a library. So instead of a terminal application, you're calling it through Python and then getting JSON back. 

Yes, it doesn't use the permuter yet. But someday, we want to add that. It allows you to kind of share progress. 

I've gotten a lot of people who have told me that it's enabled them to do decomp in general. I think a lot of people were afraid to - not afraid but put off by setting up all these different tools in the terminal. I get that, and it's so rewarding to me that people have found it so useful. You mentioned earlier, you thought it looked kind of it could be a Zachtronics game. The original idea for it was that we wanted to add and we might still, but we want to add points to it. When you match a function, you'll get points on your account based on how large the function is. We did want to gamify it, and we're still thinking about how to do that, but yes. 

[00:44:51] JN: Yes. Knowing very little about decompilation, I was very delighted by decomp. I guess this kind of comes back to what I said earlier about you two of you becoming cursed compiler archaeologists. I think at one point, in one of the FAQ pages you mentioned the Compiler Explorer project where you can look at the output of various compilers to understand them. I think it just looks like - I mean, I guess both decomp.me but also what you've done to make it accessible. But also, decompilation in general must just be such a wonderful, interesting way to learn about computer architecture and how things actually work. It seems very cool, and it seems it has a really active community, too. 

You talk a little bit. You mentioned various aspects of the community along the way that people work on the projects and the way you go about documenting. What is the community around decomp.me like?

[00:45:33] ER: We have a Discord for the website itself, and then we have a channel there where people can ask for decomp help. But mostly, the Discord is actually focused on site development and bug reports and stuff. Something that is unfortunate about the website that we want to improve, it's not a great entry point if you're not involved with a project already. It's kind of - it's a tool. It's not a community hub yet. We do want to change that. 

The cool thing about the community of decomp.me is that there are people who have - I didn't even realize there was a decomp for a game. Then suddenly, people are coming in the server saying like, "Oh, there's a problem with this compiler." Or like, "Could you make a -" We have these things called presets, which is it attaches a bunch of options to it to the name of your game. So you can say boom and just upload the code and get your options. They'll say, "Could we have a preset for this game?" I'm thinking, "Oh, I didn't even know that that was a process. That's awesome." 

We've also - the other cool thing that is really impressive about the website, I think, and we can thank Mark for this, is just how many platforms there are that we support on it. Mark has added several of them. I don't know exactly how many, but we support not only N64 and PlayStation. But we have Windows support now. We have portable consoles. Yes, it's becoming - there's just so much on there. There are so many of those platform communities that I don't really know much about at all, so yes. 

[00:46:49] JN: Yes. I mean, I'd love to hear more about that, Mark. How do you add support? What is happening for the decompilation step in the background? Is this literally running all of the compilers? How is it -

[00:46:59] MS: Basically yes. I mean, I'd say it's basically gluing together the existing tooling with a nice web front-end and a database on the back. Yes, the existing tooling. For example, well, MTC. If MTC has support for the Assembly or the platform, then great. Otherwise, you just start with an MTC function to start writing things in. But, yes, the main part is the differ library needs to understand the registers or the architecture and things like that. Once smarter people than me have added that kind of functionality, it's relatively easy to glue it all together and add another platform. 

The difficulty is sourcing the compilers, and that's normally the way. A lot of people say, "Oh, hey. I want to decompile this game. Or can you add this game or add a preset for this game to decomp.me?" Then we'll ask them or have they got a project already. Do they know what the compiler is and all these things? There's a lot of homework that someone should do first before we can add something to the site. This is one of the potential pitfalls of decomp is that it can sound quite exciting to start on a project, but it is a potentially a huge undertaking. 

So it can get - there's been a lot of projects that start, and then you have some - like a burst of energy for a couple of months. Then the excitement fades when people realize that they're not one percent through the game in three months. Oh, that's going to take three, four years at this current rate. So people may or may not decide to abandon things. Yes, it's an interesting one. I guess that goes back to what I was saying about because these projects are so big, and there are so many different things you can do to stop feeling burnt out by just trying to match things. You can work on some reverse engineering or the tooling or help someone do a different project or just take a break for a couple of months entirely and things. 

[00:48:39] JN: Cool. Yes. I guess the same as any open-source project, burnout is a very real possibility. I'm now just - all the talk about needing to find various versions of compilers and just imagine a compiler floppy disk drop box being at a local library. It's like, "You found this compiler. Please drop your compiler off it." 

[00:48:54] MS: That would be fantastic because yes. I mean, that's the thing that's kind of interesting with like I don't know enough about or very much about the GameCube scene, but they have tried to write scripts or add-ons. Once you've compiled a function, then they do some manipulation of the file to better match it because they know the real compiler does this thing, but they don't have that real compiler. So they kind of Frankenstein some pieces together. 

A different example would be in the PlayStation decompilation. All of the tooling is 16-bit Windows binaries, so the early ones. You're running DOSBox or DOS Emulator or something that to try and run things, which is pretty rubbish if you're trying to run it on a modern system. Some of those earlier versions of the toolchain, we can just compile an old GCC on one CPU, so use that. But then the assembler is still the Windows 16-bit DOS executable or whatever. I've written a tool that helps massage the output of GCC into something that a modern Assembly like [inaudible 00:49:58] assembler can still match. So then you're actually using completely modern toolchain rather than having to run some super old versions of things. 

[00:50:08] JN: Yes. It brings me on to - I guess as we get towards the end here, one of the question I want to ask is you've covered these kind of, I guess, big leaps in tooling that made things a lot easier and different things that are coming on. What's the next big leap for the decompilation community? Is there anything coming up that you think will really help the process along or that you'd like to see?

[00:50:26] ER: Sure, yes. I have two things. The meme answer is AI, AI for decomp. Easier said than done. A few of us have looked into it. Because of the nature of what we're doing, we're going for very - our requirements are very strict. AI is very - language models, for example, are good at being creative. But we need something that matches. There are some papers on it, but I think it's a little far off before we have some tool that is usable by everyone for that. 

Something that one of my kind of aspirations is I wrote a tool for Paper Mario that detects duplicate functions within a program. We used it to kind of de-duplicate a lot of places where they probably had an include. Then they wrote one function, and it's in the game a hundred times. What we can do is we can determine where those all are, write one function, and then include it 100 times. Then we've just decompiled 100 functions. 

Then I took it a step further and I made the tool look for subsequences of matching code. So we could say like, "Oh, this function exhibits similar behavior to this other one. We haven't matched this one, but we have matched this one. We can copy this if else into here. Then, oh, all of a sudden, now we've matched this function." What I'd to do is build a persistent database of C to ASM snippets, kind of a Shazam or SoundHound for code. My dream is that this thing can scan a big repository and then kind of piece together a bunch of stuff. If it can't give you the function, it could tell you like, "Look at this from here. Look at this from here." I think that will really kind of help a lot in this area. 

[00:51:57] JN: Yes. That'd be awesome. I guess going back to the AI point, I guess this is more just traditional machine learning than LLMs. Once you've got a big database of labeled C to Assembly, you can just hit that with machine learning of some kind, right? 

[00:52:11] ER: Sure. Yes, yes. 

[00:52:12] JN: Interesting. Mark, anything come up?

[00:52:15] MS: Well, I mean, not technology-wise. I think I'd like people who are thinking about doing it to try it. A lot of people will say, "Oh, it'd be great if this game was decompiled." Then my response is always, "Why not try?" A lot of people say, "Oh, it's too difficult." It's like, well, it is difficult but it is very rewarding, and you can learn a hell of a lot, and there's a great community to answer all your questions and to help you along. I'd say, yes, I didn't know anything about decomp when I just threw myself into it, and look at me now. 

Yes, I think there's potentially a lot of cheerleaders in this community of like, "Oh, yes. Great. Great work, guys. It'd be great if you do this." But, yes, just become the changes kind of the - my take on -

[00:52:54] JN: Yes. I guess I have two questions on that. You didn't know a lot about decomp. What was your C and Assembly and system architecture and compilers knowledge?

[00:53:04] MS: Very low. I could do a Hello World in C. I didn't really know any MIPS Assembly or didn't really know Assembly. It took a long time. Even now, I don't think I could write a function in Assembly. I was talking to Ethan about this offline, but we recognize patents in the Assembly of like, "Oh, those three statements together mean this particular thing. Well, that's a cast from a long to a short. Or, yes, that's a floating-point operation things or cast or things like that." 

Back to the previous question where I talked about that PlayStation Assembly tool, I learned a lot more writing that tool around the Assembly and how things worked than I had in two years of doing decompilation. Yes. Things like M2C make a huge - it just takes that block of Assembly and writes something that can often just be almost a straight one-shot match, which is just kind of insane, whereas, yes, a Ghidra equivalent would just be almost a load of nonsense. Ghidra can be helpful. I know I [inaudible 00:54:01] it earlier, but it can be helpful in some scenarios where there are some awkward bloops or things like that going on. But it will never give you the - anything like the matching or anything like code that would match compared to M2C. 

[00:54:13] JN: Cool. Then, yes, I guess my last question to take us out on that is so for folks who do want to move from cheer leader to actually doing it or I've listened to this and want to get involved, where should they start?

[00:54:23] ER: I want to start with word of encouragement because like I said earlier, I mean, I have an engineering background, but I didn't think I'd ever be able to do this stuff. It takes time and effort. But if you're interested, I really encourage getting involved. It's extremely fun and rewarding. I think a lot of these skills have other applications beyond the matching decomp. 

We have a public decomp Discord that we can share with you, and I'd be happy to talk to anyone who joins that if they want guidance or anything that. 

[00:54:51] JN: Perfect. We'll get it in the show notes. 

[00:54:52] ER: You can also join the decomp.me Discord and, yes, either one of those. 

[00:54:57] JN: Awesome. Anything to add, Mark?

[00:54:59] MS: No. I think those two Discords are the right way in. Yes. Even if you're not wanting to do any decompilation, there's plenty of help that we could take on the website and all the tooling in general, so yes. There's all the thing with all this work and not enough hands and not enough time, but yes. 

[00:55:16] JN: Yes. As evidenced by game developers writing a DSL for a single place in Paper Mario, developers love to write tools. So that is always an option. Perfect. Well, thank you both so much. I've definitely learned a lot and now need to immediately resist scratching some itches. This has been wonderful. Thank you both. 

[00:55:30] MS: Cheers, Joe. 

[00:55:30] ER: Thank you. 

[END]