EPISODE 1751

[INTRODUCTION] 

[00:00:00] ANNOUNCER: PCSX2 is an open-source PlayStation 2 emulator that allows users to play PS2 games on modern hardware. The emulator is remarkable for simulating the complex architecture of the PS2, which includes the Emotion Engine CPU, Graphic Synthesizer, and specialized subsystems. The emulator just hit a major milestone with the release of PCSX2 version 2.0. The release brings many changes, including a Qt base interface, big picture mode, auto selection of graphics APIs, and native support for macOS. 

Tellow Krinkle is a developer for PCSX2 who ported the emulator to macOS among other contributions. In addition to his work on PS2 emulation, he has also worked on Dolphin, which emulates the Nintendo GameCube and Wii. Tellow joins the podcast with Joe Nash to talk about how he got started in emulation, the PS2 architecture, the challenges of rendering PS2 games on modern GPUs and more. 

Joe Nash is a developer, educator, and award-winning community builder who has worked at companies including GitHub, Twilio, Unity, and PayPal. Joe got a start in software development by creating mods and running servers for Garry's Mod. And game development remains his favorite way to experience and explore new technologies and concepts. 

[INTERVIEW]

[00:01:29] JN: Welcome, Tellow. How are you doing today?

[00:01:31] TK: Pretty good. I'm happy to be here. 

[00:01:33] JN: Wonderful. I want to start by asking you, we've spoken to folks in the emulator scene before and it seems people come into this whole thing from many different varied paths. How did you get started building emulators?

[00:01:45] TK: Yeah. I think my path has probably maybe one of the weirder ones. Well, to start, I guess I've looked up a lot to emulators and emulator development when I was younger. And so, for a long time, I kind of wanted to try to work on the Dolphin emulator. And I never ended up getting around to it for a long time. But then I guess after graduating college, I took a part-time job for a while and I had some extra time and I was like, "Oh. All right. Maybe I can work on an emulator." 

And I, at that point, kind of looked around and I was like, "Hey, look. It would be cool to the PCSX2 emulator." At the time, it didn't support macOS. And I was using macOS as my main operating system at the time, I guess. I still am at the moment. And I was like it'd be cool to have a Mac version of that. Right? And so, that was just kind of my plan. I was like, "Hey, I'll just -" I mean, I knew they previously had a macOS interface. They had old builds from probably four or five years ago. Maybe even older at the time. And so, I was like, "Oh. Well, I'll just -" it must be at least somewhat compatible. Not compatible. But - 

[00:02:53] JN: Some legacy of it working on the platform there. Yeah. 

[00:02:55] TK: Yeah. Exactly. And I was like, "Okay. Well, I'll just try to get it working." Yeah. I started that. It turns out it wasn't too bad to at least get the interface showing up. Not the actual emulation or anything. But just like, "Oh, we want to have an interface." There were some minor bugs with some of the - at the time, on Windows, they were using a mixture of wxWidgets and direct Win32 stuff. And then on Linux, they used wxWidgets. Where Windows had Win32 stuff, they used GDK. 

And so, there is a GDK version for macOS. It's not amazing. It's kind of terrible actually. But it does technically work. And so, I just used that on macOS for the initial build. And we got some Windows working. And from there, well, there were two things that namely stood out. The first is that macOS - at the time, PCSX2 supported two graphics API. OpenGL and DirectX 11. Yeah, 11 at the time. And macOS had obviously - DirectX 11 is just a Microsoft thing. No one else supports that. And then OpenGL, macOS had kind of abandoned it probably about 3 years before and completely stopped updating it to stay up to date with the latest features and stuff. And so, there were a whole bunch of like - PCSX2 had workarounds for lack of various features. And a whole bunch of them had been deleted from the emulator about four years before my first attempt. And I had, at this time, no graphics programming experience whatsoever, was to attempt to undo the deletion of all of these fallback paths for lack of various features. Yeah. 

I was like, "Oh. Well, just PCSX2." As pretty much all emulators are, pretty much, and software projects in general are under a system called version control where every change to the emulator is kind of tracked as like a sequence of changes. And so, you can go back to the old one and just be like, "Can I just undo that one please?" I started by just trying to undo all of the commits that had been there that were removing these feature workarounds. And a few of them, the code has changed too much since then to just directly undo it. I kind of tried to fix it directly. And it didn't fully work. So I was still kind of stuck with the black screen. 

And so, at that point, I brought it to a Linux computer and I artificially disabled all of the features that weren't supported on macOS and then enabled them one by one to figure out which one might attempt to bring back the workaround it was broken on. And, eventually, through that, was able to get at least an image showing up on macOS.

[00:05:34] JN: That's a really interesting way to get into an open-source project. Like getting into the git blame and bringing back some old features. That's really interesting. Also, fascinating to hear that you weren't into graphics programming before you got into the emulator. Because having looked a little bit about some of the stuff you work on and some of the things you work on outside the project seems very much now that graphics is your world.

[00:05:54] TK: Yeah. Since then, I've done a lot of graphic stuff. But, yeah. When I started on the emulator, that was actually my first usage of OpenGL. And I kind of was blindly going through there being like, "Oh. Well, they had this before. It probably works something like this. We'll just try this." That's how I had like no clue what I was doing. And didn't know how to fix my black screen without just trying to compare it to it running on Linux to just figure out which was breaking it. 

[00:06:21] JN: Amazing. Okay. I want to get into talking about the emulator server and how it works. And your work on it, especially on the Graphics Synthesizer. But I guess first, because I know it's a fairly wild platform, I guess to talk about the emulator, we should talk about the PlayStation 2 as a system and architecture. Can you tell us a little bit about how the PlayStation 2 worked and I guess all the pieces? 

[00:06:42] TK: Yeah. The PlayStation 2, it's like three or four processors all kind of connected via a DMA, a memory copy engine. There's the main CPU, which is known as the EE or emotion engine. And that's like a 300-megahertz MIPS CPU with a few weird instructions on it. But then they have attached to that CPU, or I guess sitting nearby, two vector units, which are each this like custom instruction set that is just used for processing. If you heard of, whatever, SSE on like Intel CPUs, which allow the CPU to operate on four values at once with each instruction, they're kind of similar to that except for it's like a very dedicated instruction set that doesn't really have much else outside of being able to do these vector float operations. 

[00:07:34] JN: Is it kind of like the kind of thing you'd use CUDA for? Sorry to interrupt. But is it like kind of in that - 

[00:07:38] TK: Yes. 

[00:07:39] JN: Yeah. Okay. Cool. 

[00:07:39] TK: Yeah. Those VUs. There were two VUs. There's VU0, which is kind of meant to be used as like a co-processor for the CPU for the EE. And so, the EE can kind of like just reach into its register. It can actually use it like as a co-process - it can just like execute single instructions on it directly. Or you can send it like a little micro-program that'll just run through for 20 or 50 instructions long. Usually something like that. 

And then VU1. The two VUs have the same architecture. But they're just kind of like attached to different things. And, therefore, expected to be used differently. VU1 is slightly further I guess from the CPU. And it's kind of meant to be used in similar to what in a modern graphics card would be the vertex shaders. It's got like a thing that allows it to load stuff from memory into its like internal working memory. And then it would run the program over that. And then there's an instruction in it that just takes whatever is in the internal working memory and just sends it to the next chip on the PlayStation 2, which is the GS, the Graphics Synthesizer. 

You can think of that as the - well, the PlayStation 2 is old enough that it doesn't have actual shaders. But you can think of it as doing the part of the graphics workload that would now be done by fragment shaders. It starts by receiving just lists of triangles that have already been transformed and are in coordinates that are just like 2D coordinates on the screen with a Z value only for depth testing and nothing else. The coordinates are already in 2D and it just rasterizes the triangles and then colors them in with the texture. And that's about it.

[00:09:15] JN: Right. God. There's a lot going on. I guess the console really came - you just said that it doesn't have actual shaders. I guess the development of this console is at such a point where so much of what we now take going to a 3D game is still being decided that the architecture is just kind of all over the place. 

[00:09:33] TK: Yeah. Pretty much. Yeah. It has something you could think of as similar to a vertex shader but no fragment shaders, which is kind of funny. It's kind of like the opposite of the GameCube where they have kind of fixed function vertex processing. And then they have, A, semi- programmable fragment processing. 

[00:09:49] JN: Interesting. Okay. I guess this is just very like PC brained of me. But it's also really interesting how the various parts of the graphics processing is split across so many different pieces of hardware. I guess we're so used to now being like there's the big monolithic graphics card. Right?

[00:10:02] TK: Yeah. Versus the GPU. And it processes things. Yeah. I know. Back then, they even - the split was kind of different. And so, the vertex shaders are much closer to the CPU. In fact, so close that when we emulate it - I know we've heard people asking us like, "Why can't you put the VU stuff into vertex shaders? And it's like the VU can so much more easily communicate with the EE than a modern computer's vertex shaders can communicate with the CPU that that just wouldn't work. 

[00:10:30] JN: Okay. Yeah. That I guess brings me on to my next question, which was what is it like to emulate this architecture? I imagine there's tradeoffs with the fact that you're working on - I imagine there's a lot of stuff going on in parallel in that architecture that's completely difficult to catch. 

[00:10:41] TK: Yes. That tends to cause a lot of fun for us. Actually, I think with the exception of a few games that really abused the ability to very tightly synchronize between the EE and the VUs. And there are a few games like that that we just like can't emulate properly because they expect ridiculously accurate cycle timings between the various co-processors. We try pretty hard to track cycle timing within a single processor. The EE may be a little less so. But the VUs, we very tightly track it because they actually kind of reveal some of their internal timings to the programs. And so, you kind of have to - which I can go into a bit more later if you want. 

[00:11:20] JN: Yeah. That's fascinating. Yeah. We can come back to that. 

[00:11:23] TK: Yeah. But outside of that, there are only a few games that really abuse that hard enough that we break on those. The other fun ones are some of the instructions themselves on some of these things do things in weird ways that are hard to at least quickly emulate. As an example, I think one of the more famous ones is the PlayStation 2's floating-point math. Nowadays, in even the GameCube, I remember Dolphin had a blog post on one instruction that was slightly off from the way you're supposed to do it according to the standard. And it was breaking all their replays of Mario Kart stuff or whatever. But that was about it. And that was like one teeny little difference in how they were handling I think a fused multiply-add in that. In the PlayStation 2's case, the floating-point math just completely ignores what's now standardized as how to represent a 32-bit floating-point value.

[00:12:18] JN: Yeah. I think I saw an offhand comment about this. It might have been on the PC's Wikipedia. But their floating-points not to the IEEE standard at all. Right. Okay. Yeah. 

[00:12:25] TK: Yeah. It's kind of amazing. The first thing. In PC floating-point, right? The way it's set up, the 32 bits are split into a single sign bit to say whether it's negative or positive. Then 8 bits for an exponent. And then 23 bits for what is usually called I think a mantissa. Pretty much, it's the computer equivalent of scientific notations. You say it's like - instead of saying a number, it's like, "Oh, it's 57." You instead say it's like 1. something times 2 to the 32. Right? And that's how floating-point values are able to represent both really large values, like 10 to the 300ish. As well as ridiculously small values like, well, 10 to the negative 300ish. That's kind of the representation. 

But on the standardized floating-point, they have a whole bunch of extra things to handle edge cases. As you go up near the very top of the floating-point area, the highest exponent field value that can go into that field is used for storing infinity and not a number. Where infinity indicates that if you go above a certain value, it's there to like kind of stick around. If you go above the maximum float, it'll go to infinity. And then even if you try to divide that in a half, it'll stay infinity. And if you try to divide it more, it'll stay infinity. And then not a number is for like things that go even more off-hand. 

If you try to take infinity and multiply it by zero, that's even more ridiculous. And it's like, "Okay. That turns into not a number." And then once you have a not a number, it just kind of like spreads around. Because any operation that includes not a number spreads the not a number to indicate to you, hopefully, by the end that, "Oh, by the way. Something went wrong in this calculation. It's now not a number. We have no clue." Right? 
And the PlayStation 2, they were just like, "Who needs that?" 

And so, the highest exponent value is not for infinities and not a numbers. It's just one higher exponent than the previous, which means that you can represent floating-point values that are twice the size of - because, remember, this is exponential. Twice the size of the maximum PC floating-point can be represented by PlayStation 2 in their floating-point. That's the first part of the issue is that it's like any number that goes slightly higher than the maximum PC floating-point is now accidentally on PC becomes infinity. 

And it's kind of a minor issue when it's just, "Oh, they're big numbers." But when you remember that Infinity is meant to kind of like spread, that's when things get really messy. Because if you multiply by zero on a PlayStation 2, you're guaranteed to get zero back. Right? Every number being a not infinity number, you multiply it by zero, you get zero. But on a PC, you multiply infinity by zero, and you get not a number. NaN.

[00:15:13] JN: You've just opened a door to me. I feel like PlayStation consoles in the past have been the source of more - PlayStation to PC porting has been the source of some really cursed ports. And I didn't know that this was an issue. And now I'm just like, "Oh, God. The amount of errors this brings up." 

[00:15:29] TK: If you want some fun on that specifically, the Dolphin blog has a blog post about a game that was ported from PlayStation 2 to GameCube. It was - whatever. I think True Crime New York City, I think? Where they mentioned that this is one of the games that in Dolphin ended up making them have to emulate floating-point exceptions. Because, of course, this was right back to the PlayStation 2. Floats that if you multiply by zero. Whatever. 

And so, in their case, I think they added a - when they ported to the GameCube, their AI must have broken. And so, they added a division by zero exception handler that would just say, "Oh. Oh, wait you just divided by zero? Let me replace -" the correct answer there was I think it's just zero. Or maybe it's - I don't remember what the exact number is. But whenever they accidentally divided by zero, they just catch the exception and replaced the number with a benign thing that wouldn't blow up the thing. And then Dolphin had to make sure that they could actually emulate this division by zero exceptions all because of a game that was scooted over from a PlayStation where division by zero didn't result in - I guess in this case, it would be infinities or not a number numbers that's stuck around.

[00:16:38] JN: Incredible. Okay. That was one so. Sorry. I interrupted you mid-flow. We had the issue of numbers can be much bigger on PlayStation. And then, of course, you don't get - when you are working on a different platform, you've got the propagating infinities. And then I think you were about to say there's another issue that arose from this.

[00:16:52] TK: Yeah. I guess I'll just mention some of the things we do to work around that. Yeah. Because of that, PCSX2 just has - if you look through, I think the advanced settings or something, there's a number of like clamping modes. And so, what that is, is after every calculation we do with these floating-point values, if you have the clamping modes enabled, it will just do like a min and max with the PC min and max floats to just remove all of the infinities and turn them into just large numbers, which will at least multiply by zero properly in the PS2's opinion of properly even if they aren't quite the same number. That's how we try to deal with it. And it doesn't work for every game. And so, I think we have a bunch of things to enable various modes on our thing for specific games that when we know that they prefer break less. But there are still games that like the AI just doesn't quite work correctly because we aren't quite doing the math correctly. 

Then going on the once last fun with the PlayStation 2 floats, it comes down to - that problem was like with the really big numbers. And so, the other problem we have is with little small changes in the low bits of a number. Low bits being like the ones that make the least difference. When you have 1.0034, whatever. That last digit. When the PlayStation 2 does math, the IEEE standard, the standard behind PC floating-point, has a thing where they're like, "Oh, you should -" or at least for the default mode. When you do math, like adding or subtracting values, you should round the result to as if you had infinite precision when you did the calculation and then rounded it to the nearest float. 

And to do that, they employ I think three extra bits of precision when they're doing the calculation. One of which is - I think one tracks if you're below or above 50%. And then one tracks if anything below that has ever gone above zero. Pretty much it. It's to be able to do a round-even rounding, right? If you think of doing a round-even, if it's below 0.5, you round down. If it's above 0.5, you round up. But you have to know if it's exactly 0.5. At which point, in the case of floating-point math, you round towards the even number. 

And so, that requires an extra three bits in every calculation. And, apparently, whoever was designing the PS2 is like, "That's too many bits to keep track of." Right? And so, they just don't. I think we have figured out that their addition uses just one guard bit is what those are called. And we had to do that. Because, apparently, some game does decryption of its game binary using the floating-point math unit. And so, if you do it even slightly wrong, it just breaks everything. Because they're decrypting data or something like that. 

We specifically have a thing in there to truncate the numbers before we do math with them to make sure that these extra bits of precision are just deleted. And then the other one is multiplication. I don't think anyone's actually fully looked into exactly what it does in its math. But what we do know is that if you multiply a number by one, usually, you'd expect that multiplying a number by one will get you that number back. But according to a PlayStation 2, there are certain numbers that if you multiply by one, they will change slightly. 

[00:20:09] JN: How many certain numbers are we talking here? Are we talking range is a very big or weird thing? Are you talking one specific, very precise number? What would you take? 

[00:20:16] TK: I think it ranges. But one of the other PCSX2 devs named Phoebes went over this on their blog. I think they were actually going to be in the interview but they couldn't make it in the end. It would have been cool to have them explain that. Because I'm not quite as familiar with it other than the fact that there are definitely numbers that you can multiply by one and they'll just lose a bit. They'll just shrink by one bit. Oops. 

[00:20:41] JN: Oh, dear. You mentioned the Graphics Synthesizer earlier as an area that you've worked a lot on. What interest - how did you get into that particular? Was that an accident because of the starting with the OpenGL thing? Or was there something about that chip in particular that interested you? 

[00:20:53] TK: Yeah. Initially, I like worked a little bit on it to get OpenGL working and stuff. And then I was like, "Oh, hey. It works. Yay." And I kind of set it down and went to work on it actually. I guess I didn't mention this earlier. But the next thing I worked on was Apple had just like wiped 32-bit support off of their operating system the year I started working on this. And so, I've done a lot of my work originally by just using an older computer that hadn't gotten a 32-bit support wiped off of it. But then to actually get it working on like my main laptop, the next thing was like, "Oh, let's actually get this working on 64-bit. I kind of like went over there for a while. 

And that's actually when I - my first ever PR to the project was actually on for the getting 64-bit support. Mac stuff came later mostly because it was kind of like, well, that was a project that didn't affect everyone outside of macOS. And, also, because the developers were like, "We deleted all these things four years ago because we were kind of sick of maintaining them. We're not sure we want those back. We're not sure where we want all those hacks for really old OpenGL stuff. We're not sure we want those back." 

And so, one of the things that was on the back of my mind at that time was like, "Oh. Well, the reason Apple's abandoned OpenGL is because they want everyone to use their new API. Metal. Maybe I should consider writing a Metal renderer for PCSX2." That was kind of on the back of my mind for a while. That was kind of I guess part of it. And then the other was during my time working on the 64-bit support. That's when I kind of got brought into the rest of the PCSX2 team and joined their Discord, whatever. Said, "Hi." Got to know everyone pretty well by the time the 64-bit stuff actually got merged. Because that took a little while. 

And so, one of the other people in the Discord did a lot of their work on the graphics stuff and it looks kind of fun. Right? And so, I think I played around with some small fixes to the thing before I attempted to write a Metal renderer for PCSX2. And, yeah, the first attempt didn't actually go especially well. The one that's now in PCSX2 was attempt number two. It kind of went from there. 

During my first attempt at writing the Metal renderer where I found a lot of things that I kind of wanted to change about the graphics thing to make it more friendly to adding new renderers. And so, a little while later, I went through and tried to do an overhaul to make it easier to add new renderers. And then added the Metal renderer. The final version. The actual published version. 

[00:23:20] JN: Yeah. That's always a very, very satisfying [inaudible 00:23:22] project. Making something extensible. I think it's a nice area to work on. You mentioned a bunch about Apple graphics as we've gone along. And a bunch of it has been very surprising to me. I knew they had the Metal API. I guess I didn't realize how quickly they deprecated everything else. I know very little about the world of Apple graphics. And especially now that they're completely over in custom chip land, I feel like it's completely invisible. Could you tell us a little bit about I guess how Apple approach graphics processing, I guess? 

[00:23:49] TK: Yeah. From the API standpoint, especially earlier when they were still running Metal on the same graphics chips that everyone else was using. And so, obviously, the hardware could do the things that OpenGL supports. Because that's why they're in OpenGL. And so, obviously, the features set was kind of similar to newer OpenGL. The OpenGL Apple supported was just missing major features. You probably heard of compute shaders, right? The ability to run random things that's not graphics on your GPU. Yeah, Apple's just missing that from its OpenGL because it got introduced after - I think Apple abandoned OpenGL in 2009. Pretty much everything since then, they don't have. 

Yeah. Metals. If I had to describe it, I'd say it's most similar to a modernized DirectX 11. If you think of like Vulkan and DirectX 12 are very different from DirectX 11. Metal is kind of like if you tried to bring the important things that are in Vulkan and DirectX 12, if you tried to like bring those to DirectX 11 instead of doing a whole new API, that's kind of that would be my description of Metal. 

[00:24:54] JN: Yeah. That sounds very Apple-y to me. 

[00:24:56] TK: Yeah. The other very Apple thing that they did is they just were like, "Anything Legacy, goodbye." And so, everything that you're kind of recommended not to use in the future from DirectX. Things like geometry shaders is a really big one. Thankfully, this didn't really affect PCSX2 too much. But, yeah. Geometry shaders was the big one. Metal was just like, "Nope. No geometry shaders." Because, previously, everyone was like, "Right. They added geometry shaders." And then they turned out to be like really slow. And so, they're like, "Don't use geometry shaders. But they're still here." And so, Apple's like, "Nope. No geometry shaders. Sorry," which caused a lot of people a lot of pain. Including us for a bit. Because PCSX2, while we didn't use geometry shaders for too much, we did use them for a few things. And so, that was probably the last things. That's what took the longest to come to the Metal renderer for a long time. We didn't have the ability to upscale points and lines in the Metal renderer because that required geometry shaders for a long time. And by required, I mean, our code for it was using geometry shaders. We obviously didn't require because we switched off of them. 

[00:25:57] JN: Right. As in like in a mutual law. The only way to implement that with geometry shaders, which is what you were doing.

[00:26:02] TK: Yeah. It was a very common thing especially. Because it's the straightforward thing. Right? You have a point. And then you send your point to the geometry shader, which turns your point into four vertices of a triangle. And that sounds like the kind of thing that a geometry shader would be good for because that's what they were for. Right? But it turns out, apparently, they just weren't that fast at even simple things like that. It was still faster to - if you're wondering the new version, which I actually learned from an AMD presentation on how to doing this faster than using geometry shaders on all GPUs is you send your vertex shader - you take advantage of the fact that newer GPUs can just kind of load from memory however they want. And so, instead of using the hardware - modern GPUs either have or pretend to have dedicated hardware for loading vertex data. And so, then you just ignore that hardware or the pretending of that hardware as it may be on - I think AMD GPUs only pretend to have it. I'm not sure. Apple GPUs definitely only pretend to have it. You ignore that hardware if it even exists. And just manually fetch the vertices. 

And so, instead of using a shader to expand one point into four vertices, you say there's four vertices and then you divide the vertex index by four. And so, have the four adjacent vertices load from the same pixel, the same single pixels data. And then just offset it. And it ends up working out and being faster than geometry shaders at least according to AMDs presentation that I watched. I never actually benchmarked it personally. 

[00:27:31] JN: All the low-level details of graphical techniques always just seem like witchcraft. That's awesome. Yeah. Good to understand the whole Apple API journey. It's very Objective C to me what you said about going with making something new that's kind of the old thing with all the things imported. One of the things we've heard from emulator devs in the past and I think that you mentioned we first started talking was, in your journey in working on the emulator and especially on the GS, you obviously come across weird ways the games have used it. What are your favorite cursed games that you've had to deal with on the emulator? 

[00:28:03] TK: Yeah. Let's see. I think we'll go with SOTC. Shadow of the Colossus. There we go. Shadow of the Colossus. Yes. They have this bloom effect that they run on the GS. And the way they do it is, over each frame, they accumulate this texture by - I think I don't remember exactly. But I think what they do is they take a silhouette of all of the things they want to put this bloom effect on and they make that white. And then they spread it around by just render it up slightly above, slightly to the left, slightly to the right, slightly down. Because, remember, they don't have shaders. Right? They don't have shaders. You can't just sample this texture in three places. They just render it four or five times over this image. 

The GS had a lot of - it didn't have shaders. But it had a lot of fill rate. The ability to render a lot of pixels for the time very quickly. And so, that's just what they did. The fun part is - they accumulate this texture. And their calculation was set up so that it would - right? To get their bloom effect right. The white part would just kind of - because each frame, they blur it slightly more. It would come out in kind of like a halo, right? Because that's the effect you want. Right? 

But as it turns out, the thing preventing it's spread from continuing too far was the fact that the PlayStation 2's blending rounds down. Right? When you do the blend - remember, they're spreading this by taking the image and just rendering it with 10% opacity many, many times. And so, each time you do that calculation for like, "Oh, 10% of this one. Plus 90% of this one. Put them together." Most PC GPUs, as far as I know, they'll round nearest. They'll round to the nearest. If it's above 50%, they round up. If it's below 50% of the way to the next value down, they round it down. But on the PlayStation 2, it truncates. If it's any less than - even if it's 99% of the way to the next pixel. All the way back down. 

And so, the only thing preventing this bloom from just exploding across the screen and turning the entire thing white is the fact that the blending here between these values that it rounds down. And the fun part is that PC GPUs, these days, everything are shaders. Everything are shaders. It's all programmable. Except the blending. The blending is not programmable except for on Apple GPUs, as it turns out, and Intel ones. But on AMD Nvidia GPUs, the blending is not part of the shader. The shader outputs a color and then it goes into a special hardware unit that combines the color from the shader with the color that's already in the texture. Which means that it's not very easy to modify the - everything in the shader, it can just be like, "Oh, we'll just calculate it slightly differently." But once it's sent to the hardware blend unit, the hardware blend unit is now in charge. And we can't really do much about it. 

And so, Shadow of the Colossus is I think one of the more famous games. But we have lots of fun with blending on PS2. And so, we actually had a whole bunch of attempts at trying to make this work better. One of the things was the PC blending. It allows you to - the main calculation is alpha value one you want would mean fully opaque. That would be alpha value times the source's color plus the one minus alpha times the destination color of the texture. 

And so, one of the things we tried was like, "Well, what if we did - we can't mess with the value once it's gone to the blend unit where -" because the big issue here is that the shaders run super in parallel. A modern Nvidia or a big AMD GPU has 10,000. Something like that. CUDA cores, or stream processors, or whatever each manufacturer calls it. 10,000 of these mini-CPU-ish things. They're not really a full CPU. I mean, they're closer to a single lane of a SMD. But anyway. 10,000 things that'll be calculating on pixels at once. 

And then the way GPUs handle their - getting good memory bandwidth is by just running a whole bunch of things at the same time on each one of those. Not only is there 10,000 calculations happening at once. But there are also probably about 10 times that many pixels that are just waiting for their memory accesses to finish. You're looking at like 100,000 or something pixels that are inlight at once. 

And if your triangles overlap and they need to blend with each other, that needs to happen in a specific order. If you put the first pixel down and then the second on top of that if you're blending with each other, that'll look different if you put the second pixel first and the first one on top. And so, this is the reason that they don't want shaders being able to try to load from the texture that you're rendering to. Because then they're going to have to order themselves, which would be not great. 

And so, this is the reason that you can kind of do whatever you want as long as you're not looking at the texture you're rendering to. But once you need something that looks at it, you have to send it to this special hardware blend unit that figures out what order everything needed to be in. Fixes up. Reorders them if it needs to or however it does it. Who knows? That's the manufacturer's deal. Not ours. And then blends them in order. 

And so, we can do whatever we want up until then. And so, we're like, "Well, what if we do that first source texture times source alpha multiplication in the shader? And then we can subtract a little offset to do the opposite of what the PC -" the PC rounds at 50%. What if we subtract half a texture value? We just bring it slightly lower. So that when it does the 50% rounding, it now is actually rounding on a value that's slightly smaller than what it originally would have. And then it'll hopefully round closer to the correct way. That was the hope. 

[00:33:43] JN: The hope? Did that work out? Okay.

[00:33:44] TK: It worked on AMD GPUs. It turns out there's no requirement that the shader output have any more bits in it than the actual texture it's going to. And so, on Nvidia GPUs, it turns out they were a half precision floating-point 16-bits of data with 11-bits of precision or something like that. And they were truncating it right back down to eight before they sent it to the blend unit to save a little - I'm sure saves a little bit of bandwidth. But it completely undid everything we were trying to do. It just undid it. 

We do have SOTC looking nice on AMD and Intel. But I think Nvidia, you have to raise the blending accuracy setting. We have a PCSX2 - the big hammer for this is there's a blending accuracy toggle. And as you bring it up, we switch more and more draws to software rendering where we really do actually load the value that's currently in the texture. Look at it. Combine them and put it back. And we make GPUs really, really unhappy. Because to do that, we of course now have to order all of these pixels. And as a software, we don't really know which pixels are and aren't ordered relative to each other. Our big hammer approach to this is that after every - we draw one triangle at a time. And after every triangle, we tell the GPU, "Please flush all of your caches and make sure that the shader can read the value that was just rendered to. Okay. One more triangle. All right. Flush all the caches again." It's very expensive. GPUs hate it. But it does get you the correct pixel values. 

[00:35:15] JN: Incredible. 

[00:35:16] TK: And so, we have this big blending accuracy slider. It's a drop-down menu. But with like six different options. And the only thing those different options do is they increase the number of triangles that we apply our big hammer to. Yeah. The basic mode, we track down what most games tend to do that ends up requiring the software blending approach. And we only do it for those. And that usually only adds a very small overhead because we were - when you're only doing it for like 50 triangles, it's like, "Oh, well. Oh, well." But then you go to medium and then high. And then on high, some games are looking at 10,000 of these single triangle, followed by a barrier, followed by a single triangle, followed by a barrier. And that's when - AMD GPUs especially hate this. And you get 10 FPS or something. Some terrible, terrible frame rate on even big powerful GPUs. And they use very little power while they're doing it. Because most of the GPU is just sitting there waiting. It always looks kind of funny. It's like, "Oh, yeah. My massive GPU is using like 30 Watts. But it's trying really hard. But it's only using 30." 

[00:36:19] JN: It's really suffering. But it's using no power to do it. 

[00:36:20] TK: Yeah.

[00:36:22] JN: Okay. I guess with all that in mind, it seems like emulating the PlayStation is just like really hard. How close is true to original console? Would you say we are overall on PlayStation 2 emulation? 

[00:36:33] TK: It really depends on how true you need. Right? 

[00:36:35] JN: Yeah. Yeah. Yeah.

[00:36:36] TK: One of the things we do have, we have a software renderer, which is where we do all of the pixel, everything. We do the work that you would normally do on a GPU. We just do it on the CPU. And especially for a long time, it's gotten a lot better now. We properly detect a lot more games weird things. We properly teched a lot more games weird things now. So that we're a bit better at. But in PCSX2 1.6, there was so many games that just wouldn't work well on the hardware renderer. Unlike, for example, Dolphin software renderer is kind of meant for developers to just verify things. But the PCSX2 software renderer is meant for actual users to use, which means it's not actually fully accurate, which is a bit funny. But it uses memory the same way that the real GS does. And that helps a lot of things. 

And so, that's one of the ways you can get a bit closer is that you use the software render. It runs full speed in many games, especially on modern Ryzen 5000 series. Most games will run full speed on the software renderer, which is pretty cool. And that gets you much better support. These days, they're a lot closer. But there's still definitely games that run better on the software renderer. 

Then from the EE side of things where we have the synchronization of whatever, there's a specific game. The engine, I think it's called the Blue Shift Engine. Is it Marvel Rise of something? There's a specific game that will kick off the calculation that happens in - cycle timed. In parallel between the EE and both of the two VUs. And, hilariously, on the issue for this game, we were discussing it and one of the developers from that game came over and posted the code that they had written that we were screaming about internally. They posted it to their Twitter. They posted a link on our thing. And so, I got to actually go look at the actual source code for that game. And it was like, "Wow." 

Yeah. It runs on the EE on a single cycle. The EE is a super scaler MIPS processor and it is able to run two instructions at once, which I'm sure at the time was amazing. Nowadays, CPUs run 12. Or maybe not quite 12. But 8-ish instructions at once now. But at the time, two instructions at once. Wow. Right? Can run two instructions at once as long as they don't conflict with each other. And so, they start by running one instruction to start VU0 and one instruction to start VU1. Making sure that they happen on the exact same cycle. And then they counted the cycles for every instruction from there, so they know exactly which instruction on the CPU is running at the time of each instruction on running each of the VU programs. And so, then the CPU, they can just kind of be like, "Oh, I know the VU will be done with this. Let me just yoink right over there." And then have VU0 just yoink from the VU1 registers. Because VU zero can kind of just reach into VU1's registers if it really wants to.

[00:39:17] JN: Right. Right. And so, when you mentioned earlier that the VU exposes its timings to programs, this is where that comes in. 

[00:39:23] TK: Yeah. They kind of just reach into each other's stuff and then they expect to obviously be running two instructions from the EE. One from VU0. One from VU1. Two more instructions for the EE. One from VU0. One from VU1. And currently, our best is to synchronize after every eight instructions or whatever. And even, that's really slow or something. The current opinion on that one that one of our other developers was like, "I don't think - maybe in 10 years. Probably not in 10 years." But, yeah. There's always going to be some games that are too timing-sensitive for us.

[00:39:54] JN: Yeah. Absolutely. Yeah. All right. Weird VU timing stuff. 

[00:39:59] TK: Right. Yes. On most CPUs. I mentioned modern CPUs. They can run 8, 10 instructions at once. The way they do this is they kind of inspect your program pretty much like as they're running it. And they're like, "Oh, hey you wanted to do this instruction and this other instruction. And this instruction is multiplying the values from register two and register three and putting the result in register four. This one's multiplying the values from three and five and putting it in six." Well, there's no reason I couldn't just run both of those at once. Right? And so, it just does. And it just does this across. I think modern CPUs are tracking hundreds of instructions at the same time to figure out which ones are ready to run based on having all of their inputs ready. 

But the important thing that they try to make sure is that, no matter what they do internally to make developers not go crazy, they act as if they ran each instruction one at a time in order. They only run instructions out of order if they are able to undo everything if something explodes or whatever to make sure that they act as if they were running each instruction one at a time in order. And from a developer's standpoint, that's nice. Because I don't have to think about what my CPU is going to be running out of order as I write the instructions for it if I were writing assembly. Obviously, most people aren't writing assembly these days. But the compiler doesn't have to worry about this. Whatever. Everyone's happy. But there's a lot of circuits being used to track all that stuff. What if you just didn't have them? 

The PS2, they have a pipelined thing that can - vector unit that can do operations on four floating-point values at once. And so, for the most part, they actually do have this kind of tracking on it. And so, if you say, "Right." You're like, "Oh, let me multiply from the values in registers two and three and put the result in four." And then you say instead of three and five and put the result in six, what if you said three and four and put the result in six? Well, now you have to wait for the first one to finish before the next one can start. Because it needs the output of the previous one. 

And so, for multiplication on a VU, that does happen. But then you go to division, and you're like, "Okay, let me divide this by this." And there's actually only one division output register queue. And so, the result goes in queue. And then what if you RED queue and use that for something else? Well, it says, "Go head. You can just read it. The value in there is the one from the division that happened - the last division to have actually finished." By the way, divisions take seven cycles. Seven cycles after you start a divide, the new division result will appear in that division register and overwrite the old one. But until then, you can just yoink out the previous value. Hey, why not? It saves them some circuitry to track this. 

And it also means that you no longer - that you kind of have this one extra register's worth of data because you can start the calculation that targets the division and then take the result. Instead of having to take it first and then having to store it somewhere. And so, everyone's happy, except for all of the people who are trying to actually write code and schedule things. Because they realize now that they have to kind of - it makes things painful when you're writing code. Because right now you're like, "Okay, I start my division. And then I do some other stuff. And then I get my division result." Which means I have to have something else to do while I'm waiting for that division. 

From a programmer standpoint, you can get more performance. But you have to work a lot more on your programmer. But then from an emulator standpoint, we now have to track all of the things that are in flight. Because, of course, the people who are writing the code clearly were tracking in their heads or in a helper program or something how many cycles everything was doing. And how many cycles it had left before it appeared in the division result register and overwrote the previous thing? We have to make sure that we do the same thing now. Right? 

And so, yeah, the solution for that when we do our recompiler for the VUs, we store as a part of the - normally, when you're doing it, you're like, "Oh. Well, I'm going to recompile the code at this address in the code." But now you're like, "I'm going to recompile at this address with five cycles left on the queue register and three cycles left on the -" there's a sine, cosine thing. And that has its own output register. Three cycles left on that one. And then we will compile a different version of the code if you enter that block with a different number of cycles remaining, the various registers. 

And so, that kind of works out fairly well for the VU only code. But remember how I mentioned that the EE can just execute VU instructions on VU0? Well, as it turns out, we don't have quite that much coordination between our EE recompiler and the VU recompiler. At the moment, that just doesn't work. And at the moment, our best solution is that we detect when we have something that's able to at least detect when it might be happening. And then we put up a big warning message and say, "If your game breaks, you should contact the PCSX2 devs and they'll make a patch for your game to reorder the instructions. It won't work on a PS2. But it will work on PCSX2." Hopefully, at some point, we'll get that actually working. But at the moment, that's how things are going. 

[00:45:04] JN: Yeah. Actually, that brings me around to another question I had, which I guess just about how you work as a team, which is I feel emulators are in this really weird space where your users aren't necessarily - they don't necessarily understand their console. Not necessarily technical. They just want to play PlayStation 2 games. They're kind of treating you as a product. But all of it's a very technically intense case-by-case basis. How do you I guess go about thinking about progress of the emulator as a team? Is it you're very driven by what game issues are being reported? Is it overall accuracy to the platform? How do you approach this? 

[00:45:40] TK: Yeah. One of the big things about this is a lot of people are working on this because they enjoy working on it. And not because they're being paid to work on it or something. In fact, a lot of the PCSX2 devs explicitly don't want to accept money because they don't want to feel they have to - 

[00:45:56] JN: It becomes a different thing when you're paid for it. Right? 

[00:45:58] TK: Yeah. They don't want the responsibility that comes with that. Yeah. Part of it is kind of just like a people work on the things that they're interested in working on. And so, some people care more or less about making sure that bug report comes up. And making sure that it gets - we do have a bunch of people who are at least triaging those for us. But trying to choose which actual games to prioritize, trying to make them work is kind of up to each developer of what games they care about. 

There are these bigger issues. How many games are affected by that? Kind of like each person will decide for themselves how they want to do that. Yeah. We communicate in a Discord server. We know each other pretty well. Or at least as well as we would as a bunch of people on the internet. And I feel like we usually get along pretty well. And so, that's at least good.

[00:46:50] JN: Yeah. That makes a lot of sense. Yeah. I guess, yeah, I think it's particularly interesting with PCSX2. Because I think the project website makes it look such a professionally-run project. It's always fun to hear how different [inaudible 00:47:02] projects actually happen behind the scenes.

[00:47:04] TK: Even that. That website's pretty new. And up to two years ago, it was a website that looked kind of like it came out of 2005 probably. Because it did. Yeah, we have one of the - I don't actually know if they're newer on the team. I think they've been on the team longer than I am, which means I don't know how they - actually, I don't think they were someone who worked too much on the emulator. But they're like, "Hey, you know what? I want to work on the website." And they made this fancy new website for us. And now our website looks at least somewhat modern. Actually, it's very modern. Looks very nice. 

[00:47:35] JN: Do you have along the way - aside from the core emulator, is there a suite of tools that you all have built out internally to I guess debug or work towards progress on the emulator? 

[00:47:44] TK: Yeah. I think each person kind of used their different things. since I'm on macOS, there's really only one - I guess I should mention. This is from the perspective of debugging graphics stuff specifically. Apple has general graphics debugger for their Metal API. And so, I often use that. And then I know a lot of people on the Windows side will use a program called RenderDoc, which does a very similar thing. That can be used with any game that's using for RenderDoc, Vulcan, DirectX 11. Yeah, 11 or 12. It kind of just like you capture a frame. It takes all of the draw calls that made up that frame and kind of puts them in a list. And then you can kind of just go through them and look at each one in turn and be like, "Oh, here's a draw call that did this." 

But PS2 games tend to be kind of unruly in how they do this because of the way the PlayStation 2 works. And so, another alternative we have is we have a draw dumping system where you can check a bunch of boxes in the emulator and it will - for every single draw the game does, it will create a new PNG in a folder and it'll just number them. And in fact, it'll make not just a single PNG. But I think it'll make seven PNGs and maybe two text files saying stuff about the draw. Because it's the kind of stuff that you would see in RenderDoc. Except for all of the screens from RenderDoc is PNGs and text files. And it just fills your folder with like gigabytes of PNGs. And then you scroll through them with the icon view of your file browser and go through. I've used that a bunch as well. It's useful for tracking down lower level things of what the game - the question is like is the game doing something really weird? Or, "Oops. I did a dumb thing in my graphics code." And if it's the Xcode - or RenderDoc is better for the, "Oops. I did a dumb thing in my graphics code." Whereas if the game is doing something especially weird, it's often easier to use the draw dumping. Because that's closer to the PS2 side of things. And so, different people will use different amounts of draw dumping versus the third-party tools like RenderDoc. 

[00:49:36] JN: Yeah. I've not heard of RenderDoc. That looks awesome. 

[00:49:38] TK: Yeah. RenderDoc is very cool. And then from a CPU perspective, I think each person who is kind of working on - especially with the recompiles and stuff, would kind of do their own thing. Which, for better or for worse, I think most of them didn't actually make their way into the emulators. The thing is for debugging the stuff. Some of it did. But not all of it. 

Yeah. I know we have some integration with some of the Intel profiler tools as well. We have a thing where like if you flip on a compiler switch, that'll try to send information about the recompiled. We emulate the CPUs by taking the instructions for that, like the MIPS instructions, and generating equivalent for the computer it's running on, which tends to confuse tools. And so, the tools, like profilers and things. 

And so, we have a thing to call a special method on Intel's profiler and tell, "Oh, by the way. We just made a new function over here. Here's its name." And then that way, the profiler can actually name it properly and make things look nice. 

[00:50:40] JN: Nice. Awesome. Very cool. Final question. Is there anything coming out or anything you're working on for the emulator that you're excited about? Or anything you're able to tell us that's coming up in the future? 

[00:50:48] TK: I think more recently, I've been trying to do some stuff with the interface. And we just got the ability to - well, I guess we always had it. But it's actually working well to translate the UI into lots of different languages now. But there are a few things that are currently missing from some of the onscreen stuff doesn't work very well with like Arabic or stuff. That's definitely something I would like to work on, is trying to get that to be able to show Japanese and Arabic characters in the onscreen UI, which we currently can't. 

[00:51:19] JN: Right. Yeah. I imagine Japanese, especially for a Sony console, will be very impactful. It's awesome. Cool. Well, Tellow, thank you so much for joining me today. What a piece of hardware. Thank you for running me for everything. Yeah. Thank you for working on such a wonderful emulator. 

[00:51:31] TK: Yeah. Well, it was a pleasure doing this interview.

[END]