EPISODE 1810

[INTRO]

[0:00:00] ANNOUNCER: David A. Patterson is a pioneering computer scientist known for his contributions to computer architecture, particularly as a co-developer of reduced instruction set computing or RISC, which revolutionized processor design. He has co-authored multiple books, including the highly influential, Computer Architecture: A Quantitative Approach.

David is a UC Berkeley party professor emeritus, a Google distinguished engineer since 2016, the Rios Laboratory Director and the RISC-V International Vice Chair. He received the 2017 Turing Award together with John L. Hennessy, quote, "For pioneering a systematic, quantitative approach to the design and evaluation of computer architectures with enduring impact on the microprocessor industry." 

In this episode, he joins Kevin Ball to talk about his life and career. Kevin Ball, or KBall, is the Vice President of Engineering at Mento, and an independent coach for engineers and engineering leaders. He co-founded and served as CTO for two companies, founded the San Diego JavaScript Meetup, and organizes the AI in Action Discussion Group through Latent Space. Check out the show notes to follow KBall on Twitter or LinkedIn, or visit his website, kball.llc.

[EPISODE]

[0:01:26] KB: It is my absolute honor today to welcome Turing Award winner, David Patterson to the show.

[0:01:31] DP: Thanks for having me.

[0:01:32] KB: I'm excited to have you and I have a bunch of questions. I love getting to geek out with people on stuff. But I'm curious before we start, you have a long illustrious background, you've been in a whole bunch of different domains of the tech industry. How do you introduce yourself these days? What do you bring forward?

[0:01:51] DP: I think I usually start with, I was a Berkeley professor of computer science for four decades. Eight and a half years ago, I started working for Google. So, I've almost got a decade at Google. So, I've got a half-century of experience in the field.

[0:02:04] KB: I love the encapsulation as a half-century. The field isn't that much older than that.

[0:02:10] DP: Yes, the field's about as old as I am. So, it's in the late 40s as the very big, the ENIAC and things like that. First, when I decided to study computer science, I had no reaction from my relatives. It was not like in the 1960s, everybody said computing was the future. It was just this kind of side thing that few people were interested in, and now my relatives think I was very wise in the field that I picked.

[0:02:34] KB: Yes, absolutely. So, the subject that you won the Turing Award on was related to kind of RISC and MIPS and that sort of domain. I saw that you've been deeply involved with the RISC-V project, or at least you wrote the book, or one of the books on it. So, I'd love to kind of dig in a little bit and get your perspective on RISC-V, what it means, and what people are trying to do with it.

[0:02:57] DP: All right, so when I explain it to my relatives, what I say is when software talks to hardware, there's a vocabulary, and the technical name for that vocabulary is instruction set. The words of that vocabulary are like the keys of the calculator, add, subtract, and multiply, divide, and all that stuff. So, the so-called reduced instruction set computer debate, which we had in the 1980s, was what's the best instruction set or vocabulary for microprocessors. The prevailing philosophy was that it should be very sophisticated into this idea closer to software. And John Hennessy and I, who shared the Turing Award and people at IBM, argued that that was the wrong model, that we should instead keep the instructions actually relatively simple, and that's kind of reduced simple, offline instruction set, and that the compiler would be able to map programs into it. If we kept the instruction set simple, we could iterate them faster, it could potentially be more power efficient, and things like that.

So, that was the debate. What it came down to is for the more sophisticated instruction sets, you can think about polysyllabic words in the vocabulary, you needed fewer of them to execute the program, but they might run more slowly. What was that ratio? And it turned out that RISC style tended to execute about 30% more instructions, but we could execute them about four or five times faster. So, that was the net win. That's RISC in 60 seconds.

[0:04:26] KB: RISC in a nutshell. Yes, absolutely. It has continued to evolve. I would say probably if we look at processorship today, the debate is over, Essentially, RISC has more or less one.

[0:04:37] DP: Yes. New instruction sets over the decades, but nobody's tried to do it. The ones, these very sophisticated ones, they're still similar at the core, they're similar, the very RISC I that we did at Berkeley in the 1980s.

[0:04:49] KB: Yes. So, looking now, RISC-V, relatively recently, the first version finalized and out in the world. What is different? I think I saw in a brief that it learned a set of key lessons from previous generations and avoided some key mistakes. So, what do those look like?

[0:05:06] DP: Well, we did four risk architectures at Berkeley in the 1980s and Hennessy did a couple that are called MIPS at Stanford. What happened was, the origin story for RISC-V is in 2010, we were going to do research around parallel computing that was sponsored by Intel and Microsoft. The switch over from single-core to multi-core, and that was what was funded. We were going to need an instruction set to do our research. We could see that Moore's law was slowing down and thought what the future looked like is you'd have a core instruction set, and then you'd add special-purpose instructions for specific domains. That's what we thought was going to happen.

So, we needed a core. It would make sense to use x86 kind of based on who our sponsor was, but A, it's ghastly, hard to extend, and Intel wouldn't let us use it, nor could we use ARM. ARM was popular, but you couldn't extend it, so we had to invent our own. So, led by my colleague, Krste Asanovi?, and two grad students, Yunsup Lee and Andrew Waterman, they said, "Let's do a brand-new RISC architecture and learning what we've learned over the last 30 years about things that were probably mistakes back then and what we learned since then." Because we did four RISC architecture in the 1980s, they called it RISC-V.

So, I think the fact that they call it RISC-V gets me more credit than I deserve. But the idea was, as researchers, we thought everybody would want to use it. Everybody in academia want to use it. Other researchers want to use it. So, we made it available kind of the Berkeley tradition things are open source, and that's how it got started. About four years later, we were using it in our classes and our research. It didn't get really - other universities didn't really pick it up, but it was out there. But we started getting comments. Why did you change the instruction set from the fall semester to the spring? Why did you make those changes? Why do you care what we're doing in our classes and our research? Then in talking to them, we found out there was this thirst for an open architecture that people could use rather than have to get a proprietary architecture. Once we realized that, then we thought that was a great idea and wrote a paper kind of inspired by the Linux software should be free. This is what we call the instruction sets should be free. Soon we started a foundation around 2014 to 2015.

Now, 10 years later, fortunately, it's actually caught on. People like the enthusiasm around open-source software, this has this kind of religious, fervor, philosophical attraction. Same thing for the open architecture. Most of the people involved really like the idea that potentially we could have a lingua franca across all computing. The biggest to the smallest. If we could have one, it better be open. As we see instruction sets of proprietary are tied to the fortunes of those companies and who would have thought Intel would ever be vulnerable, right? That just seemed impossible.

But yes, that x86 is tied to that company who's having difficulties and there have been many other instruction sets that have gone away because of the fortune there. So, the idea of an open architecture that's kind of community-oriented, it's a standard. It's not like Linux, which is an implementation. It's a standard like USB or something like that, but a lot of enthusiasm around it. If you look at the very core of that instruction set, it's similar to RISC I, but since we have so many transistors today and we had that idea of adding features when you need them. So, we have, RISC-V allows optional features for all kinds of applications like encryption, or machine learning, or things like that. Options that you can add. But at its core, that's the same RISC philosophy.

[0:08:48] KB: Let's maybe talk some about that extensibility, because I think one of the things going on right now with Moore's Law ending, with a bunch of these other things, is we're sort of moving into a world of domain-specific accelerators. Probably the biggest one of those is around machine learning, people using GPUs, or TPUs and things like that. But one, what does that look like for people using RISC-V five if they want to tap into a domain-specific accelerator. You said it's open, they can do that. What does that actually end up playing out like, and where are you seeing this happen?

[0:09:18] DP: Yes. So, what I've ended up doing at Google for almost a decade is working on domain-specific accelerators for machine learning. When I retired from Berkeley and wanted to keep my hands in the technology, I thought Google would be an interesting place. I had gone sabbatical recently. They just kind of, because of my experience, they had me report to Jeff Dean, who is kind of a famous software engineer, and I think just because of my stature. But he was in the machine learning part of Google. Not that I thought this is something I wanted to do, but I didn't have any strong opinions. But he was in Google Brain that he founded. Jeff was a big and early believer in the potential of machine learning AI. It was one of the first movers in that, and so Google was the first mover in that. So, I ended up spending the last eight years learning all about domain-specific accelerators for ML, AI.

Now, RISC-V, in particular, its first foothold has been an embedded computing. So, there's DSP extension, digital signal processing extensions. There's compression expenses to keep the instruction set smaller, and then the extensions for machine learning, kind of to my surprise, I was involved in the IEEE floating point format standard, which in the 1980s. Before that, different computers had different floating-point formats, and so imagine porting floating-point programs when the floating-point didn't do the same thing. So, it was a real kind of mess on big computers and the floating-point standard was set up around, wow, microprocessors are starting to have floating point, let's standardize it. Thank God we did, but I thought the data types were settled, which was single precision, doubles precision. Maybe down the line, there'd be bigger things, but that was it.

Well, to the surprise of many machine learning because it's doing symbolic processing, it's not whatever it is, 50 bits of precision, the range is very important, but the precision isn't all that important. So, Google created a new floating-point format, which I never expected to happen in my lifetime, given the standard. But at first, a 16-bit format that they called brain float, which is a different format, because that's where it was done. Since then, Google and NVIDIA and other companies have made them even smaller. So, 8-bit floating point, 4-bit floating point. Now, there's only 16 values. How can that be floating point? But amazingly enough, people are figuring out ways to get AI done with very extremely narrow data types.

Data types has been a significant area of innovation and architecture. I think when Bill Dally of NVIDIA talks about the giant games that we made in machine learning, he credits data types as one of the significant ones. So, if you had an instruction set that you couldn't change, you'd be in big trouble. So, you'd need to span these data types. Those are examples of the type of extensions going on.

[0:12:19] KB: That's a great example. There's all these interesting papers around, yes, trading off between the number of parameters versus the precision of the parameters. I saw somebody doing essentially, they call it like one and a half bit where they just basically had one zero or not set and which parameter was just that, and that makes a ton of sense that you're being able to bake that in at the instruction set level because this is extensible allows you to experiment in those domains.

[0:12:44] DP: Yes. So, if you're a computer designer, what you love is people who say, "I can't get my work done. I need a much faster computer." You're not happy with people. Yes, things are fast enough. My laptop is good enough, I'm going to keep it until it breaks. I'm going to keep it a decade, right? You're not a fan of those people. The machine learning people are voracious. If you gave them a factor of 10 tomorrow, they'd say thank you and then like, what's next, right? They can use up everything. And it's a brand-new area. We don't know the best architecture for machine learning and AI. It's a wide-open space.

So, this idea that doing this numerical analysis, it's got these two phases, training and what's called serving your inference. Training is kind of like sending your kid to college. It takes a long time, it's expensive, spend a lot of money. But once they're educated, then they go out into the world and then hopefully answer questions much more quickly than it took them how to learn that material. So, there's two kinds and numerical precision can be different. What you train in, and then what they call quantification, maybe what you train it using 8-bit, you'll be able to serve it or do inference at narrow bits. There are papers that talk about one ternary, three states, like one and a half bit serving. It's amazing.

What's exciting from maybe in the 1990s when Intel x86 dominated everything, it was kind of boring. There's this wide-open area on both the software side and the hardware side where it's not clear what's going on, and it obviously has gigantic commercial impact, as those of us who watch NVIDIA stock can see. So, it's pushing the state of the art in both software, hardware, and simultaneously being delivered to people and having impact on people's lives. It's a very interesting time.

[0:14:34] KB: I'd love to dig in a little bit what you're talking about there in terms of separation of inference and training. Because I think the vast majority of the industries using off-the-shelf GPUs, they're using the same hardware for those. I think I saw a paper talking about Google's TPUs that you all custom-designed, actually having some focus on inference versus training. What are the different parameters that you're optimizing there?

[0:14:57] DP: So, what's happened in the machine learning community? The big thing, kind of a breakthrough was in 2017 when Google came up this new model that's called Transformer. A specific idea that if you just, if you think of an image, you can pay attention to different pieces of the image and do more computation there rather than uniformly. So, that the actual title of the paper that introduced Transformer's attention is all you need.

This has proven to be a breakthrough model, and what's happened in the last seven or eight years is people just pushing that model and expanding the number of parameters dramatically and all that stuff. So, what's happening simultaneously besides the split of training and serving, there's this rapid increase in the size of these models. What that turns out to is typically is memory capacity. It used to be, well, I think there is Transformer model might have had a hundred million parameters, some number like that, and people have already gone to billions and people are talking about hundreds of billions of parameters.

When we were talking about the data type of this, these parameters are probably there, well, they could have been 16-bit, this brain float 16 or 8-bit, or people would like to get them into 4 bits to shrink both the memory capacity, and if it's not only smaller, it takes up memory space. But also, you get more memory bandwidth. You can fetch more of them per second. That's that shift that's going on driven by this increase in these so-called large language models.

Now, for serving, it tends to be - for training, it's really computationally intensive. If you look at our textbooks, there's a phrase that's called arithmetic intensity, which is the number of operations for re-byte fetched. So, training, it could be hundreds of operations per byte fetched. Hundreds of floating-point operations per byte fetched. For serving, it tends to be not that high arithmetic intensity. What that means is, it's more memory-bound. Serving tends to be more memory-oriented. Training tends to be more compute bound. Given that there are - well, first of all, you can do serving on a training chip. If you can do training, you've got everything you need to do in serving. But by specializing, you can reduce the costs and you can produce the energy and the carbon footprint by doing that. 

So, you can have a smaller chip or reduce the size. All these chips have big matrix multiply units. 
You can get away with a smaller matrix multiply unit. You can get it probably with a lower power chip and try and focus on memory-centric architecture design for serving. I would say right now, NVIDIA rules the world on training. I think their favorite solution, how do you do serving? Well, you use our old training chips to just serve. That's their advice. That may benefit them economically, but that's their advice.

[0:17:46] KB: It might be just talking their book a little bit there.

[0:17:48] DP: Yes. So, I think, where is, if you think of it historically, if you think about the PC is duopoly between Intel and Microsoft. Right now, it's pretty much NVIDIA and NVIDIA, but they focused on the training side and very high-powered, very big chips. The serving side is there's more opportunity, I think, for hardware people to innovate. With these trade-offs I talk about, maybe you get away with a smaller chip or you have closer to the memory system so you get good memory bandwidth. That's kind of examples there. Again, this sliding era, certainly NVIDIA is the commercially available dominant thing. Some companies like Google have built their own training chips, and Google has also built special versions of it for serving.

[0:18:38] KB: You mentioned something there that I'd like to dive in a little bit further. So, piece of it being around becoming memory bound. I know this is a domain where the ratios of compute to memory have been shifting over time, even on my laptop. If I'm running into trouble, it's almost always because I'm running into memory constraints. I saw a paper that you were involved with recently around memory-centric computing and redesigning the ways that we set up at least cloud servers to be focused around memory-centricity rather than processor-centricity. Can we maybe talk about what that looks like and what that means?

[0:19:15] DP: Yes, I think the original tagline of that paper was, the CPU is not central anymore. So, we're kind of used to measuring software by arithmetic operations, be very computation-centered. But increasingly, we're bound by memory, either by memory capacity or by memory bandwidth. And what's happened over time with the slowing of Moore's law is the rapid improvements in DRAM memory which we saw for decades in the last century. It was like clockwork. Four times the capacity every three years. Today, it's going to be more than a decade, 4x every three years, to more than a decade between the 8-gigabit DRAM and the 32-gigabit DRAM. So, it's really slowing down. Then along with that slowdown in capacity, the bandwidth isn't improving as rapidly.

One of the famous kind of admonitions in computer architecture was by Gene Amdahl. He wrote a kind of a one-and-a-half-page paper that stated what he thought was fairly obvious and has since become called Amdahl's Law is, if you have a piece of the pie and there's a part of the pie that you accelerate and the rest of that you don't, he said, what people say Amdahl's Law, that limits how fast you go. If you're going to make two-thirds of the pie go Infinitely faster, you're only going to go three times faster because it's one-third you don't touch. It's kind of this, people call it this sad law because architects run into it all the time. They get very excited, "Oh, look what I figured out. I can make matrix multiply go 10 times faster. Very exciting." And then all the rest of it doesn't need to go.

So, what's happening is, as Moore's law is starting to slow down, the logic is still getting pretty good. The actual arithmetic units are okay, but the cache memory technology, it's called static RAM, that's not improving very much, and then the DRAM which is in a separate kind of technology is also improving much more slowly. What they've done is to try and boost up the memory especially for these accelerators. They've gone to a novel packaging scheme and it's aptly named high-bandwidth memory, but it's actually a little physical memory where you stack the dice on top of each other. There's a stack of dice like 4 or 8 and they're trying to get to 12 or 16 dice and you have multiple stacks right around the computation unit so they're very close and they've got a thousand wires, the standard DRAMs would have 64 wires. This is a thousand wires wire in these little stacks. It's very specialized memory and it's kind of at the heart of all these accelerators to try and provide the bandwidth that you need, particularly for training and for inference.

It's, again, not only is this technology right in the area, it's also in the business pages as we go along. SK hynix has successfully can make it with eight chips and is delivering it. Samsung, which was the original, I think, creator of the high bandwidth memory, they're having difficulties delivering on this technology. So, as a result, SK hynix is doing well in the market. The AI stuff is tied directly to the business pages and stock prices that we see.

[0:22:39] KB: Absolutely. Yes, it's really driving it. Yes, Amdahl's Law is, I think, intimately familiar to any software engineer who's been told premature optimization is the heart of all evil, right? "Oh, I got this loop going really fast. Why is my code not going any faster?"

[0:22:53] DP: Yes. I think it's the law of diminishing returns. When Gene Amdahl wrote it, it was just like, he was talking about parallel computing, right? He says, "There's a part that you can make parallel. Great. The part that you don't make parallel will limit how much performance you're going to be able to deliver." It was obvious to him as a really smart computer designer that he just had to write it out, because people are getting very excited about these parallel processors and ignoring the part that they weren't done. But it's this law you keep running into over your career. Like, "Oops, yes, screwed up again Amdahl's Law."

[0:23:26] KB: Yes. Okay. So, with the accelerators, that's really interesting, right? They're packaging huge amounts of memory going close to it. I think I saw another -

[0:23:33] DP: Yes, what's interesting is because you can actually plug a lot more DRAMs into your PC that you can plug next accelerator in the so-called dual inline modules, but you can put dozens of them. Because of the physical distance to get that 1,000 wires, it's on a special package. So, what's the downside of the stacks is the capacity isn't very high, because it's not that many dice, there's not that many stacks relative to the dims. Incredibly fast, but very limited capacity. So, how does that constrained capacity of HBM go with what you claimed was billions of parameters or hundreds of billions of parameters. Yeah, that's the problem., you need a lot of GPUs to get enough capacity to be able to solve these big problems because of the limited memory capacity of the HBM stack. Well, it's challenging. I guess if you're selling GPUs, it's not challenging.

[0:24:28] KB: Just buy more. There you go.

[0:24:31] DP: Yes. Not a problem.

[0:24:32] KB: Continuing to dig into this sort of memory and data centricity, I think I saw something around building out kind of database architectures around a memory pool rather than once again being sort of CPU centric. What does that end up looking like? I'm interested in this shift of if our constraints are now memory, which they've been on and off again, but increasingly like data has so much inertia, like that is the fundamental constraint that's not getting faster from Amdahl's Law. How do we shift our hardware architectures and our software to better manage that?

[0:25:04] DP: Yes, so the paper that you're talking about is actually being presented at CIDR, which is a database conference. I can't remember what it stands for. I was at a workshop in Germany where they had a few architects and a lot of database people and we got together and talked about this memory-centric approach and the authors of the paper, there's a lot of database people there who are better informed about this than I am.

But one of the enabling technologies is this new kind of a successor to the PCIe bus, it's called CXL. The idea is to be able to create a coherent address space across several servers. What we argue in that paper is one of the downsides of basic accelerators in the past is that you needed a memory bandwidth for these accelerators. But you would have to put a lot of physical DRAM, which each of these accelerators. If you didn't use it well, it would be very expensive. So, what we argue is that CXL allows you to very easily have, and CXL is standard on all the servers, have pools of DRAM that you'd be shared across CPUs rather than having to put a lot of DRAM at every physical one. So, you could have, given a pool of CPUs, and maybe accelerators, database accelerators make more sense, is that you only need to use memory from the pool rather than you have to justify the cost of having DRAM that only can be used in these narrow situations.

So, I'd say the bottom line of that paper is pooling of DRAM is realistic. We should be thinking of being more memory-focused, thinking of the problem of how do we get access to the data rather than thinking of it as computing is the focus of what's being done.

[0:26:51] KB: Yes. Well, and I wonder, and you said you're not as familiar on the software side, so redirect me if I'm going off into an area that isn't in your domain. But does that require the software explicitly managing the memory? Because I feel like a lot of right now, at least higher-level software, can optimistically ignore memory hierarchy. Then, you have to start being aware of it when your performance falls down, and you're saying, "Oh, I've broken cache locality or something like that." But when you have this - 

[0:27:21] DP: Yes, this is a little different. So, how do computers work? How could you have gigahertz processors and this DRAM that takes 100 nanoseconds? How do you make that work? Well, caches were invented, and we just keep sticking in levels of caches, and trying to hide, give the illusion that you have this incredibly large memory, and it's incredibly fast, and it's up to the hardware memory hierarchy to hide that from the programmers. For some tasks, that illusion works pretty well. But for others, it doesn't. So, programmers do need to be aware. I think this one isn't so much the memory hierarchy as it is, just the memory capacity, which I think programmers have had to worry about for a long time. But this idea that, well, I think the argument is, if you rearchitect your software to be aware of the pooling of memory across different servers, you have this ability to be able to get a lot better cost performance for data intensive applications.

[0:28:21] KB: Yes, that would make sense just in terms of being able to trade off more memory versus CPU and not having to stack them. I was also wondering if you make memory ownership visible to the software, could you hand off between different processes without having to do a mem copy? Where you basically say like, "Here's your address, go."?

[0:28:42] DP: Yes, that's kind of this distributed shared address space. There's a whole bunch of CXL protocols, but I think the latest CXL protocols will allow that type of access to the shared memory. But it's different, it's shared between different servers, and that used to be impossible thing to do. And the way that servers are constructed, there's just a limit to the amount of DRAM you can stick onto one-on-one server. But using CXL, you can get this illusion of having a much more memory for individual pieces. So, it's kind of a practical way to get a tremendously bigger memory footprint without a gigantic cost of some kind of supercomputer.

[0:29:23] KB: Yes, I remember earlier in my career, I was involved with a lot of high-performance computing and stuff and shared memory models. If you wanted to do it across lots of CPUs, you had to go to SGI, had these like mega things, or something like that. Anything else, you were passing messages essentially over a network bus in some way. So, it sounds like in some ways, this could enable shared compute, shared memory style programming models, but across some number of servers. 

[0:29:49] DP: Yes. So, that was John Hennessy and I. John Hennessy did this project called Dash, which was a big shared memory processor. This was this kind of switch to parallel computing, or that that was kind of in the air. We had a contrasting project at Berkeley called the net workstations. This is in, I guess, this is in the nineties, I guess, I think that's right. John's thought, well, the hardest problems with parallelism is programming. If we keep it a coherent shared address space, that's going to make it easier. So, we at Berkeley, what we said, well, that's one way to go, but we think, it's so cost-effective rather than have these big servers with this interconnect to be able to provide that. It's tremendously more cost-effective if we could use, you know, we said workstations, but PCs. Put a bunch of them together, use local area networks to connect them, and that's going to be - the cost performance of that's going to be amazing.

So, how did that settle? Well, what happened is, the internet came along, and internet services wanted - the parallelism was based on number of people, not on a single program that you had to parallelize across under processors. So, it was a huge throughput demand. What ended up happening is the network of workstations want. That's how all the Internet services standardized around that. John's model was more efficient in DRAM, but the price of DRAM at that time was so low, and you could scale up, and it was also very reliable that a single server could fail, and the software could keep working around it. So, it was much more reliable, much more scalable, and much cheaper than the coherent Azure space model that those SGI machines did. 

It's an example, how do we settle debates in computer architecture? Well, we get companies to spend hundreds of millions of dollars or billions of dollars to put it in the marketplace, and then we fight it out, and then we, "Oh, that side won. That's how we do it."

[0:31:49] KB: Absolutely. Well, and as you highlight, a huge amount of the demand in compute growth for many, many years was Internet-driven, essentially embarrassingly parallel. You just split it out across things, stateless servers that maybe access some shared state. Then, you've got databases or things like that that have to actually have that coherent view of state. Now, we're at a scale where that's not super cost effective needing this pooled memory for databases. Are there other domains? We've talked about machine learning, particularly at inference time being very memory intensive. Databases are another one. What other domains do you think this type of memory centricity is likely to make a lot of sense?

[0:32:26] DP: Yes, that's kind of the question when people talk about domains. So, my examples have always been, besides being interested in data analytics databases, I don't think I have any other areas that are obviously memory intensive. My guess is, that will be increasingly the problem going forward for lots of applications, but it's hard to - I'm not sure. I don't think I have any great examples.

[0:32:51] KB: Okay. No worries. So, looking at this, and looking at - so much as being driven right now by machine learning, and as you highlighted, like the folks doing machine learning, they'll take as much as you can threw at them. I saw something recently where one of these machine learning coding tools cursor was saying, "Hey, Anthropics throttling us because they are literally out of GPUs. They can't run enough inferences." Like everybody is struggling on that domain. But I'm curious, what do you see as the big unsolved problems for the next five or 10 years?

[0:33:24] DP: Well, one of the questions I had is when, in 2016, when I joined this machine learning organization is like, "Well, we'll see how significant this is. It's very hard to see when you're in the middle of it how - whether this is a paradigm shift or not. It's - it retroactively, it's much easier. Like a decade later, you look back, "Wow." And in my career, the microprocessor, the Internet, maybe mobile phones, smart phones. When you look back, you say, "Wow, that was a giant change in our technology base." So, now that I've been there eight years, I think this is one of those things.

[0:34:04] KB: Agree, this is a big paradigm shift. I think that's becoming more and more clear. What are the still big unsolved problems here.

[0:34:11] DP: Yeah, well, because we're at this - well, I'd say a couple of things. We're not at an upper bound on intelligence. We're not at, "You know what, that's good enough." I think with intelligence, I don't know if there is going to be an upper bound, but we're certainly not there yet. There's things where it screws up. There are things where we'd like it to be a lot smarter. So, I think on the machine learning side, can we figure out how to deliver something useful that people can depend upon? You can ask it questions and trust its answers. Can we deliver that economically? Can we reduce the carbon footprint of our solutions as well? That's a topic I ended up being involved in quite a bit the carbon footprint of this. So, I think if we only focus on the part of the industry that's machine learning, there's giant challenges there. What's the best architecture? Can we improve the algorithms to reduce the cost of training and serving? Can we come up with architecture ideas that can reduce and hardware ideas that can reduce of cost of serving? There's just a huge set of problems there. 

So, it ends us in this very exciting time. Like I said, over my career, there've been kind of boring times. This is not one of them. What's great is, if you're a researcher, then if you've got a good idea, people are anxious to hear it. There are times when everybody's making a lot of money and it's kind of boring. It's hard to get people's attention, because they're making a lot of money, and they don't need to change anything. That's not where we are today. So, it's a very exciting time if people are interested in hardware to get into this space or algorithmic advances.

[0:35:54] KB: I'd be curious to dig a little bit deeper on the carbon footprint side. I saw something recently that like Microsoft is projecting, they're spending 80 billion on data centers this year. I haven't seen similar numbers from Google and Amazon, but I know that they're also deeply investing in capacity for this. There's a lot of worry out in the world around what's the environmental impact, what's the energy impact. So, can you talk a little bit about the research you did in that domain?

[0:36:20] DP: Yes, I've worked a lot on this topic. I got started in it because, I guess in 2021, there were papers coming out. They were making alarming claims. I would ask my friends in machine learning at Google, like, "Is this true?" So, there was a paper that came out in IEEE Spectrum, which was the flagship of IEEE, one of the big organizations in the world. It said, it was 2021, and he said, I think it's by 2024, training a model would cost $100 billion and produce as much emissions as the City of New York in one month. Like, oh my God, is that true?

We started investigating it, and we found that there was a particular paper that inspired these concerns. It was actually a paper by a group at the University of Massachusetts who were trying to guess what it costs for one of the Google projects. So, in machine learning, there's a thing that you do to try and find better models, more efficient models. That's called neural architecture search, or you're using kind of machine learning to find better models, more efficient models.

It was actually a more efficient version of the transformer model called Evolve Transformer. So, they tried to estimate what was the carbon emissions of that search. Because they didn't have internal Google information, there's a thing that we call, in that paper, we call the four M's, which affects the machine costs, which is the model itself, there's the machine it's running on, and then, the third M is mechanization, which is an M word that means how efficient your data center is. And the fourth one. which was the big surprise, was max. Because that the cleanliness of the energy is highly dependent on geographies, where you are. If you're near a hydroelectric dam, if you're solar, or wind, the energy is going to be much cleaner. So, those four M's.

They didn't have access to the four M's. So, they did an estimate based on averages. That was fine, they were a little about a factor of five higher than Google was, because we have optimized a bunch of those things. Unfortunately, they misunderstood how we did the neuro-architecture search. So, they were off by another factor of 18. We did a small proxy model to search the space, and they [inaudible 0:38:45]. The paper itself that everybody based their work on was off by a factor of about 90, too high. But then, they misunderstood what the paper was about. They thought it wasn't searching for a new model, which they do that occasionally, and then they put the model out, they publish it, and put it on GitHub, and people download it, and use it thousands of times.

The people who read that paper thought that was the training of the model. So, not surprisingly, it takes more than a thousand times as much energy to find a model than to train one. So, you multiply it together, that the conclusion was too high by 120,000. So, you end up with the claims like it's going to cost $100 billion as much as New York. Now, the problem was, now that we knew that, how do we get the word out? So, there's no real good mechanism because this paper appeared in the conference to say, "Hey, by the way, remember that paper everybody's citing. It's off by 100,000."

I tried to go around and give talks, but it's basically an unsolved problem. So, that's got me into this phase. I think what happens today is people just have a hard time, because you hear these, you talk about tons of emissions. It's hard to put it in perspective. So, there's something called the International Energy Agency, which is like an organization for 50 countries. What they said was, recently, how much energy is going into data centers? Well, it's about 1%. 1% of all the electricity is going into data centers. That doesn't include the Internet, doesn't include crypto. This is kind of Amazon and Google data centers. Then, AI is only a piece of that. Our measurements were, that it was about 15%. So, it's less than a quarter of 1% is AI. That's where we are today.

This, IEA agency, they said, looking into the future, even with strong growth, it'll get bigger. But compared to the other things that are going on, it's not that big a deal. So like, air conditioning, air conditioning is going to grow a lot in the next five or 10 years, and it'll be much bigger energy driver. Also, they talked about high electrical things like aluminum plants, just plain old economic growth was going to be another thing. When you put things in perspective, it's going to grow, but it's small relative to the other things that are going on. But it's very hard to communicate that kind of message today, because people will notice things. Because what happens is - that's the worldwide average. But in particular regions, if somebody builds a lot of data centers in the same region, then that utility company could be taxed, and that's new. So, you'll see some cities where, wow, they want to build a lot of data centers here. We have a limited energy supply there, and that's going to tax it. It's kind of a local problem, but if you have the big picture, it's probably not a global problem.

But nevertheless, the growth is so much and companies want to be to be able to build these data centers. They're investing or they've made plans for nuclear energy. So, these so-called small modular reactors or even fusion, which is hard to believe.

[0:41:55] KB: I saw that, yes. I'll believe it when I see it, but that would be incredible.

[0:41:59] DP: Well, yes. The other thing that's in the press level time is quantum computing. One of my friends said, "There's a chance that fusion is going to work before quantum computing."

[0:42:09] KB: Fusion's been one of those that we're going to solve it in the next 10 years for about the last 50, right?

[0:42:14] DP: Right. But the people are more optimistic in the last five years that we're - five years away than - for decades, we've been 10 years away. They are using ML to help figure this out some, but it does seem like it's - I can't believe it's going to be real. Obviously, if fusion happens, it'll be amazing. One of my nieces is a nuclear engineer. The small modular reactor thing, we've had reactors in submarines for decades. They're training 18-year-olds to operate nuclear reactors. So, there's another part of the space that technically could be very helpful, but the kind of the sociologically putting reactors into neighborhoods, maybe bridged too far.

[0:43:02] KB: Yes. No, absolutely. I think, in that power domain - actually, first off, let me restate in case anybody missed it. If you're worried about the environmental impact, the paper that you're probably tracing your worries back to is off by 100,000.

[0:43:18] DP: And the papers in academia, we call it citations. It's citation writs going way up. Our paper says, "By the way, there's a little flaw there. We're way behind. We're not catching up." I was talking to a guy who does data center and stuff, and he says, "That's what it works. If somebody makes a mistake in a calculation, and people read about it, wow, I didn't know it was that bad." Then, that's news. Once it's news, it is very hard to fix. Even today, if you were to search Google about what's the cost of training, you might get this so-called, the same as five car lifetimes. That's that original paper that's off by 120,000. But Google will help you find that erroneous result.

[0:44:00] KB: So, another topic that I like to talk about with folks who - you mentioned, you have five decades of experience here, almost as long as the industry. You've just said something, you've seen boring times, you've seen exciting times, and we're in one of those exciting times. I'm curious with that perspective. What do you see going on in the evolution of the tech industry right now? What is changing the most? What's staying the same? What are you excited about?

[0:44:28] DP: Well, right now, I'd be asking about right after the microprocessor had been, which I asked some of my grad student friends, I was a grad student when it was invented. They told me I said, I thought it was a big deal, which I'm very happy I said that.

[0:44:43] KB: Got that one right.

[0:44:44] DP: Yes, because it wasn't - it was kind of a toy, you know, actually in the early years. There was real computer conferences, and they also - they had these kind of pretend computer conferences, where these toy computers were at. They weren't real computers. But yes, because it's right then. it's kind of hard to see around the AI stuff. What's got nothing to do that's AI that's very exciting. I think, the quantum computing stuff isn't really about AI. Quantum computing is like zero degrees Kelvin almost. AI success is machine learning, lots of data. So, quantum computing and lots of data don't really fit together. But I think what Jensen said was pretty plausible. What did he say? It was, "15 years is too early, 30 years is too late, 20 years." So, it may have a big impact on pieces of computing in the decade from now, or say, but we can have quantum cell phones. This is big science, and there'll be some problems there that it can solve that we thought were unsolvable. But most things, as far as we know right now, will still be regular, general-purpose computers for many other things, including, as far as we know, AI and machine learning.

I'm not a person who thinks that's going to be this gigantic paradigm shift. It'll be an amazing accomplishment to do it, amazing scientific accomplishment, but I don't see it as changing the industry. We just don't know how far AI is going to go, how much of our own technology will we throw away and replace with AI? So, for the people closer to AI, like in the vision community, have friends in the vision community, once there was that tipping point with AlexNet and the ImageNet competition, where it won, and within two years, everybody had abandoned what they were doing. They completely changed all their courses. So, everything went away and it became machine front. So, it was absolutely revolutionary. How much of computing technology will that happen to? Or where are the walls where can it happen? Because when it does happen, it's hard to see the limits. It's hard to see how far it can go.

We don't know where that wave of AI, how far it's going to spread into the computing field, but it's definitely affecting lots of pieces. People are talking about using large language models to make it easier to do the, what's called register transfer language of hardware design, which is, that's not something I would have seriously thought about. So, it's hard to ignore how pervasive AI is going to be, but we don't know how wide. But if we knew the answer to that, I think we'd have a better understanding. But there's this chance, it's just going to infect all of our underlying technologies, doing things very differently that we've done from the past. So, it's hard to predict, given how extensive this change we might be undergoing in the next few years.

[0:47:42] KB: Yes, absolutely. I feel like it's already dramatically changing the way people do software development. You can write software with this thing, and it does a pretty good job, and can speed you up dramatically. But it also shifts the types of architectures you want to write, which then changes the education path of what type of software is good software and how does it work.

[0:48:06] DP: Yes. I was actually involved in another paper, kind of out of my depth, that some friends wanted to write. It's about shaping AI for the public good is the theme, and I think the title is Shaping AI for Billions, and archive paper. There was actually, I did an article on The Economists about that topic. But I think the way we think a great use of technology is to act as kind of assistant with the human involved. So, there's a human expert, and it's working in conjunction with the human expert, which makes the human more valuable economically. So, rather than getting rid of hundreds of programmers, let's make hundreds of programmers tremendously more productive.

In that paper, it talks about people who worry about job loss. Well, in economics, there's this question of, whether the product you're doing is elastic or inelastic. What the question, what that means is, if you make it more efficient, and cheaper, what's going to happen? Well, if it's agriculture, you're making food, there's almost so much food that people can eat. Probably, the number of jobs will go down. But in topics like software, where in the past, it's made the number of jobs go up, even though you bring the price down, there's tremendously more demand for it.

We think, if research focuses on elastic fields, and improving human productivity, that could have this very positive effect that's going on. But I certainly see it right now, that if you're an expert, and you can use AI to make yourself much more productive, and you can see when it screws up, it's a very powerful technology. On the other hand, if you're a novice, how do you avoid the hallucinations or the mistakes it makes from having you do something that's embarrassing? So, it looks like that's where it is right now. Who knows where it's going to be in a few years? But I think it's this reinventing how we write software and how we design hardware is an example of this extensiveness of where it might go.

[0:50:04] KB: Yes. I think you're spot on. We're entering a period of software abundance, but there does not seem to be - software is like dreams made to life. We have no shortage of dreams. There's always more software to write.

[0:50:16] DP: To me, it's this, what I talked about, there's no like, that's smart enough, we don't need to go any further. I think the same thing for software quality. I mean, the things over my career, the embarrassing stuff is how insecure our technology is. Enabling Purple Hill heard people in Lithuania to steal money from your grandmother. We helped make that happen. So, if we could make a serious dent in the security problem, that would be amazing if that could happen. Maybe AI can help us. I mean, that's an example of a quality improvement that would be wonderful for our field.

[0:50:50] KB: So, we're coming to the end of our time together. Is there anything we haven't talked about that you would like to share with folks?

[0:50:57] DP: Let's see, upcoming, I think one of the - in the embodied carbon space, I'm very proud of - I think, within a couple of weeks, Google's going to publish a paper that talks about the two parts of carbon footprint is when it's operating and what it costs to manufacture. And people have speculated a lot about how expensive it is to manufacture chips. We're going to release a paper or a blog, I think. We're going to have done what's called a life cycle analysis. The title of the paper is Cradle to Grave. So, the whole lifetime. We'll have the data out there to show how expensive it is to build these AI accelerators like GPUs, what are the carbon emissions associated with it, how does the operational piece compare to the manufacturing piece. That'll be the first time that data is out. So, it was nice that Google has both environmental experts and computer experts, and we collaborated together. I hope it will come out over the next month. 

Then, reading about thinking, we tried to make a thoughtful paper under the assumption that there's people who are open-minded and would like to hear about AI. It's less clear to me now, based on the kind of Twitter like instant reactions basing on one sentence in the paper. How many open-minded, thoughtful people are out there? But I think, hopefully, if you would read about that, it'd give you something to think about. We try to talk about the upsides and the downsides, not just the upsides of the AI or the downsides, only the downsides of AI.

[0:52:26] KB: Yes. There are a lot of outrage merchants out there. 

[0:52:30] DP: Yes. There was this, the Andy Konwinski is a former PhD student. We did this paper a long time ago on cloud computing where it was not as controversial, but it was controversial when we wrote the paper. Like, everything's cloud computing, there's nothing there. So, we kind of explained it, what are the upsides, what are the downsides, how researchers could make it better, we wrote that paper. He said, we need to do that again about AI, like you said, because of the outrage merchants. So, I don't know if we're making a dent in the conversation, but we gave it a shot.

[0:52:57] KB: I hope it gets better uptake than the carbon footprint corruption.

[0:53:01] DP: Yes, that would be a goal. Can we get as many citations as the paper that's off by 100,000? If we got there, that would be success.

[0:53:11] KB: Awesome.

[END]