EPISODE 1803

[INTRODUCTION]

[0:00:00] ANNOUNCER: Jack Dongarra is an American computer scientist, who is celebrated for his pioneering contributions to numerical algorithms and high-performance computing. He developed essential software libraries like LINPACK and LAPACK, which are widely used for solving linear algebra problems on advanced computing systems. Dongarra is also a co-creator of the top 500 list, which ranks the world's most powerful supercomputers. His work has profoundly impacted computational science, enabling advancements across numerous research domains. Jack received the 2021 Turing Award for "Pioneering Contributions to Numerical Algorithms and Libraries," that enabled high-performance computational software to keep pace with exponential hardware improvements for over four decades. He joins the podcast with Sean Falconer to talk about his life and career.

This episode is hosted by Sean Falconer. Check the show notes for more information on Sean's work and where to find him.

[INTERVIEW]

[0:01:08] SF: Jack, welcome to the show.

[0:01:09] JD: Yeah. Thanks very much. It's a pleasure to be here with you.

[0:01:12] SF: Yeah. Thanks so much for being here. You've spent a lot of your career working on high-performance computing. First of all, and maybe it seems like a basic question, but I think it's probably a good spot to start is what defines high-performance computing and is that a moving target as mainstream computers that we use every day become more powerful over time?

[0:01:32] JD: Well, that's exactly right. High-performance computing, or I would call supercomputers, are usually specified as the fastest computers at any time. There's the markers time. As time goes on, these computers change, of course, and they get faster and things that were supercomputers, let's say, three or five years ago are no longer considered supercomputers. They're replaced by the next generation of machines. Supercomputers are fast in terms of floating-point operations, adds and multiplies, and they're characterized by being quite expensive as well.

The fastest computer that we have today is a machine that's located at Lawrence Livermore National Laboratory. It's the fastest, by some metric, and the cost of that computer is about 600 million dollars. That computer is a supercomputer today, but I would say five years from now, it's going to fall off and not even be considered one of the fastest computers. There's an investment that has to be made if you are interested in having a supercomputer and need a supercomputer that has to be replaced with that frequency.

[0:02:48] SF: Yeah. How did you first get interested in this research area?

[0:02:52] JD: Well, let's see. I guess, I wanted to be a high school science teacher. I went to college to do that. In my last year as an undergraduate, I was encouraged by my physics professor to apply for an internship at Argonne National Laboratory. Argonne National Laboratory is a Department of Energy Laboratory located just outside of Chicago, and that's where I was going to school. I applied for this position, and I received a word that I got the appointment. The appointment was to spend one semester with a scientist. I worked at Argonne National Lab as my last semester, and that was transformational. It changed everything in terms of my outlook. I no longer wanted to be a science teacher. I felt I had ambition to go into research and to take on this challenge, let's call it, of working at a national laboratory.

I worked there for the last semester, and then I decided to switch and become a computer scientist. I applied and was accepted at Illinois Institute of Technology in Chicago in the computer science program as a master's student. I worked for my master's degree, and Argonne National Laboratory offered me a position of one day a week working at Argonne. I lived in Chicago, went to school at IIT, and then worked at Argonne National Lab one day a week.

After receiving my master's degree in computer science, Argonne made an offer to me. I decided I didn't want to go on and get a PhD. Argonne offered me a position at the lab, full-time position, working alongside of the researchers there at Argonne in what was called the Applied Math Division. That was, again, a wonderful experience working with experts designing software for solving certain mathematical problems, and that was something which drove me. We had frequent visitors to Argonne National Lab from outside. Visitors from various universities would come and spend a day, or a week, or a month working at Argonne alongside of the other researchers that I was part of. They encouraged me to go back to school and get a PhD.

I went back to school with one of the people who gave me encouragement as his student, and that was at the University of New Mexico. I went there and worked on my PhD. While I was there, I had the opportunity of working at Los Alamos National Lab. Los Alamos is another Department of Energy Laboratory located maybe a couple of hours north of Albuquerque, where the University of New Mexico is located. I worked there and that was, again, a wonderful experience. Both Argonne and Los Alamos had supercomputers, or machines that were at the top rank at that time.

Los Alamos had just acquired a computer called the Cray, Cray-1 computer, and that computer was different in terms of its architecture. It had vector instructions. I had an opportunity to experiment and to basically, to play with this computer that was there that was going to be used for scientific computations. By the opportunity, I was in a position to use it and to develop ideas and methods that would work well on that computer. That was, again, a wonderful experience. Then I received my PhD in 1980 and went back to Argonne National Lab as a researcher and worked there for a number of years, until transitioning to Tennessee where I am today.

I've had just a few jobs in my life. I'd like to say, I had three jobs. One job was at Argonne National Lab, that was followed by the job at the University of Tennessee and Oak Ridge National Lab where I am today. Before the job at Argonne, I made pizzas. Those are the three positions I claim to hold. Pizza maker and then researcher at Argonne National Lab and then finally, professor at the University of Tennessee and working as a researcher at Oak Ridge National Laboratory.

[0:07:09] SF: Well, it's wonderful that you were able to early on in your life find something that you fell in love with and were able to do a long career in with that and move around a lot. In terms of motivation since, is the fact that what defines a high-performance computer or supercomputer, such a moving target where state of art today is yesterday's news three years from now, is that something that's helped keep you motivated to focus on this field throughout your career?

[0:07:37] JD: Oh, yeah. It's exciting to see new architectures and then try to understand how they can effectively be used to solve problems. The way I look at it is I've helped design numerical libraries. These are software components which are used by other applications. The software components are basic in terms of the operations that they do. Those libraries basically have to be reorganized, or rewritten, refactored every 10 years. That refactoring is caused by the architecture changes. We have computers, if I go back in terms of where we came from, we had scalar computers, that is computers that had the ability to execute a single stream of instructions one at a time. Those machines were replaced with machines that were vector computers.

Instead of just operating on one number, adding one number to another number, they operated by taking two vectors and adding them together, let's say, to produce a product. You issue one instruction and that has an effect across this array of data and that allows things to run much, much faster in terms of the flow of the data through the system. Vector computers caused this revolutionary idea to change the way the software was written to adapt to it. After vector computers, those computers were a special purpose for scientific computations. They were very expensive. Only a limited number would be manufactured.

What happened essentially to the computing area was that we had this incredible improvement in performance of microprocessors. We had this thing that I'll refer to as the attack of the killer micros. Those microprocessors became faster, more powerful, and were able to basically do the same function as those special purpose scientific computers that were characterized by vector computing. Microprocessors became the basic commodity component, which was used in our supercomputers.

Microprocessors then took on characteristics of being put together in a parallel context. Some scientific computers evolved to using this microprocessor that were aggregated together in a parallel computer to help solve problems. We have computers which had maybe 10, or 20, or 100 of these microprocessors together, communicating, passing messages back and forth over a high-speed network to allow them to effectively solve problems. That had a major change in terms of the software that we dealt with using that large number of processors together causes us to reorganize the algorithm that can effectively do that.

Those microprocessors then were aggregated together and we reached the point where we had thousands of processors being used to help solve our problem. Multi-core came in and that, of course, added to the complexity, I'll say, but also, to the layering of how these computers were effectively used. In the end, the supercomputers that we have today are based on, I'll say, commodity processors in terms of the basic organization is around the X86 instruction set. Intel and AMD have licensing rights for that instruction set. Our commodity processors use that. Our supercomputers basically have that as the core of their processing. Today, they're augmented by GPUs. So, graphical processing units have been added to the mix to help boost the ability to do floating point operations.

We have machines today which are hybrid, having commodity components, plus GPUs, which are effectively used. Again, with each of these changes, it causes us to rethink the algorithms, rethink the software, rethink the numerical libraries so they can effectively be used on this architecture.

[0:11:58] SF: In terms of solving these hard science problems that motivate a lot of the work around supercomputers, is there particularly an advantage that you get trying to solve that with something like a supercomputer, versus being able to use some distributed network on the cloud over multiple computers and split that work in parallel? Are there certain things that you just can't split in that fashion?

[0:12:23] JD: Right. In general, the computations that are done on these supercomputers are very data intensive. That is to say, they do a lot of arithmetic and then they do a lot of communication, transferring information from one part of the machine to another part of the machine. There's that kind of thing going on continuously in these computations. If we think about a cloud-based system, the cloud-based system incurs a certain overhead associated with the latency of moving data over large distances and the bandwidth associated with that. If we think about a truly distributed machine, let's say, a machine that puts together components in various locations to create this virtual computer, that would not work very well in terms of a architecture for a machine that would be used for a large-scale scientific computing.

If we think about a cloud-based thing, so if we think about take Amazon as an example, using Amazon to do the computation on a single Amazon site, which had high performance processors, plus, perhaps, graphical processing units, that could be used to do these computations. The question then is we have to move to data from where it's located over to the cloud-based service, do the computation there and then drag the results back to the home base.

Moving data is a very expensive operation and you don't want to do that very often. You basically, using a cloud-based system, you may get locked into that cloud-based system to do the computations over some number of years if you're going to be doing it, rather than transporting the data back and forth, because that's a very expensive operation itself. The cost, if we think about buying a computer and think about using a cloud-based system, so a computer that would be on-premises, as opposed to using a cloud-based system, there's been many studies that look at the cost of doing that and those studies usually come out with on-premises computing is a better financial arrangement. By better, I mean, perhaps, by a factor of two, over using a cloud-based system. There could be a factor of two in terms of the cost that you would pay if you add everything up in terms of doing the computation.

Again, the cloud providers are providing the service, they set the pricing, so that's based on them setting a price that's not competitive in some sense with the on-premises computations that might go on for large-scale computations.

[0:15:07] SF: You mentioned data movement there in communication, and even at the hardware level, I think that's probably one of the biggest challenges to essentially scaling up the amount of flops that these machines are capable of. What are some of the approaches to reduce the amount of communication that's happening at the hardware level?

[0:15:25] JD: Right. You've hit on really the bottleneck, or the point of contention on these computers. The thing which is most expensive on our computers is data movement. It's not the floating-point operations. It's not the computation that we do, it's the movement of data to the place where the computation is going to take place. That turns out to be the biggest bottleneck Our machines today are really over-provisioned for floating point operations. They have too much capacity and we can't get the data to them, so we need in future designs, I'll say, we need ways of trying to overcome that memory bottleneck. That really is the biggest challenge that we have.

There's a lot of ideas that are floating around and how we could overcome that, ideas like processor and memory. We take processors and embed them in the memory itself, so that the processors are very close to the point at which the data is located and data then can flow into those processors. We also have to come up with other techniques to avoid the latency of moving data. Some of the techniques that are used are organizing the computations around a directed acyclic graph. A directed acyclic graph has the ability to uncover the maximum amount of parallelism in a computation, so we can exploit that parallelism and hopefully, get a much more optimal solution.

This issue of moving data is really critical in terms of the efficiencies that we see on our computers. I'll let you in on a dirty little secret that we have, and that is if we take a look at the peak performance of our supercomputers and we look at the applications that are running on those supercomputers and look at the performance that we actually see from those applications, we're getting roughly 10% of the peak performance on these supercomputers, and that's a result of moving data. We can't move the data to the processors effectively and that's causing this low efficiency in terms of our applications.

[0:17:39] SF: Why do we end up with this case where things are over-provisioned for the floating-point operations?

[0:17:44] JD: Right. It comes about from a number of reasons. We have certain ways of measuring performance of machines. I created a benchmark a number of years ago and that benchmark is called the LINPACK Benchmark. LINPACK Benchmark solves a problem, that's a system of linear equations. That system of linear equations is used in many applications. This benchmark was created back in the late 70s, when floating-point operations were very expensive and this benchmark as a result of floating-point operations being very expensive mimicked the way real applications would perform.

As the hardware changed over time, the benchmark itself has really less reflected how real applications are evolving and architectures are trying to do very well on this benchmark, so they put instructions and design things to effectively do this benchmark very well. The benchmark has at its core a matrix multiply, two matrices being multiplied together. That's an operation which can be highly optimized. The hardware on our machines today, hardware on our machines and on our GPUs can do that operation very efficiently. That operation has been optimized to a level which gets very close to the peak performance.

If our applications did that operation, we would match the peak performance. Unfortunately, matrix multiply is not the way in which we tackle problems today. We use other techniques which don't have that ability. Matrix multiply has the ability of moving N squared pieces of data and doing N cubed operations on it. That's a surface of data and a cube of floating-point operations. That gives rise to a situation where you could move little data and do a lot of operations on the data that you move N squared data movement, N cubed operations on the data. That is something which would be ideal for an algorithm if it just did that.

Today's algorithms, unfortunately, don't just do that operation. What would we do on our supercomputers? Our supercomputers are used to solve scientific problems and those problems span the range of looking at weather forecasting, climate modeling, looking at applications that are trying to optimize combustion engines, looking at nuclear reactors, some are used for nuclear weapon design. All of those things usually center on solving the three-dimensional partial differential equation. That three-dimensional PDE that's being solved is going to solve a system of linear equations and that system of linear equations is not dense, it's sparse. By sparse, I mean, it has very few elements in this matrix and we're going to try to solve a system of equations where we have a lot of zeros in the matrix itself.

In order to solve that system, we use what's called an iterative method. That iterative method has the property of doing really N operations on N pieces of data. We have to move N pieces of data and then do N floating point operations on it. That's the basic pattern in which these algorithms operate, and that's very unlike what was done for a dense matrix, where we had N squared pieces of data and N cubed operations on it. We move a little bit of data into a lot of operations. With the PDEs, unfortunately, we're in a situation where we just move N pieces of data and do N operations on it and that causes this only less than 10% of peak performance, because of the movement of data which is very poor compared to the rate at which we can do the floating-point operations. We're basically starving the floating-point potential of these machines. The floating-point units of these machines are being starved for data, waiting for the information to flow to it.

[0:22:04] SF: Yeah. Back to the LINPACK and where that started, there's a saying in business of you optimize the things that you measure, so be careful about what you measure, because you end up creating these biases, essentially, it sounds like, that's what's happened there. Is this solved for that that you essentially, there should be more than one measurement, or KPI that is used to benchmark supercomputers?

[0:22:28] JD: Right. Of course, the best benchmark is the application that you have. If you have an application that you're intending to run, that would be the thing to really benchmark. Benchmarks, the ideal benchmark would be multiple benchmarks that span a space of applications. Today, we have this in the past, we developed LINPACK and we thought that was a good measure. Today, it's not a very good measure of how our computers really operate, so we've developed other benchmarks.

I mentioned solving PDEs and this iterative process, so we've developed another benchmark called the HPCG benchmark, High-Performance Conjugate Gradients Benchmark. It uses an iterative algorithm. It matches, or tries to imitate what applications do in solving that three-dimensional partial differential equation. It's trying to get a handle on what the performance is and what the bottlenecks are on machines today on problems that are important to solve. That's another component, or another benchmark that we have that can be used to effectively deal with our situation today, or our understanding of how these machines can be used and where the bottlenecks are in the machines themselves. We need more of those things. That would be the ideal situation.

[0:23:50] SF: Right. Do you think that that will change, essentially, how, or read to an overhaul of the top 500 lists that people use to measure supercomputers today?

[0:23:59] JD: Right. We have this top 500 list, which measures the 500 fastest computers. In retrospect, 500 maybe too many, but okay, that's what we have. We have all that data. I view the top 500 is giving us a handle on trends and it plots things in a nice way, so we can see where we are today, at least in terms of what we think of as the theoretical peak performance for these machines. If you don't do well on the top 500, you're probably not going to do well on other applications. That's one way to look at it.

We have all this data and it presents a good point to keep. I don't want to lose that top 500 data. I'm going to augment it. I want to augment it with other benchmarks. Again, we have this thing called HPCG which measures another aspect of the computers. It really is trying to get a handle on this data movement to see how well we do with that and that augments the performance. There's another list, it doesn't have quite 500 machines on it. It has only a few hundred machines on it. That shows the difference between the peak performance, or the LINPACK Benchmark and this thing which is more realistic in terms of what our applications can do today. We need to develop others. I shouldn't say we just have two. We should develop other benchmarks, indeed.

[0:25:25] SF: Yeah. Relate to that, I read your paper reading that thing, high-performance computing challenges and opportunities, which I encourage anybody listening to check that out as well, but one of the things that you talk about in there is how there's this shift, essentially, to AI-driven workloads, a lot of that happening by these cloud vendors. They're typically with AI-driven workloads, you're talking about 16-bit arithmetic, versus 64-bit floating-point arithmetic. How does that change the way that you need to think about measuring the performance of high-performance computing?

[0:25:58] JD: AI is a tremendous force in terms of computing and in terms of our understanding of what things, how things are happening. It's a great tool and we use it all the time, and it's being used by all of the applications to really strengthen their ability. Going back to the situation that we have, I'll say in the old days, we had computers which had 32-bit and 64-bit floating point operations. Those were the basic thing that we had to work with. With GPUs, we had another level coming in. We had 16-bit floating point operations. NVIDIA provides us with 16-bit. IEEE has a standard for doing 16-bit floating point operations, and that is implemented in the hardware.

Google came out with something called BF16, which is another representation of 16-bit floating point operations and has a slightly different configuration of how many bits are in the exponent and how many bits are in the fraction. Google probably is in a better position for doing computations. It gives up a little accuracy, but it has a wider dynamic range over what the IEEE is providing. Then NVIDIA in its hardware has in its more recent generations has 8-bit floating point arithmetic. We went from 64-bit to 32-bit to 16-bit to 8-bit floating point arithmetic. That lower precision is being driven by AI. It's being driven by how the neural networks work. In the neural networks, there's a forward propagation and a back propagation. The forward propagation through the neural network requires slightly higher precision for doing the weights and the activations. The back propagation, the gradients in the back propagation require a high degree of dynamic range, so that's as the exponent needs to be changed, so that NVIDIA has two formats for 8-bit floating point arithmetic as a result of that. Those operations in 8-bit and 16-bit run very fast, compared to what we can do in 64-bit floating point operations.

That's caused us to rethink how we do our algorithms and can we leverage those lower precision computations in our algorithms themselves? Again, our algorithms have traditionally been written with 64-bit and 32-bit. Now, we're looking at using 16-bit and maybe even 8-bit floating point operations to help our computations work through their process. The way to think of this is we're trying to leverage the lower precision to get an approximation for the solution, and then use the higher precision to reinforce, or to increase the accuracy of the solution that we obtain. We do something very fast in lower precision and then do something slower in higher precision to refine the solution to get up to a point where it's acceptable for our computations.

There's a whole area of research going on today in this mixed precision, trying to leverage lower precision computations to get a higher speed and then pass it off to a second stage which refines the solution to get this higher precision.

[0:29:31] SF: Do you feel like that is the most, I don't know, practical path currently to reaching the next generation of super computers?

[0:29:41] JD: I would say that it's a viable path that's being explored and being used today, because the hardware is there, and the hardware is there not because of the scientific computations that we do in HPC, it's there because of AI. We have to try to use that to get a better accuracy and to also get better speed. Understand with lower precision, there's less communication. We have less memory traffic, because we're communicating, let's say, 16-bit words, instead of 64-bit words. That helps in terms of the data movement. We have a less memory footprint as well if we're doing data in 16-bit. We store less things. The arithmetic operations go faster.

We usually think of going from 64-bit to 32-bit, there's a factor of two in the 32-bit, a factor of two speed increase in the 32-bit. If you go to 16-bit, there's another factor of two, so we get much, much faster speeds in terms of our floating-point operations. We reduce memory traffic, so our data movement has been reduced. We don't we don't have so much of a memory bottleneck and we can carry on our computations. I would say, there's a concern though and the concern is that because AI is so important, hardware manufacturers are providing those 16-bit, 8-bit, they're doing 32-bit, but the newer products and in fact, the NVIDIA products have gotten to the point where the 64-bit floating point operations on the new hardware is really not very efficient, so it doesn't run very well. It runs less than what we were seeing with the previous generation.

NVIDIA has a number of products. The current thing is called Hopper. The new one is called Blackwell. It's just about to be released. Just to put this in perspective, The Hopper, the current generation that we have, the 64-bit operations run faster than the newer processor called Blackwell, so run faster in terms of operations per second. But in terms of 32-bit and 16-bit and 8-bit floating-point operations, Blackwell really exceeds the performance from the Hopper.

Less effort is being put into the 64-bit floating-point operations and more effort, more efficiency is being gained by using those operations, which are really there for AI purposes. The scientific community needs to focus on the 32-bit and 16-bit, or mixed provision.

[0:32:25] SF: Right. What do we give up by over-focusing on the AI-specific workloads, versus being able to continue to improve the performance of the 64-bit operations?

[0:32:37] JD: I would say that it's leading us to a point where the scientific users who need the accuracy are not going to see improvements in the next generation of machines. If they can get by with less accuracy, or if they can get by with a mixed precision algorithm, then they would be able to see the advantages of that newer architecture. It's going to cause a shift again, where people have to rethink how they implement things. There's a refactoring that goes on of redesigning stuff around the architectures that we have. As I mentioned earlier, we have to redesign our software as the architectures change and the architectures change in a radical way, I'll say, every 10 years. There's that reinventing, or rediscovery that has to go on to refactor the software to effectively embrace the hardware that's being presented.

From a pessimistic standpoint, the way I sometimes characterize it is we ask for a supercomputer and we ask for it to have certain peak performance, and we have certain amount of money that we're willing to pay for that. We bid a computer based on those things, the peak performance and the money that we have. The result is that manufacturers produce a machine which matches that peak performance and comes in at the budget of what we have. But being able to use that machine requires a tremendous effort in terms of putting together algorithms that can fit onto the architecture that we have, and trying to really embrace what's there.

A better way of designing a machine would be using what's called co-design. Co-design says, we get the architects together with the application people, with the software people, and with the mathematicians and we design a machine that can be effectively used by that group, a special purpose machine for the scientific applications. But unfortunately, that'd be too expensive to manufacture. Those machines would have very few machines that would be produced as a result of that and it would go the way of the old, vector computers which are considered dinosaurs today.

[0:35:01] SF: Are there some people that are using this end-to-end co-design approach?

[0:35:08] JD: Oh, sure. Yeah. Co-design was done, or was tried with the big supercomputers that we have working at the national laboratories. The department of energy is highly engaged in using high-performance computing for solving their most challenging problems and they invest up to 600 million dollars per computer. There are three large exascale computers, or exascale computers are 10 to the 18 floating-point operations per second. Those are 64-bit floating-point operations. Those machines are at Oak Ridge National Laboratory, Argonne National Laboratory, and Lawrence Livermore National Laboratory. Those were machines that were recently manufactured and put in place. They have co-design in the architecture, trying to match the architecture with the applications, but it's hard to design, to redesign a machine based on trying to get the machine with that comes in at the appropriate budget.

Again, all the machines are based on commodity processors. X86 architecture, Intel and AMD are the basic instruction sets that are on those machines and each of those machines have GPUs. The Argonne machine is using Intel as their processor and Intel has a GPU, which is being used with it. Oak Ridge and Livermore have an AMD processor with an AMD GPU associated with it. Those machines are being used to drive our large-scale scientific computations today.

[0:36:49] SF: Yeah. I wanted to dive in a little bit on this topic of exascale. As you mentioned, it's 10 to 18 flops. I still remember when the first 1 gigahertz processor came out. You're talking about 10 to the 9, and now modern chips on the latest iPhone, you're starting to hit 10 to 12 and teraflop range. We still have petaflop, and then exaflop. I mean, what's it take to reach that raw computational power? Is it all about that architecture, or are there other things that have to go on?

[0:37:19] JD: Well, of course, it's based on architecture. The architecture has to be able to support that. We think about our computers today and those computers are potentially incredible in terms of their capability. We think about the machine at let's say, at Livermore National Lab. It has a peak performance, there's a peak performance of 2.7 exaflops. That's the theoretical peak performance for 64-bit floating point operations. That machine has a large number of processors and a large number of nodes.

Each node of the machine is composed of three AMD CPUs. Think about commodity AMD CPUs, each one of those have eight cores, so three eight core CPUs, plus three AMD GPUs. Those things are put together and that's considered a node of a machine. The full machine has over 11,000 of those nodes in the machine. Those nodes are connected by a high-speed interconnect that allows them to pass information from one node to another. We have to connect 11,000 nodes in this machine to effectively put everything together. The machine consumes 34 megawatts of power. What's a megawatt? At my home here in Tennessee, if I use one megawatt over the course of a year, I'll get a bill from the electric company for one million dollars. A megawatt year is about a million dollars.

This machine at 34 megawatts, that translates into 34 million dollars just to turn the thing on. You can see how these machines are enormous. The floor space is about two tennis courts. Think of two tennis courts with these 11,000 nodes with an interconnect, trying to make things work correctly. I mentioned that it has a peak performance of 2.7 exaflops, but that's for doing 64-bit floating point operations. If your application can use 16-bit and again, 16-bit is in the GPUs that are there really to help with AI computations and machine learning things and data analysis, you can reach 17 exaflops. That's the difference between 64-bit and 16-bit floating point operations. 2.7 compared to 17. There's a big potential there if you could use that power.

Now, the machine has all of these components in it, but the GPUs are really where the performance comes in. 99% of the performance of this system is based on the GPU performance. It's critical that the applications run in parallel. There's 11,000 nodes. It's critical that the applications use the GPUs, because 99% of the performance is coming from the GPUs. It would be critical if the applications can use mixed precision, could benefit from some of that 16-bit floating point operations that are the potential from this architecture.

These are incredible systems. I think of these machines similar to the web telescope, Hubble telescope. It's a tremendous instrument for doing scientific discovery. We need to use them effectively to get to the potential that they really have.

[0:41:01] SF: Thinking about the future of HPC and moving beyond some of the types of architectures you're talking about, there's things like, quantum computing, which there's a lot of hype around, and Google, IBM have big investments there. What are your thoughts on potential of quantum? Is it overhyped?

[0:41:19] JD: Of course. All new things are overhyped. We have to give them some leeway here, doing the hyping. Quantum computers are great potential for the future. They're not going to replace the conventional computers that we have. They're going to augment them. It's a great area for research, is the way I look at it. All of the hype, I have to assume that that's going to happen, it's happened for most systems, or most radical changes that we have have gone through this hype that we're experiencing with quantum. There'll be some disappointments, because the hype will not be reached, of course. But there's great potential.

Today, we really have a handful of algorithms only. Only just a few algorithms that can be expressed in terms of the use of quantum computing, and we need to develop more algorithms. We need to understand where that fits in in terms of the way in which we are doing our computations. I think about computers today, the physical computers, and we have CPUs which are commodity processors, they're augmented by GPUs, which have this ability to do AI computations and the lower precision very, very efficiently, and we potentially have the ability to augment that with a quantum-based thing that we can add on to it.

I see the super computers of the future having again, commodity processors, GPUs and other things attached, one of which could be a quantum computer. Another thing we might think about is a neuromorphic computer. We might think about an optical computer. We might think about some DNA-based computational device all being used in some way, where it fits the applications that we intend to run on that machine. Putting a quantum computer today on a supercomputer is a good experiment. It's good for research. It's good for those applications that could benefit from that quantum thing, but the quantum computer is not going to replace the computers that we have today. It's not going to replace my laptop in my lifetime anyway. I think it's a great research endeavor. It's overhyped, but we have to give them some leeway in in terms of the hype. It's a great area for new ideas emerging.

[0:43:41] SF: Yeah. I remember hearing about DNA-based computing, probably when I was in graduate school, which now is 15-plus years ago. But I don't think I've heard anything since. Well, what is the state of the art there?

[0:43:53] JD: Yeah. I mentioned it, I don't really know what the basis is for DNA computing today, or where they sit in terms of the potential of doing anything really reasonable. I'm not an expert in DNA computing.

[0:44:06] SF: Okay. Yeah, I was just curious. Looking back on your career, you've made some really big contributions to the science, computer science research world that have had long lasting impact. Looking back at your career, what do you think is the project that you're either most proud of, or you feel had the most impact?

[0:44:28] JD: Well, I like to think I've contributed to three areas. I've contributed to the design of numerical libraries for dealing with linear algebra. Those libraries have undergone changes at every major - they've tried to follow the architectures that were coming out. Again, there's these new architectures that emerge every 10 years and the software and the libraries have to change, or adapt to the architectures that are there. One of the things I feel I've contributed is this refactoring of architectures that have followed the evolution of the hardware, trying to match the hardware in some way.

The second thing I've contributed to is helping to design and come up with a community-based way of doing message passing. In the early days, we had computers that were parallel and that had a pass information from one point of the computer to another point. Each manufacturer had their own way of doing that. Each research group had their own way of implementing that. We had what I think of as the Wild West, where we had many ideas and many things that were tried, and I was able to help put the community together to come up with a standard for doing message passing for scientific computations. That standard is called MPI, the Message Passing Interface.

A small group of people, small group of talented researchers got together and within a year and a half put together something that became the community standard that is in use today and has been used for the last 20-25 years in doing our computations. That's a major thing that has had an impact on the way in which the fabric of our scientific computing is done.

The third thing relates to performance evaluation. Again, I was instrumental in putting in place this thing called the LINPACK Benchmark, which is the basis for the top 500 that we put together, and that has evolved to other testing mechanisms. I mentioned HPCG Benchmark, which does something more relevant to our computation today, and that has actually evolved into something called another thing which uses mixed precision, trying to have a benchmark which leverages the 64-bit, 32-bit, 16-bit computations in a way that tries to expose what the potential is if applications can effectively use that level of performance.

I would say, those are the three things. The numerical library that have followed the architecture trends putting in place a standard which is used by the community for doing message passing and then an effort for doing performance evaluation through, perhaps, benchmarks.

[0:47:36] SF: Awesome. Well, Jack, thanks so much for being here. I feel like, I could pick your brain all day, but I'm sure you have other things that you need to do besides talk to me, but it's been a real honor.

[0:47:44] JD: Great. Very good, Sean. Thanks for the opportunity.

[0:47:47] SF: Cheers.

[END]