EPISODE 1918

[INTRODUCTION]

[0:00:01] ANNOUNCER: Formal methods are a branch of mathematics and computer science focused on proving the correctness of systems, and they have long promised a more rigorous foundation for software. However, their complexity has kept them confined to a small community of specialists. That is now changing as agentic AI systems take on increasingly autonomous roles. The question of how to define and force and verify what those agents are allowed to do has become urgent, and automated reasoning is emerging as a critical part of the answer.

Byron Cook is a VP and distinguished scientist at AWS, a professor at University College London, and a program manager at DARPA. He founded the automated reasoning group at AWS over a decade ago, where his team built the foundations behind products like IAM Access Analyzer, VPC Reachability Analyzer, and Bedrock Guardrails. In this episode, Byron joined Sean Falconer to discuss how automated reasoning works and why it scales so well with AI, the rise of neuro-symbolic approaches that combine formal logic with large language models, what it means to formally specify agent behavior using temporal logic, and why the convergence of agentic AI and formal methods may represent one of the most significant shifts in how software is built and verified.

This episode is hosted by Sean Falconer. Check the show notes for more information on Sean's work and where to find him.

[INTERVIEW]

[0:01:44] SF: Byron, welcome to the show.

[0:01:45] BC: Thank you very much. Thanks for having me.

[0:01:47] SF: Yeah, absolutely. I'm looking forward to this. We're going to probably cover a lot of ground today. You've had a pretty rich history. One of the things I wanted to start off with was, so you're a VP and distinguished scientist at AWS. You've been there over 11 years. You're also a professor at UCL and a PM for DARPA. My first question, I guess, was like, how is all that possible? This seems like a lot of jobs to balance. How do you manage wearing those three different hats simultaneously?

[0:02:14] BC: Cognitively, the same job, right? I think in each role, I'm trying to identify really big things that other people can't wrap their head around, that maybe they don't think are possible. I'm trying to move that seemingly impossible to the possible, and then trying to de-risk that. So, either trying to establish basis for confidence, or prototype stuff. That often involves getting people in the orbit of the scientific world and the startup world and the university world to find the right people with the right tools to prototype something and then simultaneously write some sort of report, or some sort of PowerPoint document, or something in the language of the institution to help them appreciate what's possible and to give your leaders a lever to pull on to prioritize these kinds of things.

It's all the same. It's the difference between Spanish and Portuguese. You're using slightly different nouns and verbs, but largely they're in the Indo-European family of languages. You're doing the same thing in these large organizations. They all speak the same language. Then even if I'm in Amazon, I'm at any given time have three or four things going on, some sort of context switching between those things anyways.

[0:03:23] SF: Was that something that you're always interested in, trying to tackle these types of problems that seem impossible?

[0:03:29] BC: I saw my father work between institutions and basically see the community as his stakeholders, or his customers, or the people he was working with and cross-organizational boundaries. In my life, I'm working on cars, or doing music, or these kinds of things. People are floating in and out of organizations and you're just trying to solve problems together. I've always seen the organizational boundaries as relatively porous. Then I got really interested in logic and it's a very small crowd of people. There's maybe 3,000 people in the world who specialize in this area. We're always at different institutions, different places, meeting up with each other. It's like bands. You tend to, with three or four people realize there's something really cool to go after. You go after it for two or three papers and then after that, then people often drift off and do the next thing.

To me, it's very fluid and the - or the particular details of the organization don't really matter. I'm just usually be in the place that's the best place for me to be in to help the community figure out how to solve problems.

[0:04:27] SF: Yeah. One of these things is one of the seemingly impossible things that you were able to show was potentially possible was around, especially you built a tool that could automatically prove whether certain types of programs could terminate and making a significant contribution, essentially, to the halting problem. I guess, what made you believe that something like that was solvable, given that, I say, conventional wisdom and even now in Turing you said otherwise?

[0:04:55] BC: Yeah. There is a theorem that the problem is undecidable. That's a very meaningful result, right? That means that you cannot build a tool that can take any program in as an input and always itself terminating with an answer and always giving a right answer. You have to cut a corner somewhere. You can cut the corner by saying, well, relax, the restriction always has to give an answer. We'll be okay with maybe the tool not giving an answer from time to time. They will say, well, maybe we can get that - the occurrences of it not giving an answer down to some level where we're actually still getting value from the tool. That's reframing the theoretical result from Gödel and Turing is fundamental and absolutely staggering. On the other hand, there's lots of things that are unsafe that we do all the time. Driving is unsafe, but we get in the car, and if we get to our destination, then we know we got there safely.

The insider around that area is really interesting. I was actually working in the windows kernel team and I was carpooling with a person, a developer in the windows kernel team, and he kept - when we're talking about things that should hold of the kernel, he kept using the word eventually. Like, if this happens, that eventually should happen. I realized, that is the termination question is like, termination in the word 'eventually' have - there's a theoretical result that connects them and saying they're essentially the same thing. I realized that was an important problem. Then I began thinking about what tools could we put together from existing pieces we already had to do something quite quickly for device drivers. Then that was successful. Then that turned into reasoning about biological systems and it turned into reasoning about non-blocking concurrency, and so on.

[0:06:30] SF: Yeah. Was the original focus on device drivers, was that partly motivated because at that time, device drivers was one of the biggest culprits of [inaudible 0:06:38] failures in operating systems?

[0:06:40] BC: Yeah. In the 1980s, if your computer crashed, it was like, well, it's sad you lost your little word processing document, or whatever you're working on. But as we went to data centers, the difference between crashing and not crashing was the difference between the customer choosing your product or not. It became important to root out those crashes. It turned out that in the early 2000s, 85% of crashes in the operating systems were actually due to device drivers. It became an important problem to work on.

[0:07:06] SF: Wow. Yeah. Yeah. Talk about that. You solved that. This is a high ROI, I guess, needless to say.

[0:07:11] BC: Yeah. They were really neat, because the device drivers were not actually that big. They're 80, 200,000 lines of code, or 80,000 lines of code. They'd have 60 loops. The data structures they were operating over are actually relatively simple. They were cyclic W linked list. In many ways, it was actually a slightly - it was an easier problem than a marginal problem. Then, we're going to do some really neat things that were tuned to that domain, which allowed us to bypass some of the - not theoretically, but we practically bypassed some of those challenges around undecidability.

[0:07:45] SF: Yeah. In a lot of ways, you're bounding this unbounded problem, right, to a more specific class of the problem?

[0:07:51] BC: Yeah. You can't solve the problem in generality, so you're going to have to identify niche areas where while it's still my time out from time to time, it's not going to time out enough to cause sufficient pain to the person using it for their niche domain. The only hope is to get as close as you can to the person who's with the understanding of the niche problem to understand the dynamics of their problem. That happened again with reasoning about genetic regulatory pathways, where I was like, "Oh, I see." Usually, they have this shape and usually, they're of the size, usually, they're asking this kind of question. Actually, we could do something that 95% of the time is actually going to give them an answer and maybe 5% of the time it says, "Sorry, I can't answer the question."

[0:08:31] SF: Okay. You've founded the Automated Reasoning Group at AWS, over a decade ago. What was that original pitch to the Amazon leadership for them to be excited and want to believe in this?

[0:08:42] BC: Let's see. I had exhausted, I mean, I'd written all the research papers. I was like, how much more can I do in this ivory tower research environment that I was in at the time? I was looking for a place to go where I could maybe have more breathtaking impact with the work I was doing. I went and talked to a bunch of potential employers and figured out which one I would want to work for. I think that AWS is pretty happy to have me, because they - it was pretty risk-free, right? I was going to join. They're only hiring one person. I reported directly. This is, I think, if it didn't work out, then I would exit. I would imagine that would be the discussion that they were having like, how risky is this hire?

Amazon, I think, does a really good job of traditionally is incrementally investing in things that are working. They look for signs of direction of change on things that are going well, and they have a whole language around year over year change, and so on. They look for bets that are going well, and then they continue to feed the bets and double down on bets. It's a whole calculus of what to double down on and how to prioritize things. Once I got in the system, then Amazon just naturally went through its mechanisms of prioritization and it just grew like a weed, really.

[0:09:57] SF: Given that AWS runs critical infrastructure for millions of customers, going back to some of your work with the drivers and proving whether they can run successfully or not, are there similar class of problems around IAM policies and network configs, and things where if you mess those up, it can go really bad as someone operating this type of infrastructure? Is there a way to, essentially, apply some of your work to those areas?

[0:10:23] BC: Yeah. That's what we did. When I first joined, I identified - I went and chatted with a bunch of people and looked at a bunch of documentation and identified the top 20 security concerns, or issues, or close calls, and so on, and then prioritized them by what I thought the kinds of things that I had done in the past could help on. Identified four or five of them. Hired a few of my friends, and then we went after them. Yeah, you identified two of them. One is reasoning about the semantics of AWS policies. Another one is the virtual networks, because if you get those wrong, then you're going to have a really bad day.

Then we also worked on low level code, like virtualization and cryptography. Then we also worked on the distributed systems protocols. How do you, for example, maintain secrets across a distributed system, such that they're highly available and durable, and so on? The thinking around protocols and distributed systems was another area. Then each one, we went after with great gusto. Then the idea is that if any one of those projects were successful, then we'd be successful. Fortunately, but unfortunately, they were actually all successful.

Then I spent two or three years really holding on to many high-voltage wires as I was trying to get each of those projects - They were all quite hungry to move to the next level. I found myself to be the bottleneck on the communication around how to grow those things and how to get them into the right place. It took me a couple of years to get out of that.

[0:11:45] SF: Yeah. How's that work? I'm assuming, your team and yourself, especially when you were, essentially, an IC function in this role, until you hired some of your friends and so forth. You're doing this, it's like research project, I assume. Then maybe you're doing the proofs. Maybe there's a POC involved. How do you go from that, essentially, to this thing running in production and serving millions of people?

[0:12:05] BC: One piece of advice I got when I first started was the person was like, "Hey." Then previous employers, you've gone for these moonshots, right? You have these crazy big things. What you want to do here is you want to find a moonshot ladder. You want to find every quarter, every six months, you want to deliver something that if people don't have any context on your bigger picture, why you're doing it, what's possible. In fact, you probably don't even want to tell them the bigger picture. Deliver that. But then, you know that each piece, each delivery adds up to the bigger moonshot that you're going after. Focus on that way. Amazon is a very "show me" place, where people believe it when they see it.

Rather than talking about what you can do, deliver something, maybe a lot smaller than you normally would, build trust through that and then iterate, iterate, iterate. The projects that we were very successful on were those that we were able to break up the bigger moonshot into a much smaller set of steps. For those that we weren't able to, they languished. It was only once we'd built trust and built momentum on some of these others that we were able to come back to this bigger things and take them on again.

[0:13:07] SF: That makes sense. Just even my experience having worked at a variety of different startups, when you start in a new role, a lot of times, it's like, how do you build to something big, but also, how do you show value as soon as possible? It's much easier to do that with a smaller, more constrained project even. Then as you build trust, people believe in maybe the bigger thing in the longer-term deliverable.

[0:13:28] BC: Yeah. Yeah, exactly.

[0:13:30] SF: Is there any challenges around, from an engineering perspective, where when you're thinking about formal verification and you're doing that in the lab, that doesn't necessarily need to be as fast as it might need to be when you're actually delivering this type of stuff in production? Is there ever a challenge where the formal verification, perhaps in the lab environment is slow and then how do you engineer it to something that can run in production at scale?

[0:13:55] BC: There's actually another problem that's bigger. I'll answer your first question, but then I'll address the other question, that's actually more on my mind. In the lab, you're often relatively limited to the resources you have. You have maybe a couple grad students and a few laptops, or something like that, right? These days, maybe you have a bit of cloud, but you don't really have the resources you would have. It turns out, there's not been a lot of science, but the science is really interesting in growing on how to distribute formal reasoning. 

There's a thing, for example, called MATLAB Mono, where they distribute propositional satisfiability solving. That's really powerful. If you have a cloud scale, I can see a team to operate a cloud service and you have the resources you have at AWS to run big machines, then suddenly, actually, it's much, much faster. The performance of the reasoning isn't the big gotcha, and it's getting faster and faster. I'm not worried about that. What is a problem though, is scaling the - Whenever you're doing reasoning about systems, you have to articulate what you want to hold of the system. Like, is it all data at rest is encrypted? Is it we never log credentials? What property do you want to hold of the system? You have to do that in a way where you get the details right and you have to do it in a relatively rigid setting. You have to use a temporal logic, or something like that to express it. There's a very limited number of people who can do that.

Then the other pieces that you have to be tasteful about what you're not going to prove, right? You're not going to go and prove the correctness of the language runtime, the compiler, the microprocessor the program is running on. You have to basically build assumptions that the reasoning relies on, and making those decisions and articulating those, it's a lot about how do you do the encoding, such that the tools work. Again, that's a fairly small set of people who can do that.

I'd say, the big challenge, oh, and then culturally, your average person who knows about formal reasoning is not going to be very Amazonian, if you will. There's a big cultural gap between people who are relatively rigid, logic. When they looked at a lot of opportunities in the world, they said, "No, I'm going to do mathematical logic instead." What interests them and what drives them is really different than what you're going to find with a person who joined Amazon, because they're interested in the cloud, or other dimensions. Culturally, they're a little bit different.

Bringing all those pieces together, finding a manager who can run scientists, who understands Amazon, but also understands the science community. Then another big one is finding people who can drive the tools, or knew the encoding. Thankfully, the problem around the encoding is with AI tools these days, actually is getting dramatically easier. With the idea of neuro-symbolic AI, where you combine formal reasoning with the neuro-inspired techniques, like transformer models, suddenly you have the ideas of auto-formalization. There's a whole bunch of new building blocks that we can use to really scale this activity out. That's actually turned out to be tremendously helpful quite recently.

[0:16:43] SF: Yeah. I wanted to get into the neuro-symbolic AI, which you mentioned there. But I guess, for the audience that's probably not necessarily super familiar with it, what is the definition of that area? Is the neural, essentially, just you're leveraging LLMs in this case, and that is serving as the neural layer, and then you're combining that with more traditional symbolic reasoning?

[0:17:05] BC: I mean, you're probably going to pull five experts in the area and they're probably going to argue about the definition. Broadly speaking, in my mind, it's the combination of a symbolic, where you're pushing symbols to represent the infinite. X less than Y is the set of all pairs of integers, if you will, such that the first is less than the second, right? That's an infinite set. When you begin talking about formulae with variables and you're pushing symbolic formula, you're doing symbolic reasoning. That's the symbolic, or neuro-symbolic. Then the neuro is these, yeah, these - I mean, LLMs is absolutely the classic. That's the big hot thing right now, right, for sure, but more statistical or Bayesian thinking. Then when you combine them, you can get - depending on how you combine them, you get different advantages and disadvantages.

It depends on - You can use formal reasoning to synthesize training data, to make your language model, for example, much more powerful. Actually, a lot of model providers do that today. They use the lean theorem prover, for example, and synthesize a lot of true and false theorems and then will train over that data and then you'll find that the language model is actually better at recipe design, and so on. Then the other extreme is you could side-car a formal reasoning tool on the output of the language model and then either ask the language model to express it in the formal language, or do some translation. That's the other extreme.  Then there's middle grounds, where you try and integrate symbolic reasoning into the inference procedure itself. You see a lot more of the two extremes and not as much in the middle right now.

[0:18:36] SF: I mean, from a, I don't know, big picture perspective, if you're trying to achieve AGI, or something like that, it seems like you would need at least both sorts of approaches, where LLMs are very good at a certain class of problems and then these more formal reasoning, things that you can define as correct are going to be best used for a different class of problems. They're complementary. If you can put them together, it feels like that would be a necessary step to reach some sort of more general intelligence across a variety of different problem sets.

[0:19:05] BC: I'm very much an armchair cognitive scientist. I'm not a practicing one. I am a specialist in my branch of AI, if you will. You'll want to take what I say with a grain of salt. I like your characterization that it's a necessary, but not sufficient condition for whatever AGI means, with breathtaking AI that's hard to know if this is a human or not. I would think you need both logical and the transformer type stuff. The thing I'll say is that for years, the propositional satisfiability solving, right, is this classic NP complete problem.

The tools for propositional satisfiability solving are far, far, far exceeded human capability, like years and years and years ago, and they get better and better and better. You can take the entire design of a nuclear power plant, or a railway switching, or whatever end with millions of variables, millions of connectives, ands and ors, and so on, and show there are no satisfying assignments, or fine satisfying assignments in those spaces. You could collect the entire humanity and ask them to solve it and they wouldn't be able to solve it in that time.

It's like, in a sense, it's far exceeded human capability long ago, but they were extremely brittle to use. To get the answer, required you to understand logic. Then if you go to a satisfiability module, a theory solver, a lean, or Isabelle, these kinds of tools, you needed even more and more expertise just to ask your question.

[0:20:29] SF: Yeah.

[0:20:29] BC: That was typically the real bottleneck for these -

[0:20:32] SF: Yeah. Does pairing that with an LLM, where you can have a natural language interface in these, that I assume significantly democratizes the access to it?

[0:20:41] BC: There's all kinds of tricks you can pull now. You can call language model multiple times to do translation from natural language to logic, and then you can use a theorem prover now to prove equivalence of the different attempts. The more you do that, so the more confident you got the translation right. Once you've got that into a formula, you can walk the formula. You can use the propositional structure, or the logical structure in the formula to identify interesting corner cases and then articulate them back in natural language. You can help someone who doesn't understand logic know what the formula is saying, and then you can go from natural language to logic.

Then now, we've got those pieces of equipment. There's just so much you can do around, like if your agentic workflows, if you're ruling out bad behaviors from agents and then you decide later on, you don't like the formula that describes the safety range, you can then modify that formula and then go replay all the agentic actions and see what would have happened, and just all kinds of stuff you can do that wasn't possible before.

[0:21:39] SF: Yeah. I mean, given historically, the hardest part of some of these formal methods was the human bottleneck of writing the specs. How is this transforming some of the - even the things where you found success in previously at AWS around the formal verification around policies and things like that, is it changing that work as well?

[0:21:59] BC: One choice we made when I started the automated reasoning group, I remember it was Ken McMillan as another scientist in my space. It was just before I joined Amazon, I had told him I was going to join Amazon and I was having coffee with him and he joked, "Man, I'd love to get out of program verification." Actually, that idea stuck with me. When I joined Amazon, the policy reasoning and the network, VPC Reachability Analyzer, that's an AWS product and IAM Access Analyzer, or S3 Block Public Access, those are all based on these tools that are actually combinatorial reasoning. They're actually, they are decidable, they're not undecidable.

The real trick is to do combinatorial reasoning. They're the tools are propositional satisfiability, or satisfiability modular theory solvers. They're just unbelievably fast. No LLM will ever beat them. What the LLMs do now is they actually call those tools. That hasn't really changed. What has changed is rethinking about programs. Because now if you have loops, right, and you're now worrying about termination of the loops, or you're worrying about finding an inductive invariant over the loop, or you're worrying about concurrencies, you're looking for a live guarantee constraints, now the algorithms, the procedures that existed before were pretty brittle. Now you can just ask the LLM - when the LLM is writing the code, you can just say, "Can you find the inductive variant for me at the same time?" And it does a pretty good job.

Then what we've seen with recent advances in the models, in the past year is really breathtaking. Whereas before, we had five humans who could drive these theorem provers, those humans are now really excited, because they're able to let the models go and do the proof search for agentic tools, go find the proofs for them, and they run thousands of these jobs at the same time. It's not only 10X, or 100X. It's 1000X productivity gains from that small seed of individuals who can drive these things.

The amazing thing is because these tools, like lean theorem prover, they are yes or no tools, right? You pass them very complicated things and they will say yes or no. It's like, the scaling is unbelievable. As opposed to, I don't know, poetry or other areas where the yes - it's not as easy to say yes/no.

[0:24:07] SF: It seems like, there's a couple of different, or there's a spectrum of approaches where you could go by an LLM, or a reasoning model with these formal methods, where it's like, verifying AI output, or even some of the stuff where the LLM can rely on the formal method and trigger that to handle certain logical reasoning problems, or even using AI to make the formal methods easier to use and democratizing it. Out of that spectrum, what part of that, I think, excites you the most? Where do you see the most opportunity?

[0:24:38] BC: Well, I mean, there's just nothing but open field for the opportunity. Actually, all of them. I mean, it's just absolutely unbelievable. I just really didn't think it would happen in my lifetime that previously, I had four or five friends of mine who I'd worked within the past in different institutions and I brought them to Amazon. Basically, their job was to sit at a terminal and drive these tools, and they were the neuro, right? They had read all the proofs and understand all the tools and they could run these things. Suddenly, it's just like, you get unbelievable capacity from them. I actually don't see a ceiling for that productivity.

The thing that I'm probably most interested in intellectually is agentic safety to - I think of agents as allowing us to scale things that were. We only had socio-technical mechanisms before. Now you have a choice of socio-technical, or these agentic tools, and there's a disadvantages and advantages to them, right?

[0:25:32] SF: What do you mean by socio-technical?

[0:25:34] BC: A socio-technical mechanism is a mechanism where you are inserting people at certain stages as a protocol to raise the bar, or make decisions, right? The police is a socio-technical mechanism. I don't want to go to jail. Therefore, there are certain things I won't be doing, because I don't want to go to jail. My employer can call the police. Whereas, agents aren't afraid. They don't fear jail, whatsoever.

[0:25:58] SF: That might be the actual responsibility of the engineer at this point. If the AI is writing the code, you can't send the AI to jail, but you can send the person who sent the instruction to jail, basically.

[0:26:08] BC: Right. That's where this formal stuff comes in. If now, what we're doing is if we ask the human to look at the output, if we say, "Hey, you're responsible for the output always," in some sense, you haven't really scaled the person. Now they're just going to hit cognitive overload. Imagine they have a thousand agents writing code on their behalf and their entire job is just to read the output of agents, they're going to miss things for sure. What you really want is you want to actually write down. Whereas, maybe we weren't before, because there wasn't an advantage to doing it. Actually, write down what you want to hold. What is availability? What is confidentiality? What is integrity? What is privacy? What is sovereignty? What is durability? What are these concepts? Can you write them down in a linear temporal logic, or computational tree logic (CTL)? What is the formalism for which you would want to express these things, and then can use these auto formalization techniques to get them right.

Then now, can you, in an agentic execution environment, before allowing the commands, or programs that have been synthesized by the agent's LLM, allow them to go execute, can you now go statically check that those things won't violate the confidentiality, integrity, durability, availability rules and then allow it to go? If you can get that right, then now agents just become unbelievably powerful for us. It's a very humanistic view, because it's still saying, it's a more declarative programming model. You get to say, we in natural language in a room can decide, this is what availability is. This is what integrity is. This is what confidentiality is.

Then if things happen in the future that we don't like, then we can refine that definition together. Then one could imagine best practices where they're open source and like, oh, this is what these other enterprise organizations use for confidentiality. We should copy those. For compliance, we can now talk to the auditor and say, "Well, here's our definition of confidentiality." Then, I think really unlocks a lot of a scaling dimension that's just previously unheard of.

[0:28:06] SF: The first problem is that if everybody's running a ton of different agents, at some points, people are going to miss something, because they're content switching all the time. If you're doing this for engineering, you're spitting out a ton of code. There's no way you can hand review every single line of code and make sure that it's right. Even if you did, you still might make mistakes. Now you're shifting some of that problem to trying to do, like spec this out beforehand in terms of around what the expectations are. Are you shifting that problem to - you have to be able to pre-determine all the situations, where the agent might get itself into trouble, right? How do you start to formalize that and also, I guess, bound that problem down to something that doesn't take forever to essentially define?

[0:28:50] BC: The answer is yes, and there's a quote that's - this is related. Einstein apparently said, "Everything should be as simple as possible, but no simpler." Yeah, ultimately, we are still having to decide as a society, or as a leadership team, or as a family, what's allowed and what's not allowed. Actually, organizations that scale tend to articulate these things down, right? Amazon, we have the leadership principles. That really helps Amazon scale a lot. When I'm going through an uncertain time, or something's happening and I can't just go call my boss every time I'm confused, how am I making these decisions? I rely on the leadership principles to give me guidance. In an agentic setting where you don't have human judgment, you will want to be rather more concrete, but then you can use symbolic formulae to describe the infinite set of things you're okay with and not okay with.

Then using that, the formula, you can then find the interesting witnesses of what's allowed and not allowed to show them to a committee, to the society, to a leadership team, or to a family, these are the kinds of things that will be allowed or not allowed. How are you feeling about that? And run those things ahead of time. I think you can't make it simpler than that. You ultimately do want humans to set the policy on what is the context in which we're going to let agents just rip and what are the boundaries for which we don't want them to cross. For like, we don't want them to send money to certain countries, or buy certain kinds of stocks. I think it's an exciting opportunity. I think it is the scaling that you've got glimpses of and really large successful organizations, but it's a way of thinking that a lot of people haven't thought about before, but it's an opportunity now for scaling yourself, your family and organizations.

[0:30:34] SF: A lot of times in either, whether you're a model provider, or you're building some sort of agent system, you'll use things like the system prompt to put in certain guardrails around like, hey, this agent should be able to answer certain types of question. This sounds like a more formalized version of that, where it's not just about the formalization of the spec, but also when the agent does something, being able to go back to the spec and essentially, check in a formal fashion, whether you're allowed to do that thing or not. Is that a fair characterization?

[0:31:07] BC: Yeah. My mental model is definitely answers for which there are right or wrong answers. On the Family Medical Leave Act, or New York City Department of Buildings, there are questions for which like, no, you are not allowed to put the air conditioner there. Or, in a two-family building, you do need sprinklers. There's all kinds of like, when you look at the codes, it's like, the conjunction of codes from different years and often, there's conflict between them. Often, these rule systems that we have in society often conflict. Now is a really great time to find all those things, identify the conflicts, decide on the answers in a fair and equitable way around what our answer is going to be, and then allow agents to take action, or humans to take action.

But I definitely have in my mental model, those questions for which there are right or wrong answers. Like, should we send this money here? Does this violate your tax strategy? Those are right, or wrong answers. Then there's stuff that's more like, is that a bad thing to say? Is that a bad word? These things are a little more on the statistical spectrum. I think that you could probably use some of these same mechanisms, but that's - I have a harder time wrapping my head around, is this a good poem?

[0:32:16] SF: I think that's the hard part is the things are subjective. I've actually, I've written a little bit about this where I think one of the reasons why we see success in engineering really being the tip of the spear with not only agents, but models in general is because when you code, you have a lot of built-in guardrails in terms of whether something works or not. There's deterministic ways, like if you generate code, you could compile it, you can run it against tests. Those are some form of verification of whether there's correctness behind what you generated. But if I'm generating a poem, or even, I don't know, a press release about a product launch, the quality controls are much more subjective and it's hard to essentially put some formal verification around it to determine whether it's good or not.

Then the feedback loop in engineering is also very fast, where you could compile it, pass the test, even run it. Then if errors happen, feed the error back into the system, self-correct and generate new code and do that in essentially a loop. But it's hard to do that in a lot of domains, because by the time you know something's wrong, the space between generation and knowing it's wrong could be a really long period of time.

[0:33:24] BC: Yeah, I agree. I think I'm fortunate to be working on that side of the spectrum I'm on, because I'm just enjoying huge benefits. Whereas in music or poetry, it's harder. I will say, that there is a boundary that's not 100% clear and there's probably pretty exciting work there. It'll be more statistical in nature. You'll be like, I'm relatively confident that this discussion, even if it's translated into some Vulcan, or some language that's not natural, that's not English, or something like that, the argument that would be made still ultimately roots out to chemistry and then chemistry is a formal system.

I think in some applications, if there's a formalism hiding down somewhere, then there's probably something we can do using this. Yeah. I mean, for me, it's a mathematics, or reasoning about correctness of programs, or reasoning about policies, or security. It's like, we're well on the side of the spectrum that we're happy about.

[0:34:18] SF: Yeah. I think a lot of probably B2B type enterprise problems fit into that class structure, versus the just open-ended -

[0:34:26] BC: Exactly.

[0:34:27] SF: - you ask anything and get any possible answer.

[0:34:29] BC: Yeah. The addressable market for answering questions about correctness according to law and tax strategy across agentic financial systems is so big that it's worth going after. Even if we can answer questions around a poem being a good poem.

[0:34:48] SF: Yeah, healthcare probably similar insight. With an agent, there's going to be at least two big safety concerns, or trust concerns with those. One is that the agent might take action, or access data that just shouldn't, or is unexpected. Then, also, it could hallucinate some answer to some question, or something like that, that also leads to a misleading result. Is what you're talking about possible to potentially address both of those problems, or does it bias towards one more than the other?

[0:35:22] BC: In the space of chat bots, or that kind of thing, we have we have a product called automated reasoning checks and bedrock guardrails. It helps you formalize your domain, like a Family Medical Leave Act, or your vacation policy, or your travel policy. Then in an inference time, setting helps you remove incorrect statements and only produce correct statements and can explain why the answers are correct. That allows it not only for you to remove incorrectness due to hallucination when you're dealing with your customers, but it also allows you to understand your rules, because there may be cases where your rules were wrong and the customers like that can't be right. You can use the feedback from the customers to identify actually, where you got things wrong, which is actually fairly common, when people bring their policies in from the text file to now being used in an LLM-based situation.

The other one around, basically, where we regularly analyze using automated reasoning tools and entire AWS services looking at where data flows to and from. We don't track the data itself, but we can track the flows being using statically, analyzing both code and the infrastructure of AWS to show where data flows to and from. Then that allows us to make pretty high-fidelity decisions around risk, around what information we're using to make decisions. I think that's an area that the broader IT community can use, too, right?

If you're building agentic generative AI-based solutions that plugs into databases, data lakes these kinds of things, you can analyze your service and determine where data is flowing to and from. Then if there's certain pieces of information that shouldn't be contributing to the answers, say in a medical domain, if you don't want the agent to leak other customers, PII, for example, then that's the thing you can establish and maintain as a variant using the automated reasoning tools.

[0:37:26] SF: One of the things that you said earlier was that there's only maybe 2,000 or 3,000 people that specialize in this area of automated reasoning. It seems like, in recent years, automated reasoning has broken somewhat into the mainstream. I guess, why now? What has changed in particular for this to suddenly have more visibility as an area that's important for building successful software?

[0:37:50] BC: I'll give you this short-term answer, and there's actually a little bigger context. I mean, it's agentic AI. 10 years ago, if I had described what I did on a podcast like this, I think most people - a much smaller set of people would track even what we're even talking about. I think what's really amazing about generative AI is that as a society, we're going through an art of the possible exercise, right? There were tools that use natural language that were very easy to call. You just went to a website and people were like, "Oh, wow." They realized that computers could answer questions, very deep questions and reflect, and so on. They weren't always right, but they could do it, right? Then, it's a small delta from that to like, how do I make sure these answers are correct? Which is what automated reasoning has always been.

[0:38:32] SF: Right. Yeah.

[0:38:33] BC: Before, you were trying to talk about, imagine a computer could tell you, and they're like, "I don't even know what you're talking about." Just out of the gate, the presupposition of the work was just like, lost them in the vast majority of people. It didn't click for people. It was a very small set of people whose brains were wired to be interested in that problem. So then, people just self-selected. I think that has changed.

The bigger picture is that the inventors of computing were all formal methods people, right? Alan Turing wrote a paper called Checking a Large Routine, which shows you how to prove a correctness of a program. von Neumann was a proof - all of these people from the 1950s and 40s were mathematicians, who knew how to do proofs. Very clearly in their mind, they understood that programs mapped to Turing machines, or automata. It was abundantly clear. It was not even worth mentioning. 

It was only when the IT industry, when computers became commercial mainframes and then mini-computers, and so on, that the branches of the mathematical view of programs and how to reason about programs and then IT industry divided. You meet formal reasoning experts, particularly in Europe who have never written a computer program in their life. They don't know how computers work, but they understand the mathematics of Turing machines. Then on the other hand, people who write computer programs all their life, but don't even know it's possible to prove correctness of programs. The worlds have divided.

Then in AI, you saw the same thing, right? AI, one of the first applications of when the first IBM commercial computers was actually to automate the reasoning from Principia and Mathematica, right? Automated reasoning was one of the first non-military applications of computers. Then AI, John McCarthy coined the term AI. He was a formal reasoning person. In the very beginning, AI was the mixture of these Bayesian, cognitive, logical, the mix of those ideas and people were very excited and it was very heady days. Then you had this series of AI winters, where government funding would dry up for these areas. The response was for people to get much rather more specialized.

What happened is the automated reasoning folks, the people in the logical view found very niche applications that they could solve with their kinds of tools and were successful there, and that's how I grew up. Then other people went to recognizing numbers for depositing checks. There was another set of applications for some of the other techniques were going to work better. The areas grew apart. What's happened with generative AI and now agentic AI is that the money, but also the disruption, the opportunity has really brought all the communities - the communities are coming back together. There are certain members of those crowds that are bilingual that can talk between all those communities and they are just worth their weight in gold.

[0:41:23] SF: Yeah, it's interesting. I think this whole area is bringing back a lot of things that have had periods of time where they were - there was a big research backing behind it. I think ontologies, certainly, have been something that - I did my PhD in the knowledge engineering space with ontologies and worked on ontology mapping problems. Then no one was talking about that for a decade. Then now it's come back in a big way. There's a bunch of different things. Even in distributed computing as well, where if you have agents that are making all kinds of network calls, you're going to run into distributed systems problems, and that's really, I think brought to the forefront the idea of durable execution and durable functions and all this type of stuff that historically has been a little bit more niche. Certainly, companies invested in that, but it wasn't everybody's problem. Now suddenly, it's everybody's problem.

[0:42:12] BC: At Amazon, it's funny, because sometimes very big names will join Amazon and they'll hear about the automated reason and be like, "Oh, we tried that at my company 20 years ago. Didn't work." We're like, well, maybe that was 20 years ago. Things change. I think it's like music. You see ideas, old ideas, or fashion, right? Things come back and they get integrated in different ways. Yeah. Pendulum swing. Ideas go out of fashion and back into fashion. It's actually frankly good, because I think it's good to get a mixture of ideas and people confused about them, and some people really advocating for some, and some people advocating for other, and then finding combinations. It gives diversity of thought, which actually thinks helps us solve problems in better ways. It's pretty exciting, actually.

[0:42:55] SF: Yeah, absolutely. I mean, I think sometimes someone with a different perspective can break through in a problem area that in a new way that no one's really considered before, because the people who work in that area are walked into a certain way of thinking.

[0:43:07] BC: Totally. Yeah. If you had put me on that speech-to-speech translation problem with mathematical logic, I would have never gotten anywhere. But I would have kept insisting I had the right way to do it, right? I would've been like, "Give me another year. Give me another year." Absolutely, would have never solved it.

[0:43:21] SF: Yeah. You would have just kept hammering that nail and eventually, wouldn't have got it. For software engineers that are listening and interested in starting to apply automated reasoning, or learning more about it, where do they begin? What tools, I don't know, books, practices, would you recommend for them?

[0:43:38] BC: The ground is moving under our feet. The advice I would have given a few years ago is actually a little bit different than what I'd give right now. We've open-sourced something called Strata, which is a intermediate representation for programs that allows you to represent Python and allows you to represent Java and Rust, and so on. But it unifies to the logical representation, which ultimately, the semantics of something called lean. Lean, there's a theorem prover called lean. There's many theorem prover, actor theorem prover. So, Isabelle, HOL Light, Lean, so on. But lean is the one that many mathematicians have clicked to. That becomes the lingua franca for them. Because of language model training, so language model providers discovered this data and begin training over it and saw the value of this data. Then they've now began to synthesize data using lean. For example, DeepSeek did that. Reinforcement lending together with lean is very powerful concept.

[0:44:32] SF: Yeah. Yeah.

[0:44:34] BC: Lean has become a relatively standard tool in that space. I would go use lean. It's a mathematicians' tool. It requires you to traditionally, to hold its hand. Automated reasoning, it's almost a misnomer, right? It's like, very far from automated. I mean, it automates the checking of the argument you found. Whereas these tools, like a propositional satisfiability solver, like CDCL is fully, fully, fully automated. But it's only solving the combinatorial problem. Whereas, lean is solving an undecided - it's trying to solve an undecidable problem with your assistance.

What's really amazing these days is the language models are really good at running these tools. If you have lean together with a language model, now you're in business. Now you can do some really interesting things. If you want to reason about programs, one way to do that is to translate the programs from Rust into lean via Strata, which we've open sourced, and then to reason and lean. Then one could imagine products are coming - There's startups in the AWS. We're all building products in this space. There will be various products. I love for people to use AWS ones, but there's other startups that are building stuff, too.

[0:45:47] SF: Yeah. I mean, that's amazing. I think that's one of the things I'm most excited about in terms of like, I don't know, these days that we're living in right now is the accessibility suddenly to all these different areas. You essentially have a way to get access to information that you'd never had access to. It reminds me of when I was a teenager and the Internet became something. Suddenly, do not have to go to a library to look up every information. You'll just be able to look that up on the Internet. Even those early days where there's a lot less websites and stuff like that was just amazing and really opened up the world for me at that time. I think we're seeing that again.

[0:46:21] BC: Totally. Yeah. When I look at - I used to work on Volkswagen motors, like air-cooled 1960s in high school. That's what I did. The things I had to do to learn how to fix problems when I didn't know what I was doing were amazing, right? Then when I look at my son and how he can use online videos to learn how to use equipment, music equipment, it's incredible. Then the language models are just another step forward. It's pretty incredible.

[0:46:48] SF: Is there anything else you'd like to share?

[0:46:50] BC: People should follow me on LinkedIn. I try and post - If they're interested in stuff they're hearing here, I post - try and post just a few things, but they should be of interest to people who are tracking formal methods. I'm Byron Cook. Yeah, I don't know. Probably not. No, I'm just kidding. Keep doing it.

[0:47:07] SF: All right. Yeah. Byron Cook. I'm going to follow you right now. There we go. Awesome. Well, thank you so much for being here. I really enjoyed this. This was fantastic.

[0:47:13] BC: Yeah, it's a pleasure. Thanks for your time.

[0:47:15] SF: Cheers.

[END]
SED 		Transcript

	(c) 2026 Software Engineering Daily	1