EPISODE 1684

[INTRODUCTION]

[00:00:00] ANNOUNCER: Kentik is a network observability platform that focuses on letting users easily ask questions and get answers about their network. Avi Freedman is the CEO of Kentik. And he joins the podcast to talk about the platform, his observability philosophy, the role of AI in observability, and much more.

This episode is hosted by Lee Atchison. Lee Atchison is a software architect, author, and thought leader on cloud computing and application modernization. His bestselling book, Architecting for Scale, is an essential resource for technical teams looking to maintain high availability and manage risk in their cloud environments. 

Lee is the host of his podcast, Modern Digital Business, produced for people looking to build and grow their digital business. Listen at mdb.fm. Follow Lee at softwarearchitectureinsights.com. And see all his content at leeatchison.com.

[INTERVIEW]

[00:01:08] LA: Avi, welcome to Software Engineering Daily.

[00:01:11] AF: Thanks for having me, Lee.

[00:01:13] LA: You call Kentik a network observability platform. What exactly do you mean by those three words? 

[00:01:20] AF: We say platform because I just hate it when people call it a tool. Because it sounds like what my grandfather used to yell at me for not putting back in the right slot in the toolbox. Network is - because we are very focused on the network, we do see other parts of the infrastructure and application context. But most of our practitioners make the network go, which enables us to be talking now and enables all the packets and revenue to flow on the internet. 

And observability is being able to get enough data that you can actually see the effects and drive back to what the cause is and fix things. Because there's vendor bugs. There's people going crazy. There's applications going crazy. There's people going crazy on applications. There's all sorts of things that create problems for security, performance, reliability, revenue. 

And, again, platform is - we've always been SaaS, big data-based API, keeping everything. Sort of not federated appliances is a platform that people use. And we have multiple products on top of it. That's the three-word breakdown.

[00:02:20] LA: You mentioned all three of these. But when I think observability, I think of networking, infrastructure, and application. And, obviously, personally, people who know me know have experience in the application performance monitoring space with New Relic previously. But between those three, there are people who say all you need to do your job with monitoring applications is you need at least one of those three. But is that a true statement? Or do you really need all three? 

[00:02:50] AF: I think you do need all three. And the way I break it down - and, by the way, New Relic uses our agents to bring network data into New Relic. But the network engineers are not going to find the workflows that let them do capacity planning, or cost analysis, or figure out why there's network issues. The goal of - say, at application level, I guess people prefer not to be called APM anymore. 

At the app level, I think every app player wants to be able to see, is it a network problem? But are they going to be the place where the network practitioner is going to live and debug the issues? I don't think so. And the same thing for Kentik. It's critical that we understand what applications are flowing over the network and if there's an issue. Is it the network or the application? But you're not going to be debugging your application in Kentik. 

We have eBPF data. We have it with latency. We can say, "You know, look, there's an issue. And it is not the network. Because it is only this endpoint on this application in these locations." But we would need the traces. We would need to understand the semantics of all that. And it's just not the way we're orienting. 

I think all of them are important. And I would focus most on network or applications, but infrastructure. Because a lot of the infrastructure data is visible sort of up at the application or down the network or network converges with infrastructure. But this is probably true for infrastructure also. There's people that - especially if - I guess people still use SAN, or SAN over ethernet, or object storage. I mean, there's still compute and infrastructure. There's still more of the SPOG type stuff. The single pane of glass, which is where a lot of those folks live. But I'm sure there's some platforms that do observability. They're designed for the deep debugging, especially on the eBPF-y side of modern infrastructure for that. 

And I think it's critical they all see the other layers, which really the old federated appliance generations don't. But where the practitioner is going to live, I think that's still going to be different. And that's why we've invested in telemetry pipelining and both sending data to the other layers like we do with New Relic, and Sumo, and other folks. Or taking their context so that the entity model of New Relic, we can incorporate that in so that we know when we see IP addresses. You're not looking at IP addresses and ports. You're looking at applications that you can talk to your New Relic-using counterparts about.

[00:05:08] LA: Right. It's kind of the same as like the - you talk about t-shaped developers that have a broad knowledge of everything, in-depth in one area. You really need your observability to be t-shaped as well. It needs to understand everything but be a specialty in one area. And your specialty is networking.

[00:05:26] AF: That's true. I think that everyone has a goal of having many T's deep. I just don't think many people have accomplished it. I mean, if you look at the players in the observability market, I just don't think that's where those practitioners live today.

[00:05:38] LA: Yeah. I think that's very fair. I think that's a good way to do it. Whether or not the tool supports the use cases or not, that's not where people are going to get the data. 

[00:05:47] AF: That's true. And we prefer platform. But, yes, whether the tool does or not. 

[00:05:51] LA: I'm sorry. I used the word tool.

[00:05:53] AF: No. It's okay. It's okay.

[00:05:53] LA: Okay. Platform. That's fair. 

[00:05:56] AF: Everyone does. Yeah. 

[00:05:58] LA: As I looked, you did research. And I'll be fair, I didn't know very much about Kentik before I agreed to do this interview. I knew a little bit about you obviously from New Relic. But not an awful lot. But as I did the prepping and the research for this episode, I noticed that your interface appears to be - I guess what I would say is AI-focused. In other words, you tend to do plain English sentences to create performance charts. Things like that. Is that deliberate and is that a goal of Kentik? Or is that just kind of a, "Oh, easy to do. So we did it on the side?" 

[00:06:32] AF: Kentik has always had a focus on letting people ask whatever questions they want. And if you go back to nine years ago when we started, that was what the super nerds articulated who are our first customers, like, "I can finally answer my questions." 

And I think we're aligned with some of the early honeycomb thinking on this also, which was like there are a set of people where they need to do that. Over time, we've oriented towards making it so you don't need to be as much of a data dreamer as what I call it to be able to ask those questions. And we want to be able to surface what the questions are that you should be asking based on what we see. 

The natural language side is newer. In fact, again, I'll give props to New Relic. [inaudible 00:07:10] much better than when we had a SQL interface. It was a hack job because we use foreign data rappers to get to our back end as part of what we do. It was like windowing was not primitive. And it was awkward to do that. And we took it away actually because most people just wanted APIS. 

And the UI, we've oriented to be an API explorer also. But the NLP, what we call journeys, where the system will say maybe you next want to ask this. Maybe you next want to ask this. Taking it from API to natural language, that's relatively newer. We don't view that as a new product. We just view it as an extension of the platform, which has always been focused on show you things you might want to ask. And then making it easy for you to dig in.

[00:07:51] LA: Yeah. That makes sense. You've essentially switched from the allow the user to ask any question. Or not switched. But enhanced beyond. Let the user ask any questions. To here are the questions you should be asking right now based on what I'm seeing.

[00:08:06] AF: Exactly. And we have a lot of privacy issues. For example, we've always done machine learning for anomaly detection. But it's always within a customer's profile. But some of the what questions should you be asking, that is data that we have across the platform. Because we have tens of thousands of people using Kentik to do debugging. And we can strip out what is unique to that company and say, "When someone does this, they then do that." And then sort of make suggestions. And it's a technique that we started applying a couple of quarters ago and people are really loving. 

[00:08:37] LA: Cool. AI is - I'm trying to think if this as a valid statement. AI is central to the user experience of your product. I mean, as opposed to like a New Relic, which is a dashboard-based and, yes, you can do AI as well, you're less dashboard-focused and much more communication-focused.

[00:08:56] AF: No. I think that's maybe the newer areas of the product that's true. But if you look at the number one areas of the product, it is the dashboarding, the data explorers. We have workflows that help people do capacity planning, and cost auditing, and interconnection management. We used to call it I built Solar Winds. It's like super simple dashboards that are primitives where you can't get lost. If you're an SRE and you don't want to think about what are network dimensions - and we have thousands of dimensions between all the enrichments that we do, you could, "Oh, should I do bytes, or bits, or packets, or flows per second? Or is this a dimension or a metric?" I mean, there's a lot of stuff you can get into, which we allow people to do because we have all the data underneath.

But it's still a little bit more visual and the natural language and doing even better on the - and so, the insights and root cause is core but not the primary method that people are using today. I'd like that to switch later this year. I think it's probably going to be sometime next year that the chat ops and interfacing that way will become primary though.

[00:10:03] LA: Is this the primary way AI is useful in observability? I kind of have my opinion on this as well too, obviously, and so does companies like New Relic, et cetera. But let's get a conversation here.

[00:10:17] AF: That's an interesting question. There are ways you could use AI that I think are not the way to go. And some other companies have been very powerful building complex technology around what data you should throw away. That's not my religious bias. Because when we started, we were competing with a company called Arbor that did this thing called managed objects that are basically rollups. It was like the customers really understood, "There's a problem. I'm debugging it. I didn't tell it I needed to ask this question. And I don't have the data. And I cannot do my job." 

Since we started, this very clear, very I would say - even though we started first, Honeycomb, I must give them props for popularizing this. This Honeycombian view of "if you don't have the data, you can't do your job" has been very core to our user base for a long time. And if you think about it, IP addresses, there's a lot of them. Now add in ports convolved. Now an IPv6. It breaks cardinality for any kind of time series database. 

I've seen people, for example, use AI to figure out what data to throw away. I don't view that as the best use of AI in observability. What I think it's best used for, and by AI, I don't just mean LLM, but broadly, is what should I be looking at? What are possible root causes? And then, eventually, what are possible things that I should consider doing? 

And I actually think that co-pilot type is helpful and everyone should do. I don't think anyone's thinking about making a product without doing that. But I think it's that next level that's most interesting. And I have a bias, I don't believe LLMs alone will get you there. I think you actually have to have a semantic model of what those three Ts are; application, infrastructure, and network, to be able to do that well combined with LLMs. 

[00:12:02] LA: You can't do that with LLMs alone. Is that what you're saying? 

[00:12:06] AF: Yes. Because LLMs - I'm not a fan of the term hallucination. I was talking with someone, one of our customers yesterday, who calls it nonsense. And I think that's a better - it's a bad prediction. LLMs sometimes make bad predictions because they have no clue about what they're actually talking about. They're just predictors. But LLMs do all sorts of magical stuff and give you things you wouldn't have thought of. And that's great. But you need to have a bolt detector on the other end which comes back to the old-school AI of a semantic model of what this stuff is. And then, yes, I think it's huge for observability. Any kind of analytics, any kind of human augment, I think it's key part of what people should be building. What do you think? 

[00:12:45] LA: Well, yeah. I absolutely agree. But I think a key place that AI - and I'll use the term AI. Not LLM either for similar reasons but different. AI is much broader than large language models. but I think AI is underutilized in its ability to find the needle in the haystack. 

And I think about observability from the standpoint of tell me when something's going wrong so I can fix it. Then tell me what I need to fix. Well, telling me what I need to fix, I think there's some AI use cases there. But they're not nearly as interesting as the ones along the lines of tell me when I need to be woken up and do something. 

And so, I think this smart analysis of knowing that something's wrong and something's going on. And I need someone to take a look at it for me. That sort of analysis is really what AI is good at. Analyze the tons and tons of data and let me know that something important is going on. But I don't think it's as useful for the, "And here's what's broken and how to fix it," sort of approach. 

[00:13:52] AF: I would agree with that if you add today at the end. We don't make the claim that we're doing this today either. And there's a lot of vendors out there in the network space generally that say this is a solved problem. How the vendors say that this is a solved problem. And then everyone's like, "Oh, my God. I'm behind." And it's like no one's doing this. No one's doing this. It's like everyone thinks that everyone else is having a happier life. And, in fact, we all have the same misery. 

I think that we will get there. Again, my bias is with adding in semantic models. But I would agree with you today. What should I pay attention to? And then, yeah, there's still some humans required in all this. Distributed systems are complex as hell. And where you see problems as often is usually not where the problem actually is. And sometimes it needs a little bit of that. But I think we will get more and more towards there is a problem that is high quality. Not false positive. That I think we're pretty much there. There's good things doing not for every absolute kind of problem. But people are doing that well. 

The next step to me is what could the problem be. And I'm pretty convinced that over the next couple years there'll be some verticalized, really good solutions to that. I think we'll be one of them. And then the third stage is what should I do about it. And that's even harder. But, also, I think possible. And not 10 years out.

[00:15:10] LA: You think in less than 10-year future, we'll be able to have AI systems detect a problem, understand what's going on, and repair it? 

[00:15:21] AF: Okay. I didn't say that.

[00:15:21] LA: Okay. Okay. Then let's clarify.

[00:15:24] AF: Let me give you an analogy. In Kentik, one of the first kinds of anomaly detection we did was for denial of service attack. From the very beginning, we had people - and we could do a lot of things based on that. We could just tell the network to not do something. We could signal to scrubbers that will divert the traffic and do it. We could cloud people if you don't have enough capacity. 

From the very beginning, we had people like, "Oh, Kentik is awesome. We will completely program our network to be automated based on this." And we were like, "Okay. We're a crazy little startup. But sure." But a lot of our customers, they just wanted the big red button. 

We open a ticket. We show them our analysis. And we give them the link that the human in the loop says, "Okay, go do this." And we've already brought it to them. They don't need to go compose some API, or type CLI, or whatever. I think that's going to be the major first stage is the recommendation that human audits. The complete closed loop, I don't know whether we'll be there in 10 years. Because I really have a belief that AI techniques will help humans do their jobs. Not replace their jobs. Because, again, distributed systems are really complex. 

I think maybe 10 years, maybe more of it will be automated. But in the next three to five years, I think it's going to be more recommending to let humans be their awesomest and most useful.

[00:16:39] LA: And in that, we're in complete agreement. Let me try a different phraseology and see whether or not you come up with - whether or not it means the same thing or if it means something different. Is AI assistive or transformative? And then you can apply today and tomorrow if you want to? Or put another way, and I think this is equivalent, but is AI something that helps you grow? Or is AI something that will replace certain things? 

[00:17:07] AF: I think that's better. Because I think of it as both assistive and transformative. But I think it's something that helps you grow. Not something that helps you replace. I mean, I was never diagnosed with it. I've never medicated. I very clearly have ADHD of some form. As an adult, I can hire people to do my homework. When it's boring, I have amount of difficulty focusing. 

AI lets me achieve my potential. I don't actually - should use AI tools more and don't because I'm too busy. But to figure out what I should use. But what I see people doing with AI absolutely could be helping me do the fun, hard stuff that I like doing that would be more valuable to the business versus the dumb, boring stuff. And I'm convinced that I'll be able to say scrape the MX website and stick that stuff into Excel. Rather than having to download and do all that. And that will be transformative because it will allow me to focus more on what I enjoy doing and have more focus time. That's the way I view it. 

Now there are jobs, which I think will eventually be better done by AI. But we've had a history of things like that. And then that's the societal question, is what kind of investment do we want to make in people? And how do we have the educational system educate people about what they need to be thinking about to have a career versus a J-O-B? 

[00:18:24] LA: That makes a lot of sense. I hear a lot of my non-tech friends and non-tech community that are so afraid of AI right now because they're afraid their job's going to go away. And what I like to say is look at bringing computers into the newsroom. People thought that was going to cause jobs to go away. And, no. It just enhanced what you're able to get out of a newsroom. That's a big example. 

And there's a hundred years of a thousand different examples of places where technology was going to replace people or other things and it never did. And what it did is it transformed and created other opportunities. And so, your specific job may be very different 10 years from now because of AI. But that doesn't mean you're not important 10 years because of AI. 

[00:19:17] AF: Well, we owe it to - just as we owed it to steelworkers and before that, seamstress. I don't know what the masculine of seamstress is. But seamsters. Yes. The people that do things that are being replaced by whether it was looms or whatever, that's a separate conversation is what do we as a society want to do? And [inaudible 00:19:36] facilitate retraining. But even better yet, set the expectations earlier on that part of being a human in this world is deciding what areas to focus your creativity on. I think that's going to be increasingly important. Just as having a good bullshit detector is important in this world of AI. 

But if you want to think about things to worry about, I would put - people used to worry about the singularity. Magically, a bunch of computers would exist and suddenly they'd become Skynet. That was never technically based in reality. Whether we live in a simulation is like one of the most useless things in my opinion for people to worry about. Because it doesn't matter. At the high level, if I were inclined to worry, which I'm not a worrier, I would worry about the gray goo problem much more than the Skynet problem. 

[00:20:22] LA: Tell me what that is.

[00:20:23] AF: That is - imagine I'm at a science fiction convention right now. My wife and I met [inaudible 00:20:29] science fiction fan for some time. Gray goo problem is imagine self-replicating carbon nanobots that are programmed. The while one loop in the nanobot eats all the carbon in the universe. Boom. The gray goo is the sludge that is left over after all the nanobots reproduce and you have all the carcasses of the nanobots and they've just eaten everything. 

If we had the wrong kinds of either computational biological interface or, I mean, effectively, the nanotechnology, which 20 years ago people thought was going to be the big thing now. That could cause problems. And we've already got gene editing and stuff like that. I'm not opposed to playing with all this stuff. I just know how bad people are at writing software and how good computers are at doing exactly what you tell them. And if you instantiate that in the physical world in a bad way, I'd be much more worried about that than Skynet. That's all I'm saying.

[00:21:23] LA: Okay. It makes sense. It makes sense. Talked an awful lot about AI. But I know one of the other things we also wanted to talk about is cloud. Because one of the things you are as a company focused on is trying to be cloud-agnostic in the things that you do. You're a multi-cloud by your very nature. How does Kentik help solve the cloud diversity management issues that are associated with multiple cloud providers?

[00:21:52] AF: I'll include that with sort of the classic cloud, which is switches and stuff. Sort of let me say a private cloud, right? We see customers thinking of it as more as hybrid cloud than multi-cloud for various reasons. But if you think about it from a networker's perspective, which is where we start with, and then there's some other layers too, cloud networking is just traditional networking, which is ACLs, and routing tables, and tunnels with weird names and other people's buggy software. 

But I mean, okay, you could have machines in multiple VPCs in Azure. And you can't in Amazon. And these guys call it. It's not a VPC. It's an XPC. And it's like - and so, how do people look at this stuff when, ultimately, they need to figure out, "Can this VPC talk to that VPC? Am I breaking my workload by putting the wrong policy on? Am I breaking my bill by sending this stuff?" If you want to make HA in the cloud, even multiple availability zones in a region, you might get a pretty big bill if you do the wrong kind of active-active with whatever your distributed system layers are.

We give one consistent way of looking at the topology of your infrastructure, whether it's your cloud that you're making. And that's going to be VLANs, and ACLs on a router, and tunnels that are GRE, or IPIP, or IPsec, or something. Or the cloud equivalent of those with gateways, and cloud firewalls, and VPCs, and security policy groups, and things like that. 

If you look at that data in an infrastructure tool, it looks like a bunch of syslog records. But that doesn't help you actually build the map of it. And so, we build maps for people. We do look at the performance by testing from place to place. We look at whether your policies might have broken something. And we try to do that in a consistent way. 

I would say the on-prem side doesn't look exactly like the cloud side. But you can do all those things in Kentik. And that's getting unified. That's like from a - that's a lot of the cases where the - whatever you want to call them, DevOps, [inaudible 00:23:54], SREs, even not quite as much over to the - not developers, but the application operations teams, will come into Kentik and say, "Okay. Well, again, get a picture of it versus just looking at it sort of syslog-y. And then with the application context of what's going on inside those. That's like the number one use case. Did I break something? Is there a performance problem? Number two is cost. Just because you can blow up your cost quite a bit. 

I'm sure you've seen - was it Microsoft that went first? No. Google went first, I think, and said there's no fees to leave our jail. And then the other two - 

[00:24:27] LA: I think it was - the other two followed suit. Yeah. And Amazon was the last. 

[00:24:30] AF: Right. But they both - I mean, substantively, Microsoft and Amazon agreed at basically the same time.

[00:24:36] LA: Exactly.

[00:24:36] AF: But that tells you that network bandwidth is pretty expensive certainly compared to cost. It's more than 10 times what wholesale bandwidth cost is. 

[00:24:45] LA: It was the number one problem with hybrid and multi-cloud solutions was data transfer costs because of those artificial numbers. Yes. Yeah. Absolutely. 

[00:24:54] AF: And, again, like how you configure your database replication and your Kafka makes a pretty big difference. You get it wrong and you can have a much bigger bill. And so, that's a use case for us too. And then because we keep all the data, security people - in the network-network, we're more operationally focused except for denial of service attack and some forensic use cases. But we haven't really traditionally dealt with firewall logs because they're pretty low granularity and there's fireball management stuff. 

In the cloud, in Amazon and Microsoft, VPC flow logs that show you traffic are also firewall logs. Not really. But in the sense, they show you permit deny based on policy. They're not really stateless firewalls. And that has brought in more security people to our front door on the cloud side than we have traditionally. There's some security governance and appliance use cases there too.

[00:25:42] LA: Let's talk a little bit more about security. How do you help those customers that come to you with the security use cases? Besides the fact that you just happen to do the same sort of reading of logs that you do on-premise. It helps you with security in the cloud. But it doesn't help you on-premise. Is it more involved than that? And do you do more? And do you plan on doing more? 

[00:26:03] AF: I will actually just quantify. Because sometimes I'm a little too precise coming from nerdom. We actually do have a lot of security users of the on-prem stuff. But on-prem, the router-generated traffic logs don't give you logs for denies. Whereas firewalls do. 

On-prem, we get less of the firewall logs. If we get them, we can do stuff. But we're not really seeking that out for, I guess, again, historic reasons. But in cloud where it converges, it's been a little bit of a driver. Which, again, yeah, was a little bit of thinker of what we should be doing on-prem. 

Where we started with security was around DDoS, which I would really call an availability problem. Because if I'm dossing you, it's like I'm kidnapping you. Now it might make it easier to pick your pocket, and there are people that have done DDoS's to cover up other security things they're doing. But a lot of our customers take DDoS out of a security budget. Technically, it's a security use case. 

But the second thing is just - again, because we have an observability view of the world, we keep all the data. That makes us useful as a forensic tool. And we have about 25% of our users have security in their title even though we don't primarily sell to that crowd. Because what we do is relatively more limited to threat detection. We can take their threat feeds and make it usable. Lateral movement detection. Because you can see, we have primitives for like how many things was this thing talking to. And then you could move beyond that. 

And then, again, in the cloud, there's both availability. Like if it was dropped, is that because the policy is misconfigured? And the forensic side, because a lot of the people that used to do stuff with firewall logs now have their VPC flow logs and they come to us in the front door. 

We are not primarily focused on that. We don't go to Black Hat. We don't go to RSA. That will change over time. But it's not a major investment for us this year. There are some governance and compliance integrations where people want to know, "Just tell me -" as you were talking about, surface the insight that we'll tell people, "Hey, did you know that you have this applica - this internal gateway is talking to Iran." You probably don't want that to happen. But that'll get broader and broader as we get more towards their intent metadata on the security side. And those are things that we are investing in. 

But for us as a 200-person company to open up from a selling and supporting primarily to security groups as a second thing, it's probably going to be a next year thing. Just when we look at our investments, this year we're very focused on metrics for the network, integrating everything across the platform. Some of the new verticals that we've opened up and been growing in the traditional enterprise who have this - which really drives this hybrid cloud-focused for us. 

Now there's a separate question, which is are security and operations going to converge? That's a really interesting question. And when? 

[00:28:40] LA: That actually is a great question. Why don't you give me your thoughts on that one? 

[00:28:46] AF: The answer is they're already more converged than people think. But people talk about the same thing using different words because they're different cultures. And so, I think it's actually a cultural problem that we need to solve that will help the technology and platform integration side. And I get confused about when it happened. Because when I grew up, there was just nerd. There wasn't like security nerd, or sysadmin nerd, or application nerd, or architecture nerd. It's become very different than the last 10 or 20 years. And we can see it. 

As much as we can help bridge the divide is good. But I still see a lot of that gap be bigger than it should be. I don't have the easy answer to that. But I think they will continue. They are starting to converge, especially as budgets are saying, "Why do we have these multiple platforms that look like they do the same thing?" But I would say it's still pretty early from what I'm seeing.

[00:29:34] LA: Yeah. Yeah. That's fair. That's fair. Where are most of your inroads into your customers coming from? We talked about security. Are they DevOps? Are they developers? Are they pure ops? Are they SREs? Is it management? Where's your sweet spot? That's the word I was looking for. Where's your sweet spot user?

[00:29:57] AF: Generally, we get engaged. Because architects or other senior sort of engineering or operation technologists that either have infrastructure or network in their title are looking to have either modernized old windows software and federated appliances or have a complete observability strategy that includes the network fork of things. 

Now we take into consideration their operations teams in service providers. Even their product and sales teams sometimes. Because there's some business intelligence use cases of it. But the people we're engaging with are generally senior technologists. They're looking to solve problems for themselves or groups more close to them. But in every conversation, we're considering the broader enterprise and the SRE teams that need to see things. Because we're SaaS, the security teams get involved for reviewing it anyway because they need to approve sending the data. They see it even if we're not selling to them and they're like, "Oh, that's interesting." 

And the part that Kentik has added a lot of focus on is the operations teams. Again, not the people that know what questions to ask. But the people that need to be shown. Because they do runbooks. They look at alerts. They do runbooks. That world where, again, they're not yet ready to ask that question via NLP, they're looking at insights and data. 

We focus a lot on them. But their leader is not the one usually reaching out to us. It's usually us with their architect, and technical, and operations, and engineering leaders thinking about them as a whole enterprise. And then developers, not really. We can take EBPF data. It's a great way to shine light on the network. And we have app teams that use that data. But they also often have that data somewhere else. They're only really using it to debug the network if it's in Kentik usually today. Where that's bridging is Kubernetes visibility with it. We have some things there where we do have app teams using it. But, again, not developers really today. 

[00:32:03] LA: Yeah. Is Kubernetes networking or infrastructure? That's the million-dollar question.

[00:32:08] AF: Well, what is network is an interesting question. Is this CNI network? Is it a network mesh? Is it a service mesh? Where is your telemetry, and policy enforcement, and all that going to be? It's really confusing. I mean, even for some very senior technologists in the enterprise. I mean, we're going to do sidecars. We're not going to do sidecars. It's all shifting around and moving. It's good. Keeps you from being bored.

[00:32:29] LA: And it's going to get more confusing before it gets less confusing. That's for sure.

[00:32:34] AF: Yet, there's absolutely people that say there is no network. The network's not important. The service meshes are just going to talk to each other. I'm like, "ESP?" There are still packets. Now it's also true, we're not growing that many more people that think about OSPF, and EIGRP, and BGP and actually could go build a cloud network stack on top of some traditional stuff. How do we help those people and other people to be able to do this advanced network diagnostics? That's absolutely important. And we're not going to be able to do it without the right applications of AI. It's absolutely important the graying of that class of people. Because there are people going into it. But not nearly as much as some of these other fields.

[00:33:17] LA: Right. Right. I agree with you there. I think that's a place where we're going to run short on talent in very near future is on low-level networking concepts and the applicability of that into higher level concepts. 

[00:33:31] AF: And it's happening on the SRE side too. I mean, it's hard to find people that know you shouldn't just delete a billion files all at once. Because the system is going to - I mean, Unix has been amazing. I've always been a Unix fan. But even though I didn't like VMS and other mainframe EI type operating systems, there's some stuff that it did really well to help you run applications and support distributed systems, which Unix is just not very good at. It gets harder and harder to find people that have a good mental model of what happens inside the machine and operating system. And that's also going to be important for observability to help guide people with.

[00:34:05] LA: Valid point. Valid point. One of the other things that I've heard about you is that you actually don't use the cloud very much directly within Kentik. It's that most of what you do is not in the cloud but is on-premise. Is that true? And what's the rationale behind that? Why are you more of an on-premise versus a cloud company?

[00:34:27] AF: That's absolutely true. And the reason is because I've always thought the company should be able to make money. We're not currently a profitable company. But we have a very good gross margin, the cost to operate the service. Because we don't use Java in our databases. We wrote our own column store and streaming databases. And because we run it on-prem. 

We do talk to the major clouds and we always show them our economics. And they're initially often very enthusiastic about getting our business. And then they look at it and like, "No. We're not going to be able to do this." It doesn't mean they don't want the business. But here's the workload, which is not very optimal for cloud. Always on heavy compute and memory that only grows over time. Not bursty. And you can just do the math. It's actually pretty simple. Forget the network cost. A $30,000 - actually, now it's $20,000 computer that I can buy and get financing on to finance over three years. Cost me a couple hundred doll a month to host. And I have the math on that. And the bandwidth is almost free nowadays. Versus it'd be a couple thousand a month, 3,000 a month in the cloud. And so, it doesn't take much to do the math to say - now, we do use cloud. Because we have performance testing agents all over the place. And we have to have that in cloud. Not only in cloud. But also, in cloud. 

We do use cloud because we bear the burden of taking the telemetry and egressing it to our infrastructure. We pick up permissions, and go scrape metrics, and take VPC for the logs and send it. And our CI/CD is in the cloud. Because that's a bursty infrastructure workload. That makes a lot of sense. 

We haven't used cloud queuing, and databases, and things like that because we haven't wanted to get locked in. Because who knows who we might partner with in the future? And because we do on-premises success. We have customers that will not egress their telemetry. And we don't ship our software to customers. But we will run - we push a button and turn on a net-booted bare metal containerized infrastructure in someone's infrastructure if that's what they want. And we would run on their VPCs if that's what they want. But the economics are always - because we're a high-speed scanning database, it's just not actually very - the data side of it is not very cloud cost friendly either versus running our own. The day we can pack it all up and use cloud more economically, we'd be happy to.

[00:36:51] LA: That makes a lot of sense. What I always talk about - I obviously write a lot about cloud, and cloud usage, and cloud costing. And one of the things I always talk about is the people who don't find the cloud cost-effective are the ones who are either incorrectly using it because they are taking a statically configured application, lift and shifting it to the cloud and running it as a statically configured - exactly. And have the same waste that goes on in the cloud as they do on-premise. But the cloud waste is more expensive than the on-premise waste. Or it's someone whose load doesn't dramatically change over time and doesn't spike, such as an observability platform. Certainly, that was true in New Relic. I imagine it's true with you guys too. You have a constant load of data coming in all the time. 

[00:37:40] AF: Right. Exactly. And our spikes are single-digit percent because we're at scale. We're taking tens of millions per second. When someone spikes a million, it's like, "Okay, we just handle it. We generally try to stay over-provisioned, so the enough percent that it works fine. And, of course, we add customers. And we can predict that. We're always growing. 

But to give you another view of it - because a lot of people say, "Well, you must not be counting in the effect of people." Our numbers, we're 200 people. Tens and tens and tens of millions of dollars of revenue. But not hundreds. But we're big enough that I can say, yeah, our six-person operations - or seven people. Yeah, it's like we don't have other magic people. But we're also not like visiting the data center even monthly. We have someone that goes and pops some discs every so often that spends two hours that's on contract that - but it's Qantas. 

And, yeah. But I never say never. My assumption is the economics will flip at some point. And we also have a lot of customers that do bare metal. And one of them might make us an offer at some point. Because, again, if you can predict your workload, even a lot of them will do flexible bare metal for days at a time. Just not seconds. But days. Yeah. It's not a religious bias. It's just what makes most economic sense right now. 

[00:39:00] LA: That's good to hear. I think a lot of people hear some arguments saying the cloud's too expensive. But they don't realize that it really depends on how you use it. And the loads that you apply and what you need to do. And so, if you're a standard e-commerce application or e-commerce store that has daily variability, weekly variability, annual variability, spikes and traffic, all of them, Cloud's great for you. But if you're something like an observability platform with a constant load, it's not necessarily - 

[00:39:30] AF: And there's an argument that rather than running our own Kafka, which is the only Java thing we use. And I guess we could look at Red Panda because that's not Java. But that's got some other limitations. But rather than running our own Kafka or our own MySQL, which we run Postgres, or should have stuck with running for now. Because we still do some foreign data wrapper stuff. And so, that we should use a cloud-managed thing. Or that we're not a heavy Redis user. But that we should using a cloud key value store. 

But, again, the issue there is that becomes difficult for flexibility. Unless you're going to go building an abstraction layer. I'm never going to do it. But on my list of things that someone needs to do, which people have tried occasionally, is without building a whole pass, what would the right abstraction layer be to be able to build distributed systems flexibly and migrate between the clouds to use the minimal covering key-value store, object store? I'm surprised there hasn't been something. Maybe there is. But I haven't seen it.

[00:40:29] LA: There's been a lot of attempts. But nothing there yet.

[00:40:31] AF: Right? Because everyone starts and then they build a pass, which is not what we're looking for. 

[00:40:36] LA: This has been a great conversation. I want to thank you for your time and your great answers to these questions. My guest today has been Avi Freedman, the CEO of Kentik. Avi, again, thank you very much for joining me on Software Engineering Daily.

[00:40:48] AF: Thank you very much, Lee. It's been a pleasure.

[END]