[0:00:00] ANNOUNCER: Observability is becoming an increasingly competitive space in the software world. Many developers have heard of Datadog and New Relic, but there are a seemingly countless number of observability products out there. Costa Tsaousis is the founder and CEO of Netdata. His goal was to build an open-source platform that was high-resolution, real-time, and easily scalable. Netdata is the result. It's relatively new to the crowded observability space, but it's grown into a major presence. Costa joins the show to talk about the design philosophy of Netdata and how it inverts a common observability design pattern.

This episode is hosted by Lee Atchison. Lee Atchison is a software architect, author, and thought leader on cloud computing and application modernization. His bestselling book, Architecting for Scale, is an essential resource for technical teams looking to maintain high availability and manage risk in their cloud environments. Lee is the host of his podcast, Modern Digital Business, produced for people looking to build and grow their digital business. Listen mdb.fm. Follow Lee at softwarearchitectureinsights.com and see all his content at leeatchison.com.

[EPISODE]

[0:01:27] LA: Costa, welcome to Software Engineering Daily.

[0:01:29] CT: Hi. Nice to meet you.

[0:01:30] LA: We've worked together before a couple of years ago, but it's been a while and it was great when I got ready for this interview to go back and see all the things, I think, you've accomplished in the last couple of years. Hopefully, we're going to get into that. But why don't we start out with how do you met data? How does it differ from other mainstream observability companies? I'm talking about companies like Datadog, and New Relic, Dynatrace. How does Netdata differ from those?

[0:01:59] CT: So, we're trying to solve the struggles, the common struggles. The common struggles are hard to set up tremendous amount of time to understand what you have to do, create the dashboards, understand the metrics and the likes, create the alert. That's one part. The second part is how to scale it. How to make this perform at a reasonable cost. Also, how to make it as real-time as possible, let's say, because real-time and high-resolution metrics are really important, especially in today's environment, where everything, the microseconds count.

The idea with Netdata was mainly to make monitoring work out of the box. So, unlimited metrics, full high resolutions per second data collection as a standard for all metrics, easy scalability, distributed. No centralized server, not a single thing that has all your metrics and takes all the load. You install your data all over the place and you configure if you want, if you have a few [inaudible 0:03:02] servers, or if you want to offload production systems. Of course, you can have centralization points. But you can have as many as you want.

The next thing was to put inside the tool, to put all these keys that are required. So, the understanding of each metric, what metrics do, what they mean, how they correlate, how they should be presented, all these are inside the tool. You don't have to cherry-pick. You don't have to know the metrics beforehand. We have a lot of users that actually use the tool to understand the metrics the other way around. This is opposite to the monitoring textbook. You know that better than me. So, the monitor textbook says, “Okay, you have to have a deep understanding of the metrics, you have to know what you want to monitor.” Then, you have to collect metrics, make dashboards or alerts, or other. Netdata is the opposite, exactly the opposite.

So, you don't know the metrics. You don't know anything. You just install it. It comes up with thousands of dashboards of metrics, and hundreds of charts, and amazing real-time dashboards by itself. And you learn and understand the infrastructure, while you browse it. While you investigate, troubleshoot or explore the infrastructure, you actually – you have an aha moment. Oh, this thing has also this metric. Well, that's great. It’s the opposite way.

[0:04:25] LA: So, it helps teach you what your application really is doing. You don't have to be a metrics expert in order to understand it.

[0:04:32] CT: Exactly. So, the whole point is to allow everyone to have a monitoring solution as comprehensive and thus, let's say, a 360 view on the infrastructure they have similar to what Fortune 500 companies have. But you take it for free. It's there. You just install it. You didn't do anything.

[0:04:54] LA: So, ease of set up, ease of learning and understanding how it works, ease of scalability and real-time. That's really the value proposition that you focus on.

[0:05:05] CT: Yes. At the same time, we solve a number of secondary problems. Monitoring is like most of the people build their own monitoring themselves. So, they use tools like Prometheus, Grafana, and the likes, and most of the monitoring systems are like this, so they allow you to customize, of course, the monitoring system, but to also configure everything by hand, yourself, this is a requirement for most of them. The problem with this is that once you start doing this, the monitoring itself has a life cycle, like a software development.

So, in order to create a new dashboard, you have to see how you can collect the metrics, how much storage you need, what queries you need, in order to render them, verify the dashboard, that it is okay, et cetera, et cetera. You understand that this is very problematic, especially when there is fire. When there is fire, you actually want something to visualize the metrics now. Now is the time that you need them.

If at that point, you have to say, “Oh, wait a moment, we don't collect this metric. Let's figure it out. Let's have someone collect them to visualize this metric, and then we can continue the investigation.” You don't want that.

[0:06:17] LA: That doesn't work when your application is down.

[0:06:21] CT: Yes. So, the idea is that we try to innovate, let's take for example, as I said before, Grafana. Grafana and [inaudible 0:06:29] tools are amazing tools. But let's take a chart. You see a chart, and this chart comes from a number of servers, and it has a number of components in them, et cetera, et cetera, et cetera, et cetera. If the developer of the dashboard has not baked into the dashboard, the ability for you to understand where the data are coming from, the only tool you have is to actually see the query. So, you have to interpret the query to understand if the query is right. If it matches the right data. Then, you’d never have the ability to actually see where the data are really coming from.

Okay, the query matches these labels, and this stuff, and this matrix, et cetera, et cetera. But how many metrics are actually matched? You can’t see. What we did is on the user interface, we added a number of menus on every chart to actually allow people understand where the data are coming from. So, every component, every server, every label has counters next to it, showing how many metrics had matched for each of them, together with some statistics. This improves understanding. So, you see in a data chart, and do you understand that this is where the data are coming from. I see it. It's not a query. It's not something cryptic. I can see it. It's just a few drop-down menus and I can understand if all the data are there, or which data are there.

[0:07:57] LA: What's the source of that data? Is that auto discovered during the initial setup process? Or this is some rich data source as far as what the meaning of the data is that you are collecting, right? Where do you get that from?

[0:08:10] CT: So, the idea is the following data come, we have our own collectors, and then we use open metrics exporters, like from [inaudible 0:08:18] to actually collect metrics. In a Kubernetes environment, the metrics that exist in the microservice environment, there is a registry there. It is D or console or something else that knows all the endpoints of the infrastructure. So, we scrape that. We collect all the data from the registry, and then we know where to connect and where to collect metrics. 

But with data, when it is in standalone mode, you install it on a server, not on another Kubernetes environment. So, there is no registry. It auto-detects everything on the node [inaudible 0:08:52]. So, you’re installing a data on every node. That's the idea. On every server, you use only one data instance. This data instance will automatically figure it out. That’s, okay, “Postgres is running here, or MySQL, or in NGINX, or [inaudible 0:09:07] proxy,” or whatever it runs, and it will start collecting metrics from it.

In some cases where, let's assume that we detected a PostgreSQL. We connect to it, but it refuses. In this case, you need to configure the data to give us the password. With this username and this password to connect to the SQL server to get the data. But in most of the cases, this is not needed. So, for web servers, even if by default, for example, all the databases allow local host access without the password. Then the data auto detects that. Now, we're adding the ability instead of changing configuration files to configure all our collectors. We have 800 collectors like that, to configure them from the UI, so you don't need to edit configuration files.

What I didn't tell you, and this important, is that especially for infrastructure monitoring, there are a lot of metrics that are constant. Take errors, for example. Errors is usually zero, and if we're talking about hardware, so you have infrastructure on-prem and you have hardware, there are a ton of metrics in the Linux kernel exposes, and the various sensors that are just zero. So, memory errors, e-doc errors, PCIe AER errors. So, a ton of them, but are just zero. What we do is that we collect everything from everywhere. But as long as it is zero, we do nothing. We monitor it, we collect it, but you don't see a chart about it. There is nothing. The chart will automatically appear and an alert will automatically be attached to that chart once one error appears.

[0:10:49] LA: Once the first real data appears, is when you can start doing that.

[0:10:52] CT: Yes.

[0:10:53] LA: So, let's talk about the different types of information that you can collect and where do you focus. Now, when I think about observability, there's three main categories of data that I usually think about. I think about events and logs, things that are happening to your system. I think about metrics, which the way I describe metrics is, that's the thing that describes a point in time state of the system. Then, there's traces, like end-to-end request tracing. Where do you focus in that space?

[0:11:26] CT: So, we focus mainly on metrics. And the data traditionally, is a metrics tool. But also, traditionally, we convert logs in real-time into metrics. If you have a web server log file, for example, the data we will provide amazing dashboards for this, for everything that it contains. Out of the box, you don't need to do anything. You will have full dashboard. All the errors, the redirects, the 500s, protocols, and points, everything.

Now, recently, we started sipping also a logs query system with Netdata. Today, for example, you can query the journal files of the system, the journal files. Actually, we have made also a few PRs from the systemd repository to improve its performance because it's slow. So, the idea there is that from within the same UI, the fully automated UI of Netdata, to be able to query logs that we don't have, we don't store the logs. So, you may have an elastic search, or you may have a Splunk. Or you may have a [inaudible 0:12:34] key. Whatever you have, the idea is that you should be able to query these things from within the data.

For the moment, we have implemented all the system [inaudible 0:12:45] development just last month. Actually, this is not released yet. We're going to release in the next release. And we plan to have in the next couple of months for all the others. So, [inaudible 0:12:55], Elastic, et cetera. We also have a PR where we have our own log management system. Currently, we are stress-testing it. These probably is going to be merged in the next couple of months.

[0:13:09] LA: You'll be well into the event and log management space, but not the tracing space. Just staying out of that space or no trace?

[0:13:17] CT: No trace.

[0:13:18] LA: That makes a lot of sense.

[0:13:20] CT: You have to be very good at something and then progress.

[0:13:24] LA: Absolutely. One of the things I talked about on other podcast episodes, anyone who's listening to podcast has heard me say this, but I find that observability companies tend to be excellent in one of those three, okay in more than one. They want to do all of them, but they're never great in all of them. So, companies like New Relic, which is the one I'm the most familiar with. They obviously started as a metrics company. They added events, then they added logging, and now they add tracing. And well, guess what, they still are really great at metrics. They're okay at events, but they're not quite the same thing as other tracing companies yet. There are other companies that do tracing much better than they do. Same thing with those tracing companies, they tend not to do events and metrics as well. 

Datadog started out as an OS infrastructure monitoring company and moved into the APM metric space and they're okay there, but they're not great. Every company has their focus, which is great, and it's good to see what your focus is. Do you focus more on, would you say on what I would call infrastructure metrics, more so than application metrics? Or do you do both?

[0:14:34] CT: You have to do both. It's a similar thing, so you do both, and then data does both. Our focus, however, is mainly on the standardization of monitoring of packets applications, and operating systems, and operating system components and the likes. Most of the value you can get out of the data is in this area, mainly because these are fully automated, so you don't have to do anything. Of course, you can do your own application, and we collect open metrics. We have stats. We have an API where you can push metrics to the data. But the customization needed is you have to do it. We don't know your metrics. It cannot be done, because it's yours.

So, to understand the complexity, half of the time we spent, in order to integrate an application to collect metrics from an application, half of the time is spent in correlating the metrics together, understanding the metrics, correlating them together, deciding the kind of presentation that the user should see at the end. So, this is a big part of the analysis that needs to be done in order to have an effective and efficient monitoring for troubleshooting and the likes, creating alerts, et cetera.

For APM, this is required, and in the tools that provide high, a lot of customization features are the best. Of course, we have custom dashboards, and of course, you can correlate stuff. Of course, you can do all this stuff. But our key strength is if you need the monitoring today. The good thing with APM is that Netdata tries to figure it out. So, even if you put your own metrics, Netdata will do its best to provide automatic dashboards and correlate the staff by itself.

What I didn't tell you is that we have an ML component in Netdata, and this is actually open-source. So, Netdata trains an ML model, actually multiple ML models per metric collected. No matter how many metrics collected, this is [inaudible 0:16:37]. We don't train somewhere else and we push them, the trained models, the results.

[0:16:42] LA: You're doing the training at the customer site.

[0:16:44] CT: Right on the spot. Now, this means that we learn the metrics on the exact use case, on the exact load, on the exact whatever it is, the exact environment. This allows Netdata to visualize anomalies on every chart. So, every chart has anomalies on Netdata, anomaly rate. And this is automatic on every chart. There is an anomaly rate above every chart, that actually shows you what's the anomaly rate for the timeframe you're viewing the chart.

Now, the good thing is that we can correlate metrics based on anomalies. Let's assume that you have, for example, a server that is sitting down for a day. Nothing happens there. Then, you log in. You just login. You SSH to it. You understand that at that point in time, a lot of metrics reload, from disc, from network, from SSH server application metrics, everything changed a bit. All these are anomalies, and Netdata has the ability to correlate all these on the fly. So, although we have not predefined how these corelate, APM metrics, for example, corelate, the application will figure it out based on anomalies.

[0:17:55] LA: Let me see if I get this correct. You're able to tell that this spike in CPU that occurred just now happened because an additional person logged in, for instance, and you're able to do that correlation. So, not generate an alert, because who cares, someone just logged in. It's not really a problem. It's the normal event.

[0:18:15] CT: Yes. We have plugins that collect utilization, resource utilization per application. Also, we have an BPF collector that collects also all the system calls that applications do. And if they are successful, or not, et cetera. All these are standard metrics. So, we know when the data is installed on a computer or on a server. We know all processes, what they do, what CPU utilization they have, we know all this stuff.

Now, when you SSH to a server that you never did before. You understand that the CPU consumption of SSH server, SSH D spikes. A new processor created, your shell script or whatever it is, memory consumption increases, network traffic increases. Now, all these based on – because the server has trained the model that said, “I am sitting idle. I don't do anything.” Suddenly, all of these anomalies. So, by just doing this, you can see immediately what affected the thing. You SSH to a server, and suddenly you are going to see that the store has moved, the disk moved, the memory increased, the CPU consumption increase, these processes.

So, this fully automatic, you just highlight any region, you see rated by anomaly rates, it gives you a short list of all the metrics that were anomalous by that timeframe. We have also another tool that we call it metric correlations. What this tool does is that you highlight a spike or a dive on a chart, any chart, it doesn’t matter. Let's assume that it is your business sales, or your web server requests per second, or whatever, it doesn't matter. So, you see a spike or a dive. You highlight the thing, and then the engine, we implemented a scoring engine inside Netdata. It goes through all the metrics, all of them, across all the servers to find out what looks similar to that thing. What changed that significantly in that time frame that you have highlighted. Similar to what you have highlighted. It gives you a short list, a score list of all the metrics across your network that looks similar. It may find a spike when you have a highlighted dive, and vice versa, because it looks for changes. Something that was similarly changed, not up and down.

So, the idea is that using this tool, you can see, for example, your aha moment is there. You highlight a dive on your web server, and you see, for example, that your database was slow, or the storage was slow, or the network had some issues, or whatever it was. That was the idea. So, these tools correlate metrics independently of the configuration. These are just mathematics.

[0:21:08] LA: That's great. That's great. So, I want to talk more about that if we can, but I think we need to understand a little bit about how Netdata works and how it differs from other companies before I think we can understand that fully. Usually, about now, in an interview, one of the questions I'll ask an observability company is, are you SaaS? And usually the answer is pretty clear. “We're on-prem”, or, “Yes, we're SaaS. What we'd call mostly SaaS. We have an agent, but mostly SaaS.” But you're a little different. You're mostly on-prem, and partially SaaS, versus mostly SaaS and partially on-prem. That's kind of backwards. Do you want to talk about that a little bit and exactly how your architecture worked from that standpoint?

[0:21:52] CT: So, as I said, Netdata is a distributed monitoring system. So, you install Netdata agents all over the place, and the same software can also act as centralization point. This is open-source software. You install it on your servers, you put some personalization points as needed, and this is it, you are done.

Now, because this is distributed, it means some orchestration. So, at some point, you need to say for example, A, which are all the metrics on my entire network? We have something, let's say, a registry or an orchestrator of the entire thing. We call this Netdata cloud. This is a free offering tool. So, we give it to the community for free. Of course, Netdata can be used without it. Netdata cloud now, what it knows, is how many servers you have. If they are online, what metrics they have, the names, not the values. Just the keys, the names, that do collect system CPU today, or whatever the metrics you have. Based on this, it can create dashboards, across your infrastructure by querying the individual servers.

So, all the ways where you will see a chart or an alarm, it always comes from your on-prem software. But if you want to view for example, your infrastructure from anywhere. So, your infrastructure is private, but you want to be at your home, login and see it, where do you go? You go to a SaaS offering. Now, we have a SaaS offering that is free, and is the baseline, for example. to have for everyone. It's free forever. We ensure that no cost involved. Then, on that SaaS offering, we offer things that features that enterprises mainly meet. These are role-based access, or the thing that I said, single sign known. Stuff that enterprises need and these are paid on the paid plan.

Now, this works well. Today, let's say this allows Netdata to be a lot more cost-efficient than anything else. So, you don't have dedicated infrastructure on your prem. So, you install Netdata everywhere. Not dedicated, so we just use resources that are spare inside your servers. At the same time, with a very thin layer, we are able to provide full infrastructure level of use without any additional effort.

[0:24:19] LA: This makes sense for, when we talk about like server monitoring, for instance. It works well for that, because you put an agent on the server and it collects the data on the server, the data stored on the server, your data plane is entirely in your own premises. And the only thing that goes on in the cloud is the correlation of what data you're collecting in the management of that –

[0:24:42] CT: It's like a registry.

[0:24:42] LA: – and creation of the dash.

[0:24:43] CT: A registry. It’s a registry.

[0:24:46] LA: Registry. Right. But the data for those dashboards still comes directly from your infrastructure.

[0:24:53] CT: Yes, [inaudible 0:24:53] alarms. So now, we're releasing a mobile app, so you can receive for example, alarms on the app.

[0:24:59] LA: So, alerts that come in are generated by your own infrastructure, own your own premise. Whether that's in the cloud or on-premise, it doesn't matter. It's your own infrastructure.

[0:25:10] CT: Yes. The idea with the implementation that we have is that it is distributed, and they don't need to be on the same place. So, you may have a server in China, and one in Europe, and another one in the US. We don't care. As long as – and these are private. All of these servers are inside private lands. It's okay. They will connect on Netdata cloud, the registry. And via the registry, you will be able to access your servers. So, you do not have to configure a registry yourself. You do not have to expose it to the Internet. And for everything that needs to be done, for example, their mobile app applications or whatever, or it sends pager duty notifications, or whatever these are. The registry doesn't. So, the agent sends these events to the registry and the registry sends them to the proper endpoint.

[0:26:04] LA: Makes perfect sense. So, you store the data on-premise, you generate the alerts in the events on-premise, and you send them up and magic occurs.

[0:26:14] CT: It's a transport. Of course, there is code there. So, this Netdata cloud, there is code, because imagine this thing, you are on a dashboard that you have two servers. One server is in China, the other is in the US. Great. Now, you want a dashboard with every chart coming from both, who merges the data? These servers sent the data to Netdata cloud, and Netdata cloud, without storing them, on the fly, merges them and sends them to your browser.

[0:26:42] LA: You tunnel the requests, is essentially what you’re doing.

[0:26:45] CT: So that you can access, you can have a unified view of your data, no matter where you are, and no matter where your servers are.

[0:26:53] LA: That's how you avoid VPN issues, firewalls, and things like that, as well, is you don't have to worry about any of those things.

[0:26:59] CT: Nothing, you don't need anything. It just means these servers are in private land, and they just need to do an outgoing connection to the registry. That's it.

[0:27:06] LA: Two questions that come up, one of them is, so now you're storing the data and the events, creation, all occurs on-prem. That means the machine learning algorithms and the AI that you're doing with the data has to also occur on-prem. Correct?

[0:27:25] CT: Exactly.

[0:27:25] LA: Okay. So, within the server, you're doing all that processing. Now, how much of the load is that for applications? Do applications need to plan for that? And rather than having 50 servers, they might need 52 servers because they need to generate capacity to handle this machine learning that goes on with the metrics?

[0:27:44] CT: Okay, great question. So, with ML and health and storage enabled, everything enabled, Netdata needs 5% of one CPU core.

[0:28:00] LA: That’s very efficient.

[0:28:01] CT: If this is lot, there are many servers that are sensitive. This is not allowed, for example, if you have a master database, you may not want this, and this – okay. What you do in this case is that you put a parent server, same software, open source, you put a parent server and you stream from the database server, the data, to the parent.

Now, the child does not have anything there. It uses just one CPU, 1% of a single core for its own data collection and the streaming and the likes, and it offloads everything, mail, alerts, dashboarding, storage, everything is offloaded to the parent. This way, you have full flexibility.

[0:28:46] LA: So, you can keep it separate and isolated, so it doesn't affect the performance of your application.

[0:28:51] CT: Yes. And the parents are also used if you have DMZ, so you don't want your servers, for example, to connect to the Internet directly. That's okay. Put a parent, ASes border controller, ASes border gateway. So, you put a parent there. On one side, it is connected to your network. On the other side, it's connected on the internet, and you have a broker there protecting your servers, and only [inaudible 0:29:17] connection to the cloud and nothing else.

[0:29:20] LA: Nothing else. Just the one server?

[0:29:22] CT: Yes.

[0:29:22] LA: Yes. So, I think this answers my second question, too, which was, how do you deal with cloud-owned components? So, you're using wanting to monitor RDS, for instance, in AWS world. How do you do that? You do the same sort of model there. You launch a separate server and have it collected data.

[0:29:40] CT: You have two options. All you do is a cloud-managed service, most likely, you will have metrics in cloud watch or something else. Of course, this is custom course. This expensive. Not for us, you pay AWS or a cloud provider.

Another option is if you have a server in AWS, you install Netdata on it and you connect to your RDS as a remote database server. So, you give it your Postgres credentials to collect statistics. And now this server on AWS knows exactly what is happening on your server. Actually, this is the preferred way, not only because of the cost. This is a lot cheaper. But also, because you have full visibility on all metrics, on CloudWatch, not all metrics are available. But when you connect that directly to the database server, you can collect all of the metrics that are available, and per second.

[0:30:34] LA: All the database metrics, you can’t necessarily get server-level metrics, but you can certainly get –

[0:30:40] CT: Even per index, per table, per index, per database instance, per everything. Now, this is not expensive. So, the cheapest VM, let's say on AWS, even the free, the one that is given for free, is perfectly capable for doing a few database service monitoring, a few database servers, or some shared components there, managed by the cloud provider. Once you do this, this is a normal agent. So, as far as we are concerned, this is a local database. It's not on the same host, but it's a local database to that note. So, once you do this, then everything will work. You can merge, you can have a combined dashboard from two PostgreSQL servers, for example. One in AWS and another in Azure. You can see them as one in a data, because everything is the same, just different data sources for us.

[0:31:34] LA: Cool, that makes a lot of sense. That explains the cloud monitoring aspect as well, too. So, it is right to say that you're mostly an on-prem system with the SaaS overlord, your registration system –

[0:31:48] CT: I have to tell you that we have Fortune 500 companies that use Netdata that requires to have a data cloud on-prem, we do offer this option. Netcloud is not a free software, unlike the agent. It's a thin layer compared to the agent. But if you need it on-prem, then you are a business, and we have to make money somehow. So, this is actually one of the key revenue sources that we have. Maybe the cloud on-prem.

[0:32:17] LA: So, your agents are open source. The Netdata cloud is where you make your money, but your core is an open-source offering.

[0:32:26] CT: So, the agent is fully capable to be used as a standalone system itself. It doesn't need it. It doesn't require the cloud. If you have a parent, if you centralize them, your metrics to one agent, then full infrastructure level dashboards are available there. Everything is there. Even the alerts will be centralized, everything will be there. But if you want to scale further, so you need many, many, many parents, because you have dispersed data centers all over the world, et cetera, or the infrastructure is too big, it cannot fit in a single server, then you need the cloud. If you want it on-prem, then we have an offering even for on-prem.

[0:33:06] LA: One of the things you say on your website is you say one of the most start open source projects on GitHub. That's a pretty amazing statement. Do you want to talk about that a little bit? What exactly does that mean?

[0:33:18] CT: Yes. So, when I release Netdata, this is a funny story. I am developing Netdata for a couple of years, a few friends of mine know about it. We use it on production, on the company I was working for, we're trying to solve problems with cloud providers and the likes. Successfully, we did solve them with Netdata. So, I am saying, “Okay, this is good. I have worked a couple of years on it. And now it's time to give it to the world.” So, I press the release button on GitHub, and nothing happens. Nothing.

Okay, let's talk about it. Let's write a blog post. So, I wrote a blog post and I asked a few online Linux-related sysadmin, and DevOps-related articles, magazines to write it. They refused. “We don't care.” So, one morning, I say, “Okay, let's go to Reddit and post there.” Let's say, guys – this post is still there. “It was nice. I wrote this thing. It probably can be helpful for you. Please check it if you like it. Thank you very much. Bye.” And Netdata managed to get 10,000 GitHub stars in two weeks. Two weeks. It’s the fastest, probably, the fastest-growing GitHub project ever. It was featured also on the GitHub Octoverse for that year. Today, we have about 65, if I remember correctly, 65,000 GitHub stars.

So, this this behavior is usually on, let's say, Google projects. Google projects usually take that many stars when they have these kinds of activity. Also, Netdata is one of the most start projects in the CNCF landscape, although we are not incubating in CN/CF. CN/CF wants to take control of the project. They want to control, what's the development, what's the future of the project? So, we don't want this to happen. But Netdata is in the CNCF landscape. I think we are about these days to pass in front of Elastic. We are the third or fourth some libraries.

Building a community, I have to tell you, was fun. Building this kind of community because most of the hard work has been done after I realized how much people needed such a solution. Because the problem of the company I was working for, it was done, we fixed it. So then, when I realized that people really need this kind of solution, then I said, “Probably, this is what we have to do. This is how we need to progress further.”

[0:36:04] LA: What are the things that open source companies, companies have a solid open source creds and solid open source foundation, one of the things these companies will often struggle with is monetization. Right? How do you take that open-source product and make money out of it? And usually, it involves some sort of enterprise play or customer support offering or something like that. Sounds like you went the enterprise option. But what capabilities? Besides the standard enterprise capabilities, like single sign-on, those sorts of things, what capabilities do you provide in your paid model that gives value over and above the open source model? What do people get when they buy into your enterprise model?

[0:36:50] CT: Okay. So, there are many models. As you probably know, there are many models of the open source, to sell services or to sell enterprise version of the software. What I tried to do here is something that has not been done before. What I wanted this new data, the core of Netdata to be an open source, fully open source without alternative, so no enterprise version of it, nothing. I am a big fan of open source myself. I believe that everything that I have become in my career is because of open source. Personally, I believe that open source is one of the miracles of the world, and 100 years from now or so, probably people when they look back, among the other miracles that are happening around us, I believe that open source will be one of the major breakthroughs of humanity for this period.

So, I love it, and I want to give back. I think that I do really a lot from the open source communities, so I want to give back. This way, I believe that Netdata, the agent, is a gift to the world. It's the core of Netdata. Everything we do, I want it open source, always free, no enterprise version of it. That's it. It's the best and it's free.

Now, in order to make money, you have to figure out for which features people will pay. Now, I wanted the money mainly because everything is done by the agent. You understand that all the core features of more of the monitoring are free, are inside the agent, even ML, we put it in the agent. The dashboards, the query engine, the health engine, everything is in the agent. So, in order to make money, you need to find something that open-source users may not need that much. But enterprises are willing to pay for it. This does not need to be something heavy in monitoring terms. It may be convenience. It may be something that you can live without if you are not a company, but you must have if you are a company. Collaboration features.

So, we are trying to find the right mix, the right separation, let's say, between the open source and the core monitoring features that we want them always free, and the must-haves for companies so that they are willing to pay. Of course, this is an experiment. We don't know yet if we're going to succeed. This an experiment. I think, however, we see amazing love from a lot of big companies for Netdata. The good thing is that they see – so the value they find in Netdata, I can't name them. But the idea is that even if you have Fortune 500 companies, that they have a monitoring team, and they find that Netdata is actually more accurate, and provides more insights, and has more alert than what they have built so far.

This is great. This is our success. So, these companies are willing to pay to get the single sign-on, and the peace of mind, the increased security. They pay for this stuff. So, this is an experiment. We're trying to figure out where exactly the separation will be.

[0:40:15] LA: So, we're going to have to have a follow-on episode in a couple of yours to see how your experiment has been going and how it's changed and adjusted over time. I know we had this conversation a couple of years ago as well, and we talked about this. It's great to see how it's matured over time. I know, it'll be that much more mature.

[0:40:35] CT: Our commitment is that the core monitoring functionality should always be free. So, the core monitoring functionality.

[0:40:43] LA: Instead that differs from a traditional PLG model, product-led growth model where it's like, we'll give it away for free, but make it so that if people really liked the product, then simply by using it, they'll want to start paying. This really is something different. This really is the enterprise pay normal users don't pay model, where it's like you give everything that a normal, free tier user would want for free, forever. But take the things that the people who have money are able to or wanting to pay that these community-driven users don't really care about. Like you say, enterprise-level collaboration tools, single sign-on security, those sorts of things, and those are the things you charge for.

[0:41:27] CT: The key thing here, Lee, is for this not to happen on the core monitoring features. For example, I don't want these, because I want the free user or the user that cannot pay at the end of the day, to have ML, to have metric correlations, to have proper troubleshooting tools, because we are also at the same time, trying to improve the situation for everyone. So, I don't want to take this out. The opposite,  I want to give more for free.

[0:41:55] LA: That makes sense. This actually leads me to my final question, which is really, where are you going next? But I also want to prime it specifically with the phrase, Generative AI, to see if that brings anything into your plans. If there's anything you're planning on doing with Generative AI. But in general, I'd like to know what are you looking at? What's next for Netdata?

[0:42:20] CT: Especially, for the AI, we are trying hard. There is a presentation from Google, SREcon. I will be in SREcon in a couple of weeks this year. But a couple of years earlier, a guy from Google who was in SREcon and said, the title of his presentation is that all our ML ideas are about monitoring, of course, and observability. And to explain that ML and monitoring, they don't go together. It's very hard, mainly because every server is unique. The workload, so even if you have two servers are exactly the same, same data, same software, same hardware, same applications, everything the same, the workload determines what the metrics will do at the end of the day.

So, it's very hard to go and train models and say to the models, what is good, what is bad, and how to detect anomalies, and all this kind of stuff. We tried hard to find a way for ML to be useful. In daytime, maybe supervised, so you don't do anything. It just sits there, detects anomalies, colors this ribbon above every chart, we have a scoring engine that you can query it on the fly. But that's it. So, the idea is how to turn a ML not into a decision-making thing. It's not mathematics to make it true or false. It is an alarm. It's not an alarm. But into a consultancy.

So, while you're troubleshooting, ML in a data is there to assist you. It will not decide to wake you up at 3am. It will not. We have some cases where it’s obvious that it should. For example, you have an anomaly rate on 2,000 metrics on your network all concurrently. Wait a moment. Something is gobbling you. You have an attack, something is broken, something very bad is happening. So, we're trying to find how we're going to make the most out of ML. But the data is a lot more than that. The ease of use that we're trying to give, because my goal is everyone in this world to have a monitoring system comparable to the one that Facebook has, that Netflix has, that Tesla has. How you can have a monitoring system so capable as the one that Fortune 500 companies have. This is what we're trying to do with Netdata. This is what we're trying to give to the world. For these to happen, we have a lot of fronts. Really a lot of fronts, more integrations, more interoperability, better dashboarding, more transparency.

So, I think, and this is one of the things that I was always discussing with our investors, it's a big project. It has 800 in the grassroots. It has a big community. They need support. We need to understand all the use cases out there, and how to translate all these needs into software and implement it and give it to the world.

Overall, I think that what we're doing with ML is in the right direction. So, we are going step by step into what some things that will make a ML really useful for monitoring. Then, on the other side, I think that for the next couple of years, mainly to smooth, to polish everything that we have, and improve it, improve interoperability because monitoring, you need to be interoperable. You cannot do everything alone. You have to.

[0:45:53] LA: Right. Cool. So, I really look forward to seeing what comes next and how you work. It's interesting, the take on machine learning and AI, and your whole emphasis there is very different than I hear from other companies. So, it’ll be interesting to see how that works, and hopefully it will work well and I wish you the best of luck.

[0:46:13] CT: Thank you very much.

[0:46:13] LA: My guest today has been Costa Tsaousis who is the Founder and Chief Executive Officer at Netdata. Costa, thank you so much for being on Software Engineering Daily.

[0:46:27] CT: Thank you very much for hosting me.