EPISODE 1717

[INTRODUCTION]

[0:00:00] ANNOUNCER: Traditionally, security checks and testing are performed towards the end of the software development lifecycle. However, discovering vulnerabilities at that stage can be costly and time-consuming. This observation has led to the Shift-Left Movement, which advocates for implementing security testing earlier in the software development process. HoundDog.ai is a startup focused on software to enable shift-left security practices. Amjad Afanah and Sudipta Mukherjee are Co-Founders of HoundDog, and they join the show to talk about their company.

Gregor Vand is a security-focused technologist and is the Founder and CTO of Mailpass. Previously, Gregor was a CTO across cybersecurity, cyber insurance, and general software engineering companies. He has been based in Asia-Pacific for almost a decade and can be found via his profile at vand.hk.

[INTERVIEW]

[0:01:02] GV: Hi, Amjad and Sudipta. Welcome to Software Engineering Daily.

[0:01:06] AA: Thank you for having us.

[0:01:07] GV: Yeah, it's great to have you both here from HounDog.ai. Love to maybe just kick off with hearing about both of your backgrounds. Maybe Amjad, do you want to start off like, where have you come from in the journey to HoundDog.ai?

[0:01:23] AA: Yeah, I appreciate it. Thank you for having us. My journey started back in 2006, when I basically finished college. I moved to the Bay Area, started working for enterprise software companies. I first joined Oracle and then VMware. In both capacities, I was a product manager, basically managing enterprise software product lines that had to do with application monitoring and cloud automation. My entrepreneurial journey started when I met a very smart engineer at VMware. This was back in 2014 when Docker was rising as a disruptive technology that was helping developers deploy applications more easily.

We figured that, hey, we're at VMware, we know how enterprise applications should be deployed. Maybe we can extend our knowledge to the world of Docker containers. We started our first company, it was a company called DCHQ. Then a year and a half into it, it was a very competitive space, a lot of different container orchestration tools and cloud management vendors. We were very proud of what we built and we integrated with 15 different clouds and really simplified the containerization of enterprise existing applications, as opposed to green field microservice type of applications.

Then a year and a half into that, we actually ended up getting acquired. It was a company in the hyper-converge infrastructure space that was trying to pivot into cloud management. I ended up staying with that company for a couple of years. Then my introduction into the world of cybersecurity was back in 2018, when that same co-founder and I decided to start a company that provided the first API security scanning technology.

The idea was to basically scan your open API spec, or swagger file. We would generate a whole bunch of tests that would look into SQL injection, out of service attacks, all that stuff. The thing that distinguished us was around access control, right? Because a lot of the API attacks that happen are because of privilege escalation, somebody having access to more APIs, or more things than they shouldn't have. We would basically ask our customers to provide a pen test type of experience, but they would provide credentials to the different roles they have in their application. We would generate tests that will basically mimic what kinds of access they have to the various API endpoints using these different roles. Then they would see, "Oh, okay. Admin has more access that it should have." Then they would remediate accordingly.

Then I ended up leaving that company for different reasons, mainly because I was recruited by the self-driving car company called Cruise. That was supposed to be the next OpenAI. Unfortunately, didn't pan out accordingly. They basically wanted somebody to manage the developer experience and infrastructure services, product management team, and stuck around for two and a half years and then decided that, hey, I want to get back into cybersecurity.

Back in 2022, I joined a company called Cyral, and it was a company in the data security space. I joined as the VP of product. Really, it was that my time at Cyral that really inspired me to start HoundDog, and I can get into the details later about some of the learnings I learned from the security and privacy teams, while at Cyral, that helped formulate a lot of the core thesis that we started for HoundDog, right? That was my high-level journey. Yeah, and I'll turn it over to Sudipta to talk about his journey. Yeah.

[0:05:09] SM: Yeah. My journey started around, I think, 20 years back. I have been programming different kinds of software since then. Do like a reverse chronological tour of my journey. I think, recently years, about last decade or so, I have been involved in a lot of compilers and tooling around and programming language development work. I worked as a compiler backend engineer for developing compilers for language that are defunct now, but are still there. I had written a compiler for a language that's called app builder, but it's not the Google app builder. It's a cobalt DSL. But essentially, we written the compiler in C# for emitting MSIL. That was my initial touch on the compilers.

Then I had a knack on meta programming and identifying things that are writing code that will be identifying things that are a lot of glitches in the code basis. I wrote a book called Source Code Analytics, and using Microsoft's Roslin and JavaScript visualization engine. That was in 2016. That was received well. Since then, I had been experimenting with different kinds of meta programming approaches to identify not only just the bad programming practices, or identifying things in the code, but also, clustering developers and identifying, like using machine learning and programming languages techniques. I always wanted to work in the cross-section of ML and PL. When I got the opportunity to work with Amjad, that was a very happy accident. It's like a win-win situation. I readily joined that. This is the brief history with programming language tools. Before that, I had worked with enterprise software for developing lots of other things. I would not go there. It's a long history. Yes.

[0:07:14] GV: Yeah. Wow. Two, I would say quite varied backgrounds. I think, Amjad, your background tallies a bit closer to mine, maybe, especially where you talk about coming to cybersecurity at that point. I think that's super interesting, because I think a lot of people assume you either start in cybersecurity and that's your life, or you're not. Actually, I think some of the best tooling and platforms in cybersecurity now come out of people that have transitioned to the cybersecurity side, from watching it from the sidelines a bit and then going, "You know what? Actually, I think we can do something better here." Yeah, super interesting from both of you.

The platform, it's called HoundDog. I'd love to A, just also understand the name. If you Google HoundDog, a couple of platforms come up. I just want to super clarify, it's HoundDog.ai that we're talking about today. Yeah. I guess, where's the name from? Then also, at a high level, what is HoundDog.ai? What problems did you want to solve through this? Why did you think we want to solve this problem, or these problems through a venture?

[0:08:15] AA: Right. Right. Yeah. Starting with the name. It's actually really hard to come up with a unique name. There seems to be a lot of dogs in the cybersecurity space. But it's understandable, because a dog is a loyal companion that hunts for things and helps you discover things and helps you identify things. I figured, yeah, HoundDog is the perfect name for the animal that's going to help us hunt for these data security vulnerabilities that we're looking for. Yeah. Basically, when we started last year, it was an insurance company that had hounddog.com and I was like, "Okay, that's great." Then just recently, discovered this. There's hounddogai.com, and it's like some financial modeling tool. It just appeared out of nowhere.

[0:09:07] GV: It happens. It happens to the best of us.

[0:09:10] AA: Exactly. Exactly. Yeah. In terms of what we're trying to solve, right? I was talking about my days at Cyral. That's really what helped me, inspired me to start this company. Cyral is a data security company. Like many others in that space, it basically helps you discover and classify your structured data in production environment, right? Then apply access controls accordingly.

While I was at Cyral, basically, the feedback I was getting from a lot of security and privacy teams was that, yeah, the product and that approach makes sense. What is missing is a more proactive approach, because everything that's out there is very reactive, right? It's assuming that you already have something in production, you have to discover it, classify it and then apply the access controls. What if you really wanted to prevent, like personally identifiable information from leaking into logs, or files, or third-party systems from the start, right? Before the ripple effect happens and before, when you leak something into logs, it's not just the logs.

There's a whole bunch of systems that could potentially ingest logs. There's Sentry, there's Datadog for observability. There's the SEM tools, like Sumo Logic and Splunk. There's backup systems. Many, many different systems that may end up having this sensitive data exposed in, right? That was one of the first insights. Then, that was from a security standpoint.

From a privacy standpoint, it's actually a really big mess. For GDPR compliance, there was a survey that was done by one of the data privacy platforms. Basically, they asked a question, which is, what's the level of effort that's needed for you to discover all the systems that are handling user data and then come up with a unified data map? The 58%, or the overwhelming majority said, that it takes multiple teams and an external vendor and at least one year in order to complete that test. One year. They spend the whole year, between manual surveys and checking with different teams and understanding what data flows, their processing and all that stuff, just to create that unified data map that is needed for documenting, processing activities for compliance.

Then in addition to that, there was another question asking, how often do you add systems that process user data? Majority of the respondents said, weekly. It's not like, it's a constant thing. It's something that continues to change and it takes a long time, because of the manual approach that they're taking. We figured that, hey, what if we build a code scanner that really understands the data flows that are handling sensitive data? It could be used for two use cases. One is flagging and preventing these PII leaks from the start, and helping you document processing activities that keeps up with the changes in the code base, so you don't have to go through all that manual process, basically. Yeah.

[0:12:20] GV: Yes. Okay. PII, the headline there in terms of, as you say, two use cases effectively being able to protect against leaks, and then also, being able to keep on top of. Where is the PII and what are those data flows and especially for compliance in the run up to a SOC 2 or something along those lines. Then just briefly, what is HoundDog.ai? What was it not solving, just to also make clear that one up?

[0:12:47] AA: Yeah. One of the questions that keeps coming up when we talk about detecting sensitive data flows is the thing that they are accustomed to knowing, which is what's already available in the SAS scanners, or the code scanners today. That's the exposed secrets detection. That's like, what we're not doing, because it's a different problem. Exposed secrets are looking for actual exposed API tokens, or passwords that you've embedded in the code. Whereas, what we're doing is understanding the code logic that's handling sensitive data. You may be creating a function, or a method that is about to handle social security. We will intelligently know based on the naming of the different things and the various analysis techniques that is happily to drill down into, we will know that, okay, this function with a certain degree of confidence is handling social security.

Then we track it across the code base to see, oh, okay, what are you doing with it? Are you exposing it in logs? Are you saving in plain text in a file? Are you sending it to Sentry, or Datadog? That's what we're doing. We're not doing the secrets detection, because to be honest, it's become a commoditized feature that every SAS scanners is now offering.

[0:14:03] GV: Yeah, I think that's a really good thing to point out. They could sound like the same problem to solve, but they really are not. As you say, yeah, secret detection has definitely been around for a bit longer, and you know some really great companies out there that specialize in that, like TruffleHog comes to mind. That's another animal in there. Not a dog, but a hog.

[0:14:20] AA: Right.

[0:14:21] GV: Yeah. Let's maybe just, get into, yeah, get into the weeds a bit of, how does the platform work in detail? I think, the high level here is it's both static analysis and AI. Yeah, I think, I'd love to understand, where do you see the static analysis being the powerful piece? Then, I mean, it sounds obvious, but I'm going to ask why also go to AI as well, and how do those two approaches work together? What are the limitations of each? I think in this context, there's just a lot to unpack there.

[0:14:53] SM: Okay. HoundDog.ai, the platform, if you will, it has a lot of moving parts. Largely, there are two big moving parts. One is the scanner. Obviously, the other one, as you mentioned, is the AI component. The scanner, when we say, refer to the scanner, we refer to the static analysis component. When we refer to AI, we refer to the inclusion, the OpenAI integrations. What the scanner does is typical of a statistic, like static analysis. It gets the source code and parses it and generates a syntax tree. Kinds of goes back and forth in that syntax tree to identify the tokens and also, identify where exactly those tokens are used inside a program.

When you say a function call, that's called invocation in terms of static analysis. Where exactly these tokens are utilized in a function call. Based on that, it determines whether something is - some PII data is being leaked to some other logs, or not. Let's take an example, just to identify as much cleaner. Let's say, I have a function call that says like, some logger.log. Inside that, I may say something like, this is the bank account and then I pass a variable called bank account number. What the scanner sees is that this is a function call. Yeah. This is like a logger.log is a function call. Then it has the argument, the function argument like, literal string that something is there. Then also, the bank account number as a variable.

It knows that the bank account number is something that is obviously a PII, like PIFI in this case, personally identifiable financial information. Then it flags that invocation as a vulnerable call, because we know that logging sensitive data is a vulnerability. This is an obvious example, but our scanner is smart enough to identify non-obvious examples. Let's say, you have a variable called some string, let's say, you are doing some programming and it says, string bank account equals to somewhere you got the get bank account number. Down the line, you have somewhere said like, string X is equal to bank account. Now, X also has the details about the bank account.

Then you are saying somewhere, you say like, logger.log something. X, instead of bank account, you just say X. Our scanner also detects X using some taint analysis. It's called taint analysis, because in terms of static analysis, the bank account is called a tainted variable, because it has the data which is not supposed to be there, and that's why it is called tainted. Whenever it is assigned to some other variable, that also becomes tainted. If you will, it's more like a ladder. It's like joining the pegs together. X also becomes tainted, and the scanner knows that and identifies that and it keeps track of all such.

This is a trivial example, but it happens across the different - it can be happening around the function calls and all of that. How this knows that the bank account is actually a vulnerability, like a sensitive data, because we maintain a battle tested list of regular expressions to identify variable names, function names, token names, class names, or whatever, to know that what particular type of sensitive data is that. For bank account number, we have a regular expression. Similarly, for let's say, blood pressure, we have a regular expression, and so on, and so forth.

We have rigorously tested this across scanning multiple repositories, and we now know that they are, if not 100%, let's say around 99% correct results all the time. Since they are correct, so since the basics of these are correct, so all the tainted variable detection also are obviously correct. If we are sure that let's say, it's not just bank account number, it says something like, customer bank account details. The variable name is customer bank account details. We also know that this is a variable of interest. Then if somewhere down the line, the developer assigns that customer bank account details to some other variable. Then we also know that is tainted.

This is where our capacity of static analysis using taint analysis technique that I just described. We do something like, inter procedural analysis. We have a function call. Let's say, I have a type and that type exposes some methods. Those are public methods. That type, let's say I have a class called student. That student has a function called return gender, for example. Get gender. I just say, let's say, sentry.something, capture something, and then I just say, some object of that student class and dot get gender.

I know that gender is get gender method of student class, actually exposes that gender and it is going to sentry. This is very smart that way, because it knows exactly how and where different kinds of things happen. Where we needed the AI capability now is that because regex can do only this much, because we can write a regex for things that we know that we are looking for, right?

For example, so if I know that I have to look for blood sugar levels in variable declaration that associates blood sugar level, I can write a regular expression to represent the variable names that would be possibly used to define blood sugar levels. I cannot write a regular expression for things that are unknown to me, but are possibly PII, or PIFI, or PHI. For that reason, in the wild, there would be much more sensitive data that we do not know about right now. What we take the help of OpenAI is that we don't send the code. What we do is we send the tokens and we simplified the problem for AI to solve. It's like a binary classification. Classic binary classification machine learning problem is we take basically, keep OpenAI some tokens and I ask it to identify that, can you tell us like, if it represents sensitive data. If it does, which category do you think it belongs to and what is your confidence about that?

Then, it gives us that result backend. Now, what we do that, we take that data back to the SAS scanner and helps the scanner, because the AI does not do this traversing of trees and such things. It just gives us a binary answer, like this is sensitive, that is not. Instead of like, as in our example, we have X equal to bank account number, it can be X equal to some other things that we do not know, but it probably is a bank account number. You know that that is a bank account number and all the other things that I discussed about this will keep on working. That's where the marriage, happy marriage happens between AI and static analysis.

[0:22:37] GV: Yeah. That's awesome detail. I think you really explained it well how AI is coming in here, much more the spot context. Because as you say, regular expressions can be used where you're very sure of a format and that format really follows a pattern, and that's a much more robust way of looking at it.

[0:22:56] SM: Yeah. I forgot to mention a couple of interesting points. One is that regex is a nightmare to maintain, obviously. Nobody knows how to write them well. That's the one thing. The second thing is that is an advantage of using AI is that we can capture foreign language details. Some code bases that we scan had foreign language variable names in them. Normally, we see that English is used for coding, but we have seen that some code bases had some foreign language. Maybe, I don't know. It can be having a gender in some other language. I have seen phone numbers called in French, something. We can identify those using AI, because obviously, AI knows all the languages on the planet.

[0:23:46] GV: Yeah, I think that's great context. Yeah, a couple of anecdotes on this side, I guess, as you say, regex is very hard to maintain. It can also lead to some strangely catastrophic outcomes if they're not written properly. I remember a long time ago a Rails app and the app kept crashing. It was crashing literally based on a regex. I remember this clear as day, because it was the first-time stack overflow had saved my life. Someone who was an absolute regex, that was just all they did all day and they said, "I can look at that and say, what's happening here is called catastrophic backtracking." I was like, oh, my goodness. What the hell is that?

Yeah. A project I'm working on, for example, we're looking for one-time codes to help the user. But a one-time code looks exactly like a zip code. You can effectively regex that terribly effectively, so it's the context where this number sits that starts to bring out. I think that it's just another example of how exactly what you've just been saying, Sudipta. I mean, there's a bit here where if you go on houndddog.ai website, you talk a lot about CWEs. Maybe, could you just give just a context, what is a CWE and HoundDog is really staying on top of the latest of these. Maybe just speak a bit to that.

[0:25:02] AA: I'll take a crack at this, and then Sudipta, if you want to add, I think, let me know. Yeah, but CWEs, it's a common weakness enumeration maintained by the MITRE. They're basically class of vulnerabilities across every category they can think of. OWASP is another set of security categories, and they also, in their documentation, they reference these CWEs as well.

There's a bunch of them that reference a sensitive data exposure, so we are focusing on the ones where sensitive data is stored in plain text, in places where it shouldn't be, like logs, files, cookies, JWT tokens, local storage. Then there's a one CWE where it basically doesn't specifically mention third party systems, and it's just basically a CWE in relation to sensitive data exposed through sent data. It's like a big uber category, that's a catch all for anything that's being sent outside of the application in places where it shouldn't be. For us, we interpret that as the third-party applications that shouldn't really have highly sensitive data that shouldn't be stored there. In terms of keeping up, we basically, yeah, just stay up to date with every version update of OWASP, or the CWEs, and that's how we stay up to date there. Yeah.

[0:26:37] GV: Yeah. That was good context. I think, a lot of developers may have heard of CVEs, but this is, yeah, why it's very quite different, actually, it's a guidance on what types of data, etc., constitute a problem effectively and keeping on top of those. Let's maybe just look at the, I guess, the developer experience here, and all of our listeners are our developers. If they were to start wanting to integrate HoundDog into their flows, what does that look like? I mean, you mentioned shifting-left a lot here, which makes a lot of sense. I think, understandably, listeners might be a little bit fatigued by hearing shift-left, and especially in the context of security. I think it's great to maybe contextualize that. How is this shifting security left? How does it look in a developer's workflow? How is that easy for them to integrate?

[0:27:26] AA: Yeah, yeah. Again, Sudipta, feel free to chime in. Basically, the most common workflow right now is through the continuous integration workflow, so that this caner is available as a Docker container, you can run it as a CLI, so you can do what we call a point in time analysis, which is get two types of output. One is what we call a sensitive data map that maps out all the sensitive data flows in your code base. We also have, Sudipta built these amazing visualizations, where you can actually track the sensitive data flows through all the files and the data syncs where it actually ends up. That's one type of analysis. Then the other is obviously, the vulnerabilities, and they're managed by a set of vulnerability rules.

An example of a rule is sensitive data exposed in logs. That's one rule. We have a whole bunch of them, like I said, the checks if the data is going to logs, files, third-party systems and so forth. Obviously, we want to have this be a continuous detection process. Right now, our customers are using it as part of the CI workflow. Some customers, actually, before we launched our cloud platform, most of our customers had a CI workflow output the results directly in the security dashboards that they were using.

GitHub as an advanced security dashboard. GitLab has something called the vulnerability report. They didn't even have to log into our own cloud platform to do anything. It's just a scanner that runs in their CI workflow and just magically pushes these things into a security dashboard that they're - consolidates all the issues from all the other code scanning issues that they see.

In the future, we would love to shift it even more left and make it part of the IDE. But I think right now, it's a good middle ground where it's not bothering developers that much, because we also feel in the IDE, they may not really have things figured out yet. There may be just testing things. We don't want to start flagging sensitive data issues there. But if you're merging to a mean branch, that's definitely an issue, and we want to flag it and make sure it doesn't go all the way to production, because that's what we're trying to stop is the PII leaks from going into production, because that's where it's really big a nightmare to fix once it's already in production.

[0:29:53] GV: You hit the nail on the head there. Shifting left in this context it is a balancing act, because, yeah, if you're in the IDE, then, well, how many other tools to look at other things are also there and then suddenly, the developer experience is horrible, because I'm just writing code and being told every five seconds that something's wrong, and that's not fun. Yeah. I think, it's a great place to have it right now in the CI, and then you can decide where to put that information as well. Because as you called out, it's not just the developer that wants to know about that. It's always going to be other people in the org that want to know about it, and that's a great place to then message it out from.

Developers, companies they're going to have to be able to actually customize this to some degree as well. I mean, you have libraries that understand what PII looks like in different contexts. I mean, this must be huge geographically, industry by industry. Maybe, could you give some context, like how broad is that right now in the libraries that are there, and then what's the process for say, a developer being able to actually then customize sensitive data types that is very specific to their case?

[0:31:04] AA: I'll just say, basically, to what Sudipta was saying is that we actually, initially had more data definitions. We had around 200. We actually intentionally narrowed it down to just 60 on purpose to make sure that whatever regex definitions that we have are going to yield to real true positives. To Sudipta's point, it's like upwards of 99% accuracy. Because we're not going to have too many combinations of what you could call social security, we're just going to narrow it down to the bare minimum that we are a 100% sure is something that's going to - a developer will call a function for social security handling.

Our vision is that this AI workflow is going to basically, give an opportunity to not have to maintain additional regexes. Because ultimately, our buyer is actually not the developers. Our buyer is the security team. Putting the onus on the security team to already know what functions, method, variables the developers are writing, that's defeating the whole purpose. If they already know what functions, methods are handling sensitive data, then they don't need us. That's where we're hoping that this AI thing is going to alleviate the pain of having to maintain regex expressions.

But there are use cases and edge cases. For example, we've talked to financial institutions, where they intentionally obscure the names of the variables and the functions. For example, they keep talking about the bank account ID, or whatever, they may actually call that variable XYZ intentionally to increase the level of security that they have within their practices. Yeah, a 100%. In those cases, we do give the opportunity for them to define their own regexes and say like, XYZ is actually translated to bank account ID, and then we can actively look for what they're defined in our system. 

But we're hoping that with AI, the majority of the use cases, like the security teams don't have to lift a finger. It's just AI magically discovering these things for them, and they don't have to maintain regexes themselves.

[0:33:27] GV: Yeah, this is super interesting point there on the variable obfuscation, basically. Yeah, it's a smart thing to do. But then, in this context, suddenly, there's a double-edged problem, if you want to call it that. Yeah. It's great to call that out. If we look at, you talk about the security team and security and then obviously, the buyer. I've suddenly experienced where all these - say, a lot of these products that are trying to address a problem. Basically, in security, they get bought often, basically, for compliance. Compliance tends to mean that something has to be produced to say, we've produced this thing, so now we can tick the box that says we're in compliance.

The shift-left bit is the fact that we can catch those things earlier to mean that we get the ticket at the end of the day. How is HoundDog helping to that end in terms of compliance artifacts, like records of processing activities, for example, for GDPR? Is that one of the key outputs per se of the platform?

[0:34:32] AA: Yeah, 100%. Yeah. The two use cases that I was describing, I would say, the security one where we stop PII leaks from the source, or at the source, I would say, that use case is actually not driven by compliance. That use case is really driven by security teams trying to avoid the nightmare. Usually, the ones who - like, a light bulb moment happens when you're talking to them, are actually security teams who have actually experienced PII leaks already and they know the amount of suffering they had to go through, because we estimated that it takes at least 80 hours between having to do code updates, to actually, let's say, something leaked into a log, you have to review the access logs to see who actually accessed the logs, who actually saw this PII data. Then depending on what kind of PII was leaked, they may need to do risk assessments, and they may need to notify customers if actual customer PII was leaked and then got compromised.

It's a huge nightmare. That one, I would say, is actually not driven by compliance, because security audits, they don't have clear cut criteria. It's big. We're going through SOC 2 audit right now. I'll be honest, it's not very clear cut as to what you must and must not have. It's very lenient, let's say. On the privacy side, GDPR is different. GDPR is literally - if you want to summarize GDPR in one sentence, it's all about what data are you collecting? Who are you sharing it with? If a user wants to delete it, or get access to that data, can you give it to them, right?

It's all about data, data, data, data. It's like, you really need to know what data this application is processing. Privacy teams are even further removed from developers than security. With security, at least you have an appsec team. They're involved a little bit in the reviews and all that stuff. But privacy is even farther removed and they get involved every once in a while when there's a huge feature and they do a design review. But they're not really embedded into the development process, and they they're often surprised by things, and they're often like, "Oh, man. We need to update these data maps. We need to do this." This is giving them an opportunity to rely on a source of truth that's continuously changing, because code is continuously changing, but they can have that peace of mind to know that okay, even if it's changing, I'll know what data flows are being updated and changed, and I can easily update my data maps accordingly.

[0:37:17] GV:  Yeah. I mean, I guess, talking about GDPR, I need to remind myself even, GDPR was a European directive, and it's worked its way into some areas of, I think, the US through, I think, for example, California has a pretty close version of it. I guess, is part of what you're able to provide, to look at data residency tracking, is that one of the big pieces here, where you've got companies with different arms of the business and they need to be able to track like, okay, it's not compliance problem if this PII leaks in this region, even though it's bad, it's just that it's not going to cross any laws effectively. But actually, if it was to leak via this other country that does GDPR, that's a problem. Is that something that HoundDog helps with as well?

[0:38:06] AA: Not exactly, I'll be completely honest. Because our source of truth is the code base. It's very hard to know where the data will actually end up. From the code base, you may have data flows that will basically store in some storage in a database. What you end up doing in production, in terms of replicating the database, or having two, basically, replication happening in different regions and all that stuff, we can't see that, because we were just looking at the source code, we can't see what kind of replication technology you're using in production. 

We are solving is this even harder problem, which is this, hey, what data flows do we even collect to begin with, and what level of confidence do we have around the data flows and the data map that we already have? Because code is changing. Our prime customers have to have the intersection of two things. One is we're handling sensitive data and be that we have an application that's continuously changing. In this day and age, most of the companies are in that bucket, because every company is becoming a technology company, every company is building applications, and no application stays constant. You have to update it. You have to make changes. You have to introduce features. You have to stay ahead of the innovation. With these changes, you're introducing risk and that risk needs to be reflected somehow in the data map that they have to a process for GDPR.

[0:39:35] GV: Yeah. That makes a lot of sense. I mean, actually, Sudipta, come back to you just in terms of developer experience, something I actually didn't touch on earlier was the platform supports a lot of programming languages. That's actually something you were giving examples are quite deliberately generic examples of function XY. At the same time, every language is very different in terms of how that looks. Yeah, could you maybe just - I mean, which languages would you say, you have full support on today and what were the challenges, or has that been a challenge to keep up across different languages as well?

[0:40:09] SM: Yeah, thanks for asking. We now have support for some of the major commercial languages. We have started with Java. Then we have support for C# in terms of programming languages. We also support SQL, and we also have Open API. Also, we have the, I think, GraphQL queries. We also scan that. We also scan the embedded SQL statements. Let's say, you have writing a Java function, Java class, and you have some embedded SQL query as a string literal. When you say literal, it means just a constant string. You can have a string, like customer insert command. Then you write the command to insert the customers. Then that query that is inside that string is pure SQL. It also will have the sensitive data, like the names of the customers. We know, now we can track that exactly, that is also something that is scanning embedded SQLs.

Coming back to your point, we are working on Python and Kotlin. That will be out pretty soon. I don't want to point fingers to a bit, but yeah, it should be out pretty soon. Semantically, for what we do for all the languages that we support, so a function call is a function call across multiple languages. It remains the semantics of thankfully, the semantics of function call, or invocation is same across all the programming languages. If you call a function in Javascript, that's exactly the same in terms of semantics, when you call it in C# or Java, there is no difference. Static analysis is a huge area, but exactly, what we are using, the part of static analysis that we are using for our scanning capability does not depend on the language semantics that changes.

Yeah, obviously, some of the language semantics changes across languages, as you mentioned, but the variable declaration, or array declarations are not same and they have some changes, but we are not really bothered by that. Or, in other words, our scanner can detect things that are not necessarily required for that.

[0:42:37] AA: Sudipta mentioned Java, C#. We also support Javascript and TypeScript.

[0:42:42] GV: Yeah. I mean, that's something I'm excited about. I'm mainly TypeScript.

[0:42:45] AA: Yeah. We also support TypeScript and Javascript that was the, yeah. As I mentioned, so whatever we are doing for what we are doing, it is same. The way to deal with the sensitive data, it may look different, but it is actually the same. Because invocations are exactly the same everywhere. Also, that yeah, for understanding how the different classes and types, we obviously have different types of like, with the scanner deploys different parsers to identify, and we also use something tree-seater, so that we have - it gives us a way to parse the things easily, so the data model of the scanner remains almost similar. It does not change. Does that make sense to you?

[0:43:35] GV: Yeah. No, that makes a lot of sense. I think, it was mentioned that there's a really nice, I think, it was almost dashboard, even though we talked about the CI side, but there is a dashboard to identify data flows, etc. I'm curious, because I think this is where security platforms, they've almost got two roles. One is literally saying, hey, this thing today is a problem. The other side of it is if they can also actually being able to educate through the platform, and being able to say, this is a problem, but by the way, here's why this is a problem and here's how you might want to approach this differently in the future. It might not be there today, but I'm curious, is that something you thought about in terms of what gets presented back to the user, and then, I guess, in this case, the developer user?

[0:44:21] AA: Yeah. I would just say, basically, that what we have today is we highlight these vulnerabilities. But as part of the onboarding of customers, we also walk through the best practices. To be honest, when it comes to sensitive data, the best practices are almost already known. You either omit the data. If you don't need it, you don't need to send it to Sentry, or Datadog. You don't need the personal identifiable information to do your debugging, or to do that stuff. Because most of the cases, developers are doing things to simplify and enhance the debugging experience. Sometimes they overshare.

It's like, just educating them on what's enough to - we don't want to hinder your debugging experience. We want you to get all the info you need to debug, but you don't need extra information that is going to put the whole company at risk. It's you either omit it, you mask it, encrypt it. Basically, that's what it boils down to. All of our remediation strategies, we present these things, and we've never had pushback from developers saying like, "Oh, no. That doesn't make sense." It's usually like, "Oh, no. I didn't know we're exposing an API token in logs." It's usually that surprise. Yeah. I think it's going to take some time for developers to really embed a lot of this best practices in their development experience here.

[0:45:53] SM: I will add to that. Actually, the CWE site also has some recommendations about how to deal with a specific violation. In our presentation in the dashboard, as well as in - we also show that recommendation, we pull that information from CWE and we also have, as Amjad was mentioning, that we have these vulnerability rules. We have taken those recommendations from CWE and put them in our vulnerability rule. They surface, whenever they see a violation, they have a do and do not. Each rule has a do and do not, and they can see exactly how. Obviously, that can be extended to have some code samples in the future to show them exactly what is meant that will be even more explanatory.

[0:46:41] GV: Yeah. I think that's really powerful, yeah, bringing through those definitions and explanations. That's a really fantastic resource that's there through CWEs. To bring that right in front of the developer, or whoever it is looking at it, I think, that's super powerful. I worked on a different problem, but it's the same idea, attack surface management. Instead of bringing through CVEs and actually having to help the user understand. CVEs can sound bad. This is high priority, but actually, we try to contextualize that for that given customer or company was an interesting challenge to help educate them. Because otherwise, everything can look urgent, or everything can look completely benign. I imagine you have some of the same things there as well.

[0:47:24] AA: Absolutely, yeah. Exactly. Basically, we give the knobs for the security, or engineering team to basically mute certain types of issues, ignore some type of issues, change the ultimately, what defines the severity of an issue is the sensitivity of the data for us. They can also change the sensitivity of like, they'll say, hey, social security is not - I'm just giving an example, but social security is not highly sensitive for us. They can make those changes to really do need for their use case, basically. Yeah.

[0:47:58] GV: Yeah. That's super powerful and helpful. We're going to the end of the episode. We're recording here towards the end of May, and I believe it's just been announced that you've secured seed funding. Yeah, that's super exciting. I'd love to just hear just a little bit about what's that going to enable for HoundDog and where do you see the next six to 18 months going?

[0:48:23] AA: Yeah. Thank you, thank you. Yeah. We were basically operating in stealth for the last year or so. We finally emerged from stealth and announced our funding today. Yeah. Basically, with that money we already have a few customers, but we want to basically double down on our go-to market. We're going to be participating in events that we wouldn't have been able to do otherwise when we were in stealth. But we're sponsoring Black Hat. We're attending the OWASP appsec conference, and we're going to be out there.

Then on the product side, of course, Sudipta already alluded to automated, like AI fixes, which is most SAS counters are going in that direction, and it makes sense that these fixes shouldn't be something that developers have to think about. It could be in the form of a pull request that just has the fix and they just approve. I think we're moving in that direction in addition to supporting more languages and yeah, basically, continue to support our customers that way. Yeah.

[0:49:23] GV: Awesome. I mean, just additional hiring come into this as well. You might know I've had a chance to think about it so far. But yeah, I mean, we've got a lot of developer listeners out there, so is that something you're thinking about?

[0:49:34] AA: Actually, on the development side, we're good, I'll be honest. But we may be hiring on the sales side, a sales development representative next year. I'm sorry to the developers now. But it is just a seed round. Hopefully, next year when we raise our series A, we'll be actively hiring. Yeah.

[0:49:51] GV: Yeah. No, fantastic. Yeah, don't bombard Amjad with your LinkedIns right now, but keep an eye on the company. I think, as a developer, the best thing is to want to check out the platform. I believe, you're in GA now, is that the - that's part of the announcement. Where's the best place for developers just to head to and get up and running?

[0:50:09] AA: Yeah. If you go to HoundDog.ai we have a free version of the scanner that basically gives you that beautiful sensitive data map, and you can even plug in, use your own OpenAI key to get the power of AI to discover everything in your code base. That's completely free, so all you have to do is just fill out a form. I know developers hate filling out forms, but from a CEO perspective, I just need to know who's trying out our product. You can use your personal email address and basically, get access to this Docker container. We have extensive documentation publicly available. It's on our website. Please, feel free to try our free scanner. Yeah.

[0:50:46] GV: Awesome. Well, it has been fantastic to have both of you here today. I think you've given so much detail, both on the technical side from yourself, Sudipta, and Amjad, you've given a ton of insights, I think, which helps developers and non-developers understand this space, and just pull back the current a little bit to understand how this all comes together. Really appreciate that. Yeah. I mean, obviously, good luck with the funding now in place and I hope we get to catch up again in the future.

[0:51:16] AA: It was an awesome conversation. Yeah, thank you so much, Gregor.

[0:51:18] SM: For thank you for having us.

[END]