EPISODE 1695

[INTRODUCTION]

[0:00:01] ANNOUNCER: A deepfake is a synthetic media technique that uses deep learning to create, or manipulate video, audio, or images to present something that didn't actually occur. Deepfakes have gained attention and part, due to their potential for misuse, such as creating forged videos for political manipulation, or spreading misinformation. Ryan Ofman is a Lead Engineer and Head of Science Communication at DeepMedia, which is a platform for AI-powered deepfake detection. He joins the show to talk about the state of deepfakes, their origin, and how to detect them.

This episode of Software Engineering Daily is hosted by Sean Falconer. Check the show notes for more information on Sean's work and where to find him.

[INTERVIEW]

[0:00:53] SF: Ryan, welcome to the show.

[0:00:54] RO: Hey, thank you so much for having me. Excited to be here.

[0:00:57] SF: Yeah, I'm excited to talk about deepfakes today. DeepMedia is a company that's focused on deepfake technology, both detection and creation for media purposes. What's the background on the actual company? When did it start and what inspired this creation of DeepMedia?

[0:01:14] RO: Absolutely. Deep Media has been around since 2017. I would say, the catalyzing event that really started DeepMedia is when our founder, Rijul Gupta, saw his very first deepfake. You might have actually seen this one. It was Barack Obama on one side and Jordan Peele on the other. Jordan Peele was really just doing an incredible Barack Obama impression, and it used some very early AI generative models to make it seem like Barack Obama was saying all the things that Jordan Peele was saying.

From that very moment, he knew that this was going to change the world. If we couldn't trust what we were seeing, then what could we trust? From that moment, he founded DeepMedia AI, which has worked on a variety of generative and detection products that ultimately, are striving to make sure that this technology is used for good and to mitigate the cases in which it's going to be used for misinformation.

[0:02:08] SF: Yeah, it's really interesting. I think that, the deepfake that broke the Internet in the early days. Besides that, that was very early. I think it woke up a lot of people to what this could really mean, especially as these things get started to get better. What is one of the most impressive deepfakes that you've seen since then?

[0:02:26] RO: That's a great question. A lot of times, deepfakes get really lumped in with a lot of the misinformation. Perhaps, the most prominent example was a deepfake that went around of President Zelenskyy of Ukraine, announcing that he was surrendering the war effort against Russia. Now, this was a pretty low-quality fake at the time, but in terms of its range and the degree in speed with which it propagated, it's certainly among the most impressive in terms of impact.

In terms of technology, it's oftentimes a lot more mundane examples. There's someone named Deepfake Tom Cruise, who makes a lot of videos on TikTok. What he does is very interesting. He uses a very highly trained model, specifically on his face and specifically on images of Tom Cruise, to build together a really, really convincing, compelling, and powerful fake that he almost entirely uses just for comedy and parody content. Nonetheless, to be able to see technology like that work with near flawless accuracy in real-time, it really opened our eyes to how scary and how powerful this technology is and is going to continue to get.

[0:03:31] SF: In terms of the history of deepfakes, there's, of course, the one that you mentioned with Barack Obama that I think really blew things up. Do things start earlier than that? Or was it really, did things really kick off in 2017?

[0:03:45] RO: That's another great question. The origin and the history of the deepfake is a very interesting case study into how concepts and ideas are brought up in the era of the Internet. The reason I say that is because the word deepfake was just the username of a Reddit user, who posted some of the first deepfake videos. He then proceeded to create the r/deepfake subreddit. That single user deciding that that portmanteau of deep learning and fake really defined the genre of what a deepfake was going to be moving forward.

Way before Jordan Poole, it was about 2017, users on Reddit were already using very basic versions of these generation models to create deepfakes, mostly of celebrities, mostly in comedy contexts.

[0:04:30] SF: What are some of the different techniques, I guess, for creating a deepfake?

[0:04:34] RO: It's a very interesting question. It's changed a lot over the years. But I would say that there are three key technologies that form the basis of what we call modern deepfake generation. That is generative adversarial networks, GANs, diffusion models, and transformers. Now, while all of these work in a slightly different way, essentially, they're taking millions, if not tens of millions of samples, of real human beings and slowly learning some of the characteristics and features of what we look like when we're talking. What do our face movements look like, what do our mouth movements look like. Over an iteration of reinforcement learning. Doing this cycle of learning, trying, guessing, and getting things wrong and right, and then doing it a million more times. These models gain a very real and powerful insight into what human beings look like when they're moving.

Something that is unique about the more modern approach to deepfake generation is the use of transformers, which deploy something called self-attention, which means that they can auto-correct themselves as they begin to learn how to generate this content. Even more so, they can start to identify the individual features and extract what human beings look like from all of these samples. While a model that might be trained on 10,000 samples might understand that a human being moves while he speaks, or our eyes take dynamic as we speak, as well of our mouth and our throats.

A more robustly trained model might even be able to pick up the nuances of how my eyebrows move when I speak, how my ears move when I speak, and how the greater surroundings of my body move as I'm beginning to engage in regular human conversation. It's really a robust reinforcement learning model, trained on millions and millions of samples of real human beings, and slowly building back from nothing what a human being looks like when they're speaking.

[0:06:31] SF: In terms of the reinforcement learning cycle, how does it know how to auto-correctly? How does it know whether it's moving in a better direction, or a worse direction?

[0:06:40] RO: It's another very interesting question. The answer is it cheats. We give it thousands of samples of labeled data, so we know it's real, or we know it's fake. We provide that information to the model as it's learning. It plays a little bit of a game with itself, right? It takes a piece of information and it makes a guess. At the beginning, it's basically just going to be guessing randomly. Real fake, real fake. But because we have the answers, we can then give it feedback. Actually, that was real. Actually, that was fake. From this process of guessing and what we call adversarial training, it begins to start to understand what are those characteristics that are in all of the fake labeled data, and what are the characteristics that are in the real data that I'm observing?

[0:07:28] SF: Then, I guess, when it comes to trying to generate a compelling deepfake, then you want to, essentially, create data that it knows fits real data closer than, essentially, the fake data.

[0:07:41] RO: Absolutely. We might call that fine tuning. When you've generated a powerful and robust model that's seen a lot of human beings, but now you're trying to cater it to your specific use case. Maybe I want to fool someone on a video call. Maybe I want to fool someone in real time, or maybe I have a little bit more time to do post processing of whatever I create. These are the decisions that go into how do you fine-tune some of these models to your specific purposes? That gets into a lot of the tricky gray area here, which is how do we stop people from fine tuning and publishing these fine-tuned models that are used for really unethical purposes?

[0:08:19] SF: In terms of creating a deepfake, does it help if I look somewhat like the person that I'm trying to essentially, pretend to be, versus looking drastically different? Would it be easier for me to create a deepfake of another white person with brown hair, versus turning myself into a woman of a different cultural background, or something like that?

[0:08:42] RO: Totally. Another really interesting question. The answer is yes, but maybe not for the reasons you might think. A lot of our early research in developing deepfake detection data sets, and the same is absolutely true for deepfake generation data sets, has been doing analysis of the racial, ethnic, age, gender, and emotional background of the participants in these data sets. What we found isn't horribly surprising, but it should be horribly upsetting, which is it's a lot of white men. For that reason, generating models that look like those people that they were trained off of is certainly easier. But it's less, I believe, because you would resemble them and more so, because a bulk of that training data resembles that person who you're trying to deep fake.

[0:09:26] SF: Right. There's a bias already in the data set. Then it's essentially better at essentially recreating that bias than something else.

[0:09:33] RO: Absolutely. Something we've really tried to do is a technique of up sampling and down sampling our data, to try to provide a pretty equal amount of training data across races, age, gender, things like that, to hopefully try to mitigate that bias, especially in our detection efforts.

[0:09:49] SF: In terms of using deepfakes for a malicious need, or something like that, like I want to pretend to be someone on a Zoom call, so I can trick somebody into doing something within that company that they shouldn't be doing. Does that need to be a video playback? Or could I actually do something that is essentially semi real-time, where I can be reactive to whatever the questions are and actually fake my presence in some way?

[0:10:14] RO: Well, I can tell you, there are some people who are trying to do a very real-time approach, but you've isolated really, perhaps, the most problematic use case, which are these semi real-time audio generators. Specifically, if your camera off, it can get even easier, because you can generate that audio very much in real time. As for deepfake videos, you can absolutely have some somewhat real-time solutions in which you can say active answers to these questions in the voice and the face of another person.

[0:10:45] SF: Then in terms of, we're talking about some scary use cases of, essentially, deepfakes. I think that's technically, that's a lot of the times where people's minds go when they hear the word deepfake is like, "Oh, someone's going to pretend to be me, or I'm going to get a call from my CEO and it's not really my CEO, or something like that." How does deepfakes potentially change the world for better? What are the positive use cases in the technology?

[0:11:12] RO: Totally. There's a lot of really exciting ones that are absolutely going to change the world. Something that we've spent some time on researching as a company and was part of how we originally were founded as a company is this idea of universal translation. Can we use AI generation models to make it look and sound like I'm speaking a language that I have no understanding of? Can we make news that goes on in different languages, perfectly encapsulate that English translation, so that when I'm watching, I don't even need subtitles. Not only do I hear the language in English, but I can see their mouth movements and they match up with that English translation.

This is an AI-based generation technique that is going to change the world and it is going to be the future of universal translation. These are use cases that are really going to bring people together. It's going to break down language barriers. It's going to break down cultural barriers. It's going to use this technology to bring people together. These are absolutely cases that exist and cases that are going to continue getting better, until they really have an impact on the very way we interact with each other.

[0:12:18] SF: Can you talk a little bit about how that technology works? If I need to, essentially, be able to translate the things that we're talking about right now, as well as the video into, I don't know, Chinese or something like that, then not only do I need to do the actual translation part of it, but I also need to presumably, manipulate my lips and stuff like that, so that it looks like I'm actually speaking Chinese, versus being a bad dub.

[0:12:45] RO: Yeah, absolutely. There's a couple of ways we go about it. Firstly, there's a concept known as phoneme analysis, which is what do our mouths look like when we make certain sounds? That's very much at the center of what a lot of universal translation and dubbing efforts do, which is firstly, if we can break down the translated audio into its given sounds, and because our models have been trained on hundreds, if not thousands of hours of individual speaking all of these different languages, we can map what individual phonemes, individual parts of words and individual sounds correspond to when it comes to mouth movement.

At the end, what we have is a pretty accurate representation of what the mouth movements would genuinely look like for this translated audio. Then we can use one of our generation and dubbing models to make the corresponding mouth look like it's making those sounds.
[0:13:41] SF: Is the output going to be, essentially, if I think about a video recording, then we have a bunch of, essentially, images that are laid out in a stream that we're going to play at a certain frame rate, then when you're mapping those phonemes to the mouth movements, are you only adjusting, essentially, the part of the image that's related to the mouth? Or are you, essentially, regenerating the entire image from scratch and making the parts of it look like the original person, but essentially, it's a net new version of whatever that original version was?

[0:14:13] RO: That's really interesting. The answer is that people are trying both approaches. What's called lip reinsertion, wave to lip, and things like that seem to be the winner of this generation war when it comes to, how do you do translation? That is, we take your just your mouth, we make it adjust to those different movements and phonemes, and then we reinsert it into the larger scale video. Your eyes might look exactly the same before and after, and it's just changing your mouth movements. But that's not often the ways that full face deepfakes work. If I was speaking as another individual, it's going to regenerate my entire face. Multiple approaches are being tested. I really believe, especially for translation, this wave to lip reinsertion is going to be the future for translation. 

[0:14:58] SF: Could you see it going even a step further where, depending on your background, like people we use, hand gestures, or other way, like communication isn't purely the words coming on my mouth, or even the movements of my mouth. It's also the other things I'm doing with my body. A lot of times, some of that can be cultural, or based on where you grew up and stuff like that. If we wanted to make it feel truly authentic, I would want to match, potentially, some of my body movements to whatever the language is as well.

[0:15:28] RO: You're on the right track. That's absolutely the next step. Certainly, it's impressive to make it look like someone is saying something. But to be able to make it look like their whole body is saying and corresponding with the words coming out of their mouth is the next step in bridging that gap towards true and accurate realism. That's certainly something that people are focused on. Things like hand movements, body movements, even things like the flow of blood to our face when we're speaking are very much the next things that we're working to train AI models that can accurately represent them.

[0:16:00] SF: How good is this technology right now? Will we be able to take the conversation that we're having and essentially, put out a version of it in Korean and put out a version of it in Russian, and so forth?

[0:16:11] RO: Absolutely. Without getting into too much detail, there are already content creators who are doing this effort. They're doing it quite well. They certainly aren't going to be perfect. If I told you ahead of time what you always should do, that something was AI dubbed. I started looking at your teeth. I started looking at your body movements. I could probably figure it out. But if I didn't know that and I was just scrolling on my phone, it is already good enough where you would never know that was not my native language.

[0:16:39] SF: Yeah. I'm curious your thoughts on how this could impact something like, the movie industry. I hear a lot of people, especially actors and stuff like that, they'll talk about how movies have changed so much in terms of the movies that release in theaters. We now no longer have, essentially, DVD rentals to help prop up smaller budget dramas that were - a lot of those things were made in the 90s. Movies like Goodwill Hunting, they don't really get created today, because generally, it's going to be more comic book movies, things that play well internationally, because they're not dialogue heavy.

But if suddenly, you could essentially create a version of the movie that feels more authentic, it's more pleasant to watch, because you're not watching with a bad dub, or reading subtitles, or something like that, is this something that could potentially maybe change back the movie? I mean, it is a big question. But I'm curious to hear your thoughts on how this could impact something like that.

[0:17:33] RO: Yeah, it's a big question and it's an important one, because it's coming and it's going to be here. The answer is, I think we're already seeing a transition from movies from all over the world that are gaining common notoriety. Whether it's Parasite, or movies coming from Korea, or whether it's movies that are generated in Europe, in French, or any other language, we're already seeing things like that come to the United States and do very well commercially. It is only going to increase in its frequency and its effectiveness when we're able to do this kind of dubbing.

Now, I think the reason that there's fair hesitance from Hollywood is to act as such a difficult skill intensive task. To try to take away some of the mouth movements that an individual might make, or to try to reassign some of the individual hand gestures, or a gesticulation that an individual might give in a movie is a hard task. We're rapidly seeing that AI is able to pretty accurately do that translation effort, without taking anything away from the individual acting performances. I absolutely believe that we are going to rapidly see a development of AI-dubbed movies and AI-dubbed content. I think it's going to do a lot of good for bringing unbelievable pictures and unbelievable talent from around the world, to whole new audiences that may never even considered it before.

[0:18:59] SF: Do you see that people also react in a negative way in those industries to this type of technology? Do they see that as a potential threat? Or is it more seen like, hey, this is just like a net, new way of creating art, similar to photography, versus doing a painting, or something like that?

[0:19:17] RO: I think both of them exists at the same time. I think the excitement is carried along with a very reasonable and understandable ounce of fear. It's not just fear that people's jobs are going to be lost, because there's always going to be the need for people who are using these AI tools to work on these general projects. I think it's more so this idea that this AI technology is coming, it's new, and it's going to require a lot of adaptation. Perhaps, it's going to make it more difficult to be an actor, or to be a consumer. But I think there's a lot of optimism that is the other side of that coin, which is it's going to make things like, movies that are not from the US so much more accessible and give them a whole new platform to exceed.

It's a difficult balance, where there's a healthy amount of fear for this technology and the impact it might have on the film and television industry, as long with the excitement for the way that it's going to revolutionize the way people work and the way in which these kinds of content can be distributed in a global environment.

[0:20:24] SF: Where's the training data for all this come from? You're talking about training on fairly massive amounts of data, especially when we get into this AI-based dubbing, then presumably, you need data from lots of different languages as well. Where's that all come from?

[0:20:42] RO: It's a great question. The answer is a wide range of different sources. Largely, open-source models, some that we've collected and generated in-house, through paying individuals for their time and their likeness. But a lot of them come from open-source data sets that are collected from open source, publicly accessible movies, films, audiobooks. We try our very best to DeepMedia, particularly to be very mindful of the individual likenesses that we are using when generating our training data. Fortunately, we are not the only ones in the space, and there is a lot of publicly available and commercially licensable data that exists and can be used for this kind of training.

[0:21:22] SF: Can you walk me through like, what's that training pipeline look like? Are you using multiple models at different parts, essentially, to solve for different problems? Can you go into a little bit of detail about how all that stuff works?

[0:21:35] RO: Absolutely. First and foremost is a very robust and powerful pre-processing step. When we're taking, say, a deepfake video, we break it up into every individual frame and we break it up into all of the different faces that are going to be present within that individual video. This allows us to get a really clean and concise look at the faces and the individuals present. This then allows our detectors to have a really, really good understanding of what should be looking for when it's looking to detect a deepfake. We hone in on those individual faces, we take hundreds of frames for every video, such that we can understand specifically the facial movements, the dynamic change in faces over time and the individual audios that we can extract, clean and isolate from digital content. This allows us to really robustly and really cleanly train our models on data that has been isolated, selected and finely tuned for our detectors to understand it.

Then the other side of that coin is very much, what are the detectors we use to actually detect deepfakes? The answer is that we are constantly testing all of the newest state-of-the-art detectors to try to find ones that are good at different techniques. We, as well as researchers in the field have found that there is no silver bullet detector. There is no single deepfake detector that is amazing at a wide range of deepfakes. There are detectors that are really good at honing in on individual human faces. There are detectors that are really good at isolating the difference between a foreground and a background, and so we are constantly testing different architectures and how we can use them in combination to come to very accurate and robust measures of whether or not something has been AI manipulated.

[0:23:22] SF: With these detectors, are you essentially creating some sort of aggregate metric, where above some threshold, then it's most likely fake and below some threshold, it's good, and maybe there's a gray area in there, similar to how some fraud detection services work, or even spam detection. Obviously, this is more complicated.

[0:23:43] RO: Very much so. It's a particularly tricky issue when it comes to deepfake detection, because the last thing that we want to be doing as a company is telling you that something is fake when it's real. Not only does that degrade trust in the platform, but it degrades trust in the very content that we're looking at. It's crucially important that we are able to detect the fakes, so that we can ensure that we can trust the real things that we're seeing. For that reason, it is very much a system of isolating thresholds using popular sources of deepfakes, deepfakes from Instagram, deepfakes from Facebook and social media, and using those to create thresholds that say, well, there are significant signs of deepfake manipulation in this content. There are some signs of deepfake manipulation in this content. There are minimal signs of deep fake manipulation in this content, or there is no signs of deepfake manipulation in this content. We are trying to, as well as detect fakes, inform the user of likelihoods and evidence that have led us to the conclusion that something might have been AI manipulated.

[0:24:49] SF: There's some explanation, essentially, attached to the decision that this is fake?

[0:24:53] RO: We fundamentally believe that especially at this point, humans integrated into the detection process is fundamentally important and crucial to make sure that these detectors, while amazing pieces of technology are being used to the best of their abilities and are being used appropriately.

[0:25:11] SF: Who are the customers in this? Who has such a need at this moment to do regular deepfake detection?

[0:25:19] RO: There's a couple key areas of customers that we interface with and are generally interested in this technology. One is people who are trying to prevent fraud, the banking industry, security industry, who are trying to prevent people from calling their bank with a deepfaked voice and getting information that they shouldn't have access to. The second and third which I would say are the biggest players in the space right now are the big social media companies, who are working to deal with the fact that deepfakes are being proliferated at unbelievable rates on their platforms.

I can assure you that these companies want you to trust what you can see on their platforms, and they have very much a vested interest in making sure that these platforms remain safe and trustworthy. They have parted with us as well, as well as partnered with general deepfake detection strategies to try to keep these social media platforms accountable and safe. It's certainly in both of our best interests to do so. Then the last interface that we have with customers is with the United States government, who is very interested in keeping information and telecommunications safe and secure. The US government is someone we've partnered with very extensively to work on building and collecting these robust detection solutions to ensure that we can keep the transfer of information, both abroad and domestic, safe and secure.

I can tell you, as we've gone into a global election year with dozens of elections going around across the globe, every month we've seen the need for accurate and scalable deepfake detection skyrocket.

[0:26:56] SF: Are there certain types of deepfakes that are easier, I guess, to detect than others? If it's audio only, does that make it easier than if it's video and audio, or is there no difference?

[0:27:08] RO: It depends on what detectors you're using and it depends on the quality of the video/audio. Even more so than I think it does, the modality. Lower quality audios, or audios that don't have a lot of background noise, that we can really hone in on the voices are particularly easy. Though, we have a pretty powerful voice isolation system. Videos in which there's only one or two people and a static background are a lot easier to detect, than a video in which there's 10, 20, 30 people. These are the reasons we've built individual face isolation solutions, which can be used to isolate individual faces in a video to overcome some of the difficulty in dealing with content in which there's many participants.

[0:27:49] SF: Then what about, and this maybe something that you're focused on, but one of the challenges, especially when I think about some of the things that happen around like fake news several years ago, it's not just the fact that there's fake news, but it's also just polluting the environment that people are consuming news, so that real news gets hidden. Is there potentially a risk to that when it comes to deepfakes? It's not so much that maybe we get really good at actually detecting them, but we can auto generate so much of it that the garbage hides the real content?

[0:28:22] RO: Absolutely. You've isolated, perhaps, the most dangerous characteristic of a deepfake, and why we feel it's so crucial to implement detection solutions, as well as policy from the top down to protect against the propagation of these deepfakes, which it almost doesn't matter if there's a hundred deepfakes, or a hundred thousand deepfakes. The fact that any piece of content you watch could be deepfaked, and there aren't solutions in place to tell you that degrades trust in all content. Absolutely, the more deepfakes there are, it's almost not the problem of the deepfakes themselves, but the impact it has on the non-deepfaked content and the trust that degrades and the general platforms to be able to share real and genuine news that needs to get out to people. It is very much a problem and it is only growing as the spread of deepfakes continues proliferating, and we've seen it pretty dramatically every year since their innovation.

[0:29:20] SF: How accessible is the technology? Obviously, people at DeepMedia have access to this technology, but if I'm a bad actor, is this something even inside the technology that you have, essentially, other things that maybe are available, can I get my hands on that and start generating this stuff in a compelling enough way that I can, essentially, maybe socially engineer my way into a system?

[0:29:42] RO: Without question. The answer is they are readily available and easy to use and download. I'll tell you perhaps, a bit of a troubling story. It's something that has really stayed with me and haunts me in a certain way. I'm from Los Angeles and I grew up going to high school right around that area. Over the last year, a couple months ago, a very, very troubling incident went down at a local Beverly Hills middle school. Basically, one of the students, perhaps in work with some of the other students, came across one of these deepfake generation models for free, totally open source, likely on GitHub. They used it to generate nude face swapped photos of other individuals in their class.

Now, not only is that a horrible violation of privacy, but it begs the question, how did a middle schooler gain access not only to the base technology, but to the highly specifically trained models that are used to create this deepfake pornography? It so emphatically highlights how easy some of these technology is, and how accessible it is to people even without a hugely robust understanding of the tech itself. It even more so highlights the need for regulation and protection that stops individuals from being able to so easily access this dangerous technology.

[0:31:00] SF: What's the current status on regulating access to some of this stuff? I feel like, in the world of AI, like the EU pass an AI act probably about a month ago or so now, but the US typically runs a little bit behind the rest of Europe, certainly, when it comes to putting some of this stuff in place.

[0:31:19] RO: It's hard, because not only is the deepfake landscape changing so robustly, but so too must the policy surrounding it. I can tell you that we at DeepMedia are working very hard with legislators, senators and people in the Congress at the United States to try to pass common sense regulation and policy surrounding this. I can tell you that individual subcommittees within the senate are working on and have already proposed policies that will at least do something, hopefully, to mitigate the spread of harmful deepfake misinformation and deepfake pornography.

Though, it's a really, really tough system. Because the technology is so new and because there is so much that is rapidly changing, to pass policy is a really difficult challenge. But I can tell you, we at DeepMedia are working tirelessly with our senators to try to come up with common sense legislation that is adaptable, scalable and really tries to strike at the root of the problem with deepfakes.

[0:32:17] SF: What would that mean? Let's say that a regulation passed tomorrow, how would that essentially, get enacted in a way that is controlling access to some of this technolog,y so that someone in middle school couldn't just download it and start creating deepfakes that really humiliate people?

[0:32:35] RO: There's a number of solutions that are crucially important for addressing this issue head on. First and foremost, is repercussions for actors who are using this technology to defame, to defraud, and to violate individual privacies. There is simply no unified regulation that can properly address repercussions for individuals who use this technology. Second, we need to do a better job of considering how we open source our technology. It's one thing to have a deepfake architecture, or a generative AI architecture available on GitHub. It is a very different thing to have a pre-trained model that has been trained on thousands of pieces of illegally collected data accessible on GitHub for free as well.

It's about enacting effective regulations for repercussions, for creating and propagating this kind of deepfake information. It's about ensuring that we are adequately protecting individuals from open sourcing a lot of these models that have been trained for very unethical purposes. Lastly, it is about using detection and alternative security solutions to make sure that when these deepfakes are propagated, they are easily detected and can easily be deemed to be fake.

These are things like, adding watermarks, doing cryptographic hashing to make sure that if something has been faked and it has been already uploaded, we can isolate it and find the individual videos that might have originated the fakes. It's these things, as well as regulation and deepfake detection that will ultimately only together, work to stop the spread of deepfake-based misinformation, and to keep people safe from the abuses of this very powerful technology.

[0:34:23] SF: Yeah. I guess, if something like that was put in place, it wouldn't necessarily fully stop bad actors from having access to some of this stuff, but it would make it harder for just a random person who can search GitHub to essentially, pull down something and in 10 minutes have some compelling deepfake put online and maybe hurt somebody.

[0:34:41] RO: Exactly. And it's going to make people be a lot more forewarned about what they're doing before they're doing it. A lot of these actors are not entirely malicious. They're trying to generate comedy and parody, and then there's a lot of people who are using this technology to very, very concretely violate individuals' privacies. We want to make those people have to second guess their work, by understanding the repercussions if they're caught for their actions and by knowing that there is a robust and expansive defense strategy in place for detecting these deepfakes and finding the people who propagated them.

[0:35:18] SF: In terms of detecting the deepfakes, is that something where you're constantly trying to play catch up to how some of the stuff gets created? Similar to like, I don't know, detection of performance enhancing drugs, or even the old days of virus detection that was like in the end of the late 90s and early a lot of you got a virus and then Norton would go and update and then will roll up the change. But then there would be another, essentially, version of that that would come along a week later.

[0:35:46] RO: I actually think the virus analogy is a very good one. Things are rapidly changing in the deepfake space. Our ability to build accurate and robust detectors has come from our willingness to participate in this cat and mouse game. To understand that there are new generator types that are coming out all the time. Within a week or two, we are generating and testing data on these new generators. Your ability to detect deepfakes today matters almost exclusively based on how well you're going to be able to detect them tomorrow. And so, we are constantly innovating in creating new detection models based on the newest generation models and using generation to guide our detection approach has been what's allowed us to stay on top of the rapidly changing space that is deepfake generation.

[0:36:34] SF: The idea there is something comes out that's new, so you basically make the assumption that, "Hey, if I wanted to create deepfakes to do something bad, basically, here's some ideas that I might leverage this new technology to be able to do that." Then, you got to work backwards from that in terms of how would we, essentially, detect someone using that technology in that way?

[0:36:55] RO: Exactly. Then, first, we try to do that, to make sure that we can detect it generally. Then, we actually will generate samples using non-real faces and data that can't be abused in any way and put it in our training sets, so that in the future, not only can we detect generally this kind of deepfake technique, but we have that technique specifically in our training data for our detectors. It's a two-step approach that ultimately lets us really, really robustly and rapidly adapt to the ever-changing landscape that is deepfake generation.

[0:37:30] SF: When it comes to creating a deepfake, is it easier for me, let's say, I'll make myself the guilty party, to make myself potentially look like someone where presumably, there's a lot of data about that person. Like, it's Tom Cruise, there's going to be millions of pictures of Tom Cruise, versus trying to make myself look like someone that maybe doesn't have a presence online, or has a very limited presence online?

[0:37:54] RO: It's a double-edged sword there. Because yes, it is absolutely easier to generate a deepfake of someone who has a large present online, say, President Biden. But it's also a lot easier to detect individuals who have a large presence online, because we can include them in our training sets. Without even trying, I can assure you that we have prominent political leaders in our training set for our deepfake detectors. Certainly, it will be easier to generate deepfakes of those individuals. But in our case, it might also be easier to detect them.

[0:38:27] SF: Then in terms of, what about the notion of a one-shot deepfake? How do you cut down on the amount of data needed to actually create some of these deepfakes?

[0:38:37] RO: Yeah. It's another really interesting question. The concept of a one-shot deepfake certainly does not exist in an effective way as of yet. But things like the Microsoft VASA model, or things like text-to-video solutions, like Sora, promised to implement these potential more so one-shot training systems. The way in which you generate those are by building really, really robust generation models that have already seen hundreds, if not thousands of deepfakes, and then using things like, wave to lip facial translation to directly swap my face onto another face, rather than slowly and meticulously building up a deepfake model that can make my face look alternative. Things like face swap models are very powerful at creating one-shot solutions. Because of that, we've had to do a lot of work to be able to detect one-shot solutions, especially because these are particularly problematic for things like Zoom calls, and for things like, deepfake fraud.

[0:39:38] SF: Is there a point where the deepfake technology just, it gets so good that you really can't tell the difference between something that's generated through the deepfake versus something that is real?

[0:39:50] RO: Yeah. I think we very well might reach a place in which that is the case, which is why we had to make the transition to AI-based deepfake detection. Because though I can envision a world in which our deepfake generators get so good, and it's nearly indistinguishable from a human being, I am already seeing a world in which our deepfake detectors using these same generative AI principles are getting so good that they can distinguish it even if our human eyes can't. That's why it's so important that we are utilizing these very same tools that are used to generate these deepfakes to detect them as well. Though, we might not be able to tell the difference. AI can speak the language of AI. Over the sample and the course of millions of different samples, our detectors could detect things that even the human eyes can miss.

[0:40:42] SF: Yeah. I guess, the generators, there's going to be some kind of pattern, or fingerprinting on it that only, essentially, a detector using the same technology would be able to pick up on, versus a person, we're seeing something probably at too high a level to really be able to detect that pattern that's there.

[0:41:01] RO: Yeah. We can learn from our detectors. We always try to say that with our detection solutions and our detection platforms, we're trying to detect deepfakes, but we're also trying to teach the user the individual skills that are necessary to detect a deepfake. Can give you a couple examples. When we first built our deepfake image detector to try to get images like majority, Stable Diffusion, Google's image gen, Adobe's Firefly and be able to detect them accurately and robustly, we started to notice some things. The skin in deepfakes to this day is specifically in deepfakes images is very waxy and unmoving. The difference between the foreground and the background can be very dramatic and very jarring. Light doesn't quite propagate entire space in a deepfake generated image like it does in the real world.

By building detailed and robust heatmaps that tell us what our detector is seeing that is causing it to indicate that something is a fake, will allow the humans to stay in the loop and to communicate with the deepfake detectors to ultimately come to the best conclusions they can, about the nature of content.

[0:42:09] SF: You're talking about individuals being able to figure out these deepfakes. What are some ways that individuals can protect themselves from being potentially victims of deepfakes?

[0:42:19] RO: First and foremost, I always to say, I grew up in the early 2000s, the age of the Internet. My parents always told me, "Don't believe everything you read on the Internet." That was pretty good advice. I think our next generation is going to have to grow up with a parallel set of advice, which is don't believe everything you see, or you hear on the Internet. A healthy level of skepticism is the first layer of protection, when you're trying to identify deepfakes yourself.

Second is to crucially consider the individual who's trying to communicate with you, whether that is on social media, or on a robo scam call is to think about is there a human being behind this. Is there a reason that I might be manipulated here? To consider that as you're considering the piece of content that you're interfacing with. Third is to learn the signs. If something feels off. If the hands don't seem accurate, if the facial movements seem just a little bit too blurry, these are common indicators that something might have been manipulated. It's a combination of a healthy level of skepticism, a little bit more detail when it comes to why someone might be trying to interface with me the way they are, why might I be getting an email from a number, or an email I've never seen before, why might I be getting a spam call? Why might I be on a conference call with my CEO, who's saying things I can't imagine he would actually want to be saying? And why might I understand and see things in this image, audio, or video that don't quite seem right to me? Understanding, thinking through and being critical of these common human reactions are the best way to keep yourself safe.

[0:43:59] SF: Do you think that like, businesses will need to start to change, or put in policies in place that help protect themselves as well? A lot of times when it comes to data breaches and other hacks that happen to companies, it's a human that is a weak point, like a customer support person gets tricked into giving someone access to an account that they shouldn't have. There's a lot of social engineering that goes on, but social engineering becomes even more compelling when suddenly, you can sound like an individual, or something like that within the company.

[0:44:28] RO: Absolutely. I can tell you that there are already policies that are being put in place by both security and non-security companies to try to mitigate some of these concerns. Things like, facial analysis, things like, new levels of security that you need to access beyond voice and fingerprint, that are required to be able to get your sensitive information. Things like, personal questions that are being added to interview and security procedures. There are already protocols that are being put in place by individual companies to work to mitigate some of the harms and potential dangers of these deep fake manipulations. It's crucially important that we work with the individual humans who are now in charge of determining the realness of something in a world where what is real is rapidly changing.

[0:45:16] SF: Yeah, it's like in Harry Potter for poly juice potion detection, we need to introduce some personal question that only that person would actually know the answer to.

[0:45:24] RO: Exactly. They have the right idea.

[0:45:28] SF: Well, as we start to wrap up, is there anything else you'd like to share?

[0:45:31] RO: Interesting questions that get at me from a both an ethical and an interest perspective when it comes to deepfakes, questions like, what is an individual likeness and how can we take a likeness and propagate it into the future? Some prominent examples that we're seeing is the use of AI generated artists in songs. I mean, we just saw a song by Drake, which used an AI generated version of Tupac and Snoop Dog. Is that a violation of privacy, or is that a way to honor the identity and historical legacy of individual artists? There's a particularly prominent scandal where a deepfaked company had offered Bruce Willis a pretty expansive contract to give his likeness into perpetuity, such that he might be deepfaked into the future and his likeness could be continued to be used in content.

Now, that contract never went through, but it raises a very interesting question of what privacy and what license do we give to our own likenesses. How do we protect individuals from having their likeness repurposed in ways that they might have never imagined?

[0:46:36] SF: Yeah. I think there's a lot of difficult questions that go beyond even just like, hey, people are using this technology to socially engineer a company, or an individual to get into the business, to these hard ethical questions of like, what does it mean for me to be own my personal identity? Can I license that with someone? If someone creates a likeness of a celebrity that's passed away, what does that mean? Who gets paid? All these types of things are really, really difficult. The technology is moving so fast, I don't think people even thought through - any thought of these questions, let alone -

[0:47:10] RO: The answers.

[0:47:10] SF: - what it means and the answers. Yeah.

[0:47:12] RO: Another thing I often think about on that regard is so much of the United States and the global legal system is based upon observation. Some of the best evidence you can give someone is a video of someone committing the crime. In a world in which we can't trust video and we can't trust audio, how do we work within our existing legal frameworks to ensure that deepfakes are not being used to deceive? The answer is deepfake detection, deepfake labeling and robust policies surrounding the use of AI-generated tools.

[0:47:45] SF: Has there been any example cases in courts that you know of, where some sort of deepfake video was used and they actually got caught?

[0:47:54] RO: Not that I can speak of right now, but I can tell you that we will pay attention to the news over the next coming months and hopefully, there will be some information coming out about cases in which, at least there have been people suggesting that the videos might have been deepfaked.

[0:48:08] SF: Yeah, that's a scary thought.

[0:48:09] RO: It's a scary thought.

[0:48:10] SF: Yeah. Well, Ryan, this was really fascinating. I think it's a really, really interesting area that I've certainly started to dig into and want to learn more about, so I really appreciate you joining us and sharing some of the stuff that's happening at DeepMedia. I also really appreciate that you show, or you talked about some of the, essentially, the pros of this technology, where could we essentially see positive aspects of this. Because I think most technologies, there's positive, but there's always, essentially, people are going to abuse the technology, too, but it doesn't mean necessarily that the technology itself is bad.

[0:48:42] RO: Generative AI is going to change the world. It's going to make communication easier and it's going to bring people together, but it is fundamentally imperative that we use things like deepfake detection solutions and use the technology to safeguard against those potential abuses. But it's going to change the world. The world that we're moving into is an exciting, though a bit of a scary place.

[0:49:04] SF: Yeah, awesome. Well, thanks again, Ryan, and cheers.

[0:49:08] RO: Thanks so much, Sean. It's been awesome.

[END]