EPISOD 1613

[INTRODUCTION]

[0:00:00] ANNOUNCER: Hugging Face was founded in 2016 and has grown to become one of the most prominent ML platforms. It's commonly used to develop and disseminate state-of-the-art ML models and is a central hub for researchers and developers. 

Sayak Paul is a machine learning engineer at Hugging Face and a Google developer expert. He joins the show today to talk about how he entered the ML field, diffusion model training, the Transformer-based architecture and more. 

This episode of Software Engineering Daily is hosted by Sean Falconer. Check the show notes for more information on Sean's work and where to find him. 

[INTERVIEW]

[0:00:46] SF: Sayak, welcome to the show.

[0:00:48] SP: Hey, Sean. Thanks for having me. Really great to be here.

[0:00:51] SF: Yeah. Thanks so much for being here. I'm excited to talk to you. To start, how did you get into ML engineering? 

[0:00:59] SP: Well, I like to work at the intersection of engineering and research. Let me kind of rephrase the question into how did I get into ML in the first place. Well, as far as I can remember, it was late 2015. I was still in my undergrad and I had to choose an elective in between pattern recognition, and machine learning and microprocessors. And I ended up choosing pattern recognition and machine learning because I had attended a couple of sessions that were given at my university by some of my university seniors. And the subject really appealed to me and I was kind of very intrigued to find more about it. Because it was interesting to me that we can design systems that can beat the Turing Test. And from a very nascent stage, I have been into computer science. I've been a computer science nerd. I've been programming since I was probably 12. 

The overall idea of designing systems that can beat Turing Test was super-duper appealing to me. And from there on, I started studying the subject. And the more I studied, the more interested I became. That's how I got into ML. And from there on, I have just not looked back.

[0:02:14] SF: And then how did you end up at Hugging Face? 

[0:02:17] SP: Yeah, that was quite a fast-forward. But sure. Why not. 2017 is the year when I completed my undergrad. And after completing my undergrad, I actually did not - I couldn't. Not did not. I couldn't take an ML-heavy job for whatever reason. It was a regular software engineering job. But I was kind of adamant about making my career out of machine learning because I'm still kind of very interested in pursuing the field wholeheartedly. 

From there on, it became a challenge for me to hire for jobs that actually required ML expertise and not regular software engineering expertise. I started contributing to open-source projects that required a bit of software engineering and a bit of ML expertise. 

And fast-forward to 2021, I was at this Australian startup called Carted. And there I was using this library called Transformers, which is actually primarily maintained at Hugging Face. And I really liked working with the library. And at that point in time, the library was expanding its efforts for computer vision-related models. Because at that point in time, the library was solely known for its natural language processing capabilities. But not so much for computer vision. 

And once I found that out, and since I was also kind of interested in the computer vision domain, it felt like a very natural thing for me to be able to contribute to the library. And since Transformers is very much a community-driven library and they welcome open source contributions, I immediately fail for that and I started contributing to the library. 

And after making quite a bit of contributions, it felt like, hey, what if I can do this full-time? I reached out to one of the folks on the team and then I had to take the regular path of interviews and so on. And here I am.

[0:04:13] SF: Fantastic. Yeah. I imagine things have changed quite a bit since you first graduated where you were having a hard time finding ML-related jobs. It's probably a little bit different now with the explosion in AI. People are desperately trying to find anybody with any kind of machine learning experience. 

You mentioned open-source there. And I know you have a long history as an open-source and community contributor. You're part of Google's GTE program. You write and create a lot of educational content. I highly recommend checking out Sayak's blog for anyone listening and is interested in our conversation. But why do all this extra work? Why is open-source and community something that's important to you to do on top of your presumably full schedule as a full-time engineer? 

[0:05:01] SP: That's an interesting question. At my current job, I get paid for working on the open-source full-time. But that wasn't the case until I joined Hugging Face. Back at Carted or at my previous organizations, I had to sort of take time to voluntarily contribute to open source. And there are a couple of reasons why. I like to think myself of an Internet-taught machine learning practitioner. And I have referred to other people's projects, other people's open-source contributions to hone my skills.

I felt like it's kind of a soft responsibility for me to also try to give back to the community. Plus, also, working on challenging projects and sharing them out in the open, it also gives you a certain sense of confidence. Because, one, you get feedback from relevant and accomplished folks. Second, it helps others. Potentially, it helps others. Third, it acts as a sort of self-reference. Maybe in the future you would like to refer to your own work to maybe quickly brush up on some basics or just to keep a track of how much progress you have made over the past few years. 

That's why open-source is kind of very important to me. Plus, you have a lot of exposure when you try to contribute to really popular open-source projects. Because that way your pull requests get reviewed by really accomplished folks. And that is like probably a very organic way of being able to work with some of the most exceptional engineering minds across the globe. That's why I think it's very important for me.

[0:06:41] SF: That's a really good attitude to have as a young engineer is to be willing to put yourself out there and get that feedback. And I think that's a way to certainly get better is if you have your code in public and it's being reviewed by some of the best engineers in the world. Inevitably, if you're open receiving to that kind of feedback, you're going to get better. And that's how you're going to end up hopefully reaching the same level as some of those people.

[0:07:07] SP: Yeah. Plus, I want to actually add on top of what you just said. And what you just said is actually spot-on. That's exactly where I'm at in terms of mental frequency. I also want to add, you also develop quite a few skills. Because if you are into machine learning, you probably haven't focused on software engineering that much. But when you are contributing to a popular ML framework, maybe such as TensorFlow, or JAX, or PyTorch, or whatever, you actually need to also focus on the test side of things. You also need to focus on the integratability side of things. You also need to focus on code readability, code cleanliness and stuff like that. 

Overall, I think all of these things collectively make you not only just better at your ML practice. But also, they help you to become a better engineer. Overall, I treat this as a major gain for your own self rather than anyone else. 

[0:08:04] SF: Yeah. Absolutely. I mean, it's setting a high bar for yourself that you might not have when you're doing just side projects where, presumably, you're the kind of the only one looking at the code base. And it's also teaching you sort of core skills that are valuable when you go and actually work as an engineer in a workplace. You have to write tests. You have to write clean code if people are going to review it and so forth. It gets you used to that sort of pattern as well of behavior. And you're going to ultimately produce a better product and be more ready for the workplace once you get there even if you're sort of early in your career. And then what helped you the most with sort of getting started and contributing to these communities? 

[0:08:45] SP: Well, it was definitely a bit intimidating for me. And I think when you are starting, things will feel a bit intimidating. But yeah, I'll cover that aspect in a moment. I was fortunate enough to have discovered some of the most amazing massively open online courses such as deeplearning.ai's deep learning specialization. Then the fast.ai's deep learning course and so on where they kind of take a very hands-on approach where they show you code and they kind of try to walk you through a very non-trivial piece of code line by line. And they also tell you a lot about the importance of communities and how software frameworks for machine learning have evolved over the past few years. That kind of motivated me to start thinking about how I can make myself useful. 

For the starters, I decided to work on some tutorials that I felt didn't exist back then. I would challenge myself to work on tutorials that are technically sort of demanding. I would explain the concept and then I try to sort of implement that concept in my own way and trying to sort of maintaining a balance between code readiness, cleanliness and efficiency. 

Working on tutorials gave me a lot of confidence. And then after developing that confidence for technical efficiency, I felt more confident about contributing directly to the library code bases. And that required a separate set of skills. Because when you are contributing code to a library, you probably need to be a very good reader of code. You need to be able to read code. You need to be able to understand why the code is written that way. You probably have to do a lot of intelligent guessing. But that skill is very valuable. At that point in time, I felt confident enough to contribute directly to library codebases, open up pull requests and so on. 

Just to give you a chronology of how I went about it, I started by working on tutorials that I felt challenging enough for me. And then I slowly proceeded towards contributing code to library code bases. 

[0:11:04] SF: Yeah. I think you made a good point too about sort of learning how to dive into one of these existing, maybe fairly large codebases and be able to read and kind of get your bearings. Because that's also a core skill of walking into any organization as a software engineer. Usually, you're not starting sort of net new. Maybe that happens some of the time. But a lot of times, you're diving into an existing codebase that could have been there for like a decade. And you kind of need to figure out what's going on here with the available resources that you have. Whether they'd be documentation or maybe leveraging other people on the team. 

But a big part of kind of getting ramped up is like just understanding what's going on so you can figure out where and how am I going to make contributions in a way that isn't going to break things and also is consistent with whatever that business happens to have in place for how they develop code and develop new types of features.

[0:11:54] SP: Yeah. Yeah. Yeah. Absolutely. And another thing is this part, if you are not used to it, at least that was the case for me, it will feel intimidating. I'm not going to make any faces. It will feel intimidating. But I think if you turn that feeling of intimidation as an opportunity for you to learn and improve, I think you will end up just nailing that one. Whenever I felt overwhelmed, I instead turned that into a learning opportunity, which really helped me. 

[0:12:24] SF: Do you think that open-source contributions get enough sort of credit as someone applying to a job when it comes to the interview process? Certainly, if you're going to interview with like a FANG company, it's not that they would discredit your open-source contributions. But you're still going to have to jump through the normal sort of solve this toy problem type of interview process. 

And even though you've made public contributions to open-source where they could actually go and look at like real engineering work that you've done. Instead, they kind of rely on these like system of problems that they're going to have you solve. Do you feel like needs to be more emphasis or credit given to those that are trying to get into industry or land a new job based on their open-source contributions? 

[0:13:08] SP: I think there could be a couple of things. For example, instead of just focusing on the regular data structure and algorithms rounds, you could maybe actually ask the candidate to work on projects that would be similar to what they would do after they join the organization they are interviewing for. That might be helpful, one. 

Second, of course, if the candidate has got a plethora of open-source contributions, you already have their engineering works in front of you and you should be able to judge if they're meeting your expectations from there. And I think the engineering rounds could actually consider having just a one-on-one chat with the candidates where the candidates walk the interviewers through some of the most interesting projects they might have done. And what were the challenges? And how did they approach the solutions in the first place? 

Because if you do something like that, that helps you to gauge the maturity level of the candidate. Because it's all one-on-one and you are able to instantly look through if there's maturity. If the thought process has any maturity that you're looking for. I think these two things could be added to the regular interviews that are conducted by the FANG companies these days.

[0:14:29] SF: There's also something there where you're sort of testing their depth of understanding of what they produced as well. I think that if someone like truly understands or was like really the owner of a project, they can dive into the details of how they made every part of the decision-making process if a lot of thought went into it. And that's, like you said, a good way of sort of testing the maturity of the engineer or the candidate.

[0:14:51] SP: Yeah. Even if that's not the case. Even if, let's say, the candidate does not have a lot of open-source projects to discuss, maybe just give them a toy project that closely resembles a problem that they would probably work on after they join the organization. And if you have a candidate solution, the candidate should be able to walk you through each and every step of the solution that they devised. Either works, I think.

[0:15:18] SF: Yeah. I actually thought one of the best on-site interview experiences I ever had was I still had to do some regular sort of whiteboarding type problem solving interviews. But the bulk of the interview process was they gave me three or four hours or something like that and they gave me an existing codebase. And then I had to extend the codebase to support this chat interface that was similar to what their actual product was. And I could do whatever I wanted in that period of time. 

And then even for the second half of it, they brought in another engineer that worked at the company to pair program with me and then I presented what I did at the end. I thought it was a really innovative way to test the candidate. Because they were using sort of the traditional problem-solving code in front of somebody on a whiteboard or at the keyboard type of problem-solving. But they also had this kind of open-ended problem-solving aspect that was basically the limitation was your imagination and how fast you could put some of this stuff together. And I thought it was a really neat way to test a candidate and I think different essentially than other types of experiences I've had. 

[0:16:19] SP: I think that's quite spot on. And especially for roles such as machine learning engineering, it's not just about traditional software engineering. There's actually an ML component inside of it. You can be a Dennis Ritchie. You can be a Thomas Cormen. But you still may not understand how Dropout works. You may not still be able to explain me how to counter overfitting. That's not very helpful if you are joining an organization as a machine learning engineer. I think it should strike a right balance in between how to write good code. How to go about problem-solving in general. But also, it should involve some elements from the ML world as well. 

[0:16:58] SF: Mm-hmm. Fantastic. Yeah. I think that's great. Let's start to talk a little bit about AI and some of the work that you're doing at Hugging Face. There's of course a lot of hype around LLMs and generative AI right now. And I think, primarily, people think of - when they think of generative AI, they think of these LLM-type models like GPT and a whole host of other ones probably in part because of the popularity of ChatGPT and things like GitHub Copilot. But there's also generative AI designed to actually like generate images from text prompts. I want to focus on that particular area. What are the approaches in AI to generating an image? And how does that compare or how is it different than essentially generating something like text? 

[0:17:45] SP: Yeah, sure. Image generation has been there in the computer vision field for a while now. I think that realm of creativity first got kickstarted with something called generative adverse serial networks back in 2014. And then it has seen many, many iterations of innovations. 

And you almost wouldn't hear about progresses around generative adversarial networks. But generative adversarial networks were the ones that actually kickstarted the entire evolution around hyperrealistic image generation where you take a look at a face image and you will find it very hard to believe that the face is actually generated using a neural network. 

Generative adversarial networks are the ones that kickstarted this entire revolution. But I think it's safe enough to say that they are kind of becoming dead now. Now we have an entire new breed of model family, which is called the diffusion model family. And if anyone's interested, things like DALL-E 2, DALL-E 3, Stable Diffusion, all of these things are based on diffusion. 

A couple of different approaches such as generative adversarial networks diffusion. Then there's variational autoencoder. Then there's normalizing flow-based networks and so on. I don't think they were popular ever. But there's that. But amongst all the approaches, GANs and diffusion models clearly took the stance. But with the current trends, I don't think GANs compare even remotely to the kind of performance that we are seeing with diffusion models.

[0:19:28] SF: What is the limitation on GANs? What is the diffusion model doing better than GANs so that it's become the most popular approach? 

[0:19:37] SP: Well, first and foremost, generative adversarial networks are notoriously hard to train despite all the technical innovations made in order to stabilize their training process. They are nowhere close to how it is with diffusion models. Diffusion models on the other hand, they are much easier to train. They are much more controllable. 

And when I say much more controllable, I essentially mean if you wanted to extend the use of diffusion models so that they can take images as input. Not just natural language text prompt. You can do that very easily and so on. Plus, they are much more flexible. If you want to extend diffusion models to solve the task of inpainting for you, it's fairly doable. 

The kind of controllability, the kind of flexibility, the kind of high-fidelity generations that diffusion models offer, these are like some of the most highly talked about aspects of diffusion models, which clearly sort of position them as the superior alternative to GANs. 

But on the other hand, I would also like to point out the slow and iterative nature of diffusion models. Diffusion models are not one-shot, unlike generative adversarial networks. They also suffer from slow inference latency. But you know how the field of deep learning evolves. We already have techniques to sort of mitigate that problem. We are not there yet. I mean, when I say we are not there yet, I essentially mean GANs can produce hyper-realistic images as well probably in an unconditional manner probably not using textual prompts. But their generation is one shot. You wouldn't have to run iterations to get to a fairly photorealistic image. But that's not the case with diffusion models. But we are slowly getting there to close the gap between the inference speed offered by GANs and the current inference speed offered by diffusion models.

[0:21:32] SF: Okay. And then you mentioned easier training with diffusion models. Can you walk me through what that process consists of? What is the diffusion model training pipeline look like? 

[0:21:43] SP: Yeah. Sure. I like to think of diffusion model training as the following. You would like to think about the process as some sort of denoising where you start with pure random noise drawn from a Gaussian distribution. And over a period of time, you try to denoise that random noise so that it becomes a photorealistic image. 

You do that over a period of time and you form the problem statement as denoising, as I mentioned. And during the course of this denoising process, if you also condition this denoising process with some sort of text embedding, which are computed from prompts like white fire monster in a snowball room, you also equip with the denoiser so that it can accept natural language prompt. And at the end of the denoising process, it can generate images from natural language input, which is super cool.

[0:22:41] SF: What is a denoising process? How do you actually remove noise from the original Gaussian distribution? 

[0:22:48] SP: What happens is you first start with a pure random noise and then you employ a network. Let me walk you through. You start with a clean image and then you introduce some amount of noise. Now you have a noisier version of the image, right? Now what you do is you employ a denoiser network to predict the amount of noise that was added. And you repeat this process over a period of time so that the network learns the reverse diffusion process where it is initially fed with a pure random noise and it can successfully denoise that pure random noise into something photorealistic. That's the process.

[0:23:24] SF: Is the training material for a diffusion network essentially a bunch of images and then you're adding some randomness in the form of noise to those images? And then you're training this network to essentially recognize the noise that you added to those images so that, in the future, it can essentially make certain predictions in terms of what kind of image you want to generate based on this originally noisy set of training material? 

[0:23:50] SP: Yeah. Yeah. Yeah. That's about right. Yeah. 

[0:23:52] SF: In terms of the training material, is it really just like a large amount of images that you're using. How do you essentially map eventually text prompts to the images? How does that set get created? 

[0:24:07] SP: Yeah. Sure. For unconditional generation where you just start the diffusion process with some random noise probably drawn using some kind of initial seed, you do not need text prompts. But unconditional image generation is probably not that useful. You would probably want to generate images from natural language inputs. Because those things are much more useful. 

What you do is, in your training set you have an image and you have a caption that kind of accurately describes the image. You have that in large quantities. But also, the quality should matter. You wouldn't want to have images and captions pair where the captions are not at all aligned with their images. Because wrong captions can actually make your model performance worse. 

You have image. You have a bunch of pairs of images and captions. And to be able to train your diffusion model, you employ a pre-trained text encoder to be able to compute some kind of intermediate representations of the prompts. And then while you are training your diffusion network, you basically pass those text embeddings to sort of condition the training process so that it learns how to denoise a noise vector, which is also conditioned on the text embeddings that you computed using a pre-trained text encoder.

[0:25:30] SF: And in terms of the pairs this image to caption, how are those generated? Where do those come from? 

[0:25:35] SP: There are a couple of approaches here. Let's say you have got a large set of images. You could potentially leverage an automatic image captioner model like LLaVA, BLIP-2 or even GPT-4 to describe the images in terms of captions. You could also scrape the internet of images. And usually, while you are scraping the internet for images, you will probably would also want to scrape the all text that you get within the HTML tags of images and those become your week captions to basically weekly describe the images. Those are of course noisy. You of course need to run some kind of data filtering so that the image caption pairs are solid and tight. These are some of the approaches that are usually followed these days.

[0:26:27] SF: And then in terms of going from prompt to creating an image, what's that process? Or I'm assuming the prompt essentially gets vectorized as some sort of embedding and then you're doing some sort of vector similarity search to find some collection of images that are similar to what the prompt is. Can you walk through that process of going from prompt to actually generating a new type of image? 

[0:26:51] SP: Oh, yeah. Sure. You have your text encoder which will get you the intermediate representation. Or as we like to call it, the embedding from your input prompt. You have the prompt embedding. And then you start with a randomly-sampled Gaussian noise. And then you pass two of those things, the prompt embeddings and the initially-sampled noise to the denoiser diffusion network, which will get you some refined kind of a noise. 

And then you again pass the refined noise back to the diffusion network along with the pre-computed prompt to embedding and you sort of repeat this process for 20 to 30 steps until and unless you get some kind of image that you are expecting.

[0:27:32] SF: What are some of the hard and unique problems about object generation in terms of creating an image, or creating a 3D model, or whatever it is versus text? What are some of the unique hard problems in the space? 

[0:27:45] SP: Well, first and foremost, it's multimodal in nature. In the text generation arena, we are only dealing with text. You are prompting a language model and you are expecting some kind of text in return. Unlike multimodal models, which can operate both on images and text. For example, it's a clip and so on. You could do reverse image search. You could tell, "Hey, I have a search query. Now go to my database of images and tell me what's the set of images from my database which closely match the input search query. Those are different things.

But for image generation from language input, you need to be able to solve the problem of object and variable binding really nicely. Let's say your prompt has got distinct nouns. Dogs, cats, tigers and so on. And you would ideally expect a text-to-image diffusion model to generate all of those objects but in a coherent manner so that it adheres to the natural language prompt. 

Let's say your prompt is a photo of a tiger surrounded by a cat and an elephant. Now the generated image should actually adhere to all the relationships that are described in the prompt. It shouldn't be like you are generating maybe - I don't know. A bird instead of a tiger or something like that. 

You see, from the natural language description, you are having to bind some kind of an object entity relationship so that all the objects that should be present in the generated image, they also have some kind of adherence with the input prompt. That's why it's different. That's why it's probably more difficult as well. 

[0:29:31] SF: Is there also challenges around essentially having high-quality data for training versus text? Text, clearly, there's a lot of digitized text at this point. Books. Things people have written. And we have a lot of digital images as well. And as you mentioned, you can use the alt tags if someone's done a good job of actually labeling them using their alt tags. But is quality of input a further restriction or challenge in the space versus text? 

[0:29:58] SP: Yeah. I think we also kind of saw a couple of works that focus on improving the performance of language models by improving the data quality. Data quality is paramount always. If you feed in garbage, your outputs will be garbage. That's like the zero rule of developing any production-grade machine learning system. The data quality for text-to-image diffusion models is also quite paramount. And at this point in time, all the superior works that I have seen, they make use of really high-quality data. 

For example, Stable Diffusion XL, they focus on building a really good data set. Then there are works like PIXART-?, they also highlight the importance of having a really good quality data set. Where not just have to focus on the quality of the images, but also the prompts that describe those images. Data quality is paramount.

[0:30:55] SF: Is that the main differentiator between the various diffusion models that exist right now? There's Doll-E, Stable Diffusion, Crayon. There are a lot of different approaches or models available to people. Is the primary differentiator the quality of the data? Or is there other things as well? 

[0:31:12] SP: No. Architectural differences are there. The way you train these diffusion models, that's also a major differentiator. For example, Imagen is a really popular text-to-image diffusion model from Google. But it also differs from Stable Diffusion XL potentially with respect to data. But the main differences are like how they are architected. Their neural network architectures differ and how they were trained on the data sets.

[0:31:37] SF: What kind of computational resources are required for training and using these types of models? And are there techniques that help make them more scalable? 

[0:31:45] SP: Well, the scalability part is still an active area of research. There are like very shallow studies on the language modeling world that focus on how we can scale these models effectively and efficiently when provided with more data and compute. But the computational resources can be quite demanding because we are now operating on the continuous space. 

But with training paradigms like latent space diffusion models, you can cut down the computational requirements quite a bit. You need clusters of GPUs to train these models. The recent innovation around this realm would be PIXART-?, which drastically reduces the amount of compute you need to train state-of-the-art text-to-image diffusion models. Like somewhere in between 32 days of A100s if I remember correctly. But yeah, that's the ballpark we are talking about.

[0:32:42] SF: What was the innovation there? What led to this performance gain that you're mentioning? 

[0:32:46] SP: Well, first and foremost, the neural network architecture. Instead of using a unit-based architecture, they used a transformer-based architecture, which is known to be like more efficient and more scalable when it comes to neural network architectures. That's the major chunk of efficiency I would like to say comes from there. Because it's Transformers. They are efficient and much more effective at many learning tasks and so on. That's one. 

Second would be the use of latent space diffusion models. Where instead of operating directly on the pixel space of images, you are operating on the latent space. What you are doing is you are leveraging some kind of a pre-trained variation of autoencoder, which will give you an intermediate representation of the images. And you apply all the diffusion-related training on the intermediate representation instead of the actual pixels of the images. 

And you could think about it in the following manner. Let's say you want to train on 1024 by 1024 images. That's a lot of pixels, by the way. 1 MBs right there. And it will be computationally very challenging to directly train on 1-megabyte pixels. Instead, what you do is you reduce the dimensionality to somewhere in between 128 by 128 by making use of a variational autoencoder. And now the representation space has immediately reduced. And henceforth, it's much less computationally demanding. And hence, the efficiency.

[0:34:18] SF: Yeah. I see. Essentially, they're using some clever tricks essentially to help reduce the complexity of space of the representation of the image so that they're reducing the computational resources involved with actually processing and training these types of models.

[0:34:33] SP: Yeah. 

[0:34:33] SF: You mentioned there's some unique hard problems in the space. But what about if I'm just using these models myself as a user? Is there difference in terms of making use of diffusion models as someone who wants to leverage image generation versus using, say, an LLM? Is this harder to use as a, essentially, consumer of the model versus a consumer of an LLM? 

[0:34:58] SP: I don't think so. Because even if you wanted to dabble in the LLM field, you would need to know about certain things. Let's say the temperature parameter. The top P parameter. Then you would need to learn how to be good at prompt engineering. Then the number of new tokens that you want the language model to generate in order to probably optimize your costs and so on. 

Similarly, in the diffusion world, you would need to care a lot about the number of inference steps that you need to get to a good image. The kind of scheduler that you are using. The amount of guidance scale that you're using and so on. I wouldn't say it's a whole lot different in comparison to the language modeling world. 

[0:35:39] SF: You're the maintainer of the diffuser's model library for Hugging Face. Can you talk a little bit about some of the applications that people build using the library? 

[0:35:49] SP: Yeah. Sure. I'm one of the maintainers. I'm joined by other folks, Patrick, Andrew. Huge shout out to them for helping me all along. Some applications include, of course, things like ControlNets where you want the generation process not just be conditioned on natural language prompt, but also let's say some kind of pose. 

Let's say you have a pose image and you want to generate an image that would have close adherence to the pose image. But you also want to condition the generation process with some kind of language prompt input. That's one of the application. Then you could you know create really cool GIFs out of static images. And this could be helpful for promotional materials and stuff like that. 

Third, you could build variations of images. Let's say you work in the e-commerce domain and you want to generate variations of a particular product image in order to help your marketing campaign and so on. That's something you could do. And then maybe you want the generation to be a bit more subject-driven. Let's say you have got fashion figures and you want the fashion figures to appear in interesting contexts. Maybe in different apparels and so on. You could use the diffusers library to quickly Implement such a use case.

[0:37:12] SF: How does someone use the library? Can you kind of walk me through the equivalent of like a "Hello, World!" but with the diffusers library? 

[0:37:20] SP: You install the library. You install the library. Because the library is an open-source Python library. You can just hit pip install diffusers. And most of our pipelines should run on a free tier collab notebook so that you do not have to acquire costly GPUs to be able to play with the library. Of course, training might require beefier GPUs. But just for running inference, a free tier collab notebook should probably cut it for you for most cases. 

After installing the library, you basically pick a model. Maybe it's a stable diffusion model. Maybe it's [inaudible 0:37:57]. Mey maybe it's UniDiffuser. You basically pick a model. You can refer to the documentation for that to know which model to pick for which task. And then you initialize a diffusion pipeline. And then you call the pipeline on your language input. Four, five lines of code and done.

[0:38:15] SF: In terms of picking a model, I've seen some of these guidelines as well of like certainly certain types of models perform better for certain types of tasks. Or they might be really focused on a particular task. Do you think over time there'll be sort of like a convergence of some of these approaches where there'll be less of a dependence on someone kind of knowing which model to approach a problem with or having to test multiple models to get the best result? 

[0:38:41] SP: I think it's always a bit evidence-driven. It's always a bit empirical. Because it also depends on the context and the kind of problem that you trying to solve. If you are trying to optimize the inference latency, then you will have a certain kind of requirement. But if you are optimizing for the ultimate quality with a good level of photorealism, your requirements would be different. 

Accordingly, you will have to settle with a model that gives you a good balance in between the two. Yeah. But you might probably be also looking for models that are better at following complex prompts. But you might have use cases that do not need models that are better at following complex input prompts. It's very contextual, I would say. It's very problem-dependent. The closer you are to the problem statement, I think the better it would be for you to take a decision on which model should I use. 

[0:39:41] SF: And then in terms of getting this kind of like empirical evidence of which approach to take, how do people go about actually collecting that empirical evidence and testing these things? is it really down to relying on like a human in the loop to evaluate? Or are there automated ways of evaluating which model is performing better for a particular task? 

[0:40:02] SP: That's a really good question. Long story cut short, nothing beats human annotators and human evaluations. If you can manage a team of human annotators within your company, that's the gold standard. No one can beat it. Because those human annotators will be actually aware of the kind of problem that you are solving for. And they will probably give you the best level of annotation. But if it's not possible, there are some automated ways to evaluate text-to-image models like the T2I-CompBenchmark. Then there's one from Holistic Evaluation of Text-to-Image Models from Stanford and so on. 

But I would like to really repeat, nothing beats human evaluations. If your organization can manage a separate and dedicated team of human annotators, always go for that approach. Otherwise, go for other benchmarks like T2I-CompBench and so on.

[0:40:58] SF: Yeah. Because I would think that, at this point in the world of AI, it can be a little bit tricky to even know if you've made - you retrained your model, you continue to evolve it. Are you actually moving things in the right direction or not if you don't have some way of actually empirically testing it?

[0:41:14] SP: Yeah. Yeah. Yeah. Absolutely. And by actually putting humans in the loop, you also get to collect preference data. And later, that preference data can be used for things like reinforcement learning for human feedback where you optimize your model for a certain kind of tone and so on which could be beneficial for your business in the long run. 

Because, for example, Midjourney does it all along. For example, if you generate an image within the Midjourney bot, it asks you to select a particular image, which it then up-samples or generate variations for, right? And by clicking a particular image within a pool of different images, you actually give your preference data to Midjourney. It's quite valuable.

[0:42:00] SF: And what's your day-to-day like as someone working on the diffusers library? Is your day-to-day as an engineer working in ML space functionally different in some way than perhaps working as like a more conventional software engineer working on a consumer-facing application?

[0:42:20] SP: I like to decouple my day job into three main things. First, the library maintenance part, where I get to contribute potentially very impactful features to the library. Making sure it's backwards compatible and so on. Responding to issues. Reviewing pull requests and so on. And then the second point would be about training and eliciting diffusion models. And third would be to take part in applied research conversations internally and also in a collaborative manner with the rest of the research community. 

It's a bit different, I would say. Because I have to sometimes be on top of the diffusion research to be able to implement something. But also, at the same time, I also have to think hard about the code quality and the way I Implement a particular component within the library so that it respects the philosophy of the library so that it is backwards compatible. So that it maximizes the user experience for a lot of our users. 

[0:43:26] SF: Yeah. It sounds like there's some components of your day job that is probably very familiar to anybody that's worked in software engineering. You're probably you know running sprints, writing tests to sort of maintain the library and certain code quality expectations and so forth. But at the same time, you also need to stay on top of everything that's sort of going on in the space so that you can train models. Also, contribute from like a research thought leadership standpoint. And also, just kind of have your bearings about what's actually going on in the space so you can implement the right things or incorporate those into the library as needed.

[0:44:02] SP: Yeah. Basically, having the ability to prioritize one thing over a bunch of different things. Because not everything happening in the diffusion will make it to the library. We have to be very careful about what we end up integrating within our library so that it becomes manageable as far as maintenance goes and so on and so forth. That, it helps to have a very general and collective overview of the field. 

[0:44:31] SF: Yeah. It makes sense. All right. Well, Sayak, thanks to you so much for being here. And again, for those listening, if you want to learn more about the space Le more about Sayak's work, I highly recommend his blog. We'll figure out a way to include that in the release notes as well. But he writes a lot of great material that deep dives into different topics that I think it's a great way to learn if this is something that's interesting to you. Thank you so much, Sayak, again for being here. And cheers.

[0:44:55] SP: No problem. Thanks for having me.

[END]