EPISODE 1700

[INTRO]

[0:00:00] ANNOUNCER: The majority of enterprise data exists in heterogeneous formats such as HTML, PDF, PNG, and PowerPoint. However, large language models do best when trained with clean curated data. This presents a major data cleaning challenge. Unstructured is focused on extracting and transforming complex data to prepare it for vector databases and LLM frameworks. Crag Wolf is head of engineering and Matt Robinson is head of product at Unstructured. They join the podcast to talk about data cleaning in the LLM age.

This episode of Software Engineering Daily is hosted by Sean Falconer. Check the show notes for more information on Sean's work and where to find him.

[EPISODE]

[0:00:50] SF: Crag and Matt, welcome to the show.

[0:00:52] CW: Hi, Sean. Thanks. Great to be here.

[0:00:54] MR: Hi, Sean. Thanks for having us.

[0:00:56] SF: Yes. Thanks so much for being here. I like the dynamics of talking to a duo. Anything to kind of like, get a little bit different flavors in terms of the conversation that can go on and so forth. Thanks for being here. Why don't we start off with some introductions for the audience? Who are you? What do you do? Why don't you start, Creg?

[0:01:12] CW: Sure. Yes. So, I'm Crag, head of engineering at Unstructured. I have been in the industry a while. Previously, I had worked together with Matt and founder Brian at primer.ai and led a couple of engineering teams there. But that's me in a nutshell.

[0:01:29] SF: All right. Matt, who are you and what do you do?

[0:01:32] MR: Hi. I'm Matt Robinson. I'm head of product at Unstructured. Depending on the day, I do work across product and engineering. I come from an MML, NLP-type background. When we first started, I did a lot of work on our open source package, and then do all sorts of various and sundry things now.

[0:01:50] SF: Well, such as the nature of a startup, we all wear many, many hats, I think. I used to say, when I was a founder that I did everything from most of the code, to sell the product, to take out the garbage on a daily basis, essentially.

So, we're talking unstructured data. I'm excited to talk about this and talking also about the company, your company Unstructured. I feel like I've had a lot of conversations kind of on this topic recently, both in my professional life, and also in my podcasting life. I think, obviously, there's been this huge growth around and Generative AI and LLMs over the last like year and a half, which I think is why there's been sort of some heightened level of conversation and focus on this.

But the reality is, like most of the world's data is unstructured, something like 80% to 90%, and we really haven't had technology to sort of unlock it. We have data lakes and things like that. But they take a significant amount of maintenance and manual work to kind of actually create the metadata around that unstructured data and make it something that you can actually leverage. So, most companies are kind of sitting on a mountain of this stuff that's mocked away and it could be an S3 bucket, it could be their email, like, essentially, they have tons and tons of this information. How do we get the data in a form that's actually useful without a bunch of engineering and manual work? How do we get beyond sort of all this, essentially, like manual work that's involved with actually allowing us to query it, use it for analytics, use it for machine learning, and so on?

[0:03:16] CW: Yes. I mean, I've jumped in there first. I'd say, this data has been around for a couple of decades now, I guess, and that there's emails, there's Office documents, whether they're Microsoft Office, or Google type documents, or other. There's video recordings and they've all kind of lived separately from what would refer to as structured data, like databases with very strict schemas, and that sort of powered business intelligence, and as like a totally separate universe, from the knowledge that may actually exist in the company, which really exists more on what we call the unstructured side of the house.

So, I just wanted to double-click on what you're saying here, and that is like, it is so crucial now to be able to unlock the value inside of all of that unstructured data. And the state-of-the-art now with LLMs and multimodal LLMs is you present them with large amounts of text or actual images and documents, and you're able to ask them questions about it for whatever knowledge you might be interested in. I guess, maybe jumping ahead a little bit here. But a key thing that in order to get there is, well, you can't throw all your documents at an LLM at once. So, there's a key thing which is retrieval. Finding the relevant documents to narrow in on answers. How do you go from mountains of unstructured data to just be able to support retrieval for relevant types of documents? That's exactly what Unstructured, the library, and the company is all about and happy to go further there. But the high level, absolutely.

[0:05:05] SF: Yes. I mean, I think you call it out an important thing there, which is like historically, we've kind of like treated these two types of data as like completely separate things. We've had a good like reasonable way of managing structured data, things that we could put in rows and columns, or even semi-structured data, like JSON or CSV file or something like that, and we've been able to kind of take action on it as either engineers or product people as a business.

But then we have this other stuff that was like lives over in the soup and we don't really know how to kind of do stuff with it. Or you have to do stuff on a very point-by-point bespoke process. I think, if you ask like a random person who's not in technology about what they think of database does, I think they probably think about it as, well, that's like a place where I can go and ask a question, and I'll get an answer. But that's not actual the reality of a database. It's much more rigid than that. But now, I think we're starting to kind of inch towards something like that being the reality where we can ask a question, and we get some sort of answer from being able to appropriately retrieve the data and combine that with things like LLMs.

Matt, did you have something to add there?

[0:06:16] MR: Yes. You brought up a really good point about like the traditional structure databases. We have a way to deal with those and well, actually like a lot of what we do is abstract away a lot of that ad hoc work that you used to do, and take all of that unstructured data and put some structure to it, so that it makes it easy to load into something like a database, whether it's a vector database, or some other kind of database. Really, at the end of the day, a lot of what we're doing is making it - we're abstracting that away, and then making it so that you don't have to care what the source document was. Downstream, you can process it all in the same way.

I'll talk about a couple of examples. So first, think about if you have an HTML file versus a Word document, well, there's structure to both of those documents, but they're both different languages, different schemas. One is HTML, while the other is a pile of different XML documents. But to get that into a format that you can then deal with downstream in the same way, you have to go through and normalize all of those elements. There could be an h1 heading in the HTML file. There could be a section header in the Word document, and you want to not care about whether it came from one or the other. So, normalizing those elements is a key part of what we do.

Once they're normalized, it doesn't matter whether it came from the Word document or the HTML file. You can just treat it the same. Then, same goes for stuff like with PDFs. We apply models too. There's other cases, we're using cues like XML tags, and HTML tags, and that sort of thing to figure out structure. Same deal on PDFs and images. We're just using visual cues. But end of the day, that all gets normalized to the same schema.

Once it's in that schema, you can chunk it, you can load it, you can do all sorts of different stuff, and it really doesn't matter what the source document is, and that's really where the power of what we're doing comes from.

[0:08:12] SF: What is the output of this process? What is that normalized view of the data once you've essentially pre-processed it?

[0:08:16] MR: It's really a couple of things. There's really three core things that we're doing. One, we're pulling out the raw text content from the document. Two, we're structuring the document. So, we're basically trying to come up with, in a natural reading order, "Hey, here's the heading. Here's some narrative text. Here's some list elements." So, we're structuring the documents into what we call document elements. Then third, is pulling out metadata from the documents. That really falls into two categories. One is information that we pull out of the document itself. So, stuff like page number, last modified date, that sort of thing. Then, some that we actually infer from the structure of the document itself. So, maybe like this piece of narrative text is a child element of this heading, for example.

We pull all of those elements out. In the end for us, it's a JSON object, that's just a list of objects with all of these elements in there. And then really, what that allows us to do is go easily a chunk for LLM applications, then the metadata allows for things like filtering for hybrid search and some things more complex, an LLM operation. Now, that's some of the more recent LLM apps we're doing.

[0:09:29] SF: Maybe we can back up a little bit and go through sort of the walk me through a use case like from essentially, where I'm starting as someone using Unstructured to essentially get to the place where I'm generating this template.

[0:09:44] CW: Sure. I mean, like a really simple use case would be, I have a Google Drive. There's a shared Google Drive for my organization and I want to be able to query the information in that Google Drive. So, processing all those documents and putting them through the transform step that Matt was referring to where we get the same structured JSON for each document, regardless of the type it is, and that could be a native Google Doc as well. From there, the rest of the pipeline basically looks the same. Chunk it. There's various chunking strategies that you could use and you could experiment with different sizes of the length of the chunks.

Again, like the vector database that we're writing to is probably going to be a vector database that support semantic search is also configurable. End of the day, it's fairly straightforward ingestion process, and that data is now ready to be queryable for the simple ChatGPT use case of I just want to chat my documents and ask questions about them.

[0:10:52] SF: So, in the scenario that we're talking about where we're processing files from drive, what is like the performance like? Am I doing that in batch? When add new files, is there like a real-time update? How does some of that stuff work>

[0:11:04] CW: For sure. Initially, it'll probably be a batch to load your 10,000, 100,000, million docs. That's something the platform that we're writing to basically allow users to specify their Google Drive location and all the parameters they need in the pipeline. The first time around, probably going to take a little bit longer given the number of documents. But then there's basically an option to keep the vector database warm, so that we keep looking for new documents. That vector database is always kept up up to date with a little bit of latency, perhaps the source documents themselves.

[0:11:42] SF: What's happening behind the scenes in order to process these docs, and then mapping that essentially to this normalized - essentially, you're pre-processing the files to normalize them into this output. Then, from there, you're applying chunking strategies to build potentially like an embedding model or something like that?

[0:12:04] MR: We are selves don't like build embedding models. So, we're typically preparing data for NLP-type use cases. Somebody might prepare a bunch of domain-specific documents using our file transformation technology, for example, for the purpose of training and embedding model. We do, in some cases, interact with those models, typically on the way to a vector database. We have integrations for that. So, if you wanted to load your documents, for example, from SharePoint into a Weaviate vector database, we would potentially call an OpenAI embedding model or something along those lines ahead of loading those documents.

But our use case is on the pre-processing side, for the most part. You asked a little bit about what what's going on under the hood when we do that. There's really two categories of methods that we use for data pre-processing. One our rules-based parsers. Then, we also have model-based parsers. For the rules-based parsers, that's going to be a lot of very, honestly, like TDSI logic for taking those source documents, whether it's PowerPoint or Word doc or HTML, and then mapping it to that that common schema.

Fortunately, like those run really fast, so it's like, well, under a second per page, usually kind of like 0.1 seconds per page and faster for that. The tougher ones are always the PDFs and the images, which are model-based workloads. Really two ways we do that. One is with object detection models. So, same sort of model that you would use to be like, "Hey, identify this car or something like that, in an image." We're doing the same thing. We're just identifying page elements based on visual cues instead of identifying the car in that case.

We also have models that we're experimenting with on that end, are called vision transformers that act sort of similar to a text transformer, except instead of text in and text out, it's image in and then text out with the text, in this case, being a structured JSON. The model-based workloads run a little bit slower, because they're heavier duty. They run about like three to five per seconds, per page for us on CPUs. 

[0:14:14] SF: Okay. If we're thinking about like a RAG pipeline, where eventually, essentially, I want to create a bunch of beddings, store them in a vector database. Where Unstructured sits in that process is really at the beginning of that, where it's essentially the head of the ingestion pipeline to clean up all this data to make it ready for chunking.

[0:14:34] MR: Exactly. We think of ourselves as the connection between all of your messy source documents where all of your organizational knowledge lives and the LLM. So, you're exactly right. We sit at the top end of the funnel for a RAG application.

[0:14:46] SF: Yes. Data quality is such an important aspect of building anything in machine learning in AI. So, what is some of the sort of performance gains that you're seeing people get by essentially being able to properly pre-process the documents and generate this normalized view of it, versus maybe something else that they might be stuck doing on their own?

[0:15:10] CW: I mean, I'd say it's, as we've mentioned, there's really a broad range of document types to support, not to mention different sources where they may not be - where the document type itself may be native to the source, for instance, a SharePoint document. So, the critical thing for a lot of users is really that were segmenting that data correctly. So, the schema that Matt was talking about before, that we are able to - I mean, number one, cleanly identify the text. If there's text in the documents, if their office type documents. Two, natural reading order.

So, for multi-column docs, if you're just running raw OCR, or using other transformation pipelines, multi-column documents will often trip up the outputs there. Then, additionally, another big subcategory where the accuracy matters a lot is in tables as well. If there's an embedded table within a PDF, or Word doc and HTML, having an accurate representation of that in the JSON output itself. So, we're always tracking our scores versus some, I guess, maybe undisclosed competitors at this point. But feeling that we have the edge across all document types, and encourage users to verify that themselves on against our API.

[0:16:35] SF: What is the output when you do something like you're processing a table? What are you actually generating on the other end of that table?

[0:16:40] CW: Yes, great question. I think in the JSON schema itself, it's a list, ordered list of elements from the document and naturally in order. If there's a table, we have an element type called table. For the actual cell structure and the content within those cells, there's an additional metadata field called text as HTML. So, we happen to choose HTML as the representation for tables because it covers cells and rows and column spans and row spans, it could be something else. But that allows end users to then - and the end user could be an LLM in a RAG application, as well, query that table. So, that's basically what's going on in terms of the JSON transformation.

[0:17:29] SF: What about in terms of like processing a PDF, where the actual pages in the PDF are images of like text? What's going on there in order to convert that into, I'm assuming, a text representation?

[0:17:41] CW: I mean, I'm really glad you asked that. I mean, because PDFs are such - there's such a variety of types, or of some types, maybe of PDFs, where they're all embedded text, or they're all scanned pages, or they're a mixture of embedded text and images per page. So, in the case of the scanned page, we treat it as an image, basically, and we stay - we pass it through an object detection layer, we're able to identify tables, headings, or titles, paragraphs, or narrative text within the page itself, and pass them through the rest of the pipeline, as we would any other content that's coming through until those outputs that.

So, if there's a table within that image, we'll recognize it as a table, we'll take the OCR steps that are needed, and we'll basically extract out the cell at the cell level for that table of the content.

[0:18:36] SF: You mentioned earlier that you're also looking, you're starting to experiment with being able to convert actual images as well into like a textual representation. So, is the output of the text essentially, like a human description of the image?

[0:18:50] MR: Yes. That would be the idea there. The idea is that you have documents. There could be an image in the document or maybe like a figure or something like that with a caption, and you want to be able to retrieve that as part of your similarity search. There's really kind of two methods we're experimenting with for this right now. One would be, you get your image or your figure or something along those lines, and then kick that out to a computer vision model, get back a text description of that. Now, it's like, "Hey, this is a graph and it's going up into the right and it describes GDP or something along those lines." Or, "Hey, this is a picture of a skier skiing down a mountain."

Now, with that information in the text key in our JSON output, you can load that into the database, essentially treated as text when you're doing your similarity search. Then, if you want to kind of pull back images along with text in your LLM application. The other easier way that we're doing that is just OCR-ing the image. So, like if you have like a picture of a stop sign or something like that, you wanted to OCR that and get stopped out there, that's something that we do for those as well.

More future-facing, there also are kind of like multi-modal retriever embedding type models that we might experiment with. But we're very practical. We kind of do the simple stuff, well, first, that gets our - users unlocked and going and then expand out from there as we get further along.

[0:20:22] SF: Yes. Absolutely. I mean, I think that the - I would think like, the biggest sort of hurt for businesses right now is just taking their textual data that's locked away in doc files and PDFs, and all these different formats and stuff like that, and being able to either bring that into an LLM for whatever sort of application that we're building, or just build a better retrieval system, so that they have kind of like a single source of truth for getting into helping people find the answers to various like internal information, and so forth.

In terms of the data processing steps for doing something to get, to be like RAG ready. You have like cleaning. There's chunking, summarizing, embeddings. Is it primarily Unstructured handling the cleaning part of that? Or is there some level of support for chunking? I know that you're not necessarily interested in building your - you're not building your own AI models from embedding. You're going to leverage something else. But where's that begin and end, I guess, in the process?

[0:21:23] MR: Yes. There's actually a lot we do on the way to the vector database that helps with retrieval once it's there. Chunking is actually a big one. So, Crag brought up earlier. We're pulling out all of these document elements and we're doing our best to make those really, really accurate. End of the day, we're not just doing that for fun. There's all sorts of stuff that we can do that, although, it is very fun. There's all sorts of stuff that we can do with that information to help prepare data for the LLM app. One of those is chunking. 

So, you can think of we've got this structured information in reading order about the document. When you think about chunking, traditionally, how you do that is you take a great big, long document, and then you just chop it up. So, I'm going to chop every hex tokens or hex characters, by breaking down the document into its elements first. We can actually chunk in a different way. Instead of chopping up the document, you can actually build chunks up from the atomic document elements. What that means is we can say, identify a title, or a subheading, or a list with bullets underneath and be like, "Hey, this stuff, belongs to the same section and then group that together into the same chunk, because we have structural information about the document."

What that means on the other and is when you're doing the retrieval for the LLM application, what you're able to do is get back content that belongs to the same section. So, it's going to be more coherent, and you're going to get more relevant context to inject into the prompt. Ultimately, you're going to get a better response on the other side. One of the other things that we enable to by doing this pre-processing. Really, end of the day, the most expensive part is going from the raw document to this standardized JSON schema, especially in the cases of PDFs and images, because these are model-based workloads. Once you have it in that schema, we chunk on top of that schema, so that's really inexpensive. You can experiment with 10, 20, 30 different chunking techniques and with different parameters really, really quickly. We're finding that that's really important because with this being a very new field, there's not necessarily a playbook out there yet. And people like really do need the ability to rapidly experiment. We're not just doing the data pre-processing. It also enables rapid experimentation with a lot of the stuff that impacts LLM results.

[0:23:47] SF: What are some of the common chunking strategies that people use?

[0:23:50] MR: Two of the common ones that we're seeing right now, from our standpoint, are one, based on structural and hierarchical information. That's one that we've been working heavily with. Another really interesting one that we've seen used heavily and that we support is similarity-based chunking. I'd say what that will look like is you start constructing a chunk. In our case, we're adding elements to the chunk as we go along, and go essentially set a threshold. If the content switches dramatically enough between document elements, you say, "Hey, this content is not similar enough anymore. We're going to break off into a new chunk."

So, we've seen a lot of success with that. Actually, we have an archive paper floating around out there that our ML research team put together. It's a really, really great and interesting paper that outlines some of the performance differences between various chunking strategies too.

[0:24:44] SF: Okay. Then right now, I think there's a lot of hype around like long context windows and it feels like that's kind of like the new hotness for 2024 in the world of LLMs and people talk, is RAG dead? Because now you have these like long context windows and so forth. Does that have any impact? We have, essentially, let's say, we have unlimited context window, or really, really big context windows. Does that impact at all the strategies that you might use from a chunking perspective?

[0:25:15] CW: I mean, it probably would, to some extent, and I think where it gets interesting for RAG is, and there's, of course, fundamentals of RAG retrieve K documents and present them to an LLM. So, the more documents that you're able to present to the LLM, given the long context window, maybe the less you have to really fret about getting re-ranking and write after retrieval.

I mean, I think the longer context windows, if anything, are just going to supercharge RAG applications in the future, because in some sense, it'll be less, sort of even less like old-fashioned data engineering and trial and error to go through if you can lean a bit more on the LLM and say, "Look, here's the 500 documents. Just give me an answer." But I also, at the same time, don't think the infinite context length windows as a thing, don't seem to, I would say, are not like something that's going to fundamentally disrupt RAG in terms of like the next, say, three years. If you have tens of millions of documents to work with, you're not going to be able throw them all the single LLM. We can talk about fine-tuning LLM which is a separate discussion, of course. But versus like, one shot, "Here's all the data, give me an answer."

[0:26:40] SF: Yes. I mean, we think even from my practical computer science perspective, like how realistic would it be, even you'd add an infinite context window to be able to grab all your documents and shove them into the cloud. I mean, I guess it depends on like what the use case is, maybe in an agent, where you don't need real-time response is different? But if you're actually doing something, it's more of like a copilot, chatbot experience. You're not going to be to put that much information across the wire to your LLM, and get a response back in a reasonable amount of time.

[0:27:09] MR: Yes. You would have to do that, by the way, like every time you ran a prompt too, and like that gets obviously prohibitively expensive, pretty fast. So, I'm here talking about like, "Hey, here's my Fortune 500 company, and I've got 500,000 PowerPoint slides or something like that." Even beyond some of the other just kind of like performance considerations. They're still lost in the middle type issues with one context windows and that sort of thing. But even if you solve those, like loading all of your organization's knowledge base into a prompt every time you want to ask a question just gets like really infeasible really fast.

[0:27:45] SF: Yes. It seems like a very hacky solution, at the very least, if that was even possible. In terms of - if we weren't using something like Unstructured, and we need to build something like our RAG model, what are the current pieces of like technology people are sort of cobbling together to solve that problem, to essentially get their data RAG ready?

[0:28:08] MR: It's really a mix of stuff. Really, like, from our perspective, what we do really well is we do like all of the different document types and we do all of this end to end. If you're not using a tool like Unstructured, you're typically cobbling together like a combination of cloud document processing solutions, like AWS Textract, or something along those lines. But that's not going to do every document type. So, like, "Hey, I've got to pull out like python-docx library, or maybe I got to pull out like lxml for XML files, cobble all of that stuff together.

At that point, you're also still coming up with your own schema for normalizing all of that sort of stuff, and we also handle like the loading into the vector database and the connections to the upstream like data sources, and then that sort of thing. So, if you wanted to do this by hand, the tools are out there. It's just a pain in the neck and nobody wants to spend all their time doing that. At the end of the day, people are a lot more interested in working on the LLM app, which if you have like domain expertise is where your time is going to be better spent anyway.

Yes, we do a lot of that for you. Really, like the origin of this company is like Crag mentioned that we had been working at an NLP company before this. And like, really, what we would notice is every time we would go on a client engagement, they'd be really excited to do the NLP and be like, "Hey, we're excited too. Where are your documents?" It'd be like a whole huge mess. We had just like written these one-off solutions to this, like very tediously, so many times that it just seems like a really great idea to put together a tool to just tie it all together.

[0:29:41] SF: Yes. It makes a ton sense. The timing is really good too, given the state of the world. There's different sort of flavors of Unstructured in terms of how people use this. There's essentially like a SaaS-based API that looks like you have like an enterprise-level version of the product. Can you talk a little bit about like that? What are the different usage models?

[0:30:04] CW: Yes. To talk about the usage models. I guess, we think of there really been three use cases or three tiers of users. So, the open source community, of course, is active and can be great for proof of concepts and whether students are data scientists or data engineers at an enterprise for like Fortune 500 company. However, once you kick the tires, you validate things are working well like to get a little bit better performance. It's like, okay, this seems to be working. But this table was not quite what I'd expect or missing, like, this output just doesn't feel quite right. For the API itself, we do have proprietary models that basically have an edge in performance over the open-source offering.

So, for data engineers that are really comfortable running their own pipelines, or already have existing pipelines and doing RAG entirely themselves, the API is a great way to go. We believe most users are going to not want to have to manage these data pipelines themselves to get their data RAG-ready. So, that's the platform product that's in private beta, that's basically ingest ETL for LLMs. It has an admin UI. You basically configure your sources, the different parameters we've been talking about regarding chunky and embedding. You set up any number of workflows that you want to. It could be across different data sources, going to different locations, going to the same location. Then, that's basically all taken care of for you and you can just worry about building the end user application. This data is already for retrieval.

[0:31:49] SF: In that scenario, I basically have my source data, and then I'm like pointing it to like a vector database that I'm in control of and I tell you, just input-output and everything works like magic?

[0:31:58] CW: So, the source database would probably more something like a Google Drive, and S3 bucket, a SharePoint. It could be like a traditional database as well. The outputs or the destinations that you also configure would be something more like, yes, one of the standard sort of vector DBs, like Weaviate, Mongo, Chroma, Pinecone. Or if you just want to store the JSON outputs, that's also like a legitimate use case could just be a storage, blob storage location, like an S3 bucket.

[0:32:26] SF: So, I guess, in sort of conventional structured data world, this would be a little bit similar to like a Fivetran or something like that, where you're connecting it to a bunch of structured data sources in order to dump into like your warehouse?

[0:32:37] CW: Yes, that's exactly right. So, we're laser-focused on the unstructured use case. There's fundamentally some differences in the abstraction layers, going from, say, structured data to structured data. We're just very much focused on getting that right for our users, but also, not exploding the use case or exploding the complexity for users that need to operate this as well.

[0:33:03] SF: Yes, makes sense. In terms of when it comes to document-like processing, we've talked a little bit about some of the challenges with like, tables, images. What are some of the other things that make processing and cleaning this type of data particularly difficult?

[0:33:18] CW: There's like all sorts of stuff that we could run down the list on. But everything from encodings. We've got people processing documents, and all sorts of different language of like, "Hey, I tried to process it with this character set, and it came through weird," that sort of thing. One of the ones in Word documents, specifically, and as far as I know, this is like only a Word document thing is you can actually make tables that like aren't tables. You can make it like T-shaped and that sort of thing. We just like recently worked through an issue to aware like, if you would just like slightly wiggle the cell length with your mouse, it would like mess things up and it was no longer like an actual table. So, we would like break things in our Word document pipeline.

We probably have like a thousand things like that, that are just like little things that you would never ever think about if you weren't just pre-processing documents every day. Of course, you have like the more I guess, like interesting modeling problem. How do you extract and transform tables? How do you do forms well? And that sort of thing. But it really is, the final transformation part like really is in itself, like a full-time job. There's so many nuances, all these different document types. When you're doing all of that, you just run into like all sorts of stuff all the time.

[0:34:33] SF: Is there a big difference when it comes to performance around like accuracy and being able to pull out some of this information based on language like English versus French versus Chinese or something like that?

[0:34:47] MR: I'd say we're extractive and so like, we're not as sensitive to language differences, for example, as a downstream NLP model would be. I would say probably the biggest challenge for us with non-English languages, and it is really, languages where there would be formatting differences. So, if it's like an up-to-down language or right-to-left language instead of left to right, or if it's just a document set from somewhere else where just like typically the format of a document will be set up differently. Those formatting type differences tend to trip us up more than language differences do. Language differences could potentially trip us up like a little bit on some of the element detection-type stuff, too. But it's really like format more than language that we spend more time dealing with.

[0:35:32] SF: How's does testing work with this? How do you know, when you make changes that you're actually moving the product in a better direction?

[0:35:40] CW: Yes, that's a great question. There's multiple layers, and we're ensuring that we are moving in the right direction. So, it begins with the open source. Just to talk more about what Matt was saying, we also get contributions for bug fixes in the open source as well that carry through into the product. It begins with the open source we have. Of course, you have CI that runs hundreds of real documents through the library. To ensure, one, it's not breaking, but we also have a low bar, or I should take that low bar, like a small number of documents to ensure that like, our accuracy, scores are still not regressing, and that can be reflecting, reading in order or that we're capturing all the words in the document.

Internally for the product, we're doing a bit more. So, we're tracking against a dataset that we think is more enterprise-y in terms of has large number of PDFs represented, but not only PDFs, has a good number of tables. So, basically RCI will break across those hundreds of documents, if our score regresses. That's another thing we're doing. Then finally, checking, before cutting release, doing more than checking those hundreds, but thousands of documents, because we support 25 different document types out of the box and it's a lot of surface area. We're talking about PDFs before. There's kind of like the subtypes of PDFs to think about and different languages that we've also touched on. So, we do have extra tooling there as well, to ensure the quality of our releases.

[0:37:27] SF: You said you support 25 different document types. Is slides, one of those?

[0:37:31] CW: Yes, PowerPoints. PowerPoints, whether they're the old, really old fashions like 2007, or the more modern PPTX. Or Google Slides, as well. If you're processing through Google Drive, the Google Slides, basically get converted to the PowerPoint format, and then get passed through the pipeline.

[0:37:51] SF: Okay. Then, with these documents that you're using, as part of the testing in the CI, essentially do know what the ideal result is, because some person or group of people have essentially done the work to hand normalize those, and then you're comparing it against the essentially human-generated ones?

[0:38:09] CW: Yes, that's right. We have a ground truth document set that we had human labeled. That's where we're making our comparisons against, as we're running through CI.

[0:38:19] SF: How long does that take to make - going through those hundreds of documents as part of the CI. Is that something that slows down the build process in any significant way?

[0:38:29] CW: I would say in that in a very significant way. I mean, CI can always be a little bit faster, but there's nothing stopping from people working concurrently, and then automatically merging their PRs. So, we're using GitHub internally. Once they pass the tests, and there's no human check on, "Oh, you've regressed on the scores. That would fail CI." No one has to babysit to make sure that that's going to be merged.

[0:38:58] SF: In the actual production system, where can scale start to become like an issue or challenge? Where do you have to think about, essentially your scalability issues?

[0:39:06] CW: I mean, there's two levels. There's the API. We have two products, right? We have the API and the platform, and so, both. I think for the API is a much simpler thing. It's just a single endpoint. In terms of supporting that, it's mainly supporting auto-scaling, and also being able to quickly stand up additional clusters if needed to support as volume increases. So, we feel really confident of being able to handle any scaling challenges on the API side. But platform also has the similar issue of processing a million documents and how those progress through the pipeline, but we have auto-scaling steps along the way. It's all taking place in Kubernetes with key to auto scalers. You can think of the linear progression as we've described it. It can be a bit more complicated than that. But having auto scalers ready to go as there's documents available to process.

[0:40:07] SF: Then, are you using some sort of like queueing system, essentially, when these requests come in? I'm assuming like in a batch of requests, they give you an endpoint, you're going to hold on to that and then go through some process? How does sort of like failover work if you can't reach the documents?

[0:40:22] CW: To your point, like a document can fail at any state, maybe like the embedding endpoint, just disappears, stops working for whatever reason. A key part of this is transparency into the process of where documents are in a current workflow, in a current job, and the status of those documents along the way. Under the hood, there's some queuing for jobs and workflows themselves. But we're also like keeping track of the state of all the documents in a central place for easy access. Without going to too much details of it basically be per workflow. So, it's easy to bubble up, but also can scale to thousands, or tens of thousands or more of workflows.

[0:41:11] SF: If there is a failure that happens, for whatever reason, do I have to kind of kickstart that from scratch again, or can I pick up sort of where things may have been left off?

[0:41:22] CW: Also, great question. So, there's a certain amount of retry back off and retry baked in, of course, like would be standard. If you require any job, we can see the state of each document so that if it's processed already, it's basically a no-op. So, we just go back and pick up where we left off.

[0:41:42] SF: Yes. I guess you probably know which docks you've actually already processed, so you don't need to process them again, essentially. So, you can skip over those.

[0:41:48] CW: Yes, exactly. Like Matt was saying before, as well, that one of the most, well, the most expensive step, especially for images and PDFs is that initial transform of going from that raw document, the structured JSON. So, if you've already completed that step, and you broke further down the line, well, and the retry, we just pick up further down the line. We can leverage the outputs that are existing and not blow a bunch of compute reprocessing that document. 

[0:42:17] SF: Right. That makes sense. As we start to wrap up, is there anything else you'd like to share?

[0:42:24] MR: Nothing else for me. Just really appreciate you having us on the podcast here. Yes, really, really, great time.

[0:42:28] SF: Great. Anything from you, Crag?

[0:42:30] CW: Thanks so much for having us on. I would, maybe a quick plug for our Slack. We have an open Slack community. It's really easy to find. If you just look for Unstructured GitHub project or unstructured.io website and you're talking about how can the variety of document types. I mean, you really just get a sense for it from the questions people are throwing out there. So, it's fun to be there and observing anyway, but we'd like to welcome the conversation and the community. So, please, if you're interested, please join the Slack.

[0:43:02] SF: Yes. I get people all the time, engineers I've worked with in the past, or young engineers asking for, "How do I get essentially my first sort of job and AI, LLM world?" There's so many of these great open source projects in communities that you can kind of just like, jump into, and that's really the easiest way is like, just get in there, get your hands dirty and start contributing.

[0:43:26] CW: Completely. I mean, I think the Unstructured open-source library is a great example of that. Just look through the GitHub issues, and some are not terribly complicated, and are very approachable for engineers that are early in their career. So, absolutely a great place for engineers to get into the field.

[0:43:46] MR: Yes. We try to mark some of those good first issues, too, for people trying to do that, too. If you're interested, check out our GitHub, look for a fun issue to work on, and we'd be happy to review it.

[0:43:55] SF: Awesome. Well, Crag and Matt, thanks so much for being here. I really enjoyed it.

[0:44:00] MR: Thank you.

[0:44:00] CW: Thanks.

[END]
SED 1700		Transcript

	(c) 2024 Software Engineering Daily	1