The Challenge of AI Model Evaluations with Ankur Goyal

Evaluations are critical for assessing the quality, performance, and effectiveness of software during development. Common evaluation methods include code reviews and automated testing, and can help identify bugs, ensure compliance with requirements, and measure software reliability.

However, evaluating LLMs presents unique challenges due to their complexity, versatility, and potential for unpredictable behavior.

Ankur Goyal is the CEO and Founder of Braintrust Data, which provides an end-to-end platform for AI application development, and has a focus on making LLM development robust and iterative. Ankur previously founded Impira which was acquired by Figma, and he later ran the AI team at Figma. Ankur joins the show to talk about Braintrust and the unique challenges of developing evaluations in a non-deterministic context.

Sean’s been an academic, startup founder, and Googler. He has published works covering a wide range of topics from AI to quantum computing. Currently, Sean is an AI Entrepreneur in Residence at Confluent where he works on AI strategy and thought leadership. You can connect with Sean on LinkedIn.

 

 

Please click here to see the transcript of this episode.

Sponsorship inquiries: sponsor@softwareengineeringdaily.com

Sponsors

This episode of Software Engineering Daily is brought to you by Capital One.

How does Capital One stack? It starts with applied research and leveraging data to build AI models. Their engineering teams use the power of the cloud and platform standardization and automation to embed AI solutions throughout the business. Real-time data at scale enables these proprietary AI solutions to help Capital One improve the financial lives of its customers. That’s technology at Capital One.

Learn more about how Capital One’s modern tech stack, data ecosystem, and application of AI/ML are central to the business by visiting www.capitalone.com/tech.

Software Daily

Software Daily

 
Subscribe to Software Daily, a curated newsletter featuring the best and newest from the software engineering community.