Similarity Search with Jeff Johnson

• Goal: – Maintain query time guarantees while performing approximate search with a learned metric. • Main idea: – Learn Mahalanobis distance parameterization. – Use it to affect distribution from which random hash functions are selected. • LSH functions that preserve the learned metric. • Approximate NN search with existing methods.

Querying a search index for objects similar to a given object is a common problem. A user who has just read a great news article might want to read articles similar to it. A user who has just taken a picture of a dog might want to search for dog photos similar to it. In both of these cases, the query object is turned into a vector and compared to the vectors representing the objects in the search index.

Facebook contains a lot of news articles and a lot of dog pictures. How do you index and query all that information efficiently? Much of that data is unlabeled. How can you use deep learning to classify entities and add more richness to the vectors?

Jeff Johnson is a research engineer at Facebook. He joins the show to discuss how similarity search works at scale, including how to represent that data and the tradeoffs of this kind of search engine across speed, memory usage, and accuracy.

Notes: Jeff’s blog post about similarity search

Software Weekly

Software Weekly

Subscribe to Software Weekly, a curated weekly newsletter featuring the best and newest from the software engineering community.