Thumbtack Infrastructure with Nate Kupp
Thumbtack is a marketplace for real-world services. On Thumbtack, people get their house painted, their dog walked, and their furniture assembled. With 40,000 daily marketplace transactions, the company handles significant traffic.
On yesterday’s episode, we explored how one aspect of Thumbtack’s marketplace recently changed, going from asynchronous matching to synchronous “instant” matching. In this episode, we zoom out to the larger architecture of Thumbtack, and how the company has grown through its adoption of managed services from both AWS and Google Cloud.
The word “serverless” has a few definitions. In the context of today’s episode, serverless is all about managed services like Google BigQuery, Google Cloud PubSub, and Amazon ECS. The majority of infrastructure at Thumbtack is built using services that automatically scale up and down. Application deployment, data engineering, queueing, and databases are almost entirely handled by cloud providers.
For the most part, Thumbtack is a “serverless” company. And it makes sense–if you are building a high-volume marketplace, you are not in the business of keeping servers running. You are in the business of improving your matching algorithms, your user experience, and your overall architecture. Paying for lots of managed services is more expensive than running virtual machines–but Thumbtack saves money from not having to hire site reliability engineers.
Nate Kupp leads the technical infrastructure team, and we met at QCon in San Francisco to talk about how to architect a modern marketplace. This was my third time attending QCon and as always I was impressed by the quality of presentations and conversations I had there. They were also kind enough to set up some dedicated space for podcasters like myself.
The most widely used cloud provider is AWS, but more and more companies that come on the show are starting to use some of the managed services from Google. The great news for developers is that integration between these managed services is pretty easy.
At Thumbtack, the production infrastructure on AWS serves user requests. The log of transactions that occur get pushed from AWS to Google Cloud, where the data engineering occurs. On Google Cloud, the transaction records are queued in Cloud PubSub, a message queueing service. Those transactions are pulled off the queue and stored in BigQuery, a system for storage and querying of high volumes of data.
BigQuery is used as the data lake to pull from when orchestrating machine learning jobs. These machine learning jobs are run in Cloud Dataproc, a managed service that runs Apache Spark. After training a model in Google Cloud, that model is deployed on the AWS side, where it serves user traffic. On the Google Cloud side, the orchestration of these different managed services is done by Apache Airflow, an open source tool that is one of the few pieces of infrastructure that Thumbtack does have to manage themselves on Google Cloud.
To find out more about the Thumbtack infrastructure, check out the video of the talk Nate gave at QCon San Francisco, or check out the Thumbtack Engineering Blog.
Transcript provided by We Edit Podcasts. Software Engineering Daily listeners can go to weeditpodcasts.com/sed to get 20% off the first two months of audio editing and transcription services. Thanks to We Edit Podcasts for partnering with SE Daily. Please click here to view this show’s transcript.