Tinybird is making OLAP feel like Postgres
These days, it’s easy to relegate “analytics” to the world of reporting and business intelligence, where batch ETL, complex queries, and scheduled reports are powered by the dominant cloud data warehouses.
But a new trend is emerging.
“Realtime analytics” is a new category of solution to a growing class of problem: Product companies want to incorporate data analytics back into the products themselves, and backend developers need a better way to serve low-latency APIs on big data queries that typically take seconds or minutes to run in a typical cloud data warehouse.
New databases have been built to help solve this class of problem. There are the open source flavors like Druid, Pinot, and ClickHouse; and even some commercial newcomers like Rockset. Cloud giants have their own offerings as well (consider the combination of Kinesis, S3, Glue, and Athena on AWS).
But while faster databases are a necessary requirement for low-latency analytics, they’re not the whole picture.
Developers don’t just need a database, they need a better way to work with large amounts of data. The transition from row-oriented OLTP databases to these newer column-oriented OLAP databases can be uncomfortable for developers used to working on mature open-source databases with a strong community of support, many managed offerings, and a wide ecosystem of integrations. These newer technologies often lack the more robust trappings that databases like Postgres or MySQL have, and that comes at a cost to developer experience and productivity. And on top of that, developers can’t start their work until data teams set up the heavy infrastructure required to host the massive data that drive teams toward OLAP in the first place.
Tinybird is trying to change that. Fundamentally a platform for developer productivity, Tinybird gives backend developers the speed and performance of an OLAP database (they use ClickHouse as their primary datastore) but with a development experience that’s more akin to Heroku Postgres.
We’ll look at some real-world applications for Tinybird, but before we do, let’s take a look at how Tinybird approaches the problem of building high-performance applications on top of large and streaming data sets.
Tinybird offers a fully managed database as a service
Developers choose a data store for their application or service based primarily on ease of use. Performance is a pre-requisite, yes, but these factors being equal developers will choose tools that make them more productive.
Tinybird’s most immediate value is as a serverless ClickHouse. It offers a generous free tier (1000 requests per day, unlimited processing, and up to 10 GB of storage), and pro-level customers can scale without constraints on storage, compute, or request frequency.
Most Tinybird users will operate on a multi-tenant cluster where data is fully segregated, but enterprise customers can get access to discrete deployments with dedicated clusters.
Further, Tinybird wants its users to get the most out of ClickHouse as cost-effectively as possible. Depending on which service agreement you choose, Tinybird can provide a dedicated, named Data Engineer and an open Slack channel for support and optimization. Even non-enterprise level customers can get the same treatment as a part of a “jumpstart” package.
Today the multi-tenant service is hosted in the US and Europe using GCP. Tinybird recognizes that customers do have preferred cloud providers, and they’ve begun adding enterprise support for alternative cloud providers and expanded offerings for data locality. With this approach, enterprise customers can avoid unnecessary data egress for better security and cost control.
Within the platform, developers can segment their work through collaborative Workspaces. These can be used to divide production and development environments, or to separate access based on role. Tinybird Data Sources can be shared across Workspaces in the same clusters, so different teams with different roles in the data journey can work independently on shared infrastructure.
OLAP performance for freshness, concurrency, and low-latency
Production APIs generally need to have three things to meet the demands of product users expecting a good experience:
- They should serve responses based on the freshest data
- They should be able to handle many concurrent users
- They should respond in milliseconds.
For Tinybird, this meant choosing a primary data store that could handle high-frequency ingestion, perform real-time aggregations and materializations, and serve many requests exceptionally quickly without compromising reliability and performance.
With these requirements in mind, the founding engineers at Tinybird chose ClickHouse. While still relatively young (the first open source release was in 2016), ClickHouse is already battle-proven with household brands, and it meets the 3 requirements above. Queries are exceptionally fast, it supports high-frequency inserts thanks to its use of a version of the Log Structure Merge (LSM) tree, and the columnar nature means its particularly adept at running complex analytical queries with filters and aggregations in realtime.
Engineers at Tinybird actively contribute to the ClickHouse code base, but the platform they’ve built also wraps the core engine with a raft of features that provide value to developers seeking speed and productivity.
Making ClickHouse comfortable
The Tinybird founders like to joke that ClickHouse is like an F1 car. If you have a big team, many hours behind the wheel, and a lot of resources, you can safely drive it very fast.
But if you’re like most, you’re looking for an “everyday driver” that can still match the performance of an F1 machine. Car buffs will think of the McLaren Senna or the Ferrari F50. Still insanely fast, but with leather seats and air conditioning.
As such, Tinybird focuses heavily on improving the developer experience with ClickHouse by softening some of its rougher edges.
For example, while ClickHouse is known for its ability to handle an insanely high rate of insertions, it is notoriously bad at handling duplicates and change data capture. It’s not a transactional database, so it’s not designed to delete and replace rows as records get updated. Tinybird has invested in making it easier for developers to deduplicate data, and they provide support and documentation to guide users through best practices for making the transition from transactional to analytical workflows. They’ve even started an open source ClickHouse Knowledge Base to galvanize the community around these concepts and other performance tips and tricks.
It still uses SQL
One welcome carryover from traditional OLTP databases that ClickHouse brings is the primacy of SQL as a query language. There isn’t a backend engineer in the world that doesn’t have some level of SQL knowledge, and Tinybird leverages this fact with ClickHouse.
But pure SQL alone isn’t enough. Not only does Tinybird support the ClickHouse flavor of SQL – which includes specialized functions for aggregations, filtering, and time series – but Tinybird also includes a templating language to create even more flexibility in constructing queries.
Originally crafted to support dynamic query parameters on the endpoints that developers can publish from the SQL queries they create in Tinybird, the templating language supports a host of other features including local variable definitions, dynamic aggregations, filter arrays, SELECT if statements, variable column selection, and more. This gives backend developers the power to build APIs that extend beyond the limits of vanilla SQL.
Additionally, any data analyst or engineer that’s written queries on a data warehouse knows that the queries can become quite complex, featuring nested CTES and subqueries that can make it difficult to isolate performance issues. To address this challenge, Tinybird created the concept of Pipes.
Pipes are how you query data in Tinybird, and instead of building complex spaghetti queries in a single editor, they’ve broken out the query development into a chain of composable SQL nodes. Every node you write in a Pipe can query over the results from prior nodes. This modular, reusable approach makes it easier to understand queries and their logic.
On top of that, Tinybird offers performance metrics for every node and API endpoint, so developers can easily detect and remedy speed-killing SQL.
A unified platform to query streams and dimensions
Of course, making SQL more exciting isn’t all that useful without data to query. To that end, Tinybird’s has been especially focused on providing adapters and native connectors to ingest data from many different data sources. These support commonly used formats, mechanisms, or tools for both batch and streaming ingestion.
For example, Tinybird’s native Kafka connector makes it trivial to set up cost-effective, persistent storage for data published on Kafka topics.
But Tinybird also recognized that Kafka, like ClickHouse, can be a pain to manage. Many development teams would prefer not to use Kafka to manage streaming data. As such, the most common way that developers send events to Tinybird is using a high-frequency ingestion REST API that Tinybird created to accept a single row or thousands in a simple JSON payload. Because it’s just an HTTP request, developers on every part of the code base, whether frontend or backend, can easily generate and stream events data to Tinybird. It can even support data ingestion via a webhook.
For dimensional data, Tinybird also supports ingestion from files – local, remote, or in S3 Buckets/Google Storage – or through integrations with a data warehouse.
Tinybird recognizes the value of being able to query streaming data and enrich it with dimensional data from diverse sources, and they expect to continue growing their support of native connectors.
A built-in publication layer
What makes Tinybird particularly special, however, is its abstraction of the publication layer. Developers in Tinybird can instantly publish their SQL queries as RESTful API endpoints in a single click or CLI command.
As mentioned above, the templating language lets you introduce dynamic query parameters to your endpoints, and row-level security filters implemented as SQL expressions lets you easily and programmatically generate authorization tokens for individual end users.
Every API generated in this manner automatically includes documentation to the OpenAPI 3.0 spec.
Tinybird also includes an observability layer on both Data Sources and published endpoints, and in true Tinybird fashion these are implemented themselves as Tinybird Data Sources, so they can be queried over and published as APIs in the same manner as the data you ingest through their native connectors.
Ideal use cases
Tinybird is focused on helping developers build applications on top of large datasets. As such, they find themselves supporting a wide range of use cases. Any scenario that involves publishing endpoints on streaming data – especially where that data must be transformed, materialized, or enriched with dimensional data on the fly – is a good fit. That could range from ITOps, to dynamically personalizing user experiences, to realtime stock or crypto trading, to log analytics, and much more.
While the platform is geared toward developers, even those who span that developer/business boundary are likely to get value from the service. After all, SQL is a commonly understood language, and REST-based tooling to help turn APIs into actionable data or dashboard views is more accessible these days.
Optimizing things, particularly if complex data models are being used or trying to minimize execution costs of running analytics over terabytes of data, will take some more engineering expertise, but Tinybird’s support team is also useful here even for businesses or individual developers without data engineering support.
Tinybird recognizes the challenge of shifting from a transactional, batch-centric mindset to an analytical, real-time approach. But they’re doing their best to make it incredibly comfortable for developers of all stripes to build faster analytical APIs – and with faster development cycles.