Cloud Database Performance: Scaling MongoDB Atlas

Article Friday, December 14 2018

The database, the cornerstone of many software products, is increasingly hosted in the cloud. The previous MongoDB Atlas: Database as a Service article was focused on the challenges of migrating to the cloud from an on-prem architecture and how a Database-as-a-Service solution like MongoDB Atlas can immensely alleviate these problems. This article will focus specifically on database performance on cloud and the performance challenges that come with cloud databases.

SQL vs. NoSQL

The History of NoSQL

A deeper dive into SQL and NoSQL solutions and a comparison of their performance is necessary. The relational model, typically known as the SQL model, has been around since the 1970s. This model relies on tables and operations following from relational algebra principles to query the data stored in the tables. The relational model is quite fit for databases stored on a single node, since most operations require columns from different tables, and this leads to joins of tables.

However, as data started growing, SQL has had scaling challenges. Distributing the data was the solution with different nodes holding different pieces of the data. Such distributed database systems, based on relational database management system (RDBM), were an operational and performance pain.

SQL databases usually scale vertically. When made to scale horizontally, multiple nodes have to act as a single server holding the database, multiple databases spread across the nodes have to communicate with each other, and additional code had to be written to aggregate the results from these different databases. Distributed queries with join operations on tables residing on different nodes becomes very slow and scaling becomes a performance bottleneck.

When Eric Brewer introduced the CAP theorem in 1998, stating that a networked shared data system cannot have all three desired properties of Consistency, high Availability, and Partition tolerance at the same time, the distributed database systems relying on the relational model took a hit.

While CAP theorem would be interpreted later as having to make certain trade-offs between consistency and availability for the specific application, the theorem still led to developers thinking outside the box and the creation of new architectures. This burst of innovation, coupled with concerns about scalability, geo-distribution, and agile methodology, rose to the creation of NoSQL solutions and they started proliferating in the late 2000s.

While there are many types of NoSQL data models, this article will focus on document databases, the model used by MongoDB.

How NoSQL Databases Work

NoSQL databases work great for distributed systems. The document-based nature of NoSQL database systems makes it easy to scale out because of breakaway from foreign keys and interdependencies between these tables in SQL database systems, which greatly limits how SQL databases scale. While it’s possible to make use of the techniques mostly associated with schemaless structures in relational databases through custom fields and embedded structured data, these techniques do not go along well with SQL.

The denormalization inherent in NoSQL, meaning that information about a certain object is not spread across multiple tables and rows, but is instead kept within a record, eliminates multiple reads and writes when reading or writing data. This is because all the data exists in one location and not across multiple tables. This brings a huge performance boost, and a great simplicity from the developer’s perspective, as developers think of data as objects rather than relationships in tables, which denormalization achieves.

MongoDB achieves horizontal scaling through multiple nodes with a mechanism called sharding. Sharding essentially divides a collection into multiple nodes and redirects incoming queries to the appropriate node. MongoDB also uses replica sets to achieve redundancy and high availability. In this model, data is replicated to multiple nodes, at least three, to achieve redundancy and successful failover practices in the case of a node failure. Each shard in the MongoDB architecture can be a replica set which ensures high availability for the whole collection.

Scaling Query Performance

The most fundamental construct that improves performance is indexes. In MongoDB, without an index, the database has to perform a scan across all its documents to find the matching documents, known as a collection scan or a table scan. This operation for a single lookup will grow linearly with the number of documents in the collection and will become inefficient. The solution lies in indexes.

Indexes hold the data on the indexed field in an easy-to-traverse, ordered form. When a query is executed, if an appropriate index exists, the database can operate on the indexes rather than doing a scan through all documents to find the matching subset of documents.

MongoDB already has a primary index on its _id field, and the developers can add additional indexes of multiple forms. MongoDB supports compound indexes, allowing indexing on multiple fields of a document, which corresponds to more specific queries searching on these multiple fields much faster. Multikey, geospatial, or full-text indexes are all supported in MongoDB.

Why not index on all fields, if it improves query performance? Indexing incurs a write penalty. Every time a write operation is performed on a document with an indexed field the index has to be updated as well. With multiple unnecessary indexes, the number of I/O operations in the form of writes to the disk becomes a performance bottleneck.

Finding the right indexes is not an easy task. If a collection is updated frequently for a period of time, having too many indexes slows the database down. If the same collection becomes read heavy after a while, the lack of indexes make for longer response times. This requires heavy monitoring and dynamic changes to the model.

Index Recommendations and Performance Advisor

MongoDB Atlas, the fully managed database-as-a-service solution of MongoDB, gives index recommendations based on query performances. Atlas’ Performance Advisor monitors the slow-running queries, which is a metric that can be defined by the developers — e.g., a query whose execution takes 150 ms or more — and makes suggestions on adding indexes to the collection, even providing the necessary code snippet to create the index.

Building an index on an existing, populated collection can be challenging. MongoDB offers a run-in-background option for indexing populated collections that does not affect the performance or the availability of the database.

Unused indexes create additional performance overhead and detecting when an index is no longer required is critical. Gathering statistics for each index can be hard if done manually, but MongoDB allows index statistics aggregation to monitor the indexes. These statistics are also available through MongoDB Compass, the intuitive UI solution for MongoDB.

Other than indexes, having optimized queries is a necessary step in improving query performance of a database. Queries can be optimized by developers when writing specific queries that a certain user action fires. However, optimizing queries and analyzing whether a query is optimized usually requires a lot of manual work.

The considerations when optimizing a query include the used indexes, query execution time, number of documents read, and number of documents returned. MongoDB optimizes queries automatically. By periodically running different query plans and keeping the empirical results as a cached query plan, MongoDB determines which indexes to use to optimize performance.

Scaling Databases Globally

Scaling your cloud database as your business grows without impacting performance is hard. If users from all around the world are using a product, the performance across these users should be consistent with stable performance and low latency.

What can be done? One option is deploying different databases for multiple regions around the world, in different datacenters, and replicating data between them as necessary. However, this connection is financially expensive and operationally hard. From a consistency point of view, it’s not easy to keep the databases in sync.

This is where a Database-as-a-Service solution can really make an impact on your business and operations. MongoDB Atlas has a new feature called Global Clusters to enable low-latency reads and writes from anywhere. With Global Clusters, local read and write operations are faster since the primary nodes for a region are located nearby. With multiple zones around the world, each zone can keep a read-only replica of the other zones to ensure low-latency reads from other regions as well.

These read-only replicas from other zones are fully configurable with MongoDB Atlas, but can also be automatically deployed for every region. The added benefit of configuration is ensuring data residency and compliance with data protection laws.

Monitoring Performance

An important feature in cloud databases is the need for monitoring. For cloud databases, numerous third-party solutions exist for logging and monitoring. However, these tools need to be configured to get a comprehensive overview of a cluster. MongoDB Atlas provides a comprehensive Metrics section to monitor and understand performance issues inside a deployed cluster.

Give MongoDB a try

A Database-as-a-Service solution offers a great relief both for companies that are just getting started and for companies that want to ease their database management efforts, without sacrificing performance. You can get started with MongoDB Atlas here, starting with a free account with 512 MB storage for experimentation purposes.

Gokhan Simsek

Eindhoven, The Netherlands

Gokhan is a computer science graduate, currently pursuing a MSc. degree in Data Science at Eindhoven University of Technology. He’s interested in big data, NLP, and machine learning.