Challenges of Multi-Cloud and Hybrid Monitoring
The goal of having a single pane of glass that allows us to see what is happening with our organization’s IT operations has been a long-standing goal for many organizations. The goal makes a lot of sense. Without a clear end-to-end picture, it is hard to determine where your problems are if you can’t determine whether something happening upstream is creating significant knock-on effects.
When we have these high-level views, we are, of course, aggregating and abstracting details. So the ability to drill into the detail from a single view is an inherent requirement. The problem comes when we have distributed our solutions across multiple data centers, cloud regions, or even regions with multiple vendors.
The core of the challenge is that our monitoring through logs, metrics, and traces accounts for a significant amount of data, particularly when it isn’t compressed. An application that is chatty with its logs or hasn’t tuned its logging configuration can easily generate more log content than the actual transactional data. The only reason we don’t notice it is that logs are generally not consolidated, and log data is purged.
When it comes to handling the monitoring in a distributed arrangement, if we want to consolidate our logs, we’re potentially egressing a lot of traffic from a data center or cloud provider, and that costs. Cloud providers typically don’t charge for inbound data, but depending upon the provider, it can be expensive for data egress; it can even cost to transmit data between regions with some providers. Even for private data centers, the cost exists in the form of bandwidth of connectivity to the internet backbone and/or the use of leased lines. The numbers can also vary around the world as well.
The following diagram provides some indicative figures from the last time I surveyed the published prices of the leading hyper scalers, and the on-premises costs are derived from leased line pricing.
This raises the question of how on earth do you create a centralized single pane of glass for your monitoring without risking potentially significant data costs. Where should I consolidate my data to? What does this mean if I use SaaS monitoring solutions such as DataDog?
There are several things we can do to improve the situation. Firstly, let’s look at the logs and traces being generated. They may help during development and testing, but do we need all of it? If we’re using logging frameworks, are the logs correctly classified as Trace, Debug, and so on? When logging frameworks are being used by applications, we can tune the logging configuration to deal with the situation when one module is particularly noisy. But for those systems that are brittle, people who are nervous about modifying any configuration or a 3rd party support organization will void any agreements if you modify any configuration. The following line of control is to take advantage of tools such as Fluentd, Logstash, or Fluentbit, which brings with it full support for OpenTelemetry. We can introduce these tools into the environment near the data source so that they can capture and filter the logs, traces, and metrics data.
The way these tools work means they can consume, transform and send logs, traces, and metrics to the final destination in a format that most systems can support. Further, Fluentd and Fluentbit can easily be deployed to fan out and fan in workloads – so scaling to sort out the data comprehensively can be done easily. We can also use them as a relay capability so we can funnel the data through specific points in a network for added security.
As you can see in the following diagram, we’re mixing Fluentd and Fluentbit to concentrate data flow before allowing it to egress. In doing so, we can reduce the number of points of network exposure to the internet. A strategy that shouldn’t be used as the only mechanism to secure data transmission, but certainly one that can be part of an arsenal of security considerations. It can also be used as a point of failsafe in the event of connectivity issues.
As well as filtering and channeling the data flow, these tools can also direct data to multiple destinations. So rather than throwing away data that we don’t want centrally, we can consolidate the data into an efficient time-series data store within the same data center/cloud and send on the data that has been identified as high value. This then gives us two options; in the event of investigating an issue, we can do a couple of things:
- Identify the additional data needed to enrich the central aggregated analysis and ingest just that additional data (and possibly further refine the filtration for the future) needed.
- Implement localized analysis and incorporate the resultant views into our dashboards.
Either way, you have access to additional information. I would opt for the former. I’ve seen situations where the local data stores have been purged too quickly by local operational teams, and data like traces and logs compress well in greater volume. But remember, if the logs include data that may be sensitive to location, pulling them to the center can raise additional challenges.
While in the diagram, we’ve shown the monitoring center to be on-premise, this could equally be a SaaS product or one of the clouds. The key to where the center is comes down to three key criteria:
- Any data constraints in terms of the ISO 27001 view of security (integrity, confidentiality, and availability).
- Connectivity and connectivity costs. This will tend to bias the location for monitoring to where the largest volume of monitoring data is generated.
- Monitoring capability and capacity – both functional (visualize and analyze data) and non-functional factors, such as how quickly inbound monitoring data can be ingested and processed.
Adopting a GitOps strategy to help ensure that we have consistency in configuration and, therefore, data flow from software that may well be deployed across data centers or cloud regions and possibly even multiple cloud vendors can be kept consistent because the monitoring sources are consistent in configuration If we identify changes to the filters (to remove or include) data coming to the center.
Incidentally, most stores of log data, be that compressed flat files, databases can be processed with tools like Fluentd not only as a data sink but also as a data source. So it is possible through GitOps to distribute out temporary configurations for your Fluentd/Fluentbit nodes which can harvest and bulk move any newly required data for the center from those regionalized staging stores rather than manually accessing and searching them. But if you adopt this approach, we recommend creating templates for such actions in advance and use as part of a tested operational process. If such a strategy were to be adopted at short notice as part of a problem remediation activity, you could accidentally try and harvest too much data or impact current active operations. It needs to be done with awareness about how it can impact what is live.
Hopefully, this will help offer some inspiration for cost-efficiently handling hybrid and multi-cloud operational monitoring.