Thanos Receiver Deep Dive - Joel Verezhak, Open Systems

CNCF [Cloud Native Computing Foundation]

28 Mar 202424:23

Summary

TLDRThe speaker discusses their experiences and challenges with Thanos, a metrics backend system, at the first Thanos Con. They explain the implementation of Thanos at Open Systems, detailing an incident where a major failure occurred due to a runaway cascade of receive failures. The talk covers technical aspects such as remote writing metrics, hash rings, and replication. The speaker shares insights into debugging the issue, emphasizing the importance of resilience in multi-tenant pipelines and suggesting potential improvements for future stability.

Takeaways

🌐 The speaker is a system engineer at Open Systems, a company that offers managed connectivity services and has recently started integrating cloud solutions with legacy systems.
🔍 The talk is about a deep dive into the Thanos receiver, a component crucial for remote metric writing in Thanos, which the speaker's company has been using for about two to three years.
📈 The Thanos receiver forms a hash ring to distribute metrics across different receivers, which is essential for scaling metrics in a distributed system.
🚨 An incident occurred where the Thanos receiver became unstable, causing a cascade of failures that affected multiple tenants and led to alerting issues.
🔄 The incident highlighted the importance of replication in Thanos, which allows for multiple replicas of metrics to ensure data availability even if some receivers fail.
🛡️ Multi-tenancy was implemented to isolate tenants and prevent a single tenant's actions from affecting others, which is crucial for maintaining system stability.
🔍 The incident was resolved by identifying and mitigating a 'monster query' that overwhelmed the receivers, demonstrating the need for robust error handling and query management.
🔧 The speaker suggests using configuration options like `store.limits` to prevent similar incidents by limiting the number of series a receiver can fetch, balancing user needs with system stability.
🔄 Dynamic scaling of receivers was found to be less stable in their specific use case, leading to a preference for a static replication factor of three.
📚 The speaker encourages revisiting the error bubbling approach in Thanos to better handle errors when multiple receivers are involved, and exploring new tenant queries for enhanced protection.
🏗️ Building a resilient multi-tenant pipeline is challenging and requires continuous learning and adaptation from incidents, as demonstrated by the Thanos receiver incident.

Q & A

What is the primary focus of the talk at Thanos Con?
-The primary focus of the talk is a deep dive into the Thanos receive component, discussing its implementation, issues, and resolution strategies based on real-world incidents.
What does the company 'Open Systems' offer?
-Open Systems offers managed connectivity services, including firewalls, proxies, and cloud solutions. They deploy physical devices globally and are transitioning to cloud-based services.
Why is the Thanos receive component crucial for Open Systems?
-The Thanos receive component is crucial for Open Systems because it allows them to remote write metrics from 10,000 hosts worldwide, enabling scalable metrics collection and processing.
What incident led to the deep investigation discussed in the talk?
-The incident involved Thanos receivers becoming unstable and causing a runaway cascade of failures. This resulted in significant metric and alerting issues, prompting a deep investigation.
How does the receive component in Thanos work?
-The Thanos receive component forms a hash ring where label sets are hashed to specific receivers. If a receiver becomes unhealthy, a receive controller in Kubernetes monitors their health and dynamically adjusts the hash ring.
What was the cause of the catastrophic failure observed in the Thanos receivers?
-The catastrophic failure was triggered by a monster query, leading to a chain reaction of errors across multiple tenants and causing the receivers to become unstable.
What is a '409 Loop' and how did it affect the system?
-A '409 Loop' refers to a continuous cycle of 409 HTTP errors (conflict), which caused the Thanos receivers to repeatedly fail and retry requests, leading to persistent system instability.
What measures did the team take to mitigate the impact of the monster query?
-The team implemented a configuration option 'store.limits' to prevent the receiver from fetching an excessive number of series, effectively blocking the monster query from causing further issues.
What are the benefits of enabling replication in the Thanos receive component?
-Enabling replication ensures that metric writes are replicated to multiple receivers. If one receiver goes down, others can still provide the data, ensuring continued functionality and resilience.
What lessons were learned from the incident discussed in the talk?
-The main lessons learned include the importance of protecting Thanos receivers, the complexity of building a resilient multi-tenant pipeline, and the need to revisit error handling and replication strategies to prevent similar incidents.