Bloomberg's Journey to a Multi-Cluster Workflow Orchestration Platform - Yao Lin & Reinhard Tartler
Summary
TLDRIn this presentation, Rina Tartler and Yaolin from Bloomberg's Cloud-Native Compute Services Group discuss their journey to a multicluster workflow orchestration platform. They delve into the challenges and solutions related to managing static and dynamic resources across data centers, emphasizing the importance of resiliency and consistency. The team introduces a novel approach using 'kind' and a relational database with streaming replication to achieve multi-data center resiliency, while maintaining cloud-native properties. The talk concludes with a Q&A session exploring user adaptation, version inconsistencies, and the potential use of Kada for managing federated clusters.
Takeaways
- ๐ Rina Tartler and Yaolin presented on Bloomberg's journey to a multicluster workflow orchestration platform.
- ๐ The presentation discussed the evolution and unique challenges of Bloomberg's internal workflow orchestration platform, which is used by engineers for various tasks including machine learning and financial analysis.
- ๐ก The platform is built on Argo Workflows, a CNCF project that provides core functionalities like a controller, custom resource definitions, and a user interface for visualizing workflow progress.
- ๐ Bloomberg, being a data company, has stringent data access requirements and needs to ensure production stability, which influenced the design of their multicluster solution.
- ๐ The platform had to address cross-data center resiliency, requiring multiple installations of Argo workflows, which added complexity in terms of cognitive load and potential failure modes.
- ๐ ๏ธ To manage workflow runs and static resources, Bloomberg introduced a management service that provides reliable placement decisions and a user interface for various management tasks.
- ๐ The solution uses a combination of Kubernetes' API server and a relational database with streaming replication to ensure cross-data center consistency and handle cluster changes.
- ๐ Kind was utilized to create a virtual control plane spanning multiple data centers, allowing the platform to manage static resources like config maps and workflow templates across clusters.
- ๐ง The architecture change was designed to be non-disruptive to existing users, with a transitional API layer to handle the shift from the old to the new system.
- ๐ค The team considered tools like Argo CD and Kafka but found their custom solution to be a more cost-effective and simpler fit for their specific requirements.
- ๐ The management service acts as a gatekeeper, accepting user requests, making placement decisions, and ensuring resources are consistently deployed across all participating clusters.
Q & A
What is the main focus of the Bloomberg's presentation?
-The presentation focuses on Bloomberg's journey to a multicluster work orchestration platform, discussing the challenges and solutions they encountered while developing their internal workflow orchestration system.
What is the purpose of the workflow orchestration platform at Bloomberg?
-The workflow orchestration platform at Bloomberg is designed to provide general utility compute for run-to-completion batch jobs, catering to internal engineers who use Bloomberg infrastructure and data.
Why is the Argo Workflows project significant for Bloomberg's platform?
-Argo Workflows is significant because it provides the core functionality that Bloomberg's clients primarily use. It includes a controller, custom resource definitions for workflow steps, and a user interface for visual observation and debugging.
What are some of the use cases for Bloomberg's workflow orchestration platform?
-The platform is used for various purposes such as machine learning orchestration, custom CI/CD solutions, maintenance tasks on physical and virtual infrastructures, and financial analysis tasks to build processing pipelines.
How does Bloomberg ensure data security and access requirements for its workflow orchestration platform?
-Bloomberg ensures data security and access requirements by providing UIs and APIs for users to manage workflows, keeping track of pods, inputs, and logs. It also integrates tightly with Bloomberg's standard approval processes for production stability.
What challenges did Bloomberg face in implementing a multicluster solution?
-Bloomberg faced challenges such as maintaining cross-data center consistency, handling the addition or removal of clusters, and managing the cognitive load and failure modes associated with multiple installations of Argo Workflows.
What is the role of the Management Service in Bloomberg's workflow orchestration platform?
-The Management Service acts as an intermediary that offers an API for workflow submission and deployment of static resources. It provides reliable placement decisions, a user interface for onboarding and management, and helps maintain consistency across data centers.
How does Bloomberg handle the deployment and management of static resources across multiple clusters?
-Bloomberg uses a combination of a Management Service API and a database with streaming replication to ensure cross-data center consistency. The API accepts requests, makes placement decisions, and writes to a database, which is then pulled by a sinker agent in the workload clusters.
What is the significance of using a modern relational database like PostgreSQL for Bloomberg's platform?
-PostgreSQL is used for its streaming replication feature, which ensures high availability and consistency across data centers. It allows for a leader database to accept write requests and stream transactions to replicas, maintaining consistency and serving read requests.
How does Bloomberg's solution with k3s and kind differ from traditional Kubernetes setups?
-Bloomberg uses k3s and kind to create a virtual control plane that spans multiple data centers, using a relational database as the storage backend. This setup allows for managing resources across clusters without the need for a full control plane in each data center.
What are some of the benefits of Bloomberg's approach to multicluster management?
-The approach allows for cost efficiency, as it is less expensive than stretching a typical etcd across data centers. It also provides a simpler solution for managing static resources and can be integrated without exposing users to implementation details.
How does the new system impact users in terms of interacting with Argo Workflows?
-Users are not exposed to the new system's implementation details. A Management Service with a custom API is designed to accept Argo native manifests and persist them in a way that is transparent to the user, maintaining a seamless experience.
What considerations did Bloomberg take into account regarding version inconsistencies across clusters?
-Bloomberg ensures that the CRDs themselves do not change significantly between minor versions of Argo, avoiding major inconsistencies. They are also considering the process of updating to newer versions to maintain consistency.
Why did Bloomberg decide not to use Argo CD for managing static resources?
-Bloomberg found that the GitOps model of Argo CD was not a good fit due to their specific data access requirements and platform setup. They needed a solution that could handle namespaces and access controls in a way that integrated well with their existing system.
Is there a plan to use Kada (Kubernetes API for multi-cluster services) in the future?
-Bloomberg is considering using Kada for managing dynamic resources, but the presentation focused on static resources. They are evaluating the complexity and cost-benefit trade-offs of using Kada and may decide to use it in the future.
Outlines

This section is available to paid users only. Please upgrade to access this part.
Upgrade NowMindmap

This section is available to paid users only. Please upgrade to access this part.
Upgrade NowKeywords

This section is available to paid users only. Please upgrade to access this part.
Upgrade NowHighlights

This section is available to paid users only. Please upgrade to access this part.
Upgrade NowTranscripts

This section is available to paid users only. Please upgrade to access this part.
Upgrade NowBrowse More Related Video

How to Build a Streaming Database in Three Challenging Steps | Materialize

Managing Cymbal Superstoreโs cloud solutions

HPE Ezmeral Runtime Enterprise

GCP Data Engineer Mock interview

Implement auto instrumentation under GraalVM static compilation on OTel... Zihao Rao & Huxing Zhang

Innovative energy technology | Microsoft #TechTalk
5.0 / 5 (0 votes)