Morcor - Co-Location of Mixed Workloads at Uber | Amite Bose

@Scale

27 Oct 202321:51

Summary

TLDRAmit from Uber discusses 'Project Marker', an initiative to co-locate diverse workloads on Uber's compute platform to optimize resource utilization and reduce hardware costs. By dynamically partitioning clusters into stateless and batch workloads and adjusting based on live load, Uber has successfully integrated Peloton and YARN to run data analytics jobs on freed-up hosts without migrating users. The talk highlights the importance of minimizing disruption, maintaining system resilience, and addressing organizational challenges in achieving co-location.

Takeaways

🚀 Uber's Project Marker aims to co-locate different kinds of workloads on the same compute platform to increase resource utilization and save costs.
🌐 Uber operates at a massive scale, conducting more than 16 billion trips per day across 10,000 cities and six continents.
🛠 The company has over 3500 microservices that are interdependent and critical for maintaining a seamless user experience.
📊 Uber's Big Data infrastructure processes vast amounts of data generated by trips and deliveries to drive business intelligence and machine learning.
🔄 Project Marker's goal is to converge stateless microservices and data analytics jobs onto a single platform to optimize resource usage.
📈 The CPU utilization graph of Uber's clusters shows significant fluctuations due to demand spikes, indicating potential for better resource management.
💡 Same cluster co-location was chosen over same host co-location to address the challenge of isolating resource-intensive batch jobs from stateless jobs.
🔄 Dynamic partitioning within clusters allows for adjusting the balance between stateless and batch workloads based on real-time demand.
🛠️ Load-aware placement and proactive monitoring are used to mitigate resource contention and ensure stable service performance.
🔗 The collaboration between Peloton (compute platform) and YARN (data platform) enables efficient use of freed-up hosts without migrating users to a new platform.

Q & A

What is the main goal of Project Marker at Uber?
-The main goal of Project Marker is to increase resource utilization and reduce costs by co-locating stateless microservices and batch data analytics workloads on the same cluster, leveraging unused capacity.
Why did Uber decide to converge its stateless microservices and batch data workloads onto a single platform?
-Uber converged these workloads to utilize unused capacity during times of low demand for microservices, allowing the company to run batch data jobs and reduce the need for additional hardware, thereby achieving cost savings.
What infrastructure does Uber use to manage its stateless microservices and batch data jobs?
-Uber uses Mesos and Peloton for managing stateless microservices, and Hadoop and Yarn for batch data jobs. Project Marker aims to co-locate these workloads on the same cluster.
How does the concept of 'over-commitment' work in Project Marker?
-Over-commitment involves advertising more CPU cores than a host actually has, allowing Uber to pack stateless microservices onto fewer hosts, freeing up other hosts to run batch data jobs.
What are the two types of co-location methods discussed in the presentation?
-The two types of co-location methods are 'same host co-location', where stateless and batch jobs share the same physical hosts, and 'same cluster co-location', where stateless and batch jobs are run on different hosts within the same cluster.
Why did Uber choose same cluster co-location over same host co-location?
-Uber chose same cluster co-location because it solved the resource isolation problem between stateless and batch jobs, making it easier to implement without risking performance degradation for stateless microservices.
How does Uber adjust partition sizes in response to changing workloads?
-Uber dynamically adjusts partition sizes based on current and predicted CPU utilization. During peak times, more capacity is allocated to stateless microservices, while during off-peak times, capacity is shifted to batch jobs.
What are some challenges Uber faces when moving hosts between partitions?
-Challenges include disrupting running jobs, such as killing batch jobs when moving hosts back to stateless workloads, or selectively moving less critical microservices when shrinking the stateless partition.
How does load-aware placement help mitigate resource contention in Project Marker?
-Load-aware placement evenly distributes jobs across hosts, reducing the risk of resource contention and hotspots. If a host becomes overloaded, services can be proactively moved to lighter-loaded hosts.
How does Uber ensure minimal disruption when taking hosts back from the batch partition for stateless workloads?
-Uber uses techniques like selectively killing less critical jobs and adjusting placement strategies to minimize disruption, ensuring that high-priority services continue running smoothly even when hosts are reclaimed.