Nielsen: Processing 55TB of Data Per Day with AWS Lambda
Summary
TLDRIn 'This is My Architecture,' Boaz interviews Opher from Nielsen Marketing Cloud about their data management platform, DataOut. The system processes 250 billion events daily, utilizing Spark EMR clusters and Lambda functions to manage segmentation data for ad campaigns. Challenges include scaling to handle massive data volumes, implementing rate limiting to prevent server strain on partners, and optimizing costs, which have been reduced from $7.7 to $4.25 per billion events through intelligent code improvements and connection management.
Takeaways
- ๐ Nielsen Marketing Cloud is a data management platform that processes marketing segmentation data for campaigns.
- ๐ DataOut, a system within Nielsen Marketing Cloud, receives files from other parts of the system and processes them for ad networks.
- ๐ The system handles a massive scale, processing 250 billion events daily, which equates to 55 terabytes of data on its peak day.
- ๐๏ธ Data files are initially written to an S3 bucket, processed with a Spark EMR cluster, and then sent to Lambda functions for final formatting and upload to ad platforms.
- ๐ค Lambda functions are responsible for the final upload of data to over 100 ad networks, demonstrating a serverless architecture.
- ๐ Metadata about files and their management is stored in a Postgres RDS database, which is updated by a work manager Lambda.
- ๐ฆ Rate limiting was introduced to prevent overwhelming partner servers, with intelligent decisions made by the work manager Lambda based on file size and event count.
- ๐ฐ Cost management is a priority, with the system costing approximately $1,000 per day, or $300,000 per year, and ongoing efforts to reduce costs.
- ๐ ๏ธ Optimization of Lambda functions' memory footprint and runtime has led to cost savings, translating code improvements directly into cost efficiency.
- ๐ The system has a dynamic scaling capability, automatically adjusting to data influx without manual intervention, thanks to the serverless nature of Lambda functions.
- ๐ An internal DDoS-like attack was mitigated by introducing a queue buffer to manage the volume of Lambda invocations reporting back to the database.
Q & A
What is Nielsen Marketing Cloud?
-Nielsen Marketing Cloud is a data management platform that prepares marketing segmentation data for use in campaigns.
Can you describe the role of the DataOut system in Nielsen Marketing Cloud?
-DataOut is a system within Nielsen Marketing Cloud that processes incoming files, performs transformations and formatting, and uploads the processed data to ad networks that are partners of Nielsen.
How does the DataOut system handle the scale of data it processes?
-The DataOut system processes about 250 billion events a day, utilizing an S3 bucket for storage, a Spark EMR cluster for processing, and Lambda functions for final formatting and uploading to ad platforms.
What is the significance of the S3 bucket in the DataOut system?
-The S3 bucket is used for storing the incoming files with segmentation data, which are then processed and written to another S3 bucket before being sent to Lambda functions for further processing.
How does the system manage the metadata about the files?
-The metadata about the files and their management is written to a Postgres RDS database, which is updated by a work manager Lambda that reads the information and makes decisions on file processing.
What is the role of Lambda functions in the final stage of the DataOut system's data processing?
-Lambda functions perform the last part of the work, which includes the final formatting of the data and uploading it to the ad platforms.
How does the system handle rate limiting to prevent overwhelming partner servers?
-The system has a rate limiting mechanism where the work manager Lambda makes intelligent decisions based on file size and event numbers to limit the rate of data sent to partner networks, preventing server overload.
What challenges did the DataOut system face regarding scaling up and down?
-The system needs to scale up and down throughout the day as data volumes vary, with peak hours requiring about six terabytes of data processing compared to one terabyte during the lowest hour.
How does the serverless architecture of Lambda functions benefit the DataOut system?
-The serverless architecture allows the system to automatically scale up and down based on data volume without manual intervention, providing cost-effective scalability.
What measures did Nielsen Marketing Cloud take to address cost concerns in the DataOut system?
-Nielsen Marketing Cloud focused on optimizing the efficiency of Lambda functions, reducing memory footprint, and adjusting the number of HTTP connections to lower costs, achieving a reduction from $7.7 to $4.25 per billion events.
How does the system ensure that costs are kept under control?
-The system measures costs and has a goal to reduce them, using a combination of code optimization, memory footprint reduction, and connection management to achieve cost savings.
Outlines
This section is available to paid users only. Please upgrade to access this part.
Upgrade NowMindmap
This section is available to paid users only. Please upgrade to access this part.
Upgrade NowKeywords
This section is available to paid users only. Please upgrade to access this part.
Upgrade NowHighlights
This section is available to paid users only. Please upgrade to access this part.
Upgrade NowTranscripts
This section is available to paid users only. Please upgrade to access this part.
Upgrade NowBrowse More Related Video
Tutorial - Databricks Platform Architecture | Databricks Academy
David C King, FogHorn Systems | CUBEConversation, November 2018
OneMind Hypervisor: Unleashing real-time insights and seamless citizen experience -Dell Technologies
2 4 1 Cloud native applications
What is Microsoft Fabric? | New Data Analytics Platform!
Dataproc in a minute
5.0 / 5 (0 votes)