Introduction to Big Data Architecture

Software Architecture Academy

29 Mar 202209:17

Summary

TLDRThis video introduces Big Data architecture using a generic reference model. It covers key layers like data ingestion, storage, analytics, consumption, and governance, while detailing the sources of big data, including structured, semi-structured, and unstructured data. The video emphasizes batch and real-time processing flows, explaining how data moves across these layers. It also touches on critical components such as relational and NoSQL databases, AI models, and knowledge graphs, concluding with a discussion on governance to ensure data quality and security. Ideal for anyone looking to understand the structure and operation of big data systems.

Takeaways

😀 Big Data systems ingest data from diverse sources, including structured, semi-structured, and unstructured data types.
😀 Structured data comes from traditional systems like data warehouses, CRMs, and APIs, while semi-structured data includes NoSQL databases, logs, emails, and files like JSON and XML.
😀 Unstructured data includes multimedia (images, videos), social media streams, sensor data, and text-based data such as web pages or PDFs.
😀 Big Data architecture consists of five key layers: injection, storage, analytics/serving, data consumption, and governance.
😀 The injection layer includes batch and real-time components that bring data into the system.
😀 Storage layer uses relational databases, NoSQL databases, and distributed file systems like Hadoop for fault-tolerant storage.
😀 The analytics or serving layer hosts models, engines, and analytics tools to process and provide data outputs for consumption.
😀 Data consumption layer involves end-user applications like BI dashboards, reporting tools, and real-time alerting systems.
😀 Governance layer ensures data quality, security, cataloging, auditing, and metadata management to ensure the system’s success and integrity.
😀 Data flows in Big Data architectures can be batch-based (scheduled processing) or real-time (in-memory processing), each with specific use cases like historical analysis or real-time alerts.

Q & A

What are the three main types of data sources mentioned in the video?
-The three main types of data sources are structured, semi-structured, and unstructured data sources.
Can you give examples of structured data sources?
-Structured data sources include traditional systems like data warehouses, data marts, OLTP systems, ODS containing organizational data, CRM and ERP systems, and exposed APIs.
What constitutes semi-structured data according to the video?
-Semi-structured data includes NoSQL databases, log files, emails (with structured headers and unstructured body), spreadsheets without schema, and files such as HTML, CSV, JSON, and XML.
What are some examples of unstructured data?
-Unstructured data includes multimedia content like images, audio, and video, social media streams, messaging data (SMS, WhatsApp, chat), Geo or maps data, sensor data from smart devices, websites, web pages, online news articles, and PDFs.
What are the main layers of a generic big data architecture?
-The main layers are the ingestion layer, data storage layer, analytics (or servicing) layer, data consumption layer, and the big data governance layer.
What components are typically found in the ingestion layer?
-The ingestion layer typically has batch and real-time (streaming) components, which allow it to collect and process data from various sources.
What types of storage are used in the data storage layer?
-The data storage layer can use traditional relational databases, NoSQL databases, and distributed file systems like HDFS (Hadoop Distributed File System) to store structured, semi-structured, and unstructured data.
What functions does the analytics or servicing layer provide?
-The analytics or servicing layer runs models and engines to provide outputs such as statistical and AI models, recommendation engines, knowledge graphs for link analysis, and pre-computed views for user queries.
How does real-time data processing differ from batch processing?
-Batch processing is periodic and scheduled, handling historical data and sending it to the analytics layer before consumption. Real-time processing handles data immediately using in-memory computation and streaming, sometimes without a dedicated analytics layer for fast-response use cases like alerts.
Why is the big data governance layer critical?
-The big data governance layer ensures data auditing, security, quality control, metadata management, and overall management practices. Without proper governance, a big data system has a high risk of failure.
What are the two main variants of real-time processing in big data architecture?
-The two variants are real-time processing with a dedicated serving layer for analytical capabilities, and real-time processing without a dedicated serving layer for fast-response use cases like alerts.
What role does overarching processing play in a big data system?
-Overarching processing refers to batch and real-time processing that moves data across layers, enabling both scheduled and instantaneous data handling for analytics and consumption.