What is Big Data? - Computerphile
Summary
TLDREl video explora el concepto de 'big data', destacando que no hay una definición precisa y que se considera 'big data' cuando los datos son demasiado grandes para manejar con métodos tradicionales. Se introducen los 'cinco E's de los datos grandes: volumen, velocidad, variedad, valor y veracidad. Se discuten las técnicas para manejar grandes volúmenes de datos, como el uso de múltiples computadoras y marcos de trabajo como MapReduce y Apache Spark. El video también menciona la importancia de la seguridad y la privacidad de los datos personales en el contexto del big data.
Takeaways
- 📊 El concepto de 'big data' no tiene una definición precisa y se considera que es demasiado grande para ser manejado con métodos tradicionales.
- 🖥️ La incapacidad de procesar o almacenar datos en una única computadora indica que probablemente se trate de 'big data'.
- 🚀 Con la evolución de las computadoras, el umbral de lo que se considera 'big' cambia constantemente.
- 🔍 El manejo de 'big data' requiere de nuevos métodos, como el uso de múltiples computadoras en paralelo con marcos como MapReduce.
- 📈 Los 'cinco E's del big data' son Volumen, Velocidad, Variedad, Valor y Veracidad, que definen características y desafíos comunes.
- 🌐 La 'Velocidad' se refiere a la generación rápida de datos, como en el caso de Facebook, que requiere procesamiento en tiempo real.
- 📚 La 'Variedad' abarca tanto datos estructurados como no estructurados, incluyendo texto, imágenes, audio y video.
- 💡 El 'Valor' se refiere a la necesidad de extraer información útil de los datos, como patrones o insights para la toma de decisiones.
- 🔒 La 'Veracidad' es crucial, ya que implica la confiabilidad y fiabilidad de los datos, considerando posibles sesgos y fallos en la medición.
- 🔄 El manejo de 'big data' implica la distribución de datos y procesos a través de múltiples computadoras, lo que permite una escalabilidad más eficiente.
- 🛠️ Los sistemas de 'big data' suelen seguir un flujo de trabajo estándar que incluye la ingesta, almacenamiento, procesamiento y visualización de datos.
- 🔍 La 'Pre-procesación' de datos es un paso importante antes del procesamiento, especialmente para datos no estructurados y para reducir la redundancia.
Q & A
¿Qué es el big data y cómo se define?
-El big data se refiere a conjuntos de datos tan grandes que no se pueden manejar razonablemente con métodos tradicionales, como procesar o almacenar en una sola computadora. La definición exacta varía y depende de la capacidad de procesamiento y almacenamiento de los sistemas informáticos actuales.
¿Cuáles son los 'cinco E's' del big data y qué representan?
-Los cinco E's del big data son Volumen, Velocidad, Variedad, Valor y Veracidad. Representan las características y desafíos comunes en el manejo de grandes volúmenes de datos, incluyendo la cantidad de datos, la rapidez de generación, la diversidad de formatos, la utilidad de los datos y la confiabilidad de la información.
¿Cómo se relaciona el concepto de big data con la capacidad de procesamiento de una computadora?
-El big data implica volúmenes de datos que no pueden ser procesados o almacenados en una sola computadora. A medida que aumentan las capacidades de procesamiento y almacenamiento, el umbral de lo que se considera 'grande' también cambia.
¿Qué es la técnica MapReduce y cómo se relaciona con el big data?
-MapReduce es un marco de programación para procesar grandes volúmenes de datos de manera distribuida. Permite dividir los datos y procesarlos en paralelo en múltiples computadoras, mejorando la eficiencia y la capacidad de manejar grandes conjuntos de datos.
¿Qué es la importancia de la 'Velocidad' en el contexto del big data?
-La velocidad hace referencia a la rapidez con la que se generan datos, lo que requiere soluciones de procesamiento en tiempo real para gestionar y analizarlas efectivamente, como en el caso de las redes sociales o sensores en tiempo real.
¿Cómo se maneja la 'Variedad' de datos en el big data?
-La variedad se refiere a la diversidad de formatos de datos, incluyendo datos estructurados y no estructurados. Para manejar esto, se requieren herramientas que puedan procesar y extraer información útil de diferentes tipos de datos, como texto, imágenes, audio y video.
¿Qué significa el 'Valor' en el big data y por qué es importante?
-El valor en el big data se refiere a la importancia de obtener información útil o conocimiento de los datos recolectados. Es crucial para la toma de decisiones y para el análisis que permite a las empresas entender y mejorar sus operaciones.
¿Qué es la 'Veracidad' y cómo afecta el manejo de datos?
-La veracidad se refiere a la confiabilidad y precisión de los datos. Es fundamental evaluar la calidad de los datos, identificar sesgos y valores faltantes, y comprender la fiabilidad de las fuentes de datos para garantizar la validez del análisis.
¿Cómo se aborda el problema del volumen de datos en el big data?
-Para abordar el volumen de datos, se utilizan técnicas como la distribución de datos a través de múltiples computadoras en un clúster, lo que permite el almacenamiento y procesamiento en paralelo, facilitando la expansión y reduciendo los costos.
¿Qué es un clúster de computadoras y cómo ayuda en el manejo de big data?
-Un clúster de computadoras es un conjunto de máquinas interconectadas que trabajan juntas para procesar y almacenar datos de manera distribuida. Esto mejora la capacidad de manejo de grandes volúmenes de datos y ofrece tolerancia a fallos y escalabilidad.
¿Cuáles son las diferentes fases del flujo de trabajo estándar en un sistema de big data?
-Las fases del flujo de trabajo estándar incluyen la ingesta de datos, el almacenamiento, la procesamiento (ya sea en lotes o en tiempo real), y la posible preprocesamiento antes de analizar los datos para extraer información valiosa.
Outlines
📊 Introducción a los Datos Macro
El primer párrafo introduce el concepto de 'big data' y cómo no existe una definición precisa. Se menciona que se considera 'big data' cuando los métodos tradicionales, como procesar o almacenar datos en una sola computadora, ya no son viables. La evolución de la tecnología está cambiando esta percepción, y se enfatiza la importancia de métodos nuevos, como el MapReduce, para manejar grandes volúmenes de datos. También se introducen los 'cinco E's de los datos grandes: Volumen, Velocidad, Variedad, Valor y Veracidad, que son características y desafíos comunes en el manejo de grandes conjuntos de datos.
🔧 Procesamiento y Almacenamiento Distribuido
Este párrafo se enfoca en cómo gestionar grandes volúmenes de datos utilizando sistemas de archivos distribuidos como el Hadoop Distributed File System. Se discute la importancia de la tolerancia a fallos y la replicación de datos para garantizar la fiabilidad. Además, se describe el proceso estándar en sistemas de big data, que incluye la ingesta de datos, el almacenamiento en un clúster, el procesamiento en paralelo y la limitación del movimiento de datos para mejorar la eficiencia. Se mencionan diferentes enfoques de procesamiento, como el procesamiento por lotes y el procesamiento en tiempo real, y se destaca la importancia de la preprocesación para adaptar datos no estructurados y reducir la redundancia.
🚀 Procesamiento de Datos en Tiempo Real
El tercer párrafo explora el procesamiento de datos en tiempo real, con tecnologías como Spark Streaming y Apache Flink, que permiten manejar flujos de datos en tiempo real. Se ilustra con el ejemplo de los camiones que envían datos de sensores constantemente, lo que representa una fuente de información en tiempo real. Se resalta la necesidad de procesar datos de forma incremental y la importancia de la eficiencia en el procesamiento distribuido para manejar grandes volúmenes de datos de manera efectiva.
Mindmap
Keywords
💡Big Data
💡MapReduce
💡Five Vs
💡Volumen
💡Velocidad
💡Variabilidad
💡Valor
💡Veracidad
💡Distributed Storage
💡Data Locality
💡Pre-processing
Highlights
Big data is defined by the inability to process data with traditional methods, such as on a single computer.
The concept of 'big' is dynamic, changing with advancements in computer capacity and speed.
Big data is often managed using the MapReduce framework, which involves splitting data and processing it across multiple computers.
The 'five Vs' of big data are Volume, Velocity, Variety, Value, and Veracity, encompassing key characteristics and challenges.
Volume refers to the size of the dataset, which is a fundamental aspect of big data.
Velocity indicates the speed at which data is generated, such as the constant stream of data from social media platforms like Facebook.
Variety addresses the range of data formats, from structured to unstructured, including images, audio, and video.
Value is about extracting meaningful insights or patterns from the collected data, enhancing understanding or decision-making.
Veracity concerns the trustworthiness and reliability of data, considering potential biases and inaccuracies.
Distributed storage and computation are common approaches to managing the vast amounts of data in big data systems.
Frameworks like Hadoop Distributed File System provide management for data storage and fault tolerance in big data clusters.
Big data systems often employ a standard workflow involving data ingestion, storage, processing, and sometimes pre-processing.
Data locality is a principle in big data processing that minimizes data movement across networks, keeping computation close to the data.
Batch processing and real-time processing are two methods for handling data, with the latter being suitable for high-velocity data streams.
Pre-processing may involve structuring unstructured data or reducing data redundancy to simplify further processing.
Techniques for reducing data redundancy can help to distill large datasets into more manageable and representative forms.
Big data streaming technologies like Apache Spark and Apache Flink facilitate real-time data processing and analysis.
The practical applications of big data are vast, including fraud detection, fleet management, and extracting valuable insights from transaction data.
Transcripts
Today we're going to be talking about big data. How big is big?
so
Well, first of all, there is no precise definition as a rule. So kind of be standard what people would say is
When we can no longer reasonably deal with the data using traditional methods
So that we kind of think what's a traditional method? Well, it might be can we process the data on a single computer?
Can we store the data on a single computer? And if we can't then we're probably dealing
With big data, so you need to have new methods in order to be able to handle and process this data
As computers getting faster and bigger capacities and more memory and things that the concept of what becomes big is is changing, right?
So kind of but a lot of it isn't really as I'll talk about later isn't how
Much power you can get in a single computer
It's more how we can use multiple computers to split the data up process everything and then throw it back like in the MapReduce framework
Then we talked about the for in with big data
There's something called the five es which kind of defines some features and problems that are common amongst any Big Data things
We have the five es and the first three that were defined. I think these were defined in 2001
So that's kind of how having talked about four. So first of all, we've got some volume. So this is the most obvious one
It's just simply how large the dataset it's the second one is
velocity
So a lot of the time these days huge amounts of data are being generated in a very short amount of time
So you think of how much data Facebook is generating people liking stuff people uploading content that's happening constantly
All throughout the day the amount of data they generate every day
It's just huge basically so they need to process that in real time
And the third one is variety
Traditionally the data we would have and we would store it in a traditional single database. It would be in a very structured format
So you've got columns and rows everywhere. He would have values for the columns these days
We've got data coming in in a lot of different formats
So as well as the traditional kind of structured data, we have unstructured data
So you've got stuff coming like web dream cliques, we've got like social media likes coming in
We've got stuff like images and audio and video
So we need to be able to handle all these different types of data and extract what we need from them
and the first one is
value
Yeah, so there's no point in us collecting huge amounts of data and then doing nothing with it
So we want to know what we want to obtain from the data and then think of ways to go about that
So something some form of value could just be getting humans to understand what is happening
In that data. So for example if you have a fleet of lorries
They will all have telematics sensors in that we collecting sensor data of what the lawyers are doing
So it's of a lot of value to the fleet manager to then be able to easily
Visualize huge amounts of data coming in and see what it's happening. So as well as processing and storing this stuff
We also want to be able to visualize it and show it humans in an easily understandable format
Oh, the value stuff is just finding patterns machine learning algorithms from all of this data
see then the fifth and final one is
Veracity this is basically how trustworthy the data is how reliable it is
So we've got data coming in from a lot of different sources
So is it being generated with statistical bias?
Are there missing values if we use think for example the sensor data, we need to realize that maybe the sensors are faulty
They're giving slightly off readings
So it's important to understand how?
Reliable the data we're looking at is and so these are kind of the five
Standard features of Big Data some people try and add more. There's another seven V's a big data at the 10 meter producer
I see. I'm sure we will keep going up and up
They are doing things like don't like vulnerability. So
Obviously when we're storing a lot of data a lot of that is quite personal data
So making sure that's secure but these are the kind of the five main ones
The first thing the big big data obviously is just the sheer volume
So one way of dealing with this is to split the data across multiple computers
So you could think okay. So we've got too much data to fit on one machine. We'll just get a more powerful computer
We'll get more CPU power. We'll get larger memory
that very quickly becomes quite difficult to manage because every time you need to
Scale it up again because you've got even more data you to buy computer or new hardware
So what tends to happen instead and all like they see all companies or just have like a cluster of computers?
So rather than a single machine
They'll have say a massive mean warehouse
basically
If you wind loads and loads and loads of computers and what this means that we can do is we can do distributed storage
so each of those machines will store a portion of the data and then we can also
Do the computation split across those machines rather than having one computer going through?
I know a billion database records you can have each computer going through a thousand of those database records
Let me take a really naive way of saying right. Ok, let's do it. Alphabetically, I'll load more records. Come in for say Zed
That's easy. Stick it on the end load more records coming for P. This Y in the middle, right? How do you manage that?
and so there's
Computing frameworks that will help with this
So for example, if you're storing data industry to fashion than this the Hadoop distributed file system
And that will manage kind of the cluster resources where the files are stored and those frameworks will also provide fault tolerance and reliability
So if one of the nose goes down, then it you've not lost that data. There will have been some replication across other nodes
So that yeah losing a single node isn't going to cause you a lot of problems
And what using a cluster also allows you to do is whenever you want to scale it up
All you do is just add more computers into the network and you're done and you can get by on
relatively cheap
Hardware rather than having to keep buying a new supercomputer in a big data
System there tends to be a pretty standard workflow
so the first thing you would want to do is have a measure to
Ingest the data remember, we've got a huge variety of data coming in. It's all coming in from different sources
So we need a way to kind of aggregators and move it on to further down the pipeline
So there's some frameworks for this. There's an Apache Capra and like Apache flume for example and loads and loads of others as well
So basically aggregate all the data push it on to the rest of the system
so then the second thing that you probably want to do is
Store that data so like we just spoke about the distributed file system
you store is in a distributed manner across the cluster then you want to
Process this data and you may skip out storage entirely
So in some cases you may not want to store your data
You just want to process it use it to update
Some machine learning model somewhere and then discard it and we don't care about long-term storage
So you're processing the data again do it in disputed fashion using frameworks such as MapReduce or Apache spark
Designing the algorithms to do that processing requires a little bit more thought than maybe doing a traditional algorithm with the frameworks
We'll hide some of it but you need to be thinking that even if we're doing it through a framework
We've still got data on different computers if we need to share messages between these computers during the computation
It becomes quite expensive if we keep moving a lot of data across the network
So it's designing algorithms that limit data movement around and it's the principle of data locality
So you want to keep the computation close to the data?
Don't move the data around
Sometimes it's unavoidable, but we limit it. So the other thing about processing is that there's different ways of doing it
There's batch processing
So you already have all of your data or whatever you protected so far
You take all of that data across the cluster you process all of that get your results and you're done
The other thing we can do is real-time processing. So again because the velocity of the data is coming in
We don't want to constantly have to take all the day to Detective
Well produce it get results and then we've got a ton more data
I want to do the same get all the data bring it back process all of it
So instead we would
Do real-time processing so as each data item arrives?
We process that we don't have to look at all the data we've got so far. We just incrementally process everything
And that's coming up in another video when we talk about data streaming
So the other thing that you might want to do before processing is something called pre-processing remember I talked about unstructured data
So maybe getting that data into a format that we specifically can use for the purpose we want to so
That would be a stage in the pipeline before processing the other thing with huge amounts of data
There's likely to be a lot of noise a lot of outliers so we can remove those
We can also remove one instances, so if you think we're getting a ton of instances in and we want them she learning algorithm
There'll be a lot of instances that are very very similar see an instance is say in a database
It's like a single line in the database. So for HTV sensor reading it would be everything for that
Lorry at that point in time CS speed directions traveling reducing. The number of instances is about reducing the granularity
so part of it is saying
if we store a rather than storing data for a
Continuous period of time so every minute for an hour if those states are very similar across that we can just say okay for this
period this is what happens and put it in a single line or we could say for example a machine learning algorithm if there's
Instances with very very similar features and then a very very similar class
We can take a single one of those instances and that will suitably represent
All of those instances so we can very very quickly reduce a huge data set down to a much smaller one
By saying there's a lot of redundancy here and we don't need a hundred very similar instances
When we one would do just as well
So if you've got a hundred
Instances and you reduce it down to one is does not have an impact on how important those instances are in the scheme of things
Yes, so techniques
That deal with this stuff. Some of them would just purely say okay now this is a single instance and
That's all you ever know others of them would
Have yet have a waiting?
So some way of saying this is a more important one because it's very similar to 100 others that we got rid of this one's
really not as important because there are least three others that were similar to it so we can wait instances to kind of reflect their
Importance. There are specific frameworks with big data streaming as well
so there's technologies such as the spark streaming module' for apache' spark or there's newer ones such as
Apache plink that can be used to do that. So they kind of abstracts away from the
streaming aspects of it so you can focus
Just in what you want to do a little thinking all this data is coming through very fast, obviously
My limited brain is thinking streaming relates to video. But you're talking about just data that is happening in real time. Is that right?
yes, so
Going back to the Lori's as they're driving down the motorway. They may be sending out a sense of read every
minute or so and
That since the reading goes back we get all the sense readings from all the lorries coming in as a data stream
so it's kind of a very quick roundup of the basics of Big Data and there's a lot of applications this obviously so
Thanks, we'll have huge volumes of transaction data that you can extract patterns of value from that and see what is normal they can do
Kind of fraud detection on that again. The previous example of fleet managers understanding what is going on
basically any industry will now have ways of being able to extract value from the data that they have so in the next video we're
Going to talk about data stream processing and more about how we actually deal with the problems that we all time data can presenters
over very very large BIOS
This kind of computation is a lot more efficient if you can distribute at because doing this map phase of saying, okay
This is one occurrence. The letter A that's independent of anything else and see most
Interested in you're probably only interested when a button is pressed or so on the only times positive
5.0 / 5 (0 votes)