What is Big Data? - Computerphile

Computerphile
15 May 201911:52

Summary

TLDREl video explora el concepto de 'big data', destacando que no hay una definición precisa y que se considera 'big data' cuando los datos son demasiado grandes para manejar con métodos tradicionales. Se introducen los 'cinco E's de los datos grandes: volumen, velocidad, variedad, valor y veracidad. Se discuten las técnicas para manejar grandes volúmenes de datos, como el uso de múltiples computadoras y marcos de trabajo como MapReduce y Apache Spark. El video también menciona la importancia de la seguridad y la privacidad de los datos personales en el contexto del big data.

Takeaways

  • 📊 El concepto de 'big data' no tiene una definición precisa y se considera que es demasiado grande para ser manejado con métodos tradicionales.
  • 🖥️ La incapacidad de procesar o almacenar datos en una única computadora indica que probablemente se trate de 'big data'.
  • 🚀 Con la evolución de las computadoras, el umbral de lo que se considera 'big' cambia constantemente.
  • 🔍 El manejo de 'big data' requiere de nuevos métodos, como el uso de múltiples computadoras en paralelo con marcos como MapReduce.
  • 📈 Los 'cinco E's del big data' son Volumen, Velocidad, Variedad, Valor y Veracidad, que definen características y desafíos comunes.
  • 🌐 La 'Velocidad' se refiere a la generación rápida de datos, como en el caso de Facebook, que requiere procesamiento en tiempo real.
  • 📚 La 'Variedad' abarca tanto datos estructurados como no estructurados, incluyendo texto, imágenes, audio y video.
  • 💡 El 'Valor' se refiere a la necesidad de extraer información útil de los datos, como patrones o insights para la toma de decisiones.
  • 🔒 La 'Veracidad' es crucial, ya que implica la confiabilidad y fiabilidad de los datos, considerando posibles sesgos y fallos en la medición.
  • 🔄 El manejo de 'big data' implica la distribución de datos y procesos a través de múltiples computadoras, lo que permite una escalabilidad más eficiente.
  • 🛠️ Los sistemas de 'big data' suelen seguir un flujo de trabajo estándar que incluye la ingesta, almacenamiento, procesamiento y visualización de datos.
  • 🔍 La 'Pre-procesación' de datos es un paso importante antes del procesamiento, especialmente para datos no estructurados y para reducir la redundancia.

Q & A

  • ¿Qué es el big data y cómo se define?

    -El big data se refiere a conjuntos de datos tan grandes que no se pueden manejar razonablemente con métodos tradicionales, como procesar o almacenar en una sola computadora. La definición exacta varía y depende de la capacidad de procesamiento y almacenamiento de los sistemas informáticos actuales.

  • ¿Cuáles son los 'cinco E's' del big data y qué representan?

    -Los cinco E's del big data son Volumen, Velocidad, Variedad, Valor y Veracidad. Representan las características y desafíos comunes en el manejo de grandes volúmenes de datos, incluyendo la cantidad de datos, la rapidez de generación, la diversidad de formatos, la utilidad de los datos y la confiabilidad de la información.

  • ¿Cómo se relaciona el concepto de big data con la capacidad de procesamiento de una computadora?

    -El big data implica volúmenes de datos que no pueden ser procesados o almacenados en una sola computadora. A medida que aumentan las capacidades de procesamiento y almacenamiento, el umbral de lo que se considera 'grande' también cambia.

  • ¿Qué es la técnica MapReduce y cómo se relaciona con el big data?

    -MapReduce es un marco de programación para procesar grandes volúmenes de datos de manera distribuida. Permite dividir los datos y procesarlos en paralelo en múltiples computadoras, mejorando la eficiencia y la capacidad de manejar grandes conjuntos de datos.

  • ¿Qué es la importancia de la 'Velocidad' en el contexto del big data?

    -La velocidad hace referencia a la rapidez con la que se generan datos, lo que requiere soluciones de procesamiento en tiempo real para gestionar y analizarlas efectivamente, como en el caso de las redes sociales o sensores en tiempo real.

  • ¿Cómo se maneja la 'Variedad' de datos en el big data?

    -La variedad se refiere a la diversidad de formatos de datos, incluyendo datos estructurados y no estructurados. Para manejar esto, se requieren herramientas que puedan procesar y extraer información útil de diferentes tipos de datos, como texto, imágenes, audio y video.

  • ¿Qué significa el 'Valor' en el big data y por qué es importante?

    -El valor en el big data se refiere a la importancia de obtener información útil o conocimiento de los datos recolectados. Es crucial para la toma de decisiones y para el análisis que permite a las empresas entender y mejorar sus operaciones.

  • ¿Qué es la 'Veracidad' y cómo afecta el manejo de datos?

    -La veracidad se refiere a la confiabilidad y precisión de los datos. Es fundamental evaluar la calidad de los datos, identificar sesgos y valores faltantes, y comprender la fiabilidad de las fuentes de datos para garantizar la validez del análisis.

  • ¿Cómo se aborda el problema del volumen de datos en el big data?

    -Para abordar el volumen de datos, se utilizan técnicas como la distribución de datos a través de múltiples computadoras en un clúster, lo que permite el almacenamiento y procesamiento en paralelo, facilitando la expansión y reduciendo los costos.

  • ¿Qué es un clúster de computadoras y cómo ayuda en el manejo de big data?

    -Un clúster de computadoras es un conjunto de máquinas interconectadas que trabajan juntas para procesar y almacenar datos de manera distribuida. Esto mejora la capacidad de manejo de grandes volúmenes de datos y ofrece tolerancia a fallos y escalabilidad.

  • ¿Cuáles son las diferentes fases del flujo de trabajo estándar en un sistema de big data?

    -Las fases del flujo de trabajo estándar incluyen la ingesta de datos, el almacenamiento, la procesamiento (ya sea en lotes o en tiempo real), y la posible preprocesamiento antes de analizar los datos para extraer información valiosa.

Outlines

00:00

📊 Introducción a los Datos Macro

El primer párrafo introduce el concepto de 'big data' y cómo no existe una definición precisa. Se menciona que se considera 'big data' cuando los métodos tradicionales, como procesar o almacenar datos en una sola computadora, ya no son viables. La evolución de la tecnología está cambiando esta percepción, y se enfatiza la importancia de métodos nuevos, como el MapReduce, para manejar grandes volúmenes de datos. También se introducen los 'cinco E's de los datos grandes: Volumen, Velocidad, Variedad, Valor y Veracidad, que son características y desafíos comunes en el manejo de grandes conjuntos de datos.

05:06

🔧 Procesamiento y Almacenamiento Distribuido

Este párrafo se enfoca en cómo gestionar grandes volúmenes de datos utilizando sistemas de archivos distribuidos como el Hadoop Distributed File System. Se discute la importancia de la tolerancia a fallos y la replicación de datos para garantizar la fiabilidad. Además, se describe el proceso estándar en sistemas de big data, que incluye la ingesta de datos, el almacenamiento en un clúster, el procesamiento en paralelo y la limitación del movimiento de datos para mejorar la eficiencia. Se mencionan diferentes enfoques de procesamiento, como el procesamiento por lotes y el procesamiento en tiempo real, y se destaca la importancia de la preprocesación para adaptar datos no estructurados y reducir la redundancia.

10:11

🚀 Procesamiento de Datos en Tiempo Real

El tercer párrafo explora el procesamiento de datos en tiempo real, con tecnologías como Spark Streaming y Apache Flink, que permiten manejar flujos de datos en tiempo real. Se ilustra con el ejemplo de los camiones que envían datos de sensores constantemente, lo que representa una fuente de información en tiempo real. Se resalta la necesidad de procesar datos de forma incremental y la importancia de la eficiencia en el procesamiento distribuido para manejar grandes volúmenes de datos de manera efectiva.

Mindmap

Keywords

💡Big Data

Big Data se refiere a conjuntos de datos extremadamente grandes y complejos que no pueden ser manejados por sistemas tradicionales de software. En el guion, se menciona que no hay una definición precisa, pero se considera 'big data' cuando los datos son demasiado grandes para ser procesados o almacenados en una única computadora. Este concepto es central en el vídeo, ya que se discute cómo se manejan y procesan estos volúmenes masivos de datos utilizando métodos innovadores.

💡MapReduce

MapReduce es un marco de programación para procesar datos en paralelo distribuido, que se utiliza para manejar grandes conjuntos de datos. En el guion, se menciona como un ejemplo de cómo se pueden dividir los datos y procesarlos en múltiples computadoras antes de reunir los resultados, lo que es crucial para el manejo eficiente de big data.

💡Five Vs

Las Five Vs son un conjunto de características que definen los retos y características comunes en el manejo de big data: Volumen, Velocidad, Variabilidad, Valor y Veracidad. Estas Vs son fundamentales para entender los diferentes aspectos que se deben considerar al trabajar con big data, como se describe en el guion.

💡Volumen

Volumen se refiere a la cantidad de datos que se generan y se almacenan. Es una de las Five Vs y es crucial en big data, ya que implica la necesidad de tecnologías que puedan manejar y procesar grandes volúmenes de información, como se discute en el vídeo.

💡Velocidad

Velocidad hace referencia a la rapidez con la que los datos se generan y se procesan. Es otra de las Five Vs y es importante en big data porque implica la necesidad de sistemas que puedan manejar la entrada de datos en tiempo real, como se menciona en el guion con el ejemplo de Facebook.

💡Variabilidad

Variabilidad se refiere a la diversidad de formatos y tipos de datos que se manejan en big data, incluyendo datos estructurados, semiestructurados y no estructurados. Es una de las Five Vs y es clave para entender cómo se procesan diferentes tipos de datos, como se describe en el guion.

💡Valor

Valor se refiere a la utilidad o la información que se puede extraer de los datos. Es una de las Five Vs y es esencial para big data, ya que el objetivo final es obtener conocimiento o insights valiosos de los datos, como se menciona en el guion con el ejemplo de los sensores en camiones.

💡Veracidad

Veracidad se refiere a la confiabilidad y la calidad de los datos. Es una de las Five Vs y es crucial para big data, ya que implica la necesidad de garantizar que los datos utilizados son precisos y confiables, como se discute en el guion con el ejemplo de los sensores defectuosos.

💡Distributed Storage

Almacenamiento distribuido es el proceso de almacenar datos en múltiples computadoras o nodos en lugar de una sola. Es una práctica común en big data para manejar grandes volúmenes de información, como se describe en el guion al hablar de cómo se maneja el volumen de datos.

💡Data Locality

Localidad de datos es un principio en el diseño de algoritmos de big data que busca minimizar el movimiento de datos entre nodos de computadoras. Se menciona en el guion como una estrategia para diseñar algoritmos eficientes que reducen la sobrecarga de la red al procesar grandes conjuntos de datos.

💡Pre-processing

Pre-processing es el proceso de preparar los datos antes del análisis, lo que puede incluir la limpieza, la transformación o la reducción de la información. Es importante en big data para asegurar que los datos estén en un formato adecuado para el análisis, como se describe en el guion.

Highlights

Big data is defined by the inability to process data with traditional methods, such as on a single computer.

The concept of 'big' is dynamic, changing with advancements in computer capacity and speed.

Big data is often managed using the MapReduce framework, which involves splitting data and processing it across multiple computers.

The 'five Vs' of big data are Volume, Velocity, Variety, Value, and Veracity, encompassing key characteristics and challenges.

Volume refers to the size of the dataset, which is a fundamental aspect of big data.

Velocity indicates the speed at which data is generated, such as the constant stream of data from social media platforms like Facebook.

Variety addresses the range of data formats, from structured to unstructured, including images, audio, and video.

Value is about extracting meaningful insights or patterns from the collected data, enhancing understanding or decision-making.

Veracity concerns the trustworthiness and reliability of data, considering potential biases and inaccuracies.

Distributed storage and computation are common approaches to managing the vast amounts of data in big data systems.

Frameworks like Hadoop Distributed File System provide management for data storage and fault tolerance in big data clusters.

Big data systems often employ a standard workflow involving data ingestion, storage, processing, and sometimes pre-processing.

Data locality is a principle in big data processing that minimizes data movement across networks, keeping computation close to the data.

Batch processing and real-time processing are two methods for handling data, with the latter being suitable for high-velocity data streams.

Pre-processing may involve structuring unstructured data or reducing data redundancy to simplify further processing.

Techniques for reducing data redundancy can help to distill large datasets into more manageable and representative forms.

Big data streaming technologies like Apache Spark and Apache Flink facilitate real-time data processing and analysis.

The practical applications of big data are vast, including fraud detection, fleet management, and extracting valuable insights from transaction data.

Transcripts

play00:00

Today we're going to be talking about big data. How big is big?

play00:03

so

play00:04

Well, first of all, there is no precise definition as a rule. So kind of be standard what people would say is

play00:12

When we can no longer reasonably deal with the data using traditional methods

play00:16

So that we kind of think what's a traditional method? Well, it might be can we process the data on a single computer?

play00:23

Can we store the data on a single computer? And if we can't then we're probably dealing

play00:27

With big data, so you need to have new methods in order to be able to handle and process this data

play00:35

As computers getting faster and bigger capacities and more memory and things that the concept of what becomes big is is changing, right?

play00:42

So kind of but a lot of it isn't really as I'll talk about later isn't how

play00:48

Much power you can get in a single computer

play00:50

It's more how we can use multiple computers to split the data up process everything and then throw it back like in the MapReduce framework

play00:58

Then we talked about the for in with big data

play01:00

There's something called the five es which kind of defines some features and problems that are common amongst any Big Data things

play01:07

We have the five es and the first three that were defined. I think these were defined in 2001

play01:11

So that's kind of how having talked about four. So first of all, we've got some volume. So this is the most obvious one

play01:18

It's just simply how large the dataset it's the second one is

play01:24

velocity

play01:25

So a lot of the time these days huge amounts of data are being generated in a very short amount of time

play01:31

So you think of how much data Facebook is generating people liking stuff people uploading content that's happening constantly

play01:37

All throughout the day the amount of data they generate every day

play01:40

It's just huge basically so they need to process that in real time

play01:44

And the third one is variety

play01:47

Traditionally the data we would have and we would store it in a traditional single database. It would be in a very structured format

play01:53

So you've got columns and rows everywhere. He would have values for the columns these days

play01:57

We've got data coming in in a lot of different formats

play01:59

So as well as the traditional kind of structured data, we have unstructured data

play02:03

So you've got stuff coming like web dream cliques, we've got like social media likes coming in

play02:08

We've got stuff like images and audio and video

play02:12

So we need to be able to handle all these different types of data and extract what we need from them

play02:17

and the first one is

play02:20

value

play02:23

Yeah, so there's no point in us collecting huge amounts of data and then doing nothing with it

play02:28

So we want to know what we want to obtain from the data and then think of ways to go about that

play02:33

So something some form of value could just be getting humans to understand what is happening

play02:38

In that data. So for example if you have a fleet of lorries

play02:42

They will all have telematics sensors in that we collecting sensor data of what the lawyers are doing

play02:47

So it's of a lot of value to the fleet manager to then be able to easily

play02:51

Visualize huge amounts of data coming in and see what it's happening. So as well as processing and storing this stuff

play02:57

We also want to be able to visualize it and show it humans in an easily understandable format

play03:01

Oh, the value stuff is just finding patterns machine learning algorithms from all of this data

play03:06

see then the fifth and final one is

play03:08

Veracity this is basically how trustworthy the data is how reliable it is

play03:12

So we've got data coming in from a lot of different sources

play03:15

So is it being generated with statistical bias?

play03:18

Are there missing values if we use think for example the sensor data, we need to realize that maybe the sensors are faulty

play03:24

They're giving slightly off readings

play03:26

So it's important to understand how?

play03:28

Reliable the data we're looking at is and so these are kind of the five

play03:31

Standard features of Big Data some people try and add more. There's another seven V's a big data at the 10 meter producer

play03:38

I see. I'm sure we will keep going up and up

play03:40

They are doing things like don't like vulnerability. So

play03:45

Obviously when we're storing a lot of data a lot of that is quite personal data

play03:49

So making sure that's secure but these are the kind of the five main ones

play03:52

The first thing the big big data obviously is just the sheer volume

play03:55

So one way of dealing with this is to split the data across multiple computers

play04:01

So you could think okay. So we've got too much data to fit on one machine. We'll just get a more powerful computer

play04:06

We'll get more CPU power. We'll get larger memory

play04:10

that very quickly becomes quite difficult to manage because every time you need to

play04:13

Scale it up again because you've got even more data you to buy computer or new hardware

play04:18

So what tends to happen instead and all like they see all companies or just have like a cluster of computers?

play04:24

So rather than a single machine

play04:27

They'll have say a massive mean warehouse

play04:31

basically

play04:31

If you wind loads and loads and loads of computers and what this means that we can do is we can do distributed storage

play04:37

so each of those machines will store a portion of the data and then we can also

play04:42

Do the computation split across those machines rather than having one computer going through?

play04:47

I know a billion database records you can have each computer going through a thousand of those database records

play04:53

Let me take a really naive way of saying right. Ok, let's do it. Alphabetically, I'll load more records. Come in for say Zed

play04:59

That's easy. Stick it on the end load more records coming for P. This Y in the middle, right? How do you manage that?

play05:06

and so there's

play05:08

Computing frameworks that will help with this

play05:09

So for example, if you're storing data industry to fashion than this the Hadoop distributed file system

play05:16

And that will manage kind of the cluster resources where the files are stored and those frameworks will also provide fault tolerance and reliability

play05:23

So if one of the nose goes down, then it you've not lost that data. There will have been some replication across other nodes

play05:30

So that yeah losing a single node isn't going to cause you a lot of problems

play05:34

And what using a cluster also allows you to do is whenever you want to scale it up

play05:38

All you do is just add more computers into the network and you're done and you can get by on

play05:44

relatively cheap

play05:46

Hardware rather than having to keep buying a new supercomputer in a big data

play05:51

System there tends to be a pretty standard workflow

play05:53

so the first thing you would want to do is have a measure to

play05:58

Ingest the data remember, we've got a huge variety of data coming in. It's all coming in from different sources

play06:04

So we need a way to kind of aggregators and move it on to further down the pipeline

play06:08

So there's some frameworks for this. There's an Apache Capra and like Apache flume for example and loads and loads of others as well

play06:17

So basically aggregate all the data push it on to the rest of the system

play06:22

so then the second thing that you probably want to do is

play06:26

Store that data so like we just spoke about the distributed file system

play06:30

you store is in a distributed manner across the cluster then you want to

play06:34

Process this data and you may skip out storage entirely

play06:38

So in some cases you may not want to store your data

play06:40

You just want to process it use it to update

play06:43

Some machine learning model somewhere and then discard it and we don't care about long-term storage

play06:48

So you're processing the data again do it in disputed fashion using frameworks such as MapReduce or Apache spark

play06:54

Designing the algorithms to do that processing requires a little bit more thought than maybe doing a traditional algorithm with the frameworks

play07:01

We'll hide some of it but you need to be thinking that even if we're doing it through a framework

play07:06

We've still got data on different computers if we need to share messages between these computers during the computation

play07:12

It becomes quite expensive if we keep moving a lot of data across the network

play07:16

So it's designing algorithms that limit data movement around and it's the principle of data locality

play07:23

So you want to keep the computation close to the data?

play07:27

Don't move the data around

play07:28

Sometimes it's unavoidable, but we limit it. So the other thing about processing is that there's different ways of doing it

play07:34

There's batch processing

play07:35

So you already have all of your data or whatever you protected so far

play07:39

You take all of that data across the cluster you process all of that get your results and you're done

play07:45

The other thing we can do is real-time processing. So again because the velocity of the data is coming in

play07:50

We don't want to constantly have to take all the day to Detective

play07:53

Well produce it get results and then we've got a ton more data

play07:56

I want to do the same get all the data bring it back process all of it

play08:01

So instead we would

play08:03

Do real-time processing so as each data item arrives?

play08:07

We process that we don't have to look at all the data we've got so far. We just incrementally process everything

play08:14

And that's coming up in another video when we talk about data streaming

play08:18

So the other thing that you might want to do before processing is something called pre-processing remember I talked about unstructured data

play08:24

So maybe getting that data into a format that we specifically can use for the purpose we want to so

play08:30

That would be a stage in the pipeline before processing the other thing with huge amounts of data

play08:35

There's likely to be a lot of noise a lot of outliers so we can remove those

play08:40

We can also remove one instances, so if you think we're getting a ton of instances in and we want them she learning algorithm

play08:46

There'll be a lot of instances that are very very similar see an instance is say in a database

play08:51

It's like a single line in the database. So for HTV sensor reading it would be everything for that

play08:57

Lorry at that point in time CS speed directions traveling reducing. The number of instances is about reducing the granularity

play09:03

so part of it is saying

play09:05

if we store a rather than storing data for a

play09:08

Continuous period of time so every minute for an hour if those states are very similar across that we can just say okay for this

play09:14

period this is what happens and put it in a single line or we could say for example a machine learning algorithm if there's

play09:22

Instances with very very similar features and then a very very similar class

play09:26

We can take a single one of those instances and that will suitably represent

play09:30

All of those instances so we can very very quickly reduce a huge data set down to a much smaller one

play09:36

By saying there's a lot of redundancy here and we don't need a hundred very similar instances

play09:40

When we one would do just as well

play09:42

So if you've got a hundred

play09:44

Instances and you reduce it down to one is does not have an impact on how important those instances are in the scheme of things

play09:52

Yes, so techniques

play09:54

That deal with this stuff. Some of them would just purely say okay now this is a single instance and

play10:01

That's all you ever know others of them would

play10:04

Have yet have a waiting?

play10:05

So some way of saying this is a more important one because it's very similar to 100 others that we got rid of this one's

play10:10

really not as important because there are least three others that were similar to it so we can wait instances to kind of reflect their

play10:16

Importance. There are specific frameworks with big data streaming as well

play10:20

so there's technologies such as the spark streaming module' for apache' spark or there's newer ones such as

play10:27

Apache plink that can be used to do that. So they kind of abstracts away from the

play10:31

streaming aspects of it so you can focus

play10:34

Just in what you want to do a little thinking all this data is coming through very fast, obviously

play10:38

My limited brain is thinking streaming relates to video. But you're talking about just data that is happening in real time. Is that right?

play10:45

yes, so

play10:47

Going back to the Lori's as they're driving down the motorway. They may be sending out a sense of read every

play10:53

minute or so and

play10:55

That since the reading goes back we get all the sense readings from all the lorries coming in as a data stream

play10:59

so it's kind of a very quick roundup of the basics of Big Data and there's a lot of applications this obviously so

play11:06

Thanks, we'll have huge volumes of transaction data that you can extract patterns of value from that and see what is normal they can do

play11:13

Kind of fraud detection on that again. The previous example of fleet managers understanding what is going on

play11:19

basically any industry will now have ways of being able to extract value from the data that they have so in the next video we're

play11:25

Going to talk about data stream processing and more about how we actually deal with the problems that we all time data can presenters

play11:33

over very very large BIOS

play11:35

This kind of computation is a lot more efficient if you can distribute at because doing this map phase of saying, okay

play11:41

This is one occurrence. The letter A that's independent of anything else and see most

play11:46

Interested in you're probably only interested when a button is pressed or so on the only times positive

Rate This

5.0 / 5 (0 votes)

Ähnliche Tags
Big DataAnálisis de DatosMapReduceDistribución de DatosAlmacenamientoProcesamiento en MasaProcesamiento en Tiempo RealDiversidad de DatosValor de DatosPrecisión de DatosTecnología de Procesamiento
Benötigen Sie eine Zusammenfassung auf Englisch?