Make Your Pandas Code Lightning Fast

Rob Mulla
13 Mar 202210:37

Summary

TLDRIn this video, Rob demonstrates a technique to significantly speed up Python's Pandas code, essential for data manipulation. He introduces a problem of calculating rewards for fictitious people based on conditions, showcasing three methods: looping, using the apply function, and the most efficient, vectorized operations. The video emphasizes the importance of vectorization for handling large datasets, illustrating the dramatic performance improvement from 3.4 seconds to just milliseconds.

Takeaways

  • 🐼 Pandas is a crucial Python package for data handling and exploration.
  • 🚀 A simple trick can significantly speed up Pandas code, making it essential for large datasets.
  • 👋 Introduction to the presenter, Rob, who specializes in Python coding and machine learning videos.
  • 📈 The demonstration involves creating a random dataset with fictitious people's data for the example.
  • 🔢 Data includes ages, time in bed, and sleeping percentages, along with categorical features like favorite and hated foods.
  • 💡 The script introduces a problem of calculating rewards based on conditions using Pandas.
  • 🔄 Three methods are presented for solving the problem: looping, using the apply function, and vectorized operations.
  • ⏱️ Timing tests show that vectorized operations are the fastest, running 2000 times quicker than looping.
  • 🔧 The script emphasizes the efficiency of vectorized functions over looping or applying functions in Pandas.
  • 📚 The importance of using vectorized functions in Pandas is highlighted for performance optimization.
  • 👋 The video concludes with a reminder to use vectorized functions and a sign-off until the next video.

Q & A

  • What is the main topic of the video?

    -The main topic of the video is to demonstrate a trick to speed up Pandas code in Python for working with datasets, which can be essential when dealing with larger datasets.

  • Who is the presenter of the video?

    -The presenter of the video is Rob, who makes videos about coding in Python and machine learning.

  • What is the initial method discussed for solving the reward calculation problem in Pandas?

    -The initial method discussed for solving the reward calculation problem is looping over each row of the dataset and applying the reward calculation.

  • What is the time complexity of the looping method according to the video?

    -The looping method has a time complexity that results in approximately 3.4 seconds per run, which is considered slow.

  • What is the second method introduced to improve the efficiency of the code?

    -The second method introduced is the use of the 'apply' function in Pandas, which is more efficient than looping over each row.

  • How much faster is the 'apply' function compared to the looping method based on the video?

    -The 'apply' function is significantly faster, taking an average of 189 milliseconds per run, which is a substantial improvement over the looping method.

  • What is the key to speeding up Pandas code as mentioned in the video?

    -The key to speeding up Pandas code is using vectorized functions, which operate on the entire dataset at once rather than row by row.

  • What is the time complexity of the vectorized method according to the video?

    -The vectorized method has a time complexity of approximately 1.57 milliseconds, which is significantly faster than both the looping and 'apply' methods.

  • What is the fictitious problem presented in the video for demonstrating the code speedup?

    -The fictitious problem is to calculate a reward for each person in a dataset based on certain conditions regarding their time in bed, percentage of time sleeping, and age, using either their favorite or hate food as the reward.

  • What are the conditions for giving a person their favorite food as a reward according to the problem presented?

    -A person will receive their favorite food as a reward if they are in bed for more than five hours and sleep for more than 50% of the time, or if they are over 90 years old.

  • What is the advice given in the video for optimizing Pandas code?

    -The advice given is to always use vectorized functions when possible, avoid iterating or looping over datasets unless necessary, and to apply the 'apply' function for better efficiency than looping.

Outlines

00:00

🐼 Introduction to Pandas Optimization

In this introductory paragraph, the speaker, Rob, introduces the video's focus on optimizing Pandas code for efficiency. He emphasizes the importance of Pandas in Python for data manipulation and exploration, and hints at a significant performance boost achievable through a simple trick. Rob invites viewers to subscribe and follow for more content on Python coding and machine learning. The setup involves importing Pandas and NumPy, creating a DataFrame with random data about fictitious people, including their ages, time in bed, sleeping percentages, favorite foods, and disliked foods. The paragraph concludes with the introduction of a sample problem to be solved using Pandas, which involves calculating rewards based on certain conditions.

05:06

🔄 Exploring Pandas Code Efficiency Levels

This paragraph delves into the process of optimizing Pandas code by comparing three different methods of applying a function to a DataFrame. The speaker begins by demonstrating the slowest method, which involves looping over each row using the iterrows function and applying a custom reward calculation. The process is then timed using the 'timeit' tool in Jupyter to measure its efficiency. The second method introduced is the use of the 'apply' function, which is shown to be significantly faster than looping. Finally, the paragraph hints at the most efficient method, vectorized functions, which are promised to provide even greater speed improvements in the subsequent content.

10:13

⚡ The Power of Vectorization in Pandas

In the concluding paragraph, the speaker demonstrates the power of vectorization in Pandas to achieve high-performance data manipulation. He contrasts the previously discussed looping and applying methods with vectorized operations, which apply conditions across the entire DataFrame simultaneously. The speaker provides a step-by-step guide on how to implement vectorized functions, showing how to replace a loop with a set of conditions that return a boolean array indicating which rows meet the criteria. The resulting 'reward' is then calculated based on these conditions. The efficiency of vectorization is highlighted by timing the code, which shows a dramatic reduction in execution time compared to the previous methods. The speaker concludes by advocating for the use of vectorized functions whenever possible to enhance Pandas code performance.

Mindmap

Keywords

💡Pandas

Pandas is an open-source Python library that provides high-performance, easy-to-use data structures and data analysis tools. It is widely recognized for its ability to handle large datasets and is a cornerstone in the Python data science ecosystem. In the video, Pandas is the main tool discussed for speeding up data processing tasks, and the script demonstrates how to optimize code efficiency when using this library.

💡Efficiency

Efficiency in the context of the video refers to the ability to perform tasks with the least amount of resources, such as time or processing power. It is a key focus of the video, as the presenter aims to show viewers how to make their Pandas code run faster. The script provides examples of different methods to calculate rewards in a dataset, with the goal of increasing the efficiency of the code.

💡Dataset

A dataset in the video is a collection of data, typically used for analysis. The script creates a fictitious dataset using random data to demonstrate the concepts being taught. The dataset consists of information about fictitious people, including their ages, time in bed, and sleeping percentages, which are used to calculate rewards based on certain conditions.

💡Jupyter Lab

Jupyter Lab is an open-source, interactive development environment that supports over 40 programming languages, including Python. It is used in the video for demonstrating the Pandas code. The presenter operates within a Jupyter Lab notebook to import libraries, create datasets, and write functions to process data.

💡NumPy

NumPy is a fundamental package for scientific computing with Python. It provides support for arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays. In the video, NumPy is used in conjunction with Pandas to generate random data for the dataset.

💡DataFrame

A DataFrame is a 2-dimensional labeled data structure with columns of potentially different types in Pandas. It is similar to a spreadsheet or SQL table, and is used extensively in the video to manipulate and analyze the dataset. The script demonstrates creating a DataFrame and applying functions to it for the reward calculation.

💡Function

In the context of the video, a function is a block of organized, reusable code that is used to perform a single, related action. The script introduces several functions, such as 'get_data' for creating the dataset and 'reward_calc' for determining the reward based on conditions. Functions are key to making the code more efficient and modular.

💡Looping

Looping is a programming concept where a block of code is executed repeatedly for a certain number of times or until a condition is met. In the video, the presenter initially uses looping to iterate over each row of the DataFrame to apply the reward calculation, which is later compared with more efficient methods.

💡Apply Function

The apply function in Pandas is used to apply a function along an axis of the DataFrame. It is showcased in the video as a more efficient alternative to looping for applying the 'reward_calc' function to each row of the DataFrame. The script demonstrates the use of apply to speed up the code execution.

💡Vectorized Functions

Vectorized functions in Pandas are functions that operate on entire arrays rather than iterating over them row by row. They are highly efficient and are recommended in the video as the fastest way to process data in Pandas. The script illustrates how to use vectorized functions to calculate rewards in a much faster manner compared to looping or apply.

💡Performance

Performance in the video refers to how quickly and efficiently code executes. The presenter is focused on improving the performance of Pandas code by comparing different methods of processing data, such as looping, applying functions, and using vectorized operations. The script emphasizes the significant speed improvements gained by using vectorized functions.

Highlights

Pandas is a popular Python package for data exploration and manipulation.

Efficiency improvements can be made in Pandas code with certain adjustments.

The video demonstrates a trick to make code run 2000 times faster.

The importance of efficiency is emphasized with larger datasets.

Introduction of the presenter, Rob, who makes coding and machine learning videos.

Importing necessary libraries: pandas as PD and NumPy as NP.

Creating a DataFrame with random data for a fictitious dataset.

Inclusion of categorical features such as favorite and hated food.

Wrapping data creation in a function called 'get_data'.

Formulating a problem to assign rewards based on conditions in the dataset.

The 'reward_calc' function determines rewards based on time in bed and sleep percentage.

An additional condition for age over 90 years to always receive favorite food.

Three methods to apply the reward calculation: looping, apply function, and vectorized functions.

Looping over each row as the initial, but slow, method.

Using the apply function as a faster alternative to looping.

Vectorized functions as the most efficient method for speeding up Pandas code.

Demonstration of the significant speed difference between looping, applying, and vectorizing.

The recommendation to use vectorized functions whenever possible in Pandas.

A conclusion summarizing the key takeaway on optimizing Pandas code with vectorization.

Transcripts

play00:00

Pandas is an extremely popular Python package for working and exploring datasets. It's an

play00:06

essential tool for anyone interested in working with data in Python. And while many love and use

play00:11

it every day, I'm surprised when I come across code that can be sped up in its efficiency just

play00:17

by making a few adjustments. In this video, I'm going to show you a trick that in our example

play00:23

makes the code run 2000 times faster. And as you start working with larger and larger datasets,

play00:29

this trick is going to become essential. My name is Rob, and I make videos about coding in Python

play00:35

and machine learning. If you do enjoy these videos, please consider subscribing,

play00:40

liking the video and giving me a follow on Twitch. Alright, so let's look at speeding up some pandas

play00:46

code. Okay, I hope you're excited to speed up some code. Here I am in a Jupyter lab notebook.

play00:52

And I am just going to start by making some imports. So of course, we want to import pandas

play00:58

as PD and import NumPy as NP. And then we're going to create our data set, we're only using

play01:06

our data set as an example. So we're actually going to create some random data. And we're going to

play01:11

create that as a data frame. Like this, our data is going to be about fictitious people, we'll give

play01:19

these people random ages using NumPy random, let's give them a random integer between one and 100.

play01:28

And let's make all of these the same size, which will be 10,000. We'll also give the time in bed

play01:36

for these people ran random, random, and we're going to give the percentage of time sleeping.

play01:46

Now here we'll give some categorical features. So we'll give the favorite food of the person.

play01:53

Let's give them pizza, taco, and ice cream. And we'll give them some food that they hate. So the

play02:01

hate food, random choice of broccoli. That's right, we need to add the size, candy, corn,

play02:13

and eggs. And we can do this and see that we have our data set here with random data for the the

play02:21

people that we are simulating. But we'll wrap this in a function. So we can call it, let's call it

play02:26

get data and add this size as a parameter and have it return this data frame. So now if we call

play02:35

get data, we get our random data set. So we're going to make a fictitious problem up that we're

play02:41

trying to solve with pandas here. The problem is we're taking the data that we have here,

play02:47

and we're going to say, given some conditions, we want to give each person a reward. So reward

play02:55

reward calculation. If they were in bed for more than five hours, and they were sleeping for more

play03:11

than 50% of the time, or point five, we give them their favorite food. Otherwise, we give them their

play03:25

hate food. And just to make it a little trickier, we're also going to say, if they are over 90 years

play03:33

old, give their favorite food regardless, they've lived to be nine years old, so they deserve their

play03:43

favorite food. So we can write this. So we can write out this problem or this calculation in

play03:49

the form of a function that we will eventually apply to our data frame, we can do it pretty

play03:56

easily. Let's call this reward calc. And we'll give it a row or a person from our data frame.

play04:06

Writing our problem out as code, we would say if the age of the person is greater or equal to 90,

play04:13

we will return their favorite food. If the row time in bed is greater than five, and the percent

play04:28

time sleeping is greater than point five, then we return their favorite food. And then otherwise,

play04:39

we're going to return the hate food. So this is the sort of thing you come across a lot when working

play04:45

with panda data sets, you want to apply some sort of a function or logic across every row in the

play04:52

data set. And we're going to do this three different ways from slowest to fastest. Now the first one

play04:57

we'll call level one, which is looping. This was always my first way of trying to solve this sort

play05:06

of problem is looping over each row of the data set and applying our known reward calculation.

play05:14

Now what we're going to find is it's not necessarily very fast, but let's give it a try. So we're going

play05:19

to first call our get data function, then we're going to iterate over each row using the iter

play05:25

rows function. So we're going to say for index row in data frame, it arose. And then for each

play05:33

row, we'll call our reward calc. And we need to store this into our data frame at the index

play05:40

location that we are located in. So we'll do loc dot index, call this reward and store it.

play05:50

You can see it's running now. And there it's finished. So we want to time this and see how

play05:56

long it actually takes. And there's this nice magic tool that we can run on the top of our Jupiter

play06:03

cell called time it in this will run it multiple times and give us the average amount of time it

play06:09

took to run. It's nice to time things this way, because then you get an idea over multiple times,

play06:17

as opposed to just one try. And it's finally done. All right, we see it ran for seven runs.

play06:24

And each run was about 3.4 seconds. That seems pretty slow. Let's make it faster. So we're going

play06:30

to go into level number two, which is the apply function, you may have heard of using the apply

play06:36

function, it can be very useful. In this case, we are going to just take our data frame, and we'll

play06:42

apply this count reward calculation function that we created. And we're going to do it on the axis

play06:50

equals one, so that we make sure it runs through each row. And it's essentially going to do the

play06:55

same thing as our loop, but more efficiently. So let's store this as reward. Let's also

play07:04

get a new data frame every time and time it like we did above. All right, so it's done. So it ran

play07:12

seven times and average 189 milliseconds, we're already seeing a huge speed improvement by using

play07:21

apply instead of iterating over each row. But there's more, let's make it even faster. Let's

play07:27

go into level three. And this is the key to speeding up most of your slow pandas code.

play07:33

And that's using vectorized functions. vectorized functions work very efficiently. And they're when

play07:41

you apply functions like this across the whole data set, instead of each row, I'm going to show

play07:47

you how we do that here. So what we would do is instead of using our pre built function for each

play07:53

row, we actually apply each of these conditions to the whole data frame itself. Let's write each of

play08:00

them up. So we're going to say the percent sleeping, remember, needs to be greater than five for it to

play08:08

be true. And the time in bed needs to be greater than five. Or if the age is greater than 90. So

play08:22

this is a vectorized version of the same code that we wrote above. But it's going to run a lot faster.

play08:28

And what it will provide us back with is an array of true or false for if it meets this condition.

play08:35

Now what we're going to do is we're going to call our get data function again, we're going to call

play08:40

reward is equal to the hate food, except for when these conditions are met. So we've already filled

play08:51

the reward in with all hate food, but we're going to locate when these conditions are met, and when

play08:58

they are, we'll give the reward of the favorite food. And let's just split these lines up. So it's

play09:06

a little bit easier to follow. And let's go ahead and runtime it on this. Wow, that's a lot faster.

play09:13

You can see it's 6.91 milliseconds. And actually the lot of the time that was taken to run this

play09:20

was just in the get data function. So if I remove that get data function out of this 1.57 milliseconds,

play09:29

let's just quickly plot what the differences in time that it takes. And I have some data saved

play09:35

off here from a previous run. These are the results from a previous run, where I had the

play09:41

milliseconds that looping applying and vectorize each took set our index to the type lot this.

play09:49

As a bar plot, and we can see here at the time it took to run each reward type, the huge jump

play09:58

down was by changing from a loop to apply. But then you can see that going from apply to using

play10:04

vectorize functions made it even more fast. So the key is whenever you're writing functions on pandas,

play10:12

try to use vectorize as much as you can. It's not always the case that you're able to do it. But when

play10:19

when possible, use a vectorize function. I hope you enjoyed this quick video showing you how you

play10:25

can speed up your pandas code. Always use vectorize functions as you can. Don't iterate or loop over

play10:31

it unless you need to. Thanks for watching and I'll see you in the next video.

Rate This

5.0 / 5 (0 votes)

Ähnliche Tags
Pandas OptimizationPython CodingData EfficiencyMachine LearningCoding TutorialPerformance TricksData AnalysisJupyter NotebookRob's TipsVectorization Techniques
Benötigen Sie eine Zusammenfassung auf Englisch?