25 Nooby Pandas Coding Mistakes You Should NEVER make.

Rob Mulla
7 Sept 202211:29

Summary

TLDRThis video educates new pandas users on 25 common mistakes to avoid for efficient data manipulation. It covers issues like unnecessary CSV indexing, using spaces in column names, underutilizing the query method, and misusing in-place operations. The script emphasizes the importance of leveraging vectorization, proper data type settings, and pandas' built-in functions for tasks like plotting and string manipulation. It also advises against practices like creating duplicate columns and encourages the use of categorical data types for efficiency.

Takeaways

  • 📝 Avoid writing to CSV with an unnecessary index; set `index=False` when saving or set an index column when reading in.
  • 🔑 Use underscores instead of spaces in column names for easier access and querying.
  • 🔍 Leverage the `query` method for powerful data frame filtering instead of basic syntax.
  • 📝 Use the `@` symbol to incorporate external variables in pandas queries without manually formulating strings.
  • ❌ Avoid using `inplace=True` as it may be deprecated; explicitly overwrite data frames with modifications instead.
  • 🔁 Prefer vectorized functions over iterating over rows for performance and cleaner code.
  • 📉 Use vectorized operations instead of `apply` when possible for efficiency.
  • 📚 Treat slices as views, not copies; use the `copy` method to ensure independence from the original data frame.
  • 🔄 Encourage chaining commands for transformations to avoid creating multiple intermediate data frames.
  • 📅 Set column data types properly, especially for dates, to ensure correct parsing and usage.
  • 👍 Use boolean values instead of strings for conditions and comparisons; map strings to booleans if necessary.
  • 📈 Utilize pandas' built-in plotting methods for quick and easy data visualization.
  • 🔠 Apply string methods directly to entire columns using `.str` method for consistency and simplicity.
  • 🔄 Avoid repeating data transformations; write a function for the pipeline and apply it to each data frame.
  • 🔄 Use the `rename` method with a dictionary for a cleaner way to rename columns.
  • 👥 Use `groupby` for aggregations based on groupings in the data frame instead of manual filtering and calculation.
  • 🔢 Calculate changes like percent difference using built-in pandas methods like `pct_change` and `diff`.
  • 📊 Consider saving large data sets in formats like parquet, feather, or pickle for efficiency and space.
  • 🖥️ Explore the `style` attribute of pandas data frames for extensive HTML formatting capabilities.
  • 🔗 Explicitly set suffixes when merging data frames to avoid confusion with default suffixes.
  • 🔄 Use the `validate` parameter in pandas merge to automatically check for correct merge types.
  • 📏 Break down long chained commands into readable lines for better code maintenance.
  • 📊 Convert columns with few unique values to categorical data types for memory efficiency and speed.
  • 🔍 Check for and remove duplicate columns when concatenating data frames to avoid confusion.

Q & A

  • Why is writing to a CSV with an unnecessary index considered a mistake in pandas?

    -Writing to a CSV with an unnecessary index is a mistake because by default, pandas includes the index without a column header name when writing to CSV, which can be redundant if the index contains no valuable information. This can lead to confusion when the CSV is read back in with an unnamed zero column included.

  • What is the recommended way to handle column names with spaces in pandas?

    -It is recommended to use underscores in place of spaces for column names in pandas. This allows for easier access using dot syntax and simplifies querying of the columns.

  • Why is the query method in pandas preferred over manually constructing query strings?

    -The query method is preferred because it allows for writing powerful queries in a more concise and readable manner. It is especially helpful for complex query criteria and avoids the need for manual string concatenation or f-strings.

  • How can external variables be used in pandas queries without manually constructing the query strings?

    -External variables can be used in pandas queries by simply prefixing the variable name with the '@' symbol. This allows pandas to access the variable directly within the query.

  • What is the general consensus on using the 'in place' parameter in pandas?

    -Using the 'in place' parameter is generally frowned upon in the pandas community, and the core developers even plan to remove this functionality altogether. It is better to explicitly overwrite the data frame with the modifications.

  • Why should iterating over rows in a data frame be avoided when vectorization is an option?

    -Iterating over rows should be avoided in favor of vectorization because vectorized operations are not only cleaner and more readable but also typically faster, especially for large data sets.

  • What is the main advantage of using the apply method over iterating over rows in a data frame?

    -The apply method allows you to run any function across an axis of your data frame, which is usually more efficient than iterating over each row. However, when vectorization is possible, it is still the preferred approach for performance reasons.

  • Why should a slice of a data frame not be treated as a new data frame?

    -A slice of a data frame should not be treated as a new data frame because modifications to the slice will result in a 'SettingWithCopyWarning', indicating that changes are being made to the original data frame. To avoid this, it's best to use the copy method to create a true copy of the data frame slice.

  • What is the recommended approach when making multiple transformations to a data frame?

    -It is recommended to use chaining commands where all transformations are applied in a single sequence rather than creating multiple intermediate data frames. This approach is more efficient and results in cleaner code.

  • Why is it important to properly set column data types in pandas?

    -Properly setting column data types is important because it ensures that the data is stored and processed in the most efficient way. Incorrect data types can lead to performance issues and may cause errors in data processing and analysis.

  • How can boolean values be represented in pandas instead of using string values?

    -Boolean values should be represented using actual boolean types (True or False) rather than string values like 'yes' or 'no'. This can be achieved by casting the string values to booleans when creating a new column or by mapping string values to booleans if they already exist in the data set.

  • What are some benefits of using pandas built-in plotting methods over manual plotting?

    -Pandas built-in plotting methods provide a quick and easy way to visualize data directly from a data frame. They are more convenient and often result in better-formatted plots compared to manually setting up a plot using matplotlib or other plotting libraries.

  • Why is it recommended to use the string method 'upper' on the entire array instead of applying it to each string individually?

    -Using the string method 'upper' on the entire array is recommended because it is more efficient and results in cleaner code. It applies the method to all elements in the column without the need for explicit iteration or looping.

  • What is the best practice for avoiding repeated code when creating data pipelines in pandas?

    -The best practice is to write a function for the data pipeline that can be applied to each data frame. This ensures consistent processing and makes the code easier to read and maintain.

  • Why is it more efficient to use the group by method for aggregations rather than looping over rows?

    -The group by method is more efficient because it allows for aggregations to be performed on groups independently in a single operation, eliminating the need for explicit looping and reducing the potential for errors.

  • What are some advantages of using the percent change and diff methods in pandas over manual calculations?

    -The percent change and diff methods in pandas provide built-in functionality for calculating changes in a series, which is more efficient and less error-prone than manual calculations. They also integrate seamlessly with pandas data structures.

  • Why might saving large data sets as CSVs be inefficient, and what are some alternative file formats?

    -Saving large data sets as CSVs can be inefficient due to slow write speeds and large disk space usage. Alternative file formats such as parquet, feather, and pickle files are more efficient, retain data types, and can be faster for both reading and writing.

  • How can conditional formatting be achieved in pandas data frames without reverting to Excel?

    -Conditional formatting can be achieved in pandas data frames using the style attribute, which allows for extensive formatting options when the data frame is displayed as HTML. This provides a powerful alternative to Excel for data presentation.

  • What is the purpose of setting suffixes when merging two data frames in pandas?

    -Setting suffixes when merging two data frames helps to differentiate columns that appear in both data frames but are not used for merging. This avoids confusion and makes it easier to track the origins of the columns in subsequent data processing.

  • How can the validate parameter in pandas merge help ensure the integrity of a one-to-one match?

    -The validate parameter in pandas merge automatically checks for different merge types and throws a merge error if the validation fails, ensuring that the merge is a one-to-one match and maintaining data integrity.

  • Why is it important to avoid stacking chained commands into one line of code in pandas?

    -Avoiding stacking chained commands into one line of code improves readability. By wrapping expressions in parentheses and splitting the code so that each line has one component of the expression, the code becomes easier to understand and maintain.

  • What are the benefits of using categorical data types in pandas for columns with a limited number of unique values?

    -Using categorical data types for columns with a limited number of unique values reduces memory usage and can significantly speed up operations on large data sets, as categorical data types are more memory-efficient and optimized for such cases.

  • How can duplicate columns be identified and removed when concatenating data frames in pandas?

    -Duplicate columns can be identified and removed by using the 'duplicated' function in pandas, which checks for duplicated columns and can be used to filter them out, ensuring a clean and error-free data frame.

Outlines

00:00

📝 Common Mistakes in Pandas Usage

This paragraph discusses the common mistakes made by new pandas users, which can be improved for better code implementation and readability. It covers issues like writing to CSV with unnecessary indexes, using spaces in column names, not using the query method, formulating query strings with string methods, using 'in place' incorrectly, iterating over rows when vectorization is possible, and misusing the apply method. It also touches on treating data frame slices as new data frames, creating multiple intermediate data frames, not setting proper column data types, and using string values instead of booleans.

05:02

📈 Enhancing Data Manipulation with Pandas

The second paragraph focuses on enhancing data manipulation techniques in pandas. It advises against manually applying string methods and renaming columns, and instead suggests using pandas' built-in methods for efficiency. The paragraph also emphasizes the importance of leveraging pandas' built-in plotting methods, avoiding repetition in data transformations, and using the group by method for aggregation. It also warns against looping over rows for aggregations and using loops to calculate value changes, suggesting the use of built-in pandas functions instead.

10:05

🔧 Advanced Pandas Techniques and Best Practices

The final paragraph delves into advanced pandas techniques and best practices. It advises against saving large datasets as CSVs due to inefficiency and recommends alternative file formats like parquet, feather, and pickle. It also discusses the use of the style attribute for conditional formatting, setting suffixes when merging data frames, and validating merges to ensure data integrity. The paragraph concludes with advice on avoiding overly compact code, utilizing categorical data types for efficiency, and checking for duplicate columns when concatenating data frames.

Mindmap

Keywords

💡Pandas

Pandas is an open-source Python library used for data manipulation and analysis. It provides data structures and functions needed to manipulate structured data, making it a fundamental tool for data scientists. In the video, the main theme revolves around common mistakes made by new users of the Pandas library, which indicates the importance of understanding Pandas for effective data handling.

💡CSV

CSV stands for Comma-Separated Values, a file format used to store tabular data, typically in plain text. The video mentions writing to a CSV file with an unnecessary index as a common mistake, highlighting the need to understand how Pandas handles CSV operations to avoid such issues.

💡Index

In Pandas, an index is a label for rows in a DataFrame and serves as a way to identify and order data. The script points out that including an unnecessary index when writing to a CSV can be a mistake, especially when the index does not contain valuable information.

💡Column Names

Column names in a Pandas DataFrame are used to identify and access the data within each column. The video emphasizes the preference for using underscores instead of spaces in column names to avoid issues with accessing columns using dot syntax and to simplify querying.

💡Query Method

The query method in Pandas allows for filtering a DataFrame based on a query expression. The video script suggests that new users often overlook this powerful feature, which can simplify complex filtering tasks.

💡In-Place Operation

In Pandas, in-place operations are those that modify the data in the original DataFrame without creating a new one. The script warns against using the in-place option due to its potential to overwrite the DataFrame and the fact that it may be removed in future versions of Pandas.

💡Vectorization

Vectorization in Pandas refers to the ability to apply operations to entire arrays rather than iterating over individual elements. The video script highlights the inefficiency of iterating over DataFrame rows when vectorized functions are available, which can lead to cleaner and faster code.

💡Apply Method

The apply method in Pandas is used to apply a function along an axis of a DataFrame. While it can be useful, the video emphasizes that vectorized functions are generally preferable for performance and readability, especially when working with large datasets.

💡Data Types

Data types in Pandas define the kind of data stored in each column of a DataFrame, such as integers, floats, strings, or datetime objects. The video script points out the importance of properly setting these data types to ensure accurate data handling and parsing.

💡Boolean Values

Boolean values represent two states: True or False. The script advises against using string values like 'yes' or 'no' to represent boolean conditions, instead recommending the use of actual boolean data types for clarity and functionality.

💡Plotting

Plotting in Pandas refers to the process of creating visual representations of data. The video mentions that while manual plotting is possible, Pandas has built-in plotting methods that simplify the process and can be used for quick visualizations.

Highlights

Avoid writing to CSV with an unnecessary index when the DataFrame index contains no valuable information.

Using spaces in column names can lead to issues, including the inability to access the column using dot syntax.

Leverage the query method for powerful filtering of DataFrames instead of basic syntax.

Pandas queries can access external variables without manually formulating query strings.

Avoid using 'in place=True' as it is discouraged and may be removed in future Pandas versions.

Prefer vectorized functions over iterating over DataFrame rows for efficiency.

Use vectorized functions instead of the apply method when possible for performance gains.

Be cautious with DataFrame slices as they may not be independent and can lead to set with copy warnings.

Chaining commands is encouraged over creating multiple intermediate DataFrames for transformations.

Manually setting column data types is often necessary for correct parsing and should be done using pandas functions.

Use boolean values instead of strings for conditions like 'yes' or 'no' in DataFrame columns.

Utilize Pandas' built-in plotting methods for quick and easy data visualization.

Apply string methods directly to entire columns using the 'str' accessor in Pandas.

Avoid repeating data transformations; instead, create a function for the pipeline and apply it to each DataFrame.

Use the rename method with a dictionary for a cleaner way to rename DataFrame columns.

Use the group by method for aggregations based on column values instead of manual filtering and calculation.

Avoid looping over DataFrame rows for aggregate calculations; group by aggregations are more efficient.

Use built-in functions like 'pct_change' and 'diff' for calculating value changes instead of manual loops.

Consider saving large datasets in formats like parquet, feather, or pickle instead of CSV for efficiency.

Use the style attribute of Pandas DataFrames for extensive formatting when displaying as HTML, an alternative to Excel.

Explicitly state suffixes when merging DataFrames to avoid confusion with default '_x' and '_y' suffixes.

Utilize the 'validate' parameter in Pandas merge to automatically check for correct merge types and avoid errors.

Avoid stacking all chained commands into one line of code for readability; split expressions for clarity.

Use categorical data types for columns with a limited set of values to save memory and improve performance.

Be aware of potential duplicate columns when concatenating DataFrames and use methods to check and remove them.

Transcripts

play00:00

In this video, I'm going to go over my list of 25 mistakes that new pandas users often make.

play00:06

In most of these cases, the code will still run, but there's a better way to implement

play00:10

the same functionality. These mistakes will also be a dead giveaway to anyone reading your code

play00:16

that you're new to the library. Number one, writing to a CSV with an unnecessary index.

play00:22

This mistake is often made when the pandas data frame index contains no valuable information.

play00:28

By default, when writing to CSV, this will include the index without a column header name.

play00:33

This mistake becomes more obvious when the same CSV is read in and an unnamed zero column is

play00:40

included. You can avoid this mistake by setting index equals to false when saving to CSV,

play00:46

or alternatively setting an index column when reading in the CSV that contains an unnamed

play00:52

index. Number two, using column names that include spaces. At first glance, using column names with

play00:59

spaces may be seen as a good thing. However, there are many issues that arise when you use

play01:04

column names that include spaces. One of the biggest being that you lose the ability to

play01:09

access the column using dot syntax. It's preferable to use underscores in place of spaces in column

play01:15

names. And you can see here, because we've done so, you can access this column using the dot syntax.

play01:21

It also makes querying these columns much easier. Number three, not leveraging the query method.

play01:28

Often you want to filter your data frame to a subset. There's nothing wrong with the syntax

play01:32

chosen here. However, many new users are unaware that you can write powerful queries using the dot

play01:38

query method. This becomes especially helpful the more complex your query criteria becomes.

play01:44

Number four, using string methods to formulate your query strings. Many times you have a variable

play01:50

that you want to query on. It's common then to see these string queries created manually,

play01:55

either by using f-strings or by concatenating strings. But this isn't necessary because

play02:00

pandas queries can access external variables by simply using the at symbol before the variable

play02:06

name. Number five, using in place equals true. Now I can understand how this one would be confusing

play02:12

to new users. As many built-in methods like fill in a and reset index have the in place option,

play02:18

setting in place to true will overwrite the data frame with the changes. However, using in place

play02:24

is generally frowned upon and the pandas core developers even plan to remove this functionality

play02:30

altogether. It's better to instead explicitly overwrite with the modifications. Number six,

play02:35

iterating over the rows in a data frame when vectorization is an option. This is a big one

play02:41

that you see a lot with new users of pandas. In this example, if we wanted to determine the rows

play02:46

that contain a year greater than 2000, we could iterate over each year. However, it's much more

play02:52

preferable to use vectorized functions here the greater than can be applied to the entire year

play02:58

column and stored as a result. Number seven, using the apply method when vectorization is an option.

play03:05

The apply method allows you to run any function across an axis of your data frame. While this

play03:10

is usually better than iterating over each row in the data frame, it's still preferable to use

play03:15

vectorized functions when possible. For instance, here we're creating a new column which squares

play03:21

the year value and here we have the vectorized version where we apply the square to the entire

play03:26

array. This is not only cleaner but will run faster. Number eight, treating a slice of a data

play03:32

frame as if it were a new data frame. In this example, we're filtering our data frame for times

play03:38

under 10 and storing this as df fast. However, when we modify this new data frame, we will see

play03:44

a set with copy warning. This warning occurs because our new modifications are actually

play03:50

being applied to a slice of our old data frame. So when you do want to create a new data frame

play03:55

based on a subset of your initial data frame, it's best to use the copy method. This by default will

play04:02

create a deep copy and any edits to your new data frame will not impact the initial data frame.

play04:08

Number nine, creating multiple intermediate data frames when making transformations. It's not

play04:13

uncommon to see code like this where each step of the process is then written to a new data frame

play04:19

variable. There are many reasons why this is not ideal. It's instead encouraged to use chaining

play04:24

commands where all the transformations are applied once. Number 10, not properly setting column d

play04:31

types. Each column in a pandas data frame has a specific data type. When reading in data, pandas

play04:37

will try its best to parse these types. However, as you can see here, this date column is represented

play04:43

as an object. In many instances, you'll have to manually set these d types. In this case,

play04:48

we could correctly set this column to a date time format by using parse dates within read CSV.

play04:55

Alternatively, we could manually set this d type using the pandas to date time method. Number 11,

play05:01

using the string value instead of a boolean. In this case, we've made a new column called sub 10,

play05:07

which is yes when the time value is less than 10. Instead of using text to represent something that

play05:14

could be true or false, you should cast these as a boolean value. This can be done when you create

play05:19

the column or if your data set already has values like this, you can map them to true or false.

play05:25

Number 12, not leveraging pandas built in plotting methods. Often you'll find yourself in a situation

play05:31

where you want to make a quick plot of the data in your pandas data frame. This can be done by

play05:37

creating a map plot lib subplot and plotting the data manually. Pandas already has a lot of this

play05:42

functionality built into its plot method. Number 13, manually applying string methods. Do you have

play05:48

a column that contains string values and you want to apply a string method like uppercase? You might

play05:54

think this is a situation where you need to apply the uppercase method across the column. Pandas

play06:00

actually has string methods where you can apply any string method to the entire array just by

play06:05

calling str and then your command, in this case upper. Number 14, repeating commonly used data

play06:13

transformations. In this example, we're reading in one data set, performing a number of transformations

play06:19

to it, and then doing the exact same thing to a different data frame. It's generally best practice

play06:24

to not repeat code unless you need to, especially when creating data pipelines is preferred to write

play06:29

a function for that pipeline which you then can apply to each data frame. Not only does this make

play06:35

your code easier to read, but ensures that the same processing is done identically on both data

play06:40

frames. Number 15, manually renaming columns. It is possible to rename columns by providing a list

play06:48

of new names. The preferred and much cleaner way to do this is to use the rename method and provide

play06:54

a dictionary to the columns variable with the old and new names. Number 16, aggregating by groups

play07:02

manually. In this case, we have a data set with both men's and women's times and we want to return

play07:08

the lowest men's and women's value. It's possible to filter on this grouping column and calculate

play07:15

the minimum value for each. This is exactly what the group by method is for. It allows you to

play07:22

select a column or columns to group the data on and then any aggregations will be done to those

play07:28

groups independently. Number 17, looping over the rows in a data frame to create aggregates.

play07:35

Similar to the last mistake, when creating multiple group by aggregates, you'll see code that

play07:40

iterates over each row in the data frame storing the results after each iteration. The same results

play07:46

can be calculated by a simple group by aggregation. Grouping this way also allows you to provide

play07:52

multiple ways to aggregate the data. In this case, mean and count, but you can also provide things

play07:58

like maximum, minimum, and standard deviation. Number 18, using a loop to calculate how a value

play08:04

changes. In this example, we're calculating the percent change and the difference between the

play08:10

time columns in each row of the data frame. You might be catching on to a trend now, but there's

play08:15

actually a built-in function for doing things like this. You can use the percent change and diff

play08:20

methods to calculate the change in this Panda series. Number 19, saving large data sets as CSVs.

play08:28

When working with Pandas, eventually you'll get to the point where you need to save the data to disk.

play08:33

CSV is one of the most common file formats to save data, but especially with large data sets,

play08:39

this can be very slow and take up a lot of space on your hard disk. Pandas has built-in methods to

play08:45

save to many different file types, including parquet, feather, and pickle files. These file

play08:50

formats also retain the data types of your data frame, which saves you from having to set them

play08:55

manually when reading in the file. Number 20, switching to Excel for conditional formatting.

play09:01

New Pandas users may find themselves switching back to Excel to do things like conditional

play09:06

formatting. You might be surprised to know that Pandas data frames have a style attribute,

play09:11

which allows you to do extensive formatting to your data frame when you display it as HTML.

play09:16

This type of styling can be extremely powerful and covers almost anything you would want to do

play09:21

in Excel. Number 21, not setting suffixes when merging two data frames. When merging two data

play09:28

frames on a specific column or columns, any columns that appear in both data frames but are not being

play09:34

used to merge will be given the default underscore x and y suffixes. By explicitly stating the

play09:41

suffixes in your merge, you can more easily track what these columns are later in your data

play09:47

processing. Number 22, manually checking after merging two data frames. There may be cases when

play09:53

you're merging two data frames and want to confirm that the merge is a one-to-one match. You can check

play09:59

for this by comparing the lengths of the merged data frame with the initial data frame. Pandas

play10:04

merge has a validate parameter which will automatically check for different merge types.

play10:09

This will throw a merge error if the validation fails. Number 23, stacking chained commands into

play10:16

one line of code. Method chaining is a great feature in Pandas, but your code can get really

play10:21

unreadable if it's all in one line. By wrapping your expression in parentheses, you can split

play10:27

your code so that each line has one component of the expression. This makes it a lot more readable.

play10:33

Number 24, not using categorical data types. In this example, we have a grouping column which

play10:39

contains only two potential values. Instead of storing columns like this as a string object,

play10:45

it's better to store them as a categorical data type. Categorical data types take up less space

play10:51

in memory and can make operations much faster on large data sets. Number 25, creating duplicated

play10:57

columns. This issue can arise when concatenating two data frames. As you can see here, the year

play11:03

column appears twice in this data frame. And if you don't know this is possible, this can be really

play11:08

confusing and hard to debug. Pandas does have a flag that can be set which will alert you when

play11:13

duplicate labels occur. You can also solve this problem by using this line of code which will

play11:18

check for duplicated columns and remove them. Thanks for watching this video. If there are

play11:22

any mistakes that I missed, please let me know in the comments below. And don't forget to like and

play11:26

subscribe. See you next time.

Rate This

5.0 / 5 (0 votes)

Etiquetas Relacionadas
Pandas TipsData AnalysisDataFrameAvoid MistakesCSV HandlingVectorizationData TypesMerging DataConditional FormattingEfficiency TechniquesData Transformation
¿Necesitas un resumen en inglés?