Lecture 02

IIT KANPUR-NPTEL
20 Jul 202348:37

Summary

TLDRThis video script offers an in-depth tutorial on data handling and claiming in R, covering best practices for setting working directories, reading and writing CSV and Excel files, and summarizing data. It delves into data cleaning, including handling NA values and outliers, and transforming data frames between wide and long formats. The script also explores merging data frames, creating new variables, and summarizing data, providing a comprehensive guide for data manipulation in R.

Takeaways

  • 📁 Set the working directory in R to manage data input and output effectively.
  • 📂 Use `read.csv` or the import dataset feature to read data from CSV files into R.
  • 🔍 View data dimensions and summaries to understand the structure and content of the dataset.
  • 🔎 Employ commands like `head`, `tail`, `str`, and `names` to inspect and explore data within R.
  • 🛠 Change column names in a data frame using the `names` function for better data management.
  • 🧼 Clean data by handling NA (missing) values using functions like `is.na`, `na.omit`, and `na.rm`.
  • 🔄 Convert data frames between long and wide formats using the `reshape2` package for different analytical needs.
  • 🔗 Merge data frames using the `merge` function, specifying the key columns that should match between datasets.
  • 📊 Create new variables within a data frame by performing calculations or transformations on existing data.
  • 📈 Summarize data using built-in functions like `summary`, `str`, and custom summaries with `summarize`.
  • 📚 Utilize libraries like `car` for accessing additional datasets and functions to enhance data analysis in R.

Q & A

  • What is the recommended best practice for setting the working directory in R?

    -The best practice is to set the working directory at the beginning of a session by using the 'setwd()' function, specifying the directory where you want to read and write data. This can be done through the R console or by copying and pasting the command from the console for future reference.

  • How can you read a CSV file into R?

    -You can read a CSV file into R by using the 'read.csv()' function and providing the name of the file as an argument. Alternatively, you can use the 'import dataset' feature from the R interface, selecting 'from text', and then choosing the CSV file.

  • What are some ways to view the structure and initial observations of a data frame in R?

    -You can view the structure of a data frame using the 'str()' function, which shows the names and types of variables. To see the initial observations, you can use the 'head()' function, which displays the first six rows by default, or you can specify a different number to see more or fewer rows. The 'summary()' function provides a statistical summary of each variable in the data frame.

  • How can you change the names of columns in a data frame in R?

    -You can change the names of columns in a data frame by using the 'names()' function and assigning new names to the desired columns. For example, 'names(dataFrame)[1] <- "NewName"' will change the name of the first column to 'NewName'.

  • What is the purpose of the 'is.na()' function in R, and how is it used?

    -The 'is.na()' function in R is used to check for missing values (NA) in a dataset. It returns a logical vector indicating whether each element is NA or not. It is often used in combination with other logical operators to filter out or handle NA values during data analysis.

  • How can you handle NA values when performing comparisons or calculations in R?

    -When performing comparisons or calculations, you can use the '&' (and) and '|' (or) operators in combination with 'is.na()' to include or exclude NA values. For example, 'x > 2 & !is.na(x)' will compare values greater than 2 while excluding NA values.

  • What is the difference between using 'na.omit' and 'complete.cases' to remove NA values from a data frame?

    -'na.omit' removes all rows with any NA values in the entire data frame, while 'complete.cases' can be used to select only those rows that do not contain any NA values in any column.

  • How can you replace NA values in a data frame with a specific value, such as zero?

    -You can replace NA values with a specific value by using the 'is.na()' function in combination with an assignment operator. For example, 'dataFrame[is.na(dataFrame)] <- 0' will replace all NA values in 'dataFrame' with zero.

  • What is the purpose of the 'unique()' function in R, and how is it used?

    -The 'unique()' function in R is used to extract the unique, non-duplicate observations from a vector or data frame. It is used to remove non-unique or redundant values, ensuring that each observation is only counted once.

  • How can you select specific columns and rows from a data frame in R?

    -You can select specific columns by using their numeric indices or names within the data frame, and specific rows can be selected using indexing with row numbers. For example, 'dataFrame[4:10, 3:5]' will select rows 4 to 10 and columns 3 to 5.

  • What is the process of transforming a data frame from wide format to long format in R?

    -The process involves using the 'melt' function from the 'reshape2' package. This function combines the wide variables into a single column, with an identifier column indicating the original variable name. The result is a long format data frame where each observation is represented in a single row.

  • How can you merge two data frames in R based on a common variable?

    -You can merge two data frames using the 'merge()' function, specifying the common variable with 'by.x' and 'by.y' arguments. The type of merge (e.g., inner, outer) can be controlled using the 'all.x', 'all.y', and 'all' parameters.

  • What is the significance of using different merge types (inner, outer) when combining data frames in R?

    -An inner merge combines data frames based on the intersection of common keys, including only observations that are present in both data frames. An outer merge includes all observations from both data frames, filling in missing values with NA where keys do not match.

Outlines

00:00

📁 Setting Working Directory and Reading Data

The paragraph begins with instructions on setting the working directory in R, emphasizing best practices for data handling. It explains how to navigate to a session, choose a directory, and execute a command to set the working directory, which can be reused for convenience. The speaker then demonstrates two methods for reading a CSV file named 'salary.csv': using the 'read.csv' command and the 'import dataset' function from a menu interface. Options for reading data with specific parameters like headers, row names, and separators are discussed. The paragraph also covers how to view the dimensions of the data frame, summarize data with various commands, and check the structure of data using 'str'. Additionally, it explains how to view initial or specific rows of data using 'head' and how to change column names using the 'names' function.

05:04

🔍 Data Inspection and Cleaning Observations

This paragraph delves into the inspection and cleaning of data in R. It starts by discussing how to print and handle variables containing NA (missing) values. The use of logical operators and the 'is.na' function to compare values while excluding NAs is explained. The paragraph also covers the use of '&' (and) and '|' (or) operators for conditional statements involving NA values. Practical examples of introducing NAs into a dataset, checking for their presence, and replacing them with zeros are provided. The importance of understanding the impact of NA values on data analysis, such as during the calculation of medians and means, is highlighted, and solutions like 'na.rm' are introduced to address this issue.

10:04

🗂 Advanced Data Handling Techniques

The paragraph presents advanced techniques for handling data frames in R. It discusses the creation of data frames using vectors and the removal of NA observations using subsetting and functions like 'na.omit'. The paragraph also addresses the impact of removing NA observations on the dataset and provides an example of how to handle this using the 'complete.cases' function. It introduces the 'library(car)' and its 'Friedman' dataset, demonstrating how to compute the median and mean while ignoring NA values. The use of 'na.omit' for drastic NA removal and the implications on data analysis are also discussed. Additionally, the paragraph covers the identification of non-unique values and the use of the 'unique' function to remove them.

15:05

📊 Data Frame Manipulation and Summary

This paragraph focuses on the manipulation of data frames and summarization of data in R. It starts with the selection of specific columns and rows from a data frame, using both numeric indices and column names. The creation of new variables within a data frame, such as 'petal ratio' and 'sepal ratio', is demonstrated. The paragraph also explains how to extract observations based on conditions and summarize data using the 'summary', 'str', and 'brief' commands. The use of 'summarize' for user-defined summaries and the handling of factor variables, including the assignment of new levels, are also covered. The paragraph concludes with an example of transforming data frames between long and wide formats using the 'reshape2' package.

20:07

🔄 Transforming Data Between Long and Wide Formats

The paragraph discusses the process of transforming data frames between long and wide formats using R. It begins by constructing a wide format data frame with multiple speed observations and corresponding identifiers. The use of the 'melt' function from the 'reshape2' package to convert the wide format into a long format is explained, with the 'ID' and 'run' variables being fixed during the transformation. The paragraph then demonstrates how to revert back to the wide format using the 'dcast' function, specifying the variables to be fixed and adjusting the 'speed' variable accordingly.

25:07

🔍 Merging Data Frames in R

This paragraph explores the merging of data frames in R, focusing on various properties of the 'merge' command. It starts by creating two data frames representing domestic and foreign movie collections. The paragraph then demonstrates the process of merging these data frames on the basis of movie names, highlighting the differences between inner and outer merges. The use of the 'all = TRUE' option for outer joins and the selection of specific variables for merging are discussed. The paragraph concludes with examples of how to handle different movie names in each data frame and the implications of merging on the final dataset.

Mindmap

Keywords

💡Data Handling

Data Handling refers to the processes involved in managing data within a system, including reading, writing, storing, and manipulating data. In the video, Data Handling is the overarching theme, as the script discusses various operations such as setting a working directory, reading CSV files, and summarizing data, which are all essential aspects of handling data in R.

💡Working Directory

A Working Directory is the current directory R is using for reading and writing data. The script emphasizes the importance of setting a working directory as a best practice in data handling, allowing users to easily manage where their data is stored and retrieved from within their R sessions.

💡CSV File

CSV stands for Comma-Separated Values and is a file format used to store tabular data, where each line represents a row and commas separate the values in each row. The script mentions reading a 'salary.csv' file as an example of data input, demonstrating one of the common tasks in data handling.

💡Data Frame

A Data Frame is a two-dimensional data structure in R used to store and manipulate data. The script frequently refers to data frames, such as when discussing the dimensions of data, changing column names, and summarizing data, highlighting the central role data frames play in data analysis.

💡NA Values

NA stands for 'Not Available' and represents missing data in R. The script discusses handling NA values, which is a critical part of data cleaning. It provides examples of how to identify and replace NA values, ensuring the accuracy and completeness of the data analysis.

💡Data Cleaning

Data Cleaning involves the process of detecting, diagnosing, and correcting data quality issues to improve the usability of data. The script covers various aspects of data cleaning, such as dealing with NA values, removing non-unique values, and changing column names to ensure the data is accurate and ready for analysis.

💡Vector

A Vector is a fundamental data structure in R that stores values of the same data type. The script mentions vectors in the context of creating conditions and handling NA values, illustrating how vectors are used to perform operations on data elements.

💡Merging Data Frames

Merging Data Frames is an operation that combines two data frames into one based on a common variable. The script discusses different types of merges, such as inner and outer merges, and demonstrates how to perform these operations using R, which is essential for integrating data from different sources.

💡Factor Variables

Factor Variables are categorical variables used in R, which store data as a set of labels or categories. The script explains how to work with factor variables, including adding new levels and removing unused levels, which is important for preparing data for statistical analysis.

💡Reshape Data

Reshaping Data refers to changing the structure of a data frame from a wide format (many columns) to a long format (fewer columns) or vice versa. The script describes the use of the 'melt' and 'dcast' functions from the 'reshape2' package to transform data frames between these formats, which is useful for certain types of data analysis.

Highlights

Setting a working directory in R is a best practice for data handling and claiming.

Demonstrated two methods for reading CSV files in R: using 'read.csv' command and the 'Import Dataset' feature.

Explained how to adjust column names in a data frame to improve data organization.

Introduced the 'head', 'tail', 'str', and 'summary' commands for data examination in R.

Discussed the importance of checking for NA (missing) values in data and provided methods to handle them.

Illustrated the use of logical operators to compare and filter data while accounting for NA values.

Provided examples of data cleaning by introducing NA values and replacing them with zeros.

Described how to remove non-unique values from a data frame to streamline data sets.

Taught how to select specific columns and rows from a data frame using R's indexing system.

Showed how to create new variables within a data frame based on existing data.

Covered the extraction of observations based on specific conditions for focused data analysis.

Explained how to perform user-defined summaries to compute statistics like mean and standard deviation.

Highlighted the use of the 'library(car)' for accessing additional datasets in R.

Discussed the manipulation of factor variables, including adding and dropping levels.

Demonstrated the transformation of data frames from wide to long format using the 'reshape2' package.

Covered the process of merging data frames in R, including inner, outer, and one-sided joins.

Concluded with a summary of data frame handling techniques learned throughout the module.

Transcripts

play00:13

We will discuss Data Handling and Data Claiming with R.

play00:18

As a best practice, I would recommend that we set our working directory where we want

play00:25

to read and write data.

play00:26

To start with, go to this session, set working directory, choose directory and then select

play00:34

the appropriate folder where we would like to set our working directory and click open.

play00:40

Notice that a command will appear on your console window.

play00:43

You can copy paste it for future references.

play00:46

So, whenever you are working in future, you can run this compound simply to set the working

play00:51

directory at appropriate place.

play00:54

Now as an example, we will read an CSV file.

play01:00

We have already seen how to read and write data in here.

play01:03

So, we will simply write this read.csv, give the name of the file which is salary.csv,

play01:10

run this command and data will be read.

play01:14

Another way to read this data is go to this import dataset, go to from text.

play01:22

You can click on this salary file, click on open and it will read the data for you.

play01:28

So that is another way to read the data.

play01:30

You can see the format, the way in which data is read.

play01:33

You can select the correct options if header is yes or row names, separator and all.

play01:38

You can see the permutations and combinations.

play01:41

The way the file will be appearing, you can see here in the data frame.

play01:44

If you click on import, the data will be read.

play01:46

So this is another way to read the data.

play01:48

In case it was an Excel file, you can import dataset from Excel and then again with same

play01:54

procedure browse and read the data.

play01:57

Again select the data and read it.

play01:59

So, this is how we will read the data.

play02:01

As a next step, we would like to see the dimensions of data.

play02:08

So there are 14,017 rows or observations and seven columns or variables.

play02:13

So this is the dimension of data frame.

play02:16

Most of these commands we have already seen in the fundamentals of our module.

play02:19

As a next step, we will summarize the data.

play02:22

So we can see all the variables, their nature and various other aspects.

play02:28

You can also use brief command to have a look at this data.

play02:33

Brief command will give you broad overview of data.

play02:38

You can see that how data looks from starting to end.

play02:45

It starts from first three rows that are visible and last two rows.

play02:49

You can also have a look at the head of the data with head command.

play02:52

You can see initial six elements using head commands.

play02:56

If you want to see more or less, you can select head data, maybe eight and you'll see the

play03:01

initial eight elements.

play03:03

Moreover, you can also see the structure of data with str command.

play03:05

It will give you the names of variables like name, their nature, which is character or

play03:13

numeric if it is, agents, all the variables we can see here, gross pay, annual salary

play03:18

and so on.

play03:19

So, we get the names of the data.

play03:20

You can also check the column name.

play03:21

So simply by writing call names, you can check all the column names that are there.

play03:25

So you can see there are seven columns name, job title, agency ID and so on up till gross

play03:29

pay.

play03:30

As a first step, you'd like to change the names of some of these columns.

play03:33

So, if you want to change the names of these columns, let's see how to do that.

play03:37

So let's say I want to change the, so we'd like to change the name of the first column

play03:42

from name in small to name in caps like this.

play03:46

Also, probably you would like to change the names of some other columns starting from

play03:51

two to four, multiple columns and you'd like to change the name maybe from job title to

play03:58

title.

play03:59

So earlier name was job title for the column number two.

play04:02

I want to keep it title.

play04:05

Then there's a column called agency ID, which I would like to convert to ID and then there

play04:11

is agency column, which I would like to convert to agency name.

play04:15

So I'll run this command.

play04:17

Notice now the new names of the data variable.

play04:22

So all the names are changed as we wanted.

play04:25

You can also see the head command and you will see that the names are changed as you

play04:29

wanted them to be.

play04:30

So with this, we have understood how to input the data, see its structure, summarize it

play04:37

and change the column names.

play04:38

In the next video, we'll talk about cleaning the observations in the data frame.

play04:43

Now, we’ll talk about cleaning the observations, cleaning the observations, and cleaning the

play04:51

data frame.

play04:52

Let's consider two vectors.

play04:54

We have already seen what a vector of variable is.

play04:58

It contains NA value also.

play05:03

So it contains certain values along with NA observation.

play05:07

So if I print this variable, I'll get those variables, including the NA observation.

play05:11

Now if I run this command X greater than two, notice that second element of the console

play05:18

is NA, which means R is not able to compare the second value with two.

play05:24

How to go about it?

play05:25

So while comparing X greater than two, we'll also use an interesting command and which

play05:31

is given by ampersand symbol and then we'll use is.na

play05:36

What is is.na?

play05:38

is.na checks whether a value is NA or not.

play05:41

Let's look at this.

play05:42

So if I put is.na along with NA, it will give true and if I use exclamatory mark, which

play05:47

is a symbol for no, it will give false.

play05:51

So I want to make this comparison of X greater than two, but I want to do it only with those

play05:56

values that are not NA value.

play05:58

So I'll put exclamatory mark, that means the value should not be NA and then run this along

play06:04

with ampersand.

play06:05

It says that the value has to be not NA and then the comparison with two has to be made

play06:11

and notice now that NA value is ignored.

play06:14

So instead of that NA, it is compared with two and we can get all the false, false, false.

play06:20

So, it has considered that NA is also not equal to two and it gives me a false value.

play06:26

Now let's say I wanted to check whether X is not only equal to zero, but also I want

play06:31

to put another condition which is with or operator, or operator is either one of them

play06:36

can be true along with X equal to two.

play06:38

So if I run this, I get seven true, falses including one NA, we know because this second

play06:45

observation is NA.

play06:46

To account for that NA observation, we will write an additional command which is again

play06:50

same as and exclamatory mark is not NA and this will account for that NA observation

play06:56

and we will get false instead of NA.

play06:58

Similarly, there are some other operators, for example, many times the value is NA and

play07:02

not a number.

play07:03

For example, something like zero by zero would be an NA and for R. So if I run zero by zero,

play07:08

that is true.

play07:10

Sometimes the value is infinite.

play07:12

So if the value is infinite, again similar operation, the same procedure we can select

play07:17

is not infinite.

play07:19

Let's say one by zero which is infinite value.

play07:21

So it will give me true.

play07:24

So this way we can handle NA observations, observations that are not available and so

play07:28

on.

play07:29

Let's take one example.

play07:30

So originally we had that data which we read.

play07:34

Now let me make a copy of this data so that we do not disturb the original data.

play07:39

So, we read it in data underscore one.

play07:42

Now let's play around with this data underscore one a little bit.

play07:45

So for this data underscore one, let's put some values, maybe thousand row, this value

play07:54

represents thousand row element at fifth column.

play07:57

Let's set it to NA.

play07:59

Similarly, I'll also set 3000 positions in the second column as NA and then again, I

play08:07

will set maybe 4000 element on the third column as NA.

play08:13

Let's run these commands.

play08:14

So there are three NAs that I have introduced in the original data.

play08:17

If you want to check whether there were any NAs in the original data, how do we check?

play08:21

A very simple way would be to check is dot NA data underscore one.

play08:26

First we'll check original and I will add an exclamatory mark or sum.

play08:32

So all the NA observations are taken here as proof.

play08:35

As we discussed earlier, true is equal to one.

play08:38

So all the NA observations will be extracted from data and notice if I run the sum, it

play08:42

is zero.

play08:43

That means there are no NA observations in the original data.

play08:45

However, now that we have introduced three NA observations in the new data, if I check

play08:52

the same command, I'll get the summation as three because three NAs have been introduced.

play08:56

So now we have three NAs in this data.

play08:58

There are other ways to check that.

play09:00

So for example, if I write all is dot NA, it checks whether all the observations are

play09:07

not NA and it tells me no.

play09:09

That means there are some observations that are NA.

play09:13

If I would have run this on the original data, this would have given me true.

play09:16

That means none of the observations were NA in the original data.

play09:19

The new data which is data underscore one carries those NA observations.

play09:23

Now let's say you want to replace all those NA observations with zero.

play09:27

A very simple solution to do that is this form that you have already seen inside data

play09:33

underscore one.

play09:34

I filter out those observations that are in it with this simple command and all those

play09:39

observations I set as zero.

play09:42

That is one way to do that.

play09:43

When I do that, all those observations will be set to zero.

play09:45

And if I take the summation, now I take the summation of is dot NA, it will be zero.

play09:50

So all those observations are zero or I can also check this one.

play09:53

All observations are not NA and yes, they are not NA.

play09:57

So all the observations are replaced by zero.

play09:59

You can replace it by any other notation if you want or in similar manner, we can process.

play10:04

So this was some of the data cleaning observations.

play10:06

We will further move to some of the other ways to handle the data in next set of videos.

play10:11

You can see some of the examples on how to handle NA observations.

play10:16

Consider this data frame, we will create a data frame for the vector.

play10:21

We will create this data frame with the data dot frame command where element one is a vector

play10:29

which contains a new value and two numeric.

play10:35

The next element B is another character vector which contains two characters, one NA and

play10:44

another character.

play10:46

So this is another vector and combination of these vectors is provided in BF data frame.

play10:53

So we can see what is BF data frame.

play10:55

It carries two NAs, one is the numeric variable NA and one is the character variable NA.

play10:59

Now one way to deal is individually remove these NAs from individual columns.

play11:05

For example, I can subset BF.

play11:07

There are different ways to do it.

play11:08

As you see, we use the subset command and I will select only those NA observations in

play11:17

variable A in the data frame BF.

play11:20

If I do that, notice the resulting output removes NA row, row which contained NA in

play11:27

variable A which was a numeric variable and the remaining variable or remaining data frame

play11:32

is a 2 x 2 data frame instead of 3 x 3 original data frame.

play11:37

Similarly, we can do the treatment for column B and if I do that for B, again a 2 x 2 kind

play11:44

of data frame will emerge where the NA observation, the row corresponding to NA observation in

play11:51

column B which is a character vector is removed.

play11:55

However, if you want to remove all the NA observations in the entire data frame, a more

play12:02

comprehensive treatment is required, although many times it is not advised.

play12:05

So you can run this subset DF and then you can write complete dot cases which will only

play12:13

select complete cases in the data frame and if you run this, notice only one row is left

play12:19

and NA's rows that contained NA in variable A and B both are removed.

play12:25

Another command which have similar effect which is NA.Tomit, it will also have similar

play12:30

effect you can use.

play12:31

So this will remove NA observations in our data frame.

play12:34

We will work it a little bit more on NA.

play12:40

So let's use a library car.

play12:45

It contains a number of useful databases.

play12:47

So we use this library car.

play12:49

It contains a data called Friedman.

play12:51

We will work on this Friedman data.

play12:52

This Friedman data seems to carry some NA observations.

play12:57

Again we will do all the similar commands like str, Friedman and so on to check the

play13:03

structure and other properties of the data.

play13:05

We can summarize it also.

play13:07

As when I summarize it, these are the observations.

play13:16

Now if I compute the median of this Friedman, kindly have a look at the summary data.

play13:21

Look at the density variable.

play13:23

Notice that there are 10 NA observations.

play13:25

Let us see what is the impact of these NA's.

play13:29

Let us compute the median of Friedman density data.

play13:33

We know how to compute that.

play13:35

If I compute the median, I will get an NA.

play13:38

A simple treatment to this kind of problem is to use median and then R provides this

play13:45

functionality to use NA.RM equal to 2.

play13:52

If I do that, R ignores the NA observations, not available observations and gives us the

play13:59

median value without considering both of them.

play14:02

Similarly, if you compute the mean of density, again you will get an NA because there is

play14:08

a NA observation.

play14:09

So you compute mean with na.rm equal to true and you get the mean.

play14:11

So this is one good solution, one shortcut to handle NA in your observations.

play14:18

A more drastic way to handle this problem is to use Friedman data, Friedman.good and

play14:26

remove all the NA observations as we have seen earlier na.omit.

play14:30

While this kind of treatment removes all the NA observations, as you will see if I compute

play14:36

the summary measure of this Friedman.good data.

play14:38

However, it is a rather more drastic treatment because despite the fact that there may be

play14:45

only one NA in a particular variable, all the variable rows will be removed.

play14:49

All those rows there are no even single NA observation is there, those rows will be removed.

play14:55

Another way to do the same kind of treatment is to let's say have this Friedman data.

play15:05

Let's create a Friedman underscore not available variable and in this variable we will use

play15:12

Friedman data.

play15:14

Again we will use exclamatory mark and that are not complete or any observation.

play15:26

This is the procedure to exactly notice what are the observations that can, what are the

play15:33

observations that carry NA values.

play15:35

So, we will use complete.

play15:36

cases since we want to extract all the observations that carry NA, we will put an exclamatory

play15:41

mark and in this variable notice in this command, in this not available data frame, we have

play15:49

all the variables that have some kind of NA observation.

play15:53

Let me print that.

play15:55

So notice two columns, population and density column, 10 NA observations are up there.

play15:59

Because there are some other columns like non-white, crime and all, they carry some

play16:04

value.

play16:05

So depending upon your requirement, whether you are interested in retaining these values

play16:08

from non-white and crime variables, you may decide to use NA.omit.

play16:14

If you use NA.omit, then all these rows will also be removed from the table.

play16:18

We will take one more example on how to work around with the NA.na values.

play16:25

We will use using R library for that.

play16:30

So we will make use of using R library.

play16:35

In this library, there is a babies database.

play16:41

So let's extract this babies database.

play16:44

From this database, there is a DWT column, which is the weight column, DWT, which is

play16:55

dad's weight.

play16:56

We will assign this to a variable x.

play17:01

And in this variable, there are certain values which are outliers.

play17:04

Let's summarize this x variable and there we will see there are certain values which

play17:10

are outliers, which are coded as 999.

play17:13

Now this 999 may appear to be a numeric, but from a priori knowledge, we know that this

play17:18

is outlier.

play17:19

So how to handle this outlier?

play17:20

Let's say you want to decide, you decide to add NA or replace these 999s with NA.

play17:26

A very simple way to use this is this kind of command and then assign NA values.

play17:33

So this will assign NA values to all the 999 values.

play17:38

Now once I do that, the values will be replaced by NA.

play17:42

So again, similar problem with the use of NA variable will emerge.

play17:44

So if I compute range of x, it will be NA.

play17:47

If I compute some summary of x, notice there are some NA values, but still we get some

play17:57

of the good summary measures like minimum, median, mean and so on.

play18:00

For range also, we can write or make use of our NA dot rm equal to 2, which is very useful.

play18:06

So we get the range.

play18:07

So in this way, we handle the NA values in our data frame and vector of variables.

play18:17

Examine how to remove non-unique values.

play18:20

Recall that we had original data, salary data, which was saved into data variable.

play18:28

We can check the head of this variable again.

play18:30

So this was our salary data.

play18:32

Let's create a copy of data as data underscore 2 and save the values here.

play18:36

Simple assignment operation.

play18:37

Now what we'll do is and notice the dimension of this data, it should be same as original

play18:43

data, 14017 observations.

play18:44

Now let's create another data which carries data underscore 3, which carries not only

play18:50

this data underscore 2, but also some values which are there in data underscore 2, starting

play18:58

from 1 to 500.

play18:59

So all the columns and row numbers 1 to 500.

play19:03

So we are adding row wise with rbind dot data dot frame to create a new variable data underscore

play19:08

3.

play19:09

Now it should be quite obvious to us that this data underscore 3 will carry those redundant

play19:14

or non-unique values at the end of it, which are exactly the values of row number 1 to

play19:20

100, 500 from data underscore 2 data frame.

play19:23

So if you look at the data dimensions of this data underscore 3, they are 500 more than

play19:28

the original data frame, which is data underscore 2.

play19:31

They are now 14,570.

play19:34

Column number will remain same of course.

play19:35

Now what we want to do is we want to remove these non-unique values and a very easy way

play19:40

to do that, let's say we create a new unique data frame with a data underscore 4 and unique

play19:47

function can be used which underscore 3 will run this command.

play19:51

And now if you notice dimension of data underscore 4, it should be same as 14017.

play19:57

So it carries only the unique observations and all those non-unique redundant observations

play20:01

that were repeated, because at the end of it, we added row 1 to 500 from the data underscore

play20:07

2 data frame, all those redundant non-unique observations are removed.

play20:12

So with this, we conclude the discussion on removal of non-unique columns, handling the

play20:21

data frames, data handling.

play20:25

Let's start with selection of columns and rows.

play20:29

Selection of column and rows in R. Although we have already seen this in bits and pieces

play20:37

earlier.

play20:38

So let's say there is an iris data in R, there are certain columns and each column carries

play20:44

certain observations which comprise rows.

play20:47

So let's say I want to select column number 3, a very simple command like this will extract

play20:53

the column number 3 for me.

play20:54

I can use this head to see the initial elements that I have extracted.

play20:59

So these are the initial elements.

play21:01

If I want to extract multiple column elements, let's say column 3 and 5, I can do it like

play21:08

this.

play21:09

Column number 3 and 5 are extracted.

play21:11

If I want to extract all the columns from 3 to 5, a very simple command, 3 to 5, we

play21:17

have seen this notation, will extract column number 3, 4, 5.

play21:20

I can check the head of this column number 3, 4, 5.

play21:23

It gives me 3 columns.

play21:25

Now you want to extract not only specific columns but specific row number elements also

play21:31

from those specific columns.

play21:32

Let's say you want to extract row number 4 to 10 for column number 3 to 5, then this

play21:38

kind of notation or this kind of command will extract row number 4 to 10 for columns 3,

play21:44

4, 5, which is petal length, width and species.

play21:47

Although you can also extract columns with their names, but generally that is not so

play21:52

advisable.

play21:53

You can extract, let's say you want to extract column number species and another is petal

play22:00

width.

play22:01

You can also do that, however, it has problems because you need to remember the spellings

play22:08

and there you may make some mistakes.

play22:11

So it is better to use numeric notation, but you can use this.

play22:14

So for example, you have species, species, I need to remember the name exactly, so I

play22:19

need to use species and petal dot width exactly the same name.

play22:25

I need to remember with exact spelling and then I can extract this, these two.

play22:30

Better would be to use head so that we can see the initial elements, so I can extract

play22:36

that.

play22:37

Now, we will come to the next step, which is creation of new variables inside data frame.

play22:46

So the way to do it, let's say you want to create a new variable called petal dot ratio

play22:53

and this variable is equal to ratio of petal dot length divided by petal width and you

play23:05

want to create another variable called sepal dot ratio, sepal length divided by sepal width

play23:16

and you run that as well.

play23:17

Now if I check the header, I will find that new columns are added, if I check that you

play23:23

have sepal ratio and petal ratio available.

play23:25

So there is a minus spelling mistake which I need to correct, so you get the sepal ratio

play23:30

and petal ratio variable here.

play23:32

Extracting observations based on conditions and summarizing the observations.

play23:37

So let's start with extracting observations.

play23:40

Let's say in the iris data, I want to extract those observations petal width of more than

play23:48

0.5 along with another condition and that condition should hold true where the species

play23:56

is it should have, let me check the head of this and I want a data where petal width is

play24:04

greater than 0.5 and species should be, I am going to use m percent to create that effect,

play24:14

species should be equal to equal to setosa.

play24:19

So now with this and I need to add a comma that all the columns are deselected.

play24:23

So with this all the variables where petal width is greater than 0.5 and species equal

play24:31

to setosa will be selected, pardon me for the spelling mistake, it should be setosa.

play24:37

And so now if I run this, I can see the row that has been extracted for which petal width

play24:44

is greater than 0.5 and species equal to setosa.

play24:48

Similarly, I can create same effect with subset command as well, with subset I will specify

play24:53

the data, I will specify the petal width to be greater than 0.5 and I will use m percent

play25:00

operator to create the and effect where I am saying that species should be equal to

play25:07

equal to setosa.

play25:09

So with this I will have the similar effect and the same row although it seems there is

play25:14

only one row which will be extracted.

play25:16

So now we will move on to some rising observations.

play25:23

So a very basic summary measure for any data frame is summary, we can see the nature of

play25:27

all the variables, their summary measure depending upon whether they are numeric, character or

play25:32

so on.

play25:33

We can also use structure command to get a sense of variables and we can also use brief

play25:39

command which will give us a sense of variables.

play25:43

Some initial three rows and closing two rows 149 and 140 we get a sense of the variable.

play25:50

Now you can create a user defined summary also let us say you want to summarize in this

play25:54

fashion.

play25:55

You can use this summarize command and you can tell R that you want to summarize iris

play26:00

data frame.

play26:01

Inside iris data frame you want to create a mean variable let us say petal dot length

play26:08

dot mean which is the mean of petal length.

play26:11

So you can create a mean of petal length variable.

play26:14

So we have a petal length variable for which we want to create a mean petal length mean.

play26:19

So maybe I want to create a mean variable for sepal length again maybe you also want

play26:26

standard deviation for these two variables so I will use this d for standard deviation.

play26:33

Now if I run this command notice mean notice the output in console you have mean of sepal

play26:39

length and petal length you have standard deviation of sepal length and petal length

play26:43

they are produced.

play26:44

So in this way I can create more variables with a more user defined or user desired summary.

play26:49

How to work with data frames.

play26:51

We will install library car which we have already used.

play26:56

As a first step we will make use of Davis data which is already inbuilt inside the car

play27:02

data frame.

play27:03

There are 200 rows and 5 columns.

play27:06

If we check the head of tables these are some of the elements gender column, weight, height,

play27:12

reported weight, reported height.

play27:13

Now as a first step we will create another variable which is a data frame element.

play27:20

This data frame is expected to have same dimensions as Davis data.

play27:25

So we will use the following command matrix and number of rows same as Davis data and

play27:34

number of column also same as Davis data.

play27:38

So we will create this variable.

play27:40

Let's see what is inside this data.

play27:41

This data carries 200 rows and 5 columns.

play27:45

We can check the demo for.

play27:47

So this is a new variable that we have created to store the observations for practice.

play27:51

Also we'll give the name to the variables inside Davis.

play27:55

So currently if you look at the head output you will notice that there are no names automatically

play28:01

by default x1 x2 are provided.

play28:03

So we'll create some names for this.

play28:05

Since we are planning to use Davis data to fill this variable we'll use similar names

play28:09

that are gender, weight, height, reported weight, reported height.

play28:15

So this is our new variable.

play28:17

You can see the head now it will be changed with the new names that we have created.

play28:22

Now let's assign the values of Davis variable inside this.

play28:25

So we'll assign the values.

play28:26

So this is our output variable dollar gender.

play28:30

So we'll assign the gender variable from Davis data.

play28:33

Similarly we'll assign the weight variable, height variable, reported weight variable,

play28:38

and we'll also assign the height variable.

play28:41

So in this fashion we have created this output variable which has similar dimensions as Davis

play28:46

data and for practice we have assigned this output variable same data or same variables

play28:51

as Davis variable like this.

play28:53

So it has gender, weight, height, reported weight and reported height variables.

play28:57

So in this video we have learned how to create a data frame, how to assign values to that

play29:02

data frame, set its dimensions and then assign values from different sources.

play29:08

Learn how to deal with factor variables and we'll also perform some operations on data

play29:14

frames.

play29:15

So we'll learn working with factor variables.

play29:19

We'll make use of library using R in which there is a cars 93 variable which we’ll

play29:27

make use of.

play29:28

We'll make use of this cars 93 variable which carries various dimensions of cars.

play29:34

Now as a starting point let's create a sub data frame or extract certain elements of

play29:40

cars 93.

play29:41

Let's first three rows and out of all the first four columns.

play29:46

So this is a three four three cross four kind of data.

play29:49

Let's see what is inside this.

play29:50

It carries four columns manufacturer, model, type and minimum price and three elements

play29:55

for each column or variable.

play29:56

Now we can see the structure of this new data small data frame.

play30:01

We can also summarize this small data frame.

play30:03

As a starting point let's assign some NA values to this.

play30:07

So I'll assign the third row fourth column.

play30:10

I'll assign a new value and first row first column also I'll assign a new value.

play30:17

Now if you print this data frame D notice the first column manufacturer first value

play30:22

is NA.

play30:23

Similarly fourth column minimum price last value is NA.

play30:26

In this sub data frame if you want to add let's say some new elements.

play30:31

Let's say you want to add to the third list you want to change the values that are there

play30:36

in column two and four and you want to change them to new elements maybe A3 to the second

play30:42

column and 30 to the fourth column should be easy right.

play30:46

However notice that it gives you a NA message and if you print D notice on the second column

play30:51

third row is created.

play30:53

The reason being if you notice the class of D model it's a factor variable so it has some

play30:59

levels what are these levels.

play31:01

We can check those levels.

play31:02

Levels D $ model.

play31:05

So it has some levels.

play31:06

There are a number of levels which is because these levels have a stick because we extracted

play31:11

it from original class data so original levels are sticking.

play31:13

So let's first remove the unused levels which is quite simple.

play31:16

I'll use this drop level command drop levels from D $ model and if I run this command all

play31:25

the unused levels will be removed and then I will run this level command to see the remaining

play31:31

levels which are currently used.

play31:32

So if I run this you will find two levels are currently used one is integer one is legit

play31:37

so only two levels are remaining all the unused levels have been removed now.

play31:40

The problem is because this is a factor variable which is using certain levels which are specified

play31:44

if I add new levels like A3 they are not added and instead they create NA values because

play31:50

R is confused that this level is not specified.

play31:52

So how to specify new levels.

play31:54

So we'll try and specify some new levels to this model variable.

play31:58

Please note we'll not ignore the earlier levels so we'll first create the existing levels

play32:04

with the model variable.

play32:05

So these are the existing levels that will make use of and in addition we'll add certain

play32:09

more levels.

play32:10

We'll add probably A3.

play32:11

We'll also add A4 and we'll maybe adding A6 also.

play32:16

Now that we have assigned these levels if I check the levels of model variable it will

play32:21

have three levels added and now we have five levels.

play32:24

Now if we run the original command which created NA which is this D3C24 where we are trying

play32:30

to assign A3 to the third element of second column and 40 to the third element of fourth

play32:35

column there is no NA creation and you can see A3 has been assigned to model because

play32:39

we already specified it as a level.

play32:42

So this is how you add level.

play32:44

Let's say now you want to add a fourth column and there are multiple ways to do it.

play32:48

One way is to use this index notation where I use D4 and then say I simply add all the

play32:53

four elements first element let's say name of the car which is oddy then one level which

play32:59

is A4 since we have already specified A4 as a level this will not create any trouble then

play33:03

type as midsize and price minimum price is 35.

play33:07

So if I do that a new row is created fourth row and we can see the elements oddy A4 midsize

play33:11

35.

play33:12

Another way to do the same thing is to use rbind and I can create D equal to rbind and

play33:19

with rbind I can again write the same elements rbind D original D and with this original

play33:25

D I'll add the new set of variables oddy A4 and 35 this will also have the similar effect.

play33:31

So if I run this command I'll again get the new variable D which has the next element

play33:36

which is oddy A4 midsize since I have added already added the fourth row this will be

play33:40

added in the form of fifth row.

play33:41

So this is another way to do it.

play33:43

Now let's say you want to create a new column fifth column which is a multiple of minimum

play33:48

price let's say D dollar minimum price multiplied by 1.3 a new column will be created we can

play33:55

print D V5 is created if you want to give it a name you can select a name that you find

play34:00

useful let's say you pick a name of column D fifth column and call it mod price.

play34:06

So a mod price name will be assigned if I print D instead of V5 we have now modified

play34:11

price or mod price.

play34:12

Another more simpler way to do that would be simply use D dollar mod price I give it

play34:17

a name mod price and then assign the same value which is D dollar minimum price into

play34:22

1.3 this will also have the same effect and a new column mod price will be created.

play34:26

Another easy way to do the same effect is within function within function transformation

play34:30

will be made inside data frame D and within D we are assigning or creating a new mod price

play34:39

equal to minimum price this is also quite simple mod price into 1.3.

play34:43

So if I run this again I'll get a new variable D which is mod price is created.

play34:49

So this is how you transform the value work out with vectors and transform variables inside

play34:56

a data frame.

play34:57

Transforming data frames between long and byte format.

play34:59

So we'll transform the data frames across long and byte format.

play35:09

So let's start with a simple construction of data frame.

play35:12

Let's create a number of variables first let's say variable speed dot 1 this may be first

play35:18

observation for different speed let's have number of values here.

play35:25

So these are hypothetical values of speed observations for different vehicles A, B,

play35:32

C, D, E, F. So these are some of observations.

play35:36

Similarly we'll have another set second set of observations as speed 2 with hypothetical

play35:40

values here may be.

play35:51

Similarly we'll have third observation for the speed variable again some random values

play36:24

like 800.

play36:29

We'll have fourth set of.

play36:32

So objective here is to create a rather wide format and then we'll see whether we are able

play36:38

to in the interest of time we'll keep the value the same.

play36:43

So there are five speed observations speed 1 dot speed 2 speed 3 4 speed 5.

play36:48

Another there is ID variable which identifies the units 1, 2, 3, 4, 5, 6.

play36:55

Then we have one variable which may give the name.

play37:00

Let's call it A, B, C, D, E, F.

play37:23

So these are speed variable 6, 5 speed variables ID run.

play37:29

Now let's combine them under the variable speed.

play37:35

And give it a name.

play37:38

Let's combine them first C with the command C by dot dot data frame first ID variable

play37:45

then run variable then speed dot 1 speed dot 2 speed dot 3 speed dot 4 and speed dot 5.

play38:00

So, these are the variables that we have created speed variable.

play38:05

Then you have head of speed.

play38:09

We can see the variables ID and their runs.

play38:12

We can see the summary.

play38:19

Structure.

play38:26

We can see the variable.

play38:27

It's a rather wide data frame.

play38:29

In order to make this a long data frame we'll take the help of package reshape2.

play38:36

So this reshape2 package will be added with the library command.

play38:40

We'll add this reshape2.

play38:42

And now we are going to go.

play38:45

So the way it works, we'll create a long data format by using melt function.

play38:50

Melt function will help us create this wide data format which is speed and then we’ll

play38:58

give the ID of variables that are to be fixed.

play39:02

So we'll give the names speed.

play39:06

So first two variables.

play39:07

So we are giving the name of first two variables.

play39:09

We want them to be fixed.

play39:10

These are ID and run variables and the variable which we.

play39:17

The variable which we want to create a more long data frame.

play39:22

This is the speed variable.

play39:23

So all the five speeds we want to put them in one variable called speed.

play39:29

So if I run this command.

play39:31

Now if I run the command notice how long variable appears now.

play39:41

So in this all the values are put in the value column and all the speed variables speed.1

play39:48

speed.2 and so on are separated now.

play39:51

So you can see that speed.1 speed.2 and so on they are combined into one column which

play39:55

is speed and their values are put in value.

play39:57

So now this is rather long data frame.

play39:59

So now also we get the interpretation when we are saying long and wide.

play40:02

And now we'll try to get back to our original wide data frame.

play40:06

How to do that for that we'll create a new data frame called wide.

play40:09

We'll use this dcast.

play40:11

dcast and we'll specify that we want the long data frame.

play40:16

The variables ID plus run they need to be fixed and the variable speed need to be adjusted.

play40:21

If I do that look at the head of wide data frame.

play40:28

So this is how we work upon wide and long data frames different context required different

play40:34

formats maybe long or wide.

play40:36

In this video we'll talk about merging data frames.

play40:40

Merging different data frames and various properties of the merge command.

play40:44

So we'll talk about merging data frames.

play40:49

Let's create two data frames.

play40:52

We create variable v1 and these are movies, and they are domestic collections.

play40:57

First we'll create that.

play40:59

So, first movie is the Avengers, Avengers Dark Knight, The Hobbit games, Skyfall, Hobbit.

play41:30

And then v2 which is their foreign collection.

play41:33

Maybe let's put some hypothetical numbers.

play41:45

So we are putting some hypothetical numbers here.

play41:59

Now let's combine them.

play42:10

So we'll combine them.

play42:11

We'll give them a name domestic v1 and v2.

play42:20

So we have combined them.

play42:23

Let's see their names.

play42:26

Use head domestic.

play42:28

If you want to create more appropriate names you can use col names with Name, Domestic.

play42:44

Now we have created the name.

play42:47

So now if I run the head command, I will get the new data frame.

play42:50

Probably I'll adjust a little bit so that it is visible.

play42:58

Next we'll create foreign collections variable.

play43:03

So again we'll start with the same process.

play43:05

We'll create v3 which is equal to and we'll use the movie's name.

play43:09

Probably we'll change the movie's names a little bit.

play43:12

So this time around we'll use a little bit difference we'll make.

play43:16

So probably we'll remove the one of the Hunger Games and add Ice Age.

play43:21

Probably we'll change this.

play43:25

And then we'll add the collections.

play43:28

The idea is to create two data frames with slightly different movie names.

play43:34

And their collections and then try to merge them and see how it works out.

play43:38

So, this time around we'll add again since it is hypothetical case so we'll for collections

play43:44

we’ll use some again same hypothetical numbers, so it doesn't matter the numbers don't matter

play43:51

here.

play43:52

So, then we have foreign.

play43:54

So we'll give it a name foreign.

play44:09

Equal to v3 comma v4.

play44:11

So these are foreign we can check the head.

play44:14

Again we'll switch the names.

play44:20

Please notice this time around while giving the names I will be using a slightly different

play44:26

notation so instead of exact name earlier we'll use the small cap not in caps.

play44:33

So notice instead of name variable earlier we are I'm using small name with small n and

play44:39

then foreign.

play44:40

So this name will be our joining variable but I'm using a different syntax.

play44:47

Let's see the head.

play44:49

So now this is our name variable so let's create the final variable which is merge and

play44:55

notice how I create this variable so final variable, which is using merge command.

play45:05

I'm merging the domestic variable with foreign variable and notice by dot x so first is domestic

play45:16

variable so by dot x equal to name.

play45:21

And the second variable is foreign so by dot y equal to name but please notice here name

play45:30

I'm giving with n as a small and I'll add them up.

play45:33

Now this is a rather difficult case here if to make it simpler it is always advisable

play45:39

to use the same syntax so like this capital name again.

play45:44

And if I do that now the name is foreign, I need not give the second thing.

play45:48

I can simply put it like this or rather I'll use for instead of this I will write in fact

play45:55

I'll write a new command if I would have written it like this capital name I'll execute this.

play46:01

It will be better if I keep this command.

play46:07

So now if I run this and now in order to add this I need things are more simple now I need

play46:13

not give two names I can simply use this.

play46:18

And head final so you can see it has merged the data let's see what exactly has happened.

play46:27

So if you notice if I do this kind of command, it has merged all the movie names, and this

play46:34

kind of merge is sort of intersection.

play46:37

So if you notice two movie names are missing one is Ice Age and one is Hunger Games.

play46:43

Now since one of the movies was not present in one data frame while other was present

play46:47

in other, so they are excluded it's the sort of intersection of merge or inner merge.

play46:54

Let's do the outer merge that is also doable.

play46:58

So, for that we'll use instead of this command I will use all = t and now if I run this command

play47:10

then notice there is Ice Age movie there and NA is in the domestic because it was not there

play47:19

in domestic.

play47:20

Similarly Hunger Games NA is in foreign, but all the movie names are present now.

play47:25

So this is called outer merge.

play47:28

Then you can also decide whether you want to merge based on one variable or the for

play47:32

example if I write all dot x equal to t in that case all the domestic will be taken and

play47:38

those that are not present in domestic will be ignored.

play47:42

Similarly if I do it for all dot y then all values that are available in foreign will

play47:49

be considered and domestic will be ignored.

play47:52

So now we conclude with the merging aspect of data frames and in this complete module

play47:57

we learned how to clean and handle data and a more complex form of data which is data

play48:02

frame.

play48:04

Thank you.

Rate This

5.0 / 5 (0 votes)

相关标签
Data HandlingR ProgrammingCSV FilesExcel DataData CleaningNA ValuesData FramesMerging DataData AnalysisR Tutorial
您是否需要英文摘要?