Data Management in R

January 27, 2020

Introduction

Why Manage Data in R

Today’s Topics

Importing data
Subsetting data
Creating variables
Editing variables
Removing variables
Sorting data
Merging data
Reshaping data
The apply family
Version Control
Github

Importing Data

Usually you will use read.csv()

read.table() are generalizations for their data formats

We can read any data from a website or the desktop

Importing Data

If we use some Census data from the web source,

census <- read.csv("http://myaddress/address2/mydata.csv")

Importing Data

calif_penn <- read.csv("http://www.stat.cmu.edu/~cshalizi/uADA/13/hw/01/calif_penn_2011.csv")
head(calif_penn)

##   X    GEO.id2 STATEFP COUNTYFP TRACTCE POPULATION LATITUDE LONGITUDE
## 1 1 6001400100       6        1  400100       2937 37.85942 -122.2304
## 2 2 6001400200       6        1  400200       1974 37.84829 -122.2495
## 3 3 6001400300       6        1  400300       4865 37.84027 -122.2544
## 4 4 6001400400       6        1  400400       3703 37.84845 -122.2573
## 5 5 6001400500       6        1  400500       3517 37.84894 -122.2647
## 6 6 6001400600       6        1  400600       1571 37.84223 -122.2649
##                               GEO.display.label Median_house_value
## 1 Census Tract 4001, Alameda County, California                 NA
## 2 Census Tract 4002, Alameda County, California             909600
## 3 Census Tract 4003, Alameda County, California             748700
## 4 Census Tract 4004, Alameda County, California             773600
## 5 Census Tract 4005, Alameda County, California             579200
## 6 Census Tract 4006, Alameda County, California             480800
##   Total_units Vacant_units Median_rooms Mean_household_size_owners
## 1        1425          162          6.5                       2.02
## 2         929           37          6.0                       2.53
## 3        2655          134          4.6                       2.45
## 4        1911           68          5.0                       2.04
## 5        1703           71          4.5                       2.66
## 6         781           65          4.8                       2.58
##   Mean_household_size_renters Built_2005_or_later Built_2000_to_2004
## 1                        1.59                 4.6                9.3
## 2                        1.81                 0.0                1.2
## 3                        1.66                 0.0                0.0
## 4                        2.19                 0.0                0.2
## 5                        1.72                 0.0                0.2
## 6                        2.18                 0.0                0.0
##   Built_1990s Built_1980s Built_1970s Built_1960s Built_1950s Built_1940s
## 1        50.9         2.5         4.8         1.3        13.9         2.8
## 2         0.0         1.3         6.1         6.5         1.0        10.8
## 3         2.3         3.2         5.2         8.3         5.3         7.8
## 4         1.3         0.0         4.9         4.3         8.0        10.4
## 5         1.1         1.9         3.7         5.8         6.0         7.5
## 6         1.2         1.4         1.0         6.5        19.7        17.0
##   Built_1939_or_earlier Bedrooms_0 Bedrooms_1 Bedrooms_2 Bedrooms_3
## 1                   9.9        3.6        5.6       11.9       40.6
## 2                  73.2        3.0       16.4       27.4       34.4
## 3                  68.0       11.5       28.4       29.2       20.4
## 4                  71.1        5.2       27.7       33.7       21.9
## 5                  73.8        4.9       30.2       38.1       19.3
## 6                  53.1        3.5       20.4       40.1       30.7
##   Bedrooms_4 Bedrooms_5_or_more Owners Renters Median_household_income
## 1       31.6                6.7   81.2    18.8                  156250
## 2       17.5                1.2   66.0    34.0                  111667
## 3        7.9                2.7   45.1    54.9                   66094
## 4        7.3                4.2   45.0    55.0                   87306
## 5        5.4                2.1   43.6    56.4                   62386
## 6        4.6                0.8   51.0    49.0                   55658
##   Mean_household_income
## 1                237805
## 2                195229
## 3                105877
## 4                106248
## 5                 74604
## 6                 73933

Importing Data

If we want to use data from our desktop, we should include a file path (ex. c:/)

MyData <- read.csv("c:/TheDataIWantToReadIn.csv", header=TRUE)

Remember that you must use the forward slash / or double backslash \ in R!

The Windows format of single backslash does not work

header=TRUE means the first row contains the names of the variables.

Importing Data

We will use built-in dataset in R

R comes with several built-in datasets
Generally used as demo data for playing with R functions

data()

Importing Data

mtcars : Motor Trend Car Road Tests

The data was extracted from the 1974 Motor Trend US magazine
The Data consists of fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models)

cars <- mtcars

Let’s Get Familiar With Our Data

head() returns the first 6 lines of the data

tail() returns the last 6 lines of the data

head(cars)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Let’s Get Familiar With Our Data

dim() provides the number of rows and columns of the data

ncol() returns the number of columns of the data

nrow() provides the number of rows of the data

dim(cars)

## [1] 32 11

ncol(cars)

## [1] 11

nrow(cars)

## [1] 32

Let’s Get Familiar With Our Data

colnames() provides the column names of the data

rownames() provides the row names of the data

colnames(cars)

##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"

rownames(cars)

##  [1] "Mazda RX4"           "Mazda RX4 Wag"       "Datsun 710"         
##  [4] "Hornet 4 Drive"      "Hornet Sportabout"   "Valiant"            
##  [7] "Duster 360"          "Merc 240D"           "Merc 230"           
## [10] "Merc 280"            "Merc 280C"           "Merc 450SE"         
## [13] "Merc 450SL"          "Merc 450SLC"         "Cadillac Fleetwood" 
## [16] "Lincoln Continental" "Chrysler Imperial"   "Fiat 128"           
## [19] "Honda Civic"         "Toyota Corolla"      "Toyota Corona"      
## [22] "Dodge Challenger"    "AMC Javelin"         "Camaro Z28"         
## [25] "Pontiac Firebird"    "Fiat X1-9"           "Porsche 914-2"      
## [28] "Lotus Europa"        "Ford Pantera L"      "Ferrari Dino"       
## [31] "Maserati Bora"       "Volvo 142E"

Let’s Get Familiar With Our Data

str() tells us the type of each column and gives us a preview of the data.

str(cars)

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Subsetting Data

In R, subsetting is done with [

The , seperates rows to select from columns to select

In [a, b], a indicates a-th row, b indicates b-th column

cars[1,1]

## [1] 21

Subsetting Data

head(cars)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Subsetting Data

When one of these is empty, it keeps all values

data[a,] gives an a-th row with the whole column values
data[,b] gives a b-th column with the whole row values

cars[1,]

##           mpg cyl disp  hp drat   wt  qsec vs am gear carb
## Mazda RX4  21   6  160 110  3.9 2.62 16.46  0  1    4    4

cars[,1]

##  [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2
## [15] 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4
## [29] 15.8 19.7 15.0 21.4

Subsetting Data

For column variable, we can use the name of a variable

The $ is used to select a column of a data frame

cars$hp

##  [1] 110 110  93 110 175 105 245  62  95 123 123 180 180 180 205 215 230
## [18]  66  52  65  97 150 150 245 175  66  91 113 264 175 335 109

cars[,4]

##  [1] 110 110  93 110 175 105 245  62  95 123 123 180 180 180 205 215 230
## [18]  66  52  65  97 150 150 245 175  66  91 113 264 175 335 109

Subsetting Data

cars$hp

##  [1] 110 110  93 110 175 105 245  62  95 123 123 180 180 180 205 215 230
## [18]  66  52  65  97 150 150 245 175  66  91 113 264 175 335 109

cars$Hp

## NULL

Subsetting Data

We can select the several rows and columns at the same time

Let’s find mpg (Miles/gallon; column 1) and hp (Horsepower; column 4)` of the cars

cars[,1,4] # ?

##  [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2
## [15] 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4
## [29] 15.8 19.7 15.0 21.4

Subsetting Data

cars[,c(1,4)]

##                      mpg  hp
## Mazda RX4           21.0 110
## Mazda RX4 Wag       21.0 110
## Datsun 710          22.8  93
## Hornet 4 Drive      21.4 110
## Hornet Sportabout   18.7 175
## Valiant             18.1 105
## Duster 360          14.3 245
## Merc 240D           24.4  62
## Merc 230            22.8  95
## Merc 280            19.2 123
## Merc 280C           17.8 123
## Merc 450SE          16.4 180
## Merc 450SL          17.3 180
## Merc 450SLC         15.2 180
## Cadillac Fleetwood  10.4 205
## Lincoln Continental 10.4 215
## Chrysler Imperial   14.7 230
## Fiat 128            32.4  66
## Honda Civic         30.4  52
## Toyota Corolla      33.9  65
## Toyota Corona       21.5  97
## Dodge Challenger    15.5 150
## AMC Javelin         15.2 150
## Camaro Z28          13.3 245
## Pontiac Firebird    19.2 175
## Fiat X1-9           27.3  66
## Porsche 914-2       26.0  91
## Lotus Europa        30.4 113
## Ford Pantera L      15.8 264
## Ferrari Dino        19.7 175
## Maserati Bora       15.0 335
## Volvo 142E          21.4 109

Subsetting Data

We can use the names of variables

cars[,c("mpg", "hp")]

##                      mpg  hp
## Mazda RX4           21.0 110
## Mazda RX4 Wag       21.0 110
## Datsun 710          22.8  93
## Hornet 4 Drive      21.4 110
## Hornet Sportabout   18.7 175
## Valiant             18.1 105
## Duster 360          14.3 245
## Merc 240D           24.4  62
## Merc 230            22.8  95
## Merc 280            19.2 123
## Merc 280C           17.8 123
## Merc 450SE          16.4 180
## Merc 450SL          17.3 180
## Merc 450SLC         15.2 180
## Cadillac Fleetwood  10.4 205
## Lincoln Continental 10.4 215
## Chrysler Imperial   14.7 230
## Fiat 128            32.4  66
## Honda Civic         30.4  52
## Toyota Corolla      33.9  65
## Toyota Corona       21.5  97
## Dodge Challenger    15.5 150
## AMC Javelin         15.2 150
## Camaro Z28          13.3 245
## Pontiac Firebird    19.2 175
## Fiat X1-9           27.3  66
## Porsche 914-2       26.0  91
## Lotus Europa        30.4 113
## Ford Pantera L      15.8 264
## Ferrari Dino        19.7 175
## Maserati Bora       15.0 335
## Volvo 142E          21.4 109

Subsetting Data

Let’s select cars with V-shaped Engine

vs contains the type of engine information

0 = V-shaped, 1 = Straight

cars$vs

##  [1] 0 0 1 1 0 1 0 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1 0 1 0 0 0 1

table(cars$vs)

## 
##  0  1 
## 18 14

table() gives us counts at each levels

Subsetting Data

cars$vs==0

##  [1]  TRUE  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE
## [12]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE
## [23]  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE FALSE

It contains TRUE/FALSE information of 32 cars,

which is related to row.

Subsetting Data

cars[cars$vs==0, ]

##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8

nrow(cars[cars$vs==0, ])

## [1] 18

Subsetting Data

Quiz!

Select cars
- V-shape Engine
- more than 18 mpg
How?

Subsetting Data

Answer~

V-shape Engine + more than 18 mpg

cars$vs==0 & cars$mpg > 18

##  [1]  TRUE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23] FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE

Subsetting Data

V-shape Engine + more than 18 mpg

cars[cars$vs==0 & cars$mpg > 18,]

##                    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Pontiac Firebird  19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Porsche 914-2     26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Ferrari Dino      19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6

Creating Variables

Create a new variable hp_per_wt about horsepower per weight.

Put new variable within dataset using cars$

cars$hp/cars$wt

##  [1] 41.98473 38.26087 40.08621 34.21462 50.87209 30.34682 68.62745
##  [8] 19.43574 30.15873 35.75581 35.75581 44.22604 48.25737 47.61905
## [15] 39.04762 39.63864 43.03087 30.00000 32.19814 35.42234 39.35091
## [22] 42.61364 43.66812 63.80208 45.51365 34.10853 42.52336 74.68605
## [29] 83.28076 63.17690 93.83754 39.20863

cars$hp_per_wt <- cars$hp/cars$wt

Creating Variables

head(cars[,8:12])

##                   vs am gear carb hp_per_wt
## Mazda RX4          0  1    4    4  41.98473
## Mazda RX4 Wag      0  1    4    4  38.26087
## Datsun 710         1  1    4    1  40.08621
## Hornet 4 Drive     1  0    3    1  34.21462
## Hornet Sportabout  0  0    3    2  50.87209
## Valiant            1  0    3    1  30.34682

Creating Variables

We could also use:

cbind is used to combine by columns

cars <- cbind(cars, cars$hp/cars$wt)
cars[1:3,]

##                mpg cyl disp  hp drat    wt  qsec vs am gear carb hp_per_wt
## Mazda RX4     21.0   6  160 110 3.90 2.620 16.46  0  1    4    4  41.98473
## Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4  38.26087
## Datsun 710    22.8   4  108  93 3.85 2.320 18.61  1  1    4    1  40.08621
##               cars$hp/cars$wt
## Mazda RX4            41.98473
## Mazda RX4 Wag        38.26087
## Datsun 710           40.08621

Creating Variables

colnames(cars)

##  [1] "mpg"             "cyl"             "disp"           
##  [4] "hp"              "drat"            "wt"             
##  [7] "qsec"            "vs"              "am"             
## [10] "gear"            "carb"            "hp_per_wt"      
## [13] "cars$hp/cars$wt"

colnames(cars)[13] <- "hp_per_wt2"

Creating Variables

Using cbind, we could put the variables at the specific region

cars2 <- cars[1:3,c("hp", "wt")]
cbind(cars2, cars2$hp/cars2$wt)

##                hp    wt cars2$hp/cars2$wt
## Mazda RX4     110 2.620          41.98473
## Mazda RX4 Wag 110 2.875          38.26087
## Datsun 710     93 2.320          40.08621

Creating Variables

cbind(cars2, cars2$hp/cars2$wt)[,c(3,1,2)]

##               cars2$hp/cars2$wt  hp    wt
## Mazda RX4              41.98473 110 2.620
## Mazda RX4 Wag          38.26087 110 2.875
## Datsun 710             40.08621  93 2.320

Creating Variables

Let’s add hp_per_wt next to hp.

cars3 <- cbind(cars, cars$hp_per_wt)[,c(1:4, 13, 5:12)]
colnames(cars3)

##  [1] "mpg"        "cyl"        "disp"       "hp"         "hp_per_wt2"
##  [6] "drat"       "wt"         "qsec"       "vs"         "am"        
## [11] "gear"       "carb"       "hp_per_wt"

cars3[1:3,]

##                mpg cyl disp  hp hp_per_wt2 drat    wt  qsec vs am gear
## Mazda RX4     21.0   6  160 110   41.98473 3.90 2.620 16.46  0  1    4
## Mazda RX4 Wag 21.0   6  160 110   38.26087 3.90 2.875 17.02  0  1    4
## Datsun 710    22.8   4  108  93   40.08621 3.85 2.320 18.61  1  1    4
##               carb hp_per_wt
## Mazda RX4        4  41.98473
## Mazda RX4 Wag    4  38.26087
## Datsun 710       1  40.08621

Editing Variables

hp_per_wt seems to be too long

cars[1:3,7:ncol(cars)]

##                qsec vs am gear carb hp_per_wt hp_per_wt2
## Mazda RX4     16.46  0  1    4    4  41.98473   41.98473
## Mazda RX4 Wag 17.02  0  1    4    4  38.26087   38.26087
## Datsun 710    18.61  1  1    4    1  40.08621   40.08621

Editing Variables

Let’s change this variable shorter

round rounds the values to specified number of decimal points

cars$hp_per_wt <- round(cars$hp_per_wt, 3)
cars[1:3,7:ncol(cars)]

##                qsec vs am gear carb hp_per_wt hp_per_wt2
## Mazda RX4     16.46  0  1    4    4    41.985   41.98473
## Mazda RX4 Wag 17.02  0  1    4    4    38.261   38.26087
## Datsun 710    18.61  1  1    4    1    40.086   40.08621

Editing Variables

wt is weight (1000 lbs)

Let’s convert that into weight per lb

cars$wt <- cars$wt * 1000
cars$wt

##  [1] 2620 2875 2320 3215 3440 3460 3570 3190 3150 3440 3440 4070 3730 3780
## [15] 5250 5424 5345 2200 1615 1835 2465 3520 3435 3840 3845 1935 2140 1513
## [29] 3170 2770 3570 2780

cars[1:3,]

##                mpg cyl disp  hp drat   wt  qsec vs am gear carb hp_per_wt
## Mazda RX4     21.0   6  160 110 3.90 2620 16.46  0  1    4    4    41.985
## Mazda RX4 Wag 21.0   6  160 110 3.90 2875 17.02  0  1    4    4    38.261
## Datsun 710    22.8   4  108  93 3.85 2320 18.61  1  1    4    1    40.086
##               hp_per_wt2
## Mazda RX4       41.98473
## Mazda RX4 Wag   38.26087
## Datsun 710      40.08621

Editing Variables

We would like to standardize the variable qsec (1/4 mile time)

Standardize of X : $\frac{X-Mean(X)}{Sd(X)}$

cars$qsec2 <- (cars$qsec - mean(cars$qsec)) / sd(cars$qsec)

Editing Variables

We know that vs is related to the shape of engine

0 = V-shaped, 1 = Straight

Let’s change (0, 1) to (“V”, “S”)

How?

Editing Variables

cars$vs[cars$vs=="0"]

##  [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

cars$vs[cars$vs=="0"] <- "V"
cars$vs[cars$vs=="1"] <- "S"
cars$vs

##  [1] "V" "V" "S" "S" "V" "S" "V" "S" "S" "S" "S" "V" "V" "V" "V" "V" "V"
## [18] "S" "S" "S" "S" "V" "V" "V" "V" "S" "V" "S" "V" "V" "V" "S"

Removing Variables

I don’t think we will need hp_per_wt, hp_per_wt2, and qsec2, so let’s remove those columns

- tells R to remove this column number

colnames(cars)

##  [1] "mpg"        "cyl"        "disp"       "hp"         "drat"      
##  [6] "wt"         "qsec"       "vs"         "am"         "gear"      
## [11] "carb"       "hp_per_wt"  "hp_per_wt2" "qsec2"

cars <- cars[,-14]

Removing Variables

cars[1:3,]

##                mpg cyl disp  hp drat   wt       qsec vs am gear carb
## Mazda RX4     21.0   6  160 110 3.90 2620 -0.7771651  V  1    4    4
## Mazda RX4 Wag 21.0   6  160 110 3.90 2875 -0.4637808  V  1    4    4
## Datsun 710    22.8   4  108  93 3.85 2320  0.4260068  S  1    4    1
##               hp_per_wt hp_per_wt2
## Mazda RX4        41.985   41.98473
## Mazda RX4 Wag    38.261   38.26087
## Datsun 710       40.086   40.08621

Removing Variables

We can remove multiple variables at the same time

cars <- cars[,-c(12,13)]
cars[1:3,]

##                mpg cyl disp  hp drat   wt       qsec vs am gear carb
## Mazda RX4     21.0   6  160 110 3.90 2620 -0.7771651  V  1    4    4
## Mazda RX4 Wag 21.0   6  160 110 3.90 2875 -0.4637808  V  1    4    4
## Datsun 710    22.8   4  108  93 3.85 2320  0.4260068  S  1    4    1

Sorting Data

Let’s sort mpg of cars data

cars <- mtcars
cars$mpg

##  [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2
## [15] 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4
## [29] 15.8 19.7 15.0 21.4

sort(cars$mpg)

##  [1] 10.4 10.4 13.3 14.3 14.7 15.0 15.2 15.2 15.5 15.8 16.4 17.3 17.8 18.1
## [15] 18.7 19.2 19.2 19.7 21.0 21.0 21.4 21.4 21.5 22.8 22.8 24.4 26.0 27.3
## [29] 30.4 30.4 32.4 33.9

By default, sorting is ascending

Sorting Data

If you want to sort it into descending order…

sort(cars$mpg, decreasing = TRUE)

##  [1] 33.9 32.4 30.4 30.4 27.3 26.0 24.4 22.8 22.8 21.5 21.4 21.4 21.0 21.0
## [15] 19.7 19.2 19.2 18.7 18.1 17.8 17.3 16.4 15.8 15.5 15.2 15.2 15.0 14.7
## [29] 14.3 13.3 10.4 10.4

Sorting Data

We use order function to sort a data frame

order gives the indices of the vector in sorted order

a <- c(100, 10, 1000)
order(a)

## [1] 2 1 3

10 is the smallest element in a, which is the second element, So 2 comes first in the output of order(a)
100 is the second smallest element and the first element in a, so 1 comes second

Sorting Data

We can use the output of order(a) to reshuffle a

a[order(a)]

## [1]   10  100 1000

Sorting Data

Now sort the whole data cars into ascending order on mpg

head(cars[order(cars$mpg),])

##                      mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Cadillac Fleetwood  10.4   8  472 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8  460 215 3.00 5.424 17.82  0  0    3    4
## Camaro Z28          13.3   8  350 245 3.73 3.840 15.41  0  0    3    4
## Duster 360          14.3   8  360 245 3.21 3.570 15.84  0  0    3    4
## Chrysler Imperial   14.7   8  440 230 3.23 5.345 17.42  0  0    3    4
## Maserati Bora       15.0   8  301 335 3.54 3.570 14.60  0  1    5    8

Sorting Data

If you want to sort data descending order on mpg,

- (minus sign) could be useful

head(cars[order(-cars$mpg),])

##                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2

Merging data

Let’s merge two data frames

We can start from two simple datasets

simple.cars <- cars[,c(1,2)]
simple.cars2 <- cars[,c(10,11)]
head(simple.cars)

##                    mpg cyl
## Mazda RX4         21.0   6
## Mazda RX4 Wag     21.0   6
## Datsun 710        22.8   4
## Hornet 4 Drive    21.4   6
## Hornet Sportabout 18.7   8
## Valiant           18.1   6

Merging data

head(simple.cars2)

##                   gear carb
## Mazda RX4            4    4
## Mazda RX4 Wag        4    4
## Datsun 710           4    1
## Hornet 4 Drive       3    1
## Hornet Sportabout    3    2
## Valiant              3    1

Merging data

Two datasets have the same rowname

Merge them by rowname

merge(simple.cars, simple.cars2, by = "row.names")

##              Row.names  mpg cyl gear carb
## 1          AMC Javelin 15.2   8    3    2
## 2   Cadillac Fleetwood 10.4   8    3    4
## 3           Camaro Z28 13.3   8    3    4
## 4    Chrysler Imperial 14.7   8    3    4
## 5           Datsun 710 22.8   4    4    1
## 6     Dodge Challenger 15.5   8    3    2
## 7           Duster 360 14.3   8    3    4
## 8         Ferrari Dino 19.7   6    5    6
## 9             Fiat 128 32.4   4    4    1
## 10           Fiat X1-9 27.3   4    4    1
## 11      Ford Pantera L 15.8   8    5    4
## 12         Honda Civic 30.4   4    4    2
## 13      Hornet 4 Drive 21.4   6    3    1
## 14   Hornet Sportabout 18.7   8    3    2
## 15 Lincoln Continental 10.4   8    3    4
## 16        Lotus Europa 30.4   4    5    2
## 17       Maserati Bora 15.0   8    5    8
## 18           Mazda RX4 21.0   6    4    4
## 19       Mazda RX4 Wag 21.0   6    4    4
## 20            Merc 230 22.8   4    4    2
## 21           Merc 240D 24.4   4    4    2
## 22            Merc 280 19.2   6    4    4
## 23           Merc 280C 17.8   6    4    4
## 24          Merc 450SE 16.4   8    3    3
## 25          Merc 450SL 17.3   8    3    3
## 26         Merc 450SLC 15.2   8    3    3
## 27    Pontiac Firebird 19.2   8    3    2
## 28       Porsche 914-2 26.0   4    5    2
## 29      Toyota Corolla 33.9   4    4    1
## 30       Toyota Corona 21.5   4    3    1
## 31             Valiant 18.1   6    3    1
## 32          Volvo 142E 21.4   4    4    2

Merging data

merge(x, y, by, by.x, by.y)

x and y are two data frames
by, by.x, by.y specifies the columns used for merging

What if rownames don’t match?

In usual cases, we need a common column to merge two datasets

Merging data

simple.cars <- data.frame("id" = rownames(cars), 
                       "mpg" = cars$mpg, "hp" = cars$hp)

simple.cars2 <- data.frame("car.type" = rownames(cars), 
                        "vs" = cars$vs, "am" = cars$am) 

head(simple.cars)

##                  id  mpg  hp
## 1         Mazda RX4 21.0 110
## 2     Mazda RX4 Wag 21.0 110
## 3        Datsun 710 22.8  93
## 4    Hornet 4 Drive 21.4 110
## 5 Hornet Sportabout 18.7 175
## 6           Valiant 18.1 105

Merging data

head(simple.cars2)

##            car.type vs am
## 1         Mazda RX4  0  1
## 2     Mazda RX4 Wag  0  1
## 3        Datsun 710  1  1
## 4    Hornet 4 Drive  1  0
## 5 Hornet Sportabout  0  0
## 6           Valiant  1  0

Merging data

merge(x = simple.cars, y = simple.cars2, by.x = "id", by.y = "car.type")

##                     id  mpg  hp vs am
## 1          AMC Javelin 15.2 150  0  0
## 2   Cadillac Fleetwood 10.4 205  0  0
## 3           Camaro Z28 13.3 245  0  0
## 4    Chrysler Imperial 14.7 230  0  0
## 5           Datsun 710 22.8  93  1  1
## 6     Dodge Challenger 15.5 150  0  0
## 7           Duster 360 14.3 245  0  0
## 8         Ferrari Dino 19.7 175  0  1
## 9             Fiat 128 32.4  66  1  1
## 10           Fiat X1-9 27.3  66  1  1
## 11      Ford Pantera L 15.8 264  0  1
## 12         Honda Civic 30.4  52  1  1
## 13      Hornet 4 Drive 21.4 110  1  0
## 14   Hornet Sportabout 18.7 175  0  0
## 15 Lincoln Continental 10.4 215  0  0
## 16        Lotus Europa 30.4 113  1  1
## 17       Maserati Bora 15.0 335  0  1
## 18           Mazda RX4 21.0 110  0  1
## 19       Mazda RX4 Wag 21.0 110  0  1
## 20            Merc 230 22.8  95  1  0
## 21           Merc 240D 24.4  62  1  0
## 22            Merc 280 19.2 123  1  0
## 23           Merc 280C 17.8 123  1  0
## 24          Merc 450SE 16.4 180  0  0
## 25          Merc 450SL 17.3 180  0  0
## 26         Merc 450SLC 15.2 180  0  0
## 27    Pontiac Firebird 19.2 175  0  0
## 28       Porsche 914-2 26.0  91  0  1
## 29      Toyota Corolla 33.9  65  1  1
## 30       Toyota Corona 21.5  97  1  0
## 31             Valiant 18.1 105  1  0
## 32          Volvo 142E 21.4 109  1  1

Merging data

What if some rows in x do not match rows in y?

There are only two common rows in simple.cars3 and simple.cars4.

simple.cars3 <- simple.cars[1:5,]
simple.cars4 <- simple.cars2[4:8,]

Merging data

simple.cars3

##                  id  mpg  hp
## 1         Mazda RX4 21.0 110
## 2     Mazda RX4 Wag 21.0 110
## 3        Datsun 710 22.8  93
## 4    Hornet 4 Drive 21.4 110
## 5 Hornet Sportabout 18.7 175

simple.cars4

##            car.type vs am
## 4    Hornet 4 Drive  1  0
## 5 Hornet Sportabout  0  0
## 6           Valiant  1  0
## 7        Duster 360  0  0
## 8         Merc 240D  1  0

Merging data

Only Hornet rows(4-5) are common in this case

Then, what happens if we merge these two dataset?

merge(simple.cars3, simple.cars4, by.x = "id", by.y = "car.type")

##                  id  mpg  hp vs am
## 1    Hornet 4 Drive 21.4 110  1  0
## 2 Hornet Sportabout 18.7 175  0  0

Merging data

If you want to keep all data from both,

merge(simple.cars3, simple.cars4, by.x = "id", by.y = "car.type", all = TRUE)

##                  id  mpg  hp vs am
## 1        Datsun 710 22.8  93 NA NA
## 2        Duster 360   NA  NA  0  0
## 3    Hornet 4 Drive 21.4 110  1  0
## 4 Hornet Sportabout 18.7 175  0  0
## 5         Mazda RX4 21.0 110 NA NA
## 6     Mazda RX4 Wag 21.0 110 NA NA
## 7         Merc 240D   NA  NA  1  0
## 8           Valiant   NA  NA  1  0

Using all, we can keep all data and NA fills for missing matches on either side

Reshaping data

t() is used to transpose a matrix or a dataframe

rownames become column names

cars[1:3,]

##                mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4     21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710    22.8   4  108  93 3.85 2.320 18.61  1  1    4    1

t(cars)[1:3,]

##      Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive Hornet Sportabout
## mpg         21            21       22.8           21.4              18.7
## cyl          6             6        4.0            6.0               8.0
## disp       160           160      108.0          258.0             360.0
##      Valiant Duster 360 Merc 240D Merc 230 Merc 280 Merc 280C Merc 450SE
## mpg     18.1       14.3      24.4     22.8     19.2      17.8       16.4
## cyl      6.0        8.0       4.0      4.0      6.0       6.0        8.0
## disp   225.0      360.0     146.7    140.8    167.6     167.6      275.8
##      Merc 450SL Merc 450SLC Cadillac Fleetwood Lincoln Continental
## mpg        17.3        15.2               10.4                10.4
## cyl         8.0         8.0                8.0                 8.0
## disp      275.8       275.8              472.0               460.0
##      Chrysler Imperial Fiat 128 Honda Civic Toyota Corolla Toyota Corona
## mpg               14.7     32.4        30.4           33.9          21.5
## cyl                8.0      4.0         4.0            4.0           4.0
## disp             440.0     78.7        75.7           71.1         120.1
##      Dodge Challenger AMC Javelin Camaro Z28 Pontiac Firebird Fiat X1-9
## mpg              15.5        15.2       13.3             19.2      27.3
## cyl               8.0         8.0        8.0              8.0       4.0
## disp            318.0       304.0      350.0            400.0      79.0
##      Porsche 914-2 Lotus Europa Ford Pantera L Ferrari Dino Maserati Bora
## mpg           26.0         30.4           15.8         19.7            15
## cyl            4.0          4.0            8.0          6.0             8
## disp         120.3         95.1          351.0        145.0           301
##      Volvo 142E
## mpg        21.4
## cyl         4.0
## disp      121.0

Reshaping data

reshape function is used for repeated measurements

We need new dataset : Indometh

Indometh is data about pharmacokinetics of indometacin

head(Indometh)

##   Subject time conc
## 1       1 0.25 1.50
## 2       1 0.50 0.94
## 3       1 0.75 0.78
## 4       1 1.00 0.48
## 5       1 1.25 0.37
## 6       1 2.00 0.19

Reshaping data

dim(Indometh)

## [1] 66  3

summary(Indometh)

##  Subject      time            conc       
##  1:11    Min.   :0.250   Min.   :0.0500  
##  4:11    1st Qu.:0.750   1st Qu.:0.1100  
##  2:11    Median :2.000   Median :0.3400  
##  5:11    Mean   :2.886   Mean   :0.5918  
##  6:11    3rd Qu.:5.000   3rd Qu.:0.8325  
##  3:11    Max.   :8.000   Max.   :2.7200

Reshaping data

Subject : Each of the six subjects were given indometacin
time : times at which blood samples were drawn (total 11 time points)
conc : plasma concentrations of indometacin

Reshaping data

Indometh[1:15,]

##    Subject time conc
## 1        1 0.25 1.50
## 2        1 0.50 0.94
## 3        1 0.75 0.78
## 4        1 1.00 0.48
## 5        1 1.25 0.37
## 6        1 2.00 0.19
## 7        1 3.00 0.12
## 8        1 4.00 0.11
## 9        1 5.00 0.08
## 10       1 6.00 0.07
## 11       1 8.00 0.05
## 12       2 0.25 2.03
## 13       2 0.50 1.63
## 14       2 0.75 0.71
## 15       2 1.00 0.70

Reshaping data

From reshape function, we can get “wide” data and “long” data

“wide” means putting repeated measurements in seperate columns
“long” means putting repeated measurements in the same columns

Reshaping data

Let’s get “wide” data

wide <- reshape(Indometh, idvar = "Subject", direction = "wide")
wide

##    Subject conc.0.25 conc.0.5 conc.0.75 conc.1 conc.1.25 conc.2 conc.3
## 1        1      1.50     0.94      0.78   0.48      0.37   0.19   0.12
## 12       2      2.03     1.63      0.71   0.70      0.64   0.36   0.32
## 23       3      2.72     1.49      1.16   0.80      0.80   0.39   0.22
## 34       4      1.85     1.39      1.02   0.89      0.59   0.40   0.16
## 45       5      2.05     1.04      0.81   0.39      0.30   0.23   0.13
## 56       6      2.31     1.44      1.03   0.84      0.64   0.42   0.24
##    conc.4 conc.5 conc.6 conc.8
## 1    0.11   0.08   0.07   0.05
## 12   0.20   0.25   0.12   0.08
## 23   0.12   0.11   0.08   0.08
## 34   0.11   0.10   0.07   0.07
## 45   0.11   0.08   0.10   0.06
## 56   0.17   0.13   0.10   0.09

Reshaping data

Let’s go back to “long” data

reshape(wide, direction = "long")

##        Subject time conc.0.25
## 1.0.25       1 0.25      1.50
## 2.0.25       2 0.25      2.03
## 3.0.25       3 0.25      2.72
## 4.0.25       4 0.25      1.85
## 5.0.25       5 0.25      2.05
## 6.0.25       6 0.25      2.31
## 1.0.5        1 0.50      0.94
## 2.0.5        2 0.50      1.63
## 3.0.5        3 0.50      1.49
## 4.0.5        4 0.50      1.39
## 5.0.5        5 0.50      1.04
## 6.0.5        6 0.50      1.44
## 1.0.75       1 0.75      0.78
## 2.0.75       2 0.75      0.71
## 3.0.75       3 0.75      1.16
## 4.0.75       4 0.75      1.02
## 5.0.75       5 0.75      0.81
## 6.0.75       6 0.75      1.03
## 1.1          1 1.00      0.48
## 2.1          2 1.00      0.70
## 3.1          3 1.00      0.80
## 4.1          4 1.00      0.89
## 5.1          5 1.00      0.39
## 6.1          6 1.00      0.84
## 1.1.25       1 1.25      0.37
## 2.1.25       2 1.25      0.64
## 3.1.25       3 1.25      0.80
## 4.1.25       4 1.25      0.59
## 5.1.25       5 1.25      0.30
## 6.1.25       6 1.25      0.64
## 1.2          1 2.00      0.19
## 2.2          2 2.00      0.36
## 3.2          3 2.00      0.39
## 4.2          4 2.00      0.40
## 5.2          5 2.00      0.23
## 6.2          6 2.00      0.42
## 1.3          1 3.00      0.12
## 2.3          2 3.00      0.32
## 3.3          3 3.00      0.22
## 4.3          4 3.00      0.16
## 5.3          5 3.00      0.13
## 6.3          6 3.00      0.24
## 1.4          1 4.00      0.11
## 2.4          2 4.00      0.20
## 3.4          3 4.00      0.12
## 4.4          4 4.00      0.11
## 5.4          5 4.00      0.11
## 6.4          6 4.00      0.17
## 1.5          1 5.00      0.08
## 2.5          2 5.00      0.25
## 3.5          3 5.00      0.11
## 4.5          4 5.00      0.10
## 5.5          5 5.00      0.08
## 6.5          6 5.00      0.13
## 1.6          1 6.00      0.07
## 2.6          2 6.00      0.12
## 3.6          3 6.00      0.08
## 4.6          4 6.00      0.07
## 5.6          5 6.00      0.10
## 6.6          6 6.00      0.10
## 1.8          1 8.00      0.05
## 2.8          2 8.00      0.08
## 3.8          3 8.00      0.08
## 4.8          4 8.00      0.07
## 5.8          5 8.00      0.06
## 6.8          6 8.00      0.09

Reshaping data

reshape(data, idvar, direction)

idvar needs to identify multiple records from the same individual
direction matches to either “wide” or “long”

The Apply family

Apply function

Let’s find the mean of these variables

cars2 <- cars[,c("mpg", "disp", "hp", "wt", "qsec")]
head(cars2)

##                    mpg disp  hp    wt  qsec
## Mazda RX4         21.0  160 110 2.620 16.46
## Mazda RX4 Wag     21.0  160 110 2.875 17.02
## Datsun 710        22.8  108  93 2.320 18.61
## Hornet 4 Drive    21.4  258 110 3.215 19.44
## Hornet Sportabout 18.7  360 175 3.440 17.02
## Valiant           18.1  225 105 3.460 20.22

Apply function

mpg.mean <- mean(cars2[,1])
disp.mean <- mean(cars2[,2])
hp.mean <- mean(cars2[,3])
wt.mean <- mean(cars2[,4])
qsec.mean <- mean(cars2[,5])

c(mpg.mean, disp.mean, hp.mean, wt.mean, qsec.mean)

## [1]  20.09062 230.72188 146.68750   3.21725  17.84875

Apply function

We can do this with easy and simple way!

apply(cars2, 2, mean)

##       mpg      disp        hp        wt      qsec 
##  20.09062 230.72188 146.68750   3.21725  17.84875

Apply function

We can apply several functions

median, sum, min, max, or even user-defined function

apply(cars2, 2, max)

##     mpg    disp      hp      wt    qsec 
##  33.900 472.000 335.000   5.424  22.900

Apply function

apply(X, margin, function)

X : data
margin : 1 indicates rows, 2 indicates columns
function : the function to be applied

Version Control

Version control is a system that records changes to a file or set of files over time. This way you can recall specific versions later.

For the examples in this workshop, I used another consultant’s presentation.
I went ahead and made many changes to fit the parameters of this workshop.
All of this was created in R as an Rmarkdown file.
What if I want to go back to the original version?

Version Control

This is particularly important for
- Working on data, code, or projects in teams
- Building open source software
- Generating easily replicable research

Github

One easy way to utilize version control software is through Github.

If you don’t already have a github account, proceed to this link and make one:

https://github.com/

Github

Let’s look at an example here:

https://github.com/SDeMora/cnn_transcripts

This is a repo for some python code that scrapes CNN transcripts.
We can see the original repo that was created by “notnews” under “forked from”.
We can also see that it was me who forked the repo to make some of my own edits.

Github

How can I integrate R easily with Github?

First, see if you already have Git on your computer by going to the Terminal and typing “which git”. If it responds with a file path, you’re golden! No need to download it.
If you don’t have it, for MAC users type in:
- git - -version
- git config
- Accept the download.
- You may need to use this if it doesn’t work: xcode-select - -install

Github

For Windows:
- https://gitforwindows.org/

Github

Time to introduce ourselves!

Go to the shell (Terminal) and type in our information like this:

git config - -global user.name ‘Jane Doe’
git config - -global user.email ‘jane@example.com’
git config - -global - -list

Github

Let’s create our first repo!

Open Github.
Click the green “New” button to create a new repo.
Give it a name!
Choose “Initialize this repository with a README”
Click: Create repository.

Github

How can we link it with our machines and R?

Go to the repo.
Click “Clone or download”
Copy the URL.
Go to RStudio on your computer.

Github

Once you’re in RStudio…

Click “File” –> “New Project”
Choose “Version Control”
Choose “Git”
Paste the URL from github that you copied previously.
Choose a location for your project!

Github

Pull.
Edit.
Commit.
Push.
Forking.
Branching.