Introduction to DS - R 语言初探之贰

Posted on 2023-09-24 繁/简： set

Data Scientist spend up to 80% of the time on data cleaning and 20% on actual data analysis.

与上一节不同，这一节中我们主要关注 data frame 而不是 vector；我们将引入 tidy format 的概念，并介绍能够有效操纵 tidy data 的开源包 tidyverse。

在开始之前，我们在 RStudio 的 REPL 中输入 install.packages("tidyverse") 安装 tidyverse 包。

This article is a self-administered course note.

It will NOT cover any exam or assignment related content.

Tidy Data

Tidy dataset provides a standardized way to link the structure of a dataset (its physical layout) with its semantics (its meaning).

我们怎样根据数据的含义 (semantics) 来组织其结构 (structure) 呢？一般来说，datasets are a collection of values, either quantitative or qualitative. These values are organized in 2 ways:

Variables - all values that measure the same underlying attribute across units.
Observations - all values measured on the same unit across attributes.

We say a dataset is tidy if:

each row represents one observation.
columns represent the different variables available for each of these observations.
each cell is a single value.

我们来看看这个例子：下面这组数据反映了德国与韩国 1960-1962 年的生育率。

1
2
3

#>       country 1960 1961 1962
#> 1     Germany 2.41 2.44 2.47
#> 2 South Korea 6.16 5.99 5.79

这一组数据显然不是 tidy 的 (或者说，是 messy 的)；这是因为：

Each row includes several observations.
One of the variables, year, is stored in the header.

#>       country year fertility
#> 1     Germany 1960      2.41
#> 2 South Korea 1960      6.16
#> 3     Germany 1961      2.44
#> 4 South Korea 1961      5.99
#> 5     Germany 1962      2.47
#> 6 South Korea 1962      5.79

我们对这组数据进行整理，使其变得 tidy：每组 observation 占据一行，抽象出 year, fertility 变量并整理到列中。接下来，我们可以使用 tidyverse 包中提供的各种函数对 tidy data 进行操纵。

Manipulating Data Frames

本小节中的函数引入自 tidyverse 中的 dplyr 包。

dplyr functions are aware of variable names. (no need to specify murders$total)
most dplyr functions take data frames as their first argument.

Adding a column

使用 mutate() 函数。

1
2
3

library(dslabs)
data("murders")
murders <- mutate(murders, rate = total / population * 100000)

将为 murders data frame 新增一个变量 (一列) rate。

Subsetting

使用 filter() 对 data frame 进行 subsetting。(横向 filter)

filter(murders, rate <= 0.71)
#>           state abb        region population total  rate
#> 1        Hawaii  HI          West    1360301     7 0.515
#> 2          Iowa  IA North Central    3046355    21 0.689
#> 3 New Hampshire  NH     Northeast    1316470     5 0.380
#> 4  North Dakota  ND North Central     672591     4 0.595
#> 5       Vermont  VT     Northeast     625741     2 0.320

Selecting columns

select() 函数选定给定 data frame 中指定的 variables 并形成一个新的 data frame。(纵向 select)

new_table <- select(murders, state, region, rate)
filter(new_table, rate <= 0.71)
#>           state        region  rate
#> 1        Hawaii          West 0.515
#> 2          Iowa North Central 0.689
#> 3 New Hampshire     Northeast 0.380
#> 4  North Dakota North Central 0.595
#> 5       Vermont     Northeast 0.320

The Pipe: `|>` or `%>%`

与 shell 中的 pipe | 的用法一致。在 R 中，the pipe |> or %>% sends the result of the left side of the pipe to be the first argument of the function on the right side of the pipe.

注意，我们目前接触的所有 dplyr 函数 (mutate, filter 与 select) 的第一个参数都是给定的 data frame；这一性质为 pipe 的应用创造了良好的条件。再回到之前的程序：

new_table <- select(murders, state, region, rate)
filter(new_table, rate <= 0.71)
#>           state        region  rate
#> 1        Hawaii          West 0.515
#> 2          Iowa North Central 0.689
#> 3 New Hampshire     Northeast 0.380
#> 4  North Dakota North Central 0.595
#> 5       Vermont     Northeast 0.320

数据的流向是 $\rm{original \ data\to select\to newtable}$，再由 $\rm{newtable\to filter\to result}$。使用 pipe，我们可以直接将 select 的输出导入到 filter 的输入，这样既省去了中间变量的定义，又提升了代码可读性。

murders |> select(state, region, rate) |> filter(rate <= 0.71)
#>           state        region  rate
#> 1        Hawaii          West 0.515
#> 2          Iowa North Central 0.689
#> 3 New Hampshire     Northeast 0.380
#> 4  North Dakota North Central 0.595
#> 5       Vermont     Northeast 0.320

当我们想将某个函数的输出导入至其他函数的第二个 (或其他所有非第一的) 参数的输入时，使用 placeholder (占位符) _。联想到 shell 中的 pipe 也有相同的 idiom。

log(8, base = 2)
#> [1] 3
2 |> log(8, base = _)
#> [1] 3

Summarizing Data

An important part of exploratory data analysis is summarizing data.

最简单的 data summary 例子：求平均值 (average) 与标准差 (standard deviation)。

`summarize`

对于给定的 data frame，summarize() 函数返回一个新的 summarized table。该 data frame：

one row for each grouping variable [stay tuned].
one column for each of the summary statistics that you have specified.

data(heights)
s <- heights |>
    filter(sex == "Female") |>
    summarize(average = mean(height), standard_deviation = sd(height))
s
#>   average standard_deviation
#> 1    64.9               3.76

由于这里没有指定 grouping variables，summarize 函数仅仅返回了一行数据，每一列对应一个我们指定的 summary statistic (平均数与标准差)。

summarize 函数要与group_by 函数配合使用才能显出其强大之处。

Group then Summarize

分组摘要：A common operation in data exploration is to first split data into groups and then compute summaries for each group.

group_by 函数将指定的 data frame 按照某个 variable 分组，并返回一个 grouped data frame。该 variable 中每个不同的值对应一个 grouping variable。(例：sex 中有 Male 与 Female 两种 grouping variables)

这一特殊的 grouped data frame 被称为 tibble [stay tuned].

heights |> group_by(sex)
#> # A tibble: 1,050 × 2
#> # Groups:   sex [2]
#>   sex   height
#>   <fct>  <dbl>
#> 1 Male      75
#> 2 Male      70
#> 3 Male      68
#> 4 Male      74
#> 5 Male      61
#> # ℹ 1,045 more rows

summarize 函数将对 grouped data frame 进行分组摘要；每一组 (每一个 grouping variable) 占据一行。

heights |> 
  group_by(sex) |>
  summarize(average = mean(height), standard_deviation = sd(height))
#> # A tibble: 2 × 3
#>   sex    average standard_deviation
#>   <fct>    <dbl>              <dbl>
#> 1 Female    64.9               3.76
#> 2 Male      69.3               3.61

`pull`

pull() 函数抽出给定 data frame 中的某一列。就功能上来说，其与 $ 的作用一致；"it's mostly useful because it looks a little nicer in pipe expression."

pull(murders, total)
#>  [1]  135   19  232   93 1257   65   97   38   99  669  376    7   12  364  142   21
#> [17]   63  116  351   11  293  118  413   53  120  321   12   32   84    5  246   67
#> [33]  517  286    4  310  111   36  457   16  207    8  219  805   22    2  250   93
#> [49]   27   97    5
murders$total
#>  [1]  135   19  232   93 1257   65   97   38   99  669  376    7   12  364  142   21
#> [17]   63  116  351   11  293  118  413   53  120  321   12   32   84    5  246   67
#> [33]  517  286    4  310  111   36  457   16  207    8  219  805   22    2  250   93
#> [49]   27   97    5

再看一个例子：

us_murder_rate <- murders |>
    summarize(rate = sum(total) / sum(population) * 100000)
us_murder_rate
#>   rate
#> 1 3.03
class(us_murder_eate)
#> [1] "data.frame"

us_murder_rate 只是个 single value，却被储存在 data frame 中；这显然不合理。于是我们在 pipe 中添加 pull 函数：(us_murder_rate |> pull(rate) 等价于 us_murder_rate$rate)

us_murder_rate <- murders |>
    summarize(rate = sum(total) / sum(population) * 100000) |>
    pull(rate)
class(us_murder_rate)
#> [1] "numeric"

Sorting Data Frames

之前我们介绍了一系列排序函数，例如 sort, order 等；但那是对于 vector 而言的。data frames 有另外的排序函数 (同样在 dplyr 包中引入)。

`arrange`

对于给定的 data frame，arrange 函数根据指定的某个变量 (某一列) 进行排序。

murders |>
  arrange(population) |>
  head()
#>                  state abb        region population total   rate
#> 1              Wyoming  WY          West     563626     5  0.887
#> 2 District of Columbia  DC         South     601723    99 16.453
#> 3              Vermont  VT     Northeast     625741     2  0.320
#> 4         North Dakota  ND North Central     672591     4  0.595
#> 5               Alaska  AK          West     710231    19  2.675
#> 6         South Dakota  SD North Central     814180     8  0.983

arrange 函数默认由小到大进行排序，若想降序排序，我们使用 desc() 函数。

murders |>
    arrange(desc(population)) |>
    head()
         state abb        region population total
#> 1   California  CA          West   37253956  1257
#> 2        Texas  TX         South   25145561   805
#> 3      Florida  FL         South   19687653   669
#> 4     New York  NY     Northeast   19378102   517
#> 5     Illinois  IL North Central   12830632   364
#> 6 Pennsylvania  PA     Northeast   12702379   457

若排序变量的类型是数字 (numeric 或 integer)，arrange(-population) 可以达到相同的效果。

Nested sorting

If we are ordering by a column with ties, we can use a second (or more) column to break the tie.

murders |> 
  arrange(region, rate) |> 
  head()
#>           state abb    region population total  rate
#> 1       Vermont  VT Northeast     625741     2 0.320
#> 2 New Hampshire  NH Northeast    1316470     5 0.380
#> 3         Maine  ME Northeast    1328361    11 0.828
#> 4  Rhode Island  RI Northeast    1052567    16 1.520
#> 5 Massachusetts  MA Northeast    6547629   118 1.802
#> 6      New York  NY Northeast   19378102   517 2.668

实际上就是 arrange 函数允许传入多个排序变量，第 $n$ 个排序变量就是排序的第 $n$ 关键字。

The top $n$

top_n() 函数的定义有点奇怪，很容易被它的名字所迷惑。在其 manual page 中作者也声明该函数已经过时 (deprecated)，建议使用 slice_min() 与 slice_max() 函数进行替代 (superseded)。

但既然 slide 中提到了，还是稍微说明一下：top_n(x, n, wt) 把 data frame x 的 wt 变量作为排序变量，选出前 n 行。但它并不对这些行进行排序！也就是说，返回的 n 行仍遵循原来在 x 中的相对顺序。

还是来看一个例子：

df <- data.frame(x = c(8, 9, 10))
top_n(df, 1)
#> Selecting by x
#>    x
#> 1 10
top_n(df, 2)
#> Selecting by x
#>    x
#> 1  9
#> 2 10
top_n(df, 3)
#> Selecting by x
#>    x
#> 1  8
#> 2  9
#> 3 10

Tibbles

(需要 tidyverse 包) Tibbles tbl 是一种特殊的 data frame。在之前我们已经接触过，group_by() 函数返回的是分组后的 grouped data frame，也就是 tibbles。

murders |> group_by(region)
#> # A tibble: 51 × 6
#> # Groups:   region [4]
#>   state      abb   region population total  rate
#>   <chr>      <chr> <fct>       <dbl> <dbl> <dbl>
#> 1 Alabama    AL    South     4779736   135  2.82
#> 2 Alaska     AK    West       710231    19  2.68
#> 3 Arizona    AZ    West      6392017   232  3.63
#> 4 Arkansas   AR    South     2915918    93  3.19
#> 5 California CA    West     37253956  1257  3.37
#> # ℹ 46 more rows
murders |> group_by(region) |> class()
#> [1] "grouped_df" "tbl_df"     "tbl"        "data.frame"

可以看到，class() 函数返回了许多奇怪的东西：tbl 即 tibble，summary() 与 group_by() 函数总是返回该类型的 data frame。其中，group_by() 函数返回的 tbl 又与 summary() 函数不同，是一种 grouped_df；其中还存储了额外的分组信息 (grouping information)。

除了 tibbles can be grouped 这一特性之外，tibbles 与普通的 data frames 还有许多不同之处。

Tibbles display better

The print method for tibbles is more readable than that of a data frame. (在 RStudio 上试试即可)

使用 tibble() 来创建一个新的 tibble (格式与 data.frame() 一致)。
使用 as_tibble() 来将某个 data frame 转化为 tibble。

murders
#> ...
as_tibble(murders)
#> ...

Subsets of tibbles are tibbles

对 data frame 进行 subset 后得到的不一定是 data frame，还可能是 vector 或 scalar；但对 tibble 进行 subset 后得到的仍然是 tibble。

class(murders[, 4])  # pull the 4th column
#> [1] "numeric"
class(as_tibble(murders)[, 4])
#> [1] "tbl_df"     "tbl"        "data.frame"

Tibbles give better error msg

试图访问某个 data frame 不存在的 column 时 $ 将会返回一个不带任何警告的 NULL。这十分的 error-prone。但 tibble 则会正常的弹出警告信息。

murders$Population
#> [1] NULL
as_tibble(murders)$Population
#> [1] Warning: Unknown or uninitialised column: `Population`.
#> [2] NULL

Tibbles can have complex entries

While data frame columns need to be vectors of numbers, strings, or logical values, tibbles can have more complex objects, such as lists or functions.

tibble(id = c(1, 2, 3), func = c(mean, median, sd))
#> # A tibble: 3 × 2
#>      id func  
#>   <dbl> <list>
#> 1     1 <fn>  
#> 2     2 <fn>  
#> 3     3 <fn>

Tidyverse Conditionals

我们之前已经介绍了 ifelse() 这一 conditional；tidyverse 包中还提供了其他 conditionals。

`case_when`

和 Standard ML 中的 [] 语义很像，本质上是连续的 if-elseif-...-elseif 嵌套。

x <- c(-2, -1, 0, 1, 2)
case_when(x < 0 ~ "Negative",
          x > 0 ~ "Positive",
          TRUE ~ "Zero")
#> [1] "Negative" "Negative" "Zero"     "Positive" "Positive"

A common use for this function is to define categorical variables based on existing variables.

murders |>
    mutate(group = case_when(
        abb %in% c("ME", "NH", "VT", "MA", "RI", "CT") ~ "New England",
        abb %in% c("WA", "OR", "CA") ~ "West Coast", 
        region == "South" ~ "South", 
        TRUE ~ "Other")) |>
    group_by(group) |>
    summarize(rate = sum(total) / sum(population) * 10^5)
#> # A tibble: 4 × 2
#>   group        rate
#>   <chr>       <dbl>
#> 1 New England  1.72
#> 2 Other        2.71
#> 3 South        3.63
#> 4 West Coast   2.90

`between`

我们使用 between(v, a, b) 函数来确定某个值 v 是否在区间 [a, b] 中。下面的两个命令等价：

1 2	x >= a & x <= b between(x, a, b)

`data.table`

除了 tidyverse 包提供的 tibble 外，data.table 包提供的 data.table 也是传统 data frames 的一种 alternative。

在使用 data.table object 前，先导入 data.table 包：library(data.table)。使用 setDT() 函数将一个 data frame 转为 data.table 类 object：murders <- setDT(murders)。

Selecting

以下的 selecting 方式，不仅适用于 data.table 类对象，也适用于普通的 data frame。

murders[, c("state", "region")] |> head()
#>         state region
#> 1:    Alabama  South
#> 2:     Alaska   West
#> 3:    Arizona   West
#> 4:   Arkansas  South
#> 5: California   West

但 .() 这一特殊的函数只能在导入了 data.table 包后才能使用。R 将 .() 中的变量识别为 column names，而不是 R 环境中的其他对象。

murders[, .(state, region)] |> head()
#>         state region
#> 1:    Alabama  South
#> 2:     Alaska   West
#> 3:    Arizona   West
#> 4:   Arkansas  South
#> 5: California   West

Manipulating columns

回忆起在 dplyr 包中我们使用 mutate 函数来为 data frame 添加新的一列。在 data.table 包中我们则使用 := 函数。

s <- murders[, rate := total / population * 100000]
head(s)
#>         state abb region population total rate
#> 1:    Alabama  AL  South    4779736   135 2.82
#> 2:     Alaska  AK   West     710231    19 2.68
#> 3:    Arizona  AZ   West    6392017   232 3.63
#> 4:   Arkansas  AR  South    2915918    93 3.19
#> 5: California  CA   West   37253956  1257 3.37

若想同时定义多个 columns，向 := 函数中传入多个参数。(注意要加引号；真是奇怪的 syntax)

s <- murders[, ":="(rate = total / population * 100000, rank = rank(population))]
head(s)
#>         state abb region population total     rate rank
#> 1:    Alabama  AL  South    4779736   135 2.824424   29
#> 2:     Alaska  AK   West     710231    19 2.675186    5
#> 3:    Arizona  AZ   West    6392017   232 3.629527   36
#> 4:   Arkansas  AR  South    2915918    93 3.189390   20
#> 5: California  CA   West   37253956  1257 3.374138   51

另外，如果 := 函数中的 column name 是原 data.table 中一个已存在的 column name，那么 := 函数的作用是 changing (columns) 而不是 adding (columns)。

Reference versus Copy

data.table 包在设计时的一个目的就是为了尽量节省空间。因此，与许多编程语言一样，在使用 data.table 包时一定要注意 reference 与 copy 的区别。

1 2	x <- data.table(a = 1) y <- x

在上例中，y 仅仅是 x 的一个 reference (或称为 alias)。

1 2	x <- data.table(a = 1) y <- copy(x)

使用 copy 函数创建一个拷贝。

Subsetting

data.table 对象的 sebsetting：

murders[rate <= 0.7]
#>            state abb        region population total  rate rank
#> 1:        Hawaii  HI          West    1360301     7 0.515   12
#> 2:          Iowa  IA North Central    3046355    21 0.689   22
#> 3: New Hampshire  NH     Northeast    1316470     5 0.380   10
#> 4:  North Dakota  ND North Central     672591     4 0.595    4
#> 5:       Vermont  VT     Northeast     625741     2 0.320    3

等价于使用 dplyr 包中的 filter 函数。

1	filter(murders, rate <= 0.7)

使用 data.table 包提供的特性，我们可以将 filter 与 select 函数压缩为 one succinct command：

murders[rate <= 0.7, .(state, rate)]
#>            state  rate
#> 1:        Hawaii 0.515
#> 2:          Iowa 0.689
#> 3: New Hampshire 0.380
#> 4:  North Dakota 0.595
#> 5:       Vermont 0.320

该命令等价于：

1	murders \|> filter(rate <= 0.7) \|> select(state, rate)

Importing Data

在实际应用中，我们面对的大量数据通常是从外部导入的；system package 中的众多数据集 (如 murders, heights) 起到的多数是 demonstrative purpose。因此，了解如何 importing data 非常重要。

Dealing with paths

使用 system.file 函数获取 system package 所在的文件夹路径。(现实中几乎不可能用到)

system.file(package="dslabs")
#> [1] "C:/Users/XXZ/AppData/Local/R/win-library/4.3/dslabs"
system.file("extdata", package="dslabs")
#> [1] "C:/Users/XXZ/AppData/Local/R/win-library/4.3/dslabs/extdata"
system.file("confidential", package="dslabs")
#> [1] ""

当定位到某个文件所在的文件夹时，可以使用 file.path 函数获得该文件的路径。

1
2
3

dir <- system.file(package="dslabs")
file.path(dir, "extdata")
#> [1] "C:/Users/XXZ/AppData/Local/R/win-library/4.3/dslabs/extdata"

或者使用更简单粗暴的 paste 方法，更适合程序员体质：

1 2	paste(dir, "extdata", sep='/') #> [1] "C:/Users/XXZ/AppData/Local/R/win-library/4.3/dslabs/extdata"

Showing files

使用 list.files(dir) 方法输出 dir 指定路径下文件夹中的所有文件。

dir <- system.file(package="dslabs")
list.files(dir)
#> [1] "data"        "DESCRIPTION" "extdata"     "help"       
#> [5] "html"        "INDEX"       "MD5"         "Meta"       
#> [9] "NAMESPACE"   "R"           "script"

使用 wd 函数获取 working directory 的路径。结合该函数与 list.file 可以实现 ls 的功能：

wd <- getwd()
list.files(wd)
#> [1] "Anno 1800"              "ardc00.ini"            
#> [3] "Arma 3"                 "Assassin's Creed Unity"
#> [5] "Dell"                   "desktop.ini" 
#> ...

Copying files with paths

使用 file.copy 函数将指定路径下的文件拷贝到 working directory 中。

fullpath <- file.path(system.file("extdata", package="dslabs"), "murders.csv")
file.copy(fullpath, "local_murders.csv")
list.files(wd)
#> ... local_murders.csv ...

Reading files

readr 与 readxl 包提供了读取不同类型数据集的函数。不同类型指：

format: spaces, commas, semicolons, tabs...-separated values.
suffix: txt, csv, tsv, xls, xlsx...

调用 readr 包中的 read.csv 直接对 working directory 下指定的 csv 数据集进行读取。

1 2	library(readr) dat <- read.csv("local_murders.csv")

此外，readr 包中的函数还可以读取给定 url 指向的某个远程资源。

1 2	url <- "https://raw.githubusercontent.com/.../extdata/murders.csv" dat <- read.csv(url)

我们也可以使用 download.file 函数将其下载到本地后再进行读取。

1 2	download.file(url, "murders.csv") dat <- read.csv("murders.csv")

Reference

This article is a self-administered course note.

References in the article are from corresponding course materials if not specified.

Course info:

Code: COMP2501, Lecturer: Dr. H.F. Ting.

Course textbook:

Data Analysis and Prediction Algorithms with R - Rafael A. Irizarry.

-----------------------------------そして、次の曲が始まるのです。-----------------------------------

Tidy Data

Manipulating Data Frames

Adding a column

Subsetting

Selecting columns

The Pipe: |> or %>%

Summarizing Data

summarize

Group then Summarize

pull

Sorting Data Frames

arrange

Nested sorting

The top \(n\)

Tibbles

Tibbles display better

Subsets of tibbles are tibbles

Tibbles give better error msg

Tibbles can have complex entries

Tidyverse Conditionals

case_when

between

data.table

Selecting

Manipulating columns

Reference versus Copy

Subsetting

Importing Data

Dealing with paths

Showing files

Copying files with paths

Reading files

Reference

The Pipe: `|>` or `%>%`

`summarize`

`pull`

`arrange`

`case_when`

`between`

`data.table`