Introduction to DS - Data Wrangling

Posted on 2023-10-05 繁/简： set

我是在 MIT missing semester 这门课里第一次听到 Data Wrangling 这一说法的；那时候对这个词并没有很深刻的理解。现在我了解了：数据之所以要被「整理」(wrangling)，是因为存在一种「整洁」(tidy) 的定义。

关于数据的 tidy format 在 P3 里已经详细介绍过：对于整洁的数据，存在各种各样强大的工具 (如 tidyverse) 对其进行操纵。而本节所要介绍的，是如何将杂乱的数据整理成整洁的数据。

This article is a self-administered course note.

It will NOT cover any exam or assignment related content.

Reshaping Data

在开始之前，我们引入一个数据集 fertility-two-countries-example：这个数据集是不整洁的。我们有另一种术语称呼这种不整洁的数据集，即 wide data。

wide_data <- read_csv("fertility-two-countries-example.csv")
head(wide_data)
#> A tibble: 2 × 57
#>  country `1960` `1961` `1962` `1963` `1964` `1965` `1966` `1967` `1968` ...
#>  <chr>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl> ...
#> 1 Germany   2.41   2.44   2.47   2.49   2.49   2.48   2.44   2.37   2.28 ...
#> 2 South …   6.16   5.99   5.79   5.57   5.36   5.16   4.99   4.85   4.73 ...

`pivot_longer`

选择若干列作为 pivots 并将其展开 (即，原列中的 variable「降级」为 value)。由于这种操作会增加 table 的长度，因此称为 pivot_longer。

tidy_data <- wide_data |>
    pivot_longer(wide_data, "1960":"2015", names_to = "year", values_to = "fertility")
head(tidy_data)
#    A tibble: 6 × 3
#>   country year  fertility
#>   <chr>   <chr>     <dbl>
#> 1 Germany 1960       2.41
#> 2 Germany 1961       2.44
#> 3 Germany 1962       2.47
#> 4 Germany 1963       2.49
#> 5 Germany 1964       2.49
#> 6 Germany 1965       2.48

参数分别是：

target dataset to be reshaped.
pivot columns.
names_to: new column name containing the current column names.
values_to: new column name containing the current observation values.

`pivot_wider`

基本上是 pivot_longer 的逆过程。

new_wide_data <- tidy_data |>
    pivot_wider(names_from = year, values_from = fertility)
select(new_wide_data, country, "1960":"1967")
#> # A tibble: 2 × 9
#>   country     `1960` `1961` `1962` `1963` `1964` `1965` `1966` `1967`
#>   <chr>        <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
#> 1 Germany       2.41   2.44   2.47   2.49   2.49   2.48   2.44   2.37
#> 2 South Korea   6.16   5.99   5.79   5.57   5.36   5.16   4.99   4.85

`seperate`

再看另一组不整洁的数据：

raw_dat <- read_csv(filename)
select(raw_dat, 1:5)
#> # A tibble: 2 × 5
#>   country     `1960_fertility` `1960_life_expectancy` `1961_fertility`
#>   <chr>                  <dbl>                  <dbl>            <dbl>
#> 1 Germany                 2.41                   69.3             2.44
#> 2 South Korea             6.16                   53.0             5.99
#> # ℹ 1 more variable: `1961_life_expectancy` <dbl>

这组数据的杂乱程度比 fertility-two-countries-example 更甚。

wide data.
encoding extra information (year) in the column names.

首先用 pivot_longer 进行第一次整理。

dat <- raw_dat |> pivot_longer(-country)
head(dat)
#>  A tibble: 6 × 3
#>   country name                 value
#>   <chr>   <chr>                <dbl>
#> 1 Germany 1960_fertility        2.41
#> 2 Germany 1960_life_expectancy 69.3 
#> 3 Germany 1961_fertility        2.44
#> 4 Germany 1961_life_expectancy 69.8 
#> 5 Germany 1962_fertility        2.47
#> 6 Germany 1962_life_expectancy 70.0

注意这里的特殊写法：pivot_longer(-country) 表示将除了 country 外的所有列都作为 pivot。

接下来我们使用 separate 函数进行第二次整理：seperate 函数将指定的列分为若干部分。seperate 函数默认以下划线 _ 作为变量分隔符。

dat |> separate(name, c("year, name"), extra = "merge")
#> # A tibble: 224 × 4
#>   country year  name            value
#>   <chr>   <chr> <chr>           <dbl>
#> 1 Germany 1960  fertility        2.41
#> 2 Germany 1960  life_expectancy 69.3 
#> 3 Germany 1961  fertility        2.44
#> 4 Germany 1961  life_expectancy 69.8 
#> 5 Germany 1962  fertility        2.47
#> # ℹ 219 more rows

这里要稍微讲一下 extra = "merge" 的作用。对于某个变量 1962_life_expectancy：

separate 根据默认分隔符 _ 将其分为三部分：1962, life 与 expectancy。
但指定的新变量名只有两个 c("year", "name")。
因此，多余的部分 expectancy 将会被舍弃。

这就是 extra = "merge" 所规避的：it merges the last two variables when there's an extra separation.

接下来我们进行最后一次整理：为分割开的变量 fertility 与 life_expectancy 分别创建新的一列 (即，将 fertility 与 life_expectancy 这两种 value 「升级」为 variable)。

dat |>
    separate(name, c("year", "name"), extra = "merge") |>
    pivot_wider()
#> # A tibble: 112 × 4
#>   country year  fertility life_expectancy
#>   <chr>   <chr>     <dbl>           <dbl>
#> 1 Germany 1960       2.41            69.3
#> 2 Germany 1961       2.44            69.8
#> 3 Germany 1962       2.47            70.0
#> 4 Germany 1963       2.49            70.1
#> 5 Germany 1964       2.49            70.7
#> # ℹ 107 more rows

Finished！现在每一个 observation 仅仅对应一行了。

`unite`

separate 的逆过程。默认的合并符也是下划线 _。

dat |>
    separate(name, c("year", "first", "second"), fill = "right") |>
    unite(name, first_variable_name, second_variable_name) |>
    pivot_wider() |>
    rename(fertility = fertility_NA)
#> # A tibble: 112 × 4
#>   country year  fertility life_expectancy
#>   <chr>   <chr>     <dbl>           <dbl>
#> 1 Germany 1960       2.41            69.3
#> 2 Germany 1961       2.44            69.8
#> 3 Germany 1962       2.47            70.0
#> 4 Germany 1963       2.49            70.1
#> 5 Germany 1964       2.49            70.7
#> # ℹ 107 more rows

上文我们说到，当 separation 的数量大于提供的新变量名数量时，extra separation 将会被舍去；但当 separation 的数量小于提供的新变量名数量时，默认情况 (或使用 fill = "right") 将使用 NA 填充 extra variable names。

利用这一点，使用 union 与 rename 配合能够起到替代 separation 中 extra = "merge" 的作用。

Joining Tables

下面简单介绍以下如何合并两个 tables。见下例子：

tab_1
#>        state population
#> 1    Alabama    4779736
#> 2     Alaska     710231
#> 3    Arizona    6392017
#> 4   Arkansas    2915918
#> 5 California   37253956
#> 6   Colorado    5029196
tab_2
#>         state ev
#> 1  California 55
#> 2     Arizona 11
#> 3     Alabama  9
#> 4 Connecticut  7
#> 5      Alaska  3
#> 6    Delaware  3

可以发现这两个 tables 拥有相同的变量 state，但变量的值并非一一对应。那么当它们依据 state 合并时 (by = "state") 表现如何呢？

`left_join`

left_join(tab1, tab2, by = "state")
#>        state population ev
#> 1    Alabama    4779736  9
#> 2     Alaska     710231  3
#> 3    Arizona    6392017 11
#> 4   Arkansas    2915918 NA
#> 5 California   37253956 55
#> 6   Colorado    5029196 NA

left_join 依照左侧 table 的 state 值进行合并：多余的舍去，空缺的补 NA。

`right_join`

right_join(tab1, tab2, by = "state")
#>         state population ev
#> 1     Alabama    4779736  9
#> 2      Alaska     710231  3
#> 3     Arizona    6392017 11
#> 4  California   37253956 55
#> 5 Connecticut         NA  7
#> 6    Delaware         NA  3

right_join 依照右侧 table 的 state 值进行合并：多余的舍去，空缺的补 NA。

此外还有：

inner_join：求交集。
full_join：求并集。
semi_join：该函数并不是合并操作。它仅保留左侧 table 中拥有右侧 table 信息的行，但不进行合并。
anti_join：semi_join 的逆操作。它删去左侧 table 中拥有右侧 table 信息的行，不进行合并。

Binding Tables

Unlike the join function, the binding functions do not try to match by a variable, but instead simply combine datasets. If the datasets don't match by the appropriate dimensions, one obtains an error.

binding 函数用起来就相当简单粗暴了。

`bind_cols`

bind_cols 函数合并若干列；若列的长度不一样将报错。

bind_cols(a = 1:3, b = 4:6)
#> A tibble: 3 × 2
#>       a     b
#>     <int> <int>
#> 1     1     4
#> 2     2     5
#> 3     3     6

`bind_rows`

类似的，bind_rows 函数合并若干行；若行间的 variable 对不上直接报错。

1
2
3

tab_1 <- tab[1:2, ]
tab_2 <- tab[3:4, ]
bind_rows(tab_1, tab_2)

Web Scraping

一个简单的 web scraping (web harvesting) 例子。

我们的目标是从这里提取美国各州的犯罪率表格并将其存入 R 中的一个 data frame 中。

1
2
3

url <- paste0("https://en.wikipedia.org/w/index.php?title=",
              "Gun_violence_in_the_United_States_by_state",
              "&direction=prev&oldid=810166167")

下一步，我们调用 rvest 包中的 read_html 函数获取网页对应的 XML 内容。接着再对其调用 html_text 函数，我们能够获得该网页的 HTML 源代码。

1
2
3

h <- read_html(url)
class(h)
#> [1] "xml_document" "xml_node"

下一步，我们使用 html_nodes 函数获取 XML 代码中的所有 table 元素 (或者说节点，node)。我们只关心第一个，也就是我们的目标 table。

tab <- h |> html_nodes("table")
tab[[1]]
#> {html_node}
#> <table class="wikitable sortable">
#> [1] <tbody>\n<tr>\n<th>State\n</th>\n<th>\n<a href="/wiki/List_of_U.S ...

接下来再对这个 html_node 调用 html_table 函数：它能将 HTML tables 转化为对应的 data frames。

1
2
3

tab <- tab[[1]] |> html_table()
class(tab)
#> [1] "tbl_df"     "tbl"        "data.frame"

最后一步，我们将该 data frame 的变量名按照需要进行适当的修改 (原 table 中的变量名太长了)。

tab <- tab |> setNames(c("state", "population", "total", "murder_rate"))
head(tab)
#> # A tibble: 6 × 4
#>   state      population total murder_rate
#>   <chr>      <chr>      <chr>       <dbl>
#> 1 Alabama    4,853,875  348           7.2
#> 2 Alaska     737,709    59            8  
#> 3 Arizona    6,817,565  309           4.5
#> 4 Arkansas   2,977,853  181           6.1
#> 5 California 38,993,940 1,861         4.8
#> # ℹ 1 more row

现在我们成功将网页上的 table 爬取并存储在 tab 这个 data frame 中了。但其实仔细看看，我们还有一些事要做：例如将数据中的 , 去掉并将其转化为 numeric 类型 (默认的类型是 character)。

String Processing

我们将使用 stringr 这一强大的字符串处理包。R 语言这一强调 vectorization 的语言和 regex 结合起来处理字符串真的是一种享受。

`str_replace()`

书接上文，将 population 中的 , 去掉并将数据转化为 numeric 类可以使用下列两种方法：

test_1 <- str_replace_all(tab$population, ",", "")
test_1 <- as.numeric(test_1)

test_2 <- parse_number(tab$population)
identical(test_1, test_2)
#> [1] TRUE

str_replace() 将 string 与 pattern 的首次匹配替换为另一个 string。
str_replace_all() 将 string 与 pattern 的所有匹配替换为另一个 string。

`str_detect()` & `str_view()`

str_detect() 返回一个逻辑 vector，显示给定的 string 与 pattern 是否匹配。
str_view() 显示 string 与 pattern 的首次匹配。
str_view_all() 显示 string 与 pattern 的所有匹配。

这三个函数有助于我们快速建立起对待处理数据的认识，寻找特定的 features；在 debug 时也常常用到它们。

`str_subset()`

str_subset() 返回所有含有 pattern 的 string。

str_subset(problems, "inches")
#> [1] "5 feet and 8.11 inches" "Five foot eight inches"
#> [3] "5 feet 7inches"         "5ft 9 inches"          
#> [5] "5 ft 9 inches"          "5 feet 6 inches"

`str_extract()` & `str_match()`

str_extract() 与 str_match() 返回 string 与 pattern 的首次匹配。
str_extract_all() 与 str_match_all() 返回 string 与 pattern 的所有匹配。

extract 与 match 函数的唯一区别在于它们对存在 groups (捕获组) 的 regex 表现不同。

str_match() 不仅返回 string 与 pattern 的匹配，还返回所有捕获组的值。

pattern_without_group <- "^[4-7],\\d*$"
pattern_with_group <- "^([4-7]),(\\d*)$"

str_match(s, pattern_with_groups)
#>      [,1]   [,2] [,3]
#> [1,] "5,9"  "5"  "9" 
#> [2,] "5,11" "5"  "11"
#> [3,] "6,"   "6"  ""  
#> [4,] "6,1"  "6"  "1" 
#> [5,] NA     NA   NA  

str_extract(s, pattern_with_groups)
#> [1] "5,9"  "5,11" "6,"   "6,1"  NA

在 regex 中，第 \(i\) 个捕获组用 \i 表示。(在 R 中还需要 escape 该 \，因此为 \\i) 结合捕获组与 replace 函数能够很优美的进行 search then replace：

1
2
3

str_subset(problems, pattern_with_groups) |> 
  str_replace(pattern_with_groups, "\\1'\\2") |> head()
#> [1] "5'3"  "5'25" "5'5"  "6'5"  "5'8"  "5'6"

Lookarounds

有关 regex 的内容就不多在这里介绍了，毕竟已经比较熟悉了。唯一一个没接触过的是 lookarounds：

Lookarounds are zero-width assertions. They provide a way to ask for one or more conditions to be satisfied without moving the search forward or matching it.

Lookarounds 这一名称就已经很生动的揭露出其本质了: With lookarounds, your feet stay planted on the string. You're just looking, not moving!

lookahead: (?=pattern)
lookbehind: (?<=pattern)
negative lookahead (?!pattern)
negative lookbehind (?<!pattern)

Lookarounds 也可以连接起来作为 multiple (AND) conditions。

pattern <- "(?=\\w{8,16})(?=^[a-z|A-Z].*)(?=.*\\d+.*).*"
yes <- c("Ihatepasswords1", "password1234")
no <- c("sh0rt", "Ihaterpasswords", "7X%9,N`yrYG92b7")

str_detect(yes, pattern)
#> [1] TRUE TRUE

str_detect(no, pattern)
#> [1] FALSE FALSE FALSE

str_extract(yes, pattern)
#> [1] "Ihatepasswords1" "password1234"

Reference

This article is a self-administered course note.

References in the article are from corresponding course materials if not specified.

Course info:

Code: COMP2501, Lecturer: Dr. H.F. Ting.

Course textbook:

Data Analysis and Prediction Algorithms with R - Rafael A. Irizarry.

-----------------------------------そして、次の曲が始まるのです。-----------------------------------

Reshaping Data

pivot_longer

pivot_wider

seperate

unite

Joining Tables

left_join

right_join