Introduction to DS - R 语言初探之壹

Posted on 2023-09-21 繁/简： set

COMP2501 的 Ting 老教授完美符合我对上了年纪的老师的印象：佛系，人好，但上课催眠；加之 R 语言更近似于一种脚本型语言，semantics 相对简单 (甚至绝大部分操作都是在 REPL 中进行的)，遂决定自学。

This article is a self-administered course note.

It will NOT cover any exam or assignment related content.

The Very Basics

Objects. We use the term object to describe stuff that is stored in R. Variables are examples, but objects can also be more complicated entities such as functions.

a <- 1
b <- 1
c <- -1
a
#> [1] 1
print(a)
#> [1] 1
ls()
#> [1] "a" "b" "c"

ls() function displays all objects in the environment (workspace).
use rm(a) to remove object a from current environment.

Built-in Functions.

args(log)
#> function (x, base = exp(1))
#> NULL
log(8, base = 2)
#> [1] 3
log(x = 8, base = 2)
#> [1] 3
log(base = 2, x = 8)
#> [1] 3

use help("log") or ?log to display manual page for function log().
$=$ v.s. $\leftarrow$. = is to specify arguments, <- is to assign values to variables. When arguments' names are used, the order is no longer important.

Data Types

Basic Data Types

numeric. e.g., a <- 1.
integer (subset of numeric). e.g., b <- 1L.
character. e.g., c <- "hello".
logical. e.g., d <- TRUE. (capitalized TRUE & FALSE)

Data Frames

The most common way of storing a dataset in R is in a data frame. Think of it as a table:

rows: observations.
columns: different variables reported for each observation.

library(dslabs)
data(murders)
class(murders)
#> [1] "data.frame"
str(murders)
#> 'data.frame':    51 obs. of  5 variables:
#> $ state : chr "Alabama" "Alaska" "Arizona" "Arkansas" ...
#> $ abb : chr "AL" "AK" "AZ" "AR" ...
#> ...

library() 可理解为 import (Java, Python) 或 include (C).
data(murders) 将数据集 murders 导入当前的环境中。
str() function is useful for finding out more about the structure of an object.

names(murders)
#> [1] "state" "abb" "region" "population" "total"
murders$population
#>  [1]  4779736   710231  6392017  2915918 37253956  5029196  3574097
#>  [8]   897934   601723 19687653  9920000  1360301  1567582 12830632
#> [15]  ...

names() 函数返回 data frame 的所有 variables' (columns) names.
[data_frame_name]$[variable_name]: 使用 accessor $ 取出对应的 column。

Vectors

The object murder$population is not one number but several. We call these types of objects vectors.

pop <- murder$population
length(pop)
#> [1] 51
class(pop)
#> [1] "numeric"
class(murder$state)
#> [1] "character"

注意到这里有一个 counter-intuitive 的细节：vector pop 的类型是 numeric，而不是某个类似 numeric vector 或 numeric[] 的东西；这是因为在 R 中，a single number is technically a vector of length $1$.

由此可见，与我们之前接触过的其他语言不同，在 R 语言中 the most basic objects available to store data are vectors. 这样的规定带有很强烈的 data wrangling 色彩：在进行数据处理时，最基本的单位并非某个单独的数据单元，而是一组 (通常情况下) 具有相同特征的数据集。

z <- 3 == 2
print(z)
#> [1] FALSE
length(z)
#> 1

We shall discuss vectors in full detail later.

Factors

Factors are useful for storing categorical data.

当 data frame 的某一列中 distinct elements 的数量较少时，我们可以使用 factors 来节省空间。举例来说，murders 的 region 列中只有四种元素：Northeast, South, North Central 与 West。

class(murders$region)
#> [1] "factor"
levels(murders$region)
#> [1] "Northeast" "South" "North Central" "West"

We use levels() to display all levels of a certain factor object.

In the background, R stores these levels as integers and keeps a map to keep track of the labels. This is more memory-efficient than storing all the characters.
The default in R is for the levels to follow alphabetical order.

The function reorder() lets us change the order of the levels of a factor variable (e.g., region) based on a summary computed on a numertic vector (e.g., value):

region <- murders$region
value <- murders$total
region <- reorder(region, value, FUN = sum)
levels(region)
#> [1] "Northeast" "North Central" "West" "South"

The above code takes the sum of the total murders in each region, and reorders the factor following these sums. The new order shows that Northeast has the least murders while South has the most.

Lists

Data frames are special case of lists. (data frame 本质上是一个含有若干长度相同的 vectors 或 factors 的 list) Lists are useful because you can store any combination of different types (and vectors can't).

record <- list(name = "John Doe", 
               student_id = 1234, 
               grade = c(95, 82, 91, 97, 93), 
               final_grade = "A")
class(record)
#> [1] "list"
record$student_id
#> [1] 1234
record[["student_id"]]
#> [1] 1234

与在 data frame 中一致，我们使用 [list_name]$[variable_name] 来访问 list 中对应的 field。除此之外，[[]] + "name" 也能达到同样的目的。有时我们会遇到 lists without names:

1
2
3

record2 <- list("John Doe", 1234)
record2[[1]]
#> [1] "John Doe"

我们可以把这种 lists 视作能够储存不同类型数据的 vectors。由于这种 lists 的 fields 是匿名的，我们无法通过 $ + name 进行访问。Instead，我们使用 [[]] + index 来访问对应下标的元素。

注意 R 语言遵循 1-indexed rule: 下标从 1 而非从 0 开始！

Matrices

Matrices are two-dimensional (like data frames).
All entries in matrices have to be all the same type (like vectors).

Compared to matrices, data frames are much more useful for storing data. Yet matrices have a major advantage over data frames: we can perform matrix algebra operations on them.

使用 matrix() 函数创建一个新的 matrix；使用 as.data.frame() 函数方便的将 matrix 转化为 data frame。

mat <- matrix(1:12, 4, 3)
mat
#>      [,1] [,2] [,3]
#> [1,]    1    5    9
#> [2,]    2    6   10
#> [3,]    3    7   11
#> [4,]    4    8   12
as.data.frame(mat)
#>   V1 V2 V3
#> 1  1  5  9
#> 2  2  6 10
#> 3  3  7 11
#> 4  4  8 12

使用 [] ，我们可以随心所欲地进行 matrix slicing。若被切割的是 matrix 中的某「条」(one-demensional) 数据，返回的将是一个 vector；若是某「块」(two-dimensional) 数据，返回的将是一个 matrix。

mat[2, 3]
#> [1] 10
mat[2, ]
#> [1]  2  6 10
mat[, 3]
#> [1]  9 10 11 12
mat[1:2, 2:3]
#>      [,1] [,2]
#> [1,]    5    9
#> [2,]    6   10

此外，相似的 slicing 方法也可以用在 data frame 中：

data("murders")
murders[25, 1]
#> [1] "Mississippi"
murders[2:3, ]
#>     state abb region population total
#> 2  Alaska  AK   West     710231    19
#> 3 Arizona  AZ   West    6392017   232

Vectors

In R, the most basic objects available to store data are vectors. As we have seen, complex datasets can usually be broken down into components that are vectors.

Creating Vectors

We can create vectors using the function c(), which stands for concatenate.

grades <- c(95, 82, 91, 97, 93)
grades
#> [1] 95 82 91 97 93
class(grades)
#> [1] "numeric"

与 lists 不同，vectors 中所有 entries 的类型必须是一致的。当我们尝试使用 c() 创建一个含有不同类型元素的 vectors 时，lower-level types (e.g., numeric) 将转化为 higher-level types (e.g., character)。

grades <- c(95, 82, "91", "97", "93")
grades
#> [1] "95" "82" "91" "97" "93"
class(grades)
#> [1] "character"

Names

与 lists 一样，我们可以为 vectors 中的 entries 命名，并对具名的 vectors 调用 names() 函数。

codes <- c(italy = 380, canada = 124, egypt = 818, 86)
codes
#>  italy canada  egypt 
#>    380    124    818     86
names(codes)
#> [1] "italy" "canada" "egypt" ""
class(codes)
#> [1] "numeric"

在为 vectors 的 entries 命名时将变量名用引号 "" 括起来 makes no difference。此外，我们还能使用 names() 函数来 assign names - 即，names can be given later after being defined.

codes <- c(380, 124, 818)
country <- c("italy", "canada", "egypt")
names(codes) <- country
codes
#>  italy canada  egypt 
#>    380    124    818

Sequences

使用 sequences 来迅速创建一个元素符合特定规律的 vector。Python 中也有类似的 idiom，很方便。

seq(1, 10)
#> [1]  1  2  3  4  5  6  7  8  9 10
seq(1, 10, 2)
#> [1] 1 3 5 7 9
class(seq(1, 10))
#> [1] "integer"
class(seq(1, 10, 2))
#> [1] "numeric"

只有当未指定 jump length 时 (即使用默认的 jump length 1)，seq() 函数返回的 vector 类型才是 integer；其余情况下 seq() 函数返回的 vector 都是 numeric 类型的。

seq(a, b) 与 a:b 这一表达等价：但 a:b 这种表达只能生成 jump length 为 1 的 vectors。

1:10
#> [1]  1  2  3  4  5  6  7  8  9 10
class(1:10)
#> [1] "integer"

Subsetting

我们使用 single square brackets [] + name/index 来访问 vector 中的 entry。神奇的是，You can get more than one entry by using a multi-entry vector referring to indexes/names.

codes[2]
#> canada
#>    124
codes[c(1, 3)]
#> italy egypt 
#>   380   818
codes[1:2]
#>  italy canada 
#>    380    124
codes[c("egypt", "italy")]
#> egypt italy 
#>   818   380

Rescaling

In R, arithmetic operations on vectors occur element-wise.

a <- c(0.01, 0.02, 0.03, 0.04, 0.05)
a * 100
#> [1] 1 2 3 4 5
a * 100 - 5
#> [1] -4 -3 -2 -1 0

Two Vectors

Two vectors a, b:

等长：c <- a + b 意味着 $c_i=a_i+b_i$.
不等长：出现 recycle；较短的 vector 发生 recycle 直至两 vectors 等长 $\to$ 第一种情况。

x <- c(1, 2, 3)
y <- c(10, 20, 30)
z <- c(10, 20, 30, 40, 50, 60, 70)
x+y
#> [1] 11 22 33
x+z
#> [1] 11 22 33 41 52 63 71
#> Warning message: 
#> In x + z : longer object length is not a multiple of shorter object length

Coercion

coercion is an attempt by R to be flexible with data types. When an entry does not match expected type, some of the prebuilt R functions try to guess what was meant before throwing errors.

R 以放弃 type safeness 为代价换取了 type flexiblity. 几个 coercion 的例子 (有些我们已经接触过了)：

when coercion is possible: R coerced lower-level data into higher-level data.
when coercion is impossible: Not availables (NA) is introduced.

x <- c(1, "canada", 3)
x
#> [1] "1" "canada" "3"
class(x)
#> "character"

以上是 possible coercion 的例子：R 为了遵循「vector 内类型一致」这一规则，将 numeric 类型的数据强制转换为了 character 类型。

x <- c("1", "b", "3")
as.numeric(x)
#> Warning: NAs introduced by coercion
#> [1]  1 NA  3

当某种强制类型转换不可能完成时 (impossible coercion: 我们无法将一个 character 类型的数据转化为 numeric 型)，伴随着一个轻飘飘的 warning 信息，NA 这一特殊的值将被引入。

Sorting

data wrangling 当然是离不开 sorting 的；接下来我们来介绍一下 R 中提供了哪些方便的 sorting features。

`sort`

The function sort() sorts a vector in increasing order; but it does not give us any additional information.

library(dslabs)
data(murders)
sort(murders$total)
#>  [1]    2    4    5    5    7    8   11   12   12   16   19   21   22
#> [14]   27   32   36   38   53   63   65   67   84   93   93   97   97
#> [27]   99  111  116  118  120  135  142  207  219  232  246  250  286
#> [40]  293  310  321  351  364  376  413  457  517  669  805 1257

`order`

The function order() returns the vector of indexes that sorts the input vector.

x
#> [1] 31 4 15 92 65
order(x)
#> [1] 2 3 1 5 4
index <- order(x)
x[index]
#> [1] 4 15 31 65 92

由于 R 中 vector 的 index 可以是一个 multi-entry vector，我们通过这种方式来实现 sort()。order() 能做到的不仅如此：If we want to order the states by murders, sort() alone can not accomplish the task.

ind <- order(murders$total)
murders$abb[ind]
#>  [1] "VT" "ND" "NH" "WY" "HI" "SD" "ME" "ID" "MT" "RI" "AK" "IA" "UT"
#> [14] "WV" "NE" "OR" "DE" "MN" "KS" "CO" "NM" "NV" "AR" "WA" "CT" "WI"
#> [27] "DC" "OK" "KY" "MA" "MS" "AL" "IN" "SC" "TN" "AZ" "NJ" "VA" "NC"
#> [40] "MD" "OH" "MO" "LA" "IL" "GA" "MI" "PA" "NY" "FL" "TX" "CA"

First obtain the index that orders the vectors according to murder totals.
Then index the state names vector.

由此可见，加州 CA 的 total murders 数量最多，佛蒙特州 VT 数量最少。

`max` & `which.max`

分别返回最大值的 value 与 index。同理还有 min() 与 which.min()。

max(murders$total)
#> [1] 1257
i_max <- which.max(murders$total)
murders$state[i_max]
#> [1] "California"

`rank`

rank() 函数返回给定 vector 中所有 entries 的排名。

1
2
3

x <- c(31, 4, 15, 92, 65)
rank(x)
#> [1] 3 1 2 5 4

Indexing

R provides a powerful and convenient way of indexing vector. We can, for example, subset a vector based on properties of another vector. In expression a[b], b could be:

vector of indexes. [见 5.2 order 一节]
logical vector. [stay tuned]

Subsetting with logicals

很好理解。vector a 与 logical vector b; a[b] means:

等长：$a_i$ remains only when $b_i=\mathtt{TRUE}$.
a 的长度大于 b: b 发生 recycle，与 a 等长后 $\to$ 第一种情况。
a 的长度小于 b: 超出的部分将引入 NA.

下面来看看我们是如何通过 logical subsetting 找到所有 murder rate 小于 0.71 的州的：

murder_rate <- murders$total / murders$population * 100000
logical_ind <- murder_rate < 0.71
murders$state[logical_ind]
#> [1] "Hawaii"        "Iowa"          "New Hampshire" "North Dakota" 
#> [5] "Vermont"
sum(logical_ind)
#> [1] 5

Logical operators

In R, both & and && are logical AND but not bitwise AND. But always use & since && assumes length 1 on input and only compares once.

下面这段代码能够找到所有 murder rate 小于等于 1 的西部州。

west <- murders$region == "West"
safe <- murder_rate <= 1
ind <- west & safe
murders$state[ind]
#> [1] "Hawaii"  "Idaho"   "Oregon"  "Utah"    "Wyoming"

`which`

which() 函数能够返回给定 logical vector 中所有值为 TRUE 的 entries 的 indexes。因此，对于某个 logical vector b, a[b] 与 a[which(b)] 得到的结果通常是相同的。(logical subsetting v.s. index subsetting)

假设我们想要查找加利福尼亚州的 murder rate：

logical_ind <- murders$state == "California"
ind <- which(logical_ind)
murder_rate[logical_ind]
#> [1] 3.374138
murder_rate[ind]
#> [1] 3.374138

`match`

假设我们想要同时查找不止一个州的 murder rate；此时可以使用函数 match()。

1 2	match(c("New York", "Florida", "Taxas"), murders$state) #> [1] 33 10 44

但其实使用 logical vectors 之间的逻辑或 | 也能达成目的，只是略显臃肿：

1 2	which(murders$state == "New York" \| murders$state == "Florida" \| murders$state == "Texas") #> [1] 10 33 44

`%in%`

If rather than an index we want a logical that tells us whether or not each element of a first vector is in a second, we can use the function %in%.

1 2	c("Boston", "Dakota", "Washington") %in% murder$state #> [1] FALSE FALSE TRUE

通过 which()，我们能将 match() (index-based) 与 %in% (logic-based) 联系起来。

match(c("New York", "Florida", "Texas"), murder$state)
#> [1] 33 10 44
which(murder$state %in% c("New York", "Florida", "Texas"))
#> [1] 10 33 44

Basic plots

R 语言的一大特色就是其能够以简单的语法快速生成图像。(见 MIT-Data Wrangling 一节)

`plot`

使用 plot() 函数来绘制 scatterplot (散点图)。

1
2
3

x <- murders$population / 10^6
y <- murders$total
plot(x, y)

with() function allows us to use the murders column names in the plot() function.

1	with(murders, plot(population, total))

plot(x, y) 函数有两个参数：散点图描述变量 x 与 y 间的关联或分布模式。

`hist`

使用 hist() 函数来绘制 histogram (直方图)。

1 2	x <- with(murders, total/population * 100000) hist(x)

hist(x) 函数只有一个参数：直方图描述变量 x 的频率分布情况。

`boxplot`

使用 boxplot() 函数来绘制 boxplot (箱型图)。

1 2	murders$rate <- with(murders, total/population * 100000) boxplot(rate~region, data = murders)

Boxplot demonstrating the locality, spread and skewness groups of numerical data.

Programming Basics

到目前为止介绍的所有内容都可以在 REPL 中复现；我们藉此强调 R 语言的 interactive 特征与其数据处理方面的特化。但 R 同样是一种 programming language，它当然也具有成熟的流程控制系统。

if-else statement

常见的 if-else 逻辑语句。

library(dslabs)
data(murders)
murder_rate <- murder$total / murder$population*100000

ind <- which.min(murder_rate)

if (murder_rate[ind] < 0.5) {
    print(murders$state[ind])
} else {
    print("No state has murder rate that low")
}

#> [1] "Vermont"

R 语言还有一个 built-in function for if-else: ifelse(a, b, c). 可以将其视为 C++ 中的三目运算符 a?b:c。The function is particularly useful because it works on vectors (联想到 higher-order function).

a <- c(0, 1, 2, -4, 5)
result <- ifelse(a > 0, 1/a, NA)
#> [1]  NA 1.0 0.5  NA 0.2
#> PS: 1/0 = Inf

再来看一个例子：我们使用 ifelse() 函数将某个 vector na_example 中的缺失值 (即 NA) 全部替换为 0。

data(na_example)
no_nas <- ifelse(is.na(na_example), 0, na_example)
sum(is.na(no_nas))
#> [1] 0

`any` & `all`

很经典的函数了。any() 与 all() 函数均作用于某个 logical vector。

any(a): 结果为 $a_1\lor a_2 \lor a_3...a_{n-1}\lor a_n$.
all(a): 结果为 $a_1\land a_2 \land a_3...a_{n-1}\land a_n$.

Defining functions

无需多言。R 中的函数定义语法与大多数主流语言一致。

学到这里应该能隐约察觉到了：虽然 R 中最基本的存储单元被称为 object，但它显然与 OOP 中的 object 关系不大；R 本质上来说是一种 functional programming language。

以下函数根据传入的参数 arithmetic 选择计算 x 的算数平均值或几何平均值。

avg <- function(x, arithmetic = TRUE) {
    n <- length(x)
    ifelse(arithmetic, sum(x)/n, prod(x)^(1/n))
}

function 关键字声明一个函数将被定义。注意到这里我们使用 <- 赋值符号将函数赋给了名为 avg 的变量。

仔细体会一下，这个写法其实非常的 functional programming：它暗示着在 R 中，函数与任何 object 一样，都是立派な一等公民，能被储存与传递。

For-loops

同样没什么需要强调的。下面的例子通过循环语句计算数列 $S_n$；其中 $S_i=\sum_{1}^{i}i$.

compute_s_n <- function(n) {
    x <- 1:n
    sum(x)
}

m <- 25
s_n <- vector(length = m) # create an empty vector
for (i in 1:m) {
    s_n[i] <- compute_s_n(i)
}

Vectorizations & Functionals

Although for-loops are an important concept to understand, in R we rarely use them. That's because most functions in R are vectorized.

A vectorized function is a function that will apply the same operation on each of the vectors.

不禁联想到在 Programming Languages 这门课中所学的：higher-order functions 的存在是 functional programming 的重要特征之一，而这类函数所实现的功能就是我们这里所提到的 vectorization。

某些 OOP 语言例如 Ruby，通过引入 blocks 的概念也能够较为轻松的实现类似 vectorization 的功能；我们能够看到，在支持 vectorization 的语言中，循环控制语句被使用的频率远低于不支持的语言。

R 提供了一个 general high-order function 的接口，我们称其为 functionals；sapply() 函数是其中之一。

Functionals are functions that help us apply the same function to each entry in a vector, matrix, data frame, or list. Functionals vectorized a function.

回到之前计算数列 $S_n$ 的例子：如何使用 vectorization & functionals 而不是循环控制语句进行实现？

1 2	n <- 1:25 s_n <- sapply(n, compute_s_n)

可以看到，与 for-loop 相比，vectorization 的实现显然更加简洁优雅。这也就能解释为什么像 Ruby 这样的 OOP 语言也想要引入类似的 features。

Reference

This article is a self-administered course note.

References in the article are from corresponding course materials if not specified.

Course info:

Code: COMP2501, Lecturer: Dr. H.F. Ting.

Course textbook:

Data Analysis and Prediction Algorithms with R - Rafael A. Irizarry.

-----------------------------------そして、次の曲が始まるのです。-----------------------------------

The Very Basics

Data Types

Basic Data Types

Data Frames

Vectors

Factors

Lists

Matrices

Vectors

Creating Vectors

Names

Sequences

Subsetting

Rescaling

Two Vectors

Coercion

Sorting

sort

order

max & which.max

rank

Indexing

Subsetting with logicals

Logical operators

which

match

%in%

Basic plots

plot

hist

boxplot

Programming Basics

if-else statement

any & all

Defining functions

For-loops

Vectorizations & Functionals

Reference

`sort`

`order`

`max` & `which.max`

`rank`

`which`

`match`

`%in%`

`plot`

`hist`

`boxplot`

`any` & `all`