Introduction to DS - R 语言初探之叁

Posted on 2023-10-04 繁/简： set

Data Wrangling 的下一步就是 Data Visualization。这一节我们将引入 ggplot2 开源包，学习其极其新手友好的语法，用仅仅几行代码创造出各式各样的图表与 summary。

在开始之前，使用 library(ggplot2) 导入 ggplot2 包。

This article is a self-administered course note.

It will NOT cover any exam or assignment related content.

Graph Components

在 ggplot2 包中图表的组分包括：

data component.
geometry component. (图表本身，如 scatterplot, barplot, histogram, boxplot...)
Aesthetic mapping. 是一系列 visual cues，决定了数据在图表上 present 的方式。
- map observations to $x$- or $y$-scale visual cues.
- map variables to color visual cue.
- other graph styling......

Layers

在 ggplot2 包中我们通过添加 layers 的方式「组装」出一个图表；每一个 layer 都是一个函数。

1	DATA \|> ggplot() + <LAYER 1> + <LAYER 2> + ... + <LAYER N>

Geometry layer

<LAYER 1> 通常是一个 geometry layer，形如 geom_X()，其中 X 是图表的类型，如 geom_point, geom_bar 与 geom_histogram。

以 geom_point 为例：我们需要向 geom_point() 函数传入 data 与一种 mapping。

1	p <- murders \|> ggplot() + geom_point(aes(x = population/10^6, y = total))

以上的代码对应一个 $x$ 轴为 population/10^6, $y$ 轴为 total 的 scatterplot。

注意 aes() 函数：它将为传入的两个 variable 构建 aesthetic mapping。此外，它也是 ggplot2 包中少数能够直接识别 variable 的函数，我们可以直接传入 total 而不用写成 murders$total。

geom_point() 函数还支持传入其他的参数，例如 size:

1	p <- murders \|> ggplot() + geom_point(aes(x = population/10^6, y = total), size = 2)

注意 size 并不是 mapping，它是应用在所有 points 上的一个 styling rule；因此它必须写在 aes() 函数之外，作为 geom_point() 函数的参数而非 aes() 函数的参数。

Adding text

geom_label 与 geom_text 函数允许我们为图表中的数据添加文本说明。它们的区别在于前者所添加的文本有一个矩形边框，而后者没有。

我们也许只想为某些数据添加文本说明，因此 text 不是应用在所有 points 上的 general styling rule。我们需要通过 mapping 指定需要添加 text 的数据。(这意味着 label 是 aes() 函数的参数而非 geom_text() 的。

1 2	p + geom_text(aes(population/10^6, total, label = abb)) p + geom_text(aes(10, 100, label = "x=10, y=100"))

任何 layer 函数除 data 与 aes 以外往往还能接受其他类型的参数，geom_text 也不例外。上述代码表示的图表中，文本与 data points 经常发生重合。为提升观感，我们传入 nudge_x 参数使其在 $x$ 方向产生偏移。

1	p + geom_text(aes(population/10^6, total, label = abb), nudge_x = 1.5)

Scales

scale_x_continuous 与 scale_y_continuous 函数允许我们操作图表的标度 (scale)。

1 2	p + scale_x_continuous(trans = "log10") + scale_y_continuous(trans = "log10") p + scale_x_log10() + scale_y_log10()

Labels and titles

使用 xlab 与 ylab 函数修改图表在 $x$ 轴与 $y$ 轴方向上的标签；使用 ggtitle 函数为图表添加标题。

p <- p + scale_x_log10() + 
    scale_y_log10() + 
    xlab("Population in millions (log scale)") + 
    ylab("Total number of murders (log scale)") + 
    ggtitle("US Gun Muders in 2010")

Catogories as colors

我们再添加一层 geom_point layer 来改变图表中数据点的颜色。

1	p + geom_point(aes(population/10^6, total), color = "red")

注意，geom_point 中 aes 参数是不可省略的，即使从逻辑上来说，「改变数据点的颜色」这一功能看上去并不需要提供 mapping 信息。[stay tuned for global aesthetic mapping]

1	p + geom_point(aes(population/10^6, total, col = region))

以上代码根据每个 observation 的 region 为数据点染色；相同 region 的数据点将具有相同的颜色。在这里，color 又成了一种 aesthetic mapping，作为 aes() 函数的参数 col 而非 geom_point() 函数的参数。

aesthetic	aesthetic mapping
往往应用于全体数据	可以应用于全体数据或部分数据
值是常量	值是数据中的某个变量
作为 layer 函数的参数	作为 `aes()` 函数的参数参与生成 aes mapping

Global Aesthetic Mappings

综合以上的 layers，我们组装出了一个好看的图表。其完整的，不包含中间变量的代码如下：

p <- murders |> ggplot() + 
    geom_point(aes(x = population/10^6, y = total), size = 2) +
    geom_point(aes(population/10^6, total, col = region)) + 
    geom_text(aes(population/10^6, total, label = abb), nudge_x = 0.5) + 
    scale_x_log10() + 
    scale_y_log10() + 
    xlab("Population in millions (log scale)") + 
    ylab("Total number of murders (log scale)") + 
    ggtitle("US Gun Muders in 2010")

可以见到，aes(x = population/10^6, y = total) 这一 aesthetic mapping 共出现了三次。

ggplot2 允许我们在 ggplot() 函数中定义 global aesthetic mapping，之后所有未指明 aesthetic mapping 的函数都将使用该 global aesthetic mapping 作为参数。

p <- murders |> ggplot(aes(x = population/10^6, y = total)) + 
    geom_point(size = 2) + 
    geom_point(aes(col = region)) + 
    geom_text(nudge_x = 0.5) + 
    ...

是不是清晰简洁多了。当然，global aesthetic mapping 也是可以被 override 的。

Visualization Data Distribution

The most basic statistical summary of a list of objects or numbers is its distribution.

散点图描述的是数据集中的某个因变量 $y$ 随自变量 $x$ 而变化的大致趋势，主要用于变量间的相关性分析；而描述数据的分布 (distribution) 我们则有另外的选择，如 bar plot, eCDF, histogram 与 boxplot。

首先我们来澄清一个概念：数据集中变量的类型。

Categorical type. (generally) small groups with many data points in each group.
- Ordinal: ordered values. E.g. Spiciness: (mild, medium, hot).
- Non-ordinal: unordered values. E.g. Region: (west, central, north east, south).
Numerical type. (generally) large groups with few data points in each group.
- Discrete numerics.
- Continuous numerics.

Barplot

On categorical type of variables, barplot shows for each group the number of data points whose belong to this group.

情景：统计 northeast，central，south，west 四个 region 的州占美国总犯罪率的比例。

注意，barplot 是针对 categorical type variables 而言的；其对 numerical type 效用不大。考虑统计班级内所有同学身高的分布这一情景：由于两学生身高完全一致的情况相当罕见，对应的 barplot 中绝大部分的 vertical bar 高度均为 $1$。这并不能提供很有效的 data summary。

1	murders \|> ggplot(aes(region)) + geom_bar()

eCDF & Histogram

Cumulative distribution functions

累积分布函数图 (eCDF, emipirical cumulative distribution funcion) 描述了某个 numerical variable $x$ 的累积分布情况，即 $F(a)=\Pr(x\leq a)$。

$\Pr(x\leq a)$ 表示 $x\leq a$ 的 possibility；我们可以将其视为所有满足 $x\leq a$ 的数据点占所有数据的比例。

eCDF 能够很好的描述 numerical variable 的分布，但并不直观。

Histogram

直方图 (histogram) 建立在 bins 的概念之上；我们将 span of data 分割为若干个等长的，不重叠的 bins，并计算落在每个 bins 中的 data points 个数，并将其作为每个 bin 所对应的 bar 的高度。

实际上，我们通过划分 bins 的方式人为创造出一个 categorical variable。Histogram 本质上就是一个以 bin 作为 categorical variable 绘制出的 barplot。

需要注意的是，histogram 展现出 numerical variable 分布的同时也损失了一部分关于这些数据的信息；我们无法通过 histogram 区分位于一个 bin 内的数据点的区别。

heights |> filter(sex == "Female") |>
           ggplot(aes(height)) +
           geom_histogram(binwidth = 1, fill = "blue", col = "black") +
           xlab("Female heights in inches") +
           ggtitle("Histogram")

Smoothed density

将 histogram 的每个 bar 的顶端连起来，形成的图表称为 smooth density plot。它的合理性基于 estimate。Smoothed density plot 相对于 histogram 的优点在于它能够更加直观的比较两个分布间的差异。

smoothed density plot 舍弃了 bar 而采用 curve 来描述分布，这不仅更加 visually appealing，还使得我们能用 normal distribution 来拟合 smoothed density plot；即，我们可以用平均值 $\mu$ 与标准差 $\sigma$ 来近似的描述某个符合正态分布的数据集对应的 smoothed density plot。

1
2
3

heights |> filter(sex == "Female") |>
           ggplot(aes(height)) +
           geom_density(fill = "blue")

Boxplot

箱型图 (boxplot) 是一种 five-number data summary，该五个值分别是下限，上限与三种 quartiles (the 25th, 50th and 75th percentiles)。

此外，boxplot 中还有一系列 outliers (异常值)，它们不属于 box 的范围。

Interquartile range (IQ): the difference between the 25th $(Q1)$ and 75th $(Q3)$.
下限: $Q1-1.5IQ$, 上限: $Q3+1.5IQ$.
所有 $>$ 上限与 $<$ 下限的数据点都是 outliers。

不要混淆上/下限与最大/最小值的概念；上限与下限是 boxplot 中人为规定的边界，而不是某个具体的数据点。

Reference

This article is a self-administered course note.

References in the article are from corresponding course materials if not specified.

Course info:

Code: COMP2501, Lecturer: Dr. H.F. Ting.

Course textbook:

Data Analysis and Prediction Algorithms with R - Rafael A. Irizarry.

-----------------------------------そして、次の曲が始まるのです。-----------------------------------