The Missing Semester of CS Education, MIT - Data Wrangling

Posted on 2023-07-06 繁/简： set

ENGG1340 shell 命令部分的加强版。老师通过一个例子就深入浅出的展示了 grep, sort, uniq, sed, awk 等命令强大的数据处理能力。由于之前自学了一点 sed 与 awk，这一节我理解的比较快，所以笔记相对来说会比较简单；See here for more detailed introduction on command sed and awk.

This article is a self-administered course note.

It will NOT cover any exam or assignment related content.

这一堂课的全部流程基于对某个 ssh 示例日志文件的处理；只需要一行命令，就能从海量的日志数据中提取到有价值的内容。结合 R 语言等数据处理语言，甚至还能够迅速生成各种直观的统计图。

假设我们只关心被拒绝连接的用户；经过 grep "Disconnected from" 后的原始数据格式如下：

...
Jan 13 14:59:08 thesquareplanet.com sshd[25047]: Disconnected from invalid user guest 58.26.43.39 port 47900 [preauth]
Jan 13 15:00:14 thesquareplanet.com sshd[25062]: Disconnected from invalid user postgres 190.187.67.67 port 50091 [preauth]
Jan 13 15:00:23 thesquareplanet.com sshd[25065]: Disconnected from authenticating user ftp_test 156.236.72.23 port 54404
...

在原始数据中，我们可以观察到大量的重复 pattern； sed 命令可以识别这一 pattern 并替换 (substitute) 无用的数据前缀：

$ ssh myserver journalctl
| grep sshd
| grep "Disconnected from"
| sed 's/.*Disconnected from //'
invalid user guest 58.26.43.39 port 47900 [preauth]
invalid user postgres 190.187.67.67 port 50091 [preauth]
authenticating user ftp_test 156.236.72.23 port 54404

sed 命令的最基础用法为 sed 's/REGEX/SUBSTITUTION/'。REGEX 是目标模式的正则表达式，SUBSTITUTION 是替换匹配模式的文本。注意到 sed s 命令的语法与 vim 搜索&替换命令的语法十分相似。

Basic regex for sed

Regular expressions are usually (though not always) surrounded by /. Most ASCII characters just carry their normal meaning, but some characters have special matching behavior.

. means "any single character" except newline.
* zero or more of the preceding match.
+ one or more of the preceding match.
? zero or one of the preceding match.
[abc] any one character of a, b, and c.
(RX1|RX2) either something that matches RX1 or RX2.
^ the start of the line.
$ the end of the line.

sed's regular expressions are somewhat weired, and will require you to put a \ before most of these to give them their special meaning. Or you can pass -E.

现在看到之前的例子 /.*Disconnected from /，我们会发现当用户名本身为 Disconnected from 时，结果不符合预期：

$ cat malicious_login
Jan 17 03:13:00 thesquareplanet.com sshd[2631]: Disconnected from invalid user Disconnected from 46.97.239.16 port 55920 [preauth]
$ cat malicious_login | sed 's/.*Disconnected from //'
46.97.239.16 port 55920 [preauth]

想要保留的用户名被 sed 命令替换成了 null。这是由于 sed 命令的匹配策略是贪心的；它总是匹配尽可能多的文本进行替换。在一些正则表达式实现中，我们可以在 * 或 + 后添加 ? 后缀将匹配策略改为 non-greedy 的，例如 perl 的命令行模式：

1	perl -pe 's/.*?Disconnected from //'

可惜的是 sed 命令并不支持这个写法。为了解决这个问题，也同时为之后提取用户名做准备，我们尝试使用 sed 的正则表达式匹配整行数据：

1	\| sed -E 's/.Disconnected from (invalid \|authenticating )?user . [0-9.]+ port [0-9]+( \[preauth\])?$//'

剩下的问题是，如何从匹配的正则表达式中保留我们需要的数据——用户名。解决方案是使用正则中自带的 "capture group" (捕获组) feature。

Any text matched by a regex surrounded by parentheses () is stored in a numbered capture group. These are available in the substitution as \1, \2, \3, etc.

使用小括号 () “捕获”用户名之后，我们成功得到了需要的数据。注意用户名对应的捕获组 (.*) 是整个正则表达式中的第二个捕获组，所以我们用 \2 进行指代。

$ ssh myserver hournalctl
| grep sshd
| grep "Disconnected from"
| sed -E 's/.*Disconnected from (invalid |authenticating )?user (.*) [0-9.]+ port [0-9]+( \[preauth\])?$/\2/'
...
guest
postgres
ftp_test
...

Further Data Wrangling

在提取出用户名之后，我们还能进一步对其进行整理。最常见的就是使用 sort，uniq，head 与 tail 命令。下面的命令提取出 10 个登录次数最多的用户并与其登录次数一起输出。

$ ssh myserver journalctl
| grep sshd
| grep "Disconnected from"
| sed -E 's/.*Disconnected from (invalid |authenticating )?user (.*) [0-9.]+ port [0-9]+( \[preauth\])?$/\2/'
| sort | uniq -c
| sort -nk1,1 | tail -n10
    1115 123
    1141 ftpuser
    1415 oracle
    1444 postgres
    1561 ubuntu
    1951 user
    3326 admin
    3345 test
    3782 123456
   10892 root

现在，数据的格式已经变得相当简洁：分为两列，第一列为登录次数，第二列为对应的用户名；对于这种以列的形式呈现的数据，awk 命令能够很轻松的进行处理。

将所有用户名以合并为一行进行输出。
输出所有只登陆过一次的，且用户名以 c 开头，以 e 结尾的用户的个数。

注意这里的 awk 有一个小用法是之前自学没有 cover 到的：在 awk script 中也能进行正则匹配。使用 ~ 运算符匹配某一列与某个括在 // 中的正则表达式，若能够匹配，表达式为真；否则为假。

这里也要对我之前的一个误区进行修正；awk 中的 BEGIN is a pattern that matches the start of the input (and END matches the end)。而并不是之前我以为的，在匹配开始之前或结束之后执行的命令。

$ ... | awk '{print $2}' | paste -sd
123ftpuseroraclepostgresubuntuuseradmintest123456root
$ ... | awk '$1 == 1 && $2 ~ /^c.*e$/ { print $2 }' | wc -l
1302
$ ... | awk 'BEGIN {row = 0} $1 == 1 && $2 ~ /^c.*e$/ { rows++ } END {print rows}'
1302

paste 命令将每个文件以列对列的方式，一列列地加以合并。flag -s 表示将所有列合并到同一行。

Data Analysis

在完成初步的数据整理之后，还有更多命令可以进行进一步的数据分析：

在 shell 中，可以使用 bc 命令 (Barkeley-Calculator) 直接进行数学计算。举例来说，我们利用 paste 命令将每一行的数字用 + 加号连接起来并直接传入 bc 进行计算：

$ ... | awk '{print $1}' | paste -sd+ | bc -l
100203
$ echo "2*($(data | paste -sd+))" | bc -l
200406

使用 st 命令，R 语言的命令行模式或 gnuplot 命令根据数据产生图表 (plotting)：

$ ... | awk '{print $1}'
| R --no-echo -e 'x <- scan(file="stdin", quiet=TRUE); summary(x)'

$ ... | sort -nk1,1 | tail -n10
| gnuplot -p -e 'set boxwidth 0.5; plot "-" using 1:xtic(2) with boxes'

Exercises

Find the number of words (in /usr/share/dict/words) that contain at least three as and don't have a 's ending. What are the three most common last two letters of the words? How many of those two-letter combinations are there? sed's y command, or the tr program, may help you with case insensitivity.

# the number of words contains at least 3 as and don't have a 's ending
cat words | tr "[:upper:]" "[:lower:]" | grep -E "^(.*a){3}" | grep -vE "'s$" | wc -l
# the three most common last two letters
cat words | tr "[:upper:]" "[:lower:]" 
          | sed -E 's/.*([^ ]{2})$/\1/' | sort | uniq -c | sort -nk1,1 | tail -n3
          | awk '{print $2}'
# the number of two-letter combinations
cat words | tr "[:upper:]" "[:lower:]"
          | sed -E 's/.*([^ ]{2})$/\1/' | sort | uniq | wc -l

To do in-place substitution it is quite tempting to do something like sed s/REGEX/SUBSTITUTION/ input.txt > input.txt. However this is a bad idea, why? Is this particular to sed? Use man sed to find out how to accomplish this.

sed s/REGEX/SUBSTITUTION/ input.txt > input.txt will return a blank file, since input.txt on the right hand side of the output operator > will be made blank before the left hand side of the output operator could be applied. Because the file names conicide, an empty input.txt will be output to an empty input.txt.

It is generally a bad idea to in-place edit files as mistakes can be made when using regular expressions and without backups on the original file, it can be very difficult to recover. This is not particular to sed but in general to all forms of in-place editing files. To accomplish in-place substitution add the -i tag.

Reference

This article is a self-administered course note.

References in the article are from corresponding course materials if not specified.

Course info:

MIT Open Learning. The Missing Semester of Your CS Education.

Course resource:

The Missing Semester of Your CS Education.

-----------------------------------そして、次の曲が始まるのです。-----------------------------------