The Missing Semester of CS Education, MIT - Version Control

Posted on 2023-07-07 繁/简： set

不得不说，我个人认为 Anish 是三位老师中讲得最好的一个。这节课上他展示的问题几乎全部是我在初入门 Git 时遇到过的坑；真是相见恨晚啊！不过在有 Git 使用经验的前提下来上这一小节体验也很不错，使我对 Git 优美的底层逻辑有了更深入的认识。

Anish: While an ugly interface has to be memorized, a beautiful design can be understood.

This article is a self-administered course note.

It will NOT cover any exam or assignment related content.

Git's data model

Git 以树形的 hierarchy 组织一个 top-level git 文件夹。

文件 (blobs)：底层叶子节点。
文件夹 (trees)：maps names to blobs or trees (so directories can contain other directories).

Snapshots and Commits

A snapshot is the top-level tree that is being tracked. 说到底，“快照” (snapshot) 这一概念就是我之前理解的所谓 “版本”；而 git history 即是这些 snapshots 构成的一个 DAG。

Git 中的 commits (来了！) 封装了这些 snapshots，并附有额外的 metadata 对其进行标记，例如：作者，commit time，commit message 等。

我们用一段 pseudocode 来描述这些 Git 数据模型之间的关系：

# a file is a bunch of bytes
type blob = array<byte>

# a directory contains named files and directories
type tree = map<string, tree | blob>

# a commit has parents, metadata, and the top-level tree (snapshot)
type commit = struct {
    parents: array<commit>
    author: string
    message: string
    snapshot: tree
}

Objects and Content-Addressing

在 Git 的底层存储中，这三种数据模型 (blob, tree, commit) 均是 object (对象) 的一种，并且所有的对象都有其独特的 SHA-1 哈希编码用于引用。

type object = blob | tree | commit

objects = map<string, object>

def store(object):
    id = sha1(object)
    objects[id] = object
    
def load(id):
    return objects[id]

Blobs, trees, and commits are unified in this way: they are all objects. When they reference other objects, they don't actually contain them in their on-disk representation, but have a reference to them by their hash. In Git data store, all objects are content-addressed by their SHA-1 hash.

所以，上文伪代码中类似 map<string, tree | blob 与 array<commit> 的结构，其内部储存的只是对应对象的地址；该地址即为对象的 SHA-1 哈希值。一般来说，对象的地址是 immutable 的，一旦新的对象产生，其地址随之固定；这也保证了对象本身的 immutability。

我们可以通过 git 中的 git cat-file -p 命令来说明这一点。

$ tree .
<root> (tree)
|
+- foo (tree)
|  |
|  + bar.txt (blob, contents = "hello world")
|
+- baz.txt (blob, contents = "git is wonderful")
$ git log
commit 42fb7a2... (HEAD -> master)
...
$ git cat-file -p 42fb7a2
tree 68aba62...
...
$ git cat-file -p 68aba62
100644 blob 4448adb...    baz.txt
040000 tree c68d233...    foo
$ git cat-file -p 4448adb
git is wonderful

可以看到这个严格的 hierarchy 关系与哈希寻址的过程：commit master 中封装了一个 snapshot，这个 snapshot 又包含了一个 tree 与一个 blob。到达叶子节点 blob 时，文件的内容最终被输出。

References

SHA-1 哈希值虽然能保证每个 commit 地址的独特性，但可读性很差：人类没法记住由 40 个十六进制数组成的超长字符串。为了解决这个问题，git 引入了新的 references：我们可以用 human-readable 的短字符串作为哈希值的指针。

references = map<string, string>

# references are mutable
def update_reference(name, id):
    references[name] = id

def read_reference(name):
    return references[name]

def load_refernce(name_or_id):
    if name_or_id in references:
        return load(references[name_or_id])
    else:
        return load(name_or_id)

需要注意的是，References are pointers to commits：因此，与 objects 不同，references 是 mutable 的；我们可以更改其指向的 commit。举例来说，master reference 一般指向当前 branch 的 latest commit；在更新的过程中，它所指向的 commit 将不断发生改变。

HEAD 也是一个常用的 reference，我们用它指向我们目前所在的 commit。使用 git checkout 命令并把目的地 commit 的哈希值作为参数，我们能够轻松改变 HEAD reference 指向的 commit，从而实现在 Git history 中的移动。

Finally, we can define (roughly) is a Git repository: it is the data objects and references.

Basic Commands

正如 Anish 开始指出的那样，Git 的 interface 其实十分的丑陋且对新手不友好；这些复杂的命令就像魔法咒语一样既难懂又难记。但是现在我们已经学习了 Git 优美的底层数据存储逻辑——一个由 commits 组成的 DAG。

On disk, all Git stores are objects and references: that's all there is to Git's data model. All git commands map to some manipulation of the commit DAG by adding objects and adding/updating references. Note that you cannot remove objects since the commit DAG is immutable.

当我们使用 git 命令时，思考它将会对 commit DAG 产生什么影响：有哪些新的 objects 将被创造，有哪些 references 将会被更新。相反，当我们想对 commit DAG 作出更改时，也很大几率会有对应的 Git 命令能够实现我们的目的。

git help <command>
git init: 创建新的 git repo，数据储存在 dotfile .git 文件夹中。
git status：追踪上次 commit 后新增/发生更改的文件。
.gitignore：specify intentionally untracked files to ignore.
git add <filename>：将文件加入暂存区 (staging area)。
git commit：创建新的 commit (在 commit DAG 上添加新的节点)。写 commit message 很有讲究。
- git commit --amend：修改某个 commit 的内容或 commit message。
git stash：暂时隐藏对工作区文件进行的更改。使用 git stash pop 恢复。
git log: 很有用的命令！shows a flattened log of history.
- git log --all --graph --decorate：visualizes history as a DAG. (即 commit DAG！)
git diff <filename>：显示已存入暂存区的文件和工作区的文件的区别。

注意 ~/.gitconfig 是 Git 的配置文件；它也是高度个性化 dotfiles 的经典例子之一。

Branching and Merging

Git 中的 branch 概念并不像看上去那么 intuitive；对于一段抽象的 commit branch，我们对其进行命名 (比如 master, bugFix...)，这些 branch name 就是 branch 的符号引用 (branch reference)，通常指向其所代表的 branch 中的 latest commit。所以本质上来说，branch 就是指向某个 commit 的指针。

另外一个重要的概念是 HEAD reference: 它与 branch reference 有着微妙的定义上的不同；HEAD 是一个对当前所在分支的符号引用——也就是指向你正在其基础上进行工作的 commit。

通常情况下，HEAD 是指向某个 branch reference 的 (比如 master)。在提交 commit 时，branch reference 的状态发生了改变，而这一变化通过 HEAD 的改变而变得可见。
当 HEAD 指向的是某个具体的 commit，而不是 branch reference 时，我们称其为游离的 (detached) HEAD 。由游离的 HEAD 提交 commit 是被禁止的。

一定要理解抽象的 commit branch 与 branch name/reference 的联系：branch name/reference 指向抽象 commit branch 的 latest branch。当我们说 HEAD 指向某个 branch 时，要注意：

workflow 上，我们切换到了整个 project 中的一条 branch 上进行工作。
本质上， HEAD 指向了该 branch 对应的 branch reference。

有了上面的 foundations 之后，理解接下来有关 branching 与 merging 的命令就比较简单了。

git branch：显示所有 branches。
git branch <name>：由当前 (HEAD) 新建一个 branch。注意，这并不会改变 HEAD 指向的 branch。
git checkout <revision>：很有用，但也很危险的命令；它将改变 HEAD 指向的 commit，并使得工作区的文件与 HEAD 所指向的 commit 进行同步。这很大概率上会使得工作区的文件发生改变。
- git checkout <address>：HEAD 将会指向该哈希值地址对应的 commit。
- git checkout <name>：HEAD 将会指向对应的 branch。
- git checkout -b <name>：新建一个 branch 并令 HEAD 指向它。相当于 git branch <name> 后执行 git checkout <name>。
- git checkout -- <file>：(与 branching/merging 无关) 丢弃工作区中的进行的更改。
git merge <revision>：将参数中指定的 branch 合并进入当前 branch (HEAD 指向的 branch)。一般来说，我们先 git checkout master 再将其他的 branch 合并入 master 中。如果被合并的 branch 是 master 的直接后继，这样的合并称为 fast-forward：它不会提交任何额外的 commit。
- git merge --abort：中止合并并恢复到合并之前的状态。
git mergetool：使用其他工具 (例如 vimdiff) 来解决合并冲突。
git rebase：将某个 branch 上的所有修改都移至另一个 branch 上。使用 rebase 进行合并能够创造更线性的 commit history；即使它们本来是并行开发的。见 Git 官方文档的变基实例。

Remotes

以上我们所介绍的命令都是基于本地 repo 而言的。然而 Git 之所以如此普及，一方面在于版本控制的强大，另一方面在于，共享同一个远程 repo 使得并行开发成为了可能。远程 repo 的概念其实并不复杂：它只是你的本地 repo 在远程的拷贝——通过互联网，你可以与其进行通信，增加或者获取提交记录。

另外一个重要的概念是远程 branch；与本地 branch 一致，它本质上也是一个 reference，或者说是指向某个 commit 的指针。它反映的是远程 repo (在最后一次与其通信后) 的状态。

远程 branch 其实是一个本地概念。切换到某个远程 branch 时，HEAD 自动进入游离状态；这是因为 Git 禁止从本地直接更新远程 branch。远程 branch 仅会在远程 repo 更新后发生变化；它反映的是最新的远程 repo。

git remote：列出该 repo 对应的所有远程 repo。
git remote add <name> <url>：为该 repo 新增一个远程 repo。用 Anish 的原话说，在 add 之后，本地 repo “意识到了 (aware of)” 其某个远程副本的存在；git log 此时也将显示远程 repo 中的 reference 指向的 commit。
git push <remote> <local branch>:<remote branch>：将某个 local branch 推送到远程 repo 上，使得远程 branch 同步进行更新。~~终于知道魔法咒语之 git push origin master 到底是什么意思了~~
git branch --set-upstream-to=<remote>/<remote branch>：在本地 branch 与远程 branch 之间建立对应关系 (correspondence)；这样当 git push 时，Git 会根据该对应关系自动对参数进行扩展。使用 git branch -vv 可查看这些对应关系。
git fetch：从远程 repo 中下载本地 repo 中缺失的 commits 并更新远程 branch reference。但注意，它并不会更新任何本地 branch，所以并不会改变你本地 repo 的状态。
git pull：抓取更新 (下载缺失的 commits 并且更新远程 branch) 再合并到本地 branch。相当于 git fetch; git merge <remote branch>。
git clone <url>：从远程下载 repo。
- git clone -depth=1 或 git clone --shallow：浅克隆；加入该参数后 clone 不会克隆所有的版本历史从而加速下载。

更详细的介绍与一些高阶命令见 missing semester version control (Git).

Reference

This article is a self-administered course note.

References in the article are from corresponding course materials if not specified.

Git Scm: Distributed is the New Centralized.
(checked and HIGHLY RECOMMENDED) Learn Git Branch.

Course info:

MIT Open Learning. The Missing Semester of Your CS Education.

Course resource:

The Missing Semester of Your CS Education.

-----------------------------------そして、次の曲が始まるのです。-----------------------------------