计算机视觉小记

Posted on 2025-02-22 繁/简： set

这篇笔记我打算基于 tutorial 做一点 problem-based 的总结。P.S.，Kenneth 课教的好，人长得帅，还有一件 NERV 员工服！计算机视觉又是一个相当「EVA」的学科，已经可以想象他戴着墨镜做出碇司令的经典手势了。

This article is a self-administered course note.

It will NOT cover any exam or assignment related content.

Digital Image Processing

Key point 1: Spatial v.s. Gray-level resolution

Sampling \(\to\) Spatial resolution (# pixels) \(\to\) Insufficient samples: sampling checkerboards
Quantization \(\to\) Gray-level resolution (# intensity values) \(\to\) Insufficient gray levels: false contouring

Key point 2: Adjacency & Distance

4-邻接，8-邻接与 m(ixed)-邻接 [不能形成三角形]。

Euclidean 距离 (几何距)，City-clock 距离 (\(|x_1-x_2|+|y_1-y_2|\))，Chessboard 距离 (\(\max(|x_1-x_2|,|y_1-y_2|)\))，\(D_m\) 距离 (m-path 距离)。

Key point 3: Convolution & Filters

Filtering 本质上是卷积，因此满足交换律与结合律。

Linear spatial filtering:

Smoothing Filters — blurring, noise reduction. 系数之和为 1，plain 区域在 smoothing 后保持原状
Sharpening Filters — highlight fine details. 系数之和为 0，plain 区域后在 sharpening 之后消失

Order-statistics filtering (一般来说 non-linear，因此不满足卷积的交换/结合律):

Median filters. Reduce salt-and-pepper noise.
Max filters. Reduce pepper noise (黑点为 0)
Min filters. Reduce salt noise (白点为 1)
Midpoint filters (\(\frac{1}{2}[\max+\min]\)). Reduce Gaussian noise or uniform noise.

Key point 4: Color models

RGB model，full-color image 的 pixel depth 为 24，一共有 \(2^{24}\) 种不同颜色。

YIQ model，Y 分量是辉度 (luminance) 分量，I 与 Q 是色度 (chrominance) 分量。I 与 Q 分量的系数之和均为 0，这是为了防止其捕捉到任何辉度信息 (\(R=G=B\) 时 pixel 所显示的是辉度)。

Feature Extraction

Keypoint 0:

首先注意一个 convention，图像矩阵以左上角为原点 \((0,0)\)，横轴为 \(x\) 轴，纵轴为 \(y\) 轴。但 numpy 矩阵还是以传统的先行再列的形式访问，因此坐标为 \((x,y)\) 的像素在代码中应该写成 img[y][x]。

Key point 1: Canny edge detection

Canny 边缘检测算法的步骤如下 (2D 边缘检测)：

smooth the image \(I\) by convolving with a 2D Gaussian kernel \(S=G_{\sigma}*I\)
find the gradient of the smoothed image \(\nabla S\)
non-maximal suppression: only selects edgels where \(||\nabla S||\) is greater than local values of \(||\nabla S||\) in the direction of \(\pm \nabla S\) (如果没有这一步，生成的边缘将会很厚。这一步相当于把边缘细化到脊线 ridge 上)
threshold the edgels: only strong edgels with \(||\nabla S||\) above a certain value are retained
hysteresis: some weak edges are revived if they span the gaps between some strong edgels (保证边缘的连续性从而留下一些弱边缘)

重要参数 \(\sigma\) (3.feature extraction pp.19 网格图)：

fine-scale Gaussian kernel (较小 \(\sigma\))：检测到更多，更细微的边缘，对 noise 敏感
large-scale Gaussian kernel (较大 \(\sigma\))：suppress finer details

Aperture problem (孔径问题)：为了分析图像的 motion 特征，我们还需要 corner detection 角点检测！

edge 究竟朝哪个方向移动呢？如果可以看到 endpoint 我们就能确定了！

Key point 2: Harris corner detection

Harris 角点的定义：若局部窗口包含角点，则该窗口沿任意方向移动都有灰度跳变。(平坦区域：任意方向移动均无灰度跳变；边缘：垂直于边缘方向移动有灰度跳变)

以上定义的数学描述是图像窗口在向方向 \((u,v)\) 平移时的平方强度变化 squared change in intensity (这里的 \(w\) 一般是 2D 高斯滤波核，以分配更多注意给中心区域)： \[ E(u,v)=\sum_{x,y}w(x,y)[I(x+u,y+v)-I(x,y)]^2 \] 经泰勒展开和矩阵分析将其近似为： \[ \begin{aligned} E(u,v)&\approx \begin{bmatrix}u&v\end{bmatrix}M\begin{bmatrix} u \\ v\end{bmatrix} \\ \text{where }M&=\sum_{x,y} w(x,y) \begin{bmatrix}I_x^2 & I_xI_y \\I_xI_y & I_y^2\end{bmatrix} \end{aligned} \] 根据基本特征值理论有 \(\lambda_1\leq E(u,v) \leq \lambda_2\)，其中 \(\lambda_1,\lambda_2\) 分别是 \(M\) 的最小与最大特征值，表示窗口向各个方向移动获得的最小与最大灰度变化。据此有：

case 0: \(\lambda_1\approx \lambda_2\approx 0\): 窗口移动灰度基本不变，为 featureless region
case 1: \(\lambda_1\approx 0\), \(\lambda_2\) is large: 很显然这是边缘，\(\lambda_1\) 为平行边缘方向，\(\lambda_2\) 为垂直边缘方向
case 2: \(\lambda_1\), \(\lambda_2\) are both large and distinct: 这是角点！

注意到 \(\lambda_1\lambda_2=\det(A), \lambda_1+\lambda_2=\text{trace}(A)\)，基于此定义角点响应函数 cornerness function \(R\)。Harris 推荐将 \(k\) 设为 \(0.04\sim 0.06\)。 \[ R=\det(M)-k\cdot \text{trace}(M)^2 \] 数学 intuition 这样就差不多了，以下是 Harris 角点检测算法的步骤：

convert color image to grayscale image
compute gradient along \(x\), \(y\) axis at each pixel \(I_x,I_y\)
form images of \(I_x^2,I_y^2\) and \(I_xI_y\), then smooth them with 2D Gaussian kernel
form image of the cornerness function \(R\)
locate local maxima in the image of \(R\) as corners
compute the coordinates of the corners up to sub-pixel accuracy by quadratic approximation (二次曲线近似使得角点的位置精确到次像素程度，同时记得要更新该位置上的 \(R\) 值！痛失 10 分)
threshold the strong corners

代码如下。虽然背后的数学原理有点难，但实现起来并不复杂 [等 assignment 2 due 了以后再放出]：

1	# To be released: after Mar.5

Perspective & Projective Camera

所谓 camera projection (成像投射) 指的是把世界坐标系中的某个 3D 点投射到像素坐标系中的某个 2D 点的过程。

Rigid Body Motion. World coordinates \([X, Y, Z]\) to camera coordinates \([X_c,Y_c,Z_c]\)
Perspective Projection. Camera coordinates \([X_c,Y_c,Z_c]\) to image plane coordinates \([x, y]\)
CCD Imaging. Image plane coordinates \([x, y]\) to pixel coordinate \([u,v]\)

为 represent points at infinity (vanishing points)，采用 homogenous coordinates \([X,Y,Z]\to [sX,sY,sZ, s]\)。\(s=0\) 时，表示该点在 \([X,Y,Z]\) 方向趋于无限。

Given points \(\tilde{\mathbf{x_0}}, \tilde{\mathbf{x_1}}\) in homogenous coord, the line passing through is \(\mathbf{I}=\tilde{\mathbf{x_0}}\times \tilde{\mathbf{x_1}}\)
Given lines \(\mathbf{I_0}, \mathbf{I_1}\), the intersection point is \(\tilde{\mathbf{x}}=\mathbf{I_0}\times \mathbf{I_1}\)

Step 1: Rigid Body Motion. \[ \begin{bmatrix} X_c \\ Y_c \\ Z_c \\ 1 \end{bmatrix}=\begin{bmatrix} \mathbf{R} & \mathbf{T} \\ \mathbf{0} & 1\end{bmatrix}\begin{bmatrix}X \\ Y \\ Z \\ 1\end{bmatrix}=\begin{bmatrix} -\mathbf{v_x^T}- & T_x \\ -\mathbf{v_y^T}- & T_y \\ -\mathbf{v_z^T}- & T_z \\ -0- & 1 \end{bmatrix}\begin{bmatrix}X \\ Y \\ Z \\ 1\end{bmatrix} \] 其中 \(\mathbf{R}\) 是 3X3 rotation matrix，每行分别是 camera coordsys 中 \(x,y,z\) 方向的 unit vector 在 world coordsys 下的表示。而 \(\mathbf{T}\) 是 3X1 translation matrix，它可以看成是从 camera coordsys 中心到 world coordsys 中心的向量。

Step 2: Perspective Projection.

根据相似三角形推出的重要结论：The scaling is \(f/Z_0\): \(f\) is focal length, \(Z_0\) is the distance from the optical center (usually reflected in \(Z\)-axis).

例：给定相机坐标中的一条直线 \(\mathbf{X}_c=\mathbf{a}+\lambda\mathbf{b}=[a_x,a_y,a_z]^T+\lambda[b_x,b_y,b_z]^T\)，求它的 vanishing point。

Line on the image plane: \(\mathbf{x}=f(\frac{a_x+\lambda b_x}{a_z+\lambda b_z}, \frac{a_y+\lambda b_y}{a_z+\lambda b_z})\). As \(\lambda \to \infty\), the point \(X_c\) moves further down the line, and \(\mathbf{x}\) converges to the vanish point \(\mathbf{v}=\lim_{\lambda\to\infty}f(...)=f(\frac{b_x}{b_z}, \frac{b_y}{b_z})\).
As expected, the vanishing point depends only on the line's orientation but not no its position. When \(b_z=0\), the line is parallel to the image plane and the vanishing point is at infinity.

在 camera projection matrix 中，perspective projection 表现为： \[ \begin{bmatrix}sx \\ sy \\ s\end{bmatrix}=\begin{bmatrix}f & 0 & 0 & 0 \\ 0 & f & 0 & 0 \\ 0 & 0 & 1 & 0\end{bmatrix}\begin{bmatrix}X_c \\ Y_c \\ Z_c \\ 1\end{bmatrix} \] 本质上就是 \(x=fX_c/Z_c, y=fY_c/Z_c\)。

Step 3. CCD imaging.

将 image plane 上的 2D 点 \(\mathbf{x}=(x,y)\) 转化为像素坐标 \(\mathbf{w}=(u,v)\)。 \[ \begin{aligned} u &= k_ux + u_0 \\ v &= k_vy + v_0 \\ \\ \text{i.e., }\begin{bmatrix}su\\sv\\ s\end{bmatrix}&=\begin{bmatrix}k_u & 0 & u_0 \\ 0 & k_v & v_0 \\ 0 & 0 & 1\end{bmatrix}\begin{bmatrix} sx \\ sy \\ s\end{bmatrix} \end{aligned} \]

\((u_0,v_0)\) 是 principal point，它是 optical axis 与 CDD array 的交点 (即 image plane 原点的像素坐标表示)。
\(k_u,k_v\) 是 scale factors，分别是在 \(u\) (i.e., \(x\)) 和 \(v\) (i.e., \(y\)) 方向上单位长度的像素数。

Key point 1. Full perspective camera model (10 DOFs). \[ \begin{aligned} \mathbf{P}_{ps}&=\mathbf{P}_c\mathbf{P}_p\mathbf{P}_r \\ &= \begin{bmatrix}k_u & 0 & u_0 \\ 0 & k_v & v_0 \\ 0 & 0 & 1\end{bmatrix}\begin{bmatrix}f & 0 & 0 & 0 \\ 0 & f & 0 & 0 \\ 0 & 0 & 1 & 0\end{bmatrix}\begin{bmatrix}\mathbf{R} & \mathbf{T} \\ \mathbf{0}_3^{T} & 1\end{bmatrix} \\ &=\begin{bmatrix}k_u & 0 & u_0 \\ 0 & k_v & v_0 \\ 0 & 0 & 1\end{bmatrix}\begin{bmatrix}f & 0 & 0 \\ 0 & f & 0 \\ 0 & 0 & 1\end{bmatrix}\begin{bmatrix}1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0\end{bmatrix}\begin{bmatrix}\mathbf{R} & \mathbf{T} \\ \mathbf{0}_3^{T} & 1\end{bmatrix} \\ &= \begin{bmatrix}\alpha_u & 0 & u_0 \\ 0 & \alpha_v & v_0 \\ 0 & 0 & 1\end{bmatrix}\begin{bmatrix}\mathbf{R} & \mathbf{T}\end{bmatrix} \\ &= \mathbf{K} \begin{bmatrix}\mathbf{R} & \mathbf{T}\end{bmatrix} \end{aligned} \] 这里 \(\alpha_u=fk_u,\alpha_v=fk_v,u_0,v_0\) 描述的是相机的 intrinsic parameters，其余的 (rotation/translation matrix) 描述的是 extrinsic parameters。

Camera calibration matrix \(\mathbf{K}\). 4 DOFs: \(k_u,k_v,\alpha_u,\alpha_v\). Focal length \(f\) and pixel scaling factors \(k_u,k_v\) are not independent in a way that they cannot be uniquely separated based solely on the resulting image.
Rotation matrix \(\mathbf{R}\). 3 DOFs: roll, pitch, yaw. 3x3 elements, but 6 non-linear constraints are imposed (orthogonality \(\mathbf{R^T R}=I\)).
Translation matrix T. 3 DOFs.

Key point 2. Projective camera model (11 DOFs).

Projective camera 是一个 general 3x4 matrix \(\mathbf{P}\)，它没有任何 constraints。由于 \(\mathbf{P}\) 的 overall scale 对 camera projection 的结果无影响，不妨将 \(p_{23}\) 设为 \(1\)，剩余的元素贡献了 11 DOFs。
\[ \begin{bmatrix}su \\ sv \\ s\end{bmatrix}=\begin{bmatrix}p_{00} & p_{01} & p_{02} & p_{03} \\ p_{10} & p_{11} & p_{12} & p_{13} \\ p_{20} & p_{21} & p_{22} & p_{23}\end{bmatrix}\begin{bmatrix}\lambda X \\ \lambda Y \\ \lambda Z \\ \lambda\end{bmatrix} \] Perspective camera 是 Projective camera 的一个特例。

给定 projection matrix \(\mathbf{P}\)，\((X_d,Y_d,Z_d)\) 方向上 vanishing point 的 2D 像素坐标是 \(\tilde{\mathbf{v}}=\mathbf{P}\begin{bmatrix}X_d & Y_d & Z_d & 0\end{bmatrix}^T\) 。

Key point 3. Planar/Linear rigid body transformation. 去除多余维度对应的列即可。

Weak-Perspective & Affine Camera

Key point 1. Weak-Perspectie projection.

经过 perspective projection 后，现实世界中的平行线在图像中通常会相交于 vanishing point。然而在某些图像中，平行性质以及平行线之间的 length ratios 依然得以保持。这是因为场景中物体的深度变化量 \(ΔZ\) 相较于物体到相机的平均距离\(Z_{avg}\) 而言非常微小。

对于具有以上性质的图像，我们可以使用 weak-perspective projection 来 model。回忆 perspective projection 中的 scaling \(f/Z_0\)，每个 3D 世界坐标都有彼此各异的 \(Z_0\)；weak-perspective camera 将其统一用 \(Z_{avg}\) 进行替代。 \[ \begin{bmatrix} sx \\ sy \\ s\end{bmatrix}=\begin{bmatrix} f & 0 & 0 & 0 \\ 0 & f & 0 & 0 \\ 0 & 0 & 0 & Z_{avg}\end{bmatrix}\begin{bmatrix}X_c\\ Y_c\\ Z_c \\ 1\end{bmatrix} \] 其余步骤 (rigid body motion, CCD imaging) 没有变化 \(\mathbf{P}_{\text{wp}}=\mathbf{P}_\text{c}\mathbf{P}_{\text{pll}}\mathbf{P}_\text{r}\)。

Key point 2. Image error introduced by WP approximation. \[ [u,v]-[u',v']=([u,v]-[u_0,v_0])\frac{\Delta Z}{Z_{\text{avg}}} \]

\(\Delta Z<<Z_{\text{avg}}\)
image points are closer to the principal point (i.e. \(u-u_0\approx v-v_0\approx 0\))

Step 3. Affine camera.

与 projective camera 之于 perspective camera 类似，affine camera 是 weak-perspective camera 的 non-linear constraints 被 relax 过后的更一般的模型。 \[ \mathbf{P}_{\text{aff}}=\begin{bmatrix}p_{00} & p_{01} & p_{02} & p_{03} \\ p_{10} & p_{11} & p_{12} & p_{13} \\ 0 & 0 & 0 & p_{23} \end{bmatrix} \] 不妨将 overall scale \(p_{23}\) 设为 1，affine camera 有 8 DOFs。Planar affine camera 去除第三列，有 6 DOFs。

Key point 4. Planar scene invariants (measurement which does not change across different viewpoints of the same object). [chapter 6 pp.18]

Euclidean camera (3 DOFs: rotationX1, translationX2)
similarity camera (4 DOFs: +scalingX1)
affine camera (6 DOFs: +stretchingX1, shearingX1)
projective camera (completely unrestricted 8DOFs: +equation of the vanishing lineX2)

一个重要的 projective invariant: cross-ratio - the ratio of two ratios of lengths.

Camera Calibration

相机标定是一种根据相机拍出的图像确定相机内外参数的过程。常常使用标定板 (calibaration grid) 进行辅助。

相机内参数：camera calibration matrix \(\mathbf{K}\)，包括焦距，缩放参数等
相机外参数：相机的位置 (position)，朝向 (orientation) 等等

考虑一个 projective camera： \[ \begin{bmatrix}s_iu_i \\ s_iv_i \\ s_i\end{bmatrix}=\begin{bmatrix}p_{00} & p_{01} & p_{02} & p_{03} \\ p_{10} & p_{11} & p_{12} & p_{13} \\ p_{20} & p_{21} & p_{22} & p_{23}\end{bmatrix}\begin{bmatrix}X_i\\Y_i\\Z_i\\1\end{bmatrix} \]

方程组共有 \(3\times 4=12\) 个未知数
考虑 2D 坐标点 \((u, v)\) 与其对应的 3D 世界坐标点 \((X, Y, Z)\)，每对这样的点对可以贡献两组方程：
- \(Xp_{00}+Yp_{01}+Zp_{02}+p_{03}-uXp_{20}-uYp_{21}-uZp_{22}-up_{23}=0\)
- \(Xp_{10}+Yp_{11}+Zp_{12}+p_{13}-vXp_{20}-vYp_{21}-vZp_{22}-vp_{23}=0\)
因此，解 projective matrix 至少需要 \(12/2=6\) 对点对。

点对的选取：标定板的特征 (corners, edges 等)。好处 1) 特征点易求，稳定 2) 很容易建立 3D 世界坐标系。

将这一系列方程组写成矩阵形式 \(\mathbf{Ap}=0\) [chap.7 camera calibration pp. 6]，问题转化为求矩阵 \(\mathbf{A}\) 的零空间 \(\mathbf{p}\)。然而，由于相机固有误差的存在，\(\mathbf{p}\) 不一定有解 (这也是为什么是「至少」6 对点对；点对越多，误差越小)。

确定性问题转化为最优化问题 Linear Least Squares：考虑带残差 \(\mathbf{r}\) 的方程组 \(\mathbf{Ap}=\mathbf{r}\)，求 \(\mathbf{p}\) 使得 \(||\mathbf{r}||^2\) 最小。代码如下 (注意，这里我们令 scale \(p_{23}=1\)，方程组仅有 11 个未知数，且不再是齐次的；残差变为 \(\mathbf{Ap}-\mathbf{b}=\mathbf{r}\))：

def calibrate3D(ref3D, ref2D):
    n = ref3D.shape[0]
    A = []
    b = []
    for i in range(n):
        x, y, z = ref3D[i, 0], ref3D[i, 1], ref3D[i, 2]
        u, v = ref2D[i, 0], ref2D[i, 1]

        A.append([x, y, z, 1, 0, 0, 0, 0, -u*x, -u*y, -u*z])
        A.append([0, 0, 0, 0, x, y, z, 1, -v*x, -v*y, -v*z])
        b.extend([u, v])
    
    A, b = np.array(A), np.array(b)

    p = lstsq(A, b, rcond=None)[0]
    P = np.append(p, 1).reshape(3, 4)    # add scaling factor back

    return P

将解得的 \(\mathbf{p}\) reshape 成 \(3\times 4\) 的 projective matrix \(\mathbf{P}\)。下一步，我们将其 decompose 成 \(\mathbf{K}\begin{bmatrix}\mathbf{R} & \mathbf{T}\end{bmatrix}\)。观察到 \(\mathbf{K}\) 是一个 3X3 上三角矩阵，\(\mathbf{R}\) 是一个 3X3 正交矩阵，而剩下的 \(\mathbf{T}\) 是 3X1 矩阵。自然想到 QR decomposition： \[ \begin{aligned} \begin{bmatrix}\mathbf{p}_0 & \mathbf{p}_1 & \mathbf{p}_2\end{bmatrix}^{-1}&=\mathbf{Q}\mathbf{R}' \\ \begin{bmatrix}\mathbf{p}_0 & \mathbf{p}_1 &\mathbf{p}_2\end{bmatrix} &=(\mathbf{Q}\mathbf{R}')^{-1} \\ &= \mathbf{R}'^{-1}\mathbf{Q}^{-1} \\ &= \mathbf{KR} \end{aligned} \] 性质：上三角矩阵的逆还是上三角矩阵 => \(\mathbf{K}=\mathbf{R}'^{-1}\)；正交矩阵的逆是它的转置，且结果同样是正交矩阵 => \(\mathbf{R}=\mathbf{Q}^{T}\). 根据 \(\mathbf{KT}=\mathbf{p}_3\)，有 \(\mathbf{T}=\mathbf{K}^{-1}\mathbf{p}_3\)。最后对 \(\mathbf{K}\) 做 normalization：令 \(k_{23}=1\)，且保证 \(k_{22}\) 与 \(k_{11}\) 均为正。

def decompose_P(P) :
    M = P[:, :3]
    Q_, R_ = qr(inv(M))

    # obtain K as the inverse of R
    K = inv(R_)
    R = Q_.T

    T = R_ @ P[:, 3]
    RT = np.hstack((R, T.reshape(-1, 1)))

    # normalize K
    K /= K[2, 2]
    if K[0, 0] < 0:
        K[:, 0] *= -1
        RT[0, :] *= -1
    if K[1, 1] < 0:
        K[:, 1] *= -1
        RT[1, :] *= -1

    return K, RT

Stereo Vision

Key point 1. 一维与二维还原。

如果相机的各项参数已知，我们可以把线/平面上的像素坐标还原到对应的世界坐标上。

一维 (line) projective matrix \(\mathbf{P}_p\) 是 3X2 的。\(X=p_{01}-up_{21}/up_{20}-p_{00}\) 或 \(X=p_{11}-vp_{21}/vp_{20}-p_{10}\)。

二维 (planar) projective matrix \(\mathbf{P}_p\) 是 3X3 的。求 \(\mathbf{Q}=\mathbf{P}^{-1}_p\) 后整理为 \(X=q_{00}u+q_{01}v+q_{02}/q_{20}u+q_{21}v+q_{22}\), \(Y=q_{10}u+q_{11}v+q_{12}/q_{20}u+q_{21}v+q_{22}\)。

Key point 2. 三维还原。

考虑一般的三维 projective matrix \(\mathbf{P}_p\)，它是 3X4 的。将 camera calibration 中每对点对提供的两个方程改写成关于 \(X,Y,Z\) 的形式： \[ \begin{aligned} a_u X+b_u Y+c_u Z+d_u &= 0 \\ a_v X+b_v Y + c_vZ+d_v &= 0 \end{aligned} \] 以上两个方程都各自在世界坐标系中定义了一个平面，而像素坐标 \((u,v)\) 所对应的世界坐标 \((X_0,Y_0,Z_0)\) 一定位于这两个平面的交线 上。由于方程数量 (2) 小于未知数数量 (3)，想要将 2D 像素坐标还原为 3D 世界坐标，仅仅一台标定好的相机是不够的。

如果我们有两台标定好的相机 (或者一台相机在两个角度)，那么 \((X_0,Y_0,Z_0)\) 就可以被两条交线的交点确定 (数学解释：方程数量 4 大于未知数数量 3)。这被称为双目立体视觉 (stereo vision)，与人眼的视觉原理一致。

3D 世界坐标点：\(\mathbf{X}=(X,Y,Z)\)
3D 相机坐标点：\(\mathbf{X}_c=(X_c,Y_c,Z_c)\)
像素图像坐标点：\((u,v)\)
(3D 相机坐标系下) 图像坐标点 \(\mathbf{x}=(x, y, f)\)，其中 \(x=(u-u_0)/k_u\), \(y=(v-v_0)/k_v\)
射线向量 (ray vector) \(\mathbf{p}=\vec{O_c\mathbf{x}}=\begin{bmatrix}x & y & f\end{bmatrix}^T\)

可以发现 \(\mathbf{p}\) 与 \(\mathbf{X}_c\) 被未知的 \(Z_c\) 相联系： \[ \mathbf{p}=\begin{bmatrix}x \\ y \\ f\end{bmatrix}=\begin{bmatrix}fX_c/Z_c \\ fY_c/Z_c \\ fZ_c/Z_c\end{bmatrix}=\frac{f}{Z_c}\mathbf{X}_c \] 第二台相机有相同的表现：\(\mathbf{p}'=f'\mathbf{X}_c'/Z_c'\)。

假设这两个相机坐标系被已知的旋转矩阵 \(\mathbf{R}\) 与变换矩阵 \(\mathbf{T}\) 相联系，有 \(\mathbf{X}_c'=\mathbf{RX}_c+\mathbf{T}\)。利用 \(\mathbf{X}'_c\) 与 \(\mathbf{p}'\) 的平行关系可以得到 stereo vision 中的核心方程 triangulation equation： \[ \begin{aligned} \mathbf{X}_c'\times\mathbf{p}'&=0 \\ (\mathbf{RX}_c+\mathbf{T})\times \mathbf{p}'&= 0 \\ (\frac{Z_c}{f}\mathbf{Rp}+\mathbf{T})\times\mathbf{p}'&=0 \end{aligned} \]

据此还原出未知深度 \(Z_c\)。如果两部相机并未对应到同一个世界坐标点上，两条射线不会相交，且方程组 inconsistent。

Key point 3. Affine Stereo (Chap.9 pp. 27 - 31).

Essential Matrix

对极几何 (epipolar geometry) 是双目立体视觉中的一种几何关系。基于 perspective camera 的假设，这一系列几何关系导致了各个图像点之间的约束。

对极平面 (epipolar plane)：\(\mathbf{X}\) 与两台相机的光心 (optical center) 组成的平面。
基线 (baseline)：连接两光心的直线。
对极点 (epipole)：基线与成像平面的交点。
对极线 (epipolar line)：对极平面与成像平面的交线。对极点 \(\mathbf{e}\) 与 \(\mathbf{X}\) 的图像坐标点 \(\mathbf{x}\) 都在对极线上。

对极约束 (epipolar constraint) 指的是在双目立体视觉中，已知某点 \(\mathbf{x}\) 在第一张图像上的位置；那么第二张图像上的对应点 \(\mathbf{x}'\) 一定位于对极线上。这一约束的数学表达在本质矩阵 (essential matrix) 的推导中可以体现：

已知两个相机坐标系与它们之间的转化关系 \(\mathbf{R}, \mathbf{T}\)： \[ \begin{aligned} \mathbf{X}'_c&=\mathbf{RX}_c+\mathbf{T} \\ \mathbf{T}\times \mathbf{X}'_c &= \mathbf{T}\times \mathbf{RX}_c+\mathbf{T}\times \mathbf{T} \\ \mathbf{X}'_c\cdot (\mathbf{T}\times \mathbf{X}'_c) &= \mathbf{X}_c'\cdot (\mathbf{T}\times\mathbf{RX}_c) \\ \mathbf{X}_c'[\mathbf{T}]_{\times}\mathbf{R}\mathbf{X}_c&= 0 \\ \mathbf{X}_c'^T\mathbf{E}\mathbf{X}_c&=0 \ \text{ where } \mathbf{E}=[\mathbf{T}]_{\times}\mathbf{R} \end{aligned} \] \(\mathbf{E}\) 是本质矩阵。这里应用了一些向量运算的 tricks：

\(\mathbf{T}\times \mathbf{T}=0\)
\(a\cdot(b\times c)=b\cdot (a\times c)=c\cdot (a\times b)\)
叉乘等价于反对称矩阵 (skew-symmetric matrix) 点乘：\(\mathbf{T}\times \mathbf{R}=[\mathbf{T}]_{\times}\mathbf{R}\)
继承于反对称矩阵 \([\mathbf{T}]_{\times}\)，本质矩阵 \(\mathbf{E}\) 的奇异值是 \([\sigma,\sigma,0]\) 的形式。这意味着它可以通过 SVD 还原为 \(\mathbf{R},\mathbf{T}\)
不妨将 scale \(e_{33}\) 设为 1，本质矩阵 \(\mathbf{E}\) 共有 5 DoFs (3 for rotation, 2 for translation)

Refer 一下上一章中的图感受一下本质矩阵是如何体现对极约束的。

第一个相机坐标系中射线向量 \(\mathbf{p}\) 平行于 \(\mathbf{X}_c\)，第二个中 \(\mathbf{p}'\) 平行于 \(\mathbf{X}_c'\)；由于 \(\mathbf{X}_c'^T\mathbf{E}\mathbf{X}_c=0\)，有 \(\mathbf{p}'^T\mathbf{E}\mathbf{p}=0\)。
\(\mathbf{p}\) 已知，\(\mathbf{Rp}\) 将向量转换到第二个相机坐标系下表示。
\(\mathbf{Ep}=[\mathbf{T}]_{\times}\cdot\mathbf{Rp}\)，\(T\) 是由第一个相机光心指向第二个相机光心的平移向量。叉积的结果垂直于 \(T\) 与 \(\mathbf{Rp}\) 所在的平面 (即对极平面)，因此 \(\mathbf{N}=\mathbf{Ep}\) 是极平面的一个法向量。
\(\mathbf{p}'^T\mathbf{N}=0\) 将 \(\mathbf{p}'\) 限制在对极平面上，对极约束成立。

本质矩阵求对极点 (epipoles) \(\mathbf{p}_e\)。利用 \(\mathbf{O}_c\mathbf{p}_e\) 与转置向量 \(\mathbf{T}\) 平行 (转化到第二个相机坐标系下比较) 的关系列出方程： \[ \begin{aligned} \mathbf{Rp}_e+T&=\lambda \mathbf{T} \\ \mathbf{T}\times \mathbf{Rp}_e+\mathbf{T}\times \mathbf{T} &= \lambda(\mathbf{T}\times \mathbf{T}) \\ [\mathbf{T}]_{\times} \mathbf{Rp}_e &=0 \\ \mathbf{Ep}_e&= 0 \end{aligned} \] 同理有 \(\mathbf{E}^T\mathbf{p}_e'=0\)。如果想求出极点的非平凡解，\(\det(\mathbf{E})=0\) (i.e., 秩为 \(2\))。

Fundamental Matrix

Key point 1. 基础矩阵

本质矩阵使用 \(\mathbf{p}'^T\mathbf{Ep}=0\) 表示对极约束；射线向量是在分别在两个相机坐标系下被表示的，有时这会带来不便。基础矩阵 (fundamental matrix) 则直接从像素坐标系中体现对极约束。

像素坐标 \(\tilde{\mathbf{w}}\)，intrinsic matrix \(\mathbf{K}\) 与射线向量 \(\mathbf{p}\) 之间的关系： \[ \tilde{\mathbf{w}}=\begin{bmatrix}u\\v\\1\end{bmatrix} = \frac{1}{f}\begin{bmatrix}fk_u & 0 & u_0 \\ 0 & fk_v & v_0 \\ 0& 0 & 1 \end{bmatrix}\begin{bmatrix}x \\ y \\f\end{bmatrix}= \frac{1}{f}\mathbf{Kp} \] 将 \(\mathbf{p}=f\mathbf{K}^{-1}\tilde{\mathbf{w}}\) 代入 \(\mathbf{p}'^T\mathbf{Ep}=0\) 得 \(\tilde{\mathbf{w}}'^T\mathbf{K}'^{-T}\mathbf{EK}^{-1}\tilde{\mathbf{w}}=0\)。令基础矩阵 \(\mathbf{F}=\mathbf{K}'^{-T}\mathbf{EK}^{-1}\)，有 \(\tilde{\mathbf{w}}'^T\mathbf{F}\tilde{\mathbf{w}}=0\)。这个方程的几何含义其实就是本质矩阵的降维版：\(\mathbf{n}'=\mathbf{F}\tilde{\mathbf{w}}\) 是第二张相片中对极线的法向量，\(\tilde{\mathbf{w}}'^T\mathbf{n}'=0\) 将对应的坐标点限制在对极线上。关于对极点，同样有 \(\mathbf{F}\tilde{\mathbf{w}}_e=0\) 与 \(\mathbf{F}^T\tilde{\mathbf{w}}_e'=0\)。

设定 scale \(f_{22}=1\) (\(-1\) DoF)，\(\mathbf{F}\) 继承了 \(\mathbf{E}\) 的 \(\det=0\) 约束 (\(-1\) DoF)，所以 \(\mathbf{F}\) 一共有 \(9-2=7\) DoF。

已标定相机的基础矩阵可直接通过 \(\mathbf{F}=\mathbf{K}'^{-T}[\mathbf{T}]_{\times}\mathbf{R}\mathbf{K}^{-1}\) 求出。

未标定相机的基础矩阵也可以通过采样 \(> 7\) 点对来 estimate： \[ \begin{bmatrix}u'&v'&1\end{bmatrix}\begin{bmatrix}f_{00} & f_{01} & f_{02} \\ f_{10} & f_{11} & f_{12} \\ f_{20} & f_{21} & f_{22}\end{bmatrix}\begin{bmatrix}u\\v\\1\end{bmatrix}=0 \] 如果采样的点噪音太多，得到的 \(\mathbf{F}\) 不一定满足 rank-2 的约束。将其 SVD 分解，把三角矩阵中最小的奇异值替换为 \(0\)。这样得到的 \(\mathbf{F}'\) 满足 rank-2 约束且与原来的 \(\mathbf{F}\) 距离最近 (in Frobenius norm)。

Key point 2. The correspondence problem. (chap.9 pp.11 - 17)

NCC (Normalized cross-correlation) 可解对应问题 (衡量特征点所在 patch 之间的相关度)。但由于图像强度 (intensities) 与视点 (viewpoints) 有关，NCC 常常是不可靠的。

可用于简化对应问题的几个约束 (考虑左侧图像上的特征点在右侧图像上的匹配)：

Uniqueness constriant. 不透明物体上的点最多有一个匹配。
Ordering constraint. 不透明物体表面上的一系列点在左侧与右侧图像上的顺序一致。
Disparity gradient constraint. 两点之间的深度差异在左侧与右侧图像中不会超过一定限制。

Key point 3. Structure recovery. (chap.9 pp.18 - 19)

Structure recovery 指的是从 2D 还原到 3D 世界坐标系的过程。

RANSAC

RANSAC 的全称是 RANdom SAmple Consensue (随机采样一致)。

它是一种迭代方法，用于从含有 outliers 的数据中估计模型参数。与传统方法先用尽可能多的数据再剔除 outliers 不同，RANSAC 先从最小数据集 (smallest possible set) 开始，逐步加入与模型一致的数据点。

RANSAC 流程：

从数据集中随机采样一个最小集 \(s_i\)
用 \(s_i\) 评估模型参数 \(\theta_i\)
找出与该模型 \(\theta_i\) 距离在误差阈值 \(t\) 内的所有点 \(S_i\) 形成 \(s_i\) 的一致集 (consensus set)
若 \(|S_i|>\) 评估阈值 \(T\)，使用 \(S_i\) 重新评估模型得到 \(\theta_r\) 并终止。
重复以上步骤 \(N\) 次并记录最佳模型。

for trial in range(N):
    maybeInliers := d
    maybeModel := parameters fitted to maybeInliers
    alsoInliers := empty set
    
    for point in data not in maybeInliers:
        if point fits maybeModel with an error <= t:
            alsoInliners.add(point)
    
    if alsoInliers.size > T:
        betterModel := parameters fitted to (maybeInliers + alsoInliers)
        currentErr := a measure of how well betterModel fits these points
        if currentErr < bestErr:
            bestModel := betterModel
            bestErr := currentErr

令 \(d\) 为随机采样的最小集中的点数，\(w\) 为某点是 inlier 的概率，\(N\) 次采样中至少一次随机采样获得的最小集中不含有 outliers 的概率为 \(p\)。那么有： \[ \begin{aligned} 1-p&=(1-w^d)^N \\ N&=\frac{\log(1-p)}{\log(1-w^d)} \end{aligned} \] 将 \(N\) 设为这个值可以保证至少一次能够选出最大一致集 \(S_{max}\)。评估阈值 \(T\) 通常设为 \(w\cdot n\) (\(n\) 为全数据集大小)，令一致集 \(S_i\) 可以覆盖所有的 inliers。

RANSAC 常常用于评估基础矩阵 \(\mathbf{F}\)。

Reference

This article is a self-administered course note.

References in the article are from corresponding course materials if not specified.

Course info: COMP3317, Taught by Prof. Kenneth K.Y. Wong

Course textbook: Computer Vision – A Modern Approach, Forsyth & Ponce

-----------------------------------そして、次の曲が始まるのです。-----------------------------------