Seeing the Unseen: The Transformation of Computer Vision Through Deep Learning

Neurog
3 min readDec 14, 2023

--

Imagine stepping into the world of computer vision, where machines interpret and understand the visual world. In this realm, a groundbreaking research paper titled “Visual Geometry Grounded Deep Structure From Motion” (VGGSfM) emerges from the collaborative efforts of Jianyuan Wang, Nikita Karaev, Christian Rupprecht, and David Novotny. These researchers hail from the esteemed Visual Geometry Group at the University of Oxford and Meta AI.

The VGGSfM paper introduces a novel, fully differentiable deep learning pipeline for a concept known as Structure-from-Motion (SfM). Now, you might be wondering, what is SfM? It’s a technique that estimates three-dimensional structures from two-dimensional image sequences. But here’s the catch — the VGGSfM model simplifies this complex process by employing deep 2D point tracking to obtain pixel-level accurate tracks. This is a significant departure from the traditional SfM pipeline, which relies on a series of incremental steps such as detecting and matching keypoints, image registration, 3D point triangulation, and bundle adjustment.

The paper highlights key results from experiments conducted on the Co3Dv2, IMC Phototourism, and ETH3D datasets. These results demonstrate VGGSfM’s superior performance in camera pose estimation and 3D triangulation, compared to other state-of-the-art methods. In layman’s terms, VGGSfM has shown remarkable accuracy in estimating the position and orientation of the camera in the Co3D dataset.

One metric used to measure this accuracy is the Area Under the Curve (AUC). VGGSfM showed significant improvements in AUC metrics across different thresholds compared to existing incremental SfM methods. For instance, in the AUC@10 metric, VGGSfM demonstrated a 4.92% increase over the next best method, SP+SG (SIFT + NN), which scored 70.47%, while VGGSfM achieved a commendable 75.39%.

The VGGSfM model is a game-changer in the field of computer vision, pushing the boundaries of what machines can perceive and understand about the visual world. It’s a testament to the power of innovation and the endless possibilities that lie within the realm of artificial intelligence.

The paper discusses the pairwise matching or tracking approach used by VGGSfM, a method that compares two images to find common features. This approach was put to the test against other pairwise matching methods on the IMC dataset.

However, it was found that pairwise matching often resulted in tracks with “holes” due to its inability to guarantee proper point tracking. Imagine these “holes” as missing pieces in a jigsaw puzzle, disrupting the complete picture. VGGSfM addresses this by marking these holes as invisible, akin to smoothing over the gaps, while still utilizing VGGSfM’s tracks for camera initialization. The comparison revealed that VGGSfM’s tracking method slightly outperformed the state-of-the-art matching options, much like a runner edging ahead in a tight race.

The research also tested VGGSfM’s camera initializer and triangulator by replacing them with different methods. Picture this as swapping out the engine of a car to see how it affects performance. The result? A clear drop in performance, emphasizing their critical role in the pipeline’s success. The ablation study, which is akin to a ‘what if’ scenario, showed that removing the fine tracker from VGGSfM resulted in a considerable performance drop on the IMC dataset. Specifically, the AUC@10, a measure of model accuracy, decreased from 73.92% to 62.30%. This underscores the fine tracker’s importance for achieving accurate structure from motion correspondences, much like a compass guiding a ship to its destination.

The paper asserts that VGGSfM represents a significant leap forward in the field of computer vision. Its differentiable approach not only simplifies the SfM process but also outperforms traditional methods on benchmark datasets. Despite its impressive results, VGGSfM currently lacks the capacity to process thousands of images simultaneously, a capability that traditional SfM frameworks possess. It’s like a powerful sports car that’s not yet ready for a cross-country road trip. Nevertheless, the groundwork laid by this research promises exciting directions for future developments in differentiable SfM, opening up new horizons in the field of computer vision.

Reference:
https://arxiv.org/pdf/2312.04563.pdf

--

--

Neurog

A Neurog publication about AI, tech, programming and everything in between.