One way of implementing monocular SLAM is called PTAM (Parallel Tracking and Mapping) which maps the real world without needing to be initialised with real world markers like known natural feature targets (1, 3).
It is difficult to map input from a handheld camera as opposed to a robot because a camera will not have any odometry (input from movement sensors used to estimate the position) whereas a robot would. Additionally, neither can a handheld camera be moved at arbitrarily slow speeds (1).
PTAM estimates the position of a camera in a 3D environment and it maps the positions of points on objects in the space by analysing and processing the input from the camera in real time (1).
PTAM involves two main parts – the tracking of the camera and the mapping of the points. These are run in parallel on different threads of a multi-core processor (1).
The tracking thread is responsible for estimating the camera pose (position and orientation) and also for rendering augmented graphics when PTAM is used for augmented reality. The mapping thread is responsible for mapping the points. The map is not updated after every frame, only on keyframes. This means that the processor has lots of time available per keyframe for calculation to make the map as rich and accurate as possible (1, 2).
The map consists of point features which are “locally planar textured patch[es] in the world” (1). The map contains several keyframes which are snapshots taken at various points in time. Each keyframe stores a 4-level pyramid of greyscale images. Each point feature is stored with a reference to a keyframe (usually the first keyframe in which it was seen), a pyramid level and a pixel location (1).
In order to track where the camera is, first the system uses a motion model to obtain an estimate for the previous pose of the camera when a new frame is acquired. Then the map points are projected onto the image according to the estimate. Around fifty of the coarsest features are searched for in the image and the camera pose is updated for the matches found. Around 1000 points are re-projected and searched for and the final pose estimate is computed from the matches found (1).
The building of a map has two main stages; the initialisation of the map using stereo techniques and the continual refinement and expansion of the map done by the mapping thread of the processor (1).
The initialisation is done using a 5-point stereo algorithm made by Henrik Stewénius, Christopher Engels and David Nistér which is an extension of an algorithm created by Nistér in 2004 (1, 5). It involves the user holding up the camera above the workspace and pressing a key at which point a keyframe is made and then smoothly moving the camera to an offset angle roughly 10cm away and pressing another key. The system uses the smooth movement to track the position of 2D patches and a second keyframe is created at the second key press. The algorithm uses the two keyframes to triangulate the base map which is then refined (1).
The map is then rotated so that the base plane is flat. This is done by picking points at random and using those to estimate where the plane is and testing the estimated planes against other points to see which estimate is the most likely (1).
A new keyframe is added any time the following criteria are met (1):
- Tracking quality is good.
- The last keyframe was added at least 20 frames ago.
- The current frame is above a minimum distance away from the closest keyframe.
New map points are created by triangulating the two closest keyframes to the point to determine which pyramid level the point belongs on, i.e. the depth of the point (1).
When the camera is in a well-explored environment, the system re-measures points on old keyframes, either to correct outliers or to measure any new features that have been added to the environment (1).
The mapping is only performed when there are free resources on the background processing thread. This means that the tracking system can track complex environments with a constant frame rate which is important for augmented reality applications (1).
There are some issues with PTAM which mean that it is not ready for AR applications just yet. One is that it requires the computer hardware to be fairly powerful so at the moment it would not be able to run on most mobile devices (1). Another issue is that the system does not know how to deal with self-occlusion which is where a part of an object obscures another part of the same object which is being tracked from the camera (1, 3).