How do you ensure the digital additions spatially register with the real world, in real time?

Oh, this is a topic we could go on and on about for hours. In general, to provide a perfect experience it’s important for the digital content to appear accurately placed in the real world. There are different techniques and technologies that help with that. Without going too much into details or giving a comprehensive technical overview, the 3 most used techniques:
Visual keypoint matching

This method uses a visual trigger - or marker. This is the “standard” AR that comes to mind for most people when they hear about Augmented Reality. Unique features (corners, edges) of the visual target extracted and stored in a target database. The system utilising this method is constantly extracting and comparing features of the live camera view with the one stored in the database. Once a match is found (and after a series of other calculations), an algorithm generates/calculates a virtual flat surface based on the position and angle of the visual target and all of the digital content is placed in the space related to that flat surface. Whether the visual target remains in the camera view is academic at that point, the flat surface remains the anchor of the digital content.

Spatial mapping

This is where AR (or MR) gets interesting. Usually supported by a depth sensor or similar, the framework is constantly building a virtual representation of the real world (sort of like a 3D scan). This virtual copy of the real world is then used for mainly two things:

Detect flat surfaces in the real world (horizontal and vertical)

Use the virtual “mesh” of the real world for occlusion

Using spatial mapping we can define the virtual anchors in the real space and position virtual content in relation to those anchors. Whenever the viewer returns to that space, the virtual content will be in the same place.

The interesting part here is not really the spatial positioning of content though. Occlusion of the virtual content with real world objects is just as important, if not more important. To give the perfect illusion of a virtual content in the real world occlusion is necessary and will be (or already is) the focus of development. Right now occlusion is achievable using special depth sensors (think HoloLens or Magic Leap), and building a high enough quality copy of the real world.

Spatial Computing Could Be the Next Big Thing - Scientific American

Device sensors (accelerometer, gyroscope, magnetometer, GPS)

This technique provides no visual search or mapping of the real world. Instead, the viewer is the anchor of the virtual space and content is placed in relation to the viewer, positioned usually using the compass of the device and held in place by combining multiple sensors. The easiest way to imagine this is: think of a VR scene you’re watching in Cardboard for example.

The scene has a table, and a room as a background. Now replace the room with the camera view of the real world, but keep the table there. That’s basically it. A big drawback of this approach is that the sensors are usually either not accurate enough to keep the virtual content locked to a certain point or it’s impossible to detect movement of the viewer with this technique alone. As ARKit and ARCore showed though, when you support visual tracking with this technique the results can be very convincing. We were able to successfully mix this approach with the visual trigger technique resulting in convincing AR experiences.

It’s important to mention that ARKit and ARCore successfully integrated visual tracking with sensor data providing a result that rivals with 3D spatial mapping in accuracy. So it is a very exciting time to watch those two frameworks and others to see how they will move forward with occlusion for example.

Augmented reality at the workplace | Deloitte Insights