WO2017117675A1

WO2017117675A1 - Head mounted device for augmented reality

Info

Publication number: WO2017117675A1
Application number: PCT/CA2017/050009
Authority: WO
Inventors: Dhanushan Balachandreswaran; Jian Yao; Mehdi MAZAHERI TEHRANI; Qadeer BAIG; Mauricio Manuel GALVEZ; Michael DARMITZ; Anil MAHMUD
Original assignee: Sulon Technologies Inc.
Priority date: 2016-01-08
Filing date: 2017-01-06
Publication date: 2017-07-13

Abstract

A head mounted device for augmented reality is provided. The head mounted device has a visor. The visor has at least two forward-facing cameras, and at least one non-forward-facing camera. A handheld device for augmented reality and other applications has a camera-based pose tracking system.

Description

HEAD MOUNTED DEVICE FOR AUGMENTED REALITY

TECHNICAL FIELD

[0001] The following relates to a head mounted device and more specifically to a head mounted device for displaying augmented reality and virtual reality.

BACKGROUND

[0002] Augmented reality (AR) and virtual reality (VR) visualisation applications are increasingly popular. The range of applications for AR and VR visualisation has increased with the advent of wearable technologies and 3-dimensional (3D) rendering techniques. AR and VR exist on a continuum of mixed reality visualization.

SUMMARY

[0003] In one aspect, there is provided a head mounted device for augmented reality. The head mounted device has a visor. The visor has at least two forward-facing cameras, and at least one non-forward-facing camera.

[0004] In another aspect, there is provided a handheld device. The handheld device has a camera-based system for tracking the pose of the device during use. In yet another aspect, the camera-based system can map topographies in the real environment surrounding the handheld device.

[0005] In still a further embodiment, there is provided a processing environment for performing camera-based pose tracking and environment mapping.

[0006] In at least another embodiment, a camera system and a method are provided for pose tracking and mapping in a real environment. The camera system may be for an AR or VR HMD, a handheld controller, or another device. The HMD comprises a visor housing having a display viewable by a user wearing the HMD and electronics for driving the display. The camera system comprises at least two front facing cameras, a memory and a processor set. Each front facing camera is disposed upon or within the HMD to capture visual information from the real physical environment within its respective field of view (FOV) surrounding the HMD. The FOV of at least one of the front facing cameras overlaps at least partially with the FOV of at least one other front facing camera to provide a stereoscopic front facing camera pair to capture a common region of the real physical environment from their respective viewpoints. Each of the front facing cameras is configured to obtain the visual information over a plurality of epochs to generate a real image stream. The memory is for storing the real image stream. The processor set comprises a special purpose processing unit (SPU) and a general purpose processing unit (GPU), and the processor set is configured to: (i) obtain the real image stream from the at least two front facing cameras or the memory; and (ii) process the real image stream to perform simultaneous localisation and mapping (SLAM), the performing SLAM comprising deriving scaled depth information from the real image stream.

[0007] The FOVs of the at least two front facing cameras may be substantially aligned with corresponding FOVs of the user's right and left eyes, respectively. The camera system may further comprise at least two side cameras, namely, a right side camera and a left side camera generally aiming in a direction other than front facing. The right side camera may have a FOV partially overlapping one of the front facing cameras and the left side camera may a FOV partially overlapping one of the front facing cameras.

[0008] The processor set may comprise an accelerated processing unit providing the SPU and the GPU, and one or more FPGAs coupled to the camera system and the memory for: (i) obtaining the real image stream from the cameras; (ii) storing the real image stream to the memory; and (iii) providing a subset of pose tracking operations using the real image stream.

[0009] The processor set may downscale the image streams prior to performing SLAM.

[0010] The camera system may perform SLAM at an initialization phase by: (i) obtaining parameters of the front facing cameras; (i) receiving the image stream corresponding to a first parallel set of image stream frames from each of the front facing cameras; (ii) extracting features from the set of image stream frames; (iii) generating a descriptor for each extracted feature; (iv) matching features between each set of image stream frames based on the descriptors; (v) estimating real physical world coordinates of a point in the real physical environment on the basis of the matched features and the parameters; (vi) assigning an origin to one of the real physical world coordinates; and (vii) generating a map around the origin, the map reflecting real- world scale and dimensions of the real physical environment captured in the image stream frames.

[001 1] The camera system may perform SLAM at a tracking phase subsequent to the initialization phase by: (i) obtaining parameters of the front facing cameras; (ii) receiving the image stream corresponding to the image stream frames from at least one of the front facing cameras at a second epoch subsequent to a first epoch used in the initialization phase; (ii) extracting features from the image stream frames; (iii) generating a descriptor for each extracted feature; (iv) matching features between the image stream frames to those of the first epoch based on the descriptors; and (v) estimating the real physical world coordinates of the point in the real physical environment on the basis of the matched features, the parameters and the origin.

[0012] The camera system may extract features by applying a FAST or FAST-9 feature extraction technique to assign a score and predominant angle to the features. The camera system may perform the SLAM further by culling the extracted features prior to matching the features, where the culling is based on the score. The camera system may generate descriptors by assigning a rotation-invariant ORB descriptor.

[0013] The camera system may further comprise at least one depth camera disposed upon or within the HMD to capture depth information from the real physical environment within a depth facing FOV. The depth camera may be a structured light camera. More specifically, the depth camera may be a time-of-flight camera. The depth camera may be disposed approximately at the centre of the front of the visor housing generally in the middle of the user's FOV.

[0014] The depth camera may further comprise a structured light camera, a time-of-flight camera and an image sensor. The processor set may switch operation of the depth camera between the structured light camera, the time-of-flight camera and the image sensor depending on the desired application for the camera system between pose tracking and range resolution.

[0015] The processor set may be disposed within the visor housing. Alternatively, the processor set may be remotely located and coupled to the HMD by wired or wireless communication link.

[0016] These and other aspects are contemplated and described herein. It will be appreciated that the foregoing summary sets out representative aspects of systems and methods, to assist skilled readers in understanding the following detailed description.

DESCRIPTION OF THE DRAWINGS

[0017] A greater understanding of the embodiments will be had with reference to the Figures, in which:

[0018] Fig. 1 shows a perspective view of an embodiment of a head mounted device for displaying an AR environment;

[0019] Figs. 2a and 2b show perspective views of a visor of the HMD of Fig. 1 ;

[0020] Fig. 3 shows a rear view of the visor shown in Figs. 2a and 2b;

[0021] Figs. 4a and 4b show perspective views of a battery pack of the HMD of Fig. 1 ; [0022] Fig. 5 shows an exploded view of the components of a visor of a head mounted device in accordance with another embodiment;

[0023] Fig. 6 shows a schematic view of a processor set for a head mounted device for displaying an AR environment in accordance with an embodiment;

[0024] Fig. 7 shows a flowchart of a method performed by a processor set to generate an AR image stream for viewing by a user using an HMD in accordance with an embodiment;

[0025] Fig. 8 shows a flowchart of a method for image conversion, feature extraction and descriptor generation in a method for pose tracking of an HMD in accordance with an embodiment;

[0026] Fig. 9 shows a flowchart of a method for initializing pose tracking of an HMD in accordance with an embodiment;

[0027] Fig. 10 shows a flowchart of a method for tracking the pose of an HMD in accordance with an embodiment;

[0028] Fig. 1 1 shows a flowchart for indicating and prompting completeness of mapping using an HMD in accordance with an embodiment;

[0029] Fig. 12 shows a multi-user use scenario of an HMD in accordance with an embodiment;

[0030] Fig. 13 shows an embodiment of a handheld device; and

[0031] Fig. 14 shows a schematic view of a processing environment for a handheld device in accordance with an embodiment.

DETAILED DESCRIPTION

[0032] For simplicity and clarity of illustration, where considered appropriate, reference numerals may be repeated among the Figures to indicate corresponding or analogous elements. In addition, numerous specific details are set forth in order to provide a thorough understanding of the embodiments described herein. However, it will be understood by those of ordinary skill in the art that the embodiments described herein may be practised without these specific details. In other instances, well-known methods, procedures and components have not been described in detail so as not to obscure the embodiments described herein. Also, the description is not to be considered as limiting the scope of the embodiments described herein.

[0033] Various terms used throughout the present description may be read and understood as follows, unless the context indicates otherwise: "or" as used throughout is inclusive, as though written "and/or"; singular articles and pronouns as used throughout include their plural forms, and vice versa; similarly, gendered pronouns include their counterpart pronouns so that pronouns should not be understood as limiting anything described herein to use, implementation, performance, etc. by a single gender; "exemplary" should be understood as "illustrative" or "exemplifying" and not necessarily as "preferred" over other embodiments. Further definitions for terms may be set out herein; these may apply to prior and subsequent instances of those terms, as will be understood from a reading of the present description.

[0034] Any module, unit, component, server, computer, terminal, engine or device exemplified herein that executes instructions may include or otherwise have access to computer readable media such as storage media, computer storage media, data libraries, or data storage devices (removable and/or non-removable) such as, for example, magnetic discs, optical discs, or tape. Computer storage media may include volatile and non-volatile, removable and non-removable media implemented in any method or technology for storage of information, such as computer readable instructions, data structures, program modules, or other data. Examples of computer storage media include RAM, ROM, EEPROM, flash memory or other memory technology, CD- ROM, digital versatile discs (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disc storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by an application, module, or both. Any such computer storage media may be part of the device or accessible or connectable thereto. Further, unless the context clearly indicates otherwise, any processor or controller set out herein may be implemented as a singular processor or as a plurality of processors. The plurality of processors may be arrayed or distributed, and any processing function referred to herein may be carried out by one or by a plurality of processors, even though a single processor may be exemplified. Any method, application or module herein described may be implemented using computer readable/executable instructions that may be stored or otherwise held by such computer readable media and executed by the one or more processors.

[0035] Without limiting the generality of the present disclosure, systems and methods are disclosed herein for presenting augmented reality ("AR") environments to a user wearing a head mounted device ("HMD"). The term "AR" environment as used herein may refer to both virtual reality ("VR") and AR environments. In the present disclosure, an AR environment can include visual and structural attributes of the real, physical environment surrounding the user wearing the HMD combined with virtual elements overlaid or superimposed onto the user's view of those visual and structural attributes. An AR environment can alternatively comprise exclusively virtual elements, with none of the visual or structural attributes of the surrounding real environment incorporated into the AR environment; this is sometimes referred to as a VR environment, and only the user's pose (position and orientation over time) relative to the real environment is incorporated into the AR environment. For the reader's convenience, the following may refer to "AR" but is understood to include all of the foregoing and other variations recognized by the skilled reader.

[0036] Referring now to Figs. 1 to 5, embodiments of an HMD are described. An HMD 100 in accordance with a first embodiment is shown in an upright position as it would be worn atop a user's head during use. Relative positional terms used herein, including, for example, 'right', 'left', 'front', 'rear', 'top' and 'bottom', are relative to the orientation of the user's head when wearing the HMD 100 as intended and in a substantially upright head position. The HMD 100 comprises a visor 102 configured to be worn against the user's face to display an AR environment to the user, a battery pack 104 that is electrically coupled to the visor 102 for powering the HMD 100, and a set of straps coupled to the battery pack 104 at one end and to the visor 102 at another end to retain the visor 102 against the user's face and the battery pack 104 against the rear of the user's head.

[0037] The visor 102 comprises: a camera system 210 to capture depth and visual information from the real environment surrounding the HMD 100; a processor set (not shown) that is communicatively coupled to the camera system 210 and configured to generate an AR image stream using the depth information and the visual information; a display system that is communicatively coupled to the processor set and configured to be positioned in front of the user's eyes to display the AR image stream as an AR environment visible to the user when the HMD 100 is worn by the user; and a visor housing 200 that is configured to house the camera system 210, the processor set and the display system. The visor 102 further comprises controls to receive user inputs for controlling the HMD 100, including a power button 202 and a volume adjustment 204.

[0038] The camera system 210 comprises five cameras: a right side camera 210a, a right front camera 210b, a centre front camera 210c, a left front camera 210d, and a left side camera 210e, each of which is spaced apart from the other cameras along the outer periphery of the visor 102 and directed outwardly therefrom to capture an image stream of the real environment within its respective field of view ("FOV"). In at least one embodiment, all the cameras are visual based and can capture visual information from the real environment. At least two of the front cameras are visual based, spaced apart cameras to provide an AR image stream in 3D. The right front camera 210b and the left front camera 210d are forward-facing, and are preferably positioned to capture a view of the real environment that generally aligns with the user's natural view, i.e., with their FOVs generally aligned with the FOVs of user's right and left eyes, respectively.

[0039] At least the three front cameras 210b, 210c, 210d are positioned relative to each other in a stereo relation; that is, the three front cameras are spaced apart from each other and have partially overlapping FOVs to capture a common region of the real environment from their respective viewpoints. The relative pose of the three front cameras 210b, 210c, 210d is fixed and can be used to derive depth information from the common region captured in the real image stream. The centre front camera 210c can be vertically aligned with the front right camera 210b and front left camera 210d, as shown in Figs. 1 and 2, or preferably vertically offset upwardly or downwardly therefrom, as shown in Fig. 5. If the centre front camera 210c is vertically offset from the right front camera 210b and the front left camera 210d, the centre front camera 210c can be located anywhere on the front face of the visor 102. The vertical offset of the centre front camera 210c from the other cameras enables greater robustness in calculating vertical distances of features in the real image stream. The vertical offset can further resolve ambiguities in depth calculations from the left and right front cameras.

[0040] The right side camera 210a and the left side camera 210e can also be oriented with their respective FOVs overlapping with one or more FOVs of any of the front cameras 210b, 210c, 210d, increasing the stereo range of the camera system 210 beyond the front region of the visor 102.

[0041] The right front camera 210b and the left front camera 210d are preferably oriented with their FOVs substantially aligned with the FOVs of the user's right and left eyes so that the real image streams provided by those two cameras simulate the user's natural view of the real environment. The real image streams captured by the cameras of the camera system 210 serve as a source of visual information and/or depth information during pose tracking, mapping and rendering operations performed to generate the AR image stream.

[0042] The right side camera 210a and the left side camera 21 Oe are non-forward-facing. That is, the side cameras 210a, 210e are configured to aim in a direction other than forward, such as, for example, side-facing or generally perpendicularly aligned in relation to the forward-facing cameras. The side cameras 210a, 210e generally increase the combined FOV of the camera system 210 and, more particularly, the side-to-side view angle of the camera system 210. The side cameras 210a, 210e can provide more robust pose tracking operations by serving as alternative or additional sources for real image streams, for example and as described below in greater detail, whenever the real image streams provided by the front facing cameras 210b, 210c, 210d are too sparse for pose tracking.

[0043] While the embodiments of the camera system 210 illustrated in Figs. 1 to 5 comprise five cameras, the camera system 210 can alternatively comprise fewer than five or more than five cameras. A higher number of spaced apart image cameras can increase the potential combined FOV of the camera system 210, or the degree of redundancy for visual and depth information within the image streams. This can lead to more robust pose tracking or other operations; however, increasing the number of cameras can also increase the weight, power consumption and processing demands of the camera system 210.

[0044] In at least one embodiment, one or more of the cameras is a depth camera, such as, for example, a structured light ("SL") camera or a time-of-f light ("TOF") camera. For example, the centre front camera can be a depth camera while the right front and left front cameras can be image cameras, allowing the processor set to choose between sources of information depending on the application and conditions in the real environment. For example, the processor set can be configured to use depth information from the depth camera whenever ambient lighting is too poor to derive depth information from the right front and left front cameras in stereo; further, as the user walks toward a surface, the surface may become too close to the user's forward-facing image cameras to capture a stereo image of the surface, or, when the user is too far from a surface, all potential features on the surface may lie out of range of the image cameras; further, the surface captured by the visual-based cameras may lack salient features for visual-based pose tracking. In those cases, the processor set can rely on the depth camera. In one embodiment, the depth camera comprises both an SL and a TOF emitter as well as an image sensor to allow the depth camera to switch between SL and TOF modes. For example, the real image stream contributed by a TOF camera typically affords greater near range resolution useful for mapping the real environment, while the real image stream contributed by an SL camera typically provides better long range and low resolution depth information useful for pose tracking. The preferred selection of the cameras is dependent on the desired applications for the camera system 210.

[0045] In at least one embodiment, the camera system 210 extends to the battery pack 104 with one or more cameras disposed thereon to capture further real image streams facing outwardly from the sides and/or rear of the battery pack 104. The camera system 210 can further comprise at least one camera disposed on the HMD 100 facing downwardly to enable enhanced pose tracking, height detection, peripheral rendering, gesture tracking, and mapping of regions of the user's body below the HMD 100 in various applications.

[0046] Each camera is equipped with a lens suitable for the desired application. The cameras can be equipped with fisheye lenses to widen their respective FOVs. In other embodiments, the camera lenses can be wide-angle lenses, telescopic lenses or other lenses, depending on the desired FOV and magnification for the camera system 210. The processor set can be configured to correct any corresponding lens-related distortion.

[0047] In certain implementations, such as, for example, AR implementations, the cameras preferably are capable of capturing images at a sustained rate of approximately 90 or more frames per second or, more preferably, at approximately 180 or more frames per second.

[0048] The HMD 100 can further comprise an inertial measurement unit ("IMU", not shown) to provide orientation information for the HMD 100 in addition to any pose information derivable from the real image stream. The IMU 100 can be used to establish vertical and horizontal directions, as well as to provide intermediate orientation tracking between frames in the real image stream, as described below in greater detail.

[0049] The processor set of the HMD 100 is configured to perform various functions to generate the AR image stream, including without limitation, pose tracking, mapping, rendering, combining of real and virtual features, and other image processing. The processor set is communicatively coupled to the camera system 210, and also can be connect to the IMU if the HMD 100 comprises one. The processor set can be disposed within the visor 102 and mounted to a PCB 242 disposed between the visor front and the display 240 of the display system, as shown in Fig. 5. However, other configurations are contemplated. Alternatively or additionally, the HMD 100 can comprise a wired or wireless connection to a processor set located remotely from the HMD 100 and its user. In such embodiments, processing can be shared between onboard and remote processor sets.

[0050] The display system of the HMD 100 is connected to the processor set to receive the AR image stream and display it to the user as an AR environment. The display system comprises at least one display 240 disposed in the visor 102 to be in front of the user's eyes when the HMD 100 is worn, such as the display shown in Fig. 5. The display system further comprises a lens assembly 220 that is disposed to be between the display 240 and the user's eyes. The lens assembly 220 is configured to adjust the user's view of the AR environment shown on the display system. For example, the lens assembly can magnify or demagnify, or distort or undistort images displayed by the display. The embodiment of the lens assembly 220 shown in Fig. 2b comprises a left lens 220b and a right lens 220a disposed between the display assembly and the user's eyes; the left lens 220b is substantially aligned with the FOV of the user's left eye and the right lens 220a is substantially aligned with the FOV of the user's right eye. In at least one embodiment, the lenses are Fresnel lenses. The lenses can be made of PMMA or other suitable transparent and preferably lightweight material.

[0051] In at least one embodiment, the display system can be mounted to the rear of the PCB bearing the processor set and disposed transversely across the width of the visor 102, as shown in Fig. 5. The display of the display system can be an LED, OLED, 1080p, or other suitable type of display. Alternatively, the display can be a light-field display, such as the near- eye light field display disclosed in Douglas Lanman, David Luebke, "Near-Eye Light Field Displays" in ACM SIGGRAPH 2013 Emerging Technologies, July 2013. The display system 220 is preferably laterally centred and, more preferably, also vertically centred, relative to the right front camera 210b and the left front camera 210d so that the display system 220, the right front camera 210b and the left front camera 210d are all substantially aligned with the user's natural FOV when wearing the HMD 100, as previously described.

[0052] When displaying an AR environment in 3D, the left side of the display can display a left hand perspective of the AR environment, while the right side can display a right hand perspective. The user can thereby perceive depth in the displayed AR environment. However, the user's perception can suffer if the user's eyes see the perspective intended for the other eye. The left half of the display system can be visible only to the user's left eye and the right half of the display system can be visible only to the user's right eye to prevent the user's left and right eyes from viewing images configured for the other eye. In at least one embodiment, as shown in Fig. 5, the display system can comprise a divider 244 to connect the lens array 220 and the display. The divider 244 defines left and right chambers which are closed to each other between the lens array 220 and the display, thereby blocking the view between the left and right sides of the display, allowing the user to gain a stereoscopic view of the AR environment. The divider can also serve to form a sealed chamber between the lens array and the display to prevent ambient air, dust and moisture from entering the chamber.

[0053] The embodiments of the HMD 100 shown in Figs. 1 to 5 further comprise a plurality of dedicated controls for enabling the user to control the HMD 100. The controls comprise a power button 202 to turn the HMD 100 on and off, and a volume adjustment 204 to adjust the volume of any audio played by an audio system (not shown) connected to the HMD 100 by the audio port 203. Other suitable controls are contemplated. Alternatively, the HMD 100 can comprise reconfigurable controls or sensors or cameras that cooperate with the processor set to detect and register user gestures as inputs to the HMD 100.

[0054] The camera system 210, display system and other components of the visor are housed within the visor housing 200. The visor housing 200 is shaped and sized to receive the faces of various users. The visor housing can be made from various suitable materials, such as Polycarbonate ("PC") or Acrylonitrile Butadiene Styrene ("ABS"), or combinations thereof, which are relatively common, light-weight and readily mouldable materials. As shown in Figs. 2, 3 and 5, the visor housing 200 further comprises a visor gasket 224 to provide cushioning between the visor housing 200 and the user's face when wearing the HMD 100. The cushioning can improve user comfort by damping vibration and more evenly distributing contact pressure, and it can increase the range of users whose faces the visor 102 can fit. The visor gasket 224 can also reduce interference within the viewing area from ambient light. The visor gasket 224 can be constructed of foam, rubber or other suitably flexible material, such as a thermopolyurethane ("TPU") or rubber, with a padded portion where the visor gasket 224 meets the user's face. The padded portion can be made of any suitably comfortable material, such as a strip of relatively soft, open cell polyurethane foam sheathed by a micro-fleece strip along the contact annulus formed between the visor gasket 224 and the user's face. Any gaskets or padding materials in the HMD 100 that are likely to frequently contact the user's skin are preferably hypoallergenic.

[0055] In use, the visor 102 and the user generate heat that can be sensed by the user. The resulting heat can cause the use's face to sweat, particularly within the periphery defined by the visor gasket 224 against the face. Excess sweating can cause the display and the lenses to mist. To reduce misting, the visor gasket 224 and/or the visor housing 200 preferably comprises at least one vent 206 to permit ventilation of the enclosed region between the user's face and the visor 102. The visor may comprise one or more fans 238 housed within the visor such, as adjacent a vent 206, to provide active cooling. A light baffle made from open celled foam can be used to cover the vent to prevent intrusion of ambient light while allowing air to pass between the walls of the visor housing 200 or visor gasket 224. The visor gasket 224 can be fixed to the visor housing 200, or it can be releasably attached thereto to enable replacement. For example, the visor gasket 224 can be removably attached to the visor housing 200 by hook and look, friction fit, clips or other releasable fastener, or it can be fixed with glue or by being monolithically formed with the visor housing 200. [0056] The visor 102 draws electrical power from the battery pack 104 of the HMD 100. The battery pack 104 is electrically coupled to the visor 102 by a power cable 110. The juxtaposition of the battery pack 104 against the rear of the user's head with the visor 102 worn against the front of the user's head can provide a more even weight distribution from front to rear than other configurations. The power cable 1 10 extends between a battery in the battery pack 104 and the electronic components of the visor 102, preferably along the set of straps, as shown in Fig. 1. The battery pack 104 comprises at least one battery, such as, for example, a rechargeable or non-rechargeable battery. In at least one embodiment, the battery pack comprises four 2800mAh batteries; however, other configurations are contemplated.

[0057] The battery pack 104 further comprises a battery housing 400 to house the at least one battery against the rear of the user's head. The battery housing 400 is sized and shaped to receive the heads of various types of users. The battery housing 400 has a rear gasket 404 that provides cushioning between the user's head and the battery housing 400 when the HMD 100 is worn. The rear gasket 404 is made of any suitably cushioning material, such as a thermoplastic polyurethane ("TPU") or rubber. The depth, density and hardness of the rear gasket 404 are preferably selected to generally retain the rear of the user's head away from the surface of the battery housing 400. This provides an air gap to insulate the user from heat generated by the at least one battery during use of the HMD 100. The battery housing 400 can also have a rear liner 408 applied thereto within the periphery defined by the rear gasket 404. The rear liner 408 can be made from the same material as the rear gasket 404 to provide additional cushioning for any part of the user's head that intrudes completely into the enclosed region defined by the rear gasket 404.

[0058] The rear gasket 404 and the battery housing 400 can comprise at least one rear gasket vent 406 and at least one battery vent 410, respectively, to provide ventilation. Ventilation can enhance battery performance and/or user comfort. Further, at least some surfaces of the visor housing 200 and battery housing 400 can be perforated to increase airflow and/or reduce the weight of each housing.

[0059] The battery housing 400 can comprise a removable or releasable cover and opening to enable replacement of the at least one battery housed therein. The battery housing further defines an aperture that houses, and provides external access to, a charging port 402 to charge the at least one battery. The charging port 402 is electrically coupled to the at least one battery and provides a releasable electrical coupling between a charge cable (not shown) and the at least one battery. [0060] Although the battery pack 104 illustrated in Figs. 1 to 3 is configured to be worn against the rear of the user's head, the battery pack for the HMD 100 can be configured to be worn elsewhere on the user's body, for example, as a belt clip, a knapsack, or disposed within the visor 102. Alternatively, the power source can be remote from the user, requiring the user to be tethered to the power source. While tethering to a remote power source can reduce the user's mobility, the total weight of the HMD 100 borne by the user can be lower and/or the capacity of the power source can be greater than an onboard power source. However, the previously described weight balance between the battery pack 104 and the visor 102 is sacrificed if the battery pack 104 is situated other than against the rear of the user's head.

[0061] The set of straps that holds the HMD to the user's head comprises a pair of side strap 106 coupled to the battery pack 104 and the visor 102 on each side of the HMD 100 to run along the left and right sides of the user's head, and an overhead strap 108 that is coupled to the visor 102 and the battery pack 104 and that runs along the top of the user's head. As illustrated by the arrow T in Fig. 1 , the product of the weight and depth of the visor is a downward torque tending to pull away from the user's face at the top the visor 102 and pushing into the user's face at the bottom of the visor 102 when worn upright. The user may sense the torque as an uncomfortable drooping. The set of straps is coupled to the visor 102 and the battery pack at anchor points positioned to counteract the torque during use. Further, the overhead strap 108 can be relatively inelastic to prevent drooping. Since the side straps 106 contribute relatively little to counteracting the torque, they can be relatively elastic to hold the visor 102 and battery pack 104 snugly against the user's head. The straps are preferably length adjustable and, more preferably, tension adjustable to accommodate a range of users and applications. The straps can be fabricated from any suitable combination of flexible, rigid, elastic and/or inelastic materials, such as, for example, a cotton and spandex blend, a textile sheathed by silicon, an elastic-type textile or other material that is relatively strong but comfortable when in contact with the user's skin. In another embodiment, the set of straps can have a ratchet type tightening system.

[0062] As an alternative arrangement to the set of straps, the visor 102 and the battery pack 104 can be retained in substantially fixed relationship to each other by a rigid or semi-rigid structure shaped to accommodate a user's head while retaining the visor in front of the user's eyes. To achieve at least partial front-rear weight balance, the structure preferably retains the battery pack 104 towards the rear of the user's head. If the camera system of the HMD further has one or cameras mounted to the HMD away from the visor, then those cameras are preferably retained at a constant pose relative to the visor-mounted cameras of the camera system. A rigid structure is therefore preferable in such embodiments of the HMD 100.

[0063] The visor 102 can further comprise a proximity sensor 208 to detect when the HMD 100 is being worn by a user. The proximity sensor 208 can be disposed on the visor housing 200 or elsewhere adjacent the area where the user's head is to be received to detect a user's absence from, or proximity to, the adjacent area. The proximity sensor 208 can be communicatively coupled to the processor set, which can be configured to register the detected absence or proximity of a user for various purposes, such as reducing power consumption. For example, the processor set can dim or turn off the display system or other powered components of the HMD 100 in response to detecting the user's absence, and vice versa.

[0064] In use, the HMD 100 can enable a user to move throughout a real environment while viewing an AR environment through the visor 102. As previously described, the AR image stream is at least partially based on the depth or views of the real environment surrounding the HMD 100. For example, the AR environment can comprise virtual elements overlaid with real images of the real environment, or the AR environment can comprise exclusively virtual elements that are spatially related to the real environment. The spatial relationship between the real environment and the AR environment enhances the user's sense of immersion within the AR environment. By tracking the pose (i.e., orientation and location) of the HMD 100 (and thus of the user) within the real environment, the HMD can track the user's real movements into the AR environment. If the user takes a step forward or backward, left or right, or tilts his head, then, the view of the AR environment displayed to the user can simulate a corresponding movement within the AR environment.

[0065] During an AR application, the processor set of the HMD acquires a real image stream from the camera system 210 and implements visual-based simultaneous localisation and mapping ("SLAM") techniques. The processor set can derive scaled depth information directly using a camera system that includes a depth camera or at least one pair of cameras arranged in stereo; i.e., spaced apart from each other with at least partially overlapping FOVs. Alternatively, various visual SLAM techniques, such as Oriented FAST and Rotated BRIEF SLAM ("ORB- SLAM"), are configured for use of a monocular camera by simulating "temporal stereo". "Temporal stereo" requires at least two frames to be captured by a monocular camera from two different poses, and the two frames require a region of overlap. If the real-world dimensions of an observable element within the region of overlap are known, the processor set can derive scaled depth from the real image stream. Any inaccuracy in the observed dimension will be tracked into the calculation of all depths. By using a real image stream from at least one pair or stereo cameras, however, the processor set can directly derive depth from a single epoch without requiring the type of initialisation required in monocular SLAM.

[0066] Fig. 6 illustrates an embodiment of a "system on a chip" ("SOC") processor set 600 onboard the HMD 100, though other processing environments are contemplated. According to the present embodiment, the processor set 600 comprises: a general processing unit ("GPU") 602, such as an AMD™ X86 APU, to perform both general and various graphics processing tasks; and one or more special purpose processing units ("SPU") 604 coupled to the cameras of the camera system 210 to address various tasks for handling the real image stream. Each of the SPUs 604 and the GPU 602 is connected to a RAM 606, such as, for example, an SDRAM. The GPU 602 is further connected to an antenna 612 or other wireless communication system of the HMD, an IMU 610, an HDMI, USB, PCIE or other suitable serial port 612 to enable connection to other components, and an audio port 614 via an audio CODEC. An IMU 610 can be connected to one or both of the SPUs 604 in addition to, or instead of, the GPU 602.

[0067] The SPUs may be FPGAs or ASICs, for example, that are preferably configured to perform those processes that are most suited to their respective FPGA, ASIC or other architecture, and the GPU is preferably configured to perform the remaining processes. More particularly, and as indicated by the dashed lines in Figs. 7, 8, and 9, if embodied as FPGAs, the SPUs 604 preferably perform processes that benefit most from parallel processing within the limitations inherent in FPGAs, while the GPU 602 performs other processes. If the SPUs are embodied as FPGAs, the SPU-implemented processes are preferably selected or configured to reduce demands on the FPGA look-up tables ("LUTs"), for example, by reducing the amount of values, particularly non-integer values, to store in LUTs during processing.

[0068] The SPUs can be configured to implement, preferably in parallel, various processes for pose tracking operations, such as image processing, and feature and descriptor calculation. Accordingly, the GPU's resources can be dedicated to other aspects of pose tracking, graphics rendering, general processing and application-specific tasks that are suited to processing on- board the GPU. The division of processing tasks according to the capabilities of the GPU and the SPUs may therefore provide greater combined performance than an implementation entirely on the SPUs or entirely on the GPU.

[0069] Referring now to Figs. 7 to 12, embodiments of a multi-camera or stereo SLAM method are described for generating an AR image stream from a real image stream captured by a stereo or multi-camera system (collectively, "multi-camera system"), such as the camera system 210 shown in Figs. 1 , 2 and 5.

[0070] It will be appreciated that each camera of the multi-camera system captures its own image stream; however, for ease of reading, these are referred to herein collectively as the singular "real image stream". Further, at least two of the cameras in the multi camera system capture overlapping regions of the real environment to serve as at least one stereo camera pair. The method is implemented by a processor set communicatively coupled to the camera system, such as the processor set 600 described with reference to Fig. 6.

[0071] During 3D mapping and pose tracking operations, the processor set uses known intrinsic and extrinsic calibration parameters for the cameras in the multi-camera system 210 derived during a multi-camera calibration process. Intrinsic parameters relate to lens distortion, focal length and other properties of each camera, while extrinsic parameters relate to the relative pose of each camera within the multi-camera system (the "rigid body transformation"). If the camera parameters are fixed, these can be pre-defined (typically in factory) in a memory accessible by the processor set, as illustrated by the camera parameters 906 used to calculate map points from matched features and generate key frames at block 908 during an initialization procedure shown in Fig. 9. The rigid body transformation can be derived by a suitable non-linear optimization applied to the multi-camera system's real image stream captured from a test field.

[0072] As shown in Figs. 7 and 8, the processor set 600 implementing the SLAM method can be configured to acquire the real image stream at block 800 from the camera system 210 to perform various image conversion tasks at blocks 706 and 804 on the real image stream, such as protocol translation, white balancing, down-sampling and other signal transformations in anticipation of subsequent processing during pose tracking. The processor set may be configured to treat each component real image stream as a distinct element during image conversion tasks in order to optimize image properties within each component rather than across the aggregate of every component. By treating the component real image streams separately during image conversion tasks, the processor set can apply more granular corrections than treatment of the image streams as an aggregate.

[0073] The processor set may downscale the real image stream depending on the desired accuracy of the resulting pose estimate and the capabilities of the processor set. Thus, while high-resolution images of the real environment may provide greater realism when displayed to the user on the display 240, the processor set can benefit from the lower processing demands of a lower-resolution real-image stream by down-sampling, all without perceivably impacting pose tracking quality. The processor may also perform other image conversion processes, such as to correct lens-related distortion downstream of the pose tracking operations and upstream of the combining step at block 730 shown in Fig. 7.

[0074] During the SLAM method, the processor set may operate in an initialization state, a tracking state and a re-localization state.

[0075] During the initialization state the processor set receives from the camera system data in the form of an image stream corresponding to a first parallel set of real image stream frames. The processor set extracts features and their corresponding descriptors from the set of frames, as described below in greater detail with reference to Fig. 7, and then, as shown in Fig. 9, obtains the generated features and descriptors at block 900, and at block 902 matches features between each of the frames. Since the stereo cameras have overlapping fields of view, one or more features present in one of the frames is typically present in at least another of the frames. At block 904, the processor set determines whether there are sufficient feature matches available in the present epoch. If not, the processor set performs feature matching using the next epoch, returning to block 900. Once there are enough matching features within an epoch, based on the known calibration parameters of the stereo camera array 906 stored in a memory accessible by the processor set, the processor set can, at block 908 estimate real-world coordinates of points in the real environment assumed to be exhibiting as the matched features in the real image stream. The processor set assigns an origin to any suitable location. For any feature from one frame with a corresponding match in another frame, the processor set assumes the existence of a real feature in the real environment and maps that real feature as a map point relative to the origin, thereby generating map points 910. This initial map reflects the real-world scale and dimensions of the real environment captured in the initial set of frames because the processor set can derive real-world scale from the known parameters of the stereo cameras 906. Once the processor set has generated the initial set of map points and key frames, the system is initialized, at block 912.

[0076] From time to time, the real image stream may convey relatively little or low-quality stereo information. For example, the forward facing cameras arranged in stereo may capture a matte, white wall that manifests as few features in the image stream, leading to difficulties in pose tracking. In that case, the processor set can use other frames, including monocular frames, in the image stream to supplant or enhance pose tracking using stereo framesets. By way of further example, one of the side cameras may happen to capture a more feature rich region of the real environment. The processor set can use that camera's component of the real image stream to perform pose tracking until the other cameras resume capturing sufficiently feature- rich regions of the real environment. The processor set may switch between monocular and stereo based pose tracking depending on the quality of information available in the real image stream. Further, if the features captured by the single camera in monocular mode are already mapped in a keyframe stored in a feature map, as described below in greater detail, the scaled depth information of that monocular frame can be derived by comparison to the features in the reference keyframe and used for pose tracking until the processor set is able to resume stereo based pose tracking.

[0077] Alternatively, the processor set does not switch between cameras, but instead continuously uses the contributions to the real image stream from all cameras. In such an implementation, all features in the image stream captured by the camera system contribute to pose estimation calculations, but each camera's relative contribution to pose estimation at any given epoch would be contingent on the number and/or quality of salient features it captures then relative to the other cameras in the camera system. Applicant has found that the continuous use of information from all cameras may provide more seamless pose tracking and greater accuracy relative to alternating between the cameras. Further, this "joint" treatment of all the component real image streams as a combined real image stream may occur regardless whether the processor set treats the component image stream separately or jointly during image conversion tasks.

[0078] In the tracking state, the processor set builds on the initial map by adding information acquired from subsequently captured frames, as described below.

[0079] As shown in Fig. 7, in operation, the processor set periodically selects parallel framesets from the real image stream by buffering the real image stream at individual epochs (specific moment when the selected images are captured in the image stream) at block 710 of Fig. 7 and at block 802 of Fig. 8 and registering the epoch time of each frameset 704, as shown in Fig. 7. A clock signal can be used to synchronise the processor set and the cameras. The processor set can retrieve the buffered image stream for subsequent mapping and rendering operations.

[0080] In a system-on-chip processor set configuration, such as the processor set 600 described above with reference to Fig. 6, the at least one SPU of the processor set preferably obtains the real image stream in substantially real-time, i.e., as soon as it can receive the data in the image stream from the camera system. Accordingly, the SPU may begin implementing its assigned operations on the image stream even before the camera system delivers all the data corresponding to a frameset. In this implementation, then, the SPU does not wait until an entire image stream at an epoch is buffered before performing image conversion, feature extraction and descriptor generation for that epoch.

[0081] At block 708 of Fig. 7 and at block 806 of Fig. 8, the processor set extracts features from the image stream and generates descriptors for those features. These are returned at block 808 of Fig. 8. Preferably, the processor set is selected and configured to compute the features and descriptors for each epoch before the camera system is ready to begin delivering data corresponding to the next epoch. For example, if the camera system captures the real image stream at 180 FPS, then the processor set preferably extracts all features and descriptors for a given epoch in less than 1/180th of a second, so that the pose does not significantly lag image capture.

[0082] As shown by the dashed lines in Figs. 7, 8 and 9, if the processor set comprises SPUs as in the processor set 600 of Fig. 6, then that component may be particularly suited to feature extraction and descriptor generation since, for each epoch, the processor set may need to compute many features and descriptors. An FPGA may be configured to perform feature matching and descriptor generation according to any suitable method, such as the method in Weberruss et al "ORB Feature Extraction and Matching in Hardware", incorporated herein by reference.

[0083] The processor set calculates features in the image stream according to any suitable feature extraction technique, such as FAST or FAST-9. FAST features assign a score and a predominant angle to features captured in the real image stream. The processor set then assigns a descriptor, such as a rotation-invariant ORB descriptor, to each feature.

[0084] In order to minimize processing demands, the processor set performs an initial cull against the generated features so that it retains only a certain amount of the highest-scoring features for further processing.

[0085] According to one sort method, the processor set segments the image stream into as many discrete regions as the desired total of post-cull features and retains only the single highest scoring feature in each discrete region. Accordingly, the processor set would compare each feature against the last highest scoring feature and replace the last highest scoring feature with the next feature whenever the next feature has a higher score.

[0086] According to another sorting method, however, the processor set segments the image stream into a plurality of discrete regions, where the plurality is larger than the number of desired post-cull features. The processor set then determines the highest scoring feature in each discrete region as in the previous sort method. However, since the total number of resulting features is still higher than the desired number of post-cull features, the processor set sorts the resulting features by score and retains the post-cull number of highest scoring features. Accordingly, the processor set may avoid culling a disproportionate amount of highly optimal features from regions with a disproportionately high number of next-best features. The processor set may implement any suitable sorting technique. Preferably, the processor set is configured to perform a relatively low-complexity sorting technique. For example, the complexity of certain optimized sorting algorithms is nlog(n), as compared to "counting sort", at r + n, where r is the range of possible feature scores and n is the number of elements to sort. Since counting sort performs more efficiently when the numbers to sort are integer-based with a relatively narrow range, the processor set preferably is configured to calculate integer-based scores within a relatively narrow range, such as FAST (which has a range of available scores between 0 and 255), thereby making the scores compatible with counting sort.

[0087] Referring now to Fig. 10, the processor set acquires the framesets, which can comprise one or more parallel sets of frames captured by the various cameras 210 in the camera system. As shown in Fig. 10, for example, the real image stream can comprise frames captured by left and right cameras arranged in stereo in the camera system at periodic epochs t,, t_i+1 and so on, until t_n. At block 1002 the processor set detects and extracts the features within each frameset and calculates suitable descriptors to describe those features. The processor set preferably employs efficient description protocols, such as ORB descriptors.

[0088] At block 1004, when a pose estimate from the previous epoch is available, it can be used as part of an initial estimate for the pose estimation of a subsequent frame. For example, to estimate the pose of the HMD between consecutive epochs ti and ti+1 , the processor set first assumes that the pose posei x Mi at ti equals the pose posei+1 at ti+1 , where Mi is an estimated motion inferred from previous sets of pose estimates, optionally in combination with other data from sensors such as, for example, an IMU. This methodology can work whenever the frame rate of the camera system is sufficiently high relative to the velocity of the HMD; however, if the frame rate is not sufficiently high, this approach may yield an estimate which is too inaccurate.

[0089] At block 1006, during pose estimation for the present epoch, the processor set can match the identified features in the present epoch with stored map points created from previous epochs in order to obtain a more accurate pose estimate. For example, features from each frame in the present epoch can be matched with stored map points that were matched to features from the corresponding frames in the preceding epoch. When attempting to match features in the present epoch with the stored map points matched in the preceding epoch, the processor set can determine where to search for those stored map points in the present epoch. Accordingly, when searching for a stored map point in the present epoch, the processor set may be configured to define a search window within which to search for a match in the present epoch. The search window may be predefined according to the largest movement that a feature is likely to make between subsequent epochs. Alternatively, an initial pose estimate for the present epoch, obtained as described previously at block 1004, can be used to calculate a smaller search window where the stored map point is likely to appear in a particular frame.

[0090] To determine the best matching feature in a particular frame for a particular stored map point, the processor can select only the set of features located within a defined search window of the frame as a set of possible matches. The processor can determine the best matching feature from the identified set of possible matches by comparing the descriptor of the stored map point with the descriptor of each feature in the identified set. The selection of the best matching feature can then be made by determining which feature has the descriptor with the highest number of matching cells when compared to the descriptor of the stored map point. Preferably, the image stream is blurred so that the processor set can match descriptors despite some degree of misalignment inherent in a cell-by-cell comparison.

[0091] Once the processor set has identified matches between features in the present epoch and stored map points from the preceding epoch, the processor set can calculate a more accurate pose for the present epoch, at block 1008. The processor set can calculate the pose using a suitable optimization technique, such as multi-camera bundle adjustment using Gauss- Newton, and using as constraints: the rigid body transformation between each camera associated with a frame in the epoch as determined during camera calibration; the position of all matched stored map points in 3D space; and the location of the corresponding features to which the stored map points were matched within their respective 2D frames in the real image stream. The processor set can use the origin of the stereo camera at the initial capture epoch as the origin of the real environment coordinate system (the "real" coordinate system, which can be scaled in world-coordinates due to the depth information available from the stereo cameras).

[0092] During pose estimation for the present epoch, if there are insufficient matches between stored map points and features in the present epoch, then the processor set can search within previously stored keyframes and their matches with stored map points 1014 to identify which key frame reflects the most similar capture of the real environment as compared to the present epoch. After identifying the keyframe, the processor set can estimate the posei+1 at ti+1 using a suitable algorithm, such as Perspective-Three-Point ("P3P") pose estimation conjugated with random sample consensus ("RANSAC"). At block 1010, the processor set may further derive a refined pose posei+1 at ti+1 by applying multi-camera pose optimization bundle adjustment. The processor set can use the resulting refined pose 1012of the frameset and the associated stored map points to perform odometry, environment mapping, and rendering of the AR image stream, including any virtual elements therein.

[0093] Referring again to Fig. 7, in a map management step at block 720, the processor set analyzes the buffered framesets buffered at block 710, and determines which framesets to include in a feature map 722 as a set of keyframes. The processor set selects keyframes in order to limit the number of framesets which must be processed in various steps. For example, the processor set can select every nth frame as a key frame, or add a frame based on its exceeding a threshold deviation from any previously stored keyframes. If the deviation is significant, the processor set selects the set of frames at that epoch as a further set of keyframes to add to the map. During the map management operation, the processor set can further cull keyframes from the feature map in order to keep the stored amount of keyframes within a suitable limit that can be stored and processed by the processor set. The processor set can map the sets of keyframes, along with their extracted features, to the feature map 722. The feature map comprises the keyframes and the sparse depth information extracted from the keyframes for use in pose refinement at block 712, loop closure at block 716 and dense 3D mapping stages at blocks 724 and 728, described later. Since the rigid body transformation of the camera system is known, the keyframes and salient features in the keyframes can be mapped using the estimated pose.

[0094] During pose tracking and mapping operations, the processor set may implement those operations using multiple threads. The processor set reads vectors or other data structures such as maps. Since one thread may be reading data ("reading thread") while another thread is writing ("writer thread") the same data, the processor set may use locks to prevent the writer thread from altering that data while the reader thread is reading that data. Accordingly, the reader does not read partially modified data. Because locks may slow processing, the processor set preferably assigns redundant copies of such data to the different threads. The processor set sets a flag that each thread can query to indicate whether the data subsequently has been modified relative to the data in the redundant copies. Preferably, the processor set only uses the lock if the data is flagged as modified, allowing the reader thread to access the modified data without the writer thread further modifying the data during the reading. If the flag is a Boolean variable, then it may take a value of true or false, which can be automatically modified without locks. If the flag is set to false by another thread, but the reader thread reads it as true due to some race condition, the reader thread may use its own redundant copy of the data, thus avoiding the reading of partially modified data. While the redundant copy of the data may be slightly obsolete, it may be preferable to reading partially modified data. If the data is infrequently updated, but read often, it may be preferable to provide redundant copies of the data and so that threads can read the data without locks.

[0095] Referring now to Fig. 12, a multi-user environment is illustrated in which each user's HMD 100 can contribute to a global map shared between the multiple users. In a use scenario, a first user 1210 begins exploring the lower storey 1201 of a building 1200, while a second user 1202 begins exploring the upper storey 1202. Each user's HMD 100 is configured to perform the visual-based SLAM method described herein. Initially, each HMD performs mapping and pose tracking using its own coordinate system. However, each HMD is configured to share, preferably in a wireless fashion, mapping information, such as key frames, map points and pose data with the other HMD. Once the first user mounts the stairs 1204 to the second storey along the path P, that user's HMD begins to capture regions of the real environment that are already described by pose and map data previously captured by the HMD of the second user. The first user's HMD can then rely on the second user's contribution to perform subsequent tracking refinements. Each user's HMD can serve as the master to the other's slave, or a central console 1220 or processor set remote from both users can serve as the master, while both users' HMDs serve as slaves to the console. Accordingly, the previously distinct maps contributed by each user's HMD can be reconciled into a global or master map which both HMDs can use for navigation and mapping. Both HMDs may continue to contribute to the global map. The user's may contribute further map points and key frames, or only map points, or only new map points. Further, if either user's HMD temporarily loses tracking, that HMD can re-initialize and perform mapping and pose tracking using a new set of coordinates, map points and key frames. Once that user's HMD recognizes a match to previously captured features, the HMD can relocalize the new set to the global set of coordinates, map points and key frames.

[0096] Referring again to Fig. 7, the processor set can use the keyframes to perform loop closure at block 716, in which the current estimated pose 714 is adjusted to compensate for any deviation from a previously estimated pose of a comparator keyframe. Over time, visual based pose tracking can exhibit cumulative errors. However, whenever the camera system of the HMD returns to a previously captured region of the real environment for which a previously captured reference keyframe is stored and available, the processor set can compare the estimated pose 714 of the HMD with the estimated pose of the stored keyframe 722. The processor set can assume that any difference is due to a cumulative error arising between the time of the current pose estimate and the pose estimate of the reference keyframe. The processor set can then "bundle adjust" (i.e., realign) all interim keyframes in the feature map in order to compensate for the cumulative error.

[0097] As shown in Figs. 6 and 7, the processor set optionally can be connected to an IMU 610 of the HMD to enhance visual based pose tracking through IMU fusion at block 718. Commonly available IMUs provide orientation readings in the 1000 Hz range, whereas commonly available image cameras typically operate in the 60 Hz range. While visual-based pose tracking is generally less susceptible to cumulative errors than IMU-based pose tracking, the higher response rate of IMUs can be leveraged. For any pose changes occurring between visual based pose readings, the processor set can add readings from the IMU until the next available visual- based pose estimate is available. Further optionally, if the HMD is equipped with a depth camera, the processor set can be configured to perform depth-based pose tracking whenever the visual information in the real image stream is too sparse or unreliable.

[0098] In addition to pose tracking, the processor set can use the real image stream to virtualize the real environment; that is, the processor set can extract, at block 724, and render, at block 728, a dense 3D map of the real environment using the depth and even the visual information obtainable from the real image stream. The processor set can assign the origin of the 3D map to any point including, for example, based on assigning a middle point between any two or more cameras forming a stereo configuration in the camera system as the origin. For each frameset, or preferably for each set of keyframes, in the real image stream captured by cameras in stereo configuration, the processor set can extract an epoch-specific dense 3D map, at block 724. The processor set can use the refined pose registered for that frameset, as well as the known rigid body transformation of the camera system, to align multiple dense 3D maps into a unified, dense 3D map, at block 728. The 3D map can be a point cloud or other suitable depth map, and the processor set can register colour, black and white, greyscale or other visual values to each point or voxel in the 3D map using visual information available in the real image stream. For example, any frameset captured by colour cameras can be used to contribute colour information to the 3D map in the captured region.

[0099] At block 720 the processor set can also periodically add new framesets or keyframes to previously mapped regions in the feature map and/or 3D map 722 to account for changes in those regions. The processor set can also compare present framesets to feature maps generated in previous instances to check whether there is already a feature map for the present region of the real environment. If the present region of the real environment is already mapped, the processor set can build on that map during subsequent pose tracking, mapping and rendering steps, thereby reducing initialization and mapping demands and also enhancing pose tracking accuracy.

[00100] In some applications, immediately after an initial epoch-specific 3D map has been positioned in the 3D map, the processor set can begin rendering virtual elements situated in the 3D map, thereby augmenting the real environment. In other applications, however, a greater awareness of the real environment is required before rendering virtual elements. As shown in Fig. 1 1 , at block 1 100, the processor set show the user the extent and quality of mapping for the surrounding physical environment. For example, the processor set can cause an indicator 1 102 to be displayed on the display 240 of the HMD. The indicator 1 102 can shown a first region 1 104 indicating where sufficient data has been acquired to map the real environment surrounding the user. A second region 1 106 in the indicator 1102 shows the user which areas require more data to map. The user is thus prompted to complete one or more rotations within the real environment until the processor set has determined that the threshold amount of spatial awareness has been met. For example, the processor set may require the user to capture a complete 360 360-degree view of a room prior to rendering any virtual elements within the 3D map of the room.

[00101] Referring again to Fig. 7, at block 726 the processor set can render virtual elements using any suitable rendering application, such as existing Unreal™ or Unity™ engines. The processor set renders the virtual elements with reference to the 3D map it has of the real environment. The processor set can therefore scale and position the virtual elements relative to the real environment. The virtual elements can at least partially conform to features within the real environment, or they can be entirely independent of such features. For example, the virtual elements can share structural and positional properties with real elements in the real environment while having different colours or surface textures. In an exemplary gameplay scenario, the processor set can retexture the walls of a room with a virtual stone surface to depict a castle-like AR environment. Similarly, the real ground can be retextured as grass, soil, tile, or any flooring surface. Alternatively, the processor set can render completely virtual elements which can be bounded by the dimensional limitations of the real environment while sharing no structure with any real elements. In another exemplary gameplay scenario, then, the processor set can render an ogre situated in the 3D map regardless of the presence or absence of another real feature at the same location.

[00102] In at least one embodiment, the processor set can be configured to render only those virtual elements which are generally within the region of the 3D map corresponding to the user's current FOV into the real environment. By limiting rendering to that region, processing demands are reduced relative to rendering all virtual elements within the 3D map. The processor set can further render virtual elements which are peripheral to the current FOV in order to reduce latency when the user's FOV undergoes relatively rapid changes.

[00103] The processor set can also be configured to shade the virtual elements based on current lighting conditions in the real environment. When a user views the resulting AR environment, immersion is generally improved when real and virtual elements are seamlessly integrated into the displayed scene. One aspect of integration is for virtual elements to mimic the ambient lighting of the real environment. For example, if there is a point source of light in the real environment, the processor set can be configured to generally mimic that light source when rendering the virtual elements. In one embodiment, the processor set can apply an image-based lighting ("IBL") process, such as, for example, as described in Paul Debevec, "Image-Based Lighting," IEEE Computer Graphics and Applications, vol. 22, no. 2, pp. 26-34, March/April, 2002. However, IBL techniques typically are configured for use of an omnidirectional real image of the real environment as captured by an HDR camera; these may not work well with the camera system and processor set configuration of the HMD. Therefore, the processor set can be configured to implement an approximation method to satisfactory effect.

[00104] In at least one embodiment, the approximation method comprises: the processor set acquiring a frameset from the camera system or buffer, the frameset comprising a plurality of RGB values for the captured region of the real environment; the processor set deriving from each RGB value an HSB, HSL or HSV value; calculating the mean colour and intensity values for the frameset; and using the mean RGB and HSB values in a surface shader to modify the surface appearance of the virtual objects. The processor set can be further configured to reduce computational effort by overcompensating for shading in a subset of the pixels or vertices in the shader, while leaving the remaining pixels or vertices untransformed. Due to the overcompensation of the subset, the perceived shading of the entire image will approximate the perceived shading as though all pixels or vertices had been corrected by the un- overcompensated amount. The processor set is preferably configured to perform rendering and shading in a multi-threaded manner, i.e., in parallel rather than sequentially. [00105] Another aspect of integration of real and virtual elements is for the pose of virtual elements to substantially reflect the pose of corresponding real elements within the AR environment. The processor set can be configured to reflect the user's actual pose in the AR image stream by a combining method, which can also apply occlusion to the real and virtual elements based on their relative positions within the 3D map.

[00106] In embodiments of a combining method at block 730 of Fig. 7, the processor set renders the AR image stream from the point of view of a notional or virtual camera system situated in the 3D map. The notional camera system captures within its FOV a region of the 3D map, including any mapped virtual and real elements represented in the captured region. The processor set preferably defines the notional camera by parameters similar to those of the camera system of the HMD. The processor set can then adjust the pose of the notional camera to reflect the actual calculated pose of the HMD's camera system. The AR image stream is then captured by the notional camera as its pose changes during use. The resulting AR image stream substantially reflects the user's actual pose within the real environment. In further embodiments, the combining method comprises treating the real image stream as a texture applied to a far clipping plane in the FOV of the notional camera system, and occluding any overlapping real and virtual regions by selecting the element that is nearer to the notional camera system for inclusion in the AR image stream.

[00107] Rendering of virtual elements can be relatively demanding and time-consuming. In various scenarios, the time to render virtual elements can be sufficient to result in a perceptible lag between the actual apparent pose of the HMD and the apparent pose of the AR environment at the time of display. To reduce the apparent lag, the processor set can estimate the user's pose when rendering is estimated to be complete. Prior to rendering the virtual elements, the processor set can either over-render the scene to include the virtual elements within the FOVs of notional camera system at the current, interim and estimated post-rendering poses, or it can render the scene to include only those elements at the estimated post-rendering pose. Immediately prior to issuing the AR image stream to the display, the processor set can verify the current pose against the estimate to update the pose of any virtual elements. This last step can be performed relatively quickly, particularly if the initial estimated pose proves to be accurate. The virtual objects can therefore be substantially matched to the nearly real-time pose and real image stream provided by the camera system of the HMD.

[00108] To present a 3D AR image stream, the notional camera system preferably comprises notional cameras with extrinsic and intrinsic parameters simulating those of the left and right front cameras of the camera system on the HMD. Each of the notional cameras captures the 3D map from its own perspective. The processor set can then provide the AR image stream captured by each of the notional cameras to its respective side of the display system. Since the left and right front cameras, the notional cameras and the display system are all generally aligned with the FOVs of the user's eyes, the user can view a 3D augmentation of the real environment before the user.

[00109] Fig. 13 illustrates a handheld controller 1301 capable of camera based pose tracking. In one aspect, the handheld controller may be used to control an HMD, such as, for example, the HMD shown in Figs. 1 to 5. In another aspect, however, the handheld controller may be used to control other computing devices, or independently of any other device. In still another aspect, the handheld controller may be used as a handheld mapping tool to scan and map various objects in an environment.

[001 10] The handheld controller comprises a handle 1303 and a camera system 1305 to capture depth and visual information from the real environment surrounding the handheld controller. The handheld controller further comprises a processor set 1307 in the form of a system on module ("SOM"), as shown in Fig. 14, that is communicatively coupled to the camera system 1305 and configured to perform pose tracking and, optionally, mapping of the real environment using the depth information and the visual information from the camera system 1305. A battery or other power source (not shown) provides power for the systems onboard the handheld controller.

[001 11] The camera system 1305 of the handheld controller comprises four cameras: a right side camera 1305a, an upper front camera 1305b, a lower front camera 1305c, and a left side camera 1305d, each of which is spaced apart from the other cameras about the handheld controller 1301 and directed outwardly therefrom to capture an image stream of the real environment within its respective field of view ("FOV"). In at least one embodiment, all the cameras are visual based and can capture visual information from the real environment.

[001 12] The upper front camera 1305b and the lower front camera 1305c are forward- facing and have partially overlapping fields of view to capture a common region of the real environment from their respective viewpoints. The relative pose of the front cameras is fixed and can be used to derive depth information from the common region captured in the real image stream. The front cameras can be spaced apart at either extreme of the handheld controller, as shown, to maximize the depth sensitivity of the stereo camera configuration. Preferably, the cameras are maintained closely enough together that their respective fields of view begin to overlap within the inner limits of a region on interest in the real environment for which detailed depth information is desirable. It will be appreciated that that distance is a function of the use case and the angles of view of the front cameras.

[001 13] The right side camera and the left side camera also can be oriented with their respective FOVs overlapping with one or more FOVs of any of the front cameras, thereby increasing the stereo range of the camera system beyond the front region of the visor.

[001 14] The right side camera and the left side camera are non-forward-facing. That is, the side cameras are configured to aim in a direction other than forwards; such as, for example, side-facing or generally perpendicularly aligned in relation to the forward-facing cameras. The side cameras generally increase the combined FOV of the camera system and, more particularly, the side-to-side view angle of the camera system. The side cameras can provide more robust pose tracking operations by serving as alternative or additional sources for real image streams, as previously described with reference to the camera system 210 illustrated in Figs. 1 to 5.

[001 15] While the embodiment of the camera system 1305 illustrated in Fig. 13 comprises four cameras, the camera system 1305 alternatively can comprise fewer than four or more than four cameras. A higher number of spaced apart image cameras can increase the potential combined FOV of the camera system, or the degree of redundancy for visual and/or depth information within the image streams. This can lead to more robust pose tracking or other operations; however, increasing the number of cameras can also increase the weight, power consumption and processing demands of the camera system, as previously described with reference to the camera system 210 illustrated in Figs. 1 to 5.

[001 16] In at least one embodiment, one or more of the cameras can be a depth camera, such as, for example, a structured light ("SL") camera or a time-of-flight ("TOF") camera. In at least one embodiment, the handheld device can comprise a further front camera (not shown) that is a depth camera while the lower front camera and the upper front camera can be image cameras. This allows the processor set to choose between sources of information depending on the application and conditions in the real environment. For example, the processor set can be configured to use depth information from the depth camera whenever ambient lighting is too poor to derive depth information from the upper front and lower front cameras in stereo; further, as the user moves the handheld device toward a surface, the surface may become too close to the user's forward-facing image cameras to capture a stereo image of the surface, or, when the handheld device is too far from a surface, all potential features on the surface may lie out of range of the image cameras; further, the surface captured by the visual-based cameras may lack salient features for visual-based pose tracking. In those cases, the processor set can rely on the depth camera. In one embodiment, as in the camera system 210 of the HMD, the depth camera of the handheld device comprises both an SL and a TOF emitter as well as an image sensor to allow the depth camera to switch between SL and TOF modes. For example, the real image stream contributed by a TOF camera typically affords greater near range resolution useful for mapping the real environment, while the real image stream contributed by an SL camera typically provides better long range and low resolution depth information useful for pose tracking. The preferred selection of the cameras is dependent on the desired applications for the camera system 1305.

[001 17] In certain applications for the handheld device where an accurate detection of scale in the real environment is unnecessary or undesirable, there may be no need for a depth camera or a stereo camera configuration. In that case, each camera in the camera system can have an FOV that is substantially independent of the FOVs of the other cameras, or the camera system may comprise only a single camera.

[001 18] Each camera is equipped with a lens suitable for the desired application, as previously described with reference to the camera system 210 shown in Figs. 1 to 5. The cameras of the camera system 1305 of the handheld device can be equipped with fisheye lenses to widen their respective FOVs. In other embodiments, the camera lenses can be wide- angle lenses, telescopic lenses or other lenses, depending on the desired FOV and magnification for the camera system 210. The processor set can be correspondingly configured to correct any lens-related distortion.

[001 19] The handheld device can further comprise an inertial measurement unit ("IMU", not shown) to provide orientation information for the handheld device in addition to any pose information derivable from the real image stream. The IMU can be used to establish vertical and horizontal directions, as well as to provide intermediate orientation tracking between frames in the real image stream, as described below in greater detail.

[00120] The processor set 1307 of the handheld device is configured to perform various functions, such as pose tracking. The processor set also may be configured to map captured topographies in the real environment. The processor set 1307 is communicatively coupled to the camera system 1305 and also to any IMU of the handheld device. The processor set can be mounted to a PCB that is housed on or within the handheld device. Alternatively or additionally, the handheld device can comprise a wired or wireless connection to a processor set located remotely from the handheld device. For example, the processor set can be situated in an HMD worn by the user. Alternatively, a processor set onboard the handheld device can share tasks with another remotely situated processor set via wired or wireless communication between the processor sets.

[00121] Fig. 1 1 illustrates an embodiment of a processing environment onboard the handheld device, though others are contemplated. A SOM comprises: hardware accelerated computer vision logic 1400 embodied as an SPU, such as an FPGA, ASIC or other suitable configuration to perform various pose tracking operations, such as feature extraction and descriptor generation; a GPU embodied as at least one or a pair or more of SLAM algorithm cores 1402 to supplement the hardware accelerated computer vision logic 1400, an IMU core 1404 to generate readings from a connected inertial measurement sensor 1323, a fusion pose core 1406 to reconcile SLAM and IMU readings, a communications port 1408 to manage communications between the SOM and external equipment elsewhere onboard or off board the handheld device; and an integrated high speed data bus 1410 connected to the hardware accelerated computer vision logic 1400, the cores and the communication ports to transfer data within the components of the SOM 1401. The data bus 1410 is connected further to an external memory 1412 accessible to the SOM. The communications port is connected to an antenna or other wireless communication system 1416, an IMU 1418, an HDMI, USB, PCIE or other suitable serial port 1420 to enable connection to other components onboard or off board the handheld device.

[00122] During an application, the SOM acquires a real image stream from the camera system and implements visual-based simultaneous localisation and mapping ("SLAM") techniques, such as, for example, the techniques described above with reference to Figs. 7 to 10.

[00123] The SOM can derive scaled depth information directly by using a camera system that includes a depth camera or at least one pair of cameras arranged in stereo; i.e., spaced apart from each other with at least partially overlapping FOVs. As previously described, the SOM can instead implement temporal stereo to derive scale for the captured region of the real environment. Alternatively, the SOM may not derive scale, and thus perform unsealed, relative pose tracking that is suitable in various implementations.

[00124] As previously described, the SOM optionally can be connected to an IMU 1418 to enhance visual based pose tracking through IMU fusion. Commonly available IMUs provide orientation readings in the 1000 Hz range, whereas commonly available image cameras typically operate in the 60 Hz range. While visual-based pose tracking is generally less susceptible to cumulative errors than IMU-based pose tracking, the higher response rate of IMUs can be leveraged. For any pose changes occurring between visual based pose readings, the SOM can add readings from the IMU until the next available visual-based pose estimate is available. Further optionally, if the handheld device is equipped with a depth camera, the processor set can be configured to perform depth-based pose tracking whenever the visual information in the real image stream is too sparse or unreliable.

[00125] In addition to pose tracking, the SOM can use the real image stream to virtualize the real environment; that is, the SOM can render a dense 3D map of the real environment using the depth and even the visual information obtainable from the real image stream by further comprising a suitable graphics processor configured to perform such operations. The SOM can assign the origin of the 3D map to any point including, for example, based on assigning a middle point between any two or more cameras forming a stereo configuration in the camera system as the origin. For each frameset, or preferably for each set of keyframes, in the real image stream captured by cameras in stereo configuration, the SOM can extract an epoch-specific dense 3D map. The SOM can use the refined pose registered for that frameset, as well as the known rigid body transformation of the camera system, to align multiple dense 3D maps into a unified, dense 3D map. The 3D map can be a point cloud or other suitable depth map, and the processor set can register colour, black and white, greyscale or other visual values to each point or voxel in the 3D map using visual information available in the real image stream. For example, any frameset captured by colour cameras can be used to contribute colour information to the 3D map in the captured region.

[00126] The SOM can also periodically add new framesets or keyframes to previously mapped regions in the feature map and/or 3D map to account for changes in those regions. The SOM can also compare present framesets to feature maps generated in previous instances to check whether there is already a feature map for the present region of the real environment. If the present region of the real environment is already mapped, the processor set can build on that map during subsequent pose tracking, mapping and rendering steps, thereby reducing initialization and mapping demands and also enhancing pose tracking accuracy.

[00127] In at least one embodiment, the SOM 1401 shown in Fig. 14 can be implemented in conjunction with the handheld device shown in Fig. 13. In at least another embodiment, the SOM can be implemented in conjunction with the HMD 100 shown in Figs. 1 to 5. In at least yet another embodiment, the SOM shown in Fig. 14 can be implemented in another system or device for camera-based pose tracking and, optionally, environment mapping.

[00128] In at least still another embodiment, the handheld device shown in Fig. 13 can be implemented alone or in conjunction with an HMD worn by the same user. The HMD and the handheld device can be communicatively coupled to enable sharing between them of map and pose data. In one aspect, the implementation of the handheld device alongside the HMD enables gesture- or motion-based user input in an AR experience. While the illustrated embodiments are suitable for diverse computer-vision implementations, they are particularly suitable for implementations with relatively limited power, weight and size limitations, such as, for example, for tracking the pose of an AR or VR headset or a controller therefor.

[00129] Although the foregoing has been described with reference to certain specific embodiments, various modifications thereto will be apparent to those skilled in the art without departing from the spirit and scope of the invention as outlined in the appended claims.

Claims

CLAIMS We claim:

1. A camera system for pose tracking and mapping in a real physical environment for an augmented or virtual reality (AR/VR) head mounted device (HMD), the HMD comprising a visor housing having a display viewable by a user wearing the HMD and electronics for driving the display, the camera system comprising:

at least two front facing cameras, each front facing camera disposed upon or within the HMD to capture visual information from the real physical environment within its respective field of view (FOV) surrounding the HMD, the FOV of at least one of the front facing cameras overlapping at least partially with the FOV of at least one other front facing camera to provide a stereoscopic front facing camera pair to capture a common region of the real physical environment from their respective viewpoints, each of the front facing cameras configured to obtain the visual information over a plurality of epochs to generate a real image stream; a memory for storing the real image stream;

a processor set comprising a special purpose processing unit (SPU) and a general purpose processing unit (GPU), the processor set configured to:

obtain the real image stream from the at least two front facing cameras or the memory; and

process the real image stream to perform simultaneous localisation and mapping (SLAM), the performing SLAM comprising deriving scaled depth information from the real image stream.

2. The camera system of claim 1 , wherein the FOVs of the at least two front facing cameras are substantially aligned with corresponding FOVs of the user's right and left eyes, respectively.

3. The camera system of claim 1 further comprising at least two side cameras being a right side camera and a left side camera generally aiming in a direction other than front facing.

4. The camera system of claim 3, wherein the right side camera has a FOV partially overlapping one of the front facing cameras and the left side camera has a FOV partially overlapping one of the front facing cameras.

The camera system of claim 1 , wherein the processor set comprises an accelerated processing unit providing the SPU and the GPU, and one or more FPGAs coupled to the camera system and the memory for:

obtaining the real image stream from the cameras;

storing the real image stream to the memory; and

providing a subset of pose tracking operations using the real image stream.

The camera system of claim 1 , wherein the processor set downscales the image streams prior to performing SLAM.

The camera system of claim 1 , wherein performing SI_AM at an initialization phase comprises:

obtaining parameters of the front facing cameras;

receiving the image stream corresponding to a first parallel set of image stream frames from each of the front facing cameras;

extracting features from the set of image stream frames;

generating a descriptor for each extracted feature;

matching features between each set of image stream frames based on the descriptors;

estimating real physical world coordinates of a point in the real physical environment on the basis of the matched features and the parameters;

assigning an origin to one of the real physical world coordinates; and

generating a map around the origin, the map reflecting real-world scale and dimensions of the real physical environment captured in the image stream frames.

The camera system of claim 7, wherein performing SLAM at a tracking phase subsequent to the initialization phase comprises:

obtaining parameters of the front facing cameras;

receiving the image stream corresponding to the image stream frames from at least one of the front facing cameras at a second epoch subsequent to a first epoch used in the initialization phase; extracting features from the image stream frames;

generating a descriptor for each extracted feature;

matching features between the image stream frames to those of the first epoch based on the descriptors; and

estimating the real physical world coordinates of the point in the real physical environment on the basis of the matched features, the parameters and the origin.

9. The camera system of claim 7, wherein the extracting features comprises applying a FAST or FAST-9 feature extraction technique to assign a score and predominant angle to the features.

10. The camera system of claim 9, wherein the SLAM further comprises culling the extracted features prior to matching the features, the culling being based on the score.

1 1. The camera system of claim 7, wherein the generating descriptors comprises assigning a rotation-invariant ORB descriptor.

12. The camera system of claim 1 further comprising at least one depth camera disposed upon or within the HMD to capture depth information from the real physical environment within a depth facing FOV.

13. The camera system of claim 12, wherein the depth camera is a structured light camera.

14. The camera system of claim 12, wherein the depth camera is a time-of-f light camera.

15. The camera system of claim 12, wherein the depth camera is disposed approximately at the centre of the front of the visor housing generally in the middle of the user's FOV.

16. The camera system of claim 12, wherein the depth camera comprises a structured light camera, a time-of-flight camera and an image sensor.

17. The camera system of claim 16, wherein the processor set switches operation of the depth camera between the structured light camera, the time-of-flight camera and the image sensor depending on the desired application for the camera system between pose tracking and range resolution.

18. The camera system of claim 1 , wherein the processor set is disposed within the visor housing.

19. The camera system of claim 1 , wherein the processor set is remotely located and coupled to the HMD by wired or wireless communication link.

A camera-based method for pose tracking and mapping in a real physical environment for an augmented or virtual reality (AR/VR) head mounted device (HMD), the HMD comprising:

a visor housing having a display viewable by a user wearing the HMD and electronics for driving the display, the camera-based method comprising:

at least two front facing cameras, each front facing camera disposed upon or within the HMD to capture visual information from the real physical environment within its respective field of view (FOV) surrounding the HMD, the FOV of at least one of the front facing cameras overlapping at least partially with the FOV of at least one other front facing camera to provide a stereoscopic front facing camera pair to capture a common region of the real physical environment from their respective viewpoints;

a memory for storing the real image stream;

a processor set comprising a special purpose processing unit (SPU) and a general purpose processing unit (GPU);

the camera-based method comprising:

configuring the front facing cameras configured to obtain the visual information over a plurality of epochs to generate a real image stream;

configuring the processor set to:

The camera-based method of claim 20, wherein the FOVs of the at least two front facing cameras are substantially aligned with corresponding FOVs of the user's right and left eyes, respectively.

The camera-based method of claim 20, wherein the HMD further comprises at least two side cameras disposed thereon or therein, the at least two side camera being a right side camera and a left side camera generally aiming in a direction other than front facing.

23. The camera-based method of claim 22, wherein the right side camera has a FOV partially overlapping one of the front facing cameras and the left side camera has a FOV partially overlapping one of the front facing cameras.

24. The camera-based method of claim 20, wherein the processor set comprises an accelerated processing unit providing the SPU and the GPU, and one or more FPGAs coupled to the camera system and the memory for:

obtaining the real image stream from the cameras;

storing the real image stream to the memory; and

providing a subset of pose tracking operations using the real image stream.

25. The camera-based method of claim 20, wherein the processor set downscales the image streams prior to performing SLAM.

26. The camera-based method of claim 20, wherein performing SLAM at an initialization phase comprises:

obtaining parameters of the front facing cameras;

extracting features from the set of image stream frames;

generating a descriptor for each extracted feature;

assigning an origin to one of the real physical world coordinates; and generating a map around the origin, the map reflecting real-world scale and dimensions of the real physical environment captured in the image stream frames.

27. The camera-based method of claim 26, wherein performing SLAM at a tracking phase subsequent to the initialization phase comprises:

obtaining parameters of the front facing cameras; receiving the image stream corresponding to the image stream frames from at least one of the front facing cameras at a second epoch subsequent to a first epoch used in the initialization phase;

extracting features from the image stream frames;

generating a descriptor for each extracted feature;

28. estimating the real physical world coordinates of the point in the real physical environment on the basis of the matched features, the parameters and the origin.

29. The camera-based method of claim 26, wherein the extracting features comprises applying a FAST or FAST-9 feature extraction technique to assign a score and predominant angle to the features.

30. The camera-based method of claim 28, wherein the SLAM further comprises culling the extracted features prior to matching the features, the culling being based on the score.

31. The camera-based method of claim 26, wherein the generating descriptors comprises assigning a rotation-invariant ORB descriptor.

32. The camera-based method of claim 20, wherein the HMD further comprises at least one depth camera disposed thereon or therein to capture depth information from the real physical environment within a depth facing FOV.

33. The camera-based method of claim 31 , wherein the depth camera is a structured light camera.

34. The camera-based method of claim 31 , wherein the depth camera is a time-of-flight camera.

35. The camera-based method of claim 31 , wherein the depth camera is disposed approximately at the centre of the front of the visor housing generally in the middle of the user's FOV.

36. The camera-based method of claim 31 , wherein the depth camera comprises a structured light camera, a time-of-flight camera and an image sensor.

37. The camera-based method of claim 35, wherein the processor set switches operation of the depth camera between the structured light camera, the time-of-flight camera and the image sensor depending on the desired application for the camera system between pose tracking and range resolution.

38. The camera-based method of claim 20, wherein the processor set is disposed within the visor housing.

39. The camera-based method of claim 20, wherein the processor set is remotely located and coupled to the HMD by wired or wireless communication link.

40. An augmented or virtual reality (AR/VR) head mounted device (HMD) capable of pose tracking and mapping in a real physical environment, the HMD comprising a visor housing having a display viewable by a user wearing the HMD and electronics for driving the display, and a camera system comprising:

41. The HMD of claim 39, wherein the FOVs of the at least two front facing cameras are substantially aligned with corresponding FOVs of the user's right and left eyes, respectively.

42. The HMD of claim 39 further comprising at least two side cameras being a right side camera and a left side camera generally aiming in a direction other than front facing.

The HMD of claim 41 , wherein the right side camera has a FOV partially overlapping one of the front facing cameras and the left side camera has a FOV partially overlapping one of the front facing cameras.

The HMD of claim 39, wherein the processor set comprises an accelerated processing unit providing the SPU and the GPU, and one or more FPGAs coupled to the camera system and the memory for:

obtaining the real image stream from the cameras;

storing the real image stream to the memory; and

providing a subset of pose tracking operations using the real image stream.

The HMD of claim 39, wherein the processor set downscales the image streams prior to performing SLAM.

The HMD of claim 39, wherein performing SLAM at an initialization phase comprises: obtaining parameters of the front facing cameras;

extracting features from the set of image stream frames;

generating a descriptor for each extracted feature;

assigning an origin to one of the real physical world coordinates; and

The HMD of claim 45, wherein performing SLAM at a tracking phase subsequent to the initialization phase comprises:

extracting features from the image stream frames;

generating a descriptor for each extracted feature;

48. The HMD of claim 45, wherein the extracting features comprises applying a FAST or FAST-9 feature extraction technique to assign a score and predominant angle to the features.

49. The HMD of claim 47, wherein the SLAM further comprises culling the extracted features prior to matching the features, the culling being based on the score.

50. The HMD of claim 45, wherein the generating descriptors comprises assigning a rotation-invariant ORB descriptor.

51. The HMD of claim 40 further comprising at least one depth camera disposed thereon or therein to capture depth information from the real physical environment within a depth facing FOV.

52. The HMD of claim 50, wherein the depth camera is a structured light camera.

53. The HMD of claim 50, wherein the depth camera is a time-of-flight camera.

54. The HMD of claim 50, wherein the depth camera is disposed approximately at the centre of the front of the visor housing generally in the middle of the user's FOV.

55. The HMD of claim 50, wherein the depth camera comprises a structured light camera, a time-of-flight camera and an image sensor.

56. The HMD of claim 44, wherein the processor set switches operation of the depth camera between the structured light camera, the time-of-flight camera and the image sensor depending on the desired application for the camera system between pose tracking and range resolution.

57. The HMD of claim 39, wherein the processor set is disposed within the visor housing.

58. The HMD of claim 39, wherein the processor set is remotely located and coupled to the HMD by wired or wireless communication link.

59. A camera system for pose tracking and mapping in a real physical environment for an augmented or virtual reality (AR/VR) device, the camera system comprising:

at least two front facing cameras, each front facing camera disposed upon or within the AR/VR device to capture visual information from the real physical environment within its respective field of view (FOV) surrounding the AR/VR device, the FOV of at least one of the front facing cameras overlapping at least partially with the FOV of at least one other front facing camera to provide a stereoscopic front facing camera pair to capture a common region of the real physical environment from their respective viewpoints, each of the front facing cameras configured to obtain the visual information over a plurality of epochs to generate a real image stream;

a memory for storing the real image stream;

60. The camera system of claim 58 further comprising at least two side cameras being a right side camera and a left side camera generally aiming in a direction other than front facing.

61. The camera system of claim 59, wherein the right side camera has a FOV partially overlapping one of the front facing cameras and the left side camera has a FOV partially overlapping one of the front facing cameras.

62. The camera system of claim 58, wherein the processor set comprises an accelerated processing unit providing the SPU and the GPU, and one or more FPGAs coupled to the camera system and the memory for:

obtaining the real image stream from the cameras; storing the real image stream to the memory; and

providing a subset of pose tracking operations using the real image stream.

63. The camera system of claim 58, wherein the processor set downscales the image streams prior to performing SLAM.

64. The camera system of claim 58, wherein performing SLAM at an initialization phase comprises:

obtaining parameters of the front facing cameras;

extracting features from the set of image stream frames;

generating a descriptor for each extracted feature;

65. The camera system of claim 63, wherein performing SLAM at a tracking phase subsequent to the initialization phase comprises:

obtaining parameters of the front facing cameras;

receiving the image stream corresponding to the image stream frames from at least one of the front facing cameras at a second epoch subsequent to a first epoch used in the initialization phase;

extracting features from the image stream frames;

generating a descriptor for each extracted feature;

matching features between the image stream frames to those of the first epoch based on the descriptors; and estimating the real physical world coordinates of the point in the real physical environment on the basis of the matched features, the parameters and the origin.

66. The camera system of claim 63, wherein the extracting features comprises applying a FAST or FAST-9 feature extraction technique to assign a score and predominant angle to the features.

67. The camera system of claim 65, wherein the SLAM further comprises culling the extracted features prior to matching the features, the culling being based on the score.

68. The camera system of claim 63, wherein the generating descriptors comprises assigning a rotation-invariant ORB descriptor.

69. The camera system of claim 58 further comprising at least one depth camera disposed upon or within the AR/VR device to capture depth information from the real physical environment within a depth facing FOV.

70. The camera system of claim 68, wherein the depth camera is a structured light camera.

71. The camera system of claim 68, wherein the depth camera is a time-of-f light camera.

72. The camera system of claim 68, wherein the depth camera comprises a structured light camera, a time-of-flight camera and an image sensor.

73. The camera system of claim 71 , wherein the processor set switches operation of the depth camera between the structured light camera, the time-of-flight camera and the image sensor depending on the desired application for the camera system between pose tracking and range resolution.

74. The camera system of claim 58, wherein the processor set is disposed within the AR/VR device.

75. The camera system of claim 58, wherein the processor set is remotely located and coupled to the AR/VR device by wired or wireless communication link.

76. A camera-based method for pose tracking and mapping in a real physical environment for an augmented or virtual reality (AR/VR) device, the AR/VR device comprising:

a memory for storing the real image stream;

the camera-based method comprising:

configuring the processor set to:

77. The camera-based method of claim 75, wherein the AR/VR device further comprises at least two side cameras disposed thereon or therein, the at least two side camera being a right side camera and a left side camera generally aiming in a direction other than front facing.

78. The camera-based method of claim 76, wherein the right side camera has a FOV partially overlapping one of the front facing cameras and the left side camera has a FOV partially overlapping one of the front facing cameras.

79. The camera-based method of claim 75, wherein the processor set comprises an accelerated processing unit providing the SPU and the GPU, and one or more FPGAs coupled to the camera system and the memory for:

obtaining the real image stream from the cameras;

storing the real image stream to the memory; and

providing a subset of pose tracking operations using the real image stream.

80. The camera-based method of claim 75, wherein the processor set downscales the image streams prior to performing SLAM.

81. The camera-based method of claim 75, wherein performing SLAM at an initialization phase comprises:

obtaining parameters of the front facing cameras;

extracting features from the set of image stream frames;

generating a descriptor for each extracted feature;

82. The camera-based method of claim 80, wherein performing SLAM at a tracking phase subsequent to the initialization phase comprises:

obtaining parameters of the front facing cameras;

extracting features from the image stream frames;

generating a descriptor for each extracted feature;

83. The camera-based method of claim 80, wherein the extracting features comprises applying a FAST or FAST-9 feature extraction technique to assign a score and predominant angle to the features.

84. The camera-based method of claim 82, wherein the SI_AM further comprises culling the extracted features prior to matching the features, the culling being based on the score.

85. The camera-based method of claim 80, wherein the generating descriptors comprises assigning a rotation-invariant ORB descriptor.

86. The camera-based method of claim 75, wherein the AR/VR device further comprises at least one depth camera disposed thereon or therein to capture depth information from the real physical environment within a depth facing FOV.

87. The camera-based method of claim 85, wherein the depth camera is a structured light camera.

88. The camera-based method of claim 85, wherein the depth camera is a time-of-flight camera.

89. The camera-based method of claim 85, wherein the depth camera is disposed approximately at the centre of the front of the visor housing generally in the middle of the user's FOV.

90. The camera-based method of claim 85, wherein the depth camera comprises a structured light camera, a time-of-flight camera and an image sensor.

91. The camera-based method of claim 89, wherein the processor set switches operation of the depth camera between the structured light camera, the time-of-flight camera and the image sensor depending on the desired application for the camera system between pose tracking and range resolution.

92. The camera-based method of claim 75, wherein the processor set is disposed within the AR/VR device.

93. The camera-based method of claim 75, wherein the processor set is remotely located and coupled to the AR/VR device by wired or wireless communication link.