WO2012059280A2 - System and method for multiperspective telepresence communication - Google Patents

System and method for multiperspective telepresence communication Download PDF

Info

Publication number
WO2012059280A2
WO2012059280A2 PCT/EP2011/067091 EP2011067091W WO2012059280A2 WO 2012059280 A2 WO2012059280 A2 WO 2012059280A2 EP 2011067091 W EP2011067091 W EP 2011067091W WO 2012059280 A2 WO2012059280 A2 WO 2012059280A2
Authority
WO
WIPO (PCT)
Prior art keywords
users
location
user
module
image
Prior art date
Application number
PCT/EP2011/067091
Other languages
French (fr)
Other versions
WO2012059280A3 (en
Inventor
Jaume Civit
Ken Zangelin
Mattias Barthel
Oscar Divorra
Pablo Rodriguez Rodriguez
Nuria Oliver Ramirez
Original Assignee
Telefonica, S.A.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Telefonica, S.A. filed Critical Telefonica, S.A.
Publication of WO2012059280A2 publication Critical patent/WO2012059280A2/en
Publication of WO2012059280A3 publication Critical patent/WO2012059280A3/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/141Systems for two-way working between two video terminals, e.g. videophone
    • H04N7/142Constructional details of the terminal equipment, e.g. arrangements of the camera and the display
    • H04N7/144Constructional details of the terminal equipment, e.g. arrangements of the camera and the display camera and display on the same optical axis, e.g. optically multiplexing the camera and display for eye to eye contact
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/30Image reproducers
    • H04N13/349Multi-view displays for displaying three or more geometrical viewpoints without viewer tracking
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/10Processing, recording or transmission of stereoscopic or multi-view image signals
    • H04N13/194Transmission of image signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/20Image signal generators
    • H04N13/204Image signal generators using stereoscopic image cameras
    • H04N13/243Image signal generators using stereoscopic image cameras using three or more 2D image sensors
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/20Image signal generators
    • H04N13/282Image signal generators for generating image signals corresponding to three or more geometrical viewpoints, e.g. multi-view systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/236Assembling of a multiplex stream, e.g. transport stream, by combining a video stream with other content or additional data, e.g. inserting a URL [Uniform Resource Locator] into a video stream, multiplexing software data into a video stream; Remultiplexing of multiplex streams; Insertion of stuffing bits into the multiplex stream, e.g. to obtain a constant bit-rate; Assembling of a packetised elementary stream
    • H04N21/2365Multiplexing of several video streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/41Structure of client; Structure of client peripherals
    • H04N21/422Input-only peripherals, i.e. input devices connected to specially adapted client devices, e.g. global positioning system [GPS]
    • H04N21/4223Cameras
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/431Generation of visual interfaces for content selection or interaction; Content or additional data rendering
    • H04N21/4312Generation of visual interfaces for content selection or interaction; Content or additional data rendering involving specific graphical features, e.g. screen layout, special fonts or colors, blinking icons, highlights or animations
    • H04N21/4316Generation of visual interfaces for content selection or interaction; Content or additional data rendering involving specific graphical features, e.g. screen layout, special fonts or colors, blinking icons, highlights or animations for displaying supplemental content in a region of the screen, e.g. an advertisement in a separate window
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/434Disassembling of a multiplex stream, e.g. demultiplexing audio and video streams, extraction of additional data from a video stream; Remultiplexing of multiplex streams; Extraction or processing of SI; Disassembling of packetised elementary stream
    • H04N21/4347Demultiplexing of several video streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/478Supplemental services, e.g. displaying phone caller identification, shopping application
    • H04N21/4788Supplemental services, e.g. displaying phone caller identification, shopping application communicating with other users, e.g. chatting
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/60Network structure or processes for video distribution between server and client or between remote clients; Control signalling between clients, server and network components; Transmission of management data between server and client, e.g. sending from server to client commands for recording incoming content stream; Communication details between server and client 
    • H04N21/65Transmission of management data between client and server
    • H04N21/658Transmission by the client directed to the server
    • H04N21/6587Control parameters, e.g. trick play commands, viewpoint selection
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/15Conference systems

Definitions

  • the present invention relates to communication and collaboration tools, more specifically, to telepresence systems for enhancing eye contact in a communication process, videoconferencing mainly.
  • CMC computer-mediated communication
  • AC audio conferencing
  • VC video conferencing
  • Telepresence systems have gone a long way. After a number of technologies, such as broadband internet, high quality HD low-delay video compression, or web applications, have become mature enough, several products have been able to irrupt into the market establishing a solid step forward towards practical Telepresence solutions. Among them, it can be counted large format videoconferencing systems from major providers such as Cisco Telepresence, HP Halo, Polycom, Teliris or telecollaboration suites from leading software companies. However, current systems still suffer from fundamental imperfections that are known to be detrimental to the communication process. When communicating, naturalness, feeling of presence, eye contact and gaze cues is essential elements of visual communication, and of importance for signalling attention, and managing conversational flow. It is crucial to establish rapport and trust in a relationship.
  • Eye contact and awareness of gaze direction have an influence on several social psychological measures such as trust, persuasion, group performance, and perceived dominance. Eye contact and gaze awareness serve a multitude of functions in social interactions. For example, people increase their gaze when attempting to be persuasive, deceptive and to exert dominance. People are judged to be more attentive, more competent, and more credible with increased gaze. Furthermore, mutual eye contact facilitates the task of understanding other persons by speeding up the availability of relevant material from semantic memory. Nevertheless, lack of eye contact is a common flaw in current Telepresence or videoconferencing systems.
  • the horizontal eye contact problem arises when multiple video conference participants are located at the same site and watch the same display. In this case, all of the participants receive the same video sequence instead of the appropriate remote party perspective as should be seen from their viewing angle (if this was a face-to-face meeting). This, in turn, leads to either a false perception of being looked at by all of the participants at that site or worse, no one at that site receives eye contact.
  • the horizontal eye contact problem in video conferencing appears when multiple participants share the same system at a site. In such common cases, all of these participants watch the same video sequence on the display and get the same feeling of eye contact. This problem is one of the main reasons for the artificial atmosphere in a video conferencing meeting where, for example, the speaker needs to name the remote participant before addressing him since gaze direction is not preserved.
  • 2D Multi-screen Telepresence system (Cisco Telepresence, HP Halo Telepresence, Polycom Telepresence, Teliris Telepresence): they are basically systems based on several screens and at least as many 2D cameras as screens, where regular 2D video is sent to each remote screen from its corresponding local camera in use.
  • Teliris 3D Telepresence this solution delivers 3D video in high definition using goggles.
  • the system is based on Teliris's traditional solution, which corresponds to a regular 2D Multi-screen Telepresence with the ability to switch between 2D and 3D meeting modes with goggles.
  • some stereoscopic 3D depth is available for the user.
  • it does not solve the main issues of eye contact in videoconferencing.
  • the fact that the participants have to wear glasses in order to perceive the 3D video only creates another barrier in the eye contact line.
  • all participants see the very same 3D (stereo) image which is unable to transmit the different perspectives depending on where a user is sitting.
  • Teleportel / Sony 3D Telepresence the vertical eye contact issue is one of the major problems mentioned when discussing video conferencing applications. This problem stems from the fact that in a session, the conferee looks at the monitor displaying remote participants and not at the camera, which is usually located on top of the display. A common approach is to use a slanted semi-reflective mirror ahead of the display which can capture the participant's image from the front but still allow the participant to see the video sequence on the mirror. This method, although efficient, is not perceived well by users since the extra piece of glass in the setup seems awkward. Nevertheless, there are several examples of commercial solutions which use similar optic approach. Another major problem is that vertical eye-contact is a minor issue compared to problems generated by horizontal eye-contact. This solution is unable to provide corrected 3D-aware eye contact for multiple users.
  • This invention presents a full system design for horizontal eye-contact enabled video communications for multiple users.
  • the invention allows capturing from multiple cameras, processing the multi-view information, transmitting and generating the necessary information to feed multi-perspective split-view screens to allow for simultaneous multi-user gaze direction awareness, and better gesture perspective preservation.
  • split-view screens different users can have different images of the remote location depending on where they are located in the room. Therefore the invention refers to a method for processing, transmitting and receiving images of a first set of users in a first location in communication with a second set of users in a second location.
  • Said method comprising the steps of: recording the first set of users with a plurality of cameras distributed according to the position of the second set of user in the second location, recording each camera a image of the first set of users; separating the part of the image belonging to each user of the first set of users from the parts belonging to other users recorded by the same camera, obtaining therefore for each user and for each camera a image of the user; encoding of multiple video signals , each containing the images of each user and sending the encoded video signals to a receiving module in the second location; sending control signals form a control module for managing a communication between a location of the second set of users and the location of the first set of users; decoding the encoded video signal received by the receiving module; composing one multiperspective image of the first set of users containing all the images of each user; showing the multiperspective image of the first set of users to the second set of users by mean of a display.
  • the invention also comprise, in the preferred embodiment, sending audio from the first location, perceptually synchronously with the video signal, to the second location.
  • the step of sending control signals comprises transmitting orders from a front-end module to a plurality of processing modules for creating a flow containing video and audio data, managing the distribution of the orders by a plurality of manager modules connected both to the front-end module and the plurality of processing modules. And, in some embodiments the modules are scattered across a diversity of connected distributed computing units.
  • the invention may comprise analyzing the available bandwidth in the communication channel in runtime for adjusting a transmitting rate between the first and the second location. And it may also comprise, in the step of sending the video signal to the receiver module, using forward error correction and encoding the video signal in UDP packets to avoid retransmission.
  • Another aspect of the invention refers to a system for processing, transmitting and receiving images of a first set of users in a first location in communication with a second set of users in a second location.
  • the system comprises: a plurality of cameras, for recording the first set of users, recording each camera an image of the first set of users; a sub-module for separating the images belonging to each user from the images of the other users recorded by the same camera; an encoding module for encoding the signal coming from the sub-module, containing the images of each user, and sending the encoded video signal to a receiving module in the second location; a control module for managing a communication between the first location and the second location, sending control signals; a receiving module in the second location for decoding the encoded signal and composing one multiperspective image of the first set of users containing all the images of each user; at least one display for showing the multiperspective image of the first set of users to the second set of users.
  • the system may also comprise: a screen glasses synchronizer, connected to the receiving module, which is a transducer of a signal generated by the receiving module and a signal that an active glasses can comprehend in order to synchronize the display with the operation of the active glasses; and an active glasses, in communication with the screen glasses synchronizer, for each user for separating a perspective, from the multiperspective image, which fits better to the position of the user in the second location.
  • a screen glasses synchronizer connected to the receiving module, which is a transducer of a signal generated by the receiving module and a signal that an active glasses can comprehend in order to synchronize the display with the operation of the active glasses
  • an active glasses in communication with the screen glasses synchronizer, for each user for separating a perspective, from the multiperspective image, which fits better to the position of the user in the second location.
  • a computer program comprising program code means adapted to perform the steps of the method according to any claims from 1 to 7 when said program is run on a general purpose processor, a digital signal processor, a FPGA, an ASIC, a micro-processor, a micro-controller, or any other form of programmable hardware.
  • This invention specifies the system necessary to capture, process, transmit, and receive the multi-perspective visual information.
  • the invention allows for a distributed scalable architecture where the multi-user, multi-screen, multi- perspective system is powered by distributed software on a small modular multicomputer grid.
  • Figure 1 illustrates a common situation of a multi-perspective experience for 2 users located in different positions in a room.
  • Figure 2 shows a block diagram of the system of the invention in one embodiment.
  • Figure 3 shows a block diagram of the acquisition, sender and transmission of each perspective to the destination machines.
  • Figure 4 shows a block diagram of reception, decoding and composing for remote multiperspective displays.
  • Figure 5 shows a block diagram involving the step of the invention of view splitting.
  • Figure 6 shows the camera capture distribution. Splitting and recomposing of multiperspective images.
  • Figure 7 shows a schema representing the communication of multiple 3D perspectives between two cameras and two displays corresponding to an exemplary communication for 2 to 2 people in 2 different sites.
  • FIG. 8 shows a distributed architecture of the control system of the invention. Detailed Description of the Invention
  • This invention provides a system for processing, transmitting and receiving the visual information within an eye-contact enabled, multi-perspective Telepresence system.
  • Such a multi-perspective system can send a different image to each user depending on the user location in the room. Therefore, each user is receiving the image that fits better to his position in the room.
  • the Telepresence system defined in this invention aims to deliver a Telepresence experience in which participants, in separate remote sites, feel like being in a "face-to-face” meeting.
  • This document describes the design of the system as a whole, all contained modules, each of them one or several features, as well as the intercommunication between modules.
  • the system of the invention has been designed as a multi-process modular system that can be implemented in a capable single machine, as well as on a multicomputer grid depending on the needs.
  • FIG 2 it is specified an exemplary embodiment of the invention. It is a modular architecture of a distributable, Multi-perspective Telepresence system end point for 24 simultaneous perspectives (2 per camera) in a 3 sites configureation. It allows capturing from four cameras in order to generate eight different perspectives of the two local users. Two perspectives of each local user are sent to each remote site (four to each remote site). A processing module is in charge of generating two perspectives from each camera. The system includes two receiving units, each in charge of feeding two screens by combining four different perspectives. The system also uses a distributed Graphical User Interface (GUI), able to operate the multi- perspective system from connected devices.
  • GUI Graphical User Interface
  • the provided multi-perspective Telepresence system is able to: • Synchronously grab frames from, in one embodiment of the invention, Stingray firewire cameras (Allied Vision Technologies) that support IEEE 1394 b connection standard.
  • the distributed nature of the design of the software architecture allows flexibility regarding the equipment election.
  • the system is designed to achieve distributed and parallel computation capabilities.
  • Each of the modules can be translated, almost directly, to a plugin, in one embodiment it is used a Gstreamer plugin and the whole assembled in a Gstreamer framework, which confers flexibility, scalability and usability of the software pieces independently.
  • a Gstreamer plugin and the whole assembled in a Gstreamer framework, which confers flexibility, scalability and usability of the software pieces independently.
  • there is a key control layer that is not Gstreamer based and that allows for the particular flexibility and capacity to distribute execution of this invention.
  • This module includes all the routines aimed to adapt the color space of the visual data coming from the cameras to the required color space by the processing and transmission. As the cameras are producing images in a raw format, a conversion process is required to match the requirements of the other modules. The module also computes the different perspectives that can be generated from a single camera to be sent to its destination. The video streams are encoded and sent through the network by this module. Symmetrically, the other side will receive and decode the incoming traffic. Video communication includes protection algorithms against network failures, such as Forward Error Correction and Bandwidth Automatic Adaptation. This module has been designed to include all the required interfaces to allow the proper configuration and setup (before starting the streaming and even in runtime).
  • the audio module is basically constituted by a network-robust encoder and decoder of audio that transmits the audio synchronously with the video.
  • This module is where the IIS Fraunhofer Institute's AAC-LD ACE library resides.
  • the mentioned library takes care of not just the sending and receiving of audio flows, but also all echo cancelling activities.
  • This library performs a very robust encoding/decoding operation along with an echo cancellation procedure.
  • This library has been embedded in a plugin that works with raw audio at a configurable rate. It receives raw audio data and sends raw audio data from and to the audio card.
  • the View Renderer module includes all the routines aimed to visualize the stream and to make the images match with the properties of the monitor (virtual background substitution and other operations required at rendering level). Strongly graphic related programming languages have been exploited to fulfil the real time and low latency requirements.
  • the previous modules are assembled together to build functional sub-systems. Basically, there are two sub-systems that play an important role in the apparatus: Acquisition and Sender Module 25 is the first one, takes care of the reading from the cameras and sending/encoding activities.
  • outgoing pipeline is composed by the following modules represented in Figure 3.
  • Camera capture/color conversion 31 reads from the camera and converts the stream to a manageable color space.
  • Multi-Perspective View Splitter 32 this sub-module splits the image in order to separate the users of one site in at least two perspectives that will be transmitted to a remote terminal.
  • SVC Scalable Video Coding
  • BwE + FasUP 33 Scalable Video Coding based encoder with Bandwidth Estimation, analysis of the available bandwidth in runtime in order to adjust the transmitting rate, plus Fast Update, resend intra-frames if required.
  • FEC+Sender 34 Forward Error Correction, protects the stream against network errors, and the sender itself, UDP based.
  • the first stage of the acquisition and sending sub-system is the lecture from the cameras and the image pre-processing for the consequent encoding and transmitting.
  • the pre-processing step is composed by two phases: color-conversion and view splitting; and, in the same way, the view splitting phase is composed by two steps: image separation and geometric adaptation.
  • Figure 5 depicts this block.
  • the user separation step 51 is aimed to identify and segment each user from the frame, consequently each user and perspective of the each user can be treated accordingly to the destination.
  • This transformation, that make each image fit with the requirements of the destination is the geometric adaptation 52 (or geometry correction) part added at the end of the acquisition module.
  • the receiver 22 with switchable scene composition is the second sub-system, it is placed in the destination machine.
  • This sub-system, in Gstreamer terms: Incoming pipeline, is composed by the following modules represented in the Figure 4: • Receiver and decoder: UDP based receiver plus a decoder that matches the encoder used at the other end.
  • ViCo is an acronym for distributed process control framework. It is mainly designed for Video Conference applications with added functionalities, but it is useful for a wide variety of applications.
  • ViCo is an acronym for distributed process control framework. It is mainly designed for Video Conference applications with added functionalities, but it is useful for a wide variety of applications.
  • vicoGateway (93) (for the front-end module), "vicoMgr” (90) (for the manager module) and "vico” (for the processing module).
  • a front-end module (ViCo Gateway) for establishing the communication between the first location and the second location.
  • VicoGateWav deals with the communication between sites, embedding the SIP communication engine and distributes the information and necessary commands to operate each one of the individual parallel modules of the system (vicoMgr). It also deals with the communication with the GUI (94) of the system in a manner that allows the GUI to run in a completely separated computer or device by communicating through a communication network.
  • ViCoMgr A manager module for distributing the signals and data coming from the front-end module and controlling a plurality of processing modules.
  • Each ViCoMgr deals with the signals sent and received to/from the processing pipelines.
  • For each parallel running module in the system (controlled by a "vico"), it intermediates and communicates with the main vicoGateway (the main module concentrating communication with other sites).
  • a processing module controlled by a manager module, that create a flow containing video/audio data.
  • a vico following the orders of a vicoMgr, launches each processing pipeline (processing subsystem) with the proper setup.
  • This set of control modules are inter-related by a hierarchical structure that connects them from the front-end control module ⁇ vicoGateway), to each processing module unit (vico) passing by each vicoMgr that controls several vico's.
  • SIP Session Initiation Protocol Control controls the signalling activities between sites.
  • SIP Session Initiation Protocol
  • Control controls the signalling activities between sites.
  • SIP is a well known and standard protocol for session initiation.
  • the configuration data travels through the network enclosed into a SIP message.
  • the utilization of this protocol ensures interoperability with other commercial videoconference/communication systems.
  • GUI graphical user interface
  • Figure 7 shows how the streams are composed and sent to the remote site.
  • Each of the captured images is halved and sent through the network.
  • each image-half corresponds to one of the visual perspectives to be sent.
  • fragments are composed in order to fit with screen requirements and to produce the multi-perspective effect.
  • Every split-view screen combines different perspectives of different cameras in order to supply to the users with a different image perspective depending on where they are sitting.
  • multiple perspectives are transmitted to screens as a single picture where different tiles correspond to different perspectives.
  • coded data streams are packaged into UDP packets, in order to avoid retransmission (and thus delay in real-time video communication).
  • Data can be sent to the network with/"with partial'Ywithout QoS management.
  • There is a back channel which carries information about the network status and messages like Video Fast Updates.
  • the invention is built according to a distributed parallel architecture.
  • a distributed system consists of multiple autonomous computers or processors that communicate through a private computer network or system.
  • this invention allows the Telepresence to be distributed since it has been designed to have several branches and run, if needed, over independent machines.
  • This kind of architecture allows for a high degree of scalability, and since only one machine has the knowledge of the current architecture, it is possible to update the hardware and just replace the configuration of the vicoGateway.
  • each of the machines that compose the system holds one of the processing branches.
  • PC- * are the names given to the different machines and a vicoMgr 82 is executed on each as a service. It is also shown the multiperspective acquisition and sender modules 83 and the multiperspective receiver modules 84.
  • This design allows a balanced distribution of the weight, which means that it is possible to have more advanced equipment for more consuming parts of the algorithm (e.g. PC-A to PC-D which hold the capture, coding and transmission), and to have lighter hardware (PC-E and PC- F) to just deal with decoding and visualization routines which require less computer power.
  • Parallel computing is a form of computation in which many calculations are carried out simultaneously, operating on the principle that large problems can often be divided into smaller ones, which are then solved concurrently.
  • bit-level the bit-level
  • instruction level the instruction level
  • data the data level
  • task parallelism the processing framework
  • control system included in the invention, which is able to distribute, recover and process messages that are tightly involved with the functionalities of the whole invention. Functions such start/stop the call or changing configurations in runtime are allowed by the communication protocol.
  • a user initiated signal ends in and starts from a GUI.
  • a GUI can send several events like: call start, call stop, a change in the configuration, etc. These events have to be mapped to the site's working configuration (which can be different for each site), therefore the only thing that needs to be configured in the GUI is the address/port of the appropriate vicoGateway in order to get connected to the rest of the system.
  • the communication chain starts, usually, at the GUI, and the messages are delivered to the correct device/equipment.
  • VicoGateWay 81 works as server that is listening at a specific port, set by design, but, on the other hand, it works as a client in respect to the interaction with the vicoMgrs 82 running in the distributed system.
  • VicoMgrs 82 are running in background on the machines that compose the distributed system. They have access to the final processing chain and control its performance. VicoMgrs listen permanently to one specific port where they receive the messages from the vice-Gateway.
  • the GUI acts as a client of the vice-Gateway and it prints to the specified socket a set of messages corresponding to orders permitted to the users.
  • the server part of the vice-Gateway evaluates what kind of message has arrived (nature of the message, if it is destined to one specific vicoMgr, if it is just a configuration matter message or if the message comes from another vice-Gateway and what has to do with it).
  • Start call/End call these messages contain information about the call initiation and call ending.
  • the system responds at two levels: the first one controlled by an internal protocol cares about the communication flow of vicoGateway and the corresponding vicoMgrs; and a second level where vicoGateway connects with another vicoGateway of a remote site through the SIP protocol.
  • the messages are identified as vicoStartl vicoEnd just as an agreed nomenclature system.
  • these messages contain a label that activates the function and a path with the source of the proper pipeline to be launched, since the pipelines are preconfigured to send the stream to a specific destination.
  • the vicoGateway sends a Hello message to all the vicoMgrs that are in the site's configuration file and expects a response from them in order to confirm that they are up and available.
  • the internal workflow of the vicoGateway depends on the type of message that is getting or producing. It seeks inside the list of sites or the list of flows (vicoMgrs) if gets a configuration or end/start call message.
  • Configuration the messages of this kind are defined, top down, from the GUI to the vicoGateway and later to the vicoMgr.
  • the first bytes of the structure contain the information of data type and the size of the following information.
  • the source is one vicoGateway and it transmits the proper data to its vicoMgrs and, in some embodiments of the invention, it sends a configuration message to another remote vicoGateway, and this one makes the proper operations in its site.
  • Runtime automatic control and communication control these messages concern non-user operated control signals between sites. These messages travel from vicoGateway to vicoGateway by means of SIP messages. VicoGateways can be the origin of the signalling themselves although for signalling related to the video communication modules, messages path would be extended until the vicoMgr, vico (processing level) as well. Depending on the message type, this can be distributed from a vicoGateway in a "multi-cast" manner down to each vicoMgr (for further distribution down to vicos' and processing pipelines from there), or in a selective unicast manner for specific changes in each vicoMgr.
  • the final result of this protocol is a message of success or error ⁇ vicoOKIvicoError) travelling back to GUI application, with the information about which flow has produced the error and the nature of the error (flow not available, nonexistent, etc).
  • vico is created as process by each vicoMgr, and each vicoMgr can build one or more vico's.
  • a vicoMgr treats an incoming message and manages to build the proper vico taking into account the parameters received along with the message structure. Therefore, if a message cannot be treated by the vicoMgr an error response is sent back to the vicoGateway and the GUI as well.
  • a flow is each one of the processing chains that generates a video/audio data stream.
  • the data stream is build and produced by a Gstreamer pipeline, but it could be produced by any set of libraries assembled properly.
  • This Gstreamer pipeline is created inside each vico and there are configuration messages that are delivered directly to the pipeline. If any error is produced in the delivery or in runtime a vicoError message is generated with the proper information and sent back.
  • the messages are passed to the pipeline through a dedicated interface, especially designed for this purpose.
  • the system can be easily extended to capture additional perspectives per camera (by using an appropriate camera with enough resolution and an appropriate lens). I.e. in an embodiment of this invention, one can split each captured view in 3 segments and send it to 3 perspective screens for rendering, allowing for additional viewers per site.
  • Each sending module knows the destination of the flow generated.
  • the destination of the flows is set by the control subsystem (embedded in the user interface module) which builds the table of routes since it has de knowledge of the distribution and addresses of the sites.
  • Two receiving and composing modules which concentrate the traffic coming from the remote sites and decode the images. These modules also deal with the stream composition and the signal generation in order to synchronize the images with the goggles (special glasses). Synchronization is necessary if time multiplexing of perspectives or views is performed, if light polarization is used for multiplexing 2 views then is not necessary.
  • the streams that are sent to the displays are composed, sequentially, by one left view and right view (understanding left and right as different perspectives). This way the display will print the different perspectives at different times and the receiver will be able to separate them.
  • This module participates in the negotiation of the flow. In other words, if the characteristics of one site do not match with the default configuration (simpler display, etc) this module can set up a different configuration.
  • Audio module takes care of: sending/receiving/encoding/decoding the data streams.
  • This audio module can be supported by any set libraries that allow performing this kind of operations and echo cancelling as well.
  • the audio equipment is chosen taking into account the different configuration of the rooms since different sizes of materials of the walls can be faced. Specifically, the audio channels have to be sent multiplexed with the video packet in order to keep the synchronization audio-video and, therefore have to be demultiplexed at destination.
  • the user interface gives an input point to the orders coming from the users; such as to initiate and finalize a call or to configure some parameters of the communication.
  • the control sub-system which conveys the orders and signals that manage the behaviour of the pipeline.
  • the control sub-system is the core of this module and the most important part. It takes care of the distributing and interpreting the signals all along the system. Basically, this module is composed by a net of small client/server application that communicates each other.
  • the screen goggles synchronizer is a hardware device which works as a transducer of the electrical signal generated by the receiving module and signal that the glasses can comprehend in order to synchronize the display with the operation of the active glasses. This ensures that each viewer will receive only the correct view at each time under temporal multiplexing of views.
  • the protocol behind this device can be rewritten in order to fit with the technology of the goggles. This synchronizer is not needed if plain polarization multiplexing is used.
  • This embodiment is implemented based on the Gstreamer Multimedia Framework which allows a great flexibility in terms of reusability and fastness of development. All the previous mentioned modules can be translated, almost directly, to Gstreamer plugins.
  • the system can be extended for extra users, but the configuration of the capture and visualization parts must be updated in order to keep the coherence of the viewing points with the added additional cameras + screens.
  • the cameras require a full view of the conferees, which means that the focal length, directionality and zoom of the cameras must allow both users (and more) to fit clearly fit within the field of view of each camera.
  • the screens have to dispose in order to offer a correct sight of the remote users (the position must correspond to the perspective that is being shown).

Abstract

The invention refers to a method and system for processing, transmitting and receiving images of a first set of users in a first location in communication with a second set of users in a second location. The method comprising the steps: recording the first set of users with a plurality of cameras distributed according to the position of the second set of user in the second location, recording each camera a image of the first set of users; - separating the part of the image belonging to each user of the first set of users from the parts belonging to other users recorded by the same camera, obtaining therefore for each user and for each camera a image of the user; - encoding of multiple video signals, each containing the images of each user and sending the encoded video signals to a receiving module in the second location; - sending control signals form a control module for managing a communication between a location of the second set of users and the location of the first set of users; - decoding the encoded video signal received by the receiving module; - composing one multiperspective image of the first set of users containing all the images of each user; - showing the multiperspective image of the first set of users to the second set of users by mean of a display.

Description

SYSTEM AND METHOD FOR MULTIPERSPECTIVE TELEPRESENCE COMMUNICATION
Technical Field of the Invention
The present invention relates to communication and collaboration tools, more specifically, to telepresence systems for enhancing eye contact in a communication process, videoconferencing mainly. Background of the Invention
Nowadays, in virtual collaboration and task sharing within global businesses, teams are geographically dispersed, yet their tasks require collaboration and they share responsibility in working towards a common outcome. These teams are thus critically dependent on the efficacy of mediated communication and collaboration tools. There are already technologies that support virtual collaboration, including computer-mediated communication (CMC), audio conferencing (AC), and video conferencing (VC). State of the art Telepresence systems are typically at the high end of this spectrum, providing audio-visual capture, transmission, and display solutions that attempt to support an illusion that the remote participants are actually face-to-face, sharing the same physical space.
Telepresence systems have gone a long way. After a number of technologies, such as broadband internet, high quality HD low-delay video compression, or web applications, have become mature enough, several products have been able to irrupt into the market establishing a solid step forward towards practical Telepresence solutions. Among them, it can be counted large format videoconferencing systems from major providers such as Cisco Telepresence, HP Halo, Polycom, Teliris or telecollaboration suites from leading software companies. However, current systems still suffer from fundamental imperfections that are known to be detrimental to the communication process. When communicating, naturalness, feeling of presence, eye contact and gaze cues is essential elements of visual communication, and of importance for signalling attention, and managing conversational flow. It is crucial to establish rapport and trust in a relationship. Eye contact and awareness of gaze direction have an influence on several social psychological measures such as trust, persuasion, group performance, and perceived dominance. Eye contact and gaze awareness serve a multitude of functions in social interactions. For example, people increase their gaze when attempting to be persuasive, deceptive and to exert dominance. People are judged to be more attentive, more competent, and more credible with increased gaze. Furthermore, mutual eye contact facilitates the task of understanding other persons by speeding up the availability of relevant material from semantic memory. Nevertheless, lack of eye contact is a common flaw in current Telepresence or videoconferencing systems.
The horizontal eye contact problem arises when multiple video conference participants are located at the same site and watch the same display. In this case, all of the participants receive the same video sequence instead of the appropriate remote party perspective as should be seen from their viewing angle (if this was a face-to-face meeting). This, in turn, leads to either a false perception of being looked at by all of the participants at that site or worse, no one at that site receives eye contact.
The horizontal eye contact problem in video conferencing appears when multiple participants share the same system at a site. In such common cases, all of these participants watch the same video sequence on the display and get the same feeling of eye contact. This problem is one of the main reasons for the artificial atmosphere in a video conferencing meeting where, for example, the speaker needs to name the remote participant before addressing him since gaze direction is not preserved.
Current large format videoconferencing systems are basically multi-screen, high definition videoconference systems that allow for natural body-sized, multiple person videoconferencing sessions. These systems are welcome in the market due to their high quality and increase of videoconferencing realism. They have also been the very first generation of usable and high quality room-based videoconferencing systems. Some examples are: 2D Multi-screen Telepresence system (Cisco Telepresence, HP Halo Telepresence, Polycom Telepresence, Teliris Telepresence): they are basically systems based on several screens and at least as many 2D cameras as screens, where regular 2D video is sent to each remote screen from its corresponding local camera in use. This allows having multi-screen panoramic videoconferencing for several users, and also a real-sized image of the end-point room giving an increased feeling of presence if compared to regular TV-set and/or desktop versions. However, these solutions have the general problem of bad gesture direction conservation as well as the issue of horizontal eye contact . When multiple participants share the same system at a site, all of these participants watch the same video sequence on the display and get the same feeling of eye contact problem is one of the main reasons for the artificial atmosphere in a video conferencing meeting where, for example, the speaker needs to name the remote participant before addressing him since gaze direction is not preserved.
Teliris 3D Telepresence: this solution delivers 3D video in high definition using goggles. The system is based on Teliris's traditional solution, which corresponds to a regular 2D Multi-screen Telepresence with the ability to switch between 2D and 3D meeting modes with goggles. Hence, besides the feeling of real-sized presence, some stereoscopic 3D depth is available for the user. Anyway, it does not solve the main issues of eye contact in videoconferencing. And the fact that the participants have to wear glasses in order to perceive the 3D video only creates another barrier in the eye contact line. Furthermore, all participants see the very same 3D (stereo) image which is unable to transmit the different perspectives depending on where a user is sitting.
Musion-like systems: in November 2007, Cisco demonstrated a sort of holographic-like video conferencing session, which was dubbed "The Cisco Telepresence On-Stage Experience". It was the result of cooperation between Cisco and Musion Systems, a provider of 3D holographic-like display technologies. The system technology is based on the projection of images on thin transparent foils that allow having a feeling of immersive 3D content on a stage, despite the projection is on 2D. In order to have the 3D feeling, multi-view capture and 360 degrees images are often used. In spite of everything, this system is not commercially available due to the high costs and dedicated setup that are involved when using it. Besides the system is not able to supply multiple perspectives to different users located in different locations in a room, impeding the possibility to have proper eye contact.
Teleportel / Sony 3D Telepresence: the vertical eye contact issue is one of the major problems mentioned when discussing video conferencing applications. This problem stems from the fact that in a session, the conferee looks at the monitor displaying remote participants and not at the camera, which is usually located on top of the display. A common approach is to use a slanted semi-reflective mirror ahead of the display which can capture the participant's image from the front but still allow the participant to see the video sequence on the mirror. This method, although efficient, is not perceived well by users since the extra piece of glass in the setup seems awkward. Nevertheless, there are several examples of commercial solutions which use similar optic approach. Another major problem is that vertical eye-contact is a minor issue compared to problems generated by horizontal eye-contact. This solution is unable to provide corrected 3D-aware eye contact for multiple users.
Other solutions for 2 speakers to 2 speakers with eye-contact enabled are proposed, basing on the use of a single retro-projected lenticular screen, in order to provide 2 perspectives spatially multiplexed to the users in each site. These approaches allow having a very simplified multi-perspective system for 2+2 users, where the input of a camera goes directly as a feed of one of the perspectives of the remote screens. There are also some improvements over these systems by adding the use of multiple cameras to capture the different perspectives in a site to transmit for a given remote user. This allows having a refined framing of the view for each one of the perspectives in exchange of adding many more cameras. It is used one screen to represent each viewing position, allowing to increase the number of users and sites. Main problem is that these systems are impractical for scaling up in number of sites as it just connects point to point cameras and screens perspectives, and in order to have full body size representation of conferees, on requires the use of screens that are two big in order to allocate 2 individuals in them. Furthermore, the use of a single retro-projected lenticular screen in order to provide 2 perspectives spatially multiplexed to the users in each site requires a significant amount of space. Still, the system proposed does not have a proper architecture for scaling up the system to connect more sites. Adding the use of multiple cameras to capture the different perspectives in a site to transmit for a given remote user significantly increases the hardware complexity, it adds cost without appreciable benefit.
In general, current solutions have trouble on delivering the spatiality feeling of a real 3D scene. Basically, current systems have important issues in what multiuser eye contact, transmission of body language and proper conservation of directional gestures concerns.
In order to have a solution for the problems that state of the art systems have, it is necessary to have multi-perspective systems with split-view screens able to handle multiple perspectives and multiple cameras in an efficient manner in order to transmit the proper visual experience.
Description of the Invention This invention presents a full system design for horizontal eye-contact enabled video communications for multiple users. The invention allows capturing from multiple cameras, processing the multi-view information, transmitting and generating the necessary information to feed multi-perspective split-view screens to allow for simultaneous multi-user gaze direction awareness, and better gesture perspective preservation. By means of split-view screens, different users can have different images of the remote location depending on where they are located in the room. Therefore the invention refers to a method for processing, transmitting and receiving images of a first set of users in a first location in communication with a second set of users in a second location. Said method comprising the steps of: recording the first set of users with a plurality of cameras distributed according to the position of the second set of user in the second location, recording each camera a image of the first set of users; separating the part of the image belonging to each user of the first set of users from the parts belonging to other users recorded by the same camera, obtaining therefore for each user and for each camera a image of the user; encoding of multiple video signals , each containing the images of each user and sending the encoded video signals to a receiving module in the second location; sending control signals form a control module for managing a communication between a location of the second set of users and the location of the first set of users; decoding the encoded video signal received by the receiving module; composing one multiperspective image of the first set of users containing all the images of each user; showing the multiperspective image of the first set of users to the second set of users by mean of a display.
The invention also comprise, in the preferred embodiment, sending audio from the first location, perceptually synchronously with the video signal, to the second location.
Additionally, the step of sending control signals comprises transmitting orders from a front-end module to a plurality of processing modules for creating a flow containing video and audio data, managing the distribution of the orders by a plurality of manager modules connected both to the front-end module and the plurality of processing modules. And, in some embodiments the modules are scattered across a diversity of connected distributed computing units. Optionally the invention may comprise analyzing the available bandwidth in the communication channel in runtime for adjusting a transmitting rate between the first and the second location. And it may also comprise, in the step of sending the video signal to the receiver module, using forward error correction and encoding the video signal in UDP packets to avoid retransmission.
Another aspect of the invention refers to a system for processing, transmitting and receiving images of a first set of users in a first location in communication with a second set of users in a second location. The system comprises: a plurality of cameras, for recording the first set of users, recording each camera an image of the first set of users; a sub-module for separating the images belonging to each user from the images of the other users recorded by the same camera; an encoding module for encoding the signal coming from the sub-module, containing the images of each user, and sending the encoded video signal to a receiving module in the second location; a control module for managing a communication between the first location and the second location, sending control signals; a receiving module in the second location for decoding the encoded signal and composing one multiperspective image of the first set of users containing all the images of each user; at least one display for showing the multiperspective image of the first set of users to the second set of users.
Additionally the system may also comprise: a screen glasses synchronizer, connected to the receiving module, which is a transducer of a signal generated by the receiving module and a signal that an active glasses can comprehend in order to synchronize the display with the operation of the active glasses; and an active glasses, in communication with the screen glasses synchronizer, for each user for separating a perspective, from the multiperspective image, which fits better to the position of the user in the second location.
Also a computer program is provided comprising program code means adapted to perform the steps of the method according to any claims from 1 to 7 when said program is run on a general purpose processor, a digital signal processor, a FPGA, an ASIC, a micro-processor, a micro-controller, or any other form of programmable hardware. This invention specifies the system necessary to capture, process, transmit, and receive the multi-perspective visual information. Furthermore, the invention allows for a distributed scalable architecture where the multi-user, multi-screen, multi- perspective system is powered by distributed software on a small modular multicomputer grid.
Description of the Drawings
To complement the description which is being made and for the purpose of aiding to better understand the features of the invention according to a preferred practical embodiment thereof, a set of drawings is attached as an integral part of this description, in which the following has been depicted with an illustrative and non- limiting character:
Figure 1 illustrates a common situation of a multi-perspective experience for 2 users located in different positions in a room.
Figure 2 shows a block diagram of the system of the invention in one embodiment.
Figure 3 shows a block diagram of the acquisition, sender and transmission of each perspective to the destination machines.
Figure 4 shows a block diagram of reception, decoding and composing for remote multiperspective displays.
Figure 5 shows a block diagram involving the step of the invention of view splitting.
Figure 6 shows the camera capture distribution. Splitting and recomposing of multiperspective images.
Figure 7 shows a schema representing the communication of multiple 3D perspectives between two cameras and two displays corresponding to an exemplary communication for 2 to 2 people in 2 different sites.
Figure 8 shows a distributed architecture of the control system of the invention. Detailed Description of the Invention
This invention provides a system for processing, transmitting and receiving the visual information within an eye-contact enabled, multi-perspective Telepresence system. Such a multi-perspective system can send a different image to each user depending on the user location in the room. Therefore, each user is receiving the image that fits better to his position in the room.
The Telepresence system defined in this invention aims to deliver a Telepresence experience in which participants, in separate remote sites, feel like being in a "face-to-face" meeting. This document describes the design of the system as a whole, all contained modules, each of them one or several features, as well as the intercommunication between modules.
In Figure 1 , it can be seen the effect that users from two different positions will appreciate of one remote participant, thanks to the multi-perspective screen 1 .
The system of the invention has been designed as a multi-process modular system that can be implemented in a capable single machine, as well as on a multicomputer grid depending on the needs.
In Figure 2, it is specified an exemplary embodiment of the invention. It is a modular architecture of a distributable, Multi-perspective Telepresence system end point for 24 simultaneous perspectives (2 per camera) in a 3 sites configureation. It allows capturing from four cameras in order to generate eight different perspectives of the two local users. Two perspectives of each local user are sent to each remote site (four to each remote site). A processing module is in charge of generating two perspectives from each camera. The system includes two receiving units, each in charge of feeding two screens by combining four different perspectives. The system also uses a distributed Graphical User Interface (GUI), able to operate the multi- perspective system from connected devices.
The provided multi-perspective Telepresence system is able to: • Synchronously grab frames from, in one embodiment of the invention, Stingray firewire cameras (Allied Vision Technologies) that support IEEE 1394 b connection standard.
• Convert the image formats between the camera's outgoing flow and the one required for the processing pipeline.
• Process the incoming data from cameras to generate the multiple perspectives for the remote users.
• Encode/decode and transmit/receive the multi-perspective video.
• Deal with the network fluctuations by changing properties in the encoder.
• Adapt the received flows to the visualization requirements.
• Capture, send, receive and reproduce audio without that is synchronized with the video streams without any echoes.
• Provide the conferees with the over the above functionalities to be able to: start/stop a call, enable/disable features, among other.
The distributed nature of the design of the software architecture allows flexibility regarding the equipment election. The system is designed to achieve distributed and parallel computation capabilities.
Several functional modules can be identified in Figure 2 carrying the functionalities described to be provided by the apparatus. Nevertheless, each module has a specific design in terms of the technology involved in its creation and interfaces that make it fit with the other functional boxes. The whole architecture could be split in fragments and each of these fragments would still have its meaning as they are independent functional entities. One contribution of this invention lays in its design, using operationally parallel modules which allows for building a highly distributed architecture and to maximize the operation of the modules in parallel mode.
Each of the modules can be translated, almost directly, to a plugin, in one embodiment it is used a Gstreamer plugin and the whole assembled in a Gstreamer framework, which confers flexibility, scalability and usability of the software pieces independently. However, there is a key control layer that is not Gstreamer based and that allows for the particular flexibility and capacity to distribute execution of this invention. Each of the functional parts of the system, involved in the streaming task, are explained below:
• Multi-perspective Acquisition/Processing Module and Encoding/Decoding and Media FrameWork
This module includes all the routines aimed to adapt the color space of the visual data coming from the cameras to the required color space by the processing and transmission. As the cameras are producing images in a raw format, a conversion process is required to match the requirements of the other modules. The module also computes the different perspectives that can be generated from a single camera to be sent to its destination. The video streams are encoded and sent through the network by this module. Symmetrically, the other side will receive and decode the incoming traffic. Video communication includes protection algorithms against network failures, such as Forward Error Correction and Bandwidth Automatic Adaptation. This module has been designed to include all the required interfaces to allow the proper configuration and setup (before starting the streaming and even in runtime).
• Audio
The audio module is basically constituted by a network-robust encoder and decoder of audio that transmits the audio synchronously with the video. This module is where the IIS Fraunhofer Institute's AAC-LD ACE library resides. The mentioned library takes care of not just the sending and receiving of audio flows, but also all echo cancelling activities. This library performs a very robust encoding/decoding operation along with an echo cancellation procedure. This library has been embedded in a plugin that works with raw audio at a configurable rate. It receives raw audio data and sends raw audio data from and to the audio card.
• View Renderer
The View Renderer module includes all the routines aimed to visualize the stream and to make the images match with the properties of the monitor (virtual background substitution and other operations required at rendering level). Strongly graphic related programming languages have been exploited to fulfil the real time and low latency requirements. The previous modules are assembled together to build functional sub-systems. Basically, there are two sub-systems that play an important role in the apparatus: Acquisition and Sender Module 25 is the first one, takes care of the reading from the cameras and sending/encoding activities. In an embodiment proposed for this sub-system, in Gstreamer terms: outgoing pipeline, is composed by the following modules represented in Figure 3.
• Camera capture/color conversion 31 : reads from the camera and converts the stream to a manageable color space.
• Multi-Perspective View Splitter 32: this sub-module splits the image in order to separate the users of one site in at least two perspectives that will be transmitted to a remote terminal.
• SVC (Scalable Video Coding) Encoder with BwE + FasUP 33: Scalable Video Coding based encoder with Bandwidth Estimation, analysis of the available bandwidth in runtime in order to adjust the transmitting rate, plus Fast Update, resend intra-frames if required.
• FEC+Sender 34: Forward Error Correction, protects the stream against network errors, and the sender itself, UDP based.
The first stage of the acquisition and sending sub-system is the lecture from the cameras and the image pre-processing for the consequent encoding and transmitting. The pre-processing step is composed by two phases: color-conversion and view splitting; and, in the same way, the view splitting phase is composed by two steps: image separation and geometric adaptation. Figure 5 depicts this block.
The user separation step 51 is aimed to identify and segment each user from the frame, consequently each user and perspective of the each user can be treated accordingly to the destination. This transformation, that make each image fit with the requirements of the destination is the geometric adaptation 52 (or geometry correction) part added at the end of the acquisition module.
The receiver 22 with switchable scene composition is the second sub-system, it is placed in the destination machine. This sub-system, in Gstreamer terms: Incoming pipeline, is composed by the following modules represented in the Figure 4: • Receiver and decoder: UDP based receiver plus a decoder that matches the encoder used at the other end.
• Color space adaptation: like in the previous case, a module for color space conversion is required. The video flow is returned in YUV after the decoder. · Frame composer / background substitution plus Rendering Visualization: this block composes the final video pictures in order to fit with multiple view requirements. It eventually allows, in one embodiment of the invention, to substitute the scene background or to add a visual component to the picture. The present embodiment works with the fact that the entire scene is fully captured at once by each camera and each participant position is segmented and treated separately. Figure 6 displays this method: cameras 1 to 4 capture the same scene but from different perspectives, therefore, once the users are separated, the remote conferees will receive the proper point of view in their respective monitors.
Communication and control modules
Since the system is highly distributed it requires robust signalling operations between each system branch, in terms of data transmission and flow control. A core application has been designed and built to fulfil these requirements. In an embodiment of this invention, this application takes advantage of the Gstreamer
Interface to take control of the processing pipelines.
Three sub-modules, also called ViCo framework modules, (ViCo is an acronym for distributed process control framework. It is mainly designed for Video Conference applications with added functionalities, but it is useful for a wide variety of applications.) are found inside this communication sub-system: a Front-End module, a Manager module and a Processing module. From now on, these modules will be referred as "vicoGateway" (93) (for the front-end module), "vicoMgr" (90) (for the manager module) and "vico" (for the processing module). These three modules maintain the control over the different processing modules of this invention.
• A front-end module (ViCo Gateway) for establishing the communication between the first location and the second location. VicoGateWav deals with the communication between sites, embedding the SIP communication engine and distributes the information and necessary commands to operate each one of the individual parallel modules of the system (vicoMgr). It also deals with the communication with the GUI (94) of the system in a manner that allows the GUI to run in a completely separated computer or device by communicating through a communication network.
• A manager module (ViCoMgr) for distributing the signals and data coming from the front-end module and controlling a plurality of processing modules. Each ViCoMgr deals with the signals sent and received to/from the processing pipelines. For each parallel running module in the system (controlled by a "vico"), it intermediates and communicates with the main vicoGateway (the main module concentrating communication with other sites).
• A processing module (vico), controlled by a manager module, that create a flow containing video/audio data. A vico, following the orders of a vicoMgr, launches each processing pipeline (processing subsystem) with the proper setup.
This set of control modules are inter-related by a hierarchical structure that connects them from the front-end control module {vicoGateway), to each processing module unit (vico) passing by each vicoMgr that controls several vico's.
Within the main vicoGateway, SIP (Session Initiation Protocol) Control controls the signalling activities between sites. SIP is a well known and standard protocol for session initiation. The configuration data travels through the network enclosed into a SIP message. The utilization of this protocol ensures interoperability with other commercial videoconference/communication systems.
In Figure 2 "Cloud Desktop" graphical user interface (GUI), is a web-based application giving visual and tactile control over the Telepresence system for the above mentioned operations. It allows to configure, setup Telepresence parameters, as well as to establish calls. This interface also allows a collaborative interaction between the conferees placed in remote sites. The Desktop allows sharing content and some objects in a touch-based manner while holding content, data and communication through a server in the "cloud".
About the inter site communications at data level, Figure 7 shows how the streams are composed and sent to the remote site. Each of the captured images is halved and sent through the network. In an embodiment of this invention, each image-half (horizontal half) corresponds to one of the visual perspectives to be sent. At the other end, fragments are composed in order to fit with screen requirements and to produce the multi-perspective effect. Every split-view screen combines different perspectives of different cameras in order to supply to the users with a different image perspective depending on where they are sitting. In an embodiment of multi-perspective Telepresence, multiple perspectives are transmitted to screens as a single picture where different tiles correspond to different perspectives.
In an embodiment of the invention, coded data streams are packaged into UDP packets, in order to avoid retransmission (and thus delay in real-time video communication). Data can be sent to the network with/"with partial'Ywithout QoS management. There is a back channel which carries information about the network status and messages like Video Fast Updates.
The invention is built according to a distributed parallel architecture. A distributed system consists of multiple autonomous computers or processors that communicate through a private computer network or system. Henceforth, this invention allows the Telepresence to be distributed since it has been designed to have several branches and run, if needed, over independent machines. This kind of architecture allows for a high degree of scalability, and since only one machine has the knowledge of the current architecture, it is possible to update the hardware and just replace the configuration of the vicoGateway.
In an embodiment of this invention, each of the machines that compose the system holds one of the processing branches. As shown in Figure 8, PC-* are the names given to the different machines and a vicoMgr 82 is executed on each as a service. It is also shown the multiperspective acquisition and sender modules 83 and the multiperspective receiver modules 84. This design allows a balanced distribution of the weight, which means that it is possible to have more advanced equipment for more consuming parts of the algorithm (e.g. PC-A to PC-D which hold the capture, coding and transmission), and to have lighter hardware (PC-E and PC- F) to just deal with decoding and visualization routines which require less computer power.
Parallel computing is a form of computation in which many calculations are carried out simultaneously, operating on the principle that large problems can often be divided into smaller ones, which are then solved concurrently. There are several different forms of parallel computing: bit-level, instruction level, data, and task parallelism. In an embodiment of this invention, it is possible to talk about task parallelism since Gstreamer (the processing framework) is designed to be multithreaded and most part of the software implementations are multi-process.
There is a control system included in the invention, which is able to distribute, recover and process messages that are tightly involved with the functionalities of the whole invention. Functions such start/stop the call or changing configurations in runtime are allowed by the communication protocol.
Basically, a user initiated signal ends in and starts from a GUI. A GUI can send several events like: call start, call stop, a change in the configuration, etc. These events have to be mapped to the site's working configuration (which can be different for each site), therefore the only thing that needs to be configured in the GUI is the address/port of the appropriate vicoGateway in order to get connected to the rest of the system.
The communication chain starts, usually, at the GUI, and the messages are delivered to the correct device/equipment.
All the messaging operations between the three components designed as part of the control sub-system are done through TCP sockets. VicoGateway, as the core of the communication chain, deals with messages coming from the GUI and the requests coming from the processing chain.
The messages gets into the vicoGateway 81, VicoGateWay 81 works as server that is listening at a specific port, set by design, but, on the other hand, it works as a client in respect to the interaction with the vicoMgrs 82 running in the distributed system. VicoMgrs 82 are running in background on the machines that compose the distributed system. They have access to the final processing chain and control its performance. VicoMgrs listen permanently to one specific port where they receive the messages from the vice-Gateway. The GUI acts as a client of the vice-Gateway and it prints to the specified socket a set of messages corresponding to orders permitted to the users. The server part of the vice-Gateway evaluates what kind of message has arrived (nature of the message, if it is destined to one specific vicoMgr, if it is just a configuration matter message or if the message comes from another vice-Gateway and what has to do with it).
There are three types of messages easily identifiable and classifiable. The basis to classify these messages is the route that they are following to achieve its goal and functionality that is activated by this message:
1 . Start call/End call: these messages contain information about the call initiation and call ending. The system responds at two levels: the first one controlled by an internal protocol cares about the communication flow of vicoGateway and the corresponding vicoMgrs; and a second level where vicoGateway connects with another vicoGateway of a remote site through the SIP protocol. In the first case, the messages are identified as vicoStartl vicoEnd just as an agreed nomenclature system.
The main difference with the other types of message is the content of these messages. Basically, these messages contain a label that activates the function and a path with the source of the proper pipeline to be launched, since the pipelines are preconfigured to send the stream to a specific destination.
Another important point to consider is that the first information that goes through the control system is the identification message. The vicoGateway sends a Hello message to all the vicoMgrs that are in the site's configuration file and expects a response from them in order to confirm that they are up and available.
The internal workflow of the vicoGateway depends on the type of message that is getting or producing. It seeks inside the list of sites or the list of flows (vicoMgrs) if gets a configuration or end/start call message. Configuration: the messages of this kind are defined, top down, from the GUI to the vicoGateway and later to the vicoMgr. In this case, several data structures are defined for the different types of messages (i.e. vicoMulti- Perspective=on/off, etc). Each data structure is parsed in a different way depending on the size and value of the header. The first bytes of the structure contain the information of data type and the size of the following information. These messages are treated differently depending on its purpose: i.e. data that affects the Gstreamer pipeline to be launched, any property of the Gstreamer plugins defined in this embodiment; or it affects the vicoMgr itself , changing the where it has to read the pipeline. The source is one vicoGateway and it transmits the proper data to its vicoMgrs and, in some embodiments of the invention, it sends a configuration message to another remote vicoGateway, and this one makes the proper operations in its site.
Runtime automatic control and communication control: these messages concern non-user operated control signals between sites. These messages travel from vicoGateway to vicoGateway by means of SIP messages. VicoGateways can be the origin of the signalling themselves although for signalling related to the video communication modules, messages path would be extended until the vicoMgr, vico (processing level) as well. Depending on the message type, this can be distributed from a vicoGateway in a "multi-cast" manner down to each vicoMgr (for further distribution down to vicos' and processing pipelines from there), or in a selective unicast manner for specific changes in each vicoMgr.
These messages might also have its origin in the vico application itself since, for example, a process could require provoking a configuration change to correct any aspect of the communication. Therefore, this message will have to travel through all the communication chain (from vico to vico through the vicoGateways). A good example of this kind of information is the Bandwidth Estimation message. This structure is build inside the vico that holds the reception module and this data has to be delivered to the remote site in order to adjust the bitrates of the outgoing streams. First thing is to evaluate what kind of message has been received. VicoGateWay checks if the message has a match in the message list. If there is a match it evaluates the destination of that message (site or vicoMgr). Basically, it has a list of the available vicoMgrs and the available sites in the system (in one embodiment of the invention, this list of possible systems is read from a configuration file in form of
XML). If the flow (vicoMgr) exists in the list, it finally confirms if there is connectivity with the destination.
The final result of this protocol is a message of success or error {vicoOKIvicoError) travelling back to GUI application, with the information about which flow has produced the error and the nature of the error (flow not available, nonexistent, etc).
Once the message is identified and delivered to the vicoMgr, the last one has to create a process in order to deal with the hardware and produce the desired result.
This operation is performed by vico. Specifically, vico is created as process by each vicoMgr, and each vicoMgr can build one or more vico's.
A vicoMgr treats an incoming message and manages to build the proper vico taking into account the parameters received along with the message structure. Therefore, if a message cannot be treated by the vicoMgr an error response is sent back to the vicoGateway and the GUI as well.
The final step is the creation of the flow itself. A flow is each one of the processing chains that generates a video/audio data stream. In one embodiment of the invention, the data stream is build and produced by a Gstreamer pipeline, but it could be produced by any set of libraries assembled properly. This Gstreamer pipeline is created inside each vico and there are configuration messages that are delivered directly to the pipeline. If any error is produced in the delivery or in runtime a vicoError message is generated with the proper information and sent back. The messages are passed to the pipeline through a dedicated interface, especially designed for this purpose.
The system can be easily extended to capture additional perspectives per camera (by using an appropriate camera with enough resolution and an appropriate lens). I.e. in an embodiment of this invention, one can split each captured view in 3 segments and send it to 3 perspective screens for rendering, allowing for additional viewers per site.
Embodiment of the invention including the use of glasses
This embodiment is composed by the following functional modules:
• Four acquisition and sending modules, which read the images from the cameras (4 per site for a 3 sites and a total of 6 talking positions -2 per site-) and convert this information to a proper format that fit the requirements of the multi-perspective arrangement of views and screens and that can be comprehended by the encoding and sending routines.
Each sending module knows the destination of the flow generated. The destination of the flows is set by the control subsystem (embedded in the user interface module) which builds the table of routes since it has de knowledge of the distribution and addresses of the sites.
• Two receiving and composing modules, which concentrate the traffic coming from the remote sites and decode the images. These modules also deal with the stream composition and the signal generation in order to synchronize the images with the goggles (special glasses). Synchronization is necessary if time multiplexing of perspectives or views is performed, if light polarization is used for multiplexing 2 views then is not necessary. The streams that are sent to the displays are composed, sequentially, by one left view and right view (understanding left and right as different perspectives). This way the display will print the different perspectives at different times and the receiver will be able to separate them.
This module participates in the negotiation of the flow. In other words, if the characteristics of one site do not match with the default configuration (simpler display, etc) this module can set up a different configuration.
• Audio module takes care of: sending/receiving/encoding/decoding the data streams. This audio module can be supported by any set libraries that allow performing this kind of operations and echo cancelling as well. The audio equipment is chosen taking into account the different configuration of the rooms since different sizes of materials of the walls can be faced. Specifically, the audio channels have to be sent multiplexed with the video packet in order to keep the synchronization audio-video and, therefore have to be demultiplexed at destination.
• All the data stream, sent and received, are conveyed through a network gateway that distributes the traffic going and coming from the public network. This module takes care of the basic network operations in order to distribute the data traffic properly through the private and public network. This entity reads the routes from a previously built file that contains the information of the internal network.
• The user interface gives an input point to the orders coming from the users; such as to initiate and finalize a call or to configure some parameters of the communication. The control sub-system, which conveys the orders and signals that manage the behaviour of the pipeline. The control sub-system is the core of this module and the most important part. It takes care of the distributing and interpreting the signals all along the system. Basically, this module is composed by a net of small client/server application that communicates each other.
• Finally, the screen goggles synchronizer is a hardware device which works as a transducer of the electrical signal generated by the receiving module and signal that the glasses can comprehend in order to synchronize the display with the operation of the active glasses. This ensures that each viewer will receive only the correct view at each time under temporal multiplexing of views. The protocol behind this device can be rewritten in order to fit with the technology of the goggles. This synchronizer is not needed if plain polarization multiplexing is used.
This embodiment is implemented based on the Gstreamer Multimedia Framework which allows a great flexibility in terms of reusability and fastness of development. All the previous mentioned modules can be translated, almost directly, to Gstreamer plugins.
In order to maximize the number of viewers while reducing the drop of luminance or picture flickering is by combining temporal multiplexing with polarization multiplexing. Both systems together allow to reduce the temporal interleaving by halve when more than 2 viewing positions are used.
Obviously, the system can be extended for extra users, but the configuration of the capture and visualization parts must be updated in order to keep the coherence of the viewing points with the added additional cameras + screens. The cameras require a full view of the conferees, which means that the focal length, directionality and zoom of the cameras must allow both users (and more) to fit clearly fit within the field of view of each camera. In the same way the screens have to dispose in order to offer a correct sight of the remote users (the position must correspond to the perspective that is being shown).

Claims

1 .- A method for processing, transmitting and receiving images of a first set of users in a first location in communication with a second set of users in a second location, the method comprising the next steps:
- recording the first set of users with a plurality of cameras distributed according to the position of the second set of user in the second location, recording each camera a image of the first set of users;
- separating the part of the image belonging to each user of the first set of users from the parts belonging to other users recorded by the same camera, obtaining therefore for each user and for each camera a image of the user;
- encoding of multiple video signals , each containing the images of each user and sending the encoded video signals to a receiving module in the second location;
- sending control signals form a control module for managing a communication between a location of the second set of users and the location of the first set of users;
- decoding the encoded video signal received by the receiving module;
- composing one multiperspective image of the first set of users containing all the images of each user;
- showing the multiperspective image of the first set of users to the second set of users by mean of a display.
2.- The method according to claim 1 further comprising sending audio from the first location, perceptually synchronously with the video signal, to the second location.
3.- The method according to previous claims wherein sending control signals comprises transmitting orders from a front-end module to a plurality of processing modules for creating a flow containing video and audio data, managing the distribution of the orders by a plurality of manager modules connected both to the front-end module and the plurality of processing modules.
4- The system according to claim 3 wherein modules are scattered across a diversity of connected distributed computing units.
5. - The method according to previous claims further comprising analyzing the available bandwidth in the communication channel in runtime for adjusting a transmitting rate between the first and the second location.
6. - The method according to previous claims wherein sending the video signal to the receiver module further comprising using forward error correction and encoding the video signal in UDP packets to avoid retransmission.
7. - A system for processing, transmitting and receiving images of a first set of users in a first location in communication with a second set of users in a second location, the system comprises: - a plurality of cameras, for recording the first set of users, recording each camera an image of the first set of users;
- a sub-module for separating the images belonging to each user from the images of the other users recorded by the same camera;
- an encoding module for encoding the signal coming from the sub-module, containing the images of each user, and sending the encoded video signal to a receiving module in the second location;
- a control module for managing a communication between the first location and the second location, sending control signals;
- a receiving module in the second location for decoding the encoded signal and composing one multiperspective image of the first set of users containing all the images of each user;
- at least one display for showing the multiperspective image of the first set of users to the second set of users.
8.- The system according to claim 7 further comprising:
- a screen glasses synchronizer, connected to the receiving module, which is a transducer of a signal generated by the receiving module and a signal that an active glasses can comprehend in order to synchronize the display with the operation of the active glasses; - an active glasses, in communication with the screen glasses synchronizer, for each user for separating a perspective, from the multiperspective image, which fits better to the position of the user in the second location.
9.- A computer program comprising program code means adapted to perform the steps of the method according to any claims from 1 to 6 when said program is run on a general purpose processor, a digital signal processor, a FPGA, an ASIC, a micro-processor, a micro-controller, or any other form of programmable hardware.
PCT/EP2011/067091 2010-11-05 2011-09-30 System and method for multiperspective telepresence communication WO2012059280A2 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US41053510P 2010-11-05 2010-11-05
US61/410,535 2010-11-05

Publications (2)

Publication Number Publication Date
WO2012059280A2 true WO2012059280A2 (en) 2012-05-10
WO2012059280A3 WO2012059280A3 (en) 2012-08-16

Family

ID=44720896

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2011/067091 WO2012059280A2 (en) 2010-11-05 2011-09-30 System and method for multiperspective telepresence communication

Country Status (1)

Country Link
WO (1) WO2012059280A2 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2890121A1 (en) * 2012-08-24 2015-07-01 ZTE Corporation Video conference display method and device
EP2852157A4 (en) * 2012-07-23 2015-07-22 Zte Corp Video processing method, apparatus, and system
US9565314B2 (en) 2012-09-27 2017-02-07 Dolby Laboratories Licensing Corporation Spatial multiplexing in a soundfield teleconferencing system
US10044945B2 (en) 2013-10-30 2018-08-07 At&T Intellectual Property I, L.P. Methods, systems, and products for telepresence visualizations
US10075656B2 (en) 2013-10-30 2018-09-11 At&T Intellectual Property I, L.P. Methods, systems, and products for telepresence visualizations
GB2582251A (en) * 2019-01-31 2020-09-23 Wacey Adam Volumetric communication system

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6377230B1 (en) * 1995-10-05 2002-04-23 Semiconductor Energy Laboratory Co., Ltd. Three dimensional display unit and display method
US6937266B2 (en) * 2001-06-14 2005-08-30 Microsoft Corporation Automated online broadcasting system and method using an omni-directional camera system for viewing meetings over a computer network
US7515174B1 (en) * 2004-12-06 2009-04-07 Dreamworks Animation L.L.C. Multi-user video conferencing with perspective correct eye-to-eye contact
US8797377B2 (en) * 2008-02-14 2014-08-05 Cisco Technology, Inc. Method and system for videoconference configuration
NO331839B1 (en) * 2008-05-30 2012-04-16 Cisco Systems Int Sarl Procedure for displaying an image on a display

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
None

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2852157A4 (en) * 2012-07-23 2015-07-22 Zte Corp Video processing method, apparatus, and system
US9497390B2 (en) 2012-07-23 2016-11-15 Zte Corporation Video processing method, apparatus, and system
EP2890121A1 (en) * 2012-08-24 2015-07-01 ZTE Corporation Video conference display method and device
EP2890121A4 (en) * 2012-08-24 2015-08-12 Zte Corp Video conference display method and device
US9661273B2 (en) 2012-08-24 2017-05-23 Zte Corporation Video conference display method and device
US9565314B2 (en) 2012-09-27 2017-02-07 Dolby Laboratories Licensing Corporation Spatial multiplexing in a soundfield teleconferencing system
US10044945B2 (en) 2013-10-30 2018-08-07 At&T Intellectual Property I, L.P. Methods, systems, and products for telepresence visualizations
US10075656B2 (en) 2013-10-30 2018-09-11 At&T Intellectual Property I, L.P. Methods, systems, and products for telepresence visualizations
US10257441B2 (en) 2013-10-30 2019-04-09 At&T Intellectual Property I, L.P. Methods, systems, and products for telepresence visualizations
US10447945B2 (en) 2013-10-30 2019-10-15 At&T Intellectual Property I, L.P. Methods, systems, and products for telepresence visualizations
GB2582251A (en) * 2019-01-31 2020-09-23 Wacey Adam Volumetric communication system
GB2582251B (en) * 2019-01-31 2023-04-19 Wacey Adam Volumetric communication system

Also Published As

Publication number Publication date
WO2012059280A3 (en) 2012-08-16

Similar Documents

Publication Publication Date Title
AU2011258272B2 (en) Systems and methods for scalable video communication using multiple cameras and multiple monitors
JP5508450B2 (en) Automatic video layout for multi-stream and multi-site telepresence conferencing system
US9055312B2 (en) System and method for interactive synchronized video watching
US8289367B2 (en) Conferencing and stage display of distributed conference participants
WO2012059280A2 (en) System and method for multiperspective telepresence communication
EP4046389A1 (en) Immersive viewport dependent multiparty video communication
WO2011140812A1 (en) Multi-picture synthesis method and system, and media processing device
WO2018214746A1 (en) Video conference realization method, device and system, and computer storage medium
WO2012059279A1 (en) System and method for multiperspective 3d telepresence communication
JP7412781B2 (en) Space-aware multimedia router system and method
KR20110042447A (en) Terminal, node device and method for processing stream in video conference system
Ekmekcioglu et al. Adaptive multiview video delivery using hybrid networking
Tang et al. Audio and video mixing method to enhance WebRTC
WO2013001276A1 (en) Apparatuses, methods and computing software products for operating real-time multiparty multimedia communications
CN102685443B (en) System and method for a multipoint video conference
EP4209003A1 (en) Orchestrating a multidevice video session
Hutanu et al. Uncompressed HD video for collaborative teaching—An experiment
WO2013060295A1 (en) Method and system for video processing
US20100225733A1 (en) Systems and Methods for Managing Virtual Collaboration Systems
Ceglie et al. 3DStreaming: an open-source flexible framework for real-time 3D streaming services
de Lind van Wijngaarden et al. Multi‐stream video conferencing over a peer‐to‐peer network
Margolis et al. Tri-continental premiere of 4K feature movie via network streaming at FILE 2009
AU2014305576B2 (en) Multi-content media communication method, device and system
WO2015172126A1 (en) Full duplex high quality audio/video communication over internet
Petrovic et al. Toward 3D-IPTV: design and implementation of a stereoscopic and multiple-perspective video streaming system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 11763935

Country of ref document: EP

Kind code of ref document: A2

DPE1 Request for preliminary examination filed after expiration of 19th month from priority date (pct application filed from 20040101)
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 11763935

Country of ref document: EP

Kind code of ref document: A2