WO2017097410A1

WO2017097410A1 - System for control and interactive visualization of multimedia content

Info

Publication number: WO2017097410A1
Application number: PCT/EP2016/002045
Authority: WO
Inventors: Antonio Giovanni TESTA; Marco SOTGIU
Original assignee: At Media S.R.L.
Priority date: 2015-12-10
Filing date: 2016-12-06
Publication date: 2017-06-15
Also published as: ITUB20156909A1

Abstract

A system for control and interactive visualization of multimedia content, including: a server adapted to host a website and an electronic computer provided with means for the visualization of webpages. The particularity of the invention resides in that it allows the user to control, by means of eye motion recognition or gesture commands sent to a "server" process by means of images, the delivery of a service.

Description

SYSTEM FOR CONTROL AND INTERACTIVE VISUALIZATION OF MULTIMEDIA CONTENT

The present invention relates to a system for control and interactive visualization of multimedia content, particularly of multimedia content made available by means of a server process. The invention also relates to computer programs and to a method that is consistent with the system.

Systems are known which make the navigation of websites on the part of users more pleasant by solutions that aim to render their content interactive with the aim of improving the so-called user experience. Those known systems, which include, for example, EyeGaze v. 2.4.4, available at https://xlabsgaze.com, or Wan's Vision, available at http://www.marcuswan.com, have appreciable aspects but are not free from drawbacks.

A first drawback is that, in some of those prior art systems, the user is required to install adapted plug-ins and/or software, for example by integrating them in his own browser; those plug-ins and/or software allow to utilize a service offered by a server. However, the updating of the plug-in and/or software on the user side is not always possible due to problems of compatibility with the various available browsers.

Another drawback is that some of the prior art systems require the use of hardware and software resources (for example Adobe Flash) which make them difficult to use with devices having relatively limited resources such as smartphones.

Another drawback is that some of the prior art systems provide for the use of software distributions that are closely tied to a given browser, for example distributions developed with software development kits or SDKs tied to the Chrome browser. This restricts the usability of the system and also forces the user to install additional software.

US2012306734 discloses a gesture recognition technique wherein a static geometry model is generated, from one or more images of a physical environment captured using a camera, using one or more static objects to model corresponding one or more objects in the physical environment. Interaction of a dynamic object with at least one of the static objects is identified by analyzing at least one image and a gesture is recognized from the identified interaction of the dynamic object with the at least one of the static objects to initiate an operation of the computing device.

US2011173574 discloses a gesture-based system wherein gestures may control aspects of a computing environment or application, where the gestures may be derived from a user's position or movement in a physical space. A gesture-based system may have a plurality of modes, each mode a hardware configuration, a software configuration, or a combination thereof. Techniques for transitioning a user's control, via the user's gestures, between different modes enables a system to coordinate controls between multiple modes. The system may transition the user's control from a control of the first mode to a control of a second mode. The transition may be between hardware, software, or a combination thereof.

US2012166974 discloses a method, apparatus and system that enable indirect remote interaction with a web browser. In one embodiment, remote user gestures may be captured and processed to determine an action to be taken by the web browser.

The prior art shows a lack of a system that is capable of allowing to ensure simplicity and compatibility and at the same time ensure an improvement of user experience.

The aim of the present invention is to overcome the limitations of the prior art described above, proposing a new system capable of allowing a user to control, by means of gesture commands and tracing of the movements of the head and eyes, the visualization of multimedia content and of allowing an improved, innovative and interactive navigation experience.

Within the scope of this aim, an object of the present invention is to allow to utilize services that use the functionalities that are normally present in browsers.

A further object of the invention is to render the tracking and recognition of the movements of a user and the control of the delivery of a multimedia stream more easy and computationally more efficient.

Another object of the present invention is to make the navigation of websites more pleasant, natural and enjoyable, and render those websites more creative on the part of users.

Another object of the present invention is to make the use of the hardware resources of a server more efficient and to reduce the band required to deliver the service to a client, such as the browser of a user, by means of a containment of the generated traffic, be it local to the network of the user or usable via the Internet.

Another object of the present invention is to have a structure that is simple, relatively easy to provide in practice, safe in use, effective in operation, and relatively modest in cost.

This aim and these and other objects that will become better apparent hereinafter are achieved by a system, a computer program and by a method, as claimed in the appended claims.

Advantageously, the system is compatible with common browsers, operating systems and devices normally used by an average user.

Preferably, the system is scalable and allows to determine the minimum bandwidth requirements needed for the delivery of a service on the part of a server to a client, adapting the quality of the images and/or video according to the bandwidth that is actually used.

Further characteristics and advantages of the invention will become better apparent from the detailed description that follows, given by way of nonlimiting example and accompanied by the corresponding figures, wherein:

Figure 1 is a block diagram of the system according to the present invention;

Figure 2 is a flowchart of an algorithm of the system of Figure 1 ;

Figure 3 is a view of an aspect of the algorithm of Figure 2;

Figure 4 is a flowchart that shows a second algorithm of the system of Figure 1 ;

Figures 5 and 6 show an aspect of the algorithm of Figure 4;

Figure 7 is a flowchart that shows an example of transaction among components of the system according to the present invention.

An exemplifying architecture of the system according to the present invention is summarized in the block diagram of Figure 1.

The system of Figure 1 , generally designated by the reference numeral 1 , includes a server 3 and an electronic computer 2, which are connected by means of a telecommunications network 4. The electronic computer 2 includes a processing means adapted to run code of the software type. Typically, the electronic computer 2 includes a general-purpose computer, a tablet, a smartphone and in general any equivalent device capable of running software code. Hereinafter, reference shall be made to a single electronic computer 2, but it is to be understood that their number can vary and they are capable of sending, in manners that will become better apparent hereinafter, requests to the server 3. In particular, the electronic computer 2 communicates with the server 3 by using the client-server paradigm. The computer 2 in fact includes a client of the web browser (or more simply browser) type, which allows access to a service offered by the server 3, typically a website. In particular, the web browser of the computer 2 is capable of managing HTML elements and in particular the element termed "iframe", an abbreviated form of "inline frame", i.e., a particular frame that has the particularity of being anchored to a webpage and is not external or independent of it. Typically, the iframe includes a frame of a secondary page that is connected to a main webpage. The electronic computer 2 is capable of generating requests to access webpages of the server 3 and of utilizing, by means of the webpages, services offered by the server 3. The computer 2 also includes a plurality of means, implemented preferably by virtue of software code which preferably has a modular structure. In particular, the electronic computer 2 includes a means for establishing a communication channel with the server 3 and for the reception of a multimedia stream on the part of the server 3. Also, the electronic computer 2 includes means 2a, for example a webcam that is integrated or can be integrated in the computer 2, in order to acquire images of a user of the computer 2, during the navigation of a website or the use of a service offered by the server 3. During this navigation, the user typically faces the screen of the computer 2 and the webcam 2a. The computer 2 also includes a means for sending the acquired images to the server 3. Preferably, this sending occurs substantially simultaneously with the acquisition of the images and preferably for the entire duration of the use of the service offered by the server 3 on the part of the user.

The server 3 can be provided by means of hardware devices of the known type and includes a computer that has a computing power that is usually greater than that of ordinary general purpose computers and storage means. In one embodiment, the hardware resources of the server 3 are those of the computer 2: in this embodiment, the server 3 therefore includes a plurality of processes run by the computer 2. Hereinafter reference shall be made, merely for the sake of simplicity in description, to a server 3 provided as a machine that is physically separate from the computer 2, but it should be understood that the same remarks also apply if the functionalities of the server 3 are provided by means of processes of the server type that are activated on the computer 2. For example, the server 3 is a machine that implements an operating system of the UNIX-like type. The server 3 is capable of running a software application, commonly known as Web server, which is configured to respond to requests of the HTTP type sent by a user agent (such as the web browser of the electronic computer 2) via the telecommunications network 4.

In particular, the server 3 replies to these requests by sending a plurality of webpages and by delivering a multimedia stream. This multimedia stream is displayed by means of an iframe contained in one of these webpages.

In particular, the server 3 is capable of analyzing the images sent by the computer; and of analyzing them to establish whether the user has performed a gesture, for example of his hand or head. As a consequence of this analysis, the server 3 modifies the multimedia stream as a function of the recognition of the movement.

The multimedia stream might consist of the three-dimensional modeling of an environment - for example the interior of a museum - which allows to "immerse" the user in a virtual reality. The iframe shows the view of a virtual camera that simulates the point of view of the user who visits this virtual environment ("first-person perspective").

A gesture of the user, for example the movement of the head or hand upward/downward/to the right/to the left, is interpreted by the server 3 as an indication of the user that he wishes to view specific virtual areas (for example a specific room of the museum) of the displayed virtual environment.

The server 3 is capable of receiving this indication by means of the analysis of the images received from the webcam 2a. In particular, the server 3 includes a means for tracking the position of the eyes of the user, i.e., eye tracking, and means for tracking the gestures of the user, i.e., gesture tracking.

Hereinafter it will be assumed that the indications given by the client are up/down/right/left (hereinafter termed respectively UP, DOWN, RIGHT, LEFT), which correspond to a will of the user to make the associated virtual camera assume one of these four directions. Another command, hereinafter termed "STAY", corresponds to the situation in which the server 3 is unable to recognize a movement of the user, for example because the user has not moved appreciably with respect to the position shown in a previously received image.

It should be understood that other commands are possible and that the encoding thereof, encapsulated in the communication between the server and client processes, can vary.

For example, a gesture of the head to the right can be interpreted as the will to move the virtual camera to the right (RIGHT command) or an upward gesture of the head (UP) can be interpreted as the will to go to a hypothetical upper level of the displayed virtual environment.

Figure 2 is a view of the flowchart of an algorithm that is adapted to perform tracking of the position of the eyes of the user. Hereinafter, for reasons of mere constructive simplicity, the server 3 shall be referenced as being provided on a machine that is distinct from the computer 2, but it is to be understood that the same remarks apply if reference were made to the communication between the client and the server processes activated in the computer 2. This algorithm, implemented by means of software code, is run by the server 3 and includes the following steps:

in step 10, a string of binary data obtained by extracting the body of an HTTP message sent by the browser (client) of the computer 2 is received in input. The received data are decoded preferably in base64 and subsequently interpreted as a JPEG image; in step 11 , the outcome of the decoding operation is checked.

If the two decoding operations fail, step 21 , the data received is considered invalid and an error message (STAY) is delivered to the client. If the decoding operations are correct, one moves on to step 12;

in step 12, if the obtained image is intact, a face detection device (face detector), preferably provided by resorting to Haar-like features, is run with the aim of identifying the position of the face of the user;

in step 13, one checks whether the face decoding operation has been successful. If not, the user has not been recognized within the scene acquired by the webcam and a "STAY" command is returned to the client, step 23;

in step 14, once the face of the user has been detected, the original image is cropped so as to maintain only the portion that contains the face. Then an algorithm for recognizing specific areas of the face (landmark detection), such as the comers of the eyes, the nose, the mouth is applied to the new image. The regions of the eyes are determined by estimating two rectangles that enclose the ends of the two detected eyes; in step 15, one checks whether the estimates are correct and if not a STAY message is sent to the computer 2, step 26;

in step 16, the two image portions enclosed within the two rectangles as defined in step 15 are then subjected to a gradient analysis algorithm in order to estimate the position of the two pupils of the user.

Once the two estimates have been obtained, a triangle is defined in which the two pupils (A, B) and the nose (C) are the vertices. An example of triangulation of facial landmarks can be observed in Figure 3. The triangle ABC as defined in step 15 is therefore used to determine the commands sent by the user;

in step 17, the angular coefficient S of the straight line that passes through AB, the base of the triangle, is assessed: if S is in the neighborhood of 0, it is estimated that the user is has not sent horizontal movement commands (right or left). In this case, the height, step 19, is assessed with respect to the base of the triangle, i.e., the segment DC. Two threshold values, t1 and t2, are used to assess how the values assumed by the height DC, with respect to AB, vary in accordance with upward or downward movements of the head of the user. These values are calculated as follows: t1 = 0.8 multiplied by the distance between A and B (t1 = 0.8*distance(A,B)); t2 = 0.5 multiplied by the distance between A and B, (t2 = 0.5*distance(A, B)). On the basis of these values, if the segment DC is longer than t1 , step 20, the interpreted command is DOWN, step 21 ; if the segment DC is shorter than t2, the interpreted command is UP, step 23. Finally, if none of the two preceding cases has occurred, it is estimated that the user has not moved his head, returning a STAY command, step 24.

In step 18, if, differently from what has been expressed in the preceding step, the coefficient S has values that are sufficiently far from 0, one can assume that the user has inclined his head, sending a horizontal command. In this case, the sign of S is assessed in order to determine whether the command to be interpreted is LEFT, step 25, or RIGHT, step 26.

Figure 4 is a view of the flowchart of an algorithm adapted to perform gesture recognition and tracking, particularly of gestures performed with one's hand. This algorithm, implemented by means of software code, is run by the server 3 and includes the following steps:

in step 30, once the first frame (user image) has been acquired, it is used to initialize a model of the static background of the scene acquired by the webcam 2a; in step 31 , the frames are decoded and validated and in case of error a signal is emitted, step 38;

in step 32, once their integrity has been checked, the frames are passed in input to a background extractor, step 33, which by using the model defined on the basis of the first received image marks each pixel of the image as belonging to the foreground (object in motion) or to the background (static object). Subsequently, the same received image is utilized to perform a nondeterministic update of the model used.

In a negative case, step 39, the algorithm moves on to step 31.

In step 34, the foreground mask obtained in the preceding step is analyzed with the aim of defining a series of blobs, i.e., of components composed of groups of pixels, classified as in motion, and adjacent. In particular, these blobs are obtained by applying morphological operators and edge searches to the foreground mask extracted in the preceding steps. Figure 6 is a view of the extracted foreground mask.

In step 35, the obtained blobs are filtered by size: a threshold "t" has been estimated experimentally so as to reject all blobs that have a size smaller than t. "N" is understood as a minimum number of points required to define a trajectory T. "t" is understood as a minimum dimension for considering an object in motion. For example, if the image measures 320 x 240 pixels, t is equal for example to 70 pixels.

In step 36, the selected blobs are then filtered on the basis of their shape: in particular, the elongation and the ratio between the area and the perimeter of each one of them is assessed. Among the ones that comply with the estimated parameters, the biggest is selected and likely will consist of the portion of image that is occupied by the moving hand of the user.

In step 37, the centroid of the selected blob is then calculated and stored.

The steps described above are repeated for each incoming image, so as to determine a series of centroids which, ordered chronologically, constitute an estimate of the trajectory followed by the hand of the user.

Figure 5 shows an example of gesture recognition, wherein the circle indicates the tracked object, i.e., an example of identification of the centroid of the moving hand is shown.

Once a new image has been analyzed and a new centroid in the trajectory of the hand has been stored, if a sufficient number ("N1", for example, five) of points has been collected beforehand it is possible to estimate the movement that the user is performing.

In particular: a) the dimensionality of the trajectory is initially reduced by applying a linear regression algorithm based on least squares; b) the resulting straight line is analyzed to determine the movement of the hand of the user; c) the orientation of the straight line allows to classify the movement as horizontal or vertical; d) the statistical analysis of the partial derivatives of the points belonging to the trajectory allows to determine the direction of the motion (UP/DOWN or LEFT/RIGHT); e) once a first estimate of the gesture has been obtained, it is stored but not immediately returned to the user; f) the history of the estimates made on the last images (for example the last five images) is used to render the gesture determination more stable; this in fact is calculated finally by voting selection.

For example, the gesture that in the last five frames was most frequent is selected. In this manner, oscillations and sudden variations due to noise or recognition errors are reduced drastically, at the price of a very slight loss of sensitivity.

It is also possible to arrange that each gesture is accepted as such valid if, for example, it has been observed in two or three frames/images.

Figure 7 is a flowchart showing an example of initialization and execution of a session between the server 3 and a client that is present in the computer 2, for example the web browser.

In item 40, the client requests a new session to the server, specifying in the URL the value of a parameter "id" (for example set equal to -1) and the name of a service, exemplified exemplified by the "dimmer" string.

This request is interpreted by the server as the will to generate a new session; this, as a function of the resources available to the server 3, is created and initialized by using the image contained in the body of the message.

The reply message, item 41 , therefore has a "session" attribute, which will provide the identification code "id" that is adapted to identify the session. The client indicates this identifier in subsequent requests, exemplified in item 42 (which include the sending of images of the user himself encoded in the body of the HTTP message), which relate to the requested service.

The "value" attribute can provide a value that can be associated with a graphical object that can be displayed on the iframe client side. The server likewise responds to the request, item 43. Of course, messages are also possible for signaling error situations, such as lack of resources on the server for creating a new session or related to the fact that the received request does not refer to any active session (i.e., the id indicated by the client in the request is incorrect).

Advantageously, this communication scheme and the data encapsulation modes allow to reduce the generated traffic with respect to known systems.

Advantageously, the average traffic generated in upload by a UNIX-like server machine that implements the server 3 according to the present invention is estimated (estimate carried out by means of the vnstat software tool) to be substantially equal to approximately 260 kB.

Preferably, all the provided iframes share a modular structure, the main components of which can be: a) js library for access, by means of the browser, to the webcam of the user; b) player/3D engine for providing and reproducing the interactive graphical components; c) Ajax technology for sending asynchronous requests to the server (with images encoded preferably in base64 in the payload of the request; d) timing/synchronizing mechanism for the periodic acquisition of frames from the stream of the webcam and the temporally consistent reception of commands on the part of the server.

The functionalities of the server 3 can be provided by means of a computer program which preferably has a modular structure.

One module, for example, can be assigned to the management of gesture interaction, which allows to control simple animations by means of hand motions; another module can be assigned to navigation of immersive virtual environments, which can be controlled by means of gaze and head motions; another module can be assigned to tracking the face of the user and can allow to control various graphical objects, including objects that correspond to advertising banners.

In one embodiment, the functionalities of the modules are provided as distinct applications which can be run by the server processes 3. Likewise, the functionalities of the computer 2 can be provided with a computer program which preferably has a modular structure.

These functionalities are implemented on an ordinary browser. The user, once he has reached a webpage hosted by the server 3, is asked whether he wishes to allow browser access to the webcam of his own device, by means of which his movements are tracked and interpreted: the program on the server is capable of communicating with the iframes integrated in the website and visualized by the user.

In practice it has been found that the system, the program and the method described achieve the intended aim and objects.

In particular, it has been found that the system thus conceived allows to overcome the qualitative limitations of the prior art, making it possible to provide new Internet services, which augment the functionalities of websites, achieving an innovative and interactive navigation experience. The system has a client/server structure, in which a series of frames provided for this purpose are encapsulated in ordinary webpages by means of which the user can control, by means of gesture commands and head and eye tracking, various services, including advertising services, which are highly interactive.

The system is based on the use of iframes (in-line frames) on the client side and, on the server side, of information technology algorithms for recognition of eye motion and/or gesture recognition.

By virtue of the system, the method and the program according to the present invention, the objects of the website become alive with gestures, immersive environments become navigable with head motion, navigation is facilitated by using the gaze alone.

Clearly, numerous modifications are evident and can be performed promptly by the person skilled in the art without abandoning the scope of the present invention. For example, it is possible to enhance the navigation controls with the implementation of zoom functions, driven by means of the approach/spacing of the head with respect to the screen and of diagonal movements. Moreover, it is possible to evaluate the acceleration of the movements performed by the user.

This application claims the priority of Italian Patent Application No. ITUB20156909 (corresponding to 102015000081867), filed on December 10, 2015, the subject matter of which is incorporated herein by reference.

Claims

1. A system (1) for control and interactive visualization of multimedia content, comprising a server (3) adapted to host a website comprising a plurality of webpages and; a user electronic computer (2) provided with means for the visualization of said webpages and configured to use one or more services offered by said server (3); the electronic computer (2) comprising, or being connected to, an image acquisition device (2a) adapted to acquire one or more images of said user; said system (1) being characterized in that:

- said electronic computer (2) further comprises: a means for establishing a communication channel with said server (3); a means for reception, on the part of said server (3), of a multimedia stream, and; a means for sending said acquired images to said server (3);

- said server (3) further comprising: a means for analyzing said images sent by said electronic computer (2); a means for recognizing a movement of said user's body on the basis of the analysis of said images, and; a means for sending a data item adapted to modify said multimedia stream as a function of said recognized movement.

2. The system (1) according to claim 1, characterized in that said recognition means comprises a means for tracking the position of said user's eyes, i.e., eye tracking, and a means for tracking said user's gestures, i.e., gesture tracking.

3. The system (1) according to claim 1 or 2, characterized in that said movement can be associated with a request to modify the delivery of said multimedia stream.

4. The system (1) according to one or more of the preceding claims, characterized in that said communication channel comprises a session of the HTTP type; said electronic computer (2) comprising a means for sending to said server (3) a message of the HTTP type, comprising said acquired image, preferably encoded according to base64 encoding, and the identifier of a requested service that can be associated with the website; said server comprising a means for generating, in response to said request, a message of the HTTP type that comprises said identifier, a command and a time stamp; said command being adapted to modify the delivery of said multimedia stream.

5. The system (1 ) according to one or more of the preceding claims, characterized in that said multimedia stream comprises the reproduction of a virtual reality, for example a three-dimensional reconstruction of an environment; said server (3) further comprising: a means for obtaining, on the basis of the analysis of said images, a command sent by said user and associated with the movement of a part of the body of said user, for example the face, and; a means for modifying said reproduction on the basis of said command.

6. The system (1) according to one or more of the preceding claims, characterized in that said server further comprises:

a means (16; 34) for calculating a data item, termed motion data item, on the basis of the analysis of one or more anatomical reference points obtained from an image of said user;

a means (17, 18, 19, 20, 22; 35) for comparing said motion data item with a preset data item and;

a means (21 , 23, 24, 25, 26; 37) for establishing, on the basis of said comparison, whether said user has performed a movement.

7. The system according to one or more of the preceding claims, characterized in that said server (3) further comprises a means for:

- extrapolating (10, 12, 14) from a received image, preferably by means of technologies of the Haar-like type, a portion that corresponds to the face of said user;

- identifying, in said image portion, a triangle (ABC) that is delimited by points that correspond to the pupils and nose of said user; the base (AB) of said triangle corresponding to a segment that passes through said points that correspond to said pupils;

- calculating an angular coefficient (S) of a straight line that passes through the base of said triangle and comparing it with a predefined value;

- calculating (19) the height (DC) of said triangle and comparing it with at least two preset values.

8. The system according to one or more of the preceding claims, characterized in that said server (3) is configured to identify the position of a part of the body of the user, said server (3) further comprising a means for:

identifying (31) in said image a first portion, termed foreground, related to parts of the body of said user in motion; the remaining portion being termed background;

identifying (34) a plurality of regions or blobs of said image which comprise pixels that belong to said first portion; and selecting part of said blobs, said selection being performed on the basis of a criterion chosen among: blob shape, comparison of one dimension of the blob with a predefined value, or both; identifying, among the selected blobs, the blob, termed reference blob, that has the largest size; identifying the centroid of said reference blob with respect to a reference system;

storing said position information in storage means of said server;

comparing (35) said position information with at least one previously stored position information;

establishing (36), on the basis of said comparison, whether the user has performed a movement of a hand.

9. The system (1) according to one or more of the preceding claims, characterizing that said reference blob corresponds to an area that substantially comprises a hand or the face of said user.

10. The system (1) according to one or more of the preceding claims, characterized in that said electronic computer (2) comprises a means for visualizing said multimedia stream by means of an iFrame associated with a webpage visualized by said electronic computer (2), said electronic computer (2) further comprising a means for: accessing said image acquisition device (2a), said image acquisition device comprising a webcam adapted to acquire images of said user in real time;

visualizing said multimedia stream and interactive graphic components;

sending asynchronous requests to said server (3);

sending synchronization requests to said server (3);

periodically acquiring images from a stream generated by said webcam;

receiving commands on the part of the server (3).

11. A computer program stored in an information technology medium, comprising instructions of the software type adapted to implement the means of a server (3) according to one or more of the preceding claims 1 to 10.

12. The computer program stored in an information technology medium, comprising instructions of the software type adapted to implement the means of an electronic computer (2) according to one or more of the preceding claims 1 to 10.

13. A method for the control and interactive visualization of multimedia content, particularly of multimedia content stored on a server (3), comprising:

establishing a bidirectional communication channel;

sending on said channel a multimedia stream and a plurality of images that can be associated with a user;

analyzing said images;

detecting a movement of the body of a user on the basis of the analysis of said images;

modifying the delivery of the multimedia stream that can be visualized by said user as a function of the recognition of said movement.