WO2012008827A1

WO2012008827A1 - System and method for detecting a person's direction of interest, such as a person's gaze direction

Info

Publication number: WO2012008827A1
Application number: PCT/NL2011/050423
Authority: WO
Inventors: Roberto Valenti; Vladimir Nedovic
Original assignee: Universiteit Van Amsterdam
Priority date: 2010-06-11
Filing date: 2011-06-10
Publication date: 2012-01-19
Also published as: NL2004878C2

Abstract

A method for detecting a person's interest direction, wherein a processor performs the steps of: determine in real time an interest vector of a person from video data captured by a video camera; determine in real time a salient peak closest to the determined interest vector; determine in real time a saliency-corrected interest vector from the eyes of said person to said closest salient peak; determine in real time the deviation between the determined interest vector and the determined saliency-correctedinterestvector; determine in real time further interest vectors of said person from said video data; and calculate in real time recalibrated interest vectors by using a calibration error value calculated from said determined deviation.

Description

SYSTEM AND METHOD FOR DETECTING A PERSON' S DIRECTION OF INTEREST , SUCH AS A PERSON' S GAZE DIRECTION

The invention relates to a system for detecting a person' s direction of interest, such as a person's gaze direction. The system can also be used to detect other visual

directions of interest of a person, such as the direction of a person's head, the person's eye, the person's arm and/or finger pointing, or the person' s whole body, or a

combination thereof.

Visual gaze estimation is the process which determines the 3D line of sight of a person in order to analyze the

location of interest. The estimation of the direction or the location of interest of a user is key for many applications, spanning from gaze based human-computer interaction,

advertisement [see: Smith, K., Ba, S.O., Odobez, J.M.,

Gatica-Perez , D.: Tracking the visual focus of attention for a varying number of wandering people. PAMI 30 (2008)], human cognitive state analysis, attentive interfaces (e.g. gaze controlled mouse) to human behavior analysis. Gaze direction can also provide high-level semantic cues such as who is speaking to whom, information on non verbal communications (e.g. interest, pointing with the head/with the eyes) and the mental state/attention of a user (e.g. a driver) .

Overall, visual gaze estimation is important to understand someone's attention, motivation and intentions.

Typically, the pipeline of estimating visual gaze mainly consists of two steps (see Figure 2) : (1) analyze and transform pixel based image features obtained by sensory information (devices) to a higher level representation (e.g. the position of the head or the location of the eyes) and (2) map these features to estimate the visual gaze vector (line of sight) , hence finding the area of interest in the scene . There is an abundance of research in the literature

concerning the first component of the pipeline, which principally covers methods to estimate the head position and the eye location, as they are both contributing factors to the final estimation of the visual gaze [see: Langton, S.R., Honeyman, H., Tessler, E.: The influence of head contour and nose angle on the perception of eye-gaze direction.

Perception & psychophysics 66 (2004)].

Nowadays, commercial eye gaze trackers are one of the most successful visual gaze devices. However, to achieve good detection accuracy, they have the drawback of using

intrusive or expensive sensors (pointed infrared cameras) which cannot be used in daylight and often limit the

possible movement of the head, or require the user to wear the device [see: Bates, R., Istance, H., Oosthuizen, L.,

Majaranta, P.: Survey of de-facto standards in eye tracking. In: COGAIN Conf. on Comm. By Gaze Inter. (2005)]. Therefore, recently, eye center locators based solely on appearance are proposed [see: Cristinacce, D., Cootes, T., Scott, I.: A multi-stage approach to facial feature detection. In: BMVC . (2004) 277-286; Kroon, B., Boughorbel, S., Hanjalic, A.: Accurate eye localization in webcam content. In: FG. (2008); and Valenti, R., Gevers, T.: Accurate eye center location and tracking using isophote curvature. In: CVPR. (2008)] which are reaching reasonable accuracy in order to roughly estimate the area of attention on a screen in the second step of the pipeline. A recent survey [Hansen, D.W., Ji, Q.: In the eye of the beholder: A survey of models for eyes and gaze. PAMI 32 (2010)] discusses the different methodologies to obtain the eye location information through video-based devices. Some of the methods can be also used to estimate the face

location and the head pose in geometric head pose estimation methods. Other methods in this category track the appearance between video frames, or treat the problem as an image classification one, often interpolating the results between known poses. The survey in [Murphy-Chutorian, E., Trivedi, M. : Head pose estimation in computer vision: A survey. PAMI 31 (2009)] gives a good overview of appearance based head pose estimation methods. Once the correct features are determined using one of the methods and devices discussed above, the second step in gaze estimation (see Figure 2) is to map the obtained information to the 3D scene in front of the user. In eye gaze trackers, this is often achieved by direct mapping of the eye center position to the screen location. This requires the system to be calibrated and often limits the possible position of the user (e.g. using chinrests) . In case of 3D visual gaze estimation, this often requires the intrinsic camera

parameters to be known. Failure to correctly calibrate or comply to the restrictions of the gaze estimation device may result in wrong estimations of the gaze.

The invention aims at a more accurate, user-friendly and/or cheaper system for detecting a person' s direction of interest.

To that end, the system comprises: a processor; at least one video camera connected to said processor for capturing video data; electronic memory connected to said processor; wherein said processor is arranged to determine in real time an interest vector of a person from said video data; wherein said processor is arranged to determine in real time a salient peak closest to the determined interest vector;

wherein said processor is arranged to determine in real time a saliency-corrected interest vector between said person and said closest salient peak; wherein said processor is

arranged to determine in real time the deviation between the determined interest vector and the determined saliency- corrected interest vector; wherein said processor is

arranged to determine in real time further interest vectors of said person from said video data; and wherein said processor is arranged to calculate in real time recalibrated interest vectors by using a calibration error value

calculated from said determined deviation.

Preferably said processor is arranged to determine in real time the salient peaks closest to a multitude of determined interest vectors; said processor is arranged to determine in real time a multitude of saliency-corrected interest vectors between the person and said closest salient peaks; wherein said processor is arranged to determine in real time the deviations between the multitude of determined interest vectors and the multitude of saliency-corrected interest vectors; wherein said calibration error value is calculated from said multitude of determined deviations.

Preferably said processor is arranged to iterate in real time said process of calculating said calibration error value by replacing previous determined interest vectors with interest vectors which are corrected using a previous calibration error value, for calculating a current

calibration error value.

Preferably said processor is arranged to calculate in real time said calibration error value by minimizing the

difference between the multitude of determined deviations and said calibration error value, for instance by using a weighted least square error minimization method.

Preferably said salient peaks in the region around the determined interest vector are determined using saliency data about the area which the person is expected to look at, such as video data, screen capture data or manually input data, such as annotated saliency data.

In one preferred embodiment said processor is arranged to determine in real time salient peaks in the region around the determined interest vector from video data before determining said salient peak closest to the determined interest vector.

In a further preferred embodiment said system comprises at least two video cameras connected to said processor, one camera for capturing video data of a person's face and/or body, and one camera for capturing said video data.

In an alternative preferred embodiment the processor, electronic memory and said at least two video cameras are combined in one device. The device may for instance be a smartphone, having a videocamera in the back aimed at an area of interest, and a webcam in the front, aimed at the user's face and eyes. A smartphone with gaze detection capabilities is described in US 2010/0079508, wherein gaze detection is used to determine if a person is looking at the screen of the smartphone . By using the teaching of the current invention, the smartphone can be used to detect which objects behind the smartphone the person is looking at .

The invention furthermore relates to a method for detecting a person' s interest direction, wherein a processor performs the steps of: determine in real time an interest vector of a person from video data captured by a video camera; determine in real time a salient peak closest to the determined interest vector; determine in real time a saliency-corrected interest vector between said person and said closest salient peak; determine in real time the deviation between the determined interest vector and the determined saliency- corrected interest vector; determine in real time further interest vectors of said person from said video data; and calculate in real time recalibrated interest vectors by using a calibration error value calculated from said

determined deviation.

The invention also relates to a computer software program arranged to run on a processor to perform the steps of the method of the invention, a computer readable data carrier comprising a computer software program arranged to run on a processor to perform the steps of the method of the

invention, and a computer comprising a processor and

electronic memory connected thereto leaded with a computer software program arranged to perform the steps of the method of the invention.

A preferred embodiment of the invention is described in more detail below with reference to the drawings in which: Figure 1 is a perspective view of a system in accordance with the invention; and Figure 2 is a flow chart of the system in accordance with the invention.

According to figure 1 a system for detecting a person' s gaze direction comprises a computer 1 with amongst others a processor unit, system memory and a hard drive, a video camera 2 aimed at the face of a person 6, connected to for instance a USB port of the computer 1. A second camera behind the person, which is aimed at the area where the person 6 is looking at, is also connected to a USB port of the computer 1. A software program is loaded from the hard drive into the system memory of the computer 1 in order to perform the steps of the gaze detection method.

An image 4 having several (salient) objects (in this example a car and it components) that may be of interest to the person is present in front of the person 6. Alternatively the system may be used to determine the gaze direction in a physical environment where (salient) real world objects of interest are present.

According to figure 2, a visual gaze vector can be resolved from a combination of body/head pose and eye location obtained from imaging device 2 in component I (box 10) . As this is a rough estimation, the obtained gaze line 13 in component II (box 11) is then followed until an uncertain location in the gazed area. The area of interest, in this example obtained from imaging device 5, is analyzed in component III (box 12) . In the proposed framework, the gaze vector 13 will be steered (arrow 14) to the most probable (salient) object which is close to the previously estimated point of interest. It is proven that salient objects attract eye fixations [see: Spain, M., Perona, P.: Some objects are more equal than others: Measuring and predicting importance. In: ECCV. (2009); and Einhauser, W., Spain, M., Perona, P.: Objects predict fixations better than early saliency. J. Vis. 8 (2008) 1-26], and this property is extensively used in the literature to create saliency maps (probability maps which represent the likelihood of receiving an eye fixation) to automate the generation of fixation maps [see: Judd, T., Ehinger, K., Durand, F., Torralba, A.: Learning to predict where humans look. In: ICCV (2009); and Peters, R.J., Iyer, A., Koch, C, Itti, L.: Components of bottom-up gaze

allocation in natural scenes. J. Vis. 5 (2005) 692-692].

According to the prior art on saliency it is predicted where interesting parts of the scene are, and thereby it is being tried to predict where a person would look. However, now that accurate saliency algorithms are available [see:

Valenti, R., Sebe, N., Gevers, T.: Image saliency by

isocentric curvedness and color. In ICCV. (2009); Itti, L., Koch, C, Niebur, E.: A model of saliency-based visual attention for rapid scene analysis. PAMI 20 (1998) 1254- 1259; and Ma, Y.F., Zhang, H.J.: Contrast-based image attention analysis by using fuzzy growing. In ACM MM.

(2003) ] , the invention proposes to reverse the problem by using saliency maps in aiding the uncertain fixations. In the system according to the invention, the gaze vector 13 obtained by an existing visual gaze estimation system is used to estimate the possible interest area on the scene. The size of this area will depend on device capabilities and on the scenario. This area is evaluated for salient regions, and filtered so that salient regions which are far away from the centre of interest will be less relevant for the final estimation. The obtained probability landscape is then explored to find the best candidate for the location of the adjusted fixation. This process is repeated for every estimated fixation in the image. After all the fixations and respective adjustments are obtained, the least-square error between them is minimized in order to find the best

transformation from the estimated sets of fixations to the adjusted ones.

This transformation is then applied to the original

fixations and future ones, in order to compensate for the found error. When a sequence of estimations is available, the obtained improvement is used to correct the previously erroneous estimates. The found error is used to adjust and recalibrate the gaze estimation devices at runtime, in order to improve future estimations. The method may be used to fix the shortcoming of low quality monocular head and eye trackers improving their overall accuracy.

Visual gaze estimators have inherent errors which may occur in each of the components of the visual gaze pipeline. From these errors the size of the area where interesting

locations may be found can be derived. To this end, three errors which should be taken into account when estimating visual gaze (one for each of the components of the pipeline) can be identified: the device error, the calibration error and the foveating error. Depending on the scenario, the actual size of the area of interest will be computed by cumulating these three errors and mapping them to the distance of the gazed scene. The device error:

This error is attributed to the first component of the visual gaze estimation pipeline. As imaging devices are limited in resolution, there are a discrete number of states in which image features can be detected and recognized. The variables defining this error are often the maximum level of details which the device can achieve while interpreting pixels as the location of the eye or the position of the head. Therefore, this error mainly depends on the scenario (e.g. the distance of the subject from the imaging device) and on the device that is being used.

The calibration error:

This error is attributed to the resolution of the visual gaze starting from the features extracted in the first component. Eye gaze trackers often use a mapping between the position of the eye and the corresponding locations on the screen. Therefore, the tracking system needs to be

calibrated. In case the subject moves from his original location, this mapping will be inconsistent and the system may erroneously estimate the visual gaze. Chinrests are often required in these situations to limit the movements of the users to a minimum. Muscular distress, the length of the session, the tiredness of the subject, all may influence the calibration error. As the calibration error cannot be known a priori, it cannot be modeled. Therefore, the aim is to estimate it, so that is can be compensated. The foveating error:

As this error is associated with the new component proposed in the pipeline, it is required to analyze the properties of the fovea to define it. The fovea is the part of the retina responsible for accurate central vision in the direction in which it is pointed. It is necessary to perform any

activities which require a high level of visual details. The human fovea has a diameter of about 1.0 mm with a high concentration of cone photoreceptors which account for the high visual acuity capability. Through saccades (more than 10,000 per hour [see: Geisler, W.S., Banks, M.S.: Handbook of Optics, 2^nd Ed. Volume I: Fundamentals, Techniques and Design. Volume 1. McGraw-Hill, Inc., New York, NY, USA

(1995)], the fovea is moved to the regions of interest, generating eye fixations. In fact, if the gazed object is large, the eyes constantly shift their gaze to subsequently bring images into the fovea. For this reason, fixations obtained by analyzing the location of the center of the cornea are widely used in the literature as an indication of the gaze and interest of the user.

However, it is generally assumed that the fixation obtained by analyzing the center of the cornea corresponds to the exact location of interest. While this is a valid assumption in most scenarios, the size of the fovea actually permits to see the central two degrees of the visual field. For

instance, when reading a text, humans do not fixate on each of the letters, but one fixation permits to read and see the multiple words at once.

Another important aspect to be taken into account is the decrease in visual resolution as we move away from the center of the fovea. The fovea is surrounded by the

parafovea belt which extends up to 1.25 mm away from the center, followed by the perifovea (2.75 mm away), which in turn is surrounded by a larger area that delivers low resolution information. Starting at the outskirts of the fovea, the density of receptors progressively decreases, hence the visual resolution decreases rapidly as it goes far away from the foveal center [see: Rossi, E.A., Roorda, A.: The relationship between visual resolution and cone spacing in the human fovea. Nature Neuroscience 13 (2009)] . This is modeled by using a Gaussian kernel centered on the area of interest, with standard deviation as a quarter of the estimated area of interest. In this way, areas which are close to the border of the area of interest are of lesser importance. In our model, we consider this region as the possible location for the interest point. As the area of interest is increased by the projection of the total error, the tail of the Gaussian of the area of interest will aid to balance the importance of a fixation point against the distance from the original fixation point. As the point of interest could be anywhere in this limited area, the next step is to use saliency to extract potential fixation candidates . The saliency is evaluated on the interest area by using a customized version of the saliency framework proposed in

[Valenti, R., Sebe, N., Gevers, T.: Image saliency by isocentric curvedness and color. In: ICCV. (2009)]. The framework uses isophote curvature to extract the

displacement vectors, which indicate the center of the osculating circle at each point of the image. In Cartesian coordinates, the isophote curvature is defined as:

/ -' / ,. ... 2/ , / , !.. ·<· L- F ,

Where L_x represent the first order derivative of the

luminance function in the x direction, L_xx the second order derivative on the x direction, and so on. The isophote curvature is used to estimate points which are closer to the center of the structure it belongs to, therefore the

isophote curvature is inverted and multiplied by the

gradient. The displacement coordinates D(x, y) to the estimated centers are then obtained by:

In this way every pixel in the image gives an estimate of the potential structure it belongs to. To collect and reinforce this information and to deduce the location of the objects, D(x, y) ' s are mapped into an accumulator, weighted according to their local importance defined as the amount of image curvature and color edges. The accumulator is then convolved with a Gaussian kernel so that each cluster of votes will form a single estimate. This clustering of votes in the accumulator gives an indication of where the centers of interesting or structured objects are in the image.

In [Valenti, R., Sebe, N., Gevers, T.: Image saliency by isocentric curvedness and color. In: ICCV. (2009), multiple scales are used. Here, since the scale is directly related to the size of the area of interest, the optimal scale can be determined once and then linked to the area of interest itself. Furthermore, in the abovementioned document, the color and curvature information is added to the final saliency map while here this information is discarded. The reasoning behind this choice is that this information is mainly useful to enhance objects on their edges, while the isocentric saliency is fit to locate the adjusted fixations closer to the center of the objects, rather than on their edges. While removing this information from the saliency map might reduce the overall response of salient objects in the scene, it brings the ability to use the saliency maps as smooth probability density functions. Once the saliency of the area of interest is obtained, it is masked by the area of interest model as defined before.

Hence, the Gaussian kernel in the middle of the area of interest will aid in suppressing saliency peaks in its outskirts. However, there may still be uncertainties about multiple optimal fixation candidates.

Therefore, a meanshift window with a size corresponding to the standard deviation of the Gaussian kernel is initialized on the location of the estimated fixation point

(corresponding to the center of the area of interest) . The meanshift algorithm will then iterate from that point towards the point of highest energy. After convergence, the saliency peak on the area of interest which is closer to the centre of the converged meanshift window is selected as the new (adjusted) fixation point. This process is repeated for all fixation points on an image, obtaining a set of

corrections. An analysis of a number of these corrections holds information about the overall calibration error. This allows for estimation of the current calibration error of the gaze estimation system which thereafter can be used to compensate it. The highest peaks in the saliency maps are used to align fixation points with the salient points discovered in the area of interest.

A weighted least-squares error minimization between the estimated gaze locations and the corrected ones is

performed. In this way, the affine transformation matrix T is derived. The weight is retrieved as the confidence of the adjustment, which considers both the distance from the original fixation and the saliency value sampled on the same location. The obtained transformation matrix T is thereafter applied to the original fixations to obtain the final fixation estimates. These new fixations should have

minimized the calibration error.

The pseudo code of the proposed system is as follows:

Initialize scenario parameters

- Calculate the total error = foveating error + device error + calibration error

- Calculate the size of the area of interest by

projecting total error at distance d as tan (d*total error)

for (each new fixation point p) do

- Retrieve the estimated gaze point by the device

- Extract the area of interest around each the fixation p Inspect the area of interest for salient objects

- Filter the result by the Gaussian kernel

Initialize a meanshift window on the center of the area of interest

while (maximum iterations not reached or Δρ < threshold) do

- climb the distribution to the point of maximum energy end while

- Select the saliency peak closest to the center of the converged meanshift window as being the correct

adjusted fixation

- Store the original fixation and the adjusted fixation, with weight w found on the same location on the

saliency map

- Calculate the weighted least-squares solution between all the stored points to derive the transformation matrix T

- Transform all original fixations with the obtained

transformation matrix

- Use the transformation matrix T to compensate the

calibration error in the device end for

It will be appreciated by those skilled in the art that changes could be made to the embodiments described above without departing from the broad inventive concept thereof. It is understood, therefore, that this invention is not limited to the particular embodiments disclosed, but it is intended to cover modifications within the spirit and scope of the present invention as defined by the appended claims.

Claims

1. A system for detecting a person's direction of

interest, such as a person's gaze, eye, head, body or finger point direction, comprising:

a processor;

at least one video camera connected to said processor for capturing video data;

electronic memory connected to said processor;

wherein said processor is arranged to determine in real time an interest vector of a person from said video data; wherein said processor is arranged to determine in real time a salient peak closest to the determined interest vector;

wherein said processor is arranged to determine in real time a saliency-corrected interest vector between said person and said closest salient peak;

wherein said processor is arranged to determine in real time the deviation between the determined interest vector and the determined saliency-corrected interest vector;

wherein said processor is arranged to determine in real time further interest vectors of said person from said video data; and

wherein said processor is arranged to calculate in real time recalibrated interest vectors by using a calibration error value calculated from said determined deviation.

2. The system of claim 1, wherein said processor is arranged to determine in real time the salient peaks closest to a multitude of determined interest vectors;

said processor is arranged to determine in real time a multitude of saliency-corrected interest vectors between the person and said closest salient peaks; wherein said processor is arranged to determine in real time the deviations between the multitude of determined interest vectors and the multitude of saliency-corrected interest vectors;

wherein said calibration error value is calculated from said multitude of determined deviations.

3. The system of claim 1 or 2, wherein said processor is arranged to iterate in real time said process of calculating said calibration error value by replacing previous

determined interest vectors with interest vectors which are corrected using a previous calibration error value, for calculating a current calibration error value.

4. The system of claim 2 or 3, wherein said processor is arranged to calculate in real time said calibration error value by minimizing the difference between the multitude of determined deviations and said calibration error value, for instance by using a weighted least square method.

5. The system of any of the claims 1 - 4, wherein said salient peaks are determined using saliency data about the area which the person is expected to look at, such as video data, screen capture data or manually input data.

6. The system of any of the claims 1 - 5, wherein said processor is arranged to determine in real time salient peaks in the region around the determined interest vector from video data before determining said salient peak closest to the determined interest vector.

7. The system claim 6, wherein said system comprises at least two video cameras connected to said processor, one camera for capturing video data of a person's face and/or body, and one camera for capturing said video data.

8. The system of claim 7, wherein said processor, said electronic memory and said at least two video cameras are combined in one device, for instance a smartphone .

9. The system of any of the previous claims 1 - 8, wherein said direction of interest is a gaze direction.

10. A method for detecting a person's interest direction, wherein a processor performs the steps of:

determine in real time an interest vector of a person from video data captured by a video camera; determine in real time a salient peak closest to the determined interest vector;

determine in real time a saliency-corrected interest vector between said person and said closest salient peak; determine in real time the deviation between the determined interest vector and the determined saliency- corrected interest vector;

determine in real time further interest vectors of said person from said video data; and

calculate in real time recalibrated interest vectors by using a calibration error value calculated from said

determined deviation.

11. A computer software program arranged to run on a processor to perform the steps of the method according claim 10.

12. A computer readable data carrier comprising a computer software program arranged to run on a processor to perform the steps of the method according to claim 10.

13. A computer comprising a processor and electronic memory connected thereto loaded with a computer software program arranged to perform the steps of the method according to claim 10.