WO2012008827A1 - System and method for detecting a person's direction of interest, such as a person's gaze direction - Google Patents

System and method for detecting a person's direction of interest, such as a person's gaze direction Download PDF

Info

Publication number
WO2012008827A1
WO2012008827A1 PCT/NL2011/050423 NL2011050423W WO2012008827A1 WO 2012008827 A1 WO2012008827 A1 WO 2012008827A1 NL 2011050423 W NL2011050423 W NL 2011050423W WO 2012008827 A1 WO2012008827 A1 WO 2012008827A1
Authority
WO
WIPO (PCT)
Prior art keywords
interest
processor
real time
person
determined
Prior art date
Application number
PCT/NL2011/050423
Other languages
French (fr)
Inventor
Roberto Valenti
Vladimir Nedovic
Original Assignee
Universiteit Van Amsterdam
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Universiteit Van Amsterdam filed Critical Universiteit Van Amsterdam
Publication of WO2012008827A1 publication Critical patent/WO2012008827A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/011Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
    • G06F3/013Eye tracking input arrangements

Definitions

  • the invention relates to a system for detecting a person' s direction of interest, such as a person's gaze direction.
  • the system can also be used to detect other visual
  • directions of interest of a person such as the direction of a person's head, the person's eye, the person's arm and/or finger pointing, or the person' s whole body, or a
  • Visual gaze estimation is the process which determines the 3D line of sight of a person in order to analyze the
  • the estimation of the direction or the location of interest of a user is key for many applications, spanning from gaze based human-computer interaction,
  • PAMI 30 (2008) human cognitive state analysis, attentive interfaces (e.g. gaze controlled mouse) to human behavior analysis. Gaze direction can also provide high-level semantic cues such as who is speaking to whom, information on non verbal communications (e.g. interest, pointing with the head/with the eyes) and the mental state/attention of a user (e.g. a driver) .
  • the pipeline of estimating visual gaze mainly consists of two steps (see Figure 2) : (1) analyze and transform pixel based image features obtained by sensory information (devices) to a higher level representation (e.g. the position of the head or the location of the eyes) and (2) map these features to estimate the visual gaze vector (line of sight) , hence finding the area of interest in the scene .
  • a higher level representation e.g. the position of the head or the location of the eyes
  • the second step in gaze estimation is to map the obtained information to the 3D scene in front of the user.
  • eye gaze trackers this is often achieved by direct mapping of the eye center position to the screen location. This requires the system to be calibrated and often limits the possible position of the user (e.g. using chinrests) .
  • this often requires the intrinsic camera
  • the invention aims at a more accurate, user-friendly and/or cheaper system for detecting a person' s direction of interest.
  • the system comprises: a processor; at least one video camera connected to said processor for capturing video data; electronic memory connected to said processor; wherein said processor is arranged to determine in real time an interest vector of a person from said video data; wherein said processor is arranged to determine in real time a salient peak closest to the determined interest vector;
  • said processor is arranged to determine in real time a saliency-corrected interest vector between said person and said closest salient peak; wherein said processor is
  • said processor is arranged to determine in real time the salient peaks closest to a multitude of determined interest vectors; said processor is arranged to determine in real time a multitude of saliency-corrected interest vectors between the person and said closest salient peaks; wherein said processor is arranged to determine in real time the deviations between the multitude of determined interest vectors and the multitude of saliency-corrected interest vectors; wherein said calibration error value is calculated from said multitude of determined deviations.
  • said processor is arranged to iterate in real time said process of calculating said calibration error value by replacing previous determined interest vectors with interest vectors which are corrected using a previous calibration error value, for calculating a current
  • said processor is arranged to calculate in real time said calibration error value by minimizing the
  • said salient peaks in the region around the determined interest vector are determined using saliency data about the area which the person is expected to look at, such as video data, screen capture data or manually input data, such as annotated saliency data.
  • said processor is arranged to determine in real time salient peaks in the region around the determined interest vector from video data before determining said salient peak closest to the determined interest vector.
  • said system comprises at least two video cameras connected to said processor, one camera for capturing video data of a person's face and/or body, and one camera for capturing said video data.
  • the processor, electronic memory and said at least two video cameras are combined in one device.
  • the device may for instance be a smartphone, having a videocamera in the back aimed at an area of interest, and a webcam in the front, aimed at the user's face and eyes.
  • a smartphone with gaze detection capabilities is described in US 2010/0079508, wherein gaze detection is used to determine if a person is looking at the screen of the smartphone .
  • the smartphone can be used to detect which objects behind the smartphone the person is looking at .
  • the invention furthermore relates to a method for detecting a person' s interest direction, wherein a processor performs the steps of: determine in real time an interest vector of a person from video data captured by a video camera; determine in real time a salient peak closest to the determined interest vector; determine in real time a saliency-corrected interest vector between said person and said closest salient peak; determine in real time the deviation between the determined interest vector and the determined saliency- corrected interest vector; determine in real time further interest vectors of said person from said video data; and calculate in real time recalibrated interest vectors by using a calibration error value calculated from said
  • the invention also relates to a computer software program arranged to run on a processor to perform the steps of the method of the invention, a computer readable data carrier comprising a computer software program arranged to run on a processor to perform the steps of the method of the invention
  • Figure 1 is a perspective view of a system in accordance with the invention
  • Figure 2 is a flow chart of the system in accordance with the invention.
  • a system for detecting a person' s gaze direction comprises a computer 1 with amongst others a processor unit, system memory and a hard drive, a video camera 2 aimed at the face of a person 6, connected to for instance a USB port of the computer 1.
  • a second camera behind the person, which is aimed at the area where the person 6 is looking at, is also connected to a USB port of the computer 1.
  • a software program is loaded from the hard drive into the system memory of the computer 1 in order to perform the steps of the gaze detection method.
  • An image 4 having several (salient) objects (in this example a car and it components) that may be of interest to the person is present in front of the person 6.
  • the system may be used to determine the gaze direction in a physical environment where (salient) real world objects of interest are present.
  • a visual gaze vector can be resolved from a combination of body/head pose and eye location obtained from imaging device 2 in component I (box 10) .
  • the obtained gaze line 13 in component II (box 11) is then followed until an uncertain location in the gazed area.
  • the area of interest, in this example obtained from imaging device 5 is analyzed in component III (box 12) .
  • the gaze vector 13 will be steered (arrow 14) to the most probable (salient) object which is close to the previously estimated point of interest. It is proven that salient objects attract eye fixations [see: Spain, M., Perona, P.: Some objects are more equal than others: Measuring and predicting importance. In: ECCV.
  • the invention proposes to reverse the problem by using saliency maps in aiding the uncertain fixations.
  • the gaze vector 13 obtained by an existing visual gaze estimation system is used to estimate the possible interest area on the scene. The size of this area will depend on device capabilities and on the scenario. This area is evaluated for salient regions, and filtered so that salient regions which are far away from the centre of interest will be less relevant for the final estimation.
  • the obtained probability landscape is then explored to find the best candidate for the location of the adjusted fixation. This process is repeated for every estimated fixation in the image. After all the fixations and respective adjustments are obtained, the least-square error between them is minimized in order to find the best
  • the method may be used to fix the shortcoming of low quality monocular head and eye trackers improving their overall accuracy.
  • Visual gaze estimators have inherent errors which may occur in each of the components of the visual gaze pipeline. From these errors the size of the area where interesting
  • the device error three errors which should be taken into account when estimating visual gaze (one for each of the components of the pipeline) can be identified: the device error, the calibration error and the foveating error. Depending on the scenario, the actual size of the area of interest will be computed by cumulating these three errors and mapping them to the distance of the gazed scene.
  • the device error three errors which should be taken into account when estimating visual gaze (one for each of the components of the pipeline) can be identified.
  • the calibration error the calibration error
  • the foveating error the actual size of the area of interest will be computed by cumulating these three errors and mapping them to the distance of the gazed scene.
  • This error is attributed to the first component of the visual gaze estimation pipeline.
  • imaging devices are limited in resolution, there are a discrete number of states in which image features can be detected and recognized.
  • the variables defining this error are often the maximum level of details which the device can achieve while interpreting pixels as the location of the eye or the position of the head. Therefore, this error mainly depends on the scenario (e.g. the distance of the subject from the imaging device) and on the device that is being used.
  • the fovea is the part of the retina responsible for accurate central vision in the direction in which it is pointed. It is necessary to perform any
  • the human fovea has a diameter of about 1.0 mm with a high concentration of cone photoreceptors which account for the high visual acuity capability.
  • saccades more than 10,000 per hour [see: Geisler, W.S., Banks, M.S.: Handbook of Optics, 2 nd Ed. Volume I: Fundamentals, Techniques and Design. Volume 1. McGraw-Hill, Inc., New York, NY, USA
  • the fovea is moved to the regions of interest, generating eye fixations.
  • the gazed object is large, the eyes constantly shift their gaze to subsequently bring images into the fovea.
  • fixations obtained by analyzing the location of the center of the cornea are widely used in the literature as an indication of the gaze and interest of the user.
  • the next step is to use saliency to extract potential fixation candidates .
  • the saliency is evaluated on the interest area by using a customized version of the saliency framework proposed in
  • the framework uses isophote curvature to extract the
  • the isophote curvature is used to estimate points which are closer to the center of the structure it belongs to, therefore the
  • every pixel in the image gives an estimate of the potential structure it belongs to.
  • D(x, y) ' s are mapped into an accumulator, weighted according to their local importance defined as the amount of image curvature and color edges.
  • the accumulator is then convolved with a Gaussian kernel so that each cluster of votes will form a single estimate. This clustering of votes in the accumulator gives an indication of where the centers of interesting or structured objects are in the image.
  • the Gaussian kernel in the middle of the area of interest will aid in suppressing saliency peaks in its outskirts.
  • a meanshift window with a size corresponding to the standard deviation of the Gaussian kernel is initialized on the location of the estimated fixation point
  • the meanshift algorithm will then iterate from that point towards the point of highest energy. After convergence, the saliency peak on the area of interest which is closer to the centre of the converged meanshift window is selected as the new (adjusted) fixation point. This process is repeated for all fixation points on an image, obtaining a set of
  • the affine transformation matrix T is derived.
  • the weight is retrieved as the confidence of the adjustment, which considers both the distance from the original fixation and the saliency value sampled on the same location.
  • the obtained transformation matrix T is thereafter applied to the original fixations to obtain the final fixation estimates.

Abstract

A method for detecting a person's interest direction, wherein a processor performs the steps of: determine in real time an interest vector of a person from video data captured by a video camera; determine in real time a salient peak closest to the determined interest vector; determine in real time a saliency-corrected interest vector from the eyes of said person to said closest salient peak; determine in real time the deviation between the determined interest vector and the determined saliency-correctedinterestvector; determine in real time further interest vectors of said person from said video data; and calculate in real time recalibrated interest vectors by using a calibration error value calculated from said determined deviation.

Description

SYSTEM AND METHOD FOR DETECTING A PERSON' S DIRECTION OF INTEREST , SUCH AS A PERSON' S GAZE DIRECTION
The invention relates to a system for detecting a person' s direction of interest, such as a person's gaze direction. The system can also be used to detect other visual
directions of interest of a person, such as the direction of a person's head, the person's eye, the person's arm and/or finger pointing, or the person' s whole body, or a
combination thereof.
Visual gaze estimation is the process which determines the 3D line of sight of a person in order to analyze the
location of interest. The estimation of the direction or the location of interest of a user is key for many applications, spanning from gaze based human-computer interaction,
advertisement [see: Smith, K., Ba, S.O., Odobez, J.M.,
Gatica-Perez , D.: Tracking the visual focus of attention for a varying number of wandering people. PAMI 30 (2008)], human cognitive state analysis, attentive interfaces (e.g. gaze controlled mouse) to human behavior analysis. Gaze direction can also provide high-level semantic cues such as who is speaking to whom, information on non verbal communications (e.g. interest, pointing with the head/with the eyes) and the mental state/attention of a user (e.g. a driver) .
Overall, visual gaze estimation is important to understand someone's attention, motivation and intentions.
Typically, the pipeline of estimating visual gaze mainly consists of two steps (see Figure 2) : (1) analyze and transform pixel based image features obtained by sensory information (devices) to a higher level representation (e.g. the position of the head or the location of the eyes) and (2) map these features to estimate the visual gaze vector (line of sight) , hence finding the area of interest in the scene . There is an abundance of research in the literature
concerning the first component of the pipeline, which principally covers methods to estimate the head position and the eye location, as they are both contributing factors to the final estimation of the visual gaze [see: Langton, S.R., Honeyman, H., Tessler, E.: The influence of head contour and nose angle on the perception of eye-gaze direction.
Perception & psychophysics 66 (2004)].
Nowadays, commercial eye gaze trackers are one of the most successful visual gaze devices. However, to achieve good detection accuracy, they have the drawback of using
intrusive or expensive sensors (pointed infrared cameras) which cannot be used in daylight and often limit the
possible movement of the head, or require the user to wear the device [see: Bates, R., Istance, H., Oosthuizen, L.,
Majaranta, P.: Survey of de-facto standards in eye tracking. In: COGAIN Conf. on Comm. By Gaze Inter. (2005)]. Therefore, recently, eye center locators based solely on appearance are proposed [see: Cristinacce, D., Cootes, T., Scott, I.: A multi-stage approach to facial feature detection. In: BMVC . (2004) 277-286; Kroon, B., Boughorbel, S., Hanjalic, A.: Accurate eye localization in webcam content. In: FG. (2008); and Valenti, R., Gevers, T.: Accurate eye center location and tracking using isophote curvature. In: CVPR. (2008)] which are reaching reasonable accuracy in order to roughly estimate the area of attention on a screen in the second step of the pipeline. A recent survey [Hansen, D.W., Ji, Q.: In the eye of the beholder: A survey of models for eyes and gaze. PAMI 32 (2010)] discusses the different methodologies to obtain the eye location information through video-based devices. Some of the methods can be also used to estimate the face
location and the head pose in geometric head pose estimation methods. Other methods in this category track the appearance between video frames, or treat the problem as an image classification one, often interpolating the results between known poses. The survey in [Murphy-Chutorian, E., Trivedi, M. : Head pose estimation in computer vision: A survey. PAMI 31 (2009)] gives a good overview of appearance based head pose estimation methods. Once the correct features are determined using one of the methods and devices discussed above, the second step in gaze estimation (see Figure 2) is to map the obtained information to the 3D scene in front of the user. In eye gaze trackers, this is often achieved by direct mapping of the eye center position to the screen location. This requires the system to be calibrated and often limits the possible position of the user (e.g. using chinrests) . In case of 3D visual gaze estimation, this often requires the intrinsic camera
parameters to be known. Failure to correctly calibrate or comply to the restrictions of the gaze estimation device may result in wrong estimations of the gaze.
The invention aims at a more accurate, user-friendly and/or cheaper system for detecting a person' s direction of interest.
To that end, the system comprises: a processor; at least one video camera connected to said processor for capturing video data; electronic memory connected to said processor; wherein said processor is arranged to determine in real time an interest vector of a person from said video data; wherein said processor is arranged to determine in real time a salient peak closest to the determined interest vector;
wherein said processor is arranged to determine in real time a saliency-corrected interest vector between said person and said closest salient peak; wherein said processor is
arranged to determine in real time the deviation between the determined interest vector and the determined saliency- corrected interest vector; wherein said processor is
arranged to determine in real time further interest vectors of said person from said video data; and wherein said processor is arranged to calculate in real time recalibrated interest vectors by using a calibration error value
calculated from said determined deviation.
Preferably said processor is arranged to determine in real time the salient peaks closest to a multitude of determined interest vectors; said processor is arranged to determine in real time a multitude of saliency-corrected interest vectors between the person and said closest salient peaks; wherein said processor is arranged to determine in real time the deviations between the multitude of determined interest vectors and the multitude of saliency-corrected interest vectors; wherein said calibration error value is calculated from said multitude of determined deviations.
Preferably said processor is arranged to iterate in real time said process of calculating said calibration error value by replacing previous determined interest vectors with interest vectors which are corrected using a previous calibration error value, for calculating a current
calibration error value.
Preferably said processor is arranged to calculate in real time said calibration error value by minimizing the
difference between the multitude of determined deviations and said calibration error value, for instance by using a weighted least square error minimization method.
Preferably said salient peaks in the region around the determined interest vector are determined using saliency data about the area which the person is expected to look at, such as video data, screen capture data or manually input data, such as annotated saliency data.
In one preferred embodiment said processor is arranged to determine in real time salient peaks in the region around the determined interest vector from video data before determining said salient peak closest to the determined interest vector.
In a further preferred embodiment said system comprises at least two video cameras connected to said processor, one camera for capturing video data of a person's face and/or body, and one camera for capturing said video data.
In an alternative preferred embodiment the processor, electronic memory and said at least two video cameras are combined in one device. The device may for instance be a smartphone, having a videocamera in the back aimed at an area of interest, and a webcam in the front, aimed at the user's face and eyes. A smartphone with gaze detection capabilities is described in US 2010/0079508, wherein gaze detection is used to determine if a person is looking at the screen of the smartphone . By using the teaching of the current invention, the smartphone can be used to detect which objects behind the smartphone the person is looking at .
The invention furthermore relates to a method for detecting a person' s interest direction, wherein a processor performs the steps of: determine in real time an interest vector of a person from video data captured by a video camera; determine in real time a salient peak closest to the determined interest vector; determine in real time a saliency-corrected interest vector between said person and said closest salient peak; determine in real time the deviation between the determined interest vector and the determined saliency- corrected interest vector; determine in real time further interest vectors of said person from said video data; and calculate in real time recalibrated interest vectors by using a calibration error value calculated from said
determined deviation.
The invention also relates to a computer software program arranged to run on a processor to perform the steps of the method of the invention, a computer readable data carrier comprising a computer software program arranged to run on a processor to perform the steps of the method of the
invention, and a computer comprising a processor and
electronic memory connected thereto leaded with a computer software program arranged to perform the steps of the method of the invention.
A preferred embodiment of the invention is described in more detail below with reference to the drawings in which: Figure 1 is a perspective view of a system in accordance with the invention; and Figure 2 is a flow chart of the system in accordance with the invention.
According to figure 1 a system for detecting a person' s gaze direction comprises a computer 1 with amongst others a processor unit, system memory and a hard drive, a video camera 2 aimed at the face of a person 6, connected to for instance a USB port of the computer 1. A second camera behind the person, which is aimed at the area where the person 6 is looking at, is also connected to a USB port of the computer 1. A software program is loaded from the hard drive into the system memory of the computer 1 in order to perform the steps of the gaze detection method.
An image 4 having several (salient) objects (in this example a car and it components) that may be of interest to the person is present in front of the person 6. Alternatively the system may be used to determine the gaze direction in a physical environment where (salient) real world objects of interest are present.
According to figure 2, a visual gaze vector can be resolved from a combination of body/head pose and eye location obtained from imaging device 2 in component I (box 10) . As this is a rough estimation, the obtained gaze line 13 in component II (box 11) is then followed until an uncertain location in the gazed area. The area of interest, in this example obtained from imaging device 5, is analyzed in component III (box 12) . In the proposed framework, the gaze vector 13 will be steered (arrow 14) to the most probable (salient) object which is close to the previously estimated point of interest. It is proven that salient objects attract eye fixations [see: Spain, M., Perona, P.: Some objects are more equal than others: Measuring and predicting importance. In: ECCV. (2009); and Einhauser, W., Spain, M., Perona, P.: Objects predict fixations better than early saliency. J. Vis. 8 (2008) 1-26], and this property is extensively used in the literature to create saliency maps (probability maps which represent the likelihood of receiving an eye fixation) to automate the generation of fixation maps [see: Judd, T., Ehinger, K., Durand, F., Torralba, A.: Learning to predict where humans look. In: ICCV (2009); and Peters, R.J., Iyer, A., Koch, C, Itti, L.: Components of bottom-up gaze
allocation in natural scenes. J. Vis. 5 (2005) 692-692].
According to the prior art on saliency it is predicted where interesting parts of the scene are, and thereby it is being tried to predict where a person would look. However, now that accurate saliency algorithms are available [see:
Valenti, R., Sebe, N., Gevers, T.: Image saliency by
isocentric curvedness and color. In ICCV. (2009); Itti, L., Koch, C, Niebur, E.: A model of saliency-based visual attention for rapid scene analysis. PAMI 20 (1998) 1254- 1259; and Ma, Y.F., Zhang, H.J.: Contrast-based image attention analysis by using fuzzy growing. In ACM MM.
(2003) ] , the invention proposes to reverse the problem by using saliency maps in aiding the uncertain fixations. In the system according to the invention, the gaze vector 13 obtained by an existing visual gaze estimation system is used to estimate the possible interest area on the scene. The size of this area will depend on device capabilities and on the scenario. This area is evaluated for salient regions, and filtered so that salient regions which are far away from the centre of interest will be less relevant for the final estimation. The obtained probability landscape is then explored to find the best candidate for the location of the adjusted fixation. This process is repeated for every estimated fixation in the image. After all the fixations and respective adjustments are obtained, the least-square error between them is minimized in order to find the best
transformation from the estimated sets of fixations to the adjusted ones.
This transformation is then applied to the original
fixations and future ones, in order to compensate for the found error. When a sequence of estimations is available, the obtained improvement is used to correct the previously erroneous estimates. The found error is used to adjust and recalibrate the gaze estimation devices at runtime, in order to improve future estimations. The method may be used to fix the shortcoming of low quality monocular head and eye trackers improving their overall accuracy.
Visual gaze estimators have inherent errors which may occur in each of the components of the visual gaze pipeline. From these errors the size of the area where interesting
locations may be found can be derived. To this end, three errors which should be taken into account when estimating visual gaze (one for each of the components of the pipeline) can be identified: the device error, the calibration error and the foveating error. Depending on the scenario, the actual size of the area of interest will be computed by cumulating these three errors and mapping them to the distance of the gazed scene. The device error:
This error is attributed to the first component of the visual gaze estimation pipeline. As imaging devices are limited in resolution, there are a discrete number of states in which image features can be detected and recognized. The variables defining this error are often the maximum level of details which the device can achieve while interpreting pixels as the location of the eye or the position of the head. Therefore, this error mainly depends on the scenario (e.g. the distance of the subject from the imaging device) and on the device that is being used.
The calibration error:
This error is attributed to the resolution of the visual gaze starting from the features extracted in the first component. Eye gaze trackers often use a mapping between the position of the eye and the corresponding locations on the screen. Therefore, the tracking system needs to be
calibrated. In case the subject moves from his original location, this mapping will be inconsistent and the system may erroneously estimate the visual gaze. Chinrests are often required in these situations to limit the movements of the users to a minimum. Muscular distress, the length of the session, the tiredness of the subject, all may influence the calibration error. As the calibration error cannot be known a priori, it cannot be modeled. Therefore, the aim is to estimate it, so that is can be compensated. The foveating error:
As this error is associated with the new component proposed in the pipeline, it is required to analyze the properties of the fovea to define it. The fovea is the part of the retina responsible for accurate central vision in the direction in which it is pointed. It is necessary to perform any
activities which require a high level of visual details. The human fovea has a diameter of about 1.0 mm with a high concentration of cone photoreceptors which account for the high visual acuity capability. Through saccades (more than 10,000 per hour [see: Geisler, W.S., Banks, M.S.: Handbook of Optics, 2nd Ed. Volume I: Fundamentals, Techniques and Design. Volume 1. McGraw-Hill, Inc., New York, NY, USA
(1995)], the fovea is moved to the regions of interest, generating eye fixations. In fact, if the gazed object is large, the eyes constantly shift their gaze to subsequently bring images into the fovea. For this reason, fixations obtained by analyzing the location of the center of the cornea are widely used in the literature as an indication of the gaze and interest of the user.
However, it is generally assumed that the fixation obtained by analyzing the center of the cornea corresponds to the exact location of interest. While this is a valid assumption in most scenarios, the size of the fovea actually permits to see the central two degrees of the visual field. For
instance, when reading a text, humans do not fixate on each of the letters, but one fixation permits to read and see the multiple words at once.
Another important aspect to be taken into account is the decrease in visual resolution as we move away from the center of the fovea. The fovea is surrounded by the
parafovea belt which extends up to 1.25 mm away from the center, followed by the perifovea (2.75 mm away), which in turn is surrounded by a larger area that delivers low resolution information. Starting at the outskirts of the fovea, the density of receptors progressively decreases, hence the visual resolution decreases rapidly as it goes far away from the foveal center [see: Rossi, E.A., Roorda, A.: The relationship between visual resolution and cone spacing in the human fovea. Nature Neuroscience 13 (2009)] . This is modeled by using a Gaussian kernel centered on the area of interest, with standard deviation as a quarter of the estimated area of interest. In this way, areas which are close to the border of the area of interest are of lesser importance. In our model, we consider this region as the possible location for the interest point. As the area of interest is increased by the projection of the total error, the tail of the Gaussian of the area of interest will aid to balance the importance of a fixation point against the distance from the original fixation point. As the point of interest could be anywhere in this limited area, the next step is to use saliency to extract potential fixation candidates . The saliency is evaluated on the interest area by using a customized version of the saliency framework proposed in
[Valenti, R., Sebe, N., Gevers, T.: Image saliency by isocentric curvedness and color. In: ICCV. (2009)]. The framework uses isophote curvature to extract the
displacement vectors, which indicate the center of the osculating circle at each point of the image. In Cartesian coordinates, the isophote curvature is defined as:
/ -' / ,. ... 2/ , / , !.. ·<· L- F ,
Where Lx represent the first order derivative of the
luminance function in the x direction, Lxx the second order derivative on the x direction, and so on. The isophote curvature is used to estimate points which are closer to the center of the structure it belongs to, therefore the
isophote curvature is inverted and multiplied by the
gradient. The displacement coordinates D(x, y) to the estimated centers are then obtained by:
Figure imgf000014_0001
In this way every pixel in the image gives an estimate of the potential structure it belongs to. To collect and reinforce this information and to deduce the location of the objects, D(x, y) ' s are mapped into an accumulator, weighted according to their local importance defined as the amount of image curvature and color edges. The accumulator is then convolved with a Gaussian kernel so that each cluster of votes will form a single estimate. This clustering of votes in the accumulator gives an indication of where the centers of interesting or structured objects are in the image.
In [Valenti, R., Sebe, N., Gevers, T.: Image saliency by isocentric curvedness and color. In: ICCV. (2009), multiple scales are used. Here, since the scale is directly related to the size of the area of interest, the optimal scale can be determined once and then linked to the area of interest itself. Furthermore, in the abovementioned document, the color and curvature information is added to the final saliency map while here this information is discarded. The reasoning behind this choice is that this information is mainly useful to enhance objects on their edges, while the isocentric saliency is fit to locate the adjusted fixations closer to the center of the objects, rather than on their edges. While removing this information from the saliency map might reduce the overall response of salient objects in the scene, it brings the ability to use the saliency maps as smooth probability density functions. Once the saliency of the area of interest is obtained, it is masked by the area of interest model as defined before.
Hence, the Gaussian kernel in the middle of the area of interest will aid in suppressing saliency peaks in its outskirts. However, there may still be uncertainties about multiple optimal fixation candidates.
Therefore, a meanshift window with a size corresponding to the standard deviation of the Gaussian kernel is initialized on the location of the estimated fixation point
(corresponding to the center of the area of interest) . The meanshift algorithm will then iterate from that point towards the point of highest energy. After convergence, the saliency peak on the area of interest which is closer to the centre of the converged meanshift window is selected as the new (adjusted) fixation point. This process is repeated for all fixation points on an image, obtaining a set of
corrections. An analysis of a number of these corrections holds information about the overall calibration error. This allows for estimation of the current calibration error of the gaze estimation system which thereafter can be used to compensate it. The highest peaks in the saliency maps are used to align fixation points with the salient points discovered in the area of interest.
A weighted least-squares error minimization between the estimated gaze locations and the corrected ones is
performed. In this way, the affine transformation matrix T is derived. The weight is retrieved as the confidence of the adjustment, which considers both the distance from the original fixation and the saliency value sampled on the same location. The obtained transformation matrix T is thereafter applied to the original fixations to obtain the final fixation estimates. These new fixations should have
minimized the calibration error.
The pseudo code of the proposed system is as follows:
Initialize scenario parameters
- Calculate the total error = foveating error + device error + calibration error
- Calculate the size of the area of interest by
projecting total error at distance d as tan (d*total error)
for (each new fixation point p) do
- Retrieve the estimated gaze point by the device
- Extract the area of interest around each the fixation p Inspect the area of interest for salient objects
- Filter the result by the Gaussian kernel
Initialize a meanshift window on the center of the area of interest
while (maximum iterations not reached or Δρ < threshold) do
- climb the distribution to the point of maximum energy end while
- Select the saliency peak closest to the center of the converged meanshift window as being the correct
adjusted fixation
- Store the original fixation and the adjusted fixation, with weight w found on the same location on the
saliency map
- Calculate the weighted least-squares solution between all the stored points to derive the transformation matrix T
- Transform all original fixations with the obtained
transformation matrix
- Use the transformation matrix T to compensate the
calibration error in the device end for
It will be appreciated by those skilled in the art that changes could be made to the embodiments described above without departing from the broad inventive concept thereof. It is understood, therefore, that this invention is not limited to the particular embodiments disclosed, but it is intended to cover modifications within the spirit and scope of the present invention as defined by the appended claims.

Claims

1. A system for detecting a person's direction of
interest, such as a person's gaze, eye, head, body or finger point direction, comprising:
a processor;
at least one video camera connected to said processor for capturing video data;
electronic memory connected to said processor;
wherein said processor is arranged to determine in real time an interest vector of a person from said video data; wherein said processor is arranged to determine in real time a salient peak closest to the determined interest vector;
wherein said processor is arranged to determine in real time a saliency-corrected interest vector between said person and said closest salient peak;
wherein said processor is arranged to determine in real time the deviation between the determined interest vector and the determined saliency-corrected interest vector;
wherein said processor is arranged to determine in real time further interest vectors of said person from said video data; and
wherein said processor is arranged to calculate in real time recalibrated interest vectors by using a calibration error value calculated from said determined deviation.
2. The system of claim 1, wherein said processor is arranged to determine in real time the salient peaks closest to a multitude of determined interest vectors;
said processor is arranged to determine in real time a multitude of saliency-corrected interest vectors between the person and said closest salient peaks; wherein said processor is arranged to determine in real time the deviations between the multitude of determined interest vectors and the multitude of saliency-corrected interest vectors;
wherein said calibration error value is calculated from said multitude of determined deviations.
3. The system of claim 1 or 2, wherein said processor is arranged to iterate in real time said process of calculating said calibration error value by replacing previous
determined interest vectors with interest vectors which are corrected using a previous calibration error value, for calculating a current calibration error value.
4. The system of claim 2 or 3, wherein said processor is arranged to calculate in real time said calibration error value by minimizing the difference between the multitude of determined deviations and said calibration error value, for instance by using a weighted least square method.
5. The system of any of the claims 1 - 4, wherein said salient peaks are determined using saliency data about the area which the person is expected to look at, such as video data, screen capture data or manually input data.
6. The system of any of the claims 1 - 5, wherein said processor is arranged to determine in real time salient peaks in the region around the determined interest vector from video data before determining said salient peak closest to the determined interest vector.
7. The system claim 6, wherein said system comprises at least two video cameras connected to said processor, one camera for capturing video data of a person's face and/or body, and one camera for capturing said video data.
8. The system of claim 7, wherein said processor, said electronic memory and said at least two video cameras are combined in one device, for instance a smartphone .
9. The system of any of the previous claims 1 - 8, wherein said direction of interest is a gaze direction.
10. A method for detecting a person's interest direction, wherein a processor performs the steps of:
determine in real time an interest vector of a person from video data captured by a video camera; determine in real time a salient peak closest to the determined interest vector;
determine in real time a saliency-corrected interest vector between said person and said closest salient peak; determine in real time the deviation between the determined interest vector and the determined saliency- corrected interest vector;
determine in real time further interest vectors of said person from said video data; and
calculate in real time recalibrated interest vectors by using a calibration error value calculated from said
determined deviation.
11. A computer software program arranged to run on a processor to perform the steps of the method according claim 10.
12. A computer readable data carrier comprising a computer software program arranged to run on a processor to perform the steps of the method according to claim 10.
13. A computer comprising a processor and electronic memory connected thereto loaded with a computer software program arranged to perform the steps of the method according to claim 10.
PCT/NL2011/050423 2010-06-11 2011-06-10 System and method for detecting a person's direction of interest, such as a person's gaze direction WO2012008827A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
NL2004878 2010-06-11
NL2004878A NL2004878C2 (en) 2010-06-11 2010-06-11 System and method for detecting a person's direction of interest, such as a person's gaze direction.

Publications (1)

Publication Number Publication Date
WO2012008827A1 true WO2012008827A1 (en) 2012-01-19

Family

ID=43589565

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/NL2011/050423 WO2012008827A1 (en) 2010-06-11 2011-06-10 System and method for detecting a person's direction of interest, such as a person's gaze direction

Country Status (2)

Country Link
NL (1) NL2004878C2 (en)
WO (1) WO2012008827A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015170142A1 (en) * 2014-05-08 2015-11-12 Sony Corporation Portable electronic equipment and method of controlling a portable electronic equipment
US10248281B2 (en) 2015-08-18 2019-04-02 International Business Machines Corporation Controlling input to a plurality of computer windows

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100079508A1 (en) 2008-09-30 2010-04-01 Andrew Hodge Electronic devices with gaze detection capabilities

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100079508A1 (en) 2008-09-30 2010-04-01 Andrew Hodge Electronic devices with gaze detection capabilities

Non-Patent Citations (21)

* Cited by examiner, † Cited by third party
Title
BATES, R., ISTANCE, H., OOSTHUIZEN, L., MAJARANTA, P.: "Survey of de-facto standards in eye tracking", COGAIN CONF. ON COMM. BY GAZE INTER., 2005
CRISTINACCE, D., COOTES, T., SCOTT, I.: "A multi-stage approach to facial feature detection", BMVC, 2004, pages 277 - 286
EINHAUSER, W., SPAIN, M., PERONA, P.: "Objects predict fixations better than early saliency", J. VIS., vol. 8, 2008, pages 1 - 26
GEISLER, W.S., BANKS, M.S.: "Handbook of Optics, 2nd Ed. Volume I: Fundamentals, Techniques and Design", vol. 1, 1995, MCGRAW-HILL, INC.
HANSEN, D.W., JI, Q.: "the eye of the beholder: A survey of models for eyes and gaze", PAMI, vol. 32, 2010, XP011280658, DOI: doi:10.1109/TPAMI.2009.30
HILLAIRE, S., BRETON, G., OUARTI, N., COZOT, R. AND LÉCUYER, A.: "Using a Visual Attention Model to Improve Gaze Tracking Systems in Interactive 3D Applications", COMPUTER GRAPHICS FORUM, vol. 29, no. 6, 22 March 2010 (2010-03-22), pages 1830, XP002624749, DOI: 10.1111/j.1467-8659.2010.01651.x *
ITTI L ET AL: "A MODEL OF SALIENCY-BASED VISUAL ATTENTION FOR RAPID SCENE ANALYSIS", IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, IEEE SERVICE CENTER, LOS ALAMITOS, CA, US, vol. 20, no. 11, 1 November 1998 (1998-11-01), pages 1254 - 1259, XP001203933, ISSN: 0162-8828, DOI: DOI:10.1109/34.730558 *
ITTI, L., KOCH, C., NIEBUR, E.: "A model of saliency-based visual attention for rapid scene analysis", PAMI, vol. 20, 1998, pages 1254 - 1259, XP001203933, DOI: doi:10.1109/34.730558
JUDD, T., EHINGER, K., DURAND, F., TORRALBA, A.: "Learning to predict where humans look", ICCV, 2009
KROON, B., BOUGHORBEL, S., HANJALIC, A.: "Accurate eye localization in webcam content", FG, 2008
LANGTON, S.R., HONEYMAN, H., TESSLER, E.: "The influence of head contour and nose angle on the perception of eye-gaze direction", PERCEPTION & PSYCHOPHYSICS, vol. 66, 2004
MA, Y.F., ZHANG, H.J.: "Contrast-based image attention analysis by using fuzzy growing", ACM MM, 2003
MURPHY-CHUTORIAN, E., TRIVEDI, M.: "Head pose estimation in computer vision: A survey", PAMI, vol. 31, 2009, XP011266518, DOI: doi:10.1109/TPAMI.2008.106
PETERS, R.J., IYER, A., KOCH, C., ITTI, L.: "Components of bottom-up gaze allocation in natural scenes", J. VIS., vol. 5, 2005, pages 692 - 692
ROBERTO VALENTI ET AL: "Image saliency by isocentric curvedness and color", COMPUTER VISION, 2009 IEEE 12TH INTERNATIONAL CONFERENCE ON, IEEE, PISCATAWAY, NJ, USA, 29 September 2009 (2009-09-29), pages 2185 - 2192, XP031672570, ISBN: 978-1-4244-4420-5 *
ROSSI, E.A., ROORDA, A.: "The relationship between visual resolution and cone spacing in the human fovea", NATURE NEUROSCIENCE, vol. 13, 2009
SMITH, K., BA, S.O., ODOBEZ, J.M., GATICA-PEREZ, D.: "Tracking the visual focus of attention for a varying number of wandering people", PAMI, vol. 30, 2008, XP011224160, DOI: doi:10.1109/TPAMI.2007.70773
SPAIN, M., PERONA, P.: "Some objects are more equal than others: Measuring and predicting importance", ECCV, 2009
VALENTI R ET AL: "Accurate eye center location and tracking using isophote curvature", COMPUTER VISION AND PATTERN RECOGNITION, 2008. CVPR 2008. IEEE CONFERENCE ON, IEEE, PISCATAWAY, NJ, USA, 23 June 2008 (2008-06-23), pages 1 - 8, XP031297087, ISBN: 978-1-4244-2242-5 *
VALENTI, R., GEVERS, T.: "Accurate eye center location and tracking using isophote curvature", CVPR, 2008
VALENTI, R., SEBE, N., GEVERS, T.: "Image saliency by isocentric curvedness and color", ICCV, 2009

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015170142A1 (en) * 2014-05-08 2015-11-12 Sony Corporation Portable electronic equipment and method of controlling a portable electronic equipment
US10216267B2 (en) 2014-05-08 2019-02-26 Sony Corporation Portable electronic equipment and method of controlling a portable electronic equipment
US10248281B2 (en) 2015-08-18 2019-04-02 International Business Machines Corporation Controlling input to a plurality of computer windows
US10248280B2 (en) 2015-08-18 2019-04-02 International Business Machines Corporation Controlling input to a plurality of computer windows

Also Published As

Publication number Publication date
NL2004878C2 (en) 2011-12-13

Similar Documents

Publication Publication Date Title
US9330307B2 (en) Learning based estimation of hand and finger pose
US20180211104A1 (en) Method and device for target tracking
US9262671B2 (en) Systems, methods, and software for detecting an object in an image
US10048749B2 (en) Gaze detection offset for gaze tracking models
US10109056B2 (en) Method for calibration free gaze tracking using low cost camera
EP2499963A1 (en) Method and apparatus for gaze point mapping
CN110807427B (en) Sight tracking method and device, computer equipment and storage medium
JP2017506379A5 (en)
Valenti et al. What are you looking at? Improving visual gaze estimation by saliency
US20210319585A1 (en) Method and system for gaze estimation
JPH11175246A (en) Sight line detector and method therefor
JP6822482B2 (en) Line-of-sight estimation device, line-of-sight estimation method, and program recording medium
JP5001930B2 (en) Motion recognition apparatus and method
TWI682326B (en) Tracking system and method thereof
Valenti et al. Webcam-based visual gaze estimation
Pires et al. Unwrapping the eye for visible-spectrum gaze tracking on wearable devices
Perra et al. Adaptive eye-camera calibration for head-worn devices
NL2004878C2 (en) System and method for detecting a person&#39;s direction of interest, such as a person&#39;s gaze direction.
Wu et al. NIR-based gaze tracking with fast pupil ellipse fitting for real-time wearable eye trackers
Strupczewski Commodity camera eye gaze tracking
Kim et al. Gaze estimation using a webcam for region of interest detection
US10902628B1 (en) Method for estimating user eye orientation using a system-independent learned mapping
Park Representation learning for webcam-based gaze estimation
CN112114659A (en) Method and system for determining a fine point of regard for a user
Huang et al. Robust feature extraction for non-contact gaze tracking with eyeglasses

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 11749570

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 11749570

Country of ref document: EP

Kind code of ref document: A1