WO2008069519A1 - Gesture/speech integrated recognition system and method - Google Patents

Gesture/speech integrated recognition system and method Download PDF

Info

Publication number
WO2008069519A1
WO2008069519A1 PCT/KR2007/006189 KR2007006189W WO2008069519A1 WO 2008069519 A1 WO2008069519 A1 WO 2008069519A1 KR 2007006189 W KR2007006189 W KR 2007006189W WO 2008069519 A1 WO2008069519 A1 WO 2008069519A1
Authority
WO
WIPO (PCT)
Prior art keywords
gesture
speech
feature information
integrated
module
Prior art date
Application number
PCT/KR2007/006189
Other languages
French (fr)
Inventor
Young Giu Jung
Mun Sung Han
Jae Seon Lee
Jun Seok Park
Original Assignee
Electronics And Telecommunications Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from KR1020070086575A external-priority patent/KR100948600B1/en
Application filed by Electronics And Telecommunications Research Institute filed Critical Electronics And Telecommunications Research Institute
Priority to JP2009540141A priority Critical patent/JP2010511958A/en
Publication of WO2008069519A1 publication Critical patent/WO2008069519A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/03Arrangements for converting the position or the displacement of a member into a coded form
    • G06F3/033Pointing devices displaced or positioned by the user, e.g. mice, trackballs, pens or joysticks; Accessories therefor
    • G06F3/038Control and interface arrangements therefor, e.g. drivers or device-embedded control circuitry
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/017Gesture based interaction, e.g. based on a set of recognized hand gestures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • G06F3/167Audio in a user interface, e.g. using voice commands for navigating, audio feedback
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2203/00Indexing scheme relating to G06F3/00 - G06F3/048
    • G06F2203/038Indexing scheme relating to G06F3/038
    • G06F2203/0381Multimodal input, i.e. interface arrangements enabling the user to issue commands by simultaneous use of input devices of different nature, e.g. voice plus gesture on digitizer

Definitions

  • the present invention relates to a integrated recognition technology, and more particularly, to a gesture/speech integrated recognition system and method capable of recognizing an order of a user by extracting feature information on a gesture by using an end point detection (EPD) value of a speech and integrating feature information on the speech with the feature information on the gesture, thereby recognizing the order of the user at a high recognition rate in a noise environment.
  • EPD end point detection
  • a speech recognition technology and a gesture recognition technology are used as the most convenient interface technology.
  • the speech recognition technology and the gesture recognition technology have a high recognition rate in a limited environment.
  • performances of the technologies are not good in a noise environment. This is because environmental noise affects the performance of the speech recognition, and a change in lighting and a type of gesture affect the performance of a camera-based gesture recognition technology. Therefore, the speech recognition technology needs the development of a technology of recognizing speech by using an algorithm robust against noise, and the gesture recognition technology needs the development of a technology of extracting a particular section of a gesture including recognition information.
  • a particular section of the gesture cannot be easily identified, so that recognition is difficult.
  • An aspect of the present invention provides a means for extracting feature information by detecting an order section from a speech of a user by using an algorithm robust against environmental noise, and detecting a feature section of a gesture by using information on an order start point of the speech, thereby easily recognizing the order in the gesture that is not easily identified.
  • An aspect of the present invention also provides a means for synchronizing a gesture and a speech by applying optimal frames set in advance to an order section of the gesture detected by a speech end point detection (EPD) value, thereby solving a problem of a synchronization difference between the speech and the gesture while integrated recognition is performed.
  • EPD speech end point detection
  • a gesture/speech integrated recognition system including: a speech feature extraction unit extracting speech feature information by detecting a start point and an end point of an order in an input speech; a gesture feature extraction unit extracting gesture feature information by detecting an order section from a gesture of taken images by using information on the detected start point and the end point; and an integrated recognition unit outputting the extracted speech feature information and the gesture feature information as integrated recognition data by using a learning parameter set in advance.
  • the system may further include a synchronization module including: a gesture start point detection module detecting a start point of the gesture from the taken images by using the detected start point; and an optimal frame applying module calculating and extracting optimal image frames by applying the number of optimal frames set in advance from the start point of the gesture.
  • the gesture start point detection module may detect the start point of the gesture by checking a start point that is an end point detection (EPD) plug of the detected speech from the taken images.
  • EPD end point detection
  • the speech feature extraction unit may include: an EPD module detecting a start point and an end point in the input speech; and a hearing model-based speech feature extraction module extracting speech feature information included in the detected order by using a hearing model-based algorithm.
  • the speech feature extraction unit may remove noise from the extracted speech feature information.
  • the gesture feature extraction unit may include: a hand tracking module tracking movements of a hand in images taken by a camera and transmitting the movements to the synchronization module; and a gesture feature extraction module extracting gesture feature information by using the optimal image frames extracted by the synchronization module.
  • the integrated recognition unit may include: an integrated learning DB
  • the integrated feature control module may control feature vectors of the extracted speech feature information and the gesture feature information through the extension and the reduction of the node numbers of input vectors.
  • a gesture/ speech integrated recognition method including: a first step of extracting speech feature information by detecting a start point that is an EPD value and an end point of an order in an input speech; a second step of extracting gesture feature information by detecting an order section from a gesture in images input through a camera by using the detected start point of the order; and a third step of outputting the extracted speech feature information and gesture feature information as integrated recognition data by using a learning parameter set in advance.
  • the speech feature information may be extracted from the order section obtained by the start point and the end point of the order on the basis of a hearing model.
  • the second step may include: the second step comprises: a B step of detecting an order section from the gesture of the movements of the hand by using the transmitted EPD value; a C step of determining optimal frames from the order section from the gesture by applying optimal frames set in advance; and a D step of extracting gesture feature information from the determined optimal frames.
  • the gesture/speech integrated recognition system and method according to the present invention increases a recognition rate of a gesture that is not easily identified by detecting an order section of the gesture by using an end point detection (EPD) value that is a start point of an order section of a speech.
  • EPD end point detection
  • optimal frames are applied to the order section of the gesture to synchronize the speech and the gesture, so that it is possible to implement integrated recognition between the speech and the gesture.
  • FIG. 1 is a view illustrating a concept of a gesture/speech integrated recognition system according to an embodiment of the present invention.
  • FIG. 2 is a view illustrating a configuration of a gesture/speech integrated recognition system according to an embodiment of the present invention.
  • FIG. 3 is a flowchart illustrating a gesture/speech integrated recognition method according to an embodiment of the present invention. Best Mode for Carrying Out the Invention
  • FIG. 1 is a view illustrating a concept of a gesture/speech integrated recognition system according to an embodiment of the present invention.
  • a person 100 orders through a speech 110 and a gesture 120.
  • the person may indicate a corn bread with a finger while saying "select a corn bread"as an order for selecting the corn bread from among the displayed goods.
  • feature information on the speech order of the person is recognized through speech recognition 111, and feature information on the gesture of the person is recognized through gesture recognition 121.
  • the recognized feature information on the gesture and the speech is recognized through integrated recognition 130 as a single user order in order to increase a recognition rate of the speech affected by environmental noise and the gesture that cannot be easily identified.
  • the present invention provides a technology of integrated recognition for a speech and a gesture of a person.
  • the recognized order is transmitted by a controller to a speaker 170, a display apparatus 171, a diffuser 172, a touching device 173, and a tasting device 174 which are output devices for the five senses, and the controller controls each device.
  • the recognition result is transmitted to a network, so that data of the five senses as the result is transmitted to control each of the output devices.
  • the present invention relates to integrated recognition, and a configuration after the recognition operation can be modified in various manners, so that a detailed description thereof is omitted.
  • FIG. 2 is a view illustrating a configuration of the gesture/speech integrated recognition system according to an embodiment of the present invention.
  • the gesture/speech integrated recognition system includes a speech feature extraction unit 210 for extracting speech feature information by detecting a start point and an end point of an order from a speech input through a microphone 211, a gesture feature extraction unit 220 for extracting gesture feature information by detecting an order section from a gesture of images taken by a camera by using information on the start point and the end point detected by the speech feature extraction unit 210, a synchronization module 230 for detecting a start point of the gesture from the taken images by using the start point detected by the speech feature extraction unit 210, and calculating optimal image frames by applying the number optimal frames set in advance from the detected start point of the gesture, and an integrated recognition unit 240 outputting the extracted speech feature information and the gesture feature information as integrated recognition data by using a predetermined learning parameter set in advance.
  • a speech feature extraction unit 210 for extracting speech feature information by detecting a start point and an end point of an order from a speech input through a microphone 211
  • a gesture feature extraction unit 220 for extracting gesture feature
  • the speech feature extraction unit 210 includes the microphone 211 receiving a speech of a user, an end point detection (EPD) module 212 detecting a start point and an end point of an order section from the speech of the user, and a hearing model- based speech feature extraction module 213 extracting speech feature information from the order section of the speech detected by the EPD module 212 on the basis of the hearing model.
  • a channel noise reduction module (not shown) for removing noise from the extracted speech feature information may further be included.
  • the EPD module 212 detects the start and the end of the order by analyzing the speech input through the wired or wireless microphone.
  • the EPD module 212 acquires a speech signal, calculates an energy value needed to detect the end point of the speech signal, and detects the start and the end of the order by identifying a section to be calculated as the order from the input speech signal.
  • the EPD module 212 firstly acquires the speech signal through the microphone and converts the acquired speech into a type for frame calculation. In the meanwhile, when the speech is input wirelessly, a problem of data loss or signal deterioration due to signal interference may occur, so that an operation to deal with the problem during the signal acquisition is needed.
  • the energy value needed to detect the end point of the speech signal by the EPD module 212 is calculated as follows.
  • a frame size used to analyze the speech signal is 160 samples, and frame energy is calculated by the following equation.
  • S(n) denotes a vocal cord signal sample
  • N denotes the number of samples per frame.
  • the calculated frame energy is used as a parameter for detecting the end point.
  • the EPD module 212 determines a section to be calculated as the order after calculating the frame energy value. For example, the operation of calculating the start point and the end point of the speech signal is determined by four energy thresholds using the frame energy and 10 conditions.
  • the four energy thresholds and the 10 conditions can be set in various manners and may have values obtained in experiments in order to obtain the order section.
  • the four thresholds are used to detect the start and end of every frame by an EPD algorithm.
  • the EPD module 212 transmits information on the detected start point (hereinafter, referred to as an "EPD value" of the order to a gesture start point detection module 231 of the synchronization module 230.
  • the EPD module 212 transmits information on the order section in the input speech to the hearing model-based speech feature extraction module 213 to extract speech feature information.
  • the hearing model-based speech feature extraction module 213 that receives the information on the order section in the speech extracts the feature information on the basis of the hearing model from the order section detected by the EPD module 212.
  • Examples of an algorithm used to extract the speech feature information on the basis of the hearing model include an ensemble interval histogram (EIH) and a zero-crossings with peak amplitudes (ZCPA).
  • a channel noise reduction module removes noise from the speech feature information extracted by the hearing model-based speech feature extraction module 213, and the speech feature information from which the noise is removed is transmitted to the integrated recognition unit 240.[245->240]
  • the gesture feature extraction unit 220 includes a face and hand detection module
  • a gesture feature extraction module 224 for tracking and transmitting movements of the detected hand to the synchronization module 230 and extracting feature information on the gesture by using optimal frames calculated by the synchronization module 230.
  • the face and hand detection module 222 detects the face and the hand that gives a gesture, and a hand tracking module 223 continuously tracks the hand movements in the images.
  • the hand tracking module 223 tracks the hand.
  • the hand tracking module 223 may track various body portions having movements recognized as a gesture.
  • the hand tracking module 223 continuously stores the hand movements as time elapses, and a part recognized as a gesture order in the hand movement is detected by the synchronization module 230 by using the EPD value transmitted from the speech feature extraction unit 210.
  • the synchronization module 230 which detects a section recognized as the gesture order in the hand movements by using the EPD value and applies the optimal frames for synchronization between speeches and gestures will be described.
  • the synchronization module 230 includes the gesture start point detection module
  • an optimal frame applying module 232 for calculating optimal image frames needed for integrated recognition by using a start frame of the gesture calculated by using the detected start point of the gesture.
  • the gesture start point detection module 231 of the synchronization module 230 checks a speech EPD plug in the image signal. In this method, the gesture start point detection module 231 calculates the start frame of the gesture. In addition, the optimal frame applying module 232 calculates optimal image frames needed for integrated recognition by using the calculated start frame of the gesture and transmits the optimal image frames to the gesture feature extraction module 224.
  • the optimal image frames needed for integrated recognition applied by the optimal frame applying module 232 In order to calculate the optimal image frames needed for integrated recognition applied by the optimal frame applying module 232, the number of frames which are determined to have a high recognition rate of gestures is set in advance, and when the start frame of the gesture is calculated by the gesture start point detection module 231, the optimal image frames are determined.
  • the integrated recognition unit 240 includes an integrated model generation module
  • an integrated learning database (DB) 244 implemented to be proper for integrated recognition algorithm development based on a statistical model
  • an integrated learning DB control module 243 for controlling learning by the integrated model generation module 242 and the integrated learning DB 244 and a learning parameter
  • an integrated feature control module 241 for controlling the learning parameter and feature vectors of the input speech feature information and the gesture feature information
  • an integrated recognition model 245 for providing various functions by generating recognition results.
  • the integrated model generation module 242 generates a high-performance in- tegrated model in order to effectively integrate the speech feature information with the gesture feature information.
  • various learning algorithms such as hidden Markov model (HMM), neural network (NN), dynamic time warping (DTW) are implemented and experiments are performed.
  • HMM hidden Markov model
  • NN neural network
  • DTW dynamic time warping
  • a method of optimizing NN parameters by determining an integrated model on the basis of the NN to obtain high-performance integrated recognition may be used.
  • the generation of the high-performance integrated model has a problem of synchronizing two modalities that have different frame numbers in the learning model.
  • the problem of the synchronization in the learning model is referred to as a learning model optimization problem.
  • an integration layer is provided, and a method of connecting the speech and the gesture is optimized in the layer.
  • An overlapping length between the speech and the gesture is calculated with respect to a time axis for the optimization, and the synchronization is performed on the basis of the overlapping length.
  • the overlapping length is used to search for a connection method having a highest recognition rate through a recognition rate experiment.
  • the integrated learning DB 244 implements an integrated recognition DB to be proper for the integrated recognition algorithm development based on the statistical model.
  • Table 1 shows an order set defined to integrate gestures and speeches.
  • the defined order set is obtained on the basis of natural gestures that people can understand without learning.
  • the speech is recorded at a sampling rate of 16kHz/16bits per sample by using a waveform in a pulse code modulation (PCM) method using the channel number of 1.
  • PCM pulse code modulation
  • 24bits BITMAP images having a size of 320x240 are recorded with a blue screen background in a light condition having four fluorescent boxes at 15 frames per second by using a STH-DCSG-C stereo camera. Since the stereo camera does not have a speech interface, a speech collection module and an image collection module are independently provided, and a synchronization program of the image and speech to perform a method of controlling an image collecting process through inter-process communications (IPC) in a speech recording program controls is written to collect data.
  • the image collection module is configured by using an open source computer vision library (OpenCV) and a small vision system (SVS).
  • OpenCV open source computer vision library
  • SVS small vision system
  • Stereo camera images have to be adjusted to a real recording environment through an additional calibration process, and in order to acquire optima images, associated gain, exposure, brightness, red, and blue parameter values are modified to control color, exposure, and WB values.
  • Calibration information and parameter information are stored in an additional ini file so that the image storage model calls and use the information.
  • the integrated learning DB control module 243 generates the learning parameter on the basis of the integrated learning DB 244 that is related to the integrated model generation module 242 to be generated and stored in advance.
  • the integrated feature control module 241 controls the learning parameter generated by the integrated learning DB control module 243 and the feature vectors of the feature information on the speech and the gesture extracted by the speech feature extraction unit 210 and the gesture feature extraction unit 220.
  • the control operation is associated with the extension and the reduction of the node number of input vectors.
  • the integrated feature control module 241 has the integration layer, and the integration layer is developed to effectively integrate lengths of the speech and the gesture that have different values and propose a single recognition rate.
  • the integrated recognition module 245 generates a recognition result by using a control result obtained by the integrated feature control module 241.
  • the integrated recognition module 245 provides various functions for interactions with a network or a integration representing unit.
  • FIG. 3 is a flowchart illustrating a gesture/speech integrated recognition method according to an embodiment of the present invention.
  • the gesture/speech integrated recognition method includes three threads to be operated.
  • the three threads includes a speech feature extraction thread 10 for extracting speech features, a gesture feature extraction thread 20 for extracting gesture features, and an integrated recognition thread 30 for performing integrated recognition on the speech and the gesture.
  • the three threads 10, 20, and 30 are generated at a time point when the learning parameter is loaded and cooperatively operated by using a thread plug. Now, the gesture/speech integrated recognition method in which the three threads 10, 20, 30 are cooperatively operated is described.
  • the gesture feature extraction thread 20 continuously receives images including the gesture through a camera (operation S320). Speech frames of the speech continuously input through the microphone are calculated (operation S312), and the EPD module 212 detects a start point and an end point (speech EPD value) of an order included in the speech (operation S313). When the speech EPD value is detected, the speech EPD value is transmitted to a synchronization operation 40 of the gesture feature extraction thread 20.
  • the hearing model-based speech feature extraction module 213 extracts a speech feature from the order section on the basis of the hearing model (operation S314) and transmits the extracted speech feature to the integrated recognition thread 30.
  • the gesture feature extraction thread 20 detects a hand and a face from the images continuously input through the camera (operation S321). When the hand and face are detected, a gesture of the user is tracked (operation S322). Since the gesture of the user continuously changes, a gesture having a predetermined length is stored in a buffer (operation S323).
  • a speech EPD plug in the gesture images stored in the buffer is checked (operation S324).
  • the start point and the end point of the gesture including the feature information on the images are searched for by the speech EPD plug (operation S325), and the searched gesture feature is stored (operation S326).
  • the stored gesture feature is not in synchronization with the speech, so that optimal frames are calculated from a start frame of the gesture by applying optimal frames set in advance.
  • the calculated optimal frames are used to extract gesture feature information by the gesture feature extraction module 224, and the extracted gesture feature information is transmitted to the integrated recognition thread 30.
  • the speech/gesture feature extraction threads 10 and 20 are in a sleep state while the integrated recognition thread checks a recognition result (operations S328 and S315).
  • the integrated recognition thread 30 generates a high-performance integrated model in advance by using the integrated model generation module 245 before receiving the speech feature information and the gesture feature information and controls the generated integrated model and the integrated learning DB 244 so that the integrated learning DB control module 243 generates and loads the learning parameter (operation S331).
  • the integrated recognition thread 30 is maintained in the sleep state until the speech/gesture feature information is received (operation S332).
  • the integrated recognition thread 30 that is in the sleep state receives a signal associated with the feature information after the extraction of the feature information on the speech and the gesture is completed (operation S333), the integrated recognition thread 30 loads each feature in a memory (operation S334).
  • the recognition result is calculated by using the optimized integrated learning model and the learning parameter that are set in advance (operation S335).
  • the speech feature extraction thread 10 and the gesture feature extraction thread 20 that are in the sleep state perform operations of extracting feature information from a speech and an image that are input.

Abstract

Provided is a gesture/speech integrated recognition system and method including: a speech feature extraction unit extracting speech feature information by detecting a start point and an end point of an order in an input speech; a gesture feature extraction unit extracting gesture feature information by detecting an order section from a gesture of taken images by using information on the detected start point and the end point; and an integrated recognition unit outputting the extracted speech feature information and the gesture feature information as integrated recognition data by using a learning parameter set in advance. Accordingly, it is possible to increase a performance of order recognition by integrating the speech and the gesture in a noise environment.

Description

Description
GESTURE/SPEECH INTEGRATED RECOGNITION SYSTEM
AND METHOD
Technical Field
[1] The present invention relates to a integrated recognition technology, and more particularly, to a gesture/speech integrated recognition system and method capable of recognizing an order of a user by extracting feature information on a gesture by using an end point detection (EPD) value of a speech and integrating feature information on the speech with the feature information on the gesture, thereby recognizing the order of the user at a high recognition rate in a noise environment.
[2] This work was supported by the IT R&D program of MIC/IITA[2006-S-031-01, Five
Sences Information Processing Technology Development for Network Based Reality Service] Background Art
[3] Recently, as a multimedia technology and an interface technology have been developed, researches on multimodal recognition for easily implementing a simple interface between the human and machines by using facial expressions or directions, lip movements, eye-gaze tracking, hand movements, speech, and the like have been actively carried out.
[4] Particularly, among man-machine interface technologies, a speech recognition technology and a gesture recognition technology are used as the most convenient interface technology. The speech recognition technology and the gesture recognition technology have a high recognition rate in a limited environment. However, performances of the technologies are not good in a noise environment. This is because environmental noise affects the performance of the speech recognition, and a change in lighting and a type of gesture affect the performance of a camera-based gesture recognition technology. Therefore, the speech recognition technology needs the development of a technology of recognizing speech by using an algorithm robust against noise, and the gesture recognition technology needs the development of a technology of extracting a particular section of a gesture including recognition information. In addition, when general gestures are used, a particular section of the gesture cannot be easily identified, so that recognition is difficult.
[5] In addition, when a speech and a gesture are integrated to be recognized, there is a problem in that since a speech frame rate is about lOms/frame and an image frame rate is about 66.7ms/frame, the rates for processing the speech frame and the image frame are different. In addition, a section of a gesture requires more time as compared with a section of a speech, so that lengths of the speech section and the gesture section are different, and this causes a problem of synchronizing the speech and gesture. Disclosure of Invention
Technical Problem
[6] An aspect of the present invention provides a means for extracting feature information by detecting an order section from a speech of a user by using an algorithm robust against environmental noise, and detecting a feature section of a gesture by using information on an order start point of the speech, thereby easily recognizing the order in the gesture that is not easily identified.
[7] An aspect of the present invention also provides a means for synchronizing a gesture and a speech by applying optimal frames set in advance to an order section of the gesture detected by a speech end point detection (EPD) value, thereby solving a problem of a synchronization difference between the speech and the gesture while integrated recognition is performed. Technical Solution
[8] According to an aspect of the present invention, there is provided a gesture/speech integrated recognition system including: a speech feature extraction unit extracting speech feature information by detecting a start point and an end point of an order in an input speech; a gesture feature extraction unit extracting gesture feature information by detecting an order section from a gesture of taken images by using information on the detected start point and the end point; and an integrated recognition unit outputting the extracted speech feature information and the gesture feature information as integrated recognition data by using a learning parameter set in advance.
[9] In the above aspect of the present invention, the system may further include a synchronization module including: a gesture start point detection module detecting a start point of the gesture from the taken images by using the detected start point; and an optimal frame applying module calculating and extracting optimal image frames by applying the number of optimal frames set in advance from the start point of the gesture. Here, the gesture start point detection module may detect the start point of the gesture by checking a start point that is an end point detection (EPD) plug of the detected speech from the taken images.
[10] In addition, the speech feature extraction unit may include: an EPD module detecting a start point and an end point in the input speech; and a hearing model-based speech feature extraction module extracting speech feature information included in the detected order by using a hearing model-based algorithm. In addition, the speech feature extraction unit may remove noise from the extracted speech feature information. [11] In addition, the gesture feature extraction unit may include: a hand tracking module tracking movements of a hand in images taken by a camera and transmitting the movements to the synchronization module; and a gesture feature extraction module extracting gesture feature information by using the optimal image frames extracted by the synchronization module.
[12] In addition, the integrated recognition unit may include: an integrated learning DB
(database) control module generating the learning parameter on the basis of an integrated learning model and an integrated learning DB set in advance; an integrated feature control module controlling the extracted speech feature information and the gesture feature information by using the generated learning parameter; and an integrated recognition module generating a result controlled by the integrated feature control module as a recognition result. Here, the integrated feature control module may control feature vectors of the extracted speech feature information and the gesture feature information through the extension and the reduction of the node numbers of input vectors.
[13] According to another aspect of the present invention, there is provided a gesture/ speech integrated recognition method including: a first step of extracting speech feature information by detecting a start point that is an EPD value and an end point of an order in an input speech; a second step of extracting gesture feature information by detecting an order section from a gesture in images input through a camera by using the detected start point of the order; and a third step of outputting the extracted speech feature information and gesture feature information as integrated recognition data by using a learning parameter set in advance.
[14] In the above aspect of the present invention, in the first step, the speech feature information may be extracted from the order section obtained by the start point and the end point of the order on the basis of a hearing model.
[15] In addition, the second step may include: the second step comprises: a B step of detecting an order section from the gesture of the movements of the hand by using the transmitted EPD value; a C step of determining optimal frames from the order section from the gesture by applying optimal frames set in advance; and a D step of extracting gesture feature information from the determined optimal frames.
Advantageous Effects
[16] The gesture/speech integrated recognition system and method according to the present invention increases a recognition rate of a gesture that is not easily identified by detecting an order section of the gesture by using an end point detection (EPD) value that is a start point of an order section of a speech. In addition, optimal frames are applied to the order section of the gesture to synchronize the speech and the gesture, so that it is possible to implement integrated recognition between the speech and the gesture.
Brief Description of the Drawings
[17] FIG. 1 is a view illustrating a concept of a gesture/speech integrated recognition system according to an embodiment of the present invention.
[18] FIG. 2 is a view illustrating a configuration of a gesture/speech integrated recognition system according to an embodiment of the present invention.
[19] FIG. 3 is a flowchart illustrating a gesture/speech integrated recognition method according to an embodiment of the present invention. Best Mode for Carrying Out the Invention
[20] Exemplary embodiments of the present invention will now be described in detail with reference to the accompanying drawings.
[21] In the description, the detailed descriptions of well-known functions and structures may be omitted so as not to hinder the understanding of the present invention.
[22] FIG. 1 is a view illustrating a concept of a gesture/speech integrated recognition system according to an embodiment of the present invention.
[23]
[24] *Ref erring to FIG. 1, in a gesture/speech integrated recognition technology, orders by a speech or a gesture of a person are integrated to be recognized, and a device of representing the five senses by using a control instruction generated by a result of the recognition is controlled.
[25] Specifically, a person 100 orders through a speech 110 and a gesture 120. Here, as an example of the order of the person, in order for the person to order goods through a cyberspace, the person may indicate a corn bread with a finger while saying "select a corn bread"as an order for selecting the corn bread from among the displayed goods.
[26] After the person 100 orders by the speech 110 and the gesture 120, feature information on the speech order of the person is recognized through speech recognition 111, and feature information on the gesture of the person is recognized through gesture recognition 121. The recognized feature information on the gesture and the speech is recognized through integrated recognition 130 as a single user order in order to increase a recognition rate of the speech affected by environmental noise and the gesture that cannot be easily identified.
[27] The present invention provides a technology of integrated recognition for a speech and a gesture of a person. The recognized order is transmitted by a controller to a speaker 170, a display apparatus 171, a diffuser 172, a touching device 173, and a tasting device 174 which are output devices for the five senses, and the controller controls each device. In addition, the recognition result is transmitted to a network, so that data of the five senses as the result is transmitted to control each of the output devices. The present invention relates to integrated recognition, and a configuration after the recognition operation can be modified in various manners, so that a detailed description thereof is omitted.
[28] FIG. 2 is a view illustrating a configuration of the gesture/speech integrated recognition system according to an embodiment of the present invention.
[29] Referring to FIG. 2, the gesture/speech integrated recognition system includes a speech feature extraction unit 210 for extracting speech feature information by detecting a start point and an end point of an order from a speech input through a microphone 211, a gesture feature extraction unit 220 for extracting gesture feature information by detecting an order section from a gesture of images taken by a camera by using information on the start point and the end point detected by the speech feature extraction unit 210, a synchronization module 230 for detecting a start point of the gesture from the taken images by using the start point detected by the speech feature extraction unit 210, and calculating optimal image frames by applying the number optimal frames set in advance from the detected start point of the gesture, and an integrated recognition unit 240 outputting the extracted speech feature information and the gesture feature information as integrated recognition data by using a predetermined learning parameter set in advance. Now, each component is described in detail.
[30] The speech feature extraction unit 210 includes the microphone 211 receiving a speech of a user, an end point detection (EPD) module 212 detecting a start point and an end point of an order section from the speech of the user, and a hearing model- based speech feature extraction module 213 extracting speech feature information from the order section of the speech detected by the EPD module 212 on the basis of the hearing model. In addition, a channel noise reduction module (not shown) for removing noise from the extracted speech feature information may further be included.
[31] The EPD module 212 detects the start and the end of the order by analyzing the speech input through the wired or wireless microphone.
[32] Specifically, the EPD module 212 acquires a speech signal, calculates an energy value needed to detect the end point of the speech signal, and detects the start and the end of the order by identifying a section to be calculated as the order from the input speech signal.
[33] The EPD module 212 firstly acquires the speech signal through the microphone and converts the acquired speech into a type for frame calculation. In the meanwhile, when the speech is input wirelessly, a problem of data loss or signal deterioration due to signal interference may occur, so that an operation to deal with the problem during the signal acquisition is needed.
[34] The energy value needed to detect the end point of the speech signal by the EPD module 212 is calculated as follows. A frame size used to analyze the speech signal is 160 samples, and frame energy is calculated by the following equation.
[35] FrameEnergy = log10
[36] Here, S(n) denotes a vocal cord signal sample, and N denotes the number of samples per frame.
[37] The calculated frame energy is used as a parameter for detecting the end point.
[38] The EPD module 212 determines a section to be calculated as the order after calculating the frame energy value. For example, the operation of calculating the start point and the end point of the speech signal is determined by four energy thresholds using the frame energy and 10 conditions. Here, the four energy thresholds and the 10 conditions can be set in various manners and may have values obtained in experiments in order to obtain the order section. The four thresholds are used to detect the start and end of every frame by an EPD algorithm.
[39] The EPD module 212 transmits information on the detected start point (hereinafter, referred to as an "EPD value" of the order to a gesture start point detection module 231 of the synchronization module 230.
[40] In addition, the EPD module 212 transmits information on the order section in the input speech to the hearing model-based speech feature extraction module 213 to extract speech feature information.
[41] The hearing model-based speech feature extraction module 213 that receives the information on the order section in the speech extracts the feature information on the basis of the hearing model from the order section detected by the EPD module 212. Examples of an algorithm used to extract the speech feature information on the basis of the hearing model include an ensemble interval histogram (EIH) and a zero-crossings with peak amplitudes (ZCPA).
[42] A channel noise reduction module (not shown) removes noise from the speech feature information extracted by the hearing model-based speech feature extraction module 213, and the speech feature information from which the noise is removed is transmitted to the integrated recognition unit 240.[245->240]
[43] The gesture feature extraction unit 220 includes a face and hand detection module
222 for detecting a face and a hand from the images taken by a camera 221, and a gesture feature extraction module 224 for tracking and transmitting movements of the detected hand to the synchronization module 230 and extracting feature information on the gesture by using optimal frames calculated by the synchronization module 230.
[44] The face and hand detection module 222 detects the face and the hand that gives a gesture, and a hand tracking module 223 continuously tracks the hand movements in the images. In the description, the hand tracking module 223 tracks the hand. However, it will be understood by those of ordinary skill in the art that the hand tracking module 223 may track various body portions having movements recognized as a gesture.
[45] The hand tracking module 223 continuously stores the hand movements as time elapses, and a part recognized as a gesture order in the hand movement is detected by the synchronization module 230 by using the EPD value transmitted from the speech feature extraction unit 210. Now, the synchronization module 230 which detects a section recognized as the gesture order in the hand movements by using the EPD value and applies the optimal frames for synchronization between speeches and gestures will be described.
[46] The synchronization module 230 includes the gesture start point detection module
231 for detecting the start point of the gesture by using the EPD value and the images showing the hand movements, and an optimal frame applying module 232 for calculating optimal image frames needed for integrated recognition by using a start frame of the gesture calculated by using the detected start point of the gesture.
[47] When an EPD value of a speech is detected by the EPD detection module 212 while a speech signal and an image signal are input in real-time, the gesture start point detection module 231 of the synchronization module 230 checks a speech EPD plug in the image signal. In this method, the gesture start point detection module 231 calculates the start frame of the gesture. In addition, the optimal frame applying module 232 calculates optimal image frames needed for integrated recognition by using the calculated start frame of the gesture and transmits the optimal image frames to the gesture feature extraction module 224. In order to calculate the optimal image frames needed for integrated recognition applied by the optimal frame applying module 232, the number of frames which are determined to have a high recognition rate of gestures is set in advance, and when the start frame of the gesture is calculated by the gesture start point detection module 231, the optimal image frames are determined.
[48] The integrated recognition unit 240 includes an integrated model generation module
242 for generating an integrated model for effectively integrating the speech feature information and the gesture feature information on the basis of a learning model, an integrated learning database (DB) 244 implemented to be proper for integrated recognition algorithm development based on a statistical model, an integrated learning DB control module 243 for controlling learning by the integrated model generation module 242 and the integrated learning DB 244 and a learning parameter, an integrated feature control module 241 for controlling the learning parameter and feature vectors of the input speech feature information and the gesture feature information, and an integrated recognition model 245 for providing various functions by generating recognition results.
[49] The integrated model generation module 242 generates a high-performance in- tegrated model in order to effectively integrate the speech feature information with the gesture feature information. In order to determine the high-performance integrated model, various learning algorithms such as hidden Markov model (HMM), neural network (NN), dynamic time warping (DTW) are implemented and experiments are performed. Particularly, according to the embodiment of the present invention, a method of optimizing NN parameters by determining an integrated model on the basis of the NN to obtain high-performance integrated recognition may be used. However, the generation of the high-performance integrated model has a problem of synchronizing two modalities that have different frame numbers in the learning model.
[50] The problem of the synchronization in the learning model is referred to as a learning model optimization problem. According to the embodiment of the present invention, an integration layer is provided, and a method of connecting the speech and the gesture is optimized in the layer. An overlapping length between the speech and the gesture is calculated with respect to a time axis for the optimization, and the synchronization is performed on the basis of the overlapping length. The overlapping length is used to search for a connection method having a highest recognition rate through a recognition rate experiment.
[51] The integrated learning DB 244 implements an integrated recognition DB to be proper for the integrated recognition algorithm development based on the statistical model.
[52] For example, data of ten words in various age groups is synchronized and collected by a stereo camera and a wireless microphone. Table 1 shows an order set defined to integrate gestures and speeches. The defined order set is obtained on the basis of natural gestures that people can understand without learning.
[53] Table 1
[Table 1] [Table ]
Figure imgf000011_0001
[54] Here, the speech is recorded at a sampling rate of 16kHz/16bits per sample by using a waveform in a pulse code modulation (PCM) method using the channel number of 1. In addition, 24bits BITMAP images having a size of 320x240 are recorded with a blue screen background in a light condition having four fluorescent boxes at 15 frames per second by using a STH-DCSG-C stereo camera. Since the stereo camera does not have a speech interface, a speech collection module and an image collection module are independently provided, and a synchronization program of the image and speech to perform a method of controlling an image collecting process through inter-process communications (IPC) in a speech recording program controls is written to collect data. The image collection module is configured by using an open source computer vision library (OpenCV) and a small vision system (SVS).
[55] Stereo camera images have to be adjusted to a real recording environment through an additional calibration process, and in order to acquire optima images, associated gain, exposure, brightness, red, and blue parameter values are modified to control color, exposure, and WB values. Calibration information and parameter information are stored in an additional ini file so that the image storage model calls and use the information.
[56] The integrated learning DB control module 243 generates the learning parameter on the basis of the integrated learning DB 244 that is related to the integrated model generation module 242 to be generated and stored in advance.
[57] The integrated feature control module 241 controls the learning parameter generated by the integrated learning DB control module 243 and the feature vectors of the feature information on the speech and the gesture extracted by the speech feature extraction unit 210 and the gesture feature extraction unit 220. The control operation is associated with the extension and the reduction of the node number of input vectors. The integrated feature control module 241 has the integration layer, and the integration layer is developed to effectively integrate lengths of the speech and the gesture that have different values and propose a single recognition rate.
[58] The integrated recognition module 245 generates a recognition result by using a control result obtained by the integrated feature control module 241. In addition, the integrated recognition module 245 provides various functions for interactions with a network or a integration representing unit.
[59] FIG. 3 is a flowchart illustrating a gesture/speech integrated recognition method according to an embodiment of the present invention.
[60] Referring to FIG. 3, the gesture/speech integrated recognition method includes three threads to be operated. The three threads includes a speech feature extraction thread 10 for extracting speech features, a gesture feature extraction thread 20 for extracting gesture features, and an integrated recognition thread 30 for performing integrated recognition on the speech and the gesture. The three threads 10, 20, and 30 are generated at a time point when the learning parameter is loaded and cooperatively operated by using a thread plug. Now, the gesture/speech integrated recognition method in which the three threads 10, 20, 30 are cooperatively operated is described.
[61] When the user orders by a speech and a gesture, the speech feature extraction thread
10 continuously receives the speech through a wired or wireless microphone (operation S311). Next, the gesture feature extraction thread 20 continuously receives images including the gesture through a camera (operation S320). Speech frames of the speech continuously input through the microphone are calculated (operation S312), and the EPD module 212 detects a start point and an end point (speech EPD value) of an order included in the speech (operation S313). When the speech EPD value is detected, the speech EPD value is transmitted to a synchronization operation 40 of the gesture feature extraction thread 20. In addition, when an order section of the speech is determined by using the start point and the end point of the order included in the speech, the hearing model-based speech feature extraction module 213 extracts a speech feature from the order section on the basis of the hearing model (operation S314) and transmits the extracted speech feature to the integrated recognition thread 30.
[62] The gesture feature extraction thread 20 detects a hand and a face from the images continuously input through the camera (operation S321). When the hand and face are detected, a gesture of the user is tracked (operation S322). Since the gesture of the user continuously changes, a gesture having a predetermined length is stored in a buffer (operation S323).
[63] When the speech EPD value is detected and transmitted while the gesture is stored in the buffer, a speech EPD plug in the gesture images stored in the buffer is checked (operation S324). The start point and the end point of the gesture including the feature information on the images are searched for by the speech EPD plug (operation S325), and the searched gesture feature is stored (operation S326). The stored gesture feature is not in synchronization with the speech, so that optimal frames are calculated from a start frame of the gesture by applying optimal frames set in advance. In addition, the calculated optimal frames are used to extract gesture feature information by the gesture feature extraction module 224, and the extracted gesture feature information is transmitted to the integrated recognition thread 30.
[64] When the speech feature extraction thread 10 and the gesture feature extraction thread 20 successfully extract feature information on the speech and the gesture, the speech/gesture feature extraction threads 10 and 20 are in a sleep state while the integrated recognition thread checks a recognition result (operations S328 and S315).
[65] The integrated recognition thread 30 generates a high-performance integrated model in advance by using the integrated model generation module 245 before receiving the speech feature information and the gesture feature information and controls the generated integrated model and the integrated learning DB 244 so that the integrated learning DB control module 243 generates and loads the learning parameter (operation S331). When the learning parameter is loaded, the integrated recognition thread 30 is maintained in the sleep state until the speech/gesture feature information is received (operation S332).
[66] When the integrated recognition thread 30 that is in the sleep state receives a signal associated with the feature information after the extraction of the feature information on the speech and the gesture is completed (operation S333), the integrated recognition thread 30 loads each feature in a memory (operation S334). When the feature information on the speech and the gesture is loaded, the recognition result is calculated by using the optimized integrated learning model and the learning parameter that are set in advance (operation S335).
[67] When the recognition result is calculated by the integrated recognition unit 240, the speech feature extraction thread 10 and the gesture feature extraction thread 20 that are in the sleep state perform operations of extracting feature information from a speech and an image that are input.
[68] While the present invention has been shown and described in connection with the exemplary embodiments, it will be apparent to those skilled in the art that modifications and variations can be made without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

Claims
[1] A gesture/speech integrated recognition system comprising: a speech feature extraction unit extracting speech feature information by detecting a start point and an end point of an order in an input speech; a gesture feature extraction unit extracting gesture feature information by detecting an order section from a gesture of taken images by using information on the detected start point and the end point; and an integrated recognition unit outputting the extracted speech feature information and the gesture feature information as integrated recognition data by using a learning parameter set in advance.
[2] The system of claim 1, further comprising a synchronization module comprising: a gesture start point detection module detecting a start point of the gesture from the taken images by using the detected start point; and an optimal frame applying module calculating and extracting optimal image frames by applying the number of optimal frames set in advance from the start point of the gesture.
[3] The system of claim 2, wherein the gesture start point detection module detects the start point of the gesture by checking a start point that is an EPD (end point detection) plug of the detected speech from the taken images.
[4] The system of claim 1, wherein the speech feature extraction unit comprises: an EPD module detecting a start point and an end point in the input speech; and a hearing model-based speech feature extraction module extracting speech feature information included in the detected order by using a hearing model- based algorithm.
[5] The system of claim 4, wherein the speech feature extraction unit removes noise from the extracted speech feature information.
[6] The system of claim 3, wherein the gesture feature extraction unit comprises:
*a hand tracking module tracking movements of a hand in images taken by a camera and transmitting the movements to the synchronization module; and a gesture feature extraction module extracting gesture feature information by using the optimal image frames extracted by the synchronization module.
[7] The system of claim 1, wherein the integrated recognition unit comprises: an integrated learning DB (database) control module generating the learning parameter on the basis of an integrated learning model and an integrated learning DB set in advance; an integrated feature control module controlling the extracted speech feature information and the gesture feature information by using the generated learning parameter; and an integrated recognition module generating a result controlled by the integrated feature control module as a recognition result.
[8] The system of claim 7, wherein the integrated learning model is generated on the basis of a neural network learning algorithm.
[9] The system of claim 7, wherein the integrated learning DB is implemented to be proper to an integrated recognition algorithm based on a statistical model by integrating feature information on speeches and gestures in various ages groups by using a stereo camera and a wireless microphone.
[10] The system of claim 7, wherein the integrated recognition module includes an integration layer for integrating the extracted speech feature information with the gesture feature information. [11] The system of claim 7, wherein the integrated feature control module controls feature vectors of the extracted speech feature information and the gesture feature information through the extension and the reduction of the node numbers of input vectors. [12] A gesture/speech integrated recognition method comprising: a first step of extracting speech feature information by detecting a start point that is an EPD value and an end point of an order in an input speech; a second step of extracting gesture feature information by detecting an order section from a gesture in images input through a camera by using the detected start point of the order; and a third step of outputting the extracted speech feature information and gesture feature information as integrated recognition data by using a learning parameter set in advance. [13] The method of claim 12, wherein in the first step, the speech feature information is extracted from the order section obtained by the start point and the end point of the order on the basis of a hearing model. [14] The method of claim 12, wherein the second step comprises: an A step of tracking a gesture of movements of a hand from the image input through the camera; a B step of detecting an order section from the gesture of the movements of the hand by using the transmitted EPD value; a C step of determining optimal frames from the order section from the gesture by applying optimal frames set in advance; and a D step of extracting gesture feature information from the determined optimal frames. [15] The method of claim 12, wherein the first step further comprises removing noise from the extracted speech feature information.
PCT/KR2007/006189 2006-12-04 2007-12-03 Gesture/speech integrated recognition system and method WO2008069519A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2009540141A JP2010511958A (en) 2006-12-04 2007-12-03 Gesture / voice integrated recognition system and method

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
KR10-2006-0121836 2006-12-04
KR20060121836 2006-12-04
KR10-2007-0086575 2007-08-28
KR1020070086575A KR100948600B1 (en) 2006-12-04 2007-08-28 System and method for integrating gesture and voice

Publications (1)

Publication Number Publication Date
WO2008069519A1 true WO2008069519A1 (en) 2008-06-12

Family

ID=39492339

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2007/006189 WO2008069519A1 (en) 2006-12-04 2007-12-03 Gesture/speech integrated recognition system and method

Country Status (1)

Country Link
WO (1) WO2008069519A1 (en)

Cited By (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011081541A (en) * 2009-10-06 2011-04-21 Canon Inc Input device and control method thereof
GB2476711A (en) * 2009-12-31 2011-07-06 Intel Corp Using multi-modal input to control multiple objects on a display
EP2347810A3 (en) * 2009-12-30 2011-09-14 Crytek GmbH Mobile input and sensor device for a computer-controlled video entertainment system
WO2011130083A2 (en) * 2010-04-14 2011-10-20 T-Mobile Usa, Inc. Camera-assisted noise cancellation and speech recognition
CN102298442A (en) * 2010-06-24 2011-12-28 索尼公司 Gesture recognition apparatus, gesture recognition method and program
JP2012525625A (en) * 2009-04-30 2012-10-22 サムスン エレクトロニクス カンパニー リミテッド User intention inference apparatus and method using multimodal information
CN103064530A (en) * 2012-12-31 2013-04-24 华为技术有限公司 Input processing method and device
CN103376891A (en) * 2012-04-23 2013-10-30 凹凸电子(武汉)有限公司 Multimedia system, control method for display device and controller
WO2014010879A1 (en) * 2012-07-09 2014-01-16 엘지전자 주식회사 Speech recognition apparatus and method
WO2014078480A1 (en) * 2012-11-16 2014-05-22 Aether Things, Inc. Unified framework for device configuration, interaction and control, and associated methods, devices and systems
CN104317392A (en) * 2014-09-25 2015-01-28 联想(北京)有限公司 Information control method and electronic equipment
US9002714B2 (en) 2011-08-05 2015-04-07 Samsung Electronics Co., Ltd. Method for controlling electronic apparatus based on voice recognition and motion recognition, and electronic apparatus applying the same
US9244984B2 (en) 2011-03-31 2016-01-26 Microsoft Technology Licensing, Llc Location based conversational understanding
US9298287B2 (en) 2011-03-31 2016-03-29 Microsoft Technology Licensing, Llc Combined activation for natural user interface systems
CN105792005A (en) * 2014-12-22 2016-07-20 深圳Tcl数字技术有限公司 Recording control method and device
US9454962B2 (en) 2011-05-12 2016-09-27 Microsoft Technology Licensing, Llc Sentence simplification for spoken language understanding
US9760566B2 (en) 2011-03-31 2017-09-12 Microsoft Technology Licensing, Llc Augmented conversational understanding agent to identify conversation context between two humans and taking an agent action thereof
US9842168B2 (en) 2011-03-31 2017-12-12 Microsoft Technology Licensing, Llc Task driven user intents
US9858343B2 (en) 2011-03-31 2018-01-02 Microsoft Technology Licensing Llc Personalization of queries, conversations, and searches
US20180122379A1 (en) * 2016-11-03 2018-05-03 Samsung Electronics Co., Ltd. Electronic device and controlling method thereof
US10061843B2 (en) 2011-05-12 2018-08-28 Microsoft Technology Licensing, Llc Translating natural language utterances to keyword search queries
CN110121696A (en) * 2016-11-03 2019-08-13 三星电子株式会社 Electronic equipment and its control method
US10642934B2 (en) 2011-03-31 2020-05-05 Microsoft Technology Licensing, Llc Augmented conversational understanding architecture
US10739864B2 (en) 2018-12-31 2020-08-11 International Business Machines Corporation Air writing to speech system using gesture and wrist angle orientation for synthesized speech modulation
WO2021135281A1 (en) * 2019-12-30 2021-07-08 浪潮(北京)电子信息产业有限公司 Multi-layer feature fusion-based endpoint detection method, apparatus, device, and medium
US11521038B2 (en) 2018-07-19 2022-12-06 Samsung Electronics Co., Ltd. Electronic apparatus and control method thereof
CN115881118A (en) * 2022-11-04 2023-03-31 荣耀终端有限公司 Voice interaction method and related electronic equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0594129A2 (en) * 1992-10-20 1994-04-27 Hitachi, Ltd. Display system capable of accepting user commands by use of voice and gesture inputs
JPH07306772A (en) * 1994-05-16 1995-11-21 Canon Inc Method and device for information processing
KR20010075838A (en) * 2000-01-20 2001-08-11 오길록 Apparatus and method for processing multimodal interface
US20030001908A1 (en) * 2001-06-29 2003-01-02 Koninklijke Philips Electronics N.V. Picture-in-picture repositioning and/or resizing based on speech and gesture control

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0594129A2 (en) * 1992-10-20 1994-04-27 Hitachi, Ltd. Display system capable of accepting user commands by use of voice and gesture inputs
JPH07306772A (en) * 1994-05-16 1995-11-21 Canon Inc Method and device for information processing
KR20010075838A (en) * 2000-01-20 2001-08-11 오길록 Apparatus and method for processing multimodal interface
US20030001908A1 (en) * 2001-06-29 2003-01-02 Koninklijke Philips Electronics N.V. Picture-in-picture repositioning and/or resizing based on speech and gesture control

Cited By (48)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2012525625A (en) * 2009-04-30 2012-10-22 サムスン エレクトロニクス カンパニー リミテッド User intention inference apparatus and method using multimodal information
JP2011081541A (en) * 2009-10-06 2011-04-21 Canon Inc Input device and control method thereof
EP2347810A3 (en) * 2009-12-30 2011-09-14 Crytek GmbH Mobile input and sensor device for a computer-controlled video entertainment system
US9344753B2 (en) 2009-12-30 2016-05-17 Crytek Gmbh Mobile input and sensor device for a computer-controlled video entertainment system
US8977972B2 (en) 2009-12-31 2015-03-10 Intel Corporation Using multi-modal input to control multiple objects on a display
GB2476711B (en) * 2009-12-31 2012-09-05 Intel Corp Using multi-modal input to control multiple objects on a display
GB2476711A (en) * 2009-12-31 2011-07-06 Intel Corp Using multi-modal input to control multiple objects on a display
WO2011130083A3 (en) * 2010-04-14 2012-02-02 T-Mobile Usa, Inc. Camera-assisted noise cancellation and speech recognition
WO2011130083A2 (en) * 2010-04-14 2011-10-20 T-Mobile Usa, Inc. Camera-assisted noise cancellation and speech recognition
US8635066B2 (en) 2010-04-14 2014-01-21 T-Mobile Usa, Inc. Camera-assisted noise cancellation and speech recognition
CN102298442A (en) * 2010-06-24 2011-12-28 索尼公司 Gesture recognition apparatus, gesture recognition method and program
EP2400371A3 (en) * 2010-06-24 2015-04-08 Sony Corporation Gesture recognition apparatus, gesture recognition method and program
US9842168B2 (en) 2011-03-31 2017-12-12 Microsoft Technology Licensing, Llc Task driven user intents
US9858343B2 (en) 2011-03-31 2018-01-02 Microsoft Technology Licensing Llc Personalization of queries, conversations, and searches
US10585957B2 (en) 2011-03-31 2020-03-10 Microsoft Technology Licensing, Llc Task driven user intents
US10642934B2 (en) 2011-03-31 2020-05-05 Microsoft Technology Licensing, Llc Augmented conversational understanding architecture
US10296587B2 (en) 2011-03-31 2019-05-21 Microsoft Technology Licensing, Llc Augmented conversational understanding agent to identify conversation context between two humans and taking an agent action thereof
US9298287B2 (en) 2011-03-31 2016-03-29 Microsoft Technology Licensing, Llc Combined activation for natural user interface systems
US9244984B2 (en) 2011-03-31 2016-01-26 Microsoft Technology Licensing, Llc Location based conversational understanding
US10049667B2 (en) 2011-03-31 2018-08-14 Microsoft Technology Licensing, Llc Location-based conversational understanding
US9760566B2 (en) 2011-03-31 2017-09-12 Microsoft Technology Licensing, Llc Augmented conversational understanding agent to identify conversation context between two humans and taking an agent action thereof
US10061843B2 (en) 2011-05-12 2018-08-28 Microsoft Technology Licensing, Llc Translating natural language utterances to keyword search queries
US9454962B2 (en) 2011-05-12 2016-09-27 Microsoft Technology Licensing, Llc Sentence simplification for spoken language understanding
US9002714B2 (en) 2011-08-05 2015-04-07 Samsung Electronics Co., Ltd. Method for controlling electronic apparatus based on voice recognition and motion recognition, and electronic apparatus applying the same
US9733895B2 (en) 2011-08-05 2017-08-15 Samsung Electronics Co., Ltd. Method for controlling electronic apparatus based on voice recognition and motion recognition, and electronic apparatus applying the same
CN103376891A (en) * 2012-04-23 2013-10-30 凹凸电子(武汉)有限公司 Multimedia system, control method for display device and controller
WO2014010879A1 (en) * 2012-07-09 2014-01-16 엘지전자 주식회사 Speech recognition apparatus and method
WO2014078480A1 (en) * 2012-11-16 2014-05-22 Aether Things, Inc. Unified framework for device configuration, interaction and control, and associated methods, devices and systems
AU2013270485C1 (en) * 2012-12-31 2016-01-21 Huawei Technologies Co. , Ltd. Input processing method and apparatus
AU2013270485B2 (en) * 2012-12-31 2015-09-10 Huawei Technologies Co. , Ltd. Input processing method and apparatus
EP2765473A4 (en) * 2012-12-31 2014-12-10 Huawei Tech Co Ltd Input processing method and apparatus
EP2765473A1 (en) * 2012-12-31 2014-08-13 Huawei Technologies Co., Ltd. Input processing method and apparatus
CN103064530A (en) * 2012-12-31 2013-04-24 华为技术有限公司 Input processing method and device
CN104317392A (en) * 2014-09-25 2015-01-28 联想(北京)有限公司 Information control method and electronic equipment
CN105792005A (en) * 2014-12-22 2016-07-20 深圳Tcl数字技术有限公司 Recording control method and device
CN105792005B (en) * 2014-12-22 2019-05-14 深圳Tcl数字技术有限公司 The method and device of video recording control
EP4220630A1 (en) * 2016-11-03 2023-08-02 Samsung Electronics Co., Ltd. Electronic device and controlling method thereof
CN110121696A (en) * 2016-11-03 2019-08-13 三星电子株式会社 Electronic equipment and its control method
EP3523709A4 (en) * 2016-11-03 2019-11-06 Samsung Electronics Co., Ltd. Electronic device and controlling method thereof
WO2018084576A1 (en) 2016-11-03 2018-05-11 Samsung Electronics Co., Ltd. Electronic device and controlling method thereof
US20180122379A1 (en) * 2016-11-03 2018-05-03 Samsung Electronics Co., Ltd. Electronic device and controlling method thereof
US10679618B2 (en) 2016-11-03 2020-06-09 Samsung Electronics Co., Ltd. Electronic device and controlling method thereof
US11908465B2 (en) 2016-11-03 2024-02-20 Samsung Electronics Co., Ltd. Electronic device and controlling method thereof
US11521038B2 (en) 2018-07-19 2022-12-06 Samsung Electronics Co., Ltd. Electronic apparatus and control method thereof
US10739864B2 (en) 2018-12-31 2020-08-11 International Business Machines Corporation Air writing to speech system using gesture and wrist angle orientation for synthesized speech modulation
WO2021135281A1 (en) * 2019-12-30 2021-07-08 浪潮(北京)电子信息产业有限公司 Multi-layer feature fusion-based endpoint detection method, apparatus, device, and medium
CN115881118A (en) * 2022-11-04 2023-03-31 荣耀终端有限公司 Voice interaction method and related electronic equipment
CN115881118B (en) * 2022-11-04 2023-12-22 荣耀终端有限公司 Voice interaction method and related electronic equipment

Similar Documents

Publication Publication Date Title
WO2008069519A1 (en) Gesture/speech integrated recognition system and method
KR100948600B1 (en) System and method for integrating gesture and voice
CN107799126B (en) Voice endpoint detection method and device based on supervised machine learning
US11762474B2 (en) Systems, methods and devices for gesture recognition
US8793134B2 (en) System and method for integrating gesture and sound for controlling device
US11854550B2 (en) Determining input for speech processing engine
US20190188903A1 (en) Method and apparatus for providing virtual companion to a user
CN110322760B (en) Voice data generation method, device, terminal and storage medium
CN110310623A (en) Sample generating method, model training method, device, medium and electronic equipment
CN106157956A (en) The method and device of speech recognition
JP2012014394A (en) User instruction acquisition device, user instruction acquisition program and television receiver
KR20100062207A (en) Method and apparatus for providing animation effect on video telephony call
CN112016367A (en) Emotion recognition system and method and electronic equipment
CN114779922A (en) Control method for teaching apparatus, control apparatus, teaching system, and storage medium
CN111326152A (en) Voice control method and device
CN107452381B (en) Multimedia voice recognition device and method
CN113129867A (en) Training method of voice recognition model, voice recognition method, device and equipment
CN110728993A (en) Voice change identification method and electronic equipment
KR102291740B1 (en) Image processing system
CN113497912A (en) Automatic framing through voice and video positioning
CN108628454B (en) Visual interaction method and system based on virtual human
KR20130054131A (en) Display apparatus and control method thereof
WO2021223765A1 (en) Voice recognition method, voice recognition system and electrical device
Freitas et al. Multimodal silent speech interface based on video, depth, surface electromyography and ultrasonic doppler: Data collection and first recognition results
KR102265874B1 (en) Method and Apparatus for Distinguishing User based on Multimodal

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 07851181

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2009540141

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 07851181

Country of ref document: EP

Kind code of ref document: A1