WO2008069519A1 - Gesture/speech integrated recognition system and method - Google Patents
Gesture/speech integrated recognition system and method Download PDFInfo
- Publication number
- WO2008069519A1 WO2008069519A1 PCT/KR2007/006189 KR2007006189W WO2008069519A1 WO 2008069519 A1 WO2008069519 A1 WO 2008069519A1 KR 2007006189 W KR2007006189 W KR 2007006189W WO 2008069519 A1 WO2008069519 A1 WO 2008069519A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- gesture
- speech
- feature information
- integrated
- module
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 23
- 238000000605 extraction Methods 0.000 claims abstract description 45
- 238000001514 detection method Methods 0.000 claims description 18
- 239000013598 vector Substances 0.000 claims description 7
- 238000013528 artificial neural network Methods 0.000 claims description 5
- 230000010354 integration Effects 0.000 claims description 5
- 230000009467 reduction Effects 0.000 claims description 5
- 238000013179 statistical model Methods 0.000 claims description 3
- 238000005516 engineering process Methods 0.000 description 18
- 238000011161 development Methods 0.000 description 5
- 240000008042 Zea mays Species 0.000 description 3
- 235000005824 Zea mays ssp. parviglumis Nutrition 0.000 description 3
- 235000002017 Zea mays subsp mays Nutrition 0.000 description 3
- 235000008429 bread Nutrition 0.000 description 3
- 235000005822 corn Nutrition 0.000 description 3
- 230000007613 environmental effect Effects 0.000 description 3
- 238000002474 experimental method Methods 0.000 description 3
- 239000000284 extract Substances 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 238000005457 optimization Methods 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000006866 deterioration Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000008921 facial expression Effects 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 210000001260 vocal cord Anatomy 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/01—Input arrangements or combined input and output arrangements for interaction between user and computer
- G06F3/03—Arrangements for converting the position or the displacement of a member into a coded form
- G06F3/033—Pointing devices displaced or positioned by the user, e.g. mice, trackballs, pens or joysticks; Accessories therefor
- G06F3/038—Control and interface arrangements therefor, e.g. drivers or device-embedded control circuitry
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/01—Input arrangements or combined input and output arrangements for interaction between user and computer
- G06F3/017—Gesture based interaction, e.g. based on a set of recognized hand gestures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/16—Sound input; Sound output
- G06F3/167—Audio in a user interface, e.g. using voice commands for navigating, audio feedback
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2203/00—Indexing scheme relating to G06F3/00 - G06F3/048
- G06F2203/038—Indexing scheme relating to G06F3/038
- G06F2203/0381—Multimodal input, i.e. interface arrangements enabling the user to issue commands by simultaneous use of input devices of different nature, e.g. voice plus gesture on digitizer
Definitions
- the present invention relates to a integrated recognition technology, and more particularly, to a gesture/speech integrated recognition system and method capable of recognizing an order of a user by extracting feature information on a gesture by using an end point detection (EPD) value of a speech and integrating feature information on the speech with the feature information on the gesture, thereby recognizing the order of the user at a high recognition rate in a noise environment.
- EPD end point detection
- a speech recognition technology and a gesture recognition technology are used as the most convenient interface technology.
- the speech recognition technology and the gesture recognition technology have a high recognition rate in a limited environment.
- performances of the technologies are not good in a noise environment. This is because environmental noise affects the performance of the speech recognition, and a change in lighting and a type of gesture affect the performance of a camera-based gesture recognition technology. Therefore, the speech recognition technology needs the development of a technology of recognizing speech by using an algorithm robust against noise, and the gesture recognition technology needs the development of a technology of extracting a particular section of a gesture including recognition information.
- a particular section of the gesture cannot be easily identified, so that recognition is difficult.
- An aspect of the present invention provides a means for extracting feature information by detecting an order section from a speech of a user by using an algorithm robust against environmental noise, and detecting a feature section of a gesture by using information on an order start point of the speech, thereby easily recognizing the order in the gesture that is not easily identified.
- An aspect of the present invention also provides a means for synchronizing a gesture and a speech by applying optimal frames set in advance to an order section of the gesture detected by a speech end point detection (EPD) value, thereby solving a problem of a synchronization difference between the speech and the gesture while integrated recognition is performed.
- EPD speech end point detection
- a gesture/speech integrated recognition system including: a speech feature extraction unit extracting speech feature information by detecting a start point and an end point of an order in an input speech; a gesture feature extraction unit extracting gesture feature information by detecting an order section from a gesture of taken images by using information on the detected start point and the end point; and an integrated recognition unit outputting the extracted speech feature information and the gesture feature information as integrated recognition data by using a learning parameter set in advance.
- the system may further include a synchronization module including: a gesture start point detection module detecting a start point of the gesture from the taken images by using the detected start point; and an optimal frame applying module calculating and extracting optimal image frames by applying the number of optimal frames set in advance from the start point of the gesture.
- the gesture start point detection module may detect the start point of the gesture by checking a start point that is an end point detection (EPD) plug of the detected speech from the taken images.
- EPD end point detection
- the speech feature extraction unit may include: an EPD module detecting a start point and an end point in the input speech; and a hearing model-based speech feature extraction module extracting speech feature information included in the detected order by using a hearing model-based algorithm.
- the speech feature extraction unit may remove noise from the extracted speech feature information.
- the gesture feature extraction unit may include: a hand tracking module tracking movements of a hand in images taken by a camera and transmitting the movements to the synchronization module; and a gesture feature extraction module extracting gesture feature information by using the optimal image frames extracted by the synchronization module.
- the integrated recognition unit may include: an integrated learning DB
- the integrated feature control module may control feature vectors of the extracted speech feature information and the gesture feature information through the extension and the reduction of the node numbers of input vectors.
- a gesture/ speech integrated recognition method including: a first step of extracting speech feature information by detecting a start point that is an EPD value and an end point of an order in an input speech; a second step of extracting gesture feature information by detecting an order section from a gesture in images input through a camera by using the detected start point of the order; and a third step of outputting the extracted speech feature information and gesture feature information as integrated recognition data by using a learning parameter set in advance.
- the speech feature information may be extracted from the order section obtained by the start point and the end point of the order on the basis of a hearing model.
- the second step may include: the second step comprises: a B step of detecting an order section from the gesture of the movements of the hand by using the transmitted EPD value; a C step of determining optimal frames from the order section from the gesture by applying optimal frames set in advance; and a D step of extracting gesture feature information from the determined optimal frames.
- the gesture/speech integrated recognition system and method according to the present invention increases a recognition rate of a gesture that is not easily identified by detecting an order section of the gesture by using an end point detection (EPD) value that is a start point of an order section of a speech.
- EPD end point detection
- optimal frames are applied to the order section of the gesture to synchronize the speech and the gesture, so that it is possible to implement integrated recognition between the speech and the gesture.
- FIG. 1 is a view illustrating a concept of a gesture/speech integrated recognition system according to an embodiment of the present invention.
- FIG. 2 is a view illustrating a configuration of a gesture/speech integrated recognition system according to an embodiment of the present invention.
- FIG. 3 is a flowchart illustrating a gesture/speech integrated recognition method according to an embodiment of the present invention. Best Mode for Carrying Out the Invention
- FIG. 1 is a view illustrating a concept of a gesture/speech integrated recognition system according to an embodiment of the present invention.
- a person 100 orders through a speech 110 and a gesture 120.
- the person may indicate a corn bread with a finger while saying "select a corn bread"as an order for selecting the corn bread from among the displayed goods.
- feature information on the speech order of the person is recognized through speech recognition 111, and feature information on the gesture of the person is recognized through gesture recognition 121.
- the recognized feature information on the gesture and the speech is recognized through integrated recognition 130 as a single user order in order to increase a recognition rate of the speech affected by environmental noise and the gesture that cannot be easily identified.
- the present invention provides a technology of integrated recognition for a speech and a gesture of a person.
- the recognized order is transmitted by a controller to a speaker 170, a display apparatus 171, a diffuser 172, a touching device 173, and a tasting device 174 which are output devices for the five senses, and the controller controls each device.
- the recognition result is transmitted to a network, so that data of the five senses as the result is transmitted to control each of the output devices.
- the present invention relates to integrated recognition, and a configuration after the recognition operation can be modified in various manners, so that a detailed description thereof is omitted.
- FIG. 2 is a view illustrating a configuration of the gesture/speech integrated recognition system according to an embodiment of the present invention.
- the gesture/speech integrated recognition system includes a speech feature extraction unit 210 for extracting speech feature information by detecting a start point and an end point of an order from a speech input through a microphone 211, a gesture feature extraction unit 220 for extracting gesture feature information by detecting an order section from a gesture of images taken by a camera by using information on the start point and the end point detected by the speech feature extraction unit 210, a synchronization module 230 for detecting a start point of the gesture from the taken images by using the start point detected by the speech feature extraction unit 210, and calculating optimal image frames by applying the number optimal frames set in advance from the detected start point of the gesture, and an integrated recognition unit 240 outputting the extracted speech feature information and the gesture feature information as integrated recognition data by using a predetermined learning parameter set in advance.
- a speech feature extraction unit 210 for extracting speech feature information by detecting a start point and an end point of an order from a speech input through a microphone 211
- a gesture feature extraction unit 220 for extracting gesture feature
- the speech feature extraction unit 210 includes the microphone 211 receiving a speech of a user, an end point detection (EPD) module 212 detecting a start point and an end point of an order section from the speech of the user, and a hearing model- based speech feature extraction module 213 extracting speech feature information from the order section of the speech detected by the EPD module 212 on the basis of the hearing model.
- a channel noise reduction module (not shown) for removing noise from the extracted speech feature information may further be included.
- the EPD module 212 detects the start and the end of the order by analyzing the speech input through the wired or wireless microphone.
- the EPD module 212 acquires a speech signal, calculates an energy value needed to detect the end point of the speech signal, and detects the start and the end of the order by identifying a section to be calculated as the order from the input speech signal.
- the EPD module 212 firstly acquires the speech signal through the microphone and converts the acquired speech into a type for frame calculation. In the meanwhile, when the speech is input wirelessly, a problem of data loss or signal deterioration due to signal interference may occur, so that an operation to deal with the problem during the signal acquisition is needed.
- the energy value needed to detect the end point of the speech signal by the EPD module 212 is calculated as follows.
- a frame size used to analyze the speech signal is 160 samples, and frame energy is calculated by the following equation.
- S(n) denotes a vocal cord signal sample
- N denotes the number of samples per frame.
- the calculated frame energy is used as a parameter for detecting the end point.
- the EPD module 212 determines a section to be calculated as the order after calculating the frame energy value. For example, the operation of calculating the start point and the end point of the speech signal is determined by four energy thresholds using the frame energy and 10 conditions.
- the four energy thresholds and the 10 conditions can be set in various manners and may have values obtained in experiments in order to obtain the order section.
- the four thresholds are used to detect the start and end of every frame by an EPD algorithm.
- the EPD module 212 transmits information on the detected start point (hereinafter, referred to as an "EPD value" of the order to a gesture start point detection module 231 of the synchronization module 230.
- the EPD module 212 transmits information on the order section in the input speech to the hearing model-based speech feature extraction module 213 to extract speech feature information.
- the hearing model-based speech feature extraction module 213 that receives the information on the order section in the speech extracts the feature information on the basis of the hearing model from the order section detected by the EPD module 212.
- Examples of an algorithm used to extract the speech feature information on the basis of the hearing model include an ensemble interval histogram (EIH) and a zero-crossings with peak amplitudes (ZCPA).
- a channel noise reduction module removes noise from the speech feature information extracted by the hearing model-based speech feature extraction module 213, and the speech feature information from which the noise is removed is transmitted to the integrated recognition unit 240.[245->240]
- the gesture feature extraction unit 220 includes a face and hand detection module
- a gesture feature extraction module 224 for tracking and transmitting movements of the detected hand to the synchronization module 230 and extracting feature information on the gesture by using optimal frames calculated by the synchronization module 230.
- the face and hand detection module 222 detects the face and the hand that gives a gesture, and a hand tracking module 223 continuously tracks the hand movements in the images.
- the hand tracking module 223 tracks the hand.
- the hand tracking module 223 may track various body portions having movements recognized as a gesture.
- the hand tracking module 223 continuously stores the hand movements as time elapses, and a part recognized as a gesture order in the hand movement is detected by the synchronization module 230 by using the EPD value transmitted from the speech feature extraction unit 210.
- the synchronization module 230 which detects a section recognized as the gesture order in the hand movements by using the EPD value and applies the optimal frames for synchronization between speeches and gestures will be described.
- the synchronization module 230 includes the gesture start point detection module
- an optimal frame applying module 232 for calculating optimal image frames needed for integrated recognition by using a start frame of the gesture calculated by using the detected start point of the gesture.
- the gesture start point detection module 231 of the synchronization module 230 checks a speech EPD plug in the image signal. In this method, the gesture start point detection module 231 calculates the start frame of the gesture. In addition, the optimal frame applying module 232 calculates optimal image frames needed for integrated recognition by using the calculated start frame of the gesture and transmits the optimal image frames to the gesture feature extraction module 224.
- the optimal image frames needed for integrated recognition applied by the optimal frame applying module 232 In order to calculate the optimal image frames needed for integrated recognition applied by the optimal frame applying module 232, the number of frames which are determined to have a high recognition rate of gestures is set in advance, and when the start frame of the gesture is calculated by the gesture start point detection module 231, the optimal image frames are determined.
- the integrated recognition unit 240 includes an integrated model generation module
- an integrated learning database (DB) 244 implemented to be proper for integrated recognition algorithm development based on a statistical model
- an integrated learning DB control module 243 for controlling learning by the integrated model generation module 242 and the integrated learning DB 244 and a learning parameter
- an integrated feature control module 241 for controlling the learning parameter and feature vectors of the input speech feature information and the gesture feature information
- an integrated recognition model 245 for providing various functions by generating recognition results.
- the integrated model generation module 242 generates a high-performance in- tegrated model in order to effectively integrate the speech feature information with the gesture feature information.
- various learning algorithms such as hidden Markov model (HMM), neural network (NN), dynamic time warping (DTW) are implemented and experiments are performed.
- HMM hidden Markov model
- NN neural network
- DTW dynamic time warping
- a method of optimizing NN parameters by determining an integrated model on the basis of the NN to obtain high-performance integrated recognition may be used.
- the generation of the high-performance integrated model has a problem of synchronizing two modalities that have different frame numbers in the learning model.
- the problem of the synchronization in the learning model is referred to as a learning model optimization problem.
- an integration layer is provided, and a method of connecting the speech and the gesture is optimized in the layer.
- An overlapping length between the speech and the gesture is calculated with respect to a time axis for the optimization, and the synchronization is performed on the basis of the overlapping length.
- the overlapping length is used to search for a connection method having a highest recognition rate through a recognition rate experiment.
- the integrated learning DB 244 implements an integrated recognition DB to be proper for the integrated recognition algorithm development based on the statistical model.
- Table 1 shows an order set defined to integrate gestures and speeches.
- the defined order set is obtained on the basis of natural gestures that people can understand without learning.
- the speech is recorded at a sampling rate of 16kHz/16bits per sample by using a waveform in a pulse code modulation (PCM) method using the channel number of 1.
- PCM pulse code modulation
- 24bits BITMAP images having a size of 320x240 are recorded with a blue screen background in a light condition having four fluorescent boxes at 15 frames per second by using a STH-DCSG-C stereo camera. Since the stereo camera does not have a speech interface, a speech collection module and an image collection module are independently provided, and a synchronization program of the image and speech to perform a method of controlling an image collecting process through inter-process communications (IPC) in a speech recording program controls is written to collect data.
- the image collection module is configured by using an open source computer vision library (OpenCV) and a small vision system (SVS).
- OpenCV open source computer vision library
- SVS small vision system
- Stereo camera images have to be adjusted to a real recording environment through an additional calibration process, and in order to acquire optima images, associated gain, exposure, brightness, red, and blue parameter values are modified to control color, exposure, and WB values.
- Calibration information and parameter information are stored in an additional ini file so that the image storage model calls and use the information.
- the integrated learning DB control module 243 generates the learning parameter on the basis of the integrated learning DB 244 that is related to the integrated model generation module 242 to be generated and stored in advance.
- the integrated feature control module 241 controls the learning parameter generated by the integrated learning DB control module 243 and the feature vectors of the feature information on the speech and the gesture extracted by the speech feature extraction unit 210 and the gesture feature extraction unit 220.
- the control operation is associated with the extension and the reduction of the node number of input vectors.
- the integrated feature control module 241 has the integration layer, and the integration layer is developed to effectively integrate lengths of the speech and the gesture that have different values and propose a single recognition rate.
- the integrated recognition module 245 generates a recognition result by using a control result obtained by the integrated feature control module 241.
- the integrated recognition module 245 provides various functions for interactions with a network or a integration representing unit.
- FIG. 3 is a flowchart illustrating a gesture/speech integrated recognition method according to an embodiment of the present invention.
- the gesture/speech integrated recognition method includes three threads to be operated.
- the three threads includes a speech feature extraction thread 10 for extracting speech features, a gesture feature extraction thread 20 for extracting gesture features, and an integrated recognition thread 30 for performing integrated recognition on the speech and the gesture.
- the three threads 10, 20, and 30 are generated at a time point when the learning parameter is loaded and cooperatively operated by using a thread plug. Now, the gesture/speech integrated recognition method in which the three threads 10, 20, 30 are cooperatively operated is described.
- the gesture feature extraction thread 20 continuously receives images including the gesture through a camera (operation S320). Speech frames of the speech continuously input through the microphone are calculated (operation S312), and the EPD module 212 detects a start point and an end point (speech EPD value) of an order included in the speech (operation S313). When the speech EPD value is detected, the speech EPD value is transmitted to a synchronization operation 40 of the gesture feature extraction thread 20.
- the hearing model-based speech feature extraction module 213 extracts a speech feature from the order section on the basis of the hearing model (operation S314) and transmits the extracted speech feature to the integrated recognition thread 30.
- the gesture feature extraction thread 20 detects a hand and a face from the images continuously input through the camera (operation S321). When the hand and face are detected, a gesture of the user is tracked (operation S322). Since the gesture of the user continuously changes, a gesture having a predetermined length is stored in a buffer (operation S323).
- a speech EPD plug in the gesture images stored in the buffer is checked (operation S324).
- the start point and the end point of the gesture including the feature information on the images are searched for by the speech EPD plug (operation S325), and the searched gesture feature is stored (operation S326).
- the stored gesture feature is not in synchronization with the speech, so that optimal frames are calculated from a start frame of the gesture by applying optimal frames set in advance.
- the calculated optimal frames are used to extract gesture feature information by the gesture feature extraction module 224, and the extracted gesture feature information is transmitted to the integrated recognition thread 30.
- the speech/gesture feature extraction threads 10 and 20 are in a sleep state while the integrated recognition thread checks a recognition result (operations S328 and S315).
- the integrated recognition thread 30 generates a high-performance integrated model in advance by using the integrated model generation module 245 before receiving the speech feature information and the gesture feature information and controls the generated integrated model and the integrated learning DB 244 so that the integrated learning DB control module 243 generates and loads the learning parameter (operation S331).
- the integrated recognition thread 30 is maintained in the sleep state until the speech/gesture feature information is received (operation S332).
- the integrated recognition thread 30 that is in the sleep state receives a signal associated with the feature information after the extraction of the feature information on the speech and the gesture is completed (operation S333), the integrated recognition thread 30 loads each feature in a memory (operation S334).
- the recognition result is calculated by using the optimized integrated learning model and the learning parameter that are set in advance (operation S335).
- the speech feature extraction thread 10 and the gesture feature extraction thread 20 that are in the sleep state perform operations of extracting feature information from a speech and an image that are input.
Abstract
Provided is a gesture/speech integrated recognition system and method including: a speech feature extraction unit extracting speech feature information by detecting a start point and an end point of an order in an input speech; a gesture feature extraction unit extracting gesture feature information by detecting an order section from a gesture of taken images by using information on the detected start point and the end point; and an integrated recognition unit outputting the extracted speech feature information and the gesture feature information as integrated recognition data by using a learning parameter set in advance. Accordingly, it is possible to increase a performance of order recognition by integrating the speech and the gesture in a noise environment.
Description
Description
GESTURE/SPEECH INTEGRATED RECOGNITION SYSTEM
AND METHOD
Technical Field
[1] The present invention relates to a integrated recognition technology, and more particularly, to a gesture/speech integrated recognition system and method capable of recognizing an order of a user by extracting feature information on a gesture by using an end point detection (EPD) value of a speech and integrating feature information on the speech with the feature information on the gesture, thereby recognizing the order of the user at a high recognition rate in a noise environment.
[2] This work was supported by the IT R&D program of MIC/IITA[2006-S-031-01, Five
Sences Information Processing Technology Development for Network Based Reality Service] Background Art
[3] Recently, as a multimedia technology and an interface technology have been developed, researches on multimodal recognition for easily implementing a simple interface between the human and machines by using facial expressions or directions, lip movements, eye-gaze tracking, hand movements, speech, and the like have been actively carried out.
[4] Particularly, among man-machine interface technologies, a speech recognition technology and a gesture recognition technology are used as the most convenient interface technology. The speech recognition technology and the gesture recognition technology have a high recognition rate in a limited environment. However, performances of the technologies are not good in a noise environment. This is because environmental noise affects the performance of the speech recognition, and a change in lighting and a type of gesture affect the performance of a camera-based gesture recognition technology. Therefore, the speech recognition technology needs the development of a technology of recognizing speech by using an algorithm robust against noise, and the gesture recognition technology needs the development of a technology of extracting a particular section of a gesture including recognition information. In addition, when general gestures are used, a particular section of the gesture cannot be easily identified, so that recognition is difficult.
[5] In addition, when a speech and a gesture are integrated to be recognized, there is a problem in that since a speech frame rate is about lOms/frame and an image frame rate is about 66.7ms/frame, the rates for processing the speech frame and the image frame are different. In addition, a section of a gesture requires more time as compared with a
section of a speech, so that lengths of the speech section and the gesture section are different, and this causes a problem of synchronizing the speech and gesture. Disclosure of Invention
Technical Problem
[6] An aspect of the present invention provides a means for extracting feature information by detecting an order section from a speech of a user by using an algorithm robust against environmental noise, and detecting a feature section of a gesture by using information on an order start point of the speech, thereby easily recognizing the order in the gesture that is not easily identified.
[7] An aspect of the present invention also provides a means for synchronizing a gesture and a speech by applying optimal frames set in advance to an order section of the gesture detected by a speech end point detection (EPD) value, thereby solving a problem of a synchronization difference between the speech and the gesture while integrated recognition is performed. Technical Solution
[8] According to an aspect of the present invention, there is provided a gesture/speech integrated recognition system including: a speech feature extraction unit extracting speech feature information by detecting a start point and an end point of an order in an input speech; a gesture feature extraction unit extracting gesture feature information by detecting an order section from a gesture of taken images by using information on the detected start point and the end point; and an integrated recognition unit outputting the extracted speech feature information and the gesture feature information as integrated recognition data by using a learning parameter set in advance.
[9] In the above aspect of the present invention, the system may further include a synchronization module including: a gesture start point detection module detecting a start point of the gesture from the taken images by using the detected start point; and an optimal frame applying module calculating and extracting optimal image frames by applying the number of optimal frames set in advance from the start point of the gesture. Here, the gesture start point detection module may detect the start point of the gesture by checking a start point that is an end point detection (EPD) plug of the detected speech from the taken images.
[10] In addition, the speech feature extraction unit may include: an EPD module detecting a start point and an end point in the input speech; and a hearing model-based speech feature extraction module extracting speech feature information included in the detected order by using a hearing model-based algorithm. In addition, the speech feature extraction unit may remove noise from the extracted speech feature information.
[11] In addition, the gesture feature extraction unit may include: a hand tracking module tracking movements of a hand in images taken by a camera and transmitting the movements to the synchronization module; and a gesture feature extraction module extracting gesture feature information by using the optimal image frames extracted by the synchronization module.
[12] In addition, the integrated recognition unit may include: an integrated learning DB
(database) control module generating the learning parameter on the basis of an integrated learning model and an integrated learning DB set in advance; an integrated feature control module controlling the extracted speech feature information and the gesture feature information by using the generated learning parameter; and an integrated recognition module generating a result controlled by the integrated feature control module as a recognition result. Here, the integrated feature control module may control feature vectors of the extracted speech feature information and the gesture feature information through the extension and the reduction of the node numbers of input vectors.
[13] According to another aspect of the present invention, there is provided a gesture/ speech integrated recognition method including: a first step of extracting speech feature information by detecting a start point that is an EPD value and an end point of an order in an input speech; a second step of extracting gesture feature information by detecting an order section from a gesture in images input through a camera by using the detected start point of the order; and a third step of outputting the extracted speech feature information and gesture feature information as integrated recognition data by using a learning parameter set in advance.
[14] In the above aspect of the present invention, in the first step, the speech feature information may be extracted from the order section obtained by the start point and the end point of the order on the basis of a hearing model.
[15] In addition, the second step may include: the second step comprises: a B step of detecting an order section from the gesture of the movements of the hand by using the transmitted EPD value; a C step of determining optimal frames from the order section from the gesture by applying optimal frames set in advance; and a D step of extracting gesture feature information from the determined optimal frames.
Advantageous Effects
[16] The gesture/speech integrated recognition system and method according to the present invention increases a recognition rate of a gesture that is not easily identified by detecting an order section of the gesture by using an end point detection (EPD) value that is a start point of an order section of a speech. In addition, optimal frames are applied to the order section of the gesture to synchronize the speech and the
gesture, so that it is possible to implement integrated recognition between the speech and the gesture.
Brief Description of the Drawings
[17] FIG. 1 is a view illustrating a concept of a gesture/speech integrated recognition system according to an embodiment of the present invention.
[18] FIG. 2 is a view illustrating a configuration of a gesture/speech integrated recognition system according to an embodiment of the present invention.
[19] FIG. 3 is a flowchart illustrating a gesture/speech integrated recognition method according to an embodiment of the present invention. Best Mode for Carrying Out the Invention
[20] Exemplary embodiments of the present invention will now be described in detail with reference to the accompanying drawings.
[21] In the description, the detailed descriptions of well-known functions and structures may be omitted so as not to hinder the understanding of the present invention.
[22] FIG. 1 is a view illustrating a concept of a gesture/speech integrated recognition system according to an embodiment of the present invention.
[23]
[24] *Ref erring to FIG. 1, in a gesture/speech integrated recognition technology, orders by a speech or a gesture of a person are integrated to be recognized, and a device of representing the five senses by using a control instruction generated by a result of the recognition is controlled.
[25] Specifically, a person 100 orders through a speech 110 and a gesture 120. Here, as an example of the order of the person, in order for the person to order goods through a cyberspace, the person may indicate a corn bread with a finger while saying "select a corn bread"as an order for selecting the corn bread from among the displayed goods.
[26] After the person 100 orders by the speech 110 and the gesture 120, feature information on the speech order of the person is recognized through speech recognition 111, and feature information on the gesture of the person is recognized through gesture recognition 121. The recognized feature information on the gesture and the speech is recognized through integrated recognition 130 as a single user order in order to increase a recognition rate of the speech affected by environmental noise and the gesture that cannot be easily identified.
[27] The present invention provides a technology of integrated recognition for a speech and a gesture of a person. The recognized order is transmitted by a controller to a speaker 170, a display apparatus 171, a diffuser 172, a touching device 173, and a tasting device 174 which are output devices for the five senses, and the controller controls each device. In addition, the recognition result is transmitted to a network, so
that data of the five senses as the result is transmitted to control each of the output devices. The present invention relates to integrated recognition, and a configuration after the recognition operation can be modified in various manners, so that a detailed description thereof is omitted.
[28] FIG. 2 is a view illustrating a configuration of the gesture/speech integrated recognition system according to an embodiment of the present invention.
[29] Referring to FIG. 2, the gesture/speech integrated recognition system includes a speech feature extraction unit 210 for extracting speech feature information by detecting a start point and an end point of an order from a speech input through a microphone 211, a gesture feature extraction unit 220 for extracting gesture feature information by detecting an order section from a gesture of images taken by a camera by using information on the start point and the end point detected by the speech feature extraction unit 210, a synchronization module 230 for detecting a start point of the gesture from the taken images by using the start point detected by the speech feature extraction unit 210, and calculating optimal image frames by applying the number optimal frames set in advance from the detected start point of the gesture, and an integrated recognition unit 240 outputting the extracted speech feature information and the gesture feature information as integrated recognition data by using a predetermined learning parameter set in advance. Now, each component is described in detail.
[30] The speech feature extraction unit 210 includes the microphone 211 receiving a speech of a user, an end point detection (EPD) module 212 detecting a start point and an end point of an order section from the speech of the user, and a hearing model- based speech feature extraction module 213 extracting speech feature information from the order section of the speech detected by the EPD module 212 on the basis of the hearing model. In addition, a channel noise reduction module (not shown) for removing noise from the extracted speech feature information may further be included.
[31] The EPD module 212 detects the start and the end of the order by analyzing the speech input through the wired or wireless microphone.
[32] Specifically, the EPD module 212 acquires a speech signal, calculates an energy value needed to detect the end point of the speech signal, and detects the start and the end of the order by identifying a section to be calculated as the order from the input speech signal.
[33] The EPD module 212 firstly acquires the speech signal through the microphone and converts the acquired speech into a type for frame calculation. In the meanwhile, when the speech is input wirelessly, a problem of data loss or signal deterioration due to signal interference may occur, so that an operation to deal with the problem during the signal acquisition is needed.
[34] The energy value needed to detect the end point of the speech signal by the EPD
module 212 is calculated as follows. A frame size used to analyze the speech signal is 160 samples, and frame energy is calculated by the following equation.
[35] FrameEnergy = log10
[36] Here, S(n) denotes a vocal cord signal sample, and N denotes the number of samples per frame.
[37] The calculated frame energy is used as a parameter for detecting the end point.
[38] The EPD module 212 determines a section to be calculated as the order after calculating the frame energy value. For example, the operation of calculating the start point and the end point of the speech signal is determined by four energy thresholds using the frame energy and 10 conditions. Here, the four energy thresholds and the 10 conditions can be set in various manners and may have values obtained in experiments in order to obtain the order section. The four thresholds are used to detect the start and end of every frame by an EPD algorithm.
[39] The EPD module 212 transmits information on the detected start point (hereinafter, referred to as an "EPD value" of the order to a gesture start point detection module 231 of the synchronization module 230.
[40] In addition, the EPD module 212 transmits information on the order section in the input speech to the hearing model-based speech feature extraction module 213 to extract speech feature information.
[41] The hearing model-based speech feature extraction module 213 that receives the information on the order section in the speech extracts the feature information on the basis of the hearing model from the order section detected by the EPD module 212. Examples of an algorithm used to extract the speech feature information on the basis of the hearing model include an ensemble interval histogram (EIH) and a zero-crossings with peak amplitudes (ZCPA).
[42] A channel noise reduction module (not shown) removes noise from the speech feature information extracted by the hearing model-based speech feature extraction module 213, and the speech feature information from which the noise is removed is transmitted to the integrated recognition unit 240.[245->240]
[43] The gesture feature extraction unit 220 includes a face and hand detection module
222 for detecting a face and a hand from the images taken by a camera 221, and a gesture feature extraction module 224 for tracking and transmitting movements of the detected hand to the synchronization module 230 and extracting feature information on the gesture by using optimal frames calculated by the synchronization module 230.
[44] The face and hand detection module 222 detects the face and the hand that gives a gesture, and a hand tracking module 223 continuously tracks the hand movements in the images. In the description, the hand tracking module 223 tracks the hand. However, it will be understood by those of ordinary skill in the art that the hand tracking module
223 may track various body portions having movements recognized as a gesture.
[45] The hand tracking module 223 continuously stores the hand movements as time elapses, and a part recognized as a gesture order in the hand movement is detected by the synchronization module 230 by using the EPD value transmitted from the speech feature extraction unit 210. Now, the synchronization module 230 which detects a section recognized as the gesture order in the hand movements by using the EPD value and applies the optimal frames for synchronization between speeches and gestures will be described.
[46] The synchronization module 230 includes the gesture start point detection module
231 for detecting the start point of the gesture by using the EPD value and the images showing the hand movements, and an optimal frame applying module 232 for calculating optimal image frames needed for integrated recognition by using a start frame of the gesture calculated by using the detected start point of the gesture.
[47] When an EPD value of a speech is detected by the EPD detection module 212 while a speech signal and an image signal are input in real-time, the gesture start point detection module 231 of the synchronization module 230 checks a speech EPD plug in the image signal. In this method, the gesture start point detection module 231 calculates the start frame of the gesture. In addition, the optimal frame applying module 232 calculates optimal image frames needed for integrated recognition by using the calculated start frame of the gesture and transmits the optimal image frames to the gesture feature extraction module 224. In order to calculate the optimal image frames needed for integrated recognition applied by the optimal frame applying module 232, the number of frames which are determined to have a high recognition rate of gestures is set in advance, and when the start frame of the gesture is calculated by the gesture start point detection module 231, the optimal image frames are determined.
[48] The integrated recognition unit 240 includes an integrated model generation module
242 for generating an integrated model for effectively integrating the speech feature information and the gesture feature information on the basis of a learning model, an integrated learning database (DB) 244 implemented to be proper for integrated recognition algorithm development based on a statistical model, an integrated learning DB control module 243 for controlling learning by the integrated model generation module 242 and the integrated learning DB 244 and a learning parameter, an integrated feature control module 241 for controlling the learning parameter and feature vectors of the input speech feature information and the gesture feature information, and an integrated recognition model 245 for providing various functions by generating recognition results.
[49] The integrated model generation module 242 generates a high-performance in-
tegrated model in order to effectively integrate the speech feature information with the gesture feature information. In order to determine the high-performance integrated model, various learning algorithms such as hidden Markov model (HMM), neural network (NN), dynamic time warping (DTW) are implemented and experiments are performed. Particularly, according to the embodiment of the present invention, a method of optimizing NN parameters by determining an integrated model on the basis of the NN to obtain high-performance integrated recognition may be used. However, the generation of the high-performance integrated model has a problem of synchronizing two modalities that have different frame numbers in the learning model.
[50] The problem of the synchronization in the learning model is referred to as a learning model optimization problem. According to the embodiment of the present invention, an integration layer is provided, and a method of connecting the speech and the gesture is optimized in the layer. An overlapping length between the speech and the gesture is calculated with respect to a time axis for the optimization, and the synchronization is performed on the basis of the overlapping length. The overlapping length is used to search for a connection method having a highest recognition rate through a recognition rate experiment.
[51] The integrated learning DB 244 implements an integrated recognition DB to be proper for the integrated recognition algorithm development based on the statistical model.
[52] For example, data of ten words in various age groups is synchronized and collected by a stereo camera and a wireless microphone. Table 1 shows an order set defined to integrate gestures and speeches. The defined order set is obtained on the basis of natural gestures that people can understand without learning.
[53] Table 1
[Table 1] [Table ]
[54] Here, the speech is recorded at a sampling rate of 16kHz/16bits per sample by using a waveform in a pulse code modulation (PCM) method using the channel number of 1. In addition, 24bits BITMAP images having a size of 320x240 are recorded with a blue screen background in a light condition having four fluorescent boxes at 15 frames per second by using a STH-DCSG-C stereo camera. Since the stereo camera does not have a speech interface, a speech collection module and an image collection module are independently provided, and a synchronization program of the image and speech to perform a method of controlling an image collecting process through inter-process communications (IPC) in a speech recording program controls is written to collect data. The image collection module is configured by using an open source computer vision library (OpenCV) and a small vision system (SVS).
[55] Stereo camera images have to be adjusted to a real recording environment through an additional calibration process, and in order to acquire optima images, associated gain, exposure, brightness, red, and blue parameter values are modified to control color, exposure, and WB values. Calibration information and parameter information are stored in an additional ini file so that the image storage model calls and use the information.
[56] The integrated learning DB control module 243 generates the learning parameter on the basis of the integrated learning DB 244 that is related to the integrated model generation module 242 to be generated and stored in advance.
[57] The integrated feature control module 241 controls the learning parameter generated by the integrated learning DB control module 243 and the feature vectors of the feature information on the speech and the gesture extracted by the speech feature extraction
unit 210 and the gesture feature extraction unit 220. The control operation is associated with the extension and the reduction of the node number of input vectors. The integrated feature control module 241 has the integration layer, and the integration layer is developed to effectively integrate lengths of the speech and the gesture that have different values and propose a single recognition rate.
[58] The integrated recognition module 245 generates a recognition result by using a control result obtained by the integrated feature control module 241. In addition, the integrated recognition module 245 provides various functions for interactions with a network or a integration representing unit.
[59] FIG. 3 is a flowchart illustrating a gesture/speech integrated recognition method according to an embodiment of the present invention.
[60] Referring to FIG. 3, the gesture/speech integrated recognition method includes three threads to be operated. The three threads includes a speech feature extraction thread 10 for extracting speech features, a gesture feature extraction thread 20 for extracting gesture features, and an integrated recognition thread 30 for performing integrated recognition on the speech and the gesture. The three threads 10, 20, and 30 are generated at a time point when the learning parameter is loaded and cooperatively operated by using a thread plug. Now, the gesture/speech integrated recognition method in which the three threads 10, 20, 30 are cooperatively operated is described.
[61] When the user orders by a speech and a gesture, the speech feature extraction thread
10 continuously receives the speech through a wired or wireless microphone (operation S311). Next, the gesture feature extraction thread 20 continuously receives images including the gesture through a camera (operation S320). Speech frames of the speech continuously input through the microphone are calculated (operation S312), and the EPD module 212 detects a start point and an end point (speech EPD value) of an order included in the speech (operation S313). When the speech EPD value is detected, the speech EPD value is transmitted to a synchronization operation 40 of the gesture feature extraction thread 20. In addition, when an order section of the speech is determined by using the start point and the end point of the order included in the speech, the hearing model-based speech feature extraction module 213 extracts a speech feature from the order section on the basis of the hearing model (operation S314) and transmits the extracted speech feature to the integrated recognition thread 30.
[62] The gesture feature extraction thread 20 detects a hand and a face from the images continuously input through the camera (operation S321). When the hand and face are detected, a gesture of the user is tracked (operation S322). Since the gesture of the user continuously changes, a gesture having a predetermined length is stored in a buffer (operation S323).
[63] When the speech EPD value is detected and transmitted while the gesture is stored in
the buffer, a speech EPD plug in the gesture images stored in the buffer is checked (operation S324). The start point and the end point of the gesture including the feature information on the images are searched for by the speech EPD plug (operation S325), and the searched gesture feature is stored (operation S326). The stored gesture feature is not in synchronization with the speech, so that optimal frames are calculated from a start frame of the gesture by applying optimal frames set in advance. In addition, the calculated optimal frames are used to extract gesture feature information by the gesture feature extraction module 224, and the extracted gesture feature information is transmitted to the integrated recognition thread 30.
[64] When the speech feature extraction thread 10 and the gesture feature extraction thread 20 successfully extract feature information on the speech and the gesture, the speech/gesture feature extraction threads 10 and 20 are in a sleep state while the integrated recognition thread checks a recognition result (operations S328 and S315).
[65] The integrated recognition thread 30 generates a high-performance integrated model in advance by using the integrated model generation module 245 before receiving the speech feature information and the gesture feature information and controls the generated integrated model and the integrated learning DB 244 so that the integrated learning DB control module 243 generates and loads the learning parameter (operation S331). When the learning parameter is loaded, the integrated recognition thread 30 is maintained in the sleep state until the speech/gesture feature information is received (operation S332).
[66] When the integrated recognition thread 30 that is in the sleep state receives a signal associated with the feature information after the extraction of the feature information on the speech and the gesture is completed (operation S333), the integrated recognition thread 30 loads each feature in a memory (operation S334). When the feature information on the speech and the gesture is loaded, the recognition result is calculated by using the optimized integrated learning model and the learning parameter that are set in advance (operation S335).
[67] When the recognition result is calculated by the integrated recognition unit 240, the speech feature extraction thread 10 and the gesture feature extraction thread 20 that are in the sleep state perform operations of extracting feature information from a speech and an image that are input.
[68] While the present invention has been shown and described in connection with the exemplary embodiments, it will be apparent to those skilled in the art that modifications and variations can be made without departing from the spirit and scope of the invention as defined by the appended claims.
Claims
[1] A gesture/speech integrated recognition system comprising: a speech feature extraction unit extracting speech feature information by detecting a start point and an end point of an order in an input speech; a gesture feature extraction unit extracting gesture feature information by detecting an order section from a gesture of taken images by using information on the detected start point and the end point; and an integrated recognition unit outputting the extracted speech feature information and the gesture feature information as integrated recognition data by using a learning parameter set in advance.
[2] The system of claim 1, further comprising a synchronization module comprising: a gesture start point detection module detecting a start point of the gesture from the taken images by using the detected start point; and an optimal frame applying module calculating and extracting optimal image frames by applying the number of optimal frames set in advance from the start point of the gesture.
[3] The system of claim 2, wherein the gesture start point detection module detects the start point of the gesture by checking a start point that is an EPD (end point detection) plug of the detected speech from the taken images.
[4] The system of claim 1, wherein the speech feature extraction unit comprises: an EPD module detecting a start point and an end point in the input speech; and a hearing model-based speech feature extraction module extracting speech feature information included in the detected order by using a hearing model- based algorithm.
[5] The system of claim 4, wherein the speech feature extraction unit removes noise from the extracted speech feature information.
[6] The system of claim 3, wherein the gesture feature extraction unit comprises:
*a hand tracking module tracking movements of a hand in images taken by a camera and transmitting the movements to the synchronization module; and a gesture feature extraction module extracting gesture feature information by using the optimal image frames extracted by the synchronization module.
[7] The system of claim 1, wherein the integrated recognition unit comprises: an integrated learning DB (database) control module generating the learning parameter on the basis of an integrated learning model and an integrated learning DB set in advance; an integrated feature control module controlling the extracted speech feature information and the gesture feature information by using the generated learning
parameter; and an integrated recognition module generating a result controlled by the integrated feature control module as a recognition result.
[8] The system of claim 7, wherein the integrated learning model is generated on the basis of a neural network learning algorithm.
[9] The system of claim 7, wherein the integrated learning DB is implemented to be proper to an integrated recognition algorithm based on a statistical model by integrating feature information on speeches and gestures in various ages groups by using a stereo camera and a wireless microphone.
[10] The system of claim 7, wherein the integrated recognition module includes an integration layer for integrating the extracted speech feature information with the gesture feature information. [11] The system of claim 7, wherein the integrated feature control module controls feature vectors of the extracted speech feature information and the gesture feature information through the extension and the reduction of the node numbers of input vectors. [12] A gesture/speech integrated recognition method comprising: a first step of extracting speech feature information by detecting a start point that is an EPD value and an end point of an order in an input speech; a second step of extracting gesture feature information by detecting an order section from a gesture in images input through a camera by using the detected start point of the order; and a third step of outputting the extracted speech feature information and gesture feature information as integrated recognition data by using a learning parameter set in advance. [13] The method of claim 12, wherein in the first step, the speech feature information is extracted from the order section obtained by the start point and the end point of the order on the basis of a hearing model. [14] The method of claim 12, wherein the second step comprises: an A step of tracking a gesture of movements of a hand from the image input through the camera; a B step of detecting an order section from the gesture of the movements of the hand by using the transmitted EPD value; a C step of determining optimal frames from the order section from the gesture by applying optimal frames set in advance; and a D step of extracting gesture feature information from the determined optimal frames. [15] The method of claim 12, wherein the first step further comprises removing noise
from the extracted speech feature information.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2009540141A JP2010511958A (en) | 2006-12-04 | 2007-12-03 | Gesture / voice integrated recognition system and method |
Applications Claiming Priority (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR10-2006-0121836 | 2006-12-04 | ||
KR20060121836 | 2006-12-04 | ||
KR10-2007-0086575 | 2007-08-28 | ||
KR1020070086575A KR100948600B1 (en) | 2006-12-04 | 2007-08-28 | System and method for integrating gesture and voice |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2008069519A1 true WO2008069519A1 (en) | 2008-06-12 |
Family
ID=39492339
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/KR2007/006189 WO2008069519A1 (en) | 2006-12-04 | 2007-12-03 | Gesture/speech integrated recognition system and method |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2008069519A1 (en) |
Cited By (27)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2011081541A (en) * | 2009-10-06 | 2011-04-21 | Canon Inc | Input device and control method thereof |
GB2476711A (en) * | 2009-12-31 | 2011-07-06 | Intel Corp | Using multi-modal input to control multiple objects on a display |
EP2347810A3 (en) * | 2009-12-30 | 2011-09-14 | Crytek GmbH | Mobile input and sensor device for a computer-controlled video entertainment system |
WO2011130083A2 (en) * | 2010-04-14 | 2011-10-20 | T-Mobile Usa, Inc. | Camera-assisted noise cancellation and speech recognition |
CN102298442A (en) * | 2010-06-24 | 2011-12-28 | 索尼公司 | Gesture recognition apparatus, gesture recognition method and program |
JP2012525625A (en) * | 2009-04-30 | 2012-10-22 | サムスン エレクトロニクス カンパニー リミテッド | User intention inference apparatus and method using multimodal information |
CN103064530A (en) * | 2012-12-31 | 2013-04-24 | 华为技术有限公司 | Input processing method and device |
CN103376891A (en) * | 2012-04-23 | 2013-10-30 | 凹凸电子(武汉)有限公司 | Multimedia system, control method for display device and controller |
WO2014010879A1 (en) * | 2012-07-09 | 2014-01-16 | 엘지전자 주식회사 | Speech recognition apparatus and method |
WO2014078480A1 (en) * | 2012-11-16 | 2014-05-22 | Aether Things, Inc. | Unified framework for device configuration, interaction and control, and associated methods, devices and systems |
CN104317392A (en) * | 2014-09-25 | 2015-01-28 | 联想(北京)有限公司 | Information control method and electronic equipment |
US9002714B2 (en) | 2011-08-05 | 2015-04-07 | Samsung Electronics Co., Ltd. | Method for controlling electronic apparatus based on voice recognition and motion recognition, and electronic apparatus applying the same |
US9244984B2 (en) | 2011-03-31 | 2016-01-26 | Microsoft Technology Licensing, Llc | Location based conversational understanding |
US9298287B2 (en) | 2011-03-31 | 2016-03-29 | Microsoft Technology Licensing, Llc | Combined activation for natural user interface systems |
CN105792005A (en) * | 2014-12-22 | 2016-07-20 | 深圳Tcl数字技术有限公司 | Recording control method and device |
US9454962B2 (en) | 2011-05-12 | 2016-09-27 | Microsoft Technology Licensing, Llc | Sentence simplification for spoken language understanding |
US9760566B2 (en) | 2011-03-31 | 2017-09-12 | Microsoft Technology Licensing, Llc | Augmented conversational understanding agent to identify conversation context between two humans and taking an agent action thereof |
US9842168B2 (en) | 2011-03-31 | 2017-12-12 | Microsoft Technology Licensing, Llc | Task driven user intents |
US9858343B2 (en) | 2011-03-31 | 2018-01-02 | Microsoft Technology Licensing Llc | Personalization of queries, conversations, and searches |
US20180122379A1 (en) * | 2016-11-03 | 2018-05-03 | Samsung Electronics Co., Ltd. | Electronic device and controlling method thereof |
US10061843B2 (en) | 2011-05-12 | 2018-08-28 | Microsoft Technology Licensing, Llc | Translating natural language utterances to keyword search queries |
CN110121696A (en) * | 2016-11-03 | 2019-08-13 | 三星电子株式会社 | Electronic equipment and its control method |
US10642934B2 (en) | 2011-03-31 | 2020-05-05 | Microsoft Technology Licensing, Llc | Augmented conversational understanding architecture |
US10739864B2 (en) | 2018-12-31 | 2020-08-11 | International Business Machines Corporation | Air writing to speech system using gesture and wrist angle orientation for synthesized speech modulation |
WO2021135281A1 (en) * | 2019-12-30 | 2021-07-08 | 浪潮(北京)电子信息产业有限公司 | Multi-layer feature fusion-based endpoint detection method, apparatus, device, and medium |
US11521038B2 (en) | 2018-07-19 | 2022-12-06 | Samsung Electronics Co., Ltd. | Electronic apparatus and control method thereof |
CN115881118A (en) * | 2022-11-04 | 2023-03-31 | 荣耀终端有限公司 | Voice interaction method and related electronic equipment |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0594129A2 (en) * | 1992-10-20 | 1994-04-27 | Hitachi, Ltd. | Display system capable of accepting user commands by use of voice and gesture inputs |
JPH07306772A (en) * | 1994-05-16 | 1995-11-21 | Canon Inc | Method and device for information processing |
KR20010075838A (en) * | 2000-01-20 | 2001-08-11 | 오길록 | Apparatus and method for processing multimodal interface |
US20030001908A1 (en) * | 2001-06-29 | 2003-01-02 | Koninklijke Philips Electronics N.V. | Picture-in-picture repositioning and/or resizing based on speech and gesture control |
-
2007
- 2007-12-03 WO PCT/KR2007/006189 patent/WO2008069519A1/en active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0594129A2 (en) * | 1992-10-20 | 1994-04-27 | Hitachi, Ltd. | Display system capable of accepting user commands by use of voice and gesture inputs |
JPH07306772A (en) * | 1994-05-16 | 1995-11-21 | Canon Inc | Method and device for information processing |
KR20010075838A (en) * | 2000-01-20 | 2001-08-11 | 오길록 | Apparatus and method for processing multimodal interface |
US20030001908A1 (en) * | 2001-06-29 | 2003-01-02 | Koninklijke Philips Electronics N.V. | Picture-in-picture repositioning and/or resizing based on speech and gesture control |
Cited By (48)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2012525625A (en) * | 2009-04-30 | 2012-10-22 | サムスン エレクトロニクス カンパニー リミテッド | User intention inference apparatus and method using multimodal information |
JP2011081541A (en) * | 2009-10-06 | 2011-04-21 | Canon Inc | Input device and control method thereof |
EP2347810A3 (en) * | 2009-12-30 | 2011-09-14 | Crytek GmbH | Mobile input and sensor device for a computer-controlled video entertainment system |
US9344753B2 (en) | 2009-12-30 | 2016-05-17 | Crytek Gmbh | Mobile input and sensor device for a computer-controlled video entertainment system |
US8977972B2 (en) | 2009-12-31 | 2015-03-10 | Intel Corporation | Using multi-modal input to control multiple objects on a display |
GB2476711B (en) * | 2009-12-31 | 2012-09-05 | Intel Corp | Using multi-modal input to control multiple objects on a display |
GB2476711A (en) * | 2009-12-31 | 2011-07-06 | Intel Corp | Using multi-modal input to control multiple objects on a display |
WO2011130083A3 (en) * | 2010-04-14 | 2012-02-02 | T-Mobile Usa, Inc. | Camera-assisted noise cancellation and speech recognition |
WO2011130083A2 (en) * | 2010-04-14 | 2011-10-20 | T-Mobile Usa, Inc. | Camera-assisted noise cancellation and speech recognition |
US8635066B2 (en) | 2010-04-14 | 2014-01-21 | T-Mobile Usa, Inc. | Camera-assisted noise cancellation and speech recognition |
CN102298442A (en) * | 2010-06-24 | 2011-12-28 | 索尼公司 | Gesture recognition apparatus, gesture recognition method and program |
EP2400371A3 (en) * | 2010-06-24 | 2015-04-08 | Sony Corporation | Gesture recognition apparatus, gesture recognition method and program |
US9842168B2 (en) | 2011-03-31 | 2017-12-12 | Microsoft Technology Licensing, Llc | Task driven user intents |
US9858343B2 (en) | 2011-03-31 | 2018-01-02 | Microsoft Technology Licensing Llc | Personalization of queries, conversations, and searches |
US10585957B2 (en) | 2011-03-31 | 2020-03-10 | Microsoft Technology Licensing, Llc | Task driven user intents |
US10642934B2 (en) | 2011-03-31 | 2020-05-05 | Microsoft Technology Licensing, Llc | Augmented conversational understanding architecture |
US10296587B2 (en) | 2011-03-31 | 2019-05-21 | Microsoft Technology Licensing, Llc | Augmented conversational understanding agent to identify conversation context between two humans and taking an agent action thereof |
US9298287B2 (en) | 2011-03-31 | 2016-03-29 | Microsoft Technology Licensing, Llc | Combined activation for natural user interface systems |
US9244984B2 (en) | 2011-03-31 | 2016-01-26 | Microsoft Technology Licensing, Llc | Location based conversational understanding |
US10049667B2 (en) | 2011-03-31 | 2018-08-14 | Microsoft Technology Licensing, Llc | Location-based conversational understanding |
US9760566B2 (en) | 2011-03-31 | 2017-09-12 | Microsoft Technology Licensing, Llc | Augmented conversational understanding agent to identify conversation context between two humans and taking an agent action thereof |
US10061843B2 (en) | 2011-05-12 | 2018-08-28 | Microsoft Technology Licensing, Llc | Translating natural language utterances to keyword search queries |
US9454962B2 (en) | 2011-05-12 | 2016-09-27 | Microsoft Technology Licensing, Llc | Sentence simplification for spoken language understanding |
US9002714B2 (en) | 2011-08-05 | 2015-04-07 | Samsung Electronics Co., Ltd. | Method for controlling electronic apparatus based on voice recognition and motion recognition, and electronic apparatus applying the same |
US9733895B2 (en) | 2011-08-05 | 2017-08-15 | Samsung Electronics Co., Ltd. | Method for controlling electronic apparatus based on voice recognition and motion recognition, and electronic apparatus applying the same |
CN103376891A (en) * | 2012-04-23 | 2013-10-30 | 凹凸电子(武汉)有限公司 | Multimedia system, control method for display device and controller |
WO2014010879A1 (en) * | 2012-07-09 | 2014-01-16 | 엘지전자 주식회사 | Speech recognition apparatus and method |
WO2014078480A1 (en) * | 2012-11-16 | 2014-05-22 | Aether Things, Inc. | Unified framework for device configuration, interaction and control, and associated methods, devices and systems |
AU2013270485C1 (en) * | 2012-12-31 | 2016-01-21 | Huawei Technologies Co. , Ltd. | Input processing method and apparatus |
AU2013270485B2 (en) * | 2012-12-31 | 2015-09-10 | Huawei Technologies Co. , Ltd. | Input processing method and apparatus |
EP2765473A4 (en) * | 2012-12-31 | 2014-12-10 | Huawei Tech Co Ltd | Input processing method and apparatus |
EP2765473A1 (en) * | 2012-12-31 | 2014-08-13 | Huawei Technologies Co., Ltd. | Input processing method and apparatus |
CN103064530A (en) * | 2012-12-31 | 2013-04-24 | 华为技术有限公司 | Input processing method and device |
CN104317392A (en) * | 2014-09-25 | 2015-01-28 | 联想(北京)有限公司 | Information control method and electronic equipment |
CN105792005A (en) * | 2014-12-22 | 2016-07-20 | 深圳Tcl数字技术有限公司 | Recording control method and device |
CN105792005B (en) * | 2014-12-22 | 2019-05-14 | 深圳Tcl数字技术有限公司 | The method and device of video recording control |
EP4220630A1 (en) * | 2016-11-03 | 2023-08-02 | Samsung Electronics Co., Ltd. | Electronic device and controlling method thereof |
CN110121696A (en) * | 2016-11-03 | 2019-08-13 | 三星电子株式会社 | Electronic equipment and its control method |
EP3523709A4 (en) * | 2016-11-03 | 2019-11-06 | Samsung Electronics Co., Ltd. | Electronic device and controlling method thereof |
WO2018084576A1 (en) | 2016-11-03 | 2018-05-11 | Samsung Electronics Co., Ltd. | Electronic device and controlling method thereof |
US20180122379A1 (en) * | 2016-11-03 | 2018-05-03 | Samsung Electronics Co., Ltd. | Electronic device and controlling method thereof |
US10679618B2 (en) | 2016-11-03 | 2020-06-09 | Samsung Electronics Co., Ltd. | Electronic device and controlling method thereof |
US11908465B2 (en) | 2016-11-03 | 2024-02-20 | Samsung Electronics Co., Ltd. | Electronic device and controlling method thereof |
US11521038B2 (en) | 2018-07-19 | 2022-12-06 | Samsung Electronics Co., Ltd. | Electronic apparatus and control method thereof |
US10739864B2 (en) | 2018-12-31 | 2020-08-11 | International Business Machines Corporation | Air writing to speech system using gesture and wrist angle orientation for synthesized speech modulation |
WO2021135281A1 (en) * | 2019-12-30 | 2021-07-08 | 浪潮(北京)电子信息产业有限公司 | Multi-layer feature fusion-based endpoint detection method, apparatus, device, and medium |
CN115881118A (en) * | 2022-11-04 | 2023-03-31 | 荣耀终端有限公司 | Voice interaction method and related electronic equipment |
CN115881118B (en) * | 2022-11-04 | 2023-12-22 | 荣耀终端有限公司 | Voice interaction method and related electronic equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2008069519A1 (en) | Gesture/speech integrated recognition system and method | |
KR100948600B1 (en) | System and method for integrating gesture and voice | |
CN107799126B (en) | Voice endpoint detection method and device based on supervised machine learning | |
US11762474B2 (en) | Systems, methods and devices for gesture recognition | |
US8793134B2 (en) | System and method for integrating gesture and sound for controlling device | |
US11854550B2 (en) | Determining input for speech processing engine | |
US20190188903A1 (en) | Method and apparatus for providing virtual companion to a user | |
CN110322760B (en) | Voice data generation method, device, terminal and storage medium | |
CN110310623A (en) | Sample generating method, model training method, device, medium and electronic equipment | |
CN106157956A (en) | The method and device of speech recognition | |
JP2012014394A (en) | User instruction acquisition device, user instruction acquisition program and television receiver | |
KR20100062207A (en) | Method and apparatus for providing animation effect on video telephony call | |
CN112016367A (en) | Emotion recognition system and method and electronic equipment | |
CN114779922A (en) | Control method for teaching apparatus, control apparatus, teaching system, and storage medium | |
CN111326152A (en) | Voice control method and device | |
CN107452381B (en) | Multimedia voice recognition device and method | |
CN113129867A (en) | Training method of voice recognition model, voice recognition method, device and equipment | |
CN110728993A (en) | Voice change identification method and electronic equipment | |
KR102291740B1 (en) | Image processing system | |
CN113497912A (en) | Automatic framing through voice and video positioning | |
CN108628454B (en) | Visual interaction method and system based on virtual human | |
KR20130054131A (en) | Display apparatus and control method thereof | |
WO2021223765A1 (en) | Voice recognition method, voice recognition system and electrical device | |
Freitas et al. | Multimodal silent speech interface based on video, depth, surface electromyography and ultrasonic doppler: Data collection and first recognition results | |
KR102265874B1 (en) | Method and Apparatus for Distinguishing User based on Multimodal |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 07851181 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2009540141 Country of ref document: JP Kind code of ref document: A |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 07851181 Country of ref document: EP Kind code of ref document: A1 |