WO2016094418A1 - Vocabulaire asr local dynamique - Google Patents
Vocabulaire asr local dynamique Download PDFInfo
- Publication number
- WO2016094418A1 WO2016094418A1 PCT/US2015/064523 US2015064523W WO2016094418A1 WO 2016094418 A1 WO2016094418 A1 WO 2016094418A1 US 2015064523 W US2015064523 W US 2015064523W WO 2016094418 A1 WO2016094418 A1 WO 2016094418A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- asr
- speech
- cloud
- vocabulary
- mobile device
- Prior art date
Links
- 238000000034 method Methods 0.000 claims abstract description 48
- 238000002372 labelling Methods 0.000 claims abstract description 12
- 230000006870 function Effects 0.000 claims abstract description 7
- 230000003993 interaction Effects 0.000 claims abstract description 7
- 230000015654 memory Effects 0.000 claims description 12
- 230000009467 reduction Effects 0.000 claims description 9
- 230000001629 suppression Effects 0.000 claims description 8
- 238000004891 communication Methods 0.000 description 12
- 230000005236 sound signal Effects 0.000 description 8
- 238000010586 diagram Methods 0.000 description 5
- 238000013500 data storage Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 230000004044 response Effects 0.000 description 4
- 230000005055 memory storage Effects 0.000 description 3
- 230000002093 peripheral effect Effects 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 2
- 230000001413 cellular effect Effects 0.000 description 2
- 238000004590 computer program Methods 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 230000000977 initiatory effect Effects 0.000 description 2
- 230000007704 transition Effects 0.000 description 2
- 238000007667 floating Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 230000006855 networking Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/28—Constructional details of speech recognition systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/183—Speech classification or search using natural language modelling using context dependencies, e.g. language models
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/28—Constructional details of speech recognition systems
- G10L15/30—Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/28—Constructional details of speech recognition systems
- G10L15/32—Multiple recognisers used in sequence or in parallel; Score combination systems therefor, e.g. voting systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L2015/226—Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics
- G10L2015/228—Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics of application context
Definitions
- the present application relates generally to speech processing and, more specifically, to automatic speech recognition.
- ASR automatic speech recognition
- Performance of ASR on a mobile device can be limited due to limitations of a mobile device's computing resources, which may, for example, lead to a shortage of a vocabulary for ASR.
- An example method allows defining a user actionable screen content associated with a mobile device.
- the method includes labeling at least a portion of the user actionable screen content.
- the method includes creating, based on the labeling, a first vocabulary.
- the first vocabulary is associated with a first ASR engine.
- the user actionable screen content is based partially on the user interaction with the mobile device.
- the first ASR engine is associated with the mobile device.
- the first vocabulary includes words associated with at least one function of the mobile device. In certain embodiments, a size of the first vocabulary is limited by resources of the mobile device.
- the method further includes detecting at least one key phrase in speech, the speech including at least one captured sound.
- the method allows determining whether the key phrase is a local key phrase or a cloud-based key phrase. If the key phrase is a local key phrase, ASR on the speech is performed with the first ASR engine. If the key phrase is a cloud-based key phrase, then the speech and/or the key phrase are forwarded to at least one cloud-based computing resource (a cloud).
- ASR is performed on the speech with a second ASR engine. The second ASR engine is associated with a second vocabulary and the cloud.
- the method allows performing at least noise suppression and/or noise reduction on the speech before performing the ASR on the speech by the first ASR engine to improve robustness of the ASR.
- the first vocabulary is smaller than the second
- the first vocabulary includes from 1 to 100 words, and the second vocabulary includes more than 100 words.
- the determination as to whether the at least one key phrase is a local key phrase or a cloud-based key phrase is based, at least partially, on a profile.
- the profile may be associated with the mobile device and/or the user.
- the profile includes commands that can be executed locally on the mobile device, commands that can be executed remotely in the cloud, and commands that can be executed both locally on the mobile device and remotely in the cloud.
- the profile includes at least one rule. The rule may include forwarding the speech to the cloud to perform the ASR on the speech by the second ASR engine if a score of performing the ASR on the speech by the first ASR engine is less than a pre-determined value.
- the steps of the method for providing dynamic local ASR vocabulary are stored on a non- transitory machine-readable medium comprising instructions, which, when
- FIG. 1 is block diagram illustrating a system in which methods and systems for providing a dynamic local ASR vocabulary can be practiced, according to various example embodiments.
- FIG. 2 is a block diagram of an example mobile device, in which a method for providing a dynamic local ASR vocabulary can be practiced.
- FIG. 3 is a block diagram showing a system for providing a dynamic local ASR vocabulary and hierarchical assignment of recognition tasks, according to various example embodiments.
- FIG. 4 is a flow chart illustrating steps of a method for providing a dynamic local ASR vocabulary.
- FIG. 5 is a flow chart illustrating steps of a method for hierarchical assignment of recognition tasks, according to various example embodiments.
- FIG. 6 is a flow chart illustrating steps of a method for selecting performance of speech recognition based on a profile, according to various example embodiments.
- FIG. 7 is an example computer system that may be used to implement embodiments of the disclosed technology.
- the present disclosure is directed to systems and methods for providing a dynamic local automatic speech recognition (ASR) vocabulary.
- ASR automatic speech recognition
- Various embodiments of the present technology can be practiced with mobile devices configured to capture audio signals and may provide for improvement of automatic speech recognition in the captured audio.
- the mobile devices may include: radio frequency (RF) receivers, transmitters, and transceivers; wired and/or wireless telecommunications and/or networking devices; amplifiers; audio and/or video players; encoders; decoders;
- RF radio frequency
- Mobile devices can include input devices such as buttons, switches, keys, keyboards, trackballs, sliders, touch screens, one or more microphones, gyroscopes, accelerometers, global positioning system (GPS) receivers, and the like.
- Mobile devices can include outputs, such as LED indicators, video displays, touchscreens, speakers, and the like.
- mobile devices are hand-held devices, such as notebook computers, tablet computers, phablets, smart phones, personal digital assistants, media players, mobile telephones, video cameras, and the like.
- the mobile devices are used in stationary and portable environments.
- the stationary environments include residential and commercial buildings or structures, and the like.
- the stationary environments can include living rooms, bedrooms, home theaters, conference rooms, auditoriums, business premises, and the like.
- the portable environments can include moving vehicles, moving persons, other transportation means, and the like.
- a method for providing a dynamic local ASR vocabulary includes defining a user actionable screen content associated with a mobile device.
- the user actionable screen content may be based on the user interaction with the mobile device.
- the method can include labeling at least a portion of the user actionable screen content.
- the method may also include creating, based on the labeling, a local vocabulary.
- the local vocabulary can correspond to a local ASR engine associated with the mobile device.
- Various embodiments of the method can include performing noise suppression and noise reduction on speech prior to performing the ASR on the speech by the first ASR engine to improve robustness of the ASR.
- the speech may include at least one captured sound.
- the system 100 can include a mobile device 110 and one or more cloud-based computing resources 130, also referred to herein as a computing cloud(s) 130 or a cloud 130.
- the cloud-based computing resource(s) 130 can include computing resources (hardware and software) available at a remote location and accessible over a network (for example, the Internet).
- the cloud-based computing resources 130 are shared by multiple users and can be dynamically re-allocated based on demand.
- the cloud-based computing resources 130 include one or more server farms/clusters, including a collection of computer servers which can be co-located with network switches and/or routers.
- the mobile device 110 can be connected to the computing cloud 130 via one or more wired or wireless communications networks 140.
- the mobile device 110 includes microphone(s) (e.g., transducers) 120 configured to receive voice input/acoustic sound from a user 150.
- the voice input/acoustic sound can be contaminated by a noise 160.
- Noise sources can include street noise, ambient noise, speech from entities other than an intended speaker(s), and the like.
- FIG. 2 is a block diagram illustrating components of the mobile device 110, according to various example embodiments.
- the mobile device 110 includes one or more microphones 120, a processor 210, audio processing system 220, a memory storage 230, one or more communication devices 240, and a graphic display system 250.
- the mobile device 110 also includes additional or other components needed for operations of mobile device 110.
- the mobile device 110 includes fewer components that perform similar or equivalent functions to those described with reference to FIG. 2.
- a beam-forming technique can be used to simulate forward-facing and backward-facing directional microphone responses.
- a level difference is obtained using the simulated forward-facing and the backward-facing directional microphone.
- the level difference can be used to discriminate speech and noise in, for example, the time-frequency domain, which can be further used in noise and/or echo reduction.
- some microphones 120 are used mainly to detect speech, and other microphones 120 are used mainly to detect noise.
- some microphones 120 are used to detect both noise and speech.
- the acoustic signals once received, for example, captured by microphone(s) 120, are converted into electric signals, which, in turn, are converted, by the audio processing system 220, into digital signals for processing in accordance with some embodiments.
- the processed signals are transmitted for further processing to the processor 210.
- Audio processing system 220 can be operable to process an audio signal.
- the acoustic signal is captured by the microphone 120.
- acoustic signals detected by the microphone(s) 120 are used by audio processing system 220 to separate desired speech (for example, keywords) from the noise, thereby providing more robust ASR.
- Noise reduction may include noise cancellation and/or noise suppression.
- noise reduction methods are described in U.S. Patent Application No. 12/215,980, entitled “System and Method for Providing Noise Suppression Utilizing Null Processing Noise Subtraction,” filed June 30, 2008, and in U.S. Patent Application No. 11/699,732, entitled “System and Method for Utilizing Omni-Directional Microphones for Speech Enhancement/' filed January 29, 2007, which are incorporated herein by reference in their entireties.
- the processor 210 may include hardware and/or software operable to execute computer programs stored in the memory storage 230.
- the processor 210 can use floating point operations, complex operations, and other operations, including providing a dynamic local ASR vocabulary, keyword detection, and hierarchical assignment of recognition tasks.
- the processor 210 of the mobile device 110 includes, for example, at least one of a digital signal processor, image processor, audio processor, general-purpose processor, and the like.
- the example mobile device 110 is operable, in various embodiments, to communicate over one or more wired or wireless communications networks 140 (as shown in FIG. 1), for example, via communication devices 240.
- the mobile device 110 sends at least audio signal (speech) over a wired or wireless communications network 140.
- the mobile device 110 encapsulates and/or encodes the at least one digital signal for transmission over a wireless network (e.g., a cellular network).
- the digital signal can be encapsulated over Internet Protocol Suite (TCP/IP) and/or User Datagram Protocol (UDP).
- TCP/IP Internet Protocol Suite
- UDP User Datagram Protocol
- the wired and/or wireless communications networks 140 can be circuit switched and/or packet switched. In various embodiments, the wired communications network(s) 140 provide
- the wireless communications network(s) 140 can include any number of wireless access points, base stations, repeaters, and the like.
- the wired and/or wireless communications networks 140 may conform to an industry standard(s), be proprietary, or combinations thereof. Various other suitable wired and/or wireless communications networks 140, other protocols, and combinations thereof can be used.
- the graphic display system 250 can be configured at least to provide a graphic user interface.
- a touch screen associated with the graphic display system 250 is utilized to receive input from a user. Options can be provided to a user via an icon or text buttons once the user touches the screen.
- the graphic display system 250 can be used for providing a user actionable content and generating a dynamic local ASR vocabulary.
- FIG. 3 is a block diagram showing a system 300 for providing a dynamic local ASR vocabulary and hierarchical assignment of recognition tasks, according an example embodiment.
- the example system 300 may include a key phrase detector 310, a local ASR module 320, and a cloud-based ASR module 330.
- the modules 310-330 can be implemented as executable instructions stored either locally in memory of the mobile device 110 or in computing cloud 130.
- the key phrase detector 310 may recognize the presence of one or more keywords in an acoustic audio signal, the acoustic audio signal representing at least one sound captured, for example, by microphones 120 of the mobile device 110.
- the term key phrase as used herein may comprise one or more key words.
- the key phrase detector 310 can determine whether the one or more keywords represent one or more commands that can be performed locally on a mobile device, one or more commands that can be performed in the computing cloud, or one or more commands that can be performed locally and in the computing cloud. In various embodiments, the determination is based on a profile 350.
- the profile 350 can include user specific settings and/or mobile device specific settings and rules for processing acoustic audio signal(s). Based on the determination, the acoustic audio signal can be sent to local ASR 320 or cloud-based ASR 330.
- the local ASR module 320 can be associated with a dynamic local ASR vocabulary.
- the cloud-based ASR 330 is based on the cloud-based vocabulary 360. In some embodiments, the cloud-based vocabulary 360 includes more entries than the dynamic local ASR vocabulary 340.
- the command when speech received from user 150 includes a recognized local command or key phrase, the key phrase including one or more keywords, the command can be performed locally (e.g., on a mobile device 110).
- a key phrase detector 310 determines that "Call” is a local key phrase and then uses the local ASR engine 320 (also referred to herein as local recognizer) to recognize the rest of the command ("Eugene” in this example).
- a record e.g., information for a "contact” including a telephone number
- other identifier associated with a name spoken after the "Call" command is retrieved locally on the mobile device 110 (not in the cloud-based computing resource(s) 130), and a call operation is initiated locally using the record.
- Other content stored locally e.g., on the mobile device 110
- commands associated with contact information e.g., Call, Text, Email
- audio or video content e.g., Play
- applications or bookmarked webpages Open
- Locations e.g., Navigate
- Some embodiments include deciding (for example, by the key phrase detector 310) that commands are to be performed using a cloud-based computing resource(s) 130, instead of locally (e.g., on the mobile device 110), based on the command key phrase, or based upon the recognition of a likelihood of a match of models and observed extracted audio parameters. For example, when the speech received from a user corresponds to a voice command identified as a command for execution using the cloud-based computing resource(s) 130 (e.g., since it cannot be handled locally on the mobile device), a decision can be made to have the speech and/or recognized text forwarded to the cloud-based computing resources 130 for the ASR. Furthermore, for speech received from a user that includes a command recognized by the ASR as a command for execution by the cloud-based computing resource(s) 130, the command can be selected or designated for execution by the cloud-based computing resource(s) 130.
- the key phrase "find the address” of the voice command is identified locally by the ASR.
- the voice command e.g., audio and/or recognized text
- the cloud-based computing resource 130 for the ASR and for execution of a recognized voice command by the cloud-based computing resource 130.
- some commands can use processor resources, for example, context awareness obtained from a sensor hub or a geographic locator, such as a GPS, beacon, Bluetooth Low Energy (“BLE”), or WiFi, and store information more efficiently when delivered via cloud-based computing resources 130 than when performed locally.
- processor resources for example, context awareness obtained from a sensor hub or a geographic locator, such as a GPS, beacon, Bluetooth Low Energy (“BLE”), or WiFi, and store information more efficiently when delivered via cloud-based computing resources 130 than when performed locally.
- Some embodiments can allow initiating of execution of and/or performing commands using both or different combinations of local resources (e.g., processor resources provided by and information stored on a mobile device) and cloud-based computing resource(s) 130 (e.g., processor resources provided by and information stored in the cloud-based computing resource(s) 130), depending upon the command.
- local resources e.g., processor resources provided by and information stored on a mobile device
- cloud-based computing resource(s) 130 e.g., processor resources provided by and information stored in the cloud-based computing resource(s) 130
- execution of some commands e.g., "call”
- execution or executing as referred to herein, refer to executing all or parts of the steps required to fully perform certain operations.
- Some embodiments can allow determining at least one or more commands that can be performed locally, one or more commands that can be performed by a cloud- based computing resource(s) / and one or more commands that can be performed using a combination of local resources and a cloud-based computing resource(s).
- the determination is based, for example, at least on specifications and/or characteristics of the mobile device 110.
- the determination is based, for example, in part on the characteristics or preferences of a user 150 of the mobile device 110.
- Some embodiments include a profile 350, which may be associated with a certain mobile device 110 (e.g., a make and model) and/or the user 150.
- the profile 350 can indicate, for example, at least one of one or more commands that may be performed locally, one or more commands that can be performed by cloud-based computing resources 130, and one or more commands that may be performed using a combination of local resources and a cloud-based computing resource(s) 130.
- Various embodiments include a plurality of profiles, each profile being associated with a different (e.g., a make and model) mobile device and/or a different user.
- Some embodiments can include a default profile, which may be used when information concerning the mobile device and/or user is not available. The default profile can be used to set, for example, performance of all commands using cloud-based computer resources 130 or commands known to be efficiently delivered locally (for example, via minimal usage of local processing and information storage resources).
- FIG. 4 is a flow chart illustrating a method 400 for providing a dynamic local ASR vocabulary, according to an example embodiment.
- a user actionable screen content can be defined.
- the user actionable screen content can be at least partially based on user interactions.
- the user actionable screen content is associated with a mobile device.
- at least a portion of the user actionable screen content can be labeled.
- a local vocabulary can be generated based on the labeling.
- the local vocabulary can be associated with a local ASR engine.
- the local ASR engine is associated with the mobile device.
- the local vocabulary includes words associated with certain functions of the mobile device.
- the local vocabulary can be limited by resources of the mobile device (such as memory and processor speed).
- the local ASR engine and the local vocabulary are used to recognize one or more key phrases in a speech, for example, in audio signal captured by one or more microphones of the mobile device.
- noise suppression or noise reduction is performed on the speech prior to performing the local ASR.
- FIG. 5 is flow chart illustrating a method 500 for hierarchical assignment of recognition tasks, according to various embodiments.
- speech audio
- the mobile device may sense/detect the speech through at least one transducer such as a microphone.
- the device can detect whether the speech (audio) includes a voice command. In various embodiments, this detection is performed using a module that includes a key phrase detector (e.g., a local recognizer/engine).
- a key phrase detector e.g., a local recognizer/engine.
- the "full” command refers to a key phrase comprising a command, plus additional speech (for example, “call Eugene", where the key phrase is “call” and the full command is “call Eugene”).
- the module both recognizes the "full” command and makes the determination as to whether the full command can be executed locally.
- the module can be operable to determine whether the received speech, and/or recognized text, includes at least one of a local key phrase or trigger (for example, recognize a key phrase which is associated with a voice command that can be executed locally), and/or a cloud key phrase or trigger (for example, recognize a keyword, text, or key phrase which may not be executed locally), and which may be (associated with) a voice command for which execution on a cloud-based computing resource(s) is required.
- a local key phrase or trigger for example, recognize a key phrase which is associated with a voice command that can be executed locally
- a cloud key phrase or trigger for example, recognize a keyword, text, or key phrase which may not be executed locally
- audio and/or recognized text is forwarded to the cloud.
- Various embodiments can allow conserving system resources (for example, offer low power consumption, low processor overhead, low memory usage, and the like) by detecting the key phrase and determining whether local or cloud-based resources can handle the (full) voice command.
- conserving system resources for example, offer low power consumption, low processor overhead, low memory usage, and the like
- the mobile device performs the ASR on the speech, for example, using a local ASR engine to determine what the voice command is.
- the local ASR engine uses a "small" vocabulary or dictionary (for example, a dynamic local ASR vocabulary).
- the small vocabulary includes, for example, 1-100 words.
- the number of words in this small "local" vocabulary can be more or less than in this example and less than the number available in a cloud-based resource having more memory storage.
- the words in the small vocabulary include various commands used to interact with the mobile device's basic local functionality (e.g., unlock, dial, call, open application, schedule an appointment, and the like).
- the voice command determined by the local ASR engine can be performed.
- the cloud information can be used to provide instructions to the local engine.
- the cloud can contain a calendar that is inaccessible by the local system, and, therefore, the local system is unable to determine a conflict in a schedule.
- a determination can be made to "select" use of various combinations of local and cloud-based resources for different commands.
- the cloud-based computing resource(s) can perform the ASR, for example, to determine or identify one or more voice commands.
- the cloud-based ASR uses a "large" vocabulary.
- the large vocabulary includes over 100 words.
- the words in the large vocabulary can be used to process or decode complex sentences, which may approach natural language (for example, "tomorrow after work I would like to go to an Italian restaurant").
- the cloud-based ASR uses greater system resources than are practical and/or available on the mobile device (such as power consumption, processing power, memory, storage, and the like).
- the one or more voice commands determined by the cloud-based ASR may be performed by the cloud-based computing resource(s).
- FIG. 6 is a flow chart illustrating a method 600 for selecting performance of speech recognition based on a profile, according to some embodiments.
- speech audio
- the mobile device can sense/detect the speech through at least one transducer such as a microphone.
- the mobile device may "wake up.” For example, the mobile device can perform a transition from a lower-power consumption state of operation to a higher-power consumption state of operation, the transition optionally including one or more intermediate power consumption states of operation.
- the mobile device determines that the speech includes at least a voice command (for example, using a key phrase detector).
- the mobile device can send the received speech and, optionally, a signature.
- a signature includes an identifier associated with the mobile device and/or the user.
- the signature can be associated with a certain make and model of a mobile device.
- the signature can be associated with a certain user.
- the speech and, optionally, the signature are sent through wired and/or wireless communication network(s) to cloud-based computing resources.
- a profile can be determined.
- the profile is determined based, optionally, upon a signature.
- the profile can indicate at least one of one or more commands that may be performed locally, one or more commands that may be performed by cloud-based computing resources, and one or more commands that may be performed using a combination of local resources and cloud-based computing resource(s).
- the profile for example, includes characteristics of the mobile device, such as capabilities of transducers (e.g., microphones), capabilities for processing noise and/or echo, and the like.
- the profile for example, includes information specific to the user for performing the ASR.
- a default profile is determined/used when, for example, a signature is not received or a profile is not otherwise available.
- the ASR is performed on the speech to determine a voice command. In some embodiments, optionally, the ASR is performed based on the determined profile. In some embodiments, the speech is processed (e.g., noise
- the ASR is performed by a cloud-based computing resource(s).
- the determined voice command can be performed locally, by a cloud-based computing resource(s), or combination of the two, based at least on the received profile.
- the command can be performed solely or more efficiently locally, by the cloud-based computing resource(s), or by a combination of the two, and a determination as to where to perform the command can be made based on these or like criteria.
- a decision can be made to perform certain commands always locally even if such commands may be performed by the cloud- based computing resource(s) or by a combination of the two.
- a determination can be made to always first perform certain commands locally and, if a local ASR score is low (e.g., a mismatch between speech and the local vocabulary), perform the commands remotely using the cloud-based computing resource(s).
- FIGS. 4-6 illustrate the functionality/operations of various implementations of systems, methods, and computer program products according to embodiments of the present technology. It should be noted that, in some alternative embodiments, the functions noted in the blocks may occur out of the order noted in FIGS. 4-6, or omitted altogether. For example, two blocks shown in succession may, in fact, be performed substantially concurrently, or the blocks may sometimes be performed in the reverse order.
- FIG. 7 illustrates an exemplary computer system 700 that may be used to implement some embodiments of the present invention.
- the computer system 700 of FIG. 7 may be implemented in the contexts of the likes of computing systems, networks, servers, or combinations thereof.
- the computer system 700 of FIG. 7 includes one or more processor units 710 and main memory 720.
- Main memory 720 stores, in part, instructions and data for execution by processor units 710.
- Main memory 720 stores the executable code when in operation, in this example.
- the computer system 700 of FIG. 7 further includes a mass data storage 730, portable storage device 740, output devices 750, user input devices 760, a graphics display system 770, and peripheral devices 780.
- FIG. 7 The components shown in FIG. 7 are depicted as being connected via a single bus 790.
- the components may be connected through one or more data transport means.
- Processor unit 710 and main memory 720 is connected via a local microprocessor bus, and the mass data storage 730, peripheral device(s) 780, portable storage device 740, and graphics display system 770 are connected via one or more input/output (I/O) buses.
- I/O input/output
- Mass data storage 730 which can be implemented with a magnetic disk drive, solid state drive, or an optical disk drive, is a non-volatile storage device for storing data and instructions for use by processor unit 710. Mass data storage 730 stores the system software for implementing embodiments of the present disclosure for purposes of loading that software into main memory 720.
- Portable storage device 740 operates in conjunction with a portable non-volatile storage medium, such as a flash drive, floppy disk, compact disk, digital video disc, or Universal Serial Bus (USB) storage device, to input and output data and code to and from the computer system 700 of FIG. 7.
- a portable non-volatile storage medium such as a flash drive, floppy disk, compact disk, digital video disc, or Universal Serial Bus (USB) storage device
- USB Universal Serial Bus
- User input devices 760 can provide a portion of a user interface.
- User input devices 760 may include one or more microphones, an alphanumeric keypad, such as a keyboard, for inputting alphanumeric and other information, or a pointing device, such as a mouse, a trackball, stylus, or cursor direction keys.
- User input devices 760 can also include a touchscreen.
- the computer system 700 as shown in FIG. 7 includes output devices 750. Suitable output devices 750 include speakers, printers, network interfaces, and monitors.
- Graphics display system 770 include a liquid crystal display (LCD) or other suitable display device. Graphics display system 770 is configurable to receive textual and graphical information and processes the information for output to the display device.
- LCD liquid crystal display
- Peripheral devices 780 may include any type of computer support device to add additional functionality to the computer system.
- the components provided in the computer system 700 of FIG. 7 are those typically found in computer systems that may be suitable for use with embodiments of the present disclosure and are intended to represent a broad category of such computer components that are well known in the art.
- the computer system 700 of FIG. 7 can be a personal computer (PC), handheld computer system, telephone, mobile computer system, workstation, tablet, phablet, mobile phone, server, minicomputer, mainframe computer, wearable, or any other computer system.
- the computer may also include different bus configurations, networked platforms, multi-processor platforms, and the like.
- Various operating systems may be used including UNIX, LINUX,
- WINDOWS MAC OS
- PALM OS PALM OS
- QNX ANDROID IOS
- CHROME CHROME
- TIZEN TIZEN
- the processing for various embodiments may be implemented in software that is cloud-based.
- the computer system 700 is implemented as a cloud-based computing environment, such as a virtual machine operating within a computing cloud.
- the computer system 700 may itself include a cloud-based computing environment, where the functionalities of the computer system 700 are executed in a distributed fashion.
- the computer system 700 when configured as a computing cloud, may include pluralities of computing devices in various forms, as will be described in greater detail below.
- a cloud-based computing environment is a resource that typically combines the computational power of a large grouping of processors (such as within web servers) and/or that combines the storage capacity of a large grouping of computer memories or storage devices.
- Systems that provide cloud-based resources may be utilized exclusively by their owners, or such systems may be accessible to outside users who deploy applications within the computing infrastructure to obtain the benefit of large computational or storage resources.
- the cloud may be formed, for example, by a network of web servers that comprise a plurality of computing devices, such as the computer system 700, with each server (or at least a plurality thereof) providing processor and/or storage resources.
- These servers may manage workloads provided by multiple users (e.g., cloud resource customers or other users).
- each user places workload demands upon the cloud that vary in real-time, sometimes dramatically. The nature and extent of these variations typically depends on the type of business associated with the user.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- User Interface Of Digital Computer (AREA)
Abstract
L'invention concerne des systèmes et des procédés pour un vocabulaire de reconnaissance vocale automatique (ASR) local dynamique. Un exemple de procédé consiste à définir un contenu d'écran actionnable par un utilisateur, sur la base d'interactions d'utilisateur. Au moins une partie du contenu d'écran actionnable par un utilisateur est marquée. Un vocabulaire local, associé à un moteur ASR local, est créé sur la base, en partie, du marquage. Le vocabulaire local comprend des mots associés à des fonctions d'un dispositif mobile et est limité par des ressources du dispositif mobile. Le procédé consiste à déterminer si une parole comprend une phrase clé locale ou une phrase clé en nuage. Sur la base de la détermination, le procédé consiste à effectuer une ASR sur la parole en utilisant le moteur ASR local ou en envoyant la parole à un moteur informatique en nuage, et à effectuer une ASR dans celui-ci sur la base du vocabulaire plus grand du moteur informatique en nuage.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201462089716P | 2014-12-09 | 2014-12-09 | |
US62/089,716 | 2014-12-09 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2016094418A1 true WO2016094418A1 (fr) | 2016-06-16 |
Family
ID=56108065
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2015/064523 WO2016094418A1 (fr) | 2014-12-09 | 2015-12-08 | Vocabulaire asr local dynamique |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2016094418A1 (fr) |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9820042B1 (en) | 2016-05-02 | 2017-11-14 | Knowles Electronics, Llc | Stereo separation and directional suppression with omni-directional microphones |
US10249323B2 (en) | 2017-05-31 | 2019-04-02 | Bose Corporation | Voice activity detection for communication headset |
US10311889B2 (en) | 2017-03-20 | 2019-06-04 | Bose Corporation | Audio signal processing for noise reduction |
US10366708B2 (en) | 2017-03-20 | 2019-07-30 | Bose Corporation | Systems and methods of detecting speech activity of headphone user |
US10424315B1 (en) | 2017-03-20 | 2019-09-24 | Bose Corporation | Audio signal processing for noise reduction |
US10438605B1 (en) | 2018-03-19 | 2019-10-08 | Bose Corporation | Echo control in binaural adaptive noise cancellation systems in headsets |
US10499139B2 (en) | 2017-03-20 | 2019-12-03 | Bose Corporation | Audio signal processing for noise reduction |
EP3676827A4 (fr) * | 2017-08-28 | 2021-04-14 | Roku, Inc. | Reconnaissance vocale locale et dans le nuage |
US11062702B2 (en) | 2017-08-28 | 2021-07-13 | Roku, Inc. | Media system with multiple digital assistants |
US11126389B2 (en) | 2017-07-11 | 2021-09-21 | Roku, Inc. | Controlling visual indicators in an audio responsive electronic device, and capturing and providing audio using an API, by native and non-native computing devices and services |
US11145298B2 (en) | 2018-02-13 | 2021-10-12 | Roku, Inc. | Trigger word detection with multiple digital assistants |
US11984125B2 (en) | 2021-04-23 | 2024-05-14 | Cisco Technology, Inc. | Speech recognition using on-the-fly-constrained language model per utterance |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1081685A2 (fr) * | 1999-09-01 | 2001-03-07 | TRW Inc. | Procédé de réduction de bruit dans un signal de parole utilisant un microphone unique |
WO2012094422A2 (fr) * | 2011-01-05 | 2012-07-12 | Health Fidelity, Inc. | Système et procédé vocaux pour saisie de données |
US20130289988A1 (en) * | 2012-04-30 | 2013-10-31 | Qnx Software Systems Limited | Post processing of natural language asr |
US20130289996A1 (en) * | 2012-04-30 | 2013-10-31 | Qnx Software Systems Limited | Multipass asr controlling multiple applications |
-
2015
- 2015-12-08 WO PCT/US2015/064523 patent/WO2016094418A1/fr active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1081685A2 (fr) * | 1999-09-01 | 2001-03-07 | TRW Inc. | Procédé de réduction de bruit dans un signal de parole utilisant un microphone unique |
WO2012094422A2 (fr) * | 2011-01-05 | 2012-07-12 | Health Fidelity, Inc. | Système et procédé vocaux pour saisie de données |
US20130289988A1 (en) * | 2012-04-30 | 2013-10-31 | Qnx Software Systems Limited | Post processing of natural language asr |
US20130289996A1 (en) * | 2012-04-30 | 2013-10-31 | Qnx Software Systems Limited | Multipass asr controlling multiple applications |
Cited By (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9820042B1 (en) | 2016-05-02 | 2017-11-14 | Knowles Electronics, Llc | Stereo separation and directional suppression with omni-directional microphones |
US10311889B2 (en) | 2017-03-20 | 2019-06-04 | Bose Corporation | Audio signal processing for noise reduction |
US10366708B2 (en) | 2017-03-20 | 2019-07-30 | Bose Corporation | Systems and methods of detecting speech activity of headphone user |
US10424315B1 (en) | 2017-03-20 | 2019-09-24 | Bose Corporation | Audio signal processing for noise reduction |
US10499139B2 (en) | 2017-03-20 | 2019-12-03 | Bose Corporation | Audio signal processing for noise reduction |
US10762915B2 (en) | 2017-03-20 | 2020-09-01 | Bose Corporation | Systems and methods of detecting speech activity of headphone user |
US10249323B2 (en) | 2017-05-31 | 2019-04-02 | Bose Corporation | Voice activity detection for communication headset |
US11126389B2 (en) | 2017-07-11 | 2021-09-21 | Roku, Inc. | Controlling visual indicators in an audio responsive electronic device, and capturing and providing audio using an API, by native and non-native computing devices and services |
EP3676827A4 (fr) * | 2017-08-28 | 2021-04-14 | Roku, Inc. | Reconnaissance vocale locale et dans le nuage |
US11062710B2 (en) | 2017-08-28 | 2021-07-13 | Roku, Inc. | Local and cloud speech recognition |
US11062702B2 (en) | 2017-08-28 | 2021-07-13 | Roku, Inc. | Media system with multiple digital assistants |
US11646025B2 (en) | 2017-08-28 | 2023-05-09 | Roku, Inc. | Media system with multiple digital assistants |
US11804227B2 (en) | 2017-08-28 | 2023-10-31 | Roku, Inc. | Local and cloud speech recognition |
US11961521B2 (en) | 2017-08-28 | 2024-04-16 | Roku, Inc. | Media system with multiple digital assistants |
US11145298B2 (en) | 2018-02-13 | 2021-10-12 | Roku, Inc. | Trigger word detection with multiple digital assistants |
US11664026B2 (en) | 2018-02-13 | 2023-05-30 | Roku, Inc. | Trigger word detection with multiple digital assistants |
US11935537B2 (en) | 2018-02-13 | 2024-03-19 | Roku, Inc. | Trigger word detection with multiple digital assistants |
US10438605B1 (en) | 2018-03-19 | 2019-10-08 | Bose Corporation | Echo control in binaural adaptive noise cancellation systems in headsets |
US11984125B2 (en) | 2021-04-23 | 2024-05-14 | Cisco Technology, Inc. | Speech recognition using on-the-fly-constrained language model per utterance |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20160162469A1 (en) | Dynamic Local ASR Vocabulary | |
WO2016094418A1 (fr) | Vocabulaire asr local dynamique | |
US10045140B2 (en) | Utilizing digital microphones for low power keyword detection and noise suppression | |
US9978388B2 (en) | Systems and methods for restoration of speech components | |
TWI585744B (zh) | 用於操作虛擬助理之方法、系統及電腦可讀取儲存媒體 | |
US9668048B2 (en) | Contextual switching of microphones | |
US20140244273A1 (en) | Voice-controlled communication connections | |
US9799330B2 (en) | Multi-sourced noise suppression | |
CN111192591B (zh) | 智能设备的唤醒方法、装置、智能音箱及存储介质 | |
US9953634B1 (en) | Passive training for automatic speech recognition | |
US20190013025A1 (en) | Providing an ambient assist mode for computing devices | |
US10353495B2 (en) | Personalized operation of a mobile device using sensor signatures | |
KR20160091725A (ko) | 음성 인식 방법 및 장치 | |
US9437188B1 (en) | Buffered reprocessing for multi-microphone automatic speech recognition assist | |
US20190130911A1 (en) | Communications with trigger phrases | |
US11721338B2 (en) | Context-based dynamic tolerance of virtual assistant | |
US20140316783A1 (en) | Vocal keyword training from text | |
JP6619488B2 (ja) | 人工知能機器における連続会話機能 | |
US9772815B1 (en) | Personalized operation of a mobile device using acoustic and non-acoustic information | |
US9508345B1 (en) | Continuous voice sensing | |
KR20200019522A (ko) | Gui 음성제어 장치 및 방법 | |
KR102629796B1 (ko) | 음성 인식의 향상을 지원하는 전자 장치 | |
KR20140116642A (ko) | 음성 인식 기반의 기능 제어 방법 및 장치 | |
US20170206898A1 (en) | Systems and methods for assisting automatic speech recognition | |
JP2019175453A (ja) | ユーザ音声入力の処理を含むシステム及びその動作方法並びに電子装置 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 15868411 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 15868411 Country of ref document: EP Kind code of ref document: A1 |