WO2016094418A1 - Vocabulaire asr local dynamique - Google Patents

Vocabulaire asr local dynamique Download PDF

Info

Publication number
WO2016094418A1
WO2016094418A1 PCT/US2015/064523 US2015064523W WO2016094418A1 WO 2016094418 A1 WO2016094418 A1 WO 2016094418A1 US 2015064523 W US2015064523 W US 2015064523W WO 2016094418 A1 WO2016094418 A1 WO 2016094418A1
Authority
WO
WIPO (PCT)
Prior art keywords
asr
speech
cloud
vocabulary
mobile device
Prior art date
Application number
PCT/US2015/064523
Other languages
English (en)
Inventor
Peter Santos
Original Assignee
Knowles Electronics, Llc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Knowles Electronics, Llc filed Critical Knowles Electronics, Llc
Publication of WO2016094418A1 publication Critical patent/WO2016094418A1/fr

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/30Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/32Multiple recognisers used in sequence or in parallel; Score combination systems therefor, e.g. voting systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/226Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics
    • G10L2015/228Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics of application context

Definitions

  • the present application relates generally to speech processing and, more specifically, to automatic speech recognition.
  • ASR automatic speech recognition
  • Performance of ASR on a mobile device can be limited due to limitations of a mobile device's computing resources, which may, for example, lead to a shortage of a vocabulary for ASR.
  • An example method allows defining a user actionable screen content associated with a mobile device.
  • the method includes labeling at least a portion of the user actionable screen content.
  • the method includes creating, based on the labeling, a first vocabulary.
  • the first vocabulary is associated with a first ASR engine.
  • the user actionable screen content is based partially on the user interaction with the mobile device.
  • the first ASR engine is associated with the mobile device.
  • the first vocabulary includes words associated with at least one function of the mobile device. In certain embodiments, a size of the first vocabulary is limited by resources of the mobile device.
  • the method further includes detecting at least one key phrase in speech, the speech including at least one captured sound.
  • the method allows determining whether the key phrase is a local key phrase or a cloud-based key phrase. If the key phrase is a local key phrase, ASR on the speech is performed with the first ASR engine. If the key phrase is a cloud-based key phrase, then the speech and/or the key phrase are forwarded to at least one cloud-based computing resource (a cloud).
  • ASR is performed on the speech with a second ASR engine. The second ASR engine is associated with a second vocabulary and the cloud.
  • the method allows performing at least noise suppression and/or noise reduction on the speech before performing the ASR on the speech by the first ASR engine to improve robustness of the ASR.
  • the first vocabulary is smaller than the second
  • the first vocabulary includes from 1 to 100 words, and the second vocabulary includes more than 100 words.
  • the determination as to whether the at least one key phrase is a local key phrase or a cloud-based key phrase is based, at least partially, on a profile.
  • the profile may be associated with the mobile device and/or the user.
  • the profile includes commands that can be executed locally on the mobile device, commands that can be executed remotely in the cloud, and commands that can be executed both locally on the mobile device and remotely in the cloud.
  • the profile includes at least one rule. The rule may include forwarding the speech to the cloud to perform the ASR on the speech by the second ASR engine if a score of performing the ASR on the speech by the first ASR engine is less than a pre-determined value.
  • the steps of the method for providing dynamic local ASR vocabulary are stored on a non- transitory machine-readable medium comprising instructions, which, when
  • FIG. 1 is block diagram illustrating a system in which methods and systems for providing a dynamic local ASR vocabulary can be practiced, according to various example embodiments.
  • FIG. 2 is a block diagram of an example mobile device, in which a method for providing a dynamic local ASR vocabulary can be practiced.
  • FIG. 3 is a block diagram showing a system for providing a dynamic local ASR vocabulary and hierarchical assignment of recognition tasks, according to various example embodiments.
  • FIG. 4 is a flow chart illustrating steps of a method for providing a dynamic local ASR vocabulary.
  • FIG. 5 is a flow chart illustrating steps of a method for hierarchical assignment of recognition tasks, according to various example embodiments.
  • FIG. 6 is a flow chart illustrating steps of a method for selecting performance of speech recognition based on a profile, according to various example embodiments.
  • FIG. 7 is an example computer system that may be used to implement embodiments of the disclosed technology.
  • the present disclosure is directed to systems and methods for providing a dynamic local automatic speech recognition (ASR) vocabulary.
  • ASR automatic speech recognition
  • Various embodiments of the present technology can be practiced with mobile devices configured to capture audio signals and may provide for improvement of automatic speech recognition in the captured audio.
  • the mobile devices may include: radio frequency (RF) receivers, transmitters, and transceivers; wired and/or wireless telecommunications and/or networking devices; amplifiers; audio and/or video players; encoders; decoders;
  • RF radio frequency
  • Mobile devices can include input devices such as buttons, switches, keys, keyboards, trackballs, sliders, touch screens, one or more microphones, gyroscopes, accelerometers, global positioning system (GPS) receivers, and the like.
  • Mobile devices can include outputs, such as LED indicators, video displays, touchscreens, speakers, and the like.
  • mobile devices are hand-held devices, such as notebook computers, tablet computers, phablets, smart phones, personal digital assistants, media players, mobile telephones, video cameras, and the like.
  • the mobile devices are used in stationary and portable environments.
  • the stationary environments include residential and commercial buildings or structures, and the like.
  • the stationary environments can include living rooms, bedrooms, home theaters, conference rooms, auditoriums, business premises, and the like.
  • the portable environments can include moving vehicles, moving persons, other transportation means, and the like.
  • a method for providing a dynamic local ASR vocabulary includes defining a user actionable screen content associated with a mobile device.
  • the user actionable screen content may be based on the user interaction with the mobile device.
  • the method can include labeling at least a portion of the user actionable screen content.
  • the method may also include creating, based on the labeling, a local vocabulary.
  • the local vocabulary can correspond to a local ASR engine associated with the mobile device.
  • Various embodiments of the method can include performing noise suppression and noise reduction on speech prior to performing the ASR on the speech by the first ASR engine to improve robustness of the ASR.
  • the speech may include at least one captured sound.
  • the system 100 can include a mobile device 110 and one or more cloud-based computing resources 130, also referred to herein as a computing cloud(s) 130 or a cloud 130.
  • the cloud-based computing resource(s) 130 can include computing resources (hardware and software) available at a remote location and accessible over a network (for example, the Internet).
  • the cloud-based computing resources 130 are shared by multiple users and can be dynamically re-allocated based on demand.
  • the cloud-based computing resources 130 include one or more server farms/clusters, including a collection of computer servers which can be co-located with network switches and/or routers.
  • the mobile device 110 can be connected to the computing cloud 130 via one or more wired or wireless communications networks 140.
  • the mobile device 110 includes microphone(s) (e.g., transducers) 120 configured to receive voice input/acoustic sound from a user 150.
  • the voice input/acoustic sound can be contaminated by a noise 160.
  • Noise sources can include street noise, ambient noise, speech from entities other than an intended speaker(s), and the like.
  • FIG. 2 is a block diagram illustrating components of the mobile device 110, according to various example embodiments.
  • the mobile device 110 includes one or more microphones 120, a processor 210, audio processing system 220, a memory storage 230, one or more communication devices 240, and a graphic display system 250.
  • the mobile device 110 also includes additional or other components needed for operations of mobile device 110.
  • the mobile device 110 includes fewer components that perform similar or equivalent functions to those described with reference to FIG. 2.
  • a beam-forming technique can be used to simulate forward-facing and backward-facing directional microphone responses.
  • a level difference is obtained using the simulated forward-facing and the backward-facing directional microphone.
  • the level difference can be used to discriminate speech and noise in, for example, the time-frequency domain, which can be further used in noise and/or echo reduction.
  • some microphones 120 are used mainly to detect speech, and other microphones 120 are used mainly to detect noise.
  • some microphones 120 are used to detect both noise and speech.
  • the acoustic signals once received, for example, captured by microphone(s) 120, are converted into electric signals, which, in turn, are converted, by the audio processing system 220, into digital signals for processing in accordance with some embodiments.
  • the processed signals are transmitted for further processing to the processor 210.
  • Audio processing system 220 can be operable to process an audio signal.
  • the acoustic signal is captured by the microphone 120.
  • acoustic signals detected by the microphone(s) 120 are used by audio processing system 220 to separate desired speech (for example, keywords) from the noise, thereby providing more robust ASR.
  • Noise reduction may include noise cancellation and/or noise suppression.
  • noise reduction methods are described in U.S. Patent Application No. 12/215,980, entitled “System and Method for Providing Noise Suppression Utilizing Null Processing Noise Subtraction,” filed June 30, 2008, and in U.S. Patent Application No. 11/699,732, entitled “System and Method for Utilizing Omni-Directional Microphones for Speech Enhancement/' filed January 29, 2007, which are incorporated herein by reference in their entireties.
  • the processor 210 may include hardware and/or software operable to execute computer programs stored in the memory storage 230.
  • the processor 210 can use floating point operations, complex operations, and other operations, including providing a dynamic local ASR vocabulary, keyword detection, and hierarchical assignment of recognition tasks.
  • the processor 210 of the mobile device 110 includes, for example, at least one of a digital signal processor, image processor, audio processor, general-purpose processor, and the like.
  • the example mobile device 110 is operable, in various embodiments, to communicate over one or more wired or wireless communications networks 140 (as shown in FIG. 1), for example, via communication devices 240.
  • the mobile device 110 sends at least audio signal (speech) over a wired or wireless communications network 140.
  • the mobile device 110 encapsulates and/or encodes the at least one digital signal for transmission over a wireless network (e.g., a cellular network).
  • the digital signal can be encapsulated over Internet Protocol Suite (TCP/IP) and/or User Datagram Protocol (UDP).
  • TCP/IP Internet Protocol Suite
  • UDP User Datagram Protocol
  • the wired and/or wireless communications networks 140 can be circuit switched and/or packet switched. In various embodiments, the wired communications network(s) 140 provide
  • the wireless communications network(s) 140 can include any number of wireless access points, base stations, repeaters, and the like.
  • the wired and/or wireless communications networks 140 may conform to an industry standard(s), be proprietary, or combinations thereof. Various other suitable wired and/or wireless communications networks 140, other protocols, and combinations thereof can be used.
  • the graphic display system 250 can be configured at least to provide a graphic user interface.
  • a touch screen associated with the graphic display system 250 is utilized to receive input from a user. Options can be provided to a user via an icon or text buttons once the user touches the screen.
  • the graphic display system 250 can be used for providing a user actionable content and generating a dynamic local ASR vocabulary.
  • FIG. 3 is a block diagram showing a system 300 for providing a dynamic local ASR vocabulary and hierarchical assignment of recognition tasks, according an example embodiment.
  • the example system 300 may include a key phrase detector 310, a local ASR module 320, and a cloud-based ASR module 330.
  • the modules 310-330 can be implemented as executable instructions stored either locally in memory of the mobile device 110 or in computing cloud 130.
  • the key phrase detector 310 may recognize the presence of one or more keywords in an acoustic audio signal, the acoustic audio signal representing at least one sound captured, for example, by microphones 120 of the mobile device 110.
  • the term key phrase as used herein may comprise one or more key words.
  • the key phrase detector 310 can determine whether the one or more keywords represent one or more commands that can be performed locally on a mobile device, one or more commands that can be performed in the computing cloud, or one or more commands that can be performed locally and in the computing cloud. In various embodiments, the determination is based on a profile 350.
  • the profile 350 can include user specific settings and/or mobile device specific settings and rules for processing acoustic audio signal(s). Based on the determination, the acoustic audio signal can be sent to local ASR 320 or cloud-based ASR 330.
  • the local ASR module 320 can be associated with a dynamic local ASR vocabulary.
  • the cloud-based ASR 330 is based on the cloud-based vocabulary 360. In some embodiments, the cloud-based vocabulary 360 includes more entries than the dynamic local ASR vocabulary 340.
  • the command when speech received from user 150 includes a recognized local command or key phrase, the key phrase including one or more keywords, the command can be performed locally (e.g., on a mobile device 110).
  • a key phrase detector 310 determines that "Call” is a local key phrase and then uses the local ASR engine 320 (also referred to herein as local recognizer) to recognize the rest of the command ("Eugene” in this example).
  • a record e.g., information for a "contact” including a telephone number
  • other identifier associated with a name spoken after the "Call" command is retrieved locally on the mobile device 110 (not in the cloud-based computing resource(s) 130), and a call operation is initiated locally using the record.
  • Other content stored locally e.g., on the mobile device 110
  • commands associated with contact information e.g., Call, Text, Email
  • audio or video content e.g., Play
  • applications or bookmarked webpages Open
  • Locations e.g., Navigate
  • Some embodiments include deciding (for example, by the key phrase detector 310) that commands are to be performed using a cloud-based computing resource(s) 130, instead of locally (e.g., on the mobile device 110), based on the command key phrase, or based upon the recognition of a likelihood of a match of models and observed extracted audio parameters. For example, when the speech received from a user corresponds to a voice command identified as a command for execution using the cloud-based computing resource(s) 130 (e.g., since it cannot be handled locally on the mobile device), a decision can be made to have the speech and/or recognized text forwarded to the cloud-based computing resources 130 for the ASR. Furthermore, for speech received from a user that includes a command recognized by the ASR as a command for execution by the cloud-based computing resource(s) 130, the command can be selected or designated for execution by the cloud-based computing resource(s) 130.
  • the key phrase "find the address” of the voice command is identified locally by the ASR.
  • the voice command e.g., audio and/or recognized text
  • the cloud-based computing resource 130 for the ASR and for execution of a recognized voice command by the cloud-based computing resource 130.
  • some commands can use processor resources, for example, context awareness obtained from a sensor hub or a geographic locator, such as a GPS, beacon, Bluetooth Low Energy (“BLE”), or WiFi, and store information more efficiently when delivered via cloud-based computing resources 130 than when performed locally.
  • processor resources for example, context awareness obtained from a sensor hub or a geographic locator, such as a GPS, beacon, Bluetooth Low Energy (“BLE”), or WiFi, and store information more efficiently when delivered via cloud-based computing resources 130 than when performed locally.
  • Some embodiments can allow initiating of execution of and/or performing commands using both or different combinations of local resources (e.g., processor resources provided by and information stored on a mobile device) and cloud-based computing resource(s) 130 (e.g., processor resources provided by and information stored in the cloud-based computing resource(s) 130), depending upon the command.
  • local resources e.g., processor resources provided by and information stored on a mobile device
  • cloud-based computing resource(s) 130 e.g., processor resources provided by and information stored in the cloud-based computing resource(s) 130
  • execution of some commands e.g., "call”
  • execution or executing as referred to herein, refer to executing all or parts of the steps required to fully perform certain operations.
  • Some embodiments can allow determining at least one or more commands that can be performed locally, one or more commands that can be performed by a cloud- based computing resource(s) / and one or more commands that can be performed using a combination of local resources and a cloud-based computing resource(s).
  • the determination is based, for example, at least on specifications and/or characteristics of the mobile device 110.
  • the determination is based, for example, in part on the characteristics or preferences of a user 150 of the mobile device 110.
  • Some embodiments include a profile 350, which may be associated with a certain mobile device 110 (e.g., a make and model) and/or the user 150.
  • the profile 350 can indicate, for example, at least one of one or more commands that may be performed locally, one or more commands that can be performed by cloud-based computing resources 130, and one or more commands that may be performed using a combination of local resources and a cloud-based computing resource(s) 130.
  • Various embodiments include a plurality of profiles, each profile being associated with a different (e.g., a make and model) mobile device and/or a different user.
  • Some embodiments can include a default profile, which may be used when information concerning the mobile device and/or user is not available. The default profile can be used to set, for example, performance of all commands using cloud-based computer resources 130 or commands known to be efficiently delivered locally (for example, via minimal usage of local processing and information storage resources).
  • FIG. 4 is a flow chart illustrating a method 400 for providing a dynamic local ASR vocabulary, according to an example embodiment.
  • a user actionable screen content can be defined.
  • the user actionable screen content can be at least partially based on user interactions.
  • the user actionable screen content is associated with a mobile device.
  • at least a portion of the user actionable screen content can be labeled.
  • a local vocabulary can be generated based on the labeling.
  • the local vocabulary can be associated with a local ASR engine.
  • the local ASR engine is associated with the mobile device.
  • the local vocabulary includes words associated with certain functions of the mobile device.
  • the local vocabulary can be limited by resources of the mobile device (such as memory and processor speed).
  • the local ASR engine and the local vocabulary are used to recognize one or more key phrases in a speech, for example, in audio signal captured by one or more microphones of the mobile device.
  • noise suppression or noise reduction is performed on the speech prior to performing the local ASR.
  • FIG. 5 is flow chart illustrating a method 500 for hierarchical assignment of recognition tasks, according to various embodiments.
  • speech audio
  • the mobile device may sense/detect the speech through at least one transducer such as a microphone.
  • the device can detect whether the speech (audio) includes a voice command. In various embodiments, this detection is performed using a module that includes a key phrase detector (e.g., a local recognizer/engine).
  • a key phrase detector e.g., a local recognizer/engine.
  • the "full” command refers to a key phrase comprising a command, plus additional speech (for example, “call Eugene", where the key phrase is “call” and the full command is “call Eugene”).
  • the module both recognizes the "full” command and makes the determination as to whether the full command can be executed locally.
  • the module can be operable to determine whether the received speech, and/or recognized text, includes at least one of a local key phrase or trigger (for example, recognize a key phrase which is associated with a voice command that can be executed locally), and/or a cloud key phrase or trigger (for example, recognize a keyword, text, or key phrase which may not be executed locally), and which may be (associated with) a voice command for which execution on a cloud-based computing resource(s) is required.
  • a local key phrase or trigger for example, recognize a key phrase which is associated with a voice command that can be executed locally
  • a cloud key phrase or trigger for example, recognize a keyword, text, or key phrase which may not be executed locally
  • audio and/or recognized text is forwarded to the cloud.
  • Various embodiments can allow conserving system resources (for example, offer low power consumption, low processor overhead, low memory usage, and the like) by detecting the key phrase and determining whether local or cloud-based resources can handle the (full) voice command.
  • conserving system resources for example, offer low power consumption, low processor overhead, low memory usage, and the like
  • the mobile device performs the ASR on the speech, for example, using a local ASR engine to determine what the voice command is.
  • the local ASR engine uses a "small" vocabulary or dictionary (for example, a dynamic local ASR vocabulary).
  • the small vocabulary includes, for example, 1-100 words.
  • the number of words in this small "local" vocabulary can be more or less than in this example and less than the number available in a cloud-based resource having more memory storage.
  • the words in the small vocabulary include various commands used to interact with the mobile device's basic local functionality (e.g., unlock, dial, call, open application, schedule an appointment, and the like).
  • the voice command determined by the local ASR engine can be performed.
  • the cloud information can be used to provide instructions to the local engine.
  • the cloud can contain a calendar that is inaccessible by the local system, and, therefore, the local system is unable to determine a conflict in a schedule.
  • a determination can be made to "select" use of various combinations of local and cloud-based resources for different commands.
  • the cloud-based computing resource(s) can perform the ASR, for example, to determine or identify one or more voice commands.
  • the cloud-based ASR uses a "large" vocabulary.
  • the large vocabulary includes over 100 words.
  • the words in the large vocabulary can be used to process or decode complex sentences, which may approach natural language (for example, "tomorrow after work I would like to go to an Italian restaurant").
  • the cloud-based ASR uses greater system resources than are practical and/or available on the mobile device (such as power consumption, processing power, memory, storage, and the like).
  • the one or more voice commands determined by the cloud-based ASR may be performed by the cloud-based computing resource(s).
  • FIG. 6 is a flow chart illustrating a method 600 for selecting performance of speech recognition based on a profile, according to some embodiments.
  • speech audio
  • the mobile device can sense/detect the speech through at least one transducer such as a microphone.
  • the mobile device may "wake up.” For example, the mobile device can perform a transition from a lower-power consumption state of operation to a higher-power consumption state of operation, the transition optionally including one or more intermediate power consumption states of operation.
  • the mobile device determines that the speech includes at least a voice command (for example, using a key phrase detector).
  • the mobile device can send the received speech and, optionally, a signature.
  • a signature includes an identifier associated with the mobile device and/or the user.
  • the signature can be associated with a certain make and model of a mobile device.
  • the signature can be associated with a certain user.
  • the speech and, optionally, the signature are sent through wired and/or wireless communication network(s) to cloud-based computing resources.
  • a profile can be determined.
  • the profile is determined based, optionally, upon a signature.
  • the profile can indicate at least one of one or more commands that may be performed locally, one or more commands that may be performed by cloud-based computing resources, and one or more commands that may be performed using a combination of local resources and cloud-based computing resource(s).
  • the profile for example, includes characteristics of the mobile device, such as capabilities of transducers (e.g., microphones), capabilities for processing noise and/or echo, and the like.
  • the profile for example, includes information specific to the user for performing the ASR.
  • a default profile is determined/used when, for example, a signature is not received or a profile is not otherwise available.
  • the ASR is performed on the speech to determine a voice command. In some embodiments, optionally, the ASR is performed based on the determined profile. In some embodiments, the speech is processed (e.g., noise
  • the ASR is performed by a cloud-based computing resource(s).
  • the determined voice command can be performed locally, by a cloud-based computing resource(s), or combination of the two, based at least on the received profile.
  • the command can be performed solely or more efficiently locally, by the cloud-based computing resource(s), or by a combination of the two, and a determination as to where to perform the command can be made based on these or like criteria.
  • a decision can be made to perform certain commands always locally even if such commands may be performed by the cloud- based computing resource(s) or by a combination of the two.
  • a determination can be made to always first perform certain commands locally and, if a local ASR score is low (e.g., a mismatch between speech and the local vocabulary), perform the commands remotely using the cloud-based computing resource(s).
  • FIGS. 4-6 illustrate the functionality/operations of various implementations of systems, methods, and computer program products according to embodiments of the present technology. It should be noted that, in some alternative embodiments, the functions noted in the blocks may occur out of the order noted in FIGS. 4-6, or omitted altogether. For example, two blocks shown in succession may, in fact, be performed substantially concurrently, or the blocks may sometimes be performed in the reverse order.
  • FIG. 7 illustrates an exemplary computer system 700 that may be used to implement some embodiments of the present invention.
  • the computer system 700 of FIG. 7 may be implemented in the contexts of the likes of computing systems, networks, servers, or combinations thereof.
  • the computer system 700 of FIG. 7 includes one or more processor units 710 and main memory 720.
  • Main memory 720 stores, in part, instructions and data for execution by processor units 710.
  • Main memory 720 stores the executable code when in operation, in this example.
  • the computer system 700 of FIG. 7 further includes a mass data storage 730, portable storage device 740, output devices 750, user input devices 760, a graphics display system 770, and peripheral devices 780.
  • FIG. 7 The components shown in FIG. 7 are depicted as being connected via a single bus 790.
  • the components may be connected through one or more data transport means.
  • Processor unit 710 and main memory 720 is connected via a local microprocessor bus, and the mass data storage 730, peripheral device(s) 780, portable storage device 740, and graphics display system 770 are connected via one or more input/output (I/O) buses.
  • I/O input/output
  • Mass data storage 730 which can be implemented with a magnetic disk drive, solid state drive, or an optical disk drive, is a non-volatile storage device for storing data and instructions for use by processor unit 710. Mass data storage 730 stores the system software for implementing embodiments of the present disclosure for purposes of loading that software into main memory 720.
  • Portable storage device 740 operates in conjunction with a portable non-volatile storage medium, such as a flash drive, floppy disk, compact disk, digital video disc, or Universal Serial Bus (USB) storage device, to input and output data and code to and from the computer system 700 of FIG. 7.
  • a portable non-volatile storage medium such as a flash drive, floppy disk, compact disk, digital video disc, or Universal Serial Bus (USB) storage device
  • USB Universal Serial Bus
  • User input devices 760 can provide a portion of a user interface.
  • User input devices 760 may include one or more microphones, an alphanumeric keypad, such as a keyboard, for inputting alphanumeric and other information, or a pointing device, such as a mouse, a trackball, stylus, or cursor direction keys.
  • User input devices 760 can also include a touchscreen.
  • the computer system 700 as shown in FIG. 7 includes output devices 750. Suitable output devices 750 include speakers, printers, network interfaces, and monitors.
  • Graphics display system 770 include a liquid crystal display (LCD) or other suitable display device. Graphics display system 770 is configurable to receive textual and graphical information and processes the information for output to the display device.
  • LCD liquid crystal display
  • Peripheral devices 780 may include any type of computer support device to add additional functionality to the computer system.
  • the components provided in the computer system 700 of FIG. 7 are those typically found in computer systems that may be suitable for use with embodiments of the present disclosure and are intended to represent a broad category of such computer components that are well known in the art.
  • the computer system 700 of FIG. 7 can be a personal computer (PC), handheld computer system, telephone, mobile computer system, workstation, tablet, phablet, mobile phone, server, minicomputer, mainframe computer, wearable, or any other computer system.
  • the computer may also include different bus configurations, networked platforms, multi-processor platforms, and the like.
  • Various operating systems may be used including UNIX, LINUX,
  • WINDOWS MAC OS
  • PALM OS PALM OS
  • QNX ANDROID IOS
  • CHROME CHROME
  • TIZEN TIZEN
  • the processing for various embodiments may be implemented in software that is cloud-based.
  • the computer system 700 is implemented as a cloud-based computing environment, such as a virtual machine operating within a computing cloud.
  • the computer system 700 may itself include a cloud-based computing environment, where the functionalities of the computer system 700 are executed in a distributed fashion.
  • the computer system 700 when configured as a computing cloud, may include pluralities of computing devices in various forms, as will be described in greater detail below.
  • a cloud-based computing environment is a resource that typically combines the computational power of a large grouping of processors (such as within web servers) and/or that combines the storage capacity of a large grouping of computer memories or storage devices.
  • Systems that provide cloud-based resources may be utilized exclusively by their owners, or such systems may be accessible to outside users who deploy applications within the computing infrastructure to obtain the benefit of large computational or storage resources.
  • the cloud may be formed, for example, by a network of web servers that comprise a plurality of computing devices, such as the computer system 700, with each server (or at least a plurality thereof) providing processor and/or storage resources.
  • These servers may manage workloads provided by multiple users (e.g., cloud resource customers or other users).
  • each user places workload demands upon the cloud that vary in real-time, sometimes dramatically. The nature and extent of these variations typically depends on the type of business associated with the user.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

L'invention concerne des systèmes et des procédés pour un vocabulaire de reconnaissance vocale automatique (ASR) local dynamique. Un exemple de procédé consiste à définir un contenu d'écran actionnable par un utilisateur, sur la base d'interactions d'utilisateur. Au moins une partie du contenu d'écran actionnable par un utilisateur est marquée. Un vocabulaire local, associé à un moteur ASR local, est créé sur la base, en partie, du marquage. Le vocabulaire local comprend des mots associés à des fonctions d'un dispositif mobile et est limité par des ressources du dispositif mobile. Le procédé consiste à déterminer si une parole comprend une phrase clé locale ou une phrase clé en nuage. Sur la base de la détermination, le procédé consiste à effectuer une ASR sur la parole en utilisant le moteur ASR local ou en envoyant la parole à un moteur informatique en nuage, et à effectuer une ASR dans celui-ci sur la base du vocabulaire plus grand du moteur informatique en nuage.
PCT/US2015/064523 2014-12-09 2015-12-08 Vocabulaire asr local dynamique WO2016094418A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201462089716P 2014-12-09 2014-12-09
US62/089,716 2014-12-09

Publications (1)

Publication Number Publication Date
WO2016094418A1 true WO2016094418A1 (fr) 2016-06-16

Family

ID=56108065

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2015/064523 WO2016094418A1 (fr) 2014-12-09 2015-12-08 Vocabulaire asr local dynamique

Country Status (1)

Country Link
WO (1) WO2016094418A1 (fr)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9820042B1 (en) 2016-05-02 2017-11-14 Knowles Electronics, Llc Stereo separation and directional suppression with omni-directional microphones
US10249323B2 (en) 2017-05-31 2019-04-02 Bose Corporation Voice activity detection for communication headset
US10311889B2 (en) 2017-03-20 2019-06-04 Bose Corporation Audio signal processing for noise reduction
US10366708B2 (en) 2017-03-20 2019-07-30 Bose Corporation Systems and methods of detecting speech activity of headphone user
US10424315B1 (en) 2017-03-20 2019-09-24 Bose Corporation Audio signal processing for noise reduction
US10438605B1 (en) 2018-03-19 2019-10-08 Bose Corporation Echo control in binaural adaptive noise cancellation systems in headsets
US10499139B2 (en) 2017-03-20 2019-12-03 Bose Corporation Audio signal processing for noise reduction
EP3676827A4 (fr) * 2017-08-28 2021-04-14 Roku, Inc. Reconnaissance vocale locale et dans le nuage
US11062702B2 (en) 2017-08-28 2021-07-13 Roku, Inc. Media system with multiple digital assistants
US11126389B2 (en) 2017-07-11 2021-09-21 Roku, Inc. Controlling visual indicators in an audio responsive electronic device, and capturing and providing audio using an API, by native and non-native computing devices and services
US11145298B2 (en) 2018-02-13 2021-10-12 Roku, Inc. Trigger word detection with multiple digital assistants
US11984125B2 (en) 2021-04-23 2024-05-14 Cisco Technology, Inc. Speech recognition using on-the-fly-constrained language model per utterance

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1081685A2 (fr) * 1999-09-01 2001-03-07 TRW Inc. Procédé de réduction de bruit dans un signal de parole utilisant un microphone unique
WO2012094422A2 (fr) * 2011-01-05 2012-07-12 Health Fidelity, Inc. Système et procédé vocaux pour saisie de données
US20130289988A1 (en) * 2012-04-30 2013-10-31 Qnx Software Systems Limited Post processing of natural language asr
US20130289996A1 (en) * 2012-04-30 2013-10-31 Qnx Software Systems Limited Multipass asr controlling multiple applications

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1081685A2 (fr) * 1999-09-01 2001-03-07 TRW Inc. Procédé de réduction de bruit dans un signal de parole utilisant un microphone unique
WO2012094422A2 (fr) * 2011-01-05 2012-07-12 Health Fidelity, Inc. Système et procédé vocaux pour saisie de données
US20130289988A1 (en) * 2012-04-30 2013-10-31 Qnx Software Systems Limited Post processing of natural language asr
US20130289996A1 (en) * 2012-04-30 2013-10-31 Qnx Software Systems Limited Multipass asr controlling multiple applications

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9820042B1 (en) 2016-05-02 2017-11-14 Knowles Electronics, Llc Stereo separation and directional suppression with omni-directional microphones
US10311889B2 (en) 2017-03-20 2019-06-04 Bose Corporation Audio signal processing for noise reduction
US10366708B2 (en) 2017-03-20 2019-07-30 Bose Corporation Systems and methods of detecting speech activity of headphone user
US10424315B1 (en) 2017-03-20 2019-09-24 Bose Corporation Audio signal processing for noise reduction
US10499139B2 (en) 2017-03-20 2019-12-03 Bose Corporation Audio signal processing for noise reduction
US10762915B2 (en) 2017-03-20 2020-09-01 Bose Corporation Systems and methods of detecting speech activity of headphone user
US10249323B2 (en) 2017-05-31 2019-04-02 Bose Corporation Voice activity detection for communication headset
US11126389B2 (en) 2017-07-11 2021-09-21 Roku, Inc. Controlling visual indicators in an audio responsive electronic device, and capturing and providing audio using an API, by native and non-native computing devices and services
EP3676827A4 (fr) * 2017-08-28 2021-04-14 Roku, Inc. Reconnaissance vocale locale et dans le nuage
US11062710B2 (en) 2017-08-28 2021-07-13 Roku, Inc. Local and cloud speech recognition
US11062702B2 (en) 2017-08-28 2021-07-13 Roku, Inc. Media system with multiple digital assistants
US11646025B2 (en) 2017-08-28 2023-05-09 Roku, Inc. Media system with multiple digital assistants
US11804227B2 (en) 2017-08-28 2023-10-31 Roku, Inc. Local and cloud speech recognition
US11961521B2 (en) 2017-08-28 2024-04-16 Roku, Inc. Media system with multiple digital assistants
US11145298B2 (en) 2018-02-13 2021-10-12 Roku, Inc. Trigger word detection with multiple digital assistants
US11664026B2 (en) 2018-02-13 2023-05-30 Roku, Inc. Trigger word detection with multiple digital assistants
US11935537B2 (en) 2018-02-13 2024-03-19 Roku, Inc. Trigger word detection with multiple digital assistants
US10438605B1 (en) 2018-03-19 2019-10-08 Bose Corporation Echo control in binaural adaptive noise cancellation systems in headsets
US11984125B2 (en) 2021-04-23 2024-05-14 Cisco Technology, Inc. Speech recognition using on-the-fly-constrained language model per utterance

Similar Documents

Publication Publication Date Title
US20160162469A1 (en) Dynamic Local ASR Vocabulary
WO2016094418A1 (fr) Vocabulaire asr local dynamique
US10045140B2 (en) Utilizing digital microphones for low power keyword detection and noise suppression
US9978388B2 (en) Systems and methods for restoration of speech components
TWI585744B (zh) 用於操作虛擬助理之方法、系統及電腦可讀取儲存媒體
US9668048B2 (en) Contextual switching of microphones
US20140244273A1 (en) Voice-controlled communication connections
US9799330B2 (en) Multi-sourced noise suppression
CN111192591B (zh) 智能设备的唤醒方法、装置、智能音箱及存储介质
US9953634B1 (en) Passive training for automatic speech recognition
US20190013025A1 (en) Providing an ambient assist mode for computing devices
US10353495B2 (en) Personalized operation of a mobile device using sensor signatures
KR20160091725A (ko) 음성 인식 방법 및 장치
US9437188B1 (en) Buffered reprocessing for multi-microphone automatic speech recognition assist
US20190130911A1 (en) Communications with trigger phrases
US11721338B2 (en) Context-based dynamic tolerance of virtual assistant
US20140316783A1 (en) Vocal keyword training from text
JP6619488B2 (ja) 人工知能機器における連続会話機能
US9772815B1 (en) Personalized operation of a mobile device using acoustic and non-acoustic information
US9508345B1 (en) Continuous voice sensing
KR20200019522A (ko) Gui 음성제어 장치 및 방법
KR102629796B1 (ko) 음성 인식의 향상을 지원하는 전자 장치
KR20140116642A (ko) 음성 인식 기반의 기능 제어 방법 및 장치
US20170206898A1 (en) Systems and methods for assisting automatic speech recognition
JP2019175453A (ja) ユーザ音声入力の処理を含むシステム及びその動作方法並びに電子装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15868411

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15868411

Country of ref document: EP

Kind code of ref document: A1