CN113628613A

CN113628613A - Two-stage user customizable wake word detection

Info

Publication number: CN113628613A
Application number: CN202110467675.7A
Authority: CN
Inventors: R·措普夫; A·潘迪
Original assignee: Cypress Semiconductor Corp
Current assignee: Cypress Semiconductor Corp
Priority date: 2020-05-06
Filing date: 2021-04-28
Publication date: 2021-11-09
Also published as: DE102021111594A1; DE102021111594A9

Abstract

Described herein are devices, methods, and systems for detecting phrases from spoken speech. The processing device may determine a first model for phrase recognition based on likelihood ratios using a set of training utterances. The set of utterances may be analyzed by the first model to determine a second model that includes a training state sequence for each of the set of training utterances, and wherein each training state sequence indicates a possible state for each time interval of the corresponding training utterance. The determination as to whether the detected utterance corresponds to a phrase may be based on a concatenation of the first model and the second model.

Description

Two-stage user customizable wake word detection

RELATED APPLICATIONS

This application claims priority to U.S. provisional patent application No. 63/020,984, filed on 6/5/2020, the entire disclosure of which is hereby incorporated herein by reference.

Technical Field

The present disclosure relates generally to speech recognition systems and more particularly to wake word detection.

Background

More and more modern computing devices feature speech recognition capabilities, allowing users to perform a wide variety of computing tasks via voice commands and natural speech. Devices such as mobile phones or smart speakers provide integrated virtual assistants that can respond to user commands or natural language requests by communicating over local and/or wide area networks to retrieve requested information or control other devices, such as lights, heating and air conditioning controls, audio or video equipment, and the like. Devices with speech recognition capabilities typically remain in a low power consumption mode until a particular word or phrase (i.e., a wake word or phrase) is spoken, allowing a user to control the device using voice commands after the device is thus activated.

To initiate a voice-based user interface, Wake Word Detection (WWD) is typically deployed. Here, the keywords or key phrases are continuously monitored and, when detected, further voice-based interaction is enabled. Early WWD systems used a gaussian mixture model-hidden markov model (GMM-HMM) for acoustic modeling. Recently, deep learning or deep Neural Networks (NN) have become an attractive option due to their higher accuracy than traditional approaches.

Drawings

The present embodiments are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings.

Fig. 1 is a block diagram illustrating a system according to some embodiments of the present disclosure.

Fig. 2 is a block diagram illustrating an audio processing device according to some embodiments of the present disclosure.

Fig. 3A-3C illustrate a wake word recognition model derivation process in accordance with some embodiments of the present disclosure.

Fig. 3D illustrates a conventional wake word recognition process, according to some embodiments of the present disclosure.

Fig. 4A illustrates a 2-phase model training and 2-phase wake word recognition process, according to some embodiments of the present disclosure.

Fig. 4B illustrates a wake word recognition model according to some embodiments of the present disclosure.

Fig. 5 illustrates a diagram of a state sequence of various utterances according to some embodiments of the present disclosure.

Fig. 6 illustrates a flow diagram of a method for identifying a wake word in accordance with some embodiments of the present disclosure.

Fig. 7 illustrates a flow diagram of a method for identifying a wake word in accordance with some embodiments of the present disclosure.

Fig. 8 illustrates a flow diagram of a method for identifying a wake word in accordance with some embodiments of the present disclosure.

FIG. 9 shows a programmable system on a chip (

) An embodiment of a core architecture of a processing device.

Detailed Description

In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the present embodiments. It will be apparent, however, to one skilled in the art that the present embodiments may be practiced without these specific details. In other instances, well-known circuits, structures and techniques have not been shown in detail, but rather in block diagrams in order to avoid unnecessarily obscuring the understanding of this description.

Reference in the specification to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment. The phrase "in one embodiment" in various places in the specification do not necessarily refer to the same embodiment.

As discussed above, to initiate a device using a wake word, Wake Word Detection (WWD) is typically utilized. Most approaches employ a pre-selected wake-up word (e.g., "hello") that cannot be modified by the user, and require tens of thousands of back-end training utterances. These pre-selected wake-up words are trained offline and work for all speakers and, therefore, are speaker independent. The detection of the wake word may be performed locally on the device and may then be verified by a more complex algorithm in the cloud. Speaker recognition is an additional function that can provide some degree of security or customization (e.g., user-specific playlists). Speaker recognition, however, is a complex task that is typically performed in the cloud and requires a cumbersome enrollment phase (reading text).

Many personal devices (e.g., headsets, audible devices, wearable devices, cameras, etc.) now feature voice interfaces. These devices are typically used by few or even a single user. Because they are battery powered, connectivity to the cloud is limited to conserve power. Therefore, it is desirable that the functionality remains local. Nonetheless, features such as wake word detection and speaker recognition are desirable, since hands-free is an important advantage for many of these products. One method for achieving these goals is to enable a user-personalized wake-up word. By having the user train their own (or alternatively several speakers share the same training) wake-up word, it becomes speaker dependent and thus optimized for use by a particular speaker or a small number of speakers. Speaker independence is not necessarily a requirement as these devices are used by a small number or even a single user. In addition, the customization of the wake words inherently identifies the user, and the privacy of the user-customizable wake words provides a level of security without requiring explicit and costly speaker recognition. However, it is challenging to implement such a system with as few training utterances as possible.

Deep learning or deep Neural Networks (NN) have become an attractive option compared to conventional approaches due to their improved accuracy. However, these systems are trained offline for a fixed or given wake word (e.g., "hello"). They require thousands of speakers to repeat tens of thousands of utterances. Some solutions do provide an option to adapt to the user's voice at a later time (enrollment phase, or usage-based adaptation), but do not generally exhibit the ability to train any wake-up word with only a small number of training utterances (user-customizable). Other available solutions (e.g., isolated word training and detection) suffer from sensitivity to "spoof" phrases that share important phonemes (or building block sounds) with the wake words. Indeed, such systems are susceptible to a relatively high degree of false detection in situations like pronunciations.

Embodiments described herein are directed to devices, methods, and systems for detecting a wake word from spoken speech. The processing device may determine a first model for wake word recognition based on likelihood ratios using a training set of utterances. The set of utterances may be analyzed by the first model to determine a second model that includes a training state sequence for each of the set of training utterances, and wherein each training state sequence indicates a possible state for each time interval of the corresponding training utterance. Whether the detected utterance corresponds to a wake word may be determined based on a concatenation of the first model and the second model. More specifically, the processing device may measure a distance between each training state sequence and the detected state sequence of the utterance to generate a set of distances, and may determine a minimum distance among the set of distances. The processing device may determine whether the detected utterance corresponds to a wake word based at least in part on a likelihood ratio of the detected utterance and a minimum distance among a set of distances.

Fig. 1 is a block diagram of a system 100 showing an audio processing device 102 communicatively coupled to other devices over one or more networks 114, in accordance with various embodiments. The audio processing device 102 is used to facilitate audio pattern recognition and may control devices or applications, such as device 103, based on the recognized audio patterns. Audio processing device 102 is shown receiving sound waves 105 from audio pattern source 104 and sound waves 107 from audio interference source 106. The audio processing device 102 itself may emit audio interference (not shown) (e.g., through a speaker).

The audio processing device 102 is also shown interacting with a network 114 through a communication link. To facilitate pattern recognition, the audio processing device 102 provides noise cancellation to remove some or all of the audio interference using corresponding audio data received from the audio interference source 106 or internally generated over the network 114. In embodiments, noise cancellation may be implemented using Independent Component Analysis (ICA), where the incoming signals (e.g., from microphones) are separated by source (e.g., signals from audio pattern sources and audio interference sources), and then the audio data is compared to the separated signals to determine which signals should be removed to leave an estimated audio pattern. Noise cancellation in other embodiments may utilize adaptive filters, neural networks, or any technique known in the art that may be used to attenuate non-target components of a signal. In some embodiments, the audio processing device 102 may be integrated with the controlled device 103, which may be controlled based on the identified audio pattern.

The audio pattern source 104 is used to provide sound waves 105 corresponding to recognizable audio patterns (e.g., wake words). In some embodiments, audio pattern source 104 may interact with network 114 through a communication link. In some embodiments, the audio pattern is a predetermined audio pattern and/or an audio pattern recognizable by pattern recognition software or firmware associated with the audio processing device 102. The audio pattern source 104 may be a living object (e.g., a person) or an inanimate object (e.g., a machine).

Audio interference source 106 may be a source of sound waves 107 that interfere with the identification of the audio pattern corresponding to sound waves 105. Audio interference source 106 is shown interacting with network 114 through a communication link. The audio interference source 106 may provide audio data corresponding to the audio interference to the audio processing device 102 over the network 114. The audio interference sources may include speakers, televisions, video games, industrial noise sources, or any other noise source, the sound output of which is digitized or may be digitized and provided to the audio processing device 102 via the network 114.

The controlled second device 108 is shown coupled to the network 114 via a link. The devices 108 and 103 being controlled may include any device having the following functionality: which may be initiated in response to audio pattern recognition facilitated by the audio processing device 102. Exemplary controlled devices include white goods, home automation controllers, thermostat controllers, lighting, automatic blinds, automatic door locks, automotive controls, windows, industrial controls, and actuators. As used herein, a device under control may include any logic, firmware, or software application executed by the device under control 110.

Network 114 may include one or more types of wired and/or wireless networks to communicatively couple the network nodes of fig. 1 to one another. For example, but not limited to, the network may include a Wireless Local Area Network (WLAN) (e.g., Wi-Fi, 802.11 compliant), a PAN (e.g., bluetooth SIG standard or Zigbee, IEEE 802.15.4 compliant), and the internet. In an embodiment, the audio processing device 102 is communicatively coupled to the pattern recognition application 112 through Wi-Fi and the Internet. The audio processing device 102 may be communicatively coupled to the audio interference source 106 and the controlled device 108 through bluetooth and/or Wi-Fi.

Fig. 2 is a block diagram illustrating components of an audio processing device 202 according to an embodiment. The audio processing device 202 may be a microcontroller of the Cypress PSoC family developed by selaplace semiconductor corporation of san jose, california. The audio processing device 202 is shown to include functional blocks including a microphone array 220, an audio interface 221, a threshold calculation module 222, a voice onset detector SOD 223, an audio interface control 224, a buffer 225, a combiner 226, and a Central Processing Unit (CPU) 228. Each functional block may be coupled to the bus system 227 (e.g., I2C, I2S) and may be implemented using hardware (e.g., circuitry), instructions (e.g., software and/or firmware), or a combination of hardware and instructions. In one embodiment, some or all of the audio processing device 202 is implemented by integrated circuit devices (i.e., on a single integrated circuit substrate) or circuitry in a single device package. In alternative embodiments, the components of the audio processing device 202 are distributed among multiple integrated circuit devices, device packages, or other circuits.

Microphone array 220 is used to receive sound waves such as

sound waves

105 and 107 of fig. 1. Each microphone in the microphone array 220 includes a transducer or other mechanism (e.g., including a diaphragm) to convert the energy of the sound waves into an electronic or digital signal (e.g., audio data). The microphone array 220 may include one or more microphones and is sometimes referred to herein as a microphone 220. When

sound waves

105 and 107 are received during a common period, the audio data includes components corresponding to both

sound waves

105 and 107. In some embodiments, one or more microphones of array 220 may be digital microphones. The microphone array 220 may be part of the audio interface 221 or a separate peripheral device external to the audio processing device 202 but coupled to the bus system 227. In some embodiments, the microphone array may include threshold/hysteresis settings and/or processing logic for activity detection and measurement to determine whether sound waves received by the microphone array 220 meet or exceed an activation threshold and whether corresponding audio data should be passed to the SOD 223 for processing. In various embodiments, the threshold level of activity may be an energy level, amplitude, frequency, or any other property of the acoustic wave. The microphone array 220 may be coupled to a memory (not shown) that stores activation thresholds, which may be dynamically reprogrammable (e.g., by the threshold calculation module 222).

The audio interface 221 includes circuitry for processing and analyzing audio data received from the microphone array 220. In an embodiment, audio interface 221 digitizes the electronic audio signal. Once digitized, the audio interface 221 may provide signal processing (e.g., demodulation, mixing, filtering) to analyze or manipulate properties (e.g., phase, wavelength, frequency) of the audio data. Audio interface 221 may also perform beamforming and/or other noise suppression or signal conditioning methods to improve performance in the presence of noise, reverberation, and the like.

In one embodiment, the audio interface 221 includes a Pulse Density Modulator (PDM) front end connected to the microphone array 220. In the PDM front end, the PDM generates a pulse density modulated bitstream based on the electronic signals from the microphone array 220. The PDM provides a clock signal to the microphone 220 that determines the initial sampling rate and then receives a data signal from the microphone 220 that represents the audio captured from the environment. From the data signal, the PDM generates a PDM bitstream and may provide the bitstream to a decimation filter that may generate audio data that is provided to the bus system 227 by providing high quality audio data or by reducing the sampling rate of the pulse density modulated bitstream from the PDM to low quality audio data. In an alternative embodiment, the audio data source is an auxiliary analog-to-digital converter (AUX ADC) front end. In the auxiliary ADC front-end, an analog-to-digital converter converts the analog signal from the microphone 220 to a digital audio signal. The digital audio signal may be provided to a decimation filter to generate the audio data provided to the bus system 227 by providing high quality audio data or by reducing the sampling rate of the digital audio signal from the ADC to low quality audio data.

The audio interface control 224 is used to control the timing of the sampling of the audio interface 221 or microphone array 220 and the sampling rate of the sampling of the audio interface 221 or microphone array 220. For example, the audio interface control 224 may control the audio quality (e.g., sample rate) of audio data provided to the SOD 223 and buffer 225, and may also control the time at which such audio data should be provided to the bus system 227 periodically or continuously. Although shown as separate functional blocks, the functionality of the audio interface control 224 may be performed by the SOD 223 and/or the buffer 225 or any other functional block.

The SOD 223 is used to determine whether audio data received from the audio interface 221 is a voice onset. SOD 223 may use any voice onset detection algorithm or technique known to those of ordinary skill in the art. In an embodiment, audio data with a reduced sampling rate (e.g., 2-4kHz) is sufficient for detecting voice onset (or other sound onset events) while allowing the SOD 223 to be clocked at a lower frequency, thus reducing the power consumption and complexity of the SOD 223. Upon detection of a voice initiation event, the SOD 223 asserts a status signal on the bus 227 to wake up the wake phrase detector (WUPD)228 from a low power consumption state (e.g., sleep state) to a high power consumption state (e.g., active state) to perform phrase detection, as will be discussed further below. Gating the WUPD 228 block in this manner reduces the average system processing load and reduces the False Acceptance Rate (FAR) by minimizing the background noise and spurious tones considered by the WUPD 228.

The threshold calculation module 222 monitors the ambient noise to dynamically calculate and potentially readjust the activation threshold of the audio that should trigger voice onset detection to avoid unnecessary processing by the SOD 223. In an embodiment, audio interface control 224 causes audio interface 221 to periodically provide audio data (e.g., ambient noise) to threshold calculation module 222 at intervals. In an embodiment, the threshold calculation module 222 may reset the activation threshold level from below the current ambient noise level to above the current ambient noise level.

Buffer 225 is used to store periodically sampled, predominantly noisy audio data. In an embodiment, the buffer 225 is sized to store slightly more than 250 milliseconds of audio data (e.g., 253 milliseconds) to accommodate the combination as discussed below. Alternatively, or in addition, after the SOD 223 has detected a voice onset, the buffer 225 may act as a channel to pass continuously sampled audio data including a wake phrase and a command or query. In an embodiment, audio interface control 224 causes audio interface 221 to periodically provide the dominant noise to buffer 225 at intervals. Once the SOD 223 has detected a voice-like sound, the audio interface control 224 may cause the audio interface 221 to continuously provide the remaining audio data to the buffer.

The combiner 226 is configured to generate continuous audio data using the periodically captured dominant noise and the continuously captured remaining audio data. In an embodiment, the combiner 226 stitches a portion of the end of the last periodically captured audio data with a portion of the beginning of the successively captured audio data. For example, the combiner 226 may use overlap-add operations to overlap 3 milliseconds of dominant noise with the continuously captured audio data. The combiner 226 may output continuous audio data to the WUPD 228 via the bus system 227.

The WUPD 228 may determine whether the continuous audio data output by the combiner 226 includes a wake word or a wake phrase. When WUPD 228 is activated, it may perform higher complexity and higher power calculations (e.g., relative to SOD 223) to determine whether a wake word or phrase has been spoken, as discussed in further detail with respect to fig. 3A-8. The WUPD 228 may make this determination based on the audio data recorded in the buffer 225 (corresponding to the time before the start of speech) and the high quality audio data received after the start of speech is detected.

FIG. 3A illustrates a diagram of a conventional Gaussian Markov model-hidden Markov model (GMM-HMM) wake word or word recognition method 300 in which the whole word model of each wake word is trained using any suitable algorithm, such as a maximum likelihood forward-backward algorithm. The gaussian observation model is determined by using a diagonal structure or a full covariance structure, wherein the diagonal structure is the dominant method. The word model may include a "left-to-right" linear sequence of states, with each phoneme having, for example, about three states. The observation vector O is obtained by front-end spectral analysis, where Mel-frequency cepstral coefficients (MFCCs) and their derivatives are the most common. In the standard training method, the vocalization of tens to hundreds of wake words is used in offline training to generate the final word model, as depicted in fig. 3A. During decoding, a user may wish to compute an observation sequence O ═ O given a model λ₁O₂…O_TI.e., P (O | λ). This calculation may be performed using any suitable algorithm, such as the Viterbi algorithm.

Fig. 3B illustrates a phoneme-based training method 310 in which individual phonemes are trained offline to create a phoneme database. It was found that a set of approximately 50 most common phonemes is sufficient, where each phoneme is simulated with 3 HMM states and 2 gaussian mixtures per state. During the wake word training, the user may issue several (e.g., 1-3) customized wake words. These utterances and optionally the augmented set are used in the subsequent training process shown in fig. 3C.

As shown in FIG. 3C, the phoneme recognition block 315 is used to determine the uttered wake word "abraThe most likely phoneme decomposition of cadibra ", as shown in fig. 3C. The model of the detected phonemes is then used to: using a concatenated phoneme transcription based model of phonemes (one transcription for each instance of the uttered phrase) to construct an initial word-based model λ for the wake word¹。

The training utterance may then be used again to adapt the initial word model to the speaker, capture inter-phone dependencies, and reduce the concatenation model dimension. This training abstracts the speaker-independent concatenative phoneme-based model into a speaker-dependent word-based model. Pruning and state combining also occurs during adaptation to reduce the size of the model, as also shown in fig. 3C. State combining involves the merging of two or more similar or identical states, with a very high probability of transition between these states. In the example of FIG. 3C, the 3 rd and 4 th states of the initial HMM word model corresponding to "R" and "R" may be combined into a single "R" state, and the states 11-13 of the initial HMM word model corresponding to "E", "R" and "R" may be combined into a single "R" state.

In the standard approach, once a model is obtained, it is used to evaluate the speech observations (O) (also referred to herein as utterances) and if the probability that an observation sequence has been detected exceeds the probability of being a given garbage or background model λ^g) Normalized threshold TH (given model λ)¹) Then the wake word is detected. This process is depicted in fig. 3D.

The likelihood ratio LR is defined as:

in other words, LR may be based on: and observation result O and garbage model lambda^gHow close the match is, observation O is compared to model λ¹The degree of matching. However, this method does not capture the relevant λ except for the final probability¹Any information on how to model O. In many cases, a wrongly accepted word may exhibit a partial phoneme match, or contain a completely matched subset of phonemes. Such vocalization typically has a very high LR because all factors match and the vowel is preserved. In this example, relying on LR alone would cause false acceptance. Another common false acceptance scenario is a word matching a portion of the wake-up word. Embodiments of the present disclosure overcome the above problems by: how the model is internally excited over the duration of the input observation sequence is incorporated into the model training and decision process. In the case of the HMM-GMM model, one such firing is the most likely state.

FIG. 4A illustrates a model generation training method in accordance with some embodiments of the present disclosure. After obtaining the model lambda¹(as discussed above with respect to fig. 3B-3D), the audio processing device 202 may determine the audio signal by utilizing the model λ¹The second phase of training is performed by processing the training utterances to obtain, for each training utterance, a sequence of states indicating the most likely states (e.g., a particular phoneme model) at each time interval during the process of training the utterance that will form the phase 2 model λ²The basis of (1). For example, each training utterance may last 420 milliseconds and may be captured in 10 millisecond frames (e.g., 10 millisecond time intervals). The state sequence may indicate the most likely state (e.g., a particular phoneme model) that the training utterance will be in at each particular frame of the utterance (e.g., when we transition through the model). As shown in fig. 4B, the model λ¹May be an 11-state model captured during 420ms at 10 millisecond intervals. Fig. 4B also shows the time length of the change of each state. As shown in the example of fig. 4B, each state may last 20ms, except for the 4 th state ("a"), which is shown to last 30 ms. The state sequence may indicate the likelihood of remaining in the current state or transitioning to a new state at each time interval.

Referring back to fig. 4A, the audio processing device 202 may solve the N-state model λ at time t¹Is individually most likely state q_tAs follows:

if u is_kIs the kth of the K total training utterances, an

Is a corresponding observation sequence, the audio processing device 202 may determine each training utterance u during training_kThe most probable state sequence of (c) is as follows:

the audio processing device 202 may create the second stage model λ by collecting a sequence of states for each of the K training utterances²：

The audio processing device 202 may then pass through the concatenation of stage 1 and stage 2(λ)¹And λ²) Model generation the final model λ:

λ＝{λ¹,λ²}

although model λ²Are obtained as described above, but any suitable method may be used. For example, the audio processing device 202 may utilize statistical methods to obtain a sum distribution of each state over time. Fig. 4A also illustrates a 2-phase identification process according to some embodiments of the present disclosure. The audio processing device 202 may use the model λ¹(e.g., based on LR) binding model λ²To perform recognition of the detected utterance. For example, in using a model λ¹The audio processing device 202 may then use the model λ when making recognition decisions²Validating usage model λ¹And (5) identifying results. For verification, the audio processing device 202 may determine a distance between each of the K state sequences and the state sequence of the detected utterance (e.g., at the recognized utterance), thereby generating K distances, and determine a maximum of the K distancesSmall distance, which is calculated as:

the audio processing device 202 may incorporate this minimum distance measure into a final decision:

Final Decision＝f(Dqmin_u,LR(O,λ¹,λ^g))

the audio processing device 202 may assign a Dqmin_uCompare to a threshold value, and if Dqmin_uAbove the threshold, it determines that the detected utterance is not a wake word. In some embodiments, the audio processing device 202 may assign a weight to the Dqmin_uSo that it has more or less impact on the final decision based on, for example, user preferences. It should be noted that although the above example is with respect to Dqmin_uAnd LR (O, λ)¹,λ^g) To determine whether the detected utterance corresponds to a wake word, in some embodiments, the audio processing device 202 may only utilize the model λ when determining whether the detected utterance corresponds to a wake word²(e.g., only Dqmin_u)。

Fig. 5 shows a state sequence for various utterances. Reference to one of the training utterances of "abracadabra

Marked as "reference". "reference"

May correspond to a minimum distance (Dqmin) from the sequence of states of the utterance being recognized_u) The training sounds. The same recognized wake word as the training utterance (e.g., "abracadabra") is labeled "WW" and follows the "reference" state sequence well as shown. Another wake word that is identical to the training utterance but affected by noise is labeled "WW-noise". As can be seen, the "WW-noise" exhibits a somewhat minor mismatch of statesThe state sequence of the "reference" is matched, but still well followed overall. In a noisy environment, the LR is typically lower than in a quiet environment, while the sequence of states exhibits only a slight increase in distance score, thus enhancing detectability in a noisy environment.

In many cases, the incorrectly accepted word may exhibit a partial phoneme match, or contain a subset of phonemes that completely match. For example, the vocalization "abraaaaaaaaaaaaabara" is shown in fig. 5 and labeled as "abraaaaaaaaaaaaabara". In this case, the state sequence of "abraaaaaaaaaaabara" matches well at the beginning, but is then kept around the state representing "a". The state sequence matches well again near the end of the utterance. Such vocalization typically has a very high LR because all factors match and the vowel is preserved. In this example, relying on LR alone may cause false acceptance. However, Dqmin is used_uIt is shown that there is a substantial distance between the state sequence of "abraaaaaaaaabra" at frames 30-63 and the state sequence of "reference" at frames 30-63. Thus, the audio processing device 202 may reject the utterance "abraaaaaaaaaaabara". Another common false acceptance scenario is where the utterance matches a portion of a wake word. This is illustrated by the state sequence labeled "Updowncadabra" in FIG. 5. The non-matching part of the state sequence of "updown cadabra" exhibits a poor match with the state sequence of "reference" at the corresponding frame, resulting in a high distance score and rejection. Finally, a completely unmatched utterance, labeled "california", is shown. Such a vocalization will likely exhibit a very low LR, in addition to following the "reference" sequence of states poorly at all times.

In some embodiments, the model λ is used to determine the distance between each of the K state sequences and the state sequence of the utterance being identified²May be based on other parameters. One such parameter might be how likelihood ratio LR follows each observation O_tBut evolves over time. After obtaining the model lambda¹When (as discussed above with respect to fig. 3B-3D), the audio processing device 202 may utilizeModel lambda¹Processing training utterances u_kTo determine how likelihood ratio LR of each training utterance follows each observation O_tBut evolves over time. At a given observation sequence O_t＝O₁,…,O_tAnd a model λ¹In the case of (1), is in the state S at time t_iThe probability of (d) is given by:

γ_t(i,O_t,λ¹)＝P(q_t＝S_i|O_t,λ¹)

at time t, at model λ¹The maximum probability for any state in (b) is then given by:

then, each training utterance u is processed_kThe following are given:

and thus, the audio processing device 202 may determine a Likelihood Ratio (LR) of the training utterance over time as:

the audio processing device 202 may then generate a second stage model λ²As follows:

during recognition of a received utterance, the audio processing device 202 may calculate per-utteranceAn

And LR_tMeasure D (representing the likelihood ratio of the received utterance over time) to generate K distance measures (measures). The audio processing device 202 may then use the minimum distance Dmin among all K distance measures (e.g., among all training utterances)_uTogether with the likelihood ratio LR (O, λ)¹,λ^g) To calculate the final decision:

Final Decision＝f(Dmin_u,LR(O,λ¹,λ^g))

the audio processing device 202 may adjust Dmin_uIs compared to a threshold value and if Dmin_uAbove the threshold, it determines that the detected utterance is not a wake word. In some embodiments, the audio processing device 202 may assign weights to Dmin_uSuch that it has more or less impact on the final decision based on, for example, user preferences. In some embodiments, the audio processing device 202 may incorporate the two stage 2 recognition parameters discussed above into the final decision:

Final Decision＝f(Dqmin_u,Dmin_u,LR(O,λ¹,λ^g))

other parameters may also be used instead of or in addition to likelihood ratios and state sequences over time. Examples of such parameters may include voicing measure, pitch, and frame energy, among others. In addition, the model λ¹And λ²The HMM-GMM model may not be limited, but may also include a neural network or any other suitable model type.

Fig. 6 is a flow diagram of a method 600 of detecting a wake word according to some embodiments. The method 600 may be performed by processing logic that may comprise hardware (e.g., circuitry, dedicated logic, programmable logic, a processor, a processing device, a Central Processing Unit (CPU), a system on a chip (SoC), etc.), software (e.g., instructions run/executed on a processing device), firmware (e.g., microcode), or a combination thereof. For example, the method 600 may be performed by the audio processing device 202 executing wake word detection firmware.

Referring also to fig. 2, at block 605, the audio processing device 202 may determine a first model configured to recognize a wake word based on a set of training utterances. At block 610, the audio processing device 202 may analyze the set of training utterances using the first model to determine a second model that includes a training state sequence for each of the set of training utterances, and wherein each training state sequence indicates a possible state for each time interval of the corresponding training utterance. At block 615, the audio processing device 202 may determine a sequence of states of the detected utterance that indicates possible states for each time interval of the detected utterance. At block 620, the audio processing device 202 may determine a distance between each training state sequence and the detected state sequence of the utterance to generate a set of distances. At block 625, it is determined whether the detected utterance corresponds to a wake word based at least in part on the likelihood ratio of the detected utterance and the minimum distance among the set of distances.

Fig. 7 is a flow diagram of a method 700 for detecting a wake word according to some embodiments. Method 700 may be performed by processing logic that may comprise hardware (e.g., circuitry, dedicated logic, programmable logic, a processor, a processing device, a Central Processing Unit (CPU), a system on a chip (SoC), etc.), software (e.g., instructions run/executed on a processing device), firmware (e.g., microcode), or a combination thereof. For example, the method 700 may be performed by the audio processing device 202 executing wake word detection firmware.

Referring also to fig. 2, at block 705, the audio processing device 202 may determine a first model configured to recognize a wake word based on a set of training utterances. At block 710, the audio processing device 202 may analyze the set of training utterances using the first model to determine a second model that includes an indication of a likelihood ratio for each training utterance over time. At block 715, the audio processing device 202 may determine likelihood ratios over time for the detected utterances, and at block 720, a distance between the likelihood ratio over time for each training utterance and the likelihood ratio over time for the detected utterance may be determined to generate a set of distances. At block 725, the audio processing device 202 may determine whether the detected utterance corresponds to a wake word based at least in part on a likelihood ratio of the detected utterance and a minimum distance among the set of distances.

Fig. 8 is a flow diagram of a method 800 for detecting a wake word according to some embodiments. Referring also to fig. 2, at block 805, the audio processing device 202 may determine a first model configured to recognize a wake word based on a set of training utterances. At block 810, the audio processing device 202 may analyze the set of training utterances using the first model to determine a second model that includes a training state sequence for each of the set of training utterances, and wherein each training state sequence indicates a possible state for each time interval of the corresponding training utterance. The second model may further include an indication of a likelihood ratio over time for each training utterance. At block 815, the audio processing device 202 may determine a sequence of states of the detected utterance that indicates possible states for each time interval of the detected utterance. The audio processing device 202 may further determine a likelihood ratio over time of the detected utterance. At block 820, the audio processing device 202 may determine a distance between each training state sequence and the detected state sequence of the utterance to generate a first set of distances. The audio processing device 202 may further determine a distance between the likelihood ratio over time of each training utterance and the likelihood ratio over time of the detected utterance to generate a second set of distances. At block 825, the audio processing device 202 may base at least in part on the likelihood ratio of the detected utterance, the minimum distance (Dqmin) among the first set of distances_u) And a minimum distance (Dmin) among the second set of distances_u) To determine whether the detected utterance corresponds to a wake word.

FIG. 9 shows

An embodiment of a core architecture 900 of a processing device, such as provided by the Selaplace semiconductor corporation (san Jose, Calif.)

Core architecture 900 used in a family of products. In one embodiment, core architecture 900 includes a microcontroller 1102. The microcontroller 1102 includes a CPU (central processing unit) core 1104 (which may correspond to the processing device 130 of fig. 1), a flash program storage 1106, a DOC (on-chip debug) 1108, a prefetch buffer 1110, a dedicated SRAM (static random access memory) 1112, and a special function register 1114. In an embodiment, DOC 1108, prefetch buffer 1110, dedicated SRAM 1112, and special function registers 1114 are coupled to CPU core 1104 (e.g., CPU core 1006), while flash program storage 1106 is coupled to prefetch buffer 1110.

The core architecture 1100 may also include a CHub (core hub) 1116 including a bridge 1118 and a DMA controller 1120 coupled to the microcontroller 1102 via a bus 1122. The CHub 1116 may provide the primary data and control interface between: microcontroller 1102 and its peripherals (e.g., peripherals) and memory; and a programmable core 1124. DMA controller 1120 may be programmed to transfer data between system elements without burdening CPU core 1104. In various embodiments, each of these subcomponents of the microcontroller 1102 and the CHub 1116 may differ from each choice or type of CPU core 1104. The CHub 1116 may also be coupled to a shared SRAM 1126 and an SPC (system performance controller) 1128. The dedicated SRAM 1112 is independent of the shared SRAM 1126 accessed by the microcontroller 1102 through the bridge 1118. The CPU core 1104 accesses the dedicated SRAM 1112 without passing through the bridge 1118, thus allowing local register and RAM accesses to occur concurrently with DMA accesses to the shared SRAM 1126. Although labeled herein as SRAM, in various other embodiments, these memory modules may be any suitable type of various (volatile or non-volatile) memories or data storage modules.

In various embodiments, programmable core 1124 may include various combinations of subcomponents (not shown), including, but not limited to, digital logic arrays, digital peripherals, analog processing channels, globally routed analog peripherals, DMA controllers, SRAM and other suitable types of data storage, IO ports, and other suitable types of subcomponents. In one embodiment, programmable core 1124 includes GPIO (general purpose IO) and EMIF (extended memory interface) blocks 1130 (a mechanism for providing external off-chip access to extended microcontroller 1102), programmable digital blocks 1132, programmable analog blocks 1134, and special function blocks 1136, all configured to implement one or more of the subcomponent functions. In various embodiments, special function blocks 1136 may include dedicated (non-programmable) function blocks and/or include one or more interfaces to dedicated function blocks, such as USB, crystal oscillator driver, JTAG, and the like.

The programmable digital block 1132 may include a digital logic array that includes an array of digital logic blocks and associated routing. In one embodiment, the digital block architecture consists of UDBs (universal digital blocks). For example, each UDB may include an ALU along with CPLD functionality.

In various embodiments, one or more UDBs of the programmable digital block 1132 may be configured to perform various digital functions, including, but not limited to, one or more of the following functions: a basic I2C slave device; I2C master; an SPI master or slave; a multi-wire (e.g., 3-wire) SPI master or slave (e.g., MISO/MOSI multiplexed on a single pin); timers and counters (e.g., a pair of 8-bit timers or counters, a 16-bit timer or counter, an 8-bit capture timer, etc.); PWM (e.g., a pair of 8-bit PWM, one 16-bit PWM, one 8-bit dead-zone PWM, etc.); a level sensitive I/O interrupt generator; quadrature encoder, UART (e.g., half duplex); a delay line; and any other suitable type or combination of digital functions that may be implemented in multiple UDBs.

In other embodiments, a set of two or more UDBs may be used to implement additional functionality. For purposes of illustration only and not limitation, multiple UDBs may be used to implement the following functions: an I2C slave device that supports hardware address detection and the ability to process complete transactions without CPU core (e.g., CPU core 1104) intervention and helps prevent forced clock stretching on any bit in the data stream; I2C multi-master device, which may include slave options in a single block; PRS or CRC of arbitrary length (up to 32 bits); SDIO; SGPIO; a digital correlator (e.g., with up to 32 bits, with 4x oversampling and supporting a configurable threshold); an LINbus interface; a delta-sigma modulator (e.g., for a class D audio DAC with a differential output pair); I2S (stereo); LCD drive control (e.g., UDB may be used to implement timing control of LCD drive blocks and provide display RAM addressing); full duplex UARTs (e.g., 7, 8, or 9 bits with 1 or 2 stop bits and parity and support RTS/CTS); IRDA (transmit or receive); capture timers (e.g., 16-bit or the like); dead-band PWM (e.g., 16-bit or the like); SMbus (including formatting SMbus packets with CRC in software); a brushless motor driver (e.g., to support 6/12 steps of commutation); automatic BAUD rate detection and generation (e.g., automatically determining a BAUD rate from a standard rate of 1200 to 115200BAUD, and generating the required clock to generate the BAUD rate after detection); and any other suitable type or combination of digital functions that may be implemented in multiple UDBs.

Programmable analog block 1134 may include analog resources including, but not limited to, comparators, mixers, PGA (programmable gain amplifier), TIA (transimpedance amplifier), ADC (analog-to-digital converter), DAC (digital-to-analog converter), voltage references, current sources, sample and hold circuits, and any other suitable type of analog resource. Programmable analog block 1134 may support a variety of analog functions including, but not limited to, analog routing, LCD drive IO support, capacitance sensing, voltage measurement, motor control, current to voltage conversion, voltage to frequency conversion, differential amplification, light measurement, inductive position monitoring, filtering, voice coil drive, magnetic card reading, acoustic doppler measurement, echo ranging, modem transmit and receive encoding, or any other suitable type of analog function.

The embodiments described herein may be used in various designs of mutual capacitance sensing systems, self capacitance sensing systems, or a combination of both. In one embodiment, a capacitive sensing system detects multiple sensing elements activated in an array, and can analyze signal patterns on adjacent sensing elements to separate noise from the actual signal. As will be appreciated by one of ordinary skill in the art having the benefit of this disclosure, the embodiments described herein are not tied to a particular capacitive sensing solution, and may also be used with other sensing solutions, including optical sensing solutions.

In the above description, numerous details are set forth. However, it will be apparent to one having ordinary skill in the art having had the benefit of the present disclosure that the embodiments of the present disclosure may be practiced without these specific details. In some instances, well-known structures and devices are shown in block diagram form, rather than in detail, in order to avoid obscuring the description.

Some portions of the detailed descriptions are presented in terms of algorithms and symbolic representations of operations on data bits within a computer memory. These algorithmic descriptions and representations are the means used by those skilled in the data processing arts to most effectively convey the substance of their work to others skilled in the art. An algorithm is here, and generally, conceived to be a self-consistent sequence of steps leading to a desired result. The steps are those requiring physical manipulations of physical quantities. Usually, though not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated. It has proven convenient at times, principally for reasons of common usage, to refer to these signals as bits, values, elements, symbols, characters, terms, numbers, or the like.

It should be borne in mind, however, that all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. Unless specifically stated otherwise as apparent from the above discussion, it is appreciated that throughout the description, discussions utilizing terms such as "determining," "detecting," "comparing," "resetting," "adding," "computing," or the like, refer to the action and processes of a computing system, or similar electronic computing device, that manipulates and transforms data represented as physical (e.g., electronic) quantities within the computing system's registers and memories into other data similarly represented as physical quantities within the computing system memories or registers or other such information storage, transmission or display devices.

The words "example" or "exemplary" are used herein to mean serving as an example, instance, or illustration. Any aspect or design described herein as "exemplary" or "exemplary" is not necessarily to be construed as preferred or advantageous over other aspects or designs. Rather, use of the word "example" or "exemplary" is intended to present concepts in a concrete fashion. As used in this application, the term "or" is intended to mean an inclusive "or" rather than an exclusive "or". That is, unless specified otherwise or clear from context, "X comprises a or B" is intended to mean any of the natural inclusive permutations. That is, if X comprises A; x comprises B; or X includes both a and B, then "X includes a or B" is satisfied under any of the above examples. In addition, the articles "a" and "an" as used in this application and the appended claims should generally be construed to mean "one or more" unless specified otherwise or clear from context to be directed to a singular form. Furthermore, the use of the terms "an embodiment" or "one embodiment" or "an implementation" or "one implementation" throughout is not intended to refer to the same embodiment or implementation (unless so described).

Embodiments described herein may also relate to an apparatus for performing the operations herein. This apparatus may be specially constructed for the required purposes, or it may comprise a general-purpose computer selectively activated or reconfigured by a computer program stored in the computer. Such a computer program may be stored in a non-transitory computer readable storage medium, such as, but is not limited to, any type of disk including floppy disks, optical disks, CD-ROMs, and magnetic-optical disks, read-only memories (ROMs), Random Access Memories (RAMs), EPROMs, EEPROMs, magnetic or optical cards, flash memory, or any type of media suitable for storing electronic instructions. The term "computer-readable storage medium" should be taken to include a single medium or multiple media (e.g., a centralized or distributed database, and/or associated caches and servers) that store the one or more sets of instructions. The term "computer-readable medium" shall also be taken to include any medium that is capable of storing, encoding or carrying a set of instructions for execution by the machine and that cause the machine to perform any one or more of the methodologies of the present embodiments. The term "computer-readable storage medium" shall accordingly be taken to include, but not be limited to, solid-state memories, optical media, magnetic media, any medium that is capable of storing a set of instructions for execution by the machine and that cause the machine to perform any one or more of the methodologies of the present embodiments.

The algorithms and displays presented herein are not inherently related to any particular computer or other apparatus. Various general-purpose systems may be used with programs in accordance with the teachings herein, or it may prove convenient to construct more specialized apparatus to perform the required method steps. The required structure for a variety of these systems will appear from the description above. In addition, the present embodiments are not described with reference to any particular programming language. It will be appreciated that a variety of programming languages may be used to implement the teachings of the embodiments as described herein.

The above description sets forth numerous specific details, such as examples of specific systems, components, methods, etc., in order to provide a better understanding of several embodiments of the present disclosure. It will be apparent, however, to one skilled in the art that at least some embodiments of the present disclosure may be practiced without these specific details. In other instances, well-known components or methods have not been described in detail or presented in simple block diagram format in order to avoid unnecessarily obscuring the present embodiments. Accordingly, the specific details set forth above are merely exemplary. Particular implementations may vary from these exemplary details and still be considered within the scope of the present embodiments.

It is to be understood that the above description is intended to be illustrative, and not restrictive. Many other embodiments will be apparent to those of skill in the art upon reading and understanding the above description. The scope of the embodiments should, therefore, be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled.

Claims

1. A method, comprising:

determining a first model configured to recognize phrases based on a set of training utterances;

analyzing the set of training utterances using the first model to determine a second model, the second model including parameters for each of the set of training utterances; and

determining whether the detected utterance corresponds to the phrase based on a concatenation of the first model and the second model.

2. The method of claim 1, wherein the parameters comprise a training state sequence such that the second model comprises a training state sequence for each of the set of training utterances, and wherein each training state sequence indicates a possible state for each time interval of a corresponding training utterance.

3. The method of claim 2, wherein determining whether the detected utterance corresponds to the phrase comprises:

determining a sequence of states of the detected utterance, the sequence of states indicating possible states for each time interval of the detected utterance; and

determining a distance between each training state sequence and the state sequence of the detected utterance to generate a set of distances.

4. The method of claim 3, wherein determining whether the detected utterance corresponds to the phrase further comprises:

determining a likelihood ratio of the detected utterance using the first model; and

determining whether the detected utterance corresponds to the phrase based at least in part on the likelihood ratio of the detected utterance and a minimum distance among the set of distances.

5. The method of claim 4, wherein determining that the minimum distance among the set of distances is below a threshold indicates that the detected utterance corresponds to the phrase.

6. The method of claim 1, wherein the parameters comprise likelihood ratios over time, such that the second model comprises an indication of likelihood ratios over time for each of the set of training utterances.

7. The method of claim 6, wherein determining whether the detected utterance corresponds to the phrase comprises:

determining a likelihood ratio over time of the detected utterances; and

determining a distance between a likelihood ratio over time of each training utterance and a likelihood ratio over time of the detected utterance to generate a set of distances.

8. The method of claim 7, wherein determining whether the detected utterance corresponds to the phrase further comprises:

9. A system, comprising:

a memory; and

a processing device operatively coupled to the memory, the processing device configured to:

analyzing the set of training utterances using the first model to determine a second model, the second model comprising a training state sequence for each of the set of training utterances, and wherein each training state sequence indicates a possible state for each time interval of a corresponding training utterance; and

10. The system of claim 9, wherein to determine whether the detected utterance corresponds to the phrase, the processing device is configured to:

11. The system of claim 10, wherein to determine whether the detected utterance corresponds to the phrase, the processing device is further configured to:

12. The system of claim 11, wherein the processing device is further configured to:

comparing the determined distances among the set of distances to a threshold; and is

Indicating that the detected utterance corresponds to the phrase in response to determining that the minimum distance among the set of distances is below the threshold.

13. The system of claim 9, wherein the processing device is further configured to:

for each training utterance, determining a likelihood ratio of the training utterance over time, wherein the second model further comprises an indication of the likelihood ratio of each training utterance over time.

14. The system of claim 13, wherein to determine whether the detected utterance corresponds to the phrase, the processing device is configured to:

determining a likelihood ratio over time of the detected utterances; and

15. The system of claim 14, wherein to determine whether the detected utterance corresponds to the phrase, the processing device is further configured to:

determining whether the detected utterance corresponds to the phrase based on the likelihood ratio of the detected utterance and a minimum distance among the set of distances.

16. A non-transitory computer readable medium having stored thereon instructions that, when executed by a processing device, cause the processing device to:

17. The non-transitory computer-readable medium of claim 16, wherein the parameters comprise a training state sequence such that the second model comprises a training state sequence for each of the set of training utterances, and wherein each training state sequence indicates a possible state for each time interval of a corresponding training utterance.

18. The non-transitory computer-readable medium of claim 17, wherein to determine whether the detected utterance corresponds to the phrase, the processing device is to:

19. The non-transitory computer-readable medium of claim 18, wherein to determine whether the detected utterance corresponds to the phrase, the processing device is further to:

20. The non-transitory computer-readable medium of claim 19, wherein the processing device determines that the detected utterance corresponds to the phrase in response to determining that the minimum distance among the set of distances is below a threshold.

21. The non-transitory computer-readable medium of claim 20, wherein the parameters include likelihood ratios over time, such that the second model includes an indication of likelihood ratios over time for each of the set of training utterances.

22. The non-transitory computer-readable medium of claim 21, wherein to determine whether the detected utterance corresponds to the phrase, the processing device is to:

determining a likelihood ratio over time of the detected utterances; and

23. The non-transitory computer-readable medium of claim 22, wherein to determine whether the detected utterance corresponds to the phrase, the processing device is further to: