WO2017111634A1 - Automatic tuning of speech recognition parameters - Google Patents

Automatic tuning of speech recognition parameters Download PDF

Info

Publication number
WO2017111634A1
WO2017111634A1 PCT/PL2015/050074 PL2015050074W WO2017111634A1 WO 2017111634 A1 WO2017111634 A1 WO 2017111634A1 PL 2015050074 W PL2015050074 W PL 2015050074W WO 2017111634 A1 WO2017111634 A1 WO 2017111634A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio segment
parameters
clean
dirty
optimization
Prior art date
Application number
PCT/PL2015/050074
Other languages
French (fr)
Inventor
Piotr CHLEBEK
Lukasz Kurylo
Michal BORWANSKI
Przemyslaw MAZIEWSKI
Roksana KOSTYK
Tomasz K. BURNY
Karol J. DUZINKIEWICZ
Sylwia BURACZEWSKA
Original Assignee
Intel Corporation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Intel Corporation filed Critical Intel Corporation
Priority to PCT/PL2015/050074 priority Critical patent/WO2017111634A1/en
Publication of WO2017111634A1 publication Critical patent/WO2017111634A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility

Definitions

  • Embodiments described herein generally relate to automatic speech recognition (ASR) and more specifically to automatic tuning of speech recognition parameters.
  • ASR automatic speech recognition
  • ASR Automatic speech recognition systems
  • the sensing stack includes a microphone and one or more conditioning elements to sample or modify the output of the microphone to facilitate the interpretation stack operations.
  • different devices such as laptop computers, smart phones, etc.
  • components e.g., microphones, body materials, microphone placements, etc.
  • sound stacks have different parameters that may be used to condition the received sound, The parameters not only address the sound sensing variations from the hardware, but often are intended to increase the signal to noise ratio of the spoken words over background noise to increase ASR success,
  • FIG. 1 is a block diagram of an example of an environment including a device for automatic tuning of speech recognition parameters, according to an embodiment.
  • FIG. 2 illustrates an example of a flow for automatic tuning of speech recognition parameters, according to an embodiment.
  • FIG. 3 illustrates an example of a method for automatic tuning of speech recognition parameters, according to an embodiment.
  • FIG. 4 illustrates an example of a sound wave and energy componentization for clean, according to an embodiment.
  • FIG. 5 illustrates an example of a method for performing Clean-Diff, according to an embodiment.
  • FIG. 6 is a block diagram illustrating an example of a machine upon which one or more embodiments may be implemented.
  • conditioning parameters that facilitate word extraction may be a daunting task given the variety of hardware configurations available in devices and the number of conditioning parameters. Further, selecting these parameters is often necessary for each new device that comes out, possibly lengthening device or ASR system release times.
  • the parameter selection is performed by a human, in a lab, trying various parameters and then testing the output, for example, by measuring word error rates (WERs) between the actual speech and a transcript produce by the ASR system. Further, these tests often include a clean sound and a dirty, or noisy, sound, both including the words to be transcribed.
  • the clean sound may be a monologue performed in a quiet room whereas the dirty sound is the same monologue performed on a busy city street.
  • the clean sound may including some noise, such as noise from the device (e.g. fans) or room
  • an automatic tuning of speech recognition parameters is herein described.
  • the system obtains (e.g., receives, retrieves, etc.) both a clean and a dirty copy of a sample speech. It then iteratively selects a set of parameters, applies them to both the clean and the dirty audio, tests how close the results are, and then repeats until an optimization threshold is reached. At that point, the last selected conditioning parameters are output to be used in the ASR system.
  • the searching technique may be in a multidimensional space representing a function of pre-processing parameters.
  • the outcome of the function reflects the ASR scores (e.g., WERs).
  • WER is a popular metric used for evaluation of ASR systems.
  • the WER value corresponds to a percentage of words which have been misrecognized.
  • One issue with iteratively processing sets of parameters in the manner described is the fitness function used, that is, the test to determine whether the present parameters are better than the last parameters.
  • a fitness function may be a bottle neck in the testing, leading to longer optimization times.
  • a new fitness function herein called clean-diff
  • Clean-diff sums the energy in both the clean and the dirty audio results to produce a single value that may be compared against other configurations. Clean-diff, however, takes into account the ASR system parameters when summing this energy, thus avoiding amplifying sound components that are mitigated or ignored by the ASR system, or amplifying sound components that are amplified by the ASR system. Thus, the clean-diff metric approximates the ASR scores and can be computed much faster that previous techniques.
  • the results of the present system and technique by automatically selecting objectively good pre-processing parameters for specific device and use case, improves quality, simplifies and shortens the process of tuning pre-processing parameters for ASR.
  • the present system improves the ASR experience for device users of a wide range of devices, such as ultra-books, laptops, tablets, in vehicle devices, phones, and wearables.
  • FIG. 1 is a block diagram of an example of an environment 100 including a device 105 for automatic tuning of speech recognition parameters, according to an embodiment.
  • the device 105 includes a storage device 1 10 for clean audio, a storage device 1 15 for dirty audio, a tuner 120, and a storage device for preprocessing parameter options 125.
  • one or more of the storage devices 1 10, 1 15, and 125 may be colocated on a single piece of hardware (e.g., a hard drive) or collection of hardware (e.g., a drive cluster).
  • the environment 100 illustrates an example of a test lab in which ambient sound is mitigated. The illustrated speakers provide the clean and dirty audio signals that are received by the device 105.
  • a user environment such as at a home, may be used.
  • the clean audio signal may be stored on the device 105 and the dirty audio signal may be played through a device speaker and received by a device microphone.
  • Such an arrangement may provide later tuning given a unique environment of the user.
  • some or all of the components i 10, 1 15, 120, 125, and 130 may be located in another device 107.
  • the device 107 may receive raw audio recordings from the device 105 and perform the processing described herein. For example, after recording unprocessed clean and dirty audio on the device 105 under tuning, the device 105 may be removed from the audio chamber (e.g., environment 100) and the audio signals may be transferred to dedicated computer (e.g., device 107).
  • raw (e.g., unprocessed) recordings from the device 105 are used in an offline way (e.g., on device 107) to find the optimal parameters.
  • the storage device 1 10 is arranged to store the clean audio segment.
  • the clean audio segment may be noiseless. As understood by one of ordinary skill in the art, being noiseless doesn't entail the complete lack of noise, however, it does entail minimal noise.
  • the storage device 1 15 is arranged to store the dirty audio segment. As also noted above, the dirty audio segment is the clean audio segment with the addition of noise.
  • Example noise may include simulated or real machine sounds, background conversations, traffic sounds, etc.
  • the t ner 120 is arranged to optimize pre-processing parameters.
  • each iteration includes the tuner 120 to select a set of parameters, pre-process the clean and dirty audio segments with the selected set of parameters, score the results, and output the parameters if an optimization threshold is reached— otherwise the iterative elements are repeated with a different set of parameters. Details of each element are described below.
  • the tuner 120 is arranged to select the set of parameters.
  • the set of parameters is selected from available pre-processing parameter options 125.
  • Available parameters may include one or more of frame size, frame step, number of frequency filters (e.g., Mel filters), frequency filter distribution (e.g., Mel filter distribution), frequency compensation factors (e.g., high-frequency compensation factor), or usable energy range. Any other ASR parameter may be used, however. Further, other filters or scales, such as the Bark scale, may be used.
  • Parameter selection from one iteration to the next may be performed in a number of ways. For example, the gradient descent method may be used to move a linear parameter in a direction that produces a better result and away from the direction that produces a worse result. To address being stuck in a local minima, a random selection of a parameter value may be used to escape the local minima.
  • the selection criteria is related to the optimization threshold.
  • the optimization threshold may be time or iteration based.
  • the tuner 120 is arranged to select the set of parameters and optimization threshold based on an optimization definition of a many-dimensional nonlinear optimization.
  • the optimization definition is an amoeba simplex optimization.
  • the optimization definition is a Monte Carlo optimization.
  • the tuner 120 is arranged to pre-process the clean audio segment with the set of parameters to produce a first result and to pre-process the dirty audio segment with the set of parameters to produce a second result. These are the two results that will be compared to determine the value of the selected pre-processing parameters for this iteration.
  • pre-processing the clean audio segment and pre-processing the dirty audio segment includes the tuner 120 to synchronize the clean audio segment and the dirty audio segment. Synchronization is used to ensure that the same speech signal in each of the clean and dirty audio segment is being compared, thus isolating the effect of the selected parameters on the noise in the dirty audio segment.
  • the synchronization includes using chirp detection to narrow a region of interest in the audio segments.
  • the chirp detection employs a linear one second chirp from 500 hertz to 3.5 kiiohertz.
  • the synchronization employs a phase reversal synchronization.
  • the tuner 120 is arranged to score a portion of the first result with a corresponding portion of the second result using clean-diff.
  • clean- diff compares the energy in the two signals. The more similar the energy output of each result, the better the selected pre-processing parameters performed.
  • clean-diff accounts for the way in which the ASR system will used audio segments. The following description provides additional details on techniques involved in computing the clean-diff metric.
  • clean-diff includes time- windowing (e.g., framing) both the clean audio segment and the dirty audio segment.
  • time- windows match a corresponding parameter for the ASR system being tuned.
  • Hann windows may be used.
  • the windows e.g., frames
  • the windows are sixteen milliseconds in length.
  • the windows overlap, but not completely, with neighboring windows.
  • the windows have an offset (e.g., ten milliseconds) between starting points of a preceding neighbor window.
  • clean-diff includes computing a respective power spectrum in a time-window corresponding to both the clean audio segment and the dirty audio segment. That is for a given window, a power spectrum for the clean audio segment is computed and a second power spectrum for the dirty audio segment is computed.
  • clean-diff includes dividing the time-window by a frequency filter into a plurality of portions.
  • the frequency filter is a Mel filter. An illustration for such a division is provided in FIG. 4.
  • clean-diff includes summing the energy in the respective plurality of portions for the clean audio segment and the dirty audio segment to create two energy totals, one for each.
  • summing the energy includes weighting higher frequencies more than lower frequencies.
  • the weighting is a factor of twenty one decibels divided by eight kilohertz.
  • clean-diff includes converting the two energy totals to the decibel scale;
  • clean-diff includes measuring a difference between the two energy totals.
  • the result of clean-diff is how different the two pre- processed audio segments are. The more similar the results, the smaller the difference, and the more successful the set of pre-processing parameters selected for this iteration.
  • the t ner 120 is arranged to provide the set of parameters, for this iteration, as output when the optimization threshold (discussed above) is reached.
  • the output may be directly applied to the conditioning stack for the device 105. Such output may be useful, for example, to an end-user further adjusting pre-processing parameters.
  • the output is provided to a database for similar devices to the device 105 to use. In this example, the output may be used, for example, in the manufacturing or configuration process of an original equipment manufacturer to produce a higher quality product in less time than was previously achievable.
  • FIG. 2 illustrates an example of a flow 200 for automatic tuning of speech recognition parameters, according to an embodiment.
  • the actions of the flow 200 are performed on computer hardware such as that described above with respect to FIG. 1 or below with respect to FIG. 6 (e.g., circuit sets), and the data elements are stored in machine readable media such as described below with respect to FIG. 4.
  • the flow 200 is searching for optimal tuning pre-processing parameters per device type/model.
  • this flow 200 may be performed at least once per device type/model.
  • the flow 200 may be repeated depending on device orientation (e.g., setup 205), which may be detected during device usage when different pre- processing parameters may be applied.
  • device orientation e.g., setup 205
  • a number of scenarios may be considered, including:
  • pre-processing parameters are selected for tuning.
  • parameters may be grouped, and tuning performed on the groups.
  • the first group of parameters may include pre-processing parameters judged the most important. Experimental results suggest that the number of parameters per group should be in a range 3-7. However it is possible to tune one or more (e.g., nine or higher) parameters.
  • clean speech is recorded on the targeted device (e.g., action 210).
  • the clean speech may be recreated via an artificial mouth or a high quality loudspeaker. Good results have been achieved with 6 minutes total recording. In these results twenty different speakers were used with 0.9 second intervals (e.g., no speaking) between utterances.
  • noisy speech is also recorded (action 215).
  • the speaking portion is exactly the same as that present in the clean recording.
  • many different noises may be recreated with high variability of level, directivity, types, environments, etc. All these noises should, however, be in a reasonable range of levels, A few seconds of the noisy recording may be clean (without any recreated noise).
  • the system may use linear 1 second chirp from 500 Hz to 3.5 kHz,
  • the clean recording is pre-processed with a pre-processing algorithm with default parameters (action 225), The initial pre-process parameters are chosen (action 230).
  • the noisy recording is pre-processed (action 235) and compared to the pre-processed clean recording using clean-diff (action 240).
  • the parameters are changed 245 and the noisy recording is pre-processed and compared again to the clean recording (loop to action 235) using the changed parameters.
  • This process may be repeated several times for optimization (e.g., producing a clean-diff metric of the lowest possible value within given constraints).
  • Optimization technique that were successful used include amoeba simplex, however other many-dimensional nonlinear optimization methods may be used, e.g., Monte Carlo.
  • actions 235, 240, and 245 form an inner loop of the flow 200
  • an outer loop may be used to change the initial parameters 250 applied to the clean recording (action 230). In some examples, starting from different beginning preprocessing parameters may significantly improve the flow 200.
  • WER based evaluation can be executed over few best preprocessing parameters found (finish 255). Such analysis may provide a more detailed test for the few parameters sets left, possibly increasing accuracy in the final result.
  • FIG. 3 illustrates an example of a method 300 for automatic tuning of speech recognition parameters, according to an embodiment.
  • the operations of the method 300 are performed on computer hardware such as that described above with respect to FIG. 1 or below with respect to FIG. 6 (e.g., circuit sets).
  • a clean audio segment is obtained (e.g., retrieved, received, etc.).
  • the clean audio segment is noiseless.
  • a dirty audio segment is obtained.
  • the dirty audio segment is the clean audio segment with added noise.
  • pre-processing parameters are iteratively optimized.
  • the individual operations involved with each iteration include operations 320-340.
  • a set of parameters are selected.
  • selecting the set of parameters for a given iteration follow a definition of an optimization definition of a many-dimensional nonlinear optimization.
  • the optimization definition is an amoeba simplex optimization.
  • the optimization definition is a Monte Carlo optimization.
  • the clean audio segment is pre-processed with the set of parameters to produce a first result.
  • the dirty audio segment is pre-processed with the set of parameters to produce a second result.
  • pre-processing the clean audio segment (operation 325) and pre-processing the dirty audio segment include synchronizing the clean audio segment and the dirty audio segment.
  • synchronizing the clean audio segment and the dirty audio segment n includes using chirp detection to narrow a region of interest.
  • the chirp detection uses a linear one second chirp from 500 hertz to 3.5 kiiohertz.
  • synchronizing the clean audio segment and the dirty audio segment includes using a phase reversal synchronization.
  • a portion of the first result is scored with the a corresponding portion of the second result using clean-diff.
  • using clean-diff includes time-windowing both the clean audio segment and the dirty audio segment.
  • Using clean-diff aiso includes computing a respective power spectrum in a time-window corresponding to both the clean audio segment and the dirty audio segment.
  • Using clean diff also includes dividing the time-window by a frequency filter into a plurality of portions.
  • the frequency filter is a Mel filter.
  • the time-windows and the frequency filter match corresponding parameters for an automatic speech recognition system being tuned.
  • Clean-diff also includes summing the energy in the respective plurality of portions for the clean audio segment and the dirty audio segment to create two energy totals.
  • summing the energy includes weighting higher frequencies more.
  • the weighting is a factor of twenty one decibels divided by eight kiiohertz. Clean-diff may then convert the two energy totals to the decibel scale and measure a difference between the two energy totals.
  • FIGS. 4 and 5 illustrate the clean-diff technique.
  • FIG. 4 illustrates an example of a sound wave and energy componentization 400 for clean-diff, according to an embodiment
  • FIG. 5 illustrates an example of a method 500 for performing Clean-Diff, according to an embodiment. The operations of the method
  • Clean-diff provides a difference between a clean and a noisy audio segment. As illustrated in FIG. 4 two mono 16 kHz speech signals, clean and noisy, are compared. Both signals have the same length and are synchronized with an accuracy up to few samples (generally the higher the accuracy the better). For higher sampling rate signals, the synchronization should be done before down- sampling. In an example, a cross-correlation based technique maybe used for time synchronization.
  • both signals are pre-processed once.
  • the clean signal is pre-processed with default pre-processing parameters. These default preprocessing parameters should provide reasonable WER improvement over noisy data and shouldn't increase WER over clean data. During tuning, several different pre-processing parameters are selected for the noisy signal.
  • Both signals, clean and noisy, are divided into short overlapping frames (e.g., 16ms), with small offset (step) between the frames (e.g., 10ms). These are illustrated as vertical lines in FIG. 4.
  • the energy in decibels (dB) is computed according to following (illustrated in FIG. 4):
  • a frame is windowed with Hann window. Other types of windowing can be used also), (vertical lines in FIG. 4)
  • Mel filter bank is applied (horizontal lines in FIG. 4), although other filter distribution (e.g., Bark) may also be used.
  • the energy is summed in each filter.
  • High frequency energies are adjusted (e.g., compensated for using a factor). For example, a factor of 21 dB over 8 kHz (or about 8 kHz) may be used. This compensation factor results in no compensation for the first Mel filter (filter with lowest frequencies) and 21 dB compensation for the last Mel filter (filter with highest frequencies). In an example, the compensation is done according to the pseudo-code:
  • pseudo-code values are:
  • computed energies in dB are normalized to align highest values between both clean and noisy pre-processed signals. This may be done based on a fragment (e.g., audio segment) where both signals are clean (e.g., the noisy signal does not have added noise). Generally, to facilitate better results, a few seconds of the noisy signal should be clean.
  • top energy in decibels, is computed. This value may be the maximum filter energy over all filters and all frames. For each frame, a frame difference between the clean and noisy signals is computed. Then, the metric value is calculated as an average of all frame differences.
  • the method 500 proceeds as follows: the resultant value is initialized to zero (operation 505), If there are more frames to process (decision 510), initialize clean (A) and noisy (B) energy values to zero (operation 515) for a frame. If there are more filters to process (decision 520), proceed to add measured energy in the component, defined by the frame and filter, to the respective energy values: clean energy being added to A (operation 525) and noisy energy to B (530).
  • the frame difference is determined by subtracting the clean energy value from the noisy energy value (operation 535), or B-A. The magnitude of this difference is then added to the global total for the resultant value (e.g., clean_diff below) (operation 540.
  • the method 500 then proceeds against further filters in the given frame (decision 520), and, if there are no further filters, against further frames (decision 510). When the frames have all been processed, the method averages the resultant value across the components (e.g., filter-frame combinations) and returns the result (operation 545).
  • the following pseudo-code illustrates these procedures:
  • the value ENERGY RANGE may be 45; the usable energy range in decibels.
  • FIG. 6 illustrates a block diagram of an example machine 600 upon which any one or more of the techniques (e.g., methodologies) discussed herein may perform.
  • the machine 600 may operate as a standalone device or may be connected (e.g., networked) to other machines.
  • the machine 600 may operate in the capacity of a server machine, a client machine, or both in server-client network environments.
  • the machine 600 may act as a peer machine in peer-to-peer (P2P) (or other distributed) network environment.
  • P2P peer-to-peer
  • the machine 600 may be a personal computer (PC), a tablet PC, a set-top box (STB), a personal digital assistant (PDA), a mobile telephone, a web appliance, a network router, switch or bridge, or any machine capable of executing instructions (sequential or otherwise) that specify actions to be taken by that machine. Further, while only a single machine is illustrated, the term
  • machine shall also be taken to include any collection of machines that individually or jointly execute a set (or multiple sets) of instructions to perform any one or more of the methodologies discussed herein, such as cloud computing, software as a service (SaaS), other computer cluster configurations.
  • cloud computing software as a service
  • SaaS software as a service
  • Examples, as described herein, may include, or may operate by, logic or a number of components, or mechanisms.
  • Circuit sets are a collection of circuits implemented in tangible entities that include hardware (e.g., simple circuits, gates, logic, etc.). Circuit set membership may be flexible over time and underlying hardware variability. Circuit sets include members that may, alone or in combination, perform specified operations when operating, in an example, hardware of the circuit set may be immutably designed to carry out a specific operation (e.g., hardwired).
  • the hardware of the circuit set may include variably connected physical components (e.g., execution units, transistors, simple circuits, etc.) including a computer readable medium physically modified (e.g., magnetically, electrically, moveable placement of invariant massed particles, etc.) to encode instructions of the specific operation.
  • a computer readable medium physically modified (e.g., magnetically, electrically, moveable placement of invariant massed particles, etc.) to encode instructions of the specific operation.
  • the instructions enable embedded hardware (e.g., the execution units or a loading mechanism) to create members of the circuit set in hardware via the variable connections to carry out portions of the specific operation when in operation.
  • the computer readable medium is communicatively coupled to the other components of the circuit set member when the device is operating.
  • any of the physical components may be used in more than one member of more than one circuit set.
  • execution units may be used in a first circuit of a first circuit set at one point in time and reused by a second circuit in the first circuit set, or by a third circuit in a second circuit set at a different time.
  • Machine 600 may include a hardware processor 602 (e.g., a central processing unit (CPU), a graphics processing unit (GPU), a hardware processor core, or any combination thereof), a main memory 604 and a static memory 606, some or all of which may communicate with each other via an interlink (e.g., bus) 608.
  • the machine 600 may further include a display unit 610, an alphanumeric input device 612 (e.g., a keyboard), and a user interface (UI) navigation device 614 (e.g., a mouse).
  • the display unit 610, input device 612 and UI navigation device 614 may be a touch screen display.
  • the machine 600 may additionally include a storage device (e.g., drive unit) 616, a signal generation device 618 (e.g., a speaker), a network interface device 620, and one or more sensors 621 , such as a global positioning system (GPS) sensor, compass, accelerometer, or other sensor.
  • the machine 600 may include an output controller 628, such as a serial (e.g., universal serial bus (USB), parallel, or other wired or wireless (e.g., infrared (I ), near field communication (NFC), etc.) connection to communicate or control one or more peripheral devices (e.g., a printer, card reader, etc.).
  • a serial e.g., universal serial bus (USB), parallel, or other wired or wireless (e.g., infrared (I ), near field communication (NFC), etc.) connection to communicate or control one or more peripheral devices (e.g., a printer, card reader, etc.).
  • peripheral devices e.g.
  • the storage device 616 may include a machine readable medium 622 on which is stored one or more sets of data structures or instructions 624 (e.g., software) embodying or utilized by any one or more of the techniques or functions described herein.
  • the instructions 624 may also reside, completely or at least partially, within the main memory 604, within static memory 606, or within the hardware processor 602 during execution thereof by the machine 600.
  • one or any combination of the hardware processor 602, the main memory 604, the static memory 606, or the storage device 616 may constitute machine readable media.
  • machine readable medium 622 is illustrated as a single medium, the term “machine readable medium” may include a single medium or multiple media (e.g., a centralized or distributed database, and/or associated caches and servers) configured to store the one or more instructions 624.
  • machine readable medium may include a single medium or multiple media (e.g., a centralized or distributed database, and/or associated caches and servers) configured to store the one or more instructions 624.
  • machine readable medium may include any medium that is capable of storing, encoding, or carrying instructions for execution by the machine 600 and that cause the machine 600 to perform any one or more of the techniques of the present disclosure, or that is capable of storing, encoding or carrying data structures used by or associated with such instructions.
  • Non-limiting machine readable medium examples may include solid-state memories, and optical and magnetic media
  • a massed machine readable medium comprises a machine readable medium with a plurality of particles having invariant (e.g., rest) mass. Accordingly, massed machine-readable media are not transitory propagating signals.
  • Specific examples of massed machine readable media may include: non- volatile memory, such as semiconductor memory devices (e.g., Electrically
  • EPROM Programmable Read-Only Memory
  • EEPROM Electrically Erasable Programmable Read-Only Memory
  • flash memory devices such as internal hard disks and removable disks; magneto-optical disks; and CD-ROM and DVD-ROM disks.
  • the instructions 624 may further be transmitted or received over a communications network 626 using a transmission medium via the network interface device 620 utilizing any one of a number of transfer protocols (e.g., frame relay, internet protocol (IP), transmission control protocol (TCP), user datagram protocol (UDP), hypertext transfer protocol (HTTP), etc.).
  • transfer protocols e.g., frame relay, internet protocol (IP), transmission control protocol (TCP), user datagram protocol (UDP), hypertext transfer protocol (HTTP), etc.
  • Example communication networks may include a local area network (LAN), a wide area network (WAN), a packet data network (e.g., the Internet), mobile telephone networks (e.g., cellular networks), Plain Old Telephone (POTS) networks, and wireless data networks (e.g., Institute of Electrical and Electronics Engineers (IEEE) 802.11 family of standards known as Wi-Fi®, IEEE 802.16 family of standards known as WiMax®), IEEE 802.15.4 family of standards, peer-to-peer (P2P) networks, among others.
  • the network interface device 620 may include one or more physical jacks (e.g., Ethernet, coaxial, or phone jacks) or one or more antennas to connect to the communications network 626.
  • the network interface device 620 may include a plurality of antennas to wirelessly communicate using at least one of single-input multiple-output (SIMO), multiple-input multiple-output (MIMO), or multiple-input single-output (MISO) techniques.
  • SIMO single-input multiple-output
  • MIMO multiple-input multiple-output
  • MISO multiple-input single-output
  • transmission medium shall be taken to include any intangible medium that is capable of storing, encoding or carrying instructions for execution by the machine 600, and includes digital or analog communications signals or other intangible medium to facilitate
  • Example 1 is a device for automatic tuning of speech recognition parameters, the method comprising: a storage device to store: a clean audio segment, the clean audio segment being noiseless; and a dirty audio segment, the dirty audio segment being the clean audio segment with noise; and a tuner to optimizing preprocessing parameters, the tuner to iteratively: select a set of parameters; preprocess the clean audio segment with the set of parameters to produce a first result; preprocess the dirty audio segment with the set of parameters to produce a second result; score a portion of the first result with the a
  • Example 2 the subject matter of Example 1 optionally includes wherein to preprocess the clean audio segment and to preprocess the dirty audio segment includes the tuner to synchronize the clean audio segment and the dirty audio segment.
  • Example 3 the subject matter of Example 2 optionally includes wherein to synchronize the clean audio segment and the dirty audio segment includes the tuner to use chirp detection to narrow a region of interest.
  • Example 4 the subject matter of Example 3 optionally includes wherein the chirp detection uses a linear one second chirp from 500 hertz to 3.5 kilohertz.
  • Example 5 the subject matter of any one or more of Examples 2-4 optionally include wherein to synchronize the clean audio segment and the dirty audio segment includes the tuner to use a phase reversal synchronization.
  • Example 6 the subject matter of any one or more of Examples 1-5 optionally include wherein the set of parameters and the optimization threshold are defined by an optimization definition of a many-dimensional nonlinear
  • Example 7 the subject matter of Example 6 optionally includes wherein the optimization definition is an amoeba simplex optimization.
  • the optimization definition is a Monte Carlo
  • Example 9 the subject matter of any one or more of Examples 1-8 optionally include wherein to use clean-diff includes the tuner to: time-window both the clean audio segment and the dirty audio segment; compute a respective power spectrum in a time-window corresponding to both the clean audio segment and the dirty audio segment; divide the time-window by a frequency filter into a plurality of portions; sum the energy in the respective plurality of portions for the clean audio segment and the dirty audio segment to create two energy totals; convert the two energy totals to the decibel scale; and measure a difference between the two energy totals.
  • Example 10 the subject matter of Example 9 optionally includes wherein the frequency filter is a Mel filter.
  • Example 1 the subject matter of any one or more of Examples 9-
  • 10 optionally include wherein to sum the energy includes the tuner to weight higher frequencies more.
  • Example 12 the subject matter of Example 1 1 optionally includes wherein the weight is a factor of twenty one decibels divided by eight kiiohertz.
  • Example 13 the subject matter of any one or more of Examples 9-
  • time-windows and the frequency filter match corresponding parameters for an automatic speech recognition system being tuned.
  • Example 14 is a system for automatic tuning of speech recognition parameters, the system comprising: means for obtaining a clean audio segment, the clean audio segment being noiseless; means for obtaining a dirty audio segment, the dirty audio segment being the clean audio segment with noise; and means for optimizing preprocessing parameters by iteratively: selecting a set of parameters; preprocessing the clean audio segment with the set of parameters to produce a first result; preprocessing the dirty audio segment with the set of parameters to produce a second result; scoring a portion of the first result with the a corresponding portion of the second result using clean-diff; and providing the set of parameters when an optimization threshold is reached.
  • Example 15 the subject matter of Example 14 optionally includes wherein preprocessing the clean audio segment and preprocessing the dirty audio segment includes means for synchronizing the clean audio segment and the dirty audio segment.
  • Example 16 the subject matter of Example 15 optionally includes wherein synchronizing the clean audio segment and the dirty audio segment includes means for using chirp detection to narrow a region of interest.
  • Example 17 the subject matter of Example 16 optionally includes wherein the chirp detection uses a linear one second chirp from 500 hertz to 3.5 kilohertz.
  • Example 18 the subject matter of any one or more of Examples
  • synchronizing the clean audio segment and the dirty audio segment includes means for using a phase reversal synchronization.
  • Example 19 the subject matter of any one or more of Examples
  • 14-18 optionally include wherein the set of parameters and optimization threshold are defined by an optimization definition of a many-dimensional nonlinear optimization.
  • Example 20 the subject matter of Example 19 optionally includes wherein the optimization definition is an amoeba simplex optimization.
  • Example 21 the subject matter of any one or more of Examples
  • 19-20 optionally include wherein the optimization definition is a Monte Carlo optimization.
  • Example 22 the subject matter of any one or more of Examples
  • using ciean-diff includes: means for time- windowing both the clean audio segment and the dirty audio segment; means for computing a respective power spectrum in a time-window corresponding to both the clean audio segment and the dirty audio segment; means for dividing the time- window by a frequency filter into a plura!ity of portions; means for summing the energy in the respective plurality of portions for the clean audio segment and the dirty audio segment to create two energy totals; means for converting the two energy totals to the decibel scale; and means for measuring a difference between the two energy totals.
  • Example 23 the subject matter of Example 22 optionally includes wherein the frequency filter is a Mel filter.
  • Example 24 the subject matter of any one or more of Examples
  • 22-23 optionally include wherein summing the energy includes means for weighting higher frequencies more.
  • Example 25 the subject matter of Example 24 optionally includes wherein the weighting is a factor of twenty one decibels divided by eight kilohertz.
  • Example 26 the subject matter of any one or more of Examples
  • time-windows and the frequency filter match corresponding parameters for an automatic speech recognition system being tuned.
  • Example 27 is a method for automatic tuning of speech recognition parameters, the method comprising: obtaining a clean audio segment, the clean audio segment being noiseless; obtaining a dirty audio segment, the dirty audio segment being the clean audio segment with noise; and optimizing preprocessing parameters by iteratively: selecting a set of parameters; preprocessing the clean audio segment with the set of parameters to produce a first result; preprocessing the dirty audio segment with the set of parameters to produce a second result; scoring a portion of the first result with the a corresponding portion of the second result using clean-diff; and providing the set of parameters when an optimization threshold is reached.
  • Example 28 the subject matter of Example 27 optionally includes wherein preprocessing the clean audio segment and preprocessing the dirty audio segment includes synchronizing the clean audio segment and the dirty audio segment.
  • Example 29 the subject matter of Example 28 optionally includes wherein synchronizing the clean audio segment and the dirty audio segment includes using chirp detection to narrow a region of interest.
  • Example 30 the subject matter of Example 29 optionally includes wherein the chirp detection uses a linear one second chirp from 500 hertz to 3.5 kilohertz.
  • Example 31 the subject matter of any one or more of Examples 28-30 optionally include wherein synchronizing the clean audio segment and the dirty audio segment includes using a phase reversal synchronization.
  • Example 32 the subject matter of any one or more of Examples
  • 27-31 optionally include wherein the set of parameters and optimization threshold are defined by an optimization definition of a many-dimensional nonlinear optimization.
  • Example 33 the subject matter of Example 32 optionally includes wherein the optimization definition is an amoeba simplex optimization,
  • Example 34 the subject matter of any one or more of Examples
  • 32-33 optionally include wherein the optimization definition is a Monte Carlo optimization.
  • Example 35 the subject matter of any one or more of Examples
  • 27-34 optionally include wherein using clean-diff includes: time-windowing both the clean audio segment and the dirty audio segment; computing a respective power spectrum in a time-window corresponding to both the clean audio segment and the dirty audio segment; dividing the time-window by a frequency filter into a plurality of portions; summing the energy in the respective plurality of portions for the clean audio segment and the dirty audio segment to create two energy totals; converting the two energy totals to the decibel scale; and measuring a difference between the two energy totals.
  • Example 36 the subject matter of Example 35 optionally includes wherein the frequency filter is a Mel filter.
  • Example 37 the subject matter of any one or more of Examples
  • 35-36 optionally include wherein summing the energy includes weighting higher frequencies more.
  • Example 38 the subject matter of Example 37 optionally includes wherein the weighting is a factor of twenty one decibels divided by eight kilohertz. [0102] In Example 39, the subject matter of any one or more of Examples
  • 35-38 optionally include wherein the time-windows and the frequency filter match corresponding parameters for an automatic speech recognition system being tuned.
  • Example 40 is at least one machine readable medium including instructions that, when executed by a machine, cause the machine to perform any of methods 27-39.
  • Example 41 is a system including means to perform any of the methods of Examples 27-39.
  • Example 42 is at least one machine readable medium including instructions for automatic tuning of speech recognition parameters, the instructions, when executed by a machine, cause the machine to perform operations comprising: obtaining a clean audio segment, the clean audio segment being noiseless; obtaining a dirty audio segment, the dirty audio segment being the clean audio segment with noise; and optimizing preprocessing parameters by iteratively: selecting a set of parameters; preprocessing the clean audio segment with the set of parameters to produce a first result; preprocessing the dirty audio segment with the set of parameters to produce a second result; scoring a portion of the first result with the a corresponding portion of the second result using clean-diff; and providing the set of parameters when an optimization threshold is reached.
  • Example 43 the subject matter of Example 42 optionally includes wherein preprocessing the clean audio segment and preprocessing the dirty audio segment includes synchronizing the clean audio segment and the dirty audio segment.
  • Example 44 the subject matter of Example 43 optionally includes wherein synchronizing the clean audio segment and the dirty audio segment includes using chirp detection to narrow a region of interest.
  • Example 45 the subject matter of Example 44 optionally includes wherein the chirp detection uses a linear one second chirp from 500 hertz to 3.5 kilohertz.
  • Example 46 the subject matter of any one or more of Examples
  • 43-45 optionally include wherein synchronizing the clean audio segment and the dirty audio segment includes using a phase reversal synchronization.
  • Example 47 the subject matter of any one or more of Examples 42-46 optionally include wherein the set of parameters and optimization threshold are defined by an optimization definition of a many-dimensional nonlinear optimization.
  • Example 48 the subject matter of Example 47 optionally includes wherein the optimization definition is an amoeba simplex optimization.
  • Example 49 the subject matter of any one or more of Examples
  • 47-48 optionally include wherein the optimization definition is a Monte Carlo optimization.
  • Example 50 the subject matter of any one or more of Examples
  • using clean-diff includes: time-windowing both the clean audio segment and the dirty audio segment; computing a respective power spectrum in a time-window corresponding to both the clean audio segment and the dirty audio segment; dividing the time-window by a frequency filter into a plurality of portions; summing the energy in the respective plurality of portions for the clean audio segment and the dirty audio segment to create two energy totals; converting the two energy totals to the decibel scale; and measuring a difference between the two energy totals.
  • Example 51 the subject matter of Example 50 optionally includes wherein the frequency filter is a Mel filter.
  • Example 52 the subject matter of any one or more of Examples 50-51 optionally include wherein summing the energy includes weighting higher frequencies more.
  • Example 53 the subject matter of Example 52 optionally includes wherein the weighting is a factor of twenty one decibels divided by eight kilohertz.
  • Example 54 the subject matter of any one or more of Examples 50-53 optionally include wherein the time-windows and the frequency filter match corresponding parameters for an automatic speech recognition machine readable medium being tuned.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

System and techniques for automatic tuning of speech recognition parameters are described herein. A clean audio segment and a dirty audio segment may be obtained, in an iterative fashion, optimized preprocessing parameters may be obtained by, at an iteration, selecting a set of parameters, preprocessing the clean audio segment with the set of parameters to produce a first result, preprocessing the dirty audio segment with the set of parameters to produce a second result, and scoring a portion of the first result with the a corresponding portion of the second result using clean-diff. When an optimization threshold is reached, exit the iterative process and provide the set of parameters from the last iteration.

Description

AUTOMATIC TUNING OF SPEECH RECOGNITION
PARAMETERS
TECHNICAL FIELD
[0001] Embodiments described herein generally relate to automatic speech recognition (ASR) and more specifically to automatic tuning of speech recognition parameters.
BACKGROUND
[0002] Automatic speech recognition (ASR) systems generally employ a sensing stack to receive sound from a subject and an interpretation stack to extract words from that sound. The sensing stack includes a microphone and one or more conditioning elements to sample or modify the output of the microphone to facilitate the interpretation stack operations. Often, different devices, such as laptop computers, smart phones, etc., have different components (e.g., microphones, body materials, microphone placements, etc.) that vary the received sound. Also, often sound stacks have different parameters that may be used to condition the received sound, The parameters not only address the sound sensing variations from the hardware, but often are intended to increase the signal to noise ratio of the spoken words over background noise to increase ASR success,
BRIEF DESCRIPTION OF THE DRAWINGS
[0003] In the drawings, which are not necessarily drawn to scale, like numerals may describe similar components in different views. Like numerals having different letter suffixes may represent different instances of similar components. The drawings illustrate generally, by way of example, but not by way of limitation, various embodiments discussed in the present document. [0004] FIG. 1 is a block diagram of an example of an environment including a device for automatic tuning of speech recognition parameters, according to an embodiment.
[0005] FIG. 2 illustrates an example of a flow for automatic tuning of speech recognition parameters, according to an embodiment.
[0006] FIG. 3 illustrates an example of a method for automatic tuning of speech recognition parameters, according to an embodiment.
[0007] FIG. 4 illustrates an example of a sound wave and energy componentization for clean, according to an embodiment.
[0008] FIG. 5 illustrates an example of a method for performing Clean-Diff, according to an embodiment.
[0009] FIG. 6 is a block diagram illustrating an example of a machine upon which one or more embodiments may be implemented. DETAILED DESCRIPTION
[0010] Selecting conditioning parameters that facilitate word extraction may be a daunting task given the variety of hardware configurations available in devices and the number of conditioning parameters. Further, selecting these parameters is often necessary for each new device that comes out, possibly lengthening device or ASR system release times. Traditionally, the parameter selection is performed by a human, in a lab, trying various parameters and then testing the output, for example, by measuring word error rates (WERs) between the actual speech and a transcript produce by the ASR system. Further, these tests often include a clean sound and a dirty, or noisy, sound, both including the words to be transcribed. For example, the clean sound may be a monologue performed in a quiet room whereas the dirty sound is the same monologue performed on a busy city street. The clean sound may including some noise, such as noise from the device (e.g. fans) or room
reverberation. In contrast the dirty sound includes significant sources of noise, such as air conditioning, a kitchen mixer running, washing machine running, a car engine running, etc. [0011] To address the latency problems and the expense of having a human set the conditioning parameters, and also to provide a higher quality selection of those parameters, an automatic tuning of speech recognition parameters is herein described. The system obtains (e.g., receives, retrieves, etc.) both a clean and a dirty copy of a sample speech. It then iteratively selects a set of parameters, applies them to both the clean and the dirty audio, tests how close the results are, and then repeats until an optimization threshold is reached. At that point, the last selected conditioning parameters are output to be used in the ASR system. This, thus, employs automatic tuning of the pre-processing (e.g., conditioning) parameters using a searching algorithm for local minimum discovery. The searching technique may be in a multidimensional space representing a function of pre-processing parameters. The outcome of the function reflects the ASR scores (e.g., WERs). WER is a popular metric used for evaluation of ASR systems. The WER value corresponds to a percentage of words which have been misrecognized.
[0012] One issue with iteratively processing sets of parameters in the manner described is the fitness function used, that is, the test to determine whether the present parameters are better than the last parameters. Such a fitness function may be a bottle neck in the testing, leading to longer optimization times.
[0013] A new fitness function, herein called clean-diff, may be used to efficiently test the results of using one parameter over another. Clean-diff sums the energy in both the clean and the dirty audio results to produce a single value that may be compared against other configurations. Clean-diff, however, takes into account the ASR system parameters when summing this energy, thus avoiding amplifying sound components that are mitigated or ignored by the ASR system, or amplifying sound components that are amplified by the ASR system. Thus, the clean-diff metric approximates the ASR scores and can be computed much faster that previous techniques.
[0014] Experimental results demonstrate the effectiveness of the present system and technique, when compared to human efforts to select optimal pre- processing parameters. In the test, human-best (e.g., the best that a human could do) parameters were manually selected to achieve the best WER improvement over various conditions. This human assisted tuning process took a few weeks and was performed with different speech samples and under different noise conditions. While optimizing human-best parameters, the WER metric was used.
[0015] The presently described system, using clean-diff, was also used to automatically produce pre-processing parameters for the sample speech and noise conditions discussed above. Then, both human-best and auto tuned parameters were tested and evaluated with the WER metric using speech sample and noise conditions not previously used for the pre-processing parameter tuning. The automatic tuning achieved significant WER improvement, as illustrated in the following results:
Figure imgf000006_0001
Human-best 10.4 14.3 13.9 14.0 16.5 13.8
Auto tuned 10.7 14.2 12.6 13.9 13.3 12.9
Improvement[%] -2.5 0.1 9.4 0.7 19.1 6.2
Notebook #3 (AS R B)
Clean Cafe Side Outdoor Side Average
Speaker Music
Human-best 10.3 15.8 20.1 14.0 19.1 15.9
Auto tuned 10.7 15.1 1 1.8 13.9 17.1 13.7
Improvement[%] -3.9 4.4 41.3 0.7 10.5 13.5
Note: raw results (without pre-processing) were much worse than those presented above.
[0016] As illustrated above, the results of the present system and technique, by automatically selecting objectively good pre-processing parameters for specific device and use case, improves quality, simplifies and shortens the process of tuning pre-processing parameters for ASR. Thus, the present system improves the ASR experience for device users of a wide range of devices, such as ultra-books, laptops, tablets, in vehicle devices, phones, and wearables.
[0017] FIG. 1 is a block diagram of an example of an environment 100 including a device 105 for automatic tuning of speech recognition parameters, according to an embodiment. The device 105 includes a storage device 1 10 for clean audio, a storage device 1 15 for dirty audio, a tuner 120, and a storage device for preprocessing parameter options 125. In an example, one or more of the storage devices 1 10, 1 15, and 125 may be colocated on a single piece of hardware (e.g., a hard drive) or collection of hardware (e.g., a drive cluster). The environment 100 illustrates an example of a test lab in which ambient sound is mitigated. The illustrated speakers provide the clean and dirty audio signals that are received by the device 105. In an example, a user environment, such as at a home, may be used. In such an environment, the clean audio signal may be stored on the device 105 and the dirty audio signal may be played through a device speaker and received by a device microphone. Such an arrangement may provide later tuning given a unique environment of the user. [0018] In an example, some or all of the components i 10, 1 15, 120, 125, and 130 may be located in another device 107. The device 107 may receive raw audio recordings from the device 105 and perform the processing described herein. For example, after recording unprocessed clean and dirty audio on the device 105 under tuning, the device 105 may be removed from the audio chamber (e.g., environment 100) and the audio signals may be transferred to dedicated computer (e.g., device 107). In an example, raw (e.g., unprocessed) recordings from the device 105 are used in an offline way (e.g., on device 107) to find the optimal parameters.
[0019] The storage device 1 10 is arranged to store the clean audio segment.
As noted above, the clean audio segment may be noiseless. As understood by one of ordinary skill in the art, being noiseless doesn't entail the complete lack of noise, however, it does entail minimal noise. The storage device 1 15 is arranged to store the dirty audio segment. As also noted above, the dirty audio segment is the clean audio segment with the addition of noise. Example noise may include simulated or real machine sounds, background conversations, traffic sounds, etc.
[0020] The t ner 120 is arranged to optimize pre-processing parameters.
This optimization is performed in an iterative fashion. The elements of each iteration include the tuner 120 to select a set of parameters, pre-process the clean and dirty audio segments with the selected set of parameters, score the results, and output the parameters if an optimization threshold is reached— otherwise the iterative elements are repeated with a different set of parameters. Details of each element are described below.
[0021] The tuner 120 is arranged to select the set of parameters. The set of parameters is selected from available pre-processing parameter options 125.
Available parameters may include one or more of frame size, frame step, number of frequency filters (e.g., Mel filters), frequency filter distribution (e.g., Mel filter distribution), frequency compensation factors (e.g., high-frequency compensation factor), or usable energy range. Any other ASR parameter may be used, however. Further, other filters or scales, such as the Bark scale, may be used. [0022] Parameter selection from one iteration to the next may be performed in a number of ways. For example, the gradient descent method may be used to move a linear parameter in a direction that produces a better result and away from the direction that produces a worse result. To address being stuck in a local minima, a random selection of a parameter value may be used to escape the local minima. Often, the selection criteria is related to the optimization threshold. Thus, if a parameter value is selected that is better than its neighbors in both directions, the optimization threshold is reached. However, in an example, the optimization threshold may be time or iteration based. Thus, after ten minutes or ten iterations, for example, the optimization threshold is reached. In an example, the tuner 120 is arranged to select the set of parameters and optimization threshold based on an optimization definition of a many-dimensional nonlinear optimization. In an example, the optimization definition is an amoeba simplex optimization. In an example, the optimization definition is a Monte Carlo optimization.
[0023] The tuner 120 is arranged to pre-process the clean audio segment with the set of parameters to produce a first result and to pre-process the dirty audio segment with the set of parameters to produce a second result. These are the two results that will be compared to determine the value of the selected pre-processing parameters for this iteration. In an example, pre-processing the clean audio segment and pre-processing the dirty audio segment includes the tuner 120 to synchronize the clean audio segment and the dirty audio segment. Synchronization is used to ensure that the same speech signal in each of the clean and dirty audio segment is being compared, thus isolating the effect of the selected parameters on the noise in the dirty audio segment. In an example, the synchronization includes using chirp detection to narrow a region of interest in the audio segments. In an example, the chirp detection employs a linear one second chirp from 500 hertz to 3.5 kiiohertz. In an example, the synchronization employs a phase reversal synchronization.
[0024] The tuner 120 is arranged to score a portion of the first result with a corresponding portion of the second result using clean-diff. As noted above, clean- diff compares the energy in the two signals. The more similar the energy output of each result, the better the selected pre-processing parameters performed. As noted above, when determining the energy of the two results, clean-diff accounts for the way in which the ASR system will used audio segments. The following description provides additional details on techniques involved in computing the clean-diff metric.
[0025] In an example, clean-diff includes time- windowing (e.g., framing) both the clean audio segment and the dirty audio segment. In an example, the time- windows match a corresponding parameter for the ASR system being tuned. In an example, Hann windows may be used. In an example, the windows (e.g., frames) are sixteen milliseconds in length. In an example, the windows overlap, but not completely, with neighboring windows. In an example, the windows have an offset (e.g., ten milliseconds) between starting points of a preceding neighbor window.
[0026] In an example, clean-diff includes computing a respective power spectrum in a time-window corresponding to both the clean audio segment and the dirty audio segment. That is for a given window, a power spectrum for the clean audio segment is computed and a second power spectrum for the dirty audio segment is computed.
[0027] In an example, clean-diff includes dividing the time-window by a frequency filter into a plurality of portions. In an example, the frequency filter is a Mel filter. An illustration for such a division is provided in FIG. 4.
[0028] In an example, clean-diff includes summing the energy in the respective plurality of portions for the clean audio segment and the dirty audio segment to create two energy totals, one for each. In an example, summing the energy includes weighting higher frequencies more than lower frequencies. In an example, the weighting is a factor of twenty one decibels divided by eight kilohertz. In an example, clean-diff includes converting the two energy totals to the decibel scale; and
[0029] In an example, clean-diff includes measuring a difference between the two energy totals. Thus, the result of clean-diff is how different the two pre- processed audio segments are. The more similar the results, the smaller the difference, and the more successful the set of pre-processing parameters selected for this iteration. [0030] The t ner 120 is arranged to provide the set of parameters, for this iteration, as output when the optimization threshold (discussed above) is reached. In an example, the output may be directly applied to the conditioning stack for the device 105. Such output may be useful, for example, to an end-user further adjusting pre-processing parameters. In an example, the output is provided to a database for similar devices to the device 105 to use. In this example, the output may be used, for example, in the manufacturing or configuration process of an original equipment manufacturer to produce a higher quality product in less time than was previously achievable.
[0031] FIG. 2 illustrates an example of a flow 200 for automatic tuning of speech recognition parameters, according to an embodiment. The actions of the flow 200 are performed on computer hardware such as that described above with respect to FIG. 1 or below with respect to FIG. 6 (e.g., circuit sets), and the data elements are stored in machine readable media such as described below with respect to FIG. 4. Generally, the flow 200 is searching for optimal tuning pre-processing parameters per device type/model. Thus, this flow 200 may be performed at least once per device type/model.
[0032] The flow 200 may be repeated depending on device orientation (e.g., setup 205), which may be detected during device usage when different pre- processing parameters may be applied. A number of scenarios may be considered, including:
• Placement of the user (distance, angles).
• Room type. Several experiments has been conducted inside standardized audio lab (ETSI EG 202 396) and inside quite office rooms, etc.
· Placement of the device; device on table, in hands, on a stand, on a wall, etc.
• Noise recreation methods.
Also, part of the setup 205, pre-processing parameters are selected for tuning.
Generally, the default minimal and maximal values of these parameters are known. If pre-processing has too many parameters, parameters may be grouped, and tuning performed on the groups. The first group of parameters may include pre-processing parameters judged the most important. Experimental results suggest that the number of parameters per group should be in a range 3-7. However it is possible to tune one or more (e.g., nine or higher) parameters.
[0033] Once the setup 205 is complete, clean speech is recorded on the targeted device (e.g., action 210). The clean speech may be recreated via an artificial mouth or a high quality loudspeaker. Good results have been achieved with 6 minutes total recording. In these results twenty different speakers were used with 0.9 second intervals (e.g., no speaking) between utterances.
[0034] Noisy speech is also recorded (action 215). The speaking portion is exactly the same as that present in the clean recording. During speech, many different noises may be recreated with high variability of level, directivity, types, environments, etc. All these noises should, however, be in a reasonable range of levels, A few seconds of the noisy recording may be clean (without any recreated noise).
[0035] Both recordings are trimmed and synchronized (action 220). For synchronization at least these two methods are available:
• Chirp detection for narrowing region of interest. The system may use linear 1 second chirp from 500 Hz to 3.5 kHz,
• Modem/fax like synchronization based on phase reversal (/ANS) for sample level precision.
[0036] The clean recording is pre-processed with a pre-processing algorithm with default parameters (action 225), The initial pre-process parameters are chosen (action 230). The noisy recording is pre-processed (action 235) and compared to the pre-processed clean recording using clean-diff (action 240).
[0037] After the clean-diff is calculated between the two pre-processed recordings, the parameters are changed 245 and the noisy recording is pre-processed and compared again to the clean recording (loop to action 235) using the changed parameters. This process may be repeated several times for optimization (e.g., producing a clean-diff metric of the lowest possible value within given constraints). Optimization technique that were successful used include amoeba simplex, however other many-dimensional nonlinear optimization methods may be used, e.g., Monte Carlo. [0038] While actions 235, 240, and 245 form an inner loop of the flow 200, an outer loop may be used to change the initial parameters 250 applied to the clean recording (action 230). In some examples, starting from different beginning preprocessing parameters may significantly improve the flow 200.
[0039] Finally, WER based evaluation can be executed over few best preprocessing parameters found (finish 255). Such analysis may provide a more detailed test for the few parameters sets left, possibly increasing accuracy in the final result.
[0040] FIG. 3 illustrates an example of a method 300 for automatic tuning of speech recognition parameters, according to an embodiment. The operations of the method 300 are performed on computer hardware such as that described above with respect to FIG. 1 or below with respect to FIG. 6 (e.g., circuit sets).
[0041] At operation 305, a clean audio segment is obtained (e.g., retrieved, received, etc.). In an example, the clean audio segment is noiseless.
[0042] At operation 310, a dirty audio segment is obtained. The dirty audio segment is the clean audio segment with added noise.
[0043] At operation 315, pre-processing parameters are iteratively optimized. The individual operations involved with each iteration include operations 320-340.
[0044] At operation 320, a set of parameters are selected. In an example, selecting the set of parameters for a given iteration follow a definition of an optimization definition of a many-dimensional nonlinear optimization. In an example, the optimization definition is an amoeba simplex optimization. In an example, the optimization definition is a Monte Carlo optimization.
[0045] At operation 325, the clean audio segment is pre-processed with the set of parameters to produce a first result.
[0046] At operation 330, the dirty audio segment is pre-processed with the set of parameters to produce a second result. In an example, pre-processing the clean audio segment (operation 325) and pre-processing the dirty audio segment include synchronizing the clean audio segment and the dirty audio segment. In an example, synchronizing the clean audio segment and the dirty audio segment n includes using chirp detection to narrow a region of interest. In an example, the chirp detection uses a linear one second chirp from 500 hertz to 3.5 kiiohertz. In an example, synchronizing the clean audio segment and the dirty audio segment includes using a phase reversal synchronization.
[0047] At operation 335, a portion of the first result is scored with the a corresponding portion of the second result using clean-diff. In an example, using clean-diff includes time-windowing both the clean audio segment and the dirty audio segment. Using clean-diff aiso includes computing a respective power spectrum in a time-window corresponding to both the clean audio segment and the dirty audio segment. Using clean diff also includes dividing the time-window by a frequency filter into a plurality of portions. In an example, the frequency filter is a Mel filter. In an example, the time-windows and the frequency filter match corresponding parameters for an automatic speech recognition system being tuned.
[0048J Clean-diff also includes summing the energy in the respective plurality of portions for the clean audio segment and the dirty audio segment to create two energy totals. In an example, summing the energy includes weighting higher frequencies more. In an example, the weighting is a factor of twenty one decibels divided by eight kiiohertz. Clean-diff may then convert the two energy totals to the decibel scale and measure a difference between the two energy totals.
[0049] At decision 340, it is determined whether the optimization threshold is reached, if so, provide the set of parameters. If not, return to operation 320 in a next iteration.
[0050] FIGS. 4 and 5 illustrate the clean-diff technique. FIG. 4 illustrates an example of a sound wave and energy componentization 400 for clean-diff, according to an embodiment, and FIG. 5 illustrates an example of a method 500 for performing Clean-Diff, according to an embodiment. The operations of the method
500 are performed on computer hardware such as that described above with respect to FIG. 1 or below with respect to FIG. 6 (e.g., circuit sets).
[0051] Clean-diff provides a difference between a clean and a noisy audio segment. As illustrated in FIG. 4 two mono 16 kHz speech signals, clean and noisy, are compared. Both signals have the same length and are synchronized with an accuracy up to few samples (generally the higher the accuracy the better). For higher sampling rate signals, the synchronization should be done before down- sampling. In an example, a cross-correlation based technique maybe used for time synchronization.
[0052] Before comparison, both signals are pre-processed once. The clean signal is pre-processed with default pre-processing parameters. These default preprocessing parameters should provide reasonable WER improvement over noisy data and shouldn't increase WER over clean data. During tuning, several different pre-processing parameters are selected for the noisy signal.
[0053] Both signals, clean and noisy, are divided into short overlapping frames (e.g., 16ms), with small offset (step) between the frames (e.g., 10ms). These are illustrated as vertical lines in FIG. 4. For the frames and filters of both signals, the energy in decibels (dB) is computed according to following (illustrated in FIG. 4):
1 . A frame is windowed with Hann window. Other types of windowing can be used also), (vertical lines in FIG. 4)
2. The power spectrum is computed.
3. Mel filter bank is applied (horizontal lines in FIG. 4), although other filter distribution (e.g., Bark) may also be used. The energy is summed in each filter.
4. All Filter bank energies are moved into dB scale.
[0054] High frequency energies are adjusted (e.g., compensated for using a factor). For example, a factor of 21 dB over 8 kHz (or about 8 kHz) may be used. This compensation factor results in no compensation for the first Mel filter (filter with lowest frequencies) and 21 dB compensation for the last Mel filter (filter with highest frequencies). In an example, the compensation is done according to the pseudo-code:
foreach frame {
for filter=l to FILTERS-1 { energy_dB [frame, filter] +=
( COMPENSATION_FACTOR * filter / ( FILTERS- 1 ) ) ;
}
}
In an example, pseudo-code values are:
• COMPENSATION_FACTOR = 21
o The high frequency compensation factor in dB
• FILTERS - 23
o The number of Mel filters.
[0055] After high frequency compensation, computed energies in dB are normalized to align highest values between both clean and noisy pre-processed signals. This may be done based on a fragment (e.g., audio segment) where both signals are clean (e.g., the noisy signal does not have added noise). Generally, to facilitate better results, a few seconds of the noisy signal should be clean.
[0056] For the clean signal, top energy, in decibels, is computed. This value may be the maximum filter energy over all filters and all frames. For each frame, a frame difference between the clean and noisy signals is computed. Then, the metric value is calculated as an average of all frame differences. Thus, the method 500 proceeds as follows: the resultant value is initialized to zero (operation 505), If there are more frames to process (decision 510), initialize clean (A) and noisy (B) energy values to zero (operation 515) for a frame. If there are more filters to process (decision 520), proceed to add measured energy in the component, defined by the frame and filter, to the respective energy values: clean energy being added to A (operation 525) and noisy energy to B (530). After these values are updated, the frame difference is determined by subtracting the clean energy value from the noisy energy value (operation 535), or B-A. The magnitude of this difference is then added to the global total for the resultant value (e.g., clean_diff below) (operation 540. The method 500 then proceeds against further filters in the given frame (decision 520), and, if there are no further filters, against further frames (decision 510). When the frames have all been processed, the method averages the resultant value across the components (e.g., filter-frame combinations) and returns the result (operation 545). The following pseudo-code illustrates these procedures:
clean_diff = 0 . 0 ;
foreach frame {
foreach filter {
a = clean [frame, filter]; // energy in dB b = noisy [frame, filter]; // energy in dB if (a < top_energy - ENERGY_RANGE } and (a > b) then
frame__difference = 0 . 0 ;
else
frame__difference = (b - a) ;
clean_diff += Abs ( frame___diff) ;
}
}
clean^diff = 100 . 0 * clean_diff / { FILTERS * frames) ;
In an example, the value ENERGY RANGE may be 45; the usable energy range in decibels.
[0057] FIG. 6 illustrates a block diagram of an example machine 600 upon which any one or more of the techniques (e.g., methodologies) discussed herein may perform. In alternative embodiments, the machine 600 may operate as a standalone device or may be connected (e.g., networked) to other machines. In a networked deployment, the machine 600 may operate in the capacity of a server machine, a client machine, or both in server-client network environments. In an example, the machine 600 may act as a peer machine in peer-to-peer (P2P) (or other distributed) network environment. The machine 600 may be a personal computer (PC), a tablet PC, a set-top box (STB), a personal digital assistant (PDA), a mobile telephone, a web appliance, a network router, switch or bridge, or any machine capable of executing instructions (sequential or otherwise) that specify actions to be taken by that machine. Further, while only a single machine is illustrated, the term
"machine" shall also be taken to include any collection of machines that individually or jointly execute a set (or multiple sets) of instructions to perform any one or more of the methodologies discussed herein, such as cloud computing, software as a service (SaaS), other computer cluster configurations.
[0058] Examples, as described herein, may include, or may operate by, logic or a number of components, or mechanisms. Circuit sets are a collection of circuits implemented in tangible entities that include hardware (e.g., simple circuits, gates, logic, etc.). Circuit set membership may be flexible over time and underlying hardware variability. Circuit sets include members that may, alone or in combination, perform specified operations when operating, in an example, hardware of the circuit set may be immutably designed to carry out a specific operation (e.g., hardwired). In an example, the hardware of the circuit set may include variably connected physical components (e.g., execution units, transistors, simple circuits, etc.) including a computer readable medium physically modified (e.g., magnetically, electrically, moveable placement of invariant massed particles, etc.) to encode instructions of the specific operation. In connecting the physical components, the underlying electrical properties of a hardware constituent are changed, for example, from an insulator to a conductor or vice versa. The instructions enable embedded hardware (e.g., the execution units or a loading mechanism) to create members of the circuit set in hardware via the variable connections to carry out portions of the specific operation when in operation.
Accordingly, the computer readable medium is communicatively coupled to the other components of the circuit set member when the device is operating. In an example, any of the physical components may be used in more than one member of more than one circuit set. For example, under operation, execution units may be used in a first circuit of a first circuit set at one point in time and reused by a second circuit in the first circuit set, or by a third circuit in a second circuit set at a different time.
[0059] Machine (e.g., computer system) 600 may include a hardware processor 602 (e.g., a central processing unit (CPU), a graphics processing unit (GPU), a hardware processor core, or any combination thereof), a main memory 604 and a static memory 606, some or all of which may communicate with each other via an interlink (e.g., bus) 608. The machine 600 may further include a display unit 610, an alphanumeric input device 612 (e.g., a keyboard), and a user interface (UI) navigation device 614 (e.g., a mouse). In an example, the display unit 610, input device 612 and UI navigation device 614 may be a touch screen display. The machine 600 may additionally include a storage device (e.g., drive unit) 616, a signal generation device 618 (e.g., a speaker), a network interface device 620, and one or more sensors 621 , such as a global positioning system (GPS) sensor, compass, accelerometer, or other sensor. The machine 600 may include an output controller 628, such as a serial (e.g., universal serial bus (USB), parallel, or other wired or wireless (e.g., infrared (I ), near field communication (NFC), etc.) connection to communicate or control one or more peripheral devices (e.g., a printer, card reader, etc.).
[0060] The storage device 616 may include a machine readable medium 622 on which is stored one or more sets of data structures or instructions 624 (e.g., software) embodying or utilized by any one or more of the techniques or functions described herein. The instructions 624 may also reside, completely or at least partially, within the main memory 604, within static memory 606, or within the hardware processor 602 during execution thereof by the machine 600. In an example, one or any combination of the hardware processor 602, the main memory 604, the static memory 606, or the storage device 616 may constitute machine readable media.
[0061] While the machine readable medium 622 is illustrated as a single medium, the term "machine readable medium" may include a single medium or multiple media (e.g., a centralized or distributed database, and/or associated caches and servers) configured to store the one or more instructions 624.
[0062] The term "machine readable medium" may include any medium that is capable of storing, encoding, or carrying instructions for execution by the machine 600 and that cause the machine 600 to perform any one or more of the techniques of the present disclosure, or that is capable of storing, encoding or carrying data structures used by or associated with such instructions, Non-limiting machine readable medium examples may include solid-state memories, and optical and magnetic media, In an example, a massed machine readable medium comprises a machine readable medium with a plurality of particles having invariant (e.g., rest) mass. Accordingly, massed machine-readable media are not transitory propagating signals. Specific examples of massed machine readable media may include: non- volatile memory, such as semiconductor memory devices (e.g., Electrically
Programmable Read-Only Memory (EPROM), Electrically Erasable Programmable Read-Only Memory (EEPROM)) and flash memory devices; magnetic disks, such as internal hard disks and removable disks; magneto-optical disks; and CD-ROM and DVD-ROM disks.
[0063] The instructions 624 may further be transmitted or received over a communications network 626 using a transmission medium via the network interface device 620 utilizing any one of a number of transfer protocols (e.g., frame relay, internet protocol (IP), transmission control protocol (TCP), user datagram protocol (UDP), hypertext transfer protocol (HTTP), etc.). Example communication networks may include a local area network (LAN), a wide area network (WAN), a packet data network (e.g., the Internet), mobile telephone networks (e.g., cellular networks), Plain Old Telephone (POTS) networks, and wireless data networks (e.g., Institute of Electrical and Electronics Engineers (IEEE) 802.11 family of standards known as Wi-Fi®, IEEE 802.16 family of standards known as WiMax®), IEEE 802.15.4 family of standards, peer-to-peer (P2P) networks, among others. In an example, the network interface device 620 may include one or more physical jacks (e.g., Ethernet, coaxial, or phone jacks) or one or more antennas to connect to the communications network 626. In an example, the network interface device 620 may include a plurality of antennas to wirelessly communicate using at least one of single-input multiple-output (SIMO), multiple-input multiple-output (MIMO), or multiple-input single-output (MISO) techniques. The term "transmission medium" shall be taken to include any intangible medium that is capable of storing, encoding or carrying instructions for execution by the machine 600, and includes digital or analog communications signals or other intangible medium to facilitate
communication of such software. Additional Notes & Examples
[0064] Example 1 is a device for automatic tuning of speech recognition parameters, the method comprising: a storage device to store: a clean audio segment, the clean audio segment being noiseless; and a dirty audio segment, the dirty audio segment being the clean audio segment with noise; and a tuner to optimizing preprocessing parameters, the tuner to iteratively: select a set of parameters; preprocess the clean audio segment with the set of parameters to produce a first result; preprocess the dirty audio segment with the set of parameters to produce a second result; score a portion of the first result with the a
corresponding portion of the second result using clean-diff; and provide the set of parameters when an optimization threshold is reached.
[0065] In Example 2, the subject matter of Example 1 optionally includes wherein to preprocess the clean audio segment and to preprocess the dirty audio segment includes the tuner to synchronize the clean audio segment and the dirty audio segment.
[0066] In Example 3, the subject matter of Example 2 optionally includes wherein to synchronize the clean audio segment and the dirty audio segment includes the tuner to use chirp detection to narrow a region of interest.
[0067] In Example 4, the subject matter of Example 3 optionally includes wherein the chirp detection uses a linear one second chirp from 500 hertz to 3.5 kilohertz.
[0068] In Example 5, the subject matter of any one or more of Examples 2-4 optionally include wherein to synchronize the clean audio segment and the dirty audio segment includes the tuner to use a phase reversal synchronization.
[0069] In Example 6, the subject matter of any one or more of Examples 1-5 optionally include wherein the set of parameters and the optimization threshold are defined by an optimization definition of a many-dimensional nonlinear
optimization.
[0070] In Example 7, the subject matter of Example 6 optionally includes wherein the optimization definition is an amoeba simplex optimization. [0071] In Example 8, the subject matter of any one or more of Examples 6-7 optionally include wherein the optimization definition is a Monte Carlo
optimization.
[0072] In Example 9, the subject matter of any one or more of Examples 1-8 optionally include wherein to use clean-diff includes the tuner to: time-window both the clean audio segment and the dirty audio segment; compute a respective power spectrum in a time-window corresponding to both the clean audio segment and the dirty audio segment; divide the time-window by a frequency filter into a plurality of portions; sum the energy in the respective plurality of portions for the clean audio segment and the dirty audio segment to create two energy totals; convert the two energy totals to the decibel scale; and measure a difference between the two energy totals.
[0073] In Example 10, the subject matter of Example 9 optionally includes wherein the frequency filter is a Mel filter.
[0074] In Example 1 1, the subject matter of any one or more of Examples 9-
10 optionally include wherein to sum the energy includes the tuner to weight higher frequencies more.
[0075] In Example 12, the subject matter of Example 1 1 optionally includes wherein the weight is a factor of twenty one decibels divided by eight kiiohertz.
[0076] In Example 13, the subject matter of any one or more of Examples 9-
12 optionally include wherein the time-windows and the frequency filter match corresponding parameters for an automatic speech recognition system being tuned.
[0077] Example 14 is a system for automatic tuning of speech recognition parameters, the system comprising: means for obtaining a clean audio segment, the clean audio segment being noiseless; means for obtaining a dirty audio segment, the dirty audio segment being the clean audio segment with noise; and means for optimizing preprocessing parameters by iteratively: selecting a set of parameters; preprocessing the clean audio segment with the set of parameters to produce a first result; preprocessing the dirty audio segment with the set of parameters to produce a second result; scoring a portion of the first result with the a corresponding portion of the second result using clean-diff; and providing the set of parameters when an optimization threshold is reached.
[0078] In Example 15, the subject matter of Example 14 optionally includes wherein preprocessing the clean audio segment and preprocessing the dirty audio segment includes means for synchronizing the clean audio segment and the dirty audio segment.
[0079] In Example 16, the subject matter of Example 15 optionally includes wherein synchronizing the clean audio segment and the dirty audio segment includes means for using chirp detection to narrow a region of interest.
[0080] In Example 17, the subject matter of Example 16 optionally includes wherein the chirp detection uses a linear one second chirp from 500 hertz to 3.5 kilohertz.
[0081] In Example 18, the subject matter of any one or more of Examples
15-17 optionally include wherein synchronizing the clean audio segment and the dirty audio segment includes means for using a phase reversal synchronization.
[0082] In Example 19, the subject matter of any one or more of Examples
14-18 optionally include wherein the set of parameters and optimization threshold are defined by an optimization definition of a many-dimensional nonlinear optimization.
[0083] In Example 20, the subject matter of Example 19 optionally includes wherein the optimization definition is an amoeba simplex optimization.
[0084] In Example 21, the subject matter of any one or more of Examples
19-20 optionally include wherein the optimization definition is a Monte Carlo optimization.
[0085] in Example 22, the subject matter of any one or more of Examples
14-21 optionally include wherein using ciean-diff includes: means for time- windowing both the clean audio segment and the dirty audio segment; means for computing a respective power spectrum in a time-window corresponding to both the clean audio segment and the dirty audio segment; means for dividing the time- window by a frequency filter into a plura!ity of portions; means for summing the energy in the respective plurality of portions for the clean audio segment and the dirty audio segment to create two energy totals; means for converting the two energy totals to the decibel scale; and means for measuring a difference between the two energy totals.
[0086] In Example 23, the subject matter of Example 22 optionally includes wherein the frequency filter is a Mel filter.
[0087] In Example 24, the subject matter of any one or more of Examples
22-23 optionally include wherein summing the energy includes means for weighting higher frequencies more.
[0088] In Example 25, the subject matter of Example 24 optionally includes wherein the weighting is a factor of twenty one decibels divided by eight kilohertz.
[0089] In Example 26, the subject matter of any one or more of Examples
22-25 optionally include wherein the time-windows and the frequency filter match corresponding parameters for an automatic speech recognition system being tuned.
[0090] Example 27 is a method for automatic tuning of speech recognition parameters, the method comprising: obtaining a clean audio segment, the clean audio segment being noiseless; obtaining a dirty audio segment, the dirty audio segment being the clean audio segment with noise; and optimizing preprocessing parameters by iteratively: selecting a set of parameters; preprocessing the clean audio segment with the set of parameters to produce a first result; preprocessing the dirty audio segment with the set of parameters to produce a second result; scoring a portion of the first result with the a corresponding portion of the second result using clean-diff; and providing the set of parameters when an optimization threshold is reached.
[0091] In Example 28, the subject matter of Example 27 optionally includes wherein preprocessing the clean audio segment and preprocessing the dirty audio segment includes synchronizing the clean audio segment and the dirty audio segment.
[0092] In Example 29, the subject matter of Example 28 optionally includes wherein synchronizing the clean audio segment and the dirty audio segment includes using chirp detection to narrow a region of interest. [0093] In Example 30, the subject matter of Example 29 optionally includes wherein the chirp detection uses a linear one second chirp from 500 hertz to 3.5 kilohertz.
[0094] In Example 31, the subject matter of any one or more of Examples 28-30 optionally include wherein synchronizing the clean audio segment and the dirty audio segment includes using a phase reversal synchronization.
[0095] In Example 32, the subject matter of any one or more of Examples
27-31 optionally include wherein the set of parameters and optimization threshold are defined by an optimization definition of a many-dimensional nonlinear optimization.
[0096] In Example 33, the subject matter of Example 32 optionally includes wherein the optimization definition is an amoeba simplex optimization,
[0097] In Example 34, the subject matter of any one or more of Examples
32-33 optionally include wherein the optimization definition is a Monte Carlo optimization.
[0098] In Example 35, the subject matter of any one or more of Examples
27-34 optionally include wherein using clean-diff includes: time-windowing both the clean audio segment and the dirty audio segment; computing a respective power spectrum in a time-window corresponding to both the clean audio segment and the dirty audio segment; dividing the time-window by a frequency filter into a plurality of portions; summing the energy in the respective plurality of portions for the clean audio segment and the dirty audio segment to create two energy totals; converting the two energy totals to the decibel scale; and measuring a difference between the two energy totals.
[0099] In Example 36, the subject matter of Example 35 optionally includes wherein the frequency filter is a Mel filter.
[0100] In Example 37, the subject matter of any one or more of Examples
35-36 optionally include wherein summing the energy includes weighting higher frequencies more.
[0101] In Example 38, the subject matter of Example 37 optionally includes wherein the weighting is a factor of twenty one decibels divided by eight kilohertz. [0102] In Example 39, the subject matter of any one or more of Examples
35-38 optionally include wherein the time-windows and the frequency filter match corresponding parameters for an automatic speech recognition system being tuned.
[0103] Example 40 is at least one machine readable medium including instructions that, when executed by a machine, cause the machine to perform any of methods 27-39.
[0104] Example 41 is a system including means to perform any of the methods of Examples 27-39.
[0105] Example 42 is at least one machine readable medium including instructions for automatic tuning of speech recognition parameters, the instructions, when executed by a machine, cause the machine to perform operations comprising: obtaining a clean audio segment, the clean audio segment being noiseless; obtaining a dirty audio segment, the dirty audio segment being the clean audio segment with noise; and optimizing preprocessing parameters by iteratively: selecting a set of parameters; preprocessing the clean audio segment with the set of parameters to produce a first result; preprocessing the dirty audio segment with the set of parameters to produce a second result; scoring a portion of the first result with the a corresponding portion of the second result using clean-diff; and providing the set of parameters when an optimization threshold is reached.
[0106] In Example 43, the subject matter of Example 42 optionally includes wherein preprocessing the clean audio segment and preprocessing the dirty audio segment includes synchronizing the clean audio segment and the dirty audio segment.
[0107] In Example 44, the subject matter of Example 43 optionally includes wherein synchronizing the clean audio segment and the dirty audio segment includes using chirp detection to narrow a region of interest.
[0108] In Example 45, the subject matter of Example 44 optionally includes wherein the chirp detection uses a linear one second chirp from 500 hertz to 3.5 kilohertz. [0109] In Example 46, the subject matter of any one or more of Examples
43-45 optionally include wherein synchronizing the clean audio segment and the dirty audio segment includes using a phase reversal synchronization.
[0110] In Example 47, the subject matter of any one or more of Examples 42-46 optionally include wherein the set of parameters and optimization threshold are defined by an optimization definition of a many-dimensional nonlinear optimization.
[0111] In Example 48, the subject matter of Example 47 optionally includes wherein the optimization definition is an amoeba simplex optimization.
[0112] In Example 49, the subject matter of any one or more of Examples
47-48 optionally include wherein the optimization definition is a Monte Carlo optimization.
[0113] In Example 50, the subject matter of any one or more of Examples
42—49 optionally include wherein using clean-diff includes: time-windowing both the clean audio segment and the dirty audio segment; computing a respective power spectrum in a time-window corresponding to both the clean audio segment and the dirty audio segment; dividing the time-window by a frequency filter into a plurality of portions; summing the energy in the respective plurality of portions for the clean audio segment and the dirty audio segment to create two energy totals; converting the two energy totals to the decibel scale; and measuring a difference between the two energy totals.
[0114] In Example 51, the subject matter of Example 50 optionally includes wherein the frequency filter is a Mel filter.
[0115] In Example 52, the subject matter of any one or more of Examples 50-51 optionally include wherein summing the energy includes weighting higher frequencies more.
[0116] In Example 53, the subject matter of Example 52 optionally includes wherein the weighting is a factor of twenty one decibels divided by eight kilohertz.
[0117] In Example 54, the subject matter of any one or more of Examples 50-53 optionally include wherein the time-windows and the frequency filter match corresponding parameters for an automatic speech recognition machine readable medium being tuned.
[0118] The above detailed description includes references to the accompanying drawings, which form a part of the detailed description. The drawings show, by way of illustration, specific embodiments that may be practiced. These embodiments are also referred to herein as "examples." Such examples may include elements in addition to those shown or described. However, the present inventors also contemplate examples in which only those elements shown or described are provided. Moreover, the present inventors also contemplate examples using any combination or permutation of those elements shown or described (or one or more aspects thereof), either with respect to a particular example (or one or more aspects thereof), or with respect to other examples (or one or more aspects thereof) shown or described herein.
[0119] All publications, patents, and patent documents referred to in this document are incorporated by reference herein in their entirety, as though individually incorporated by reference. In the event of inconsistent usages between this document and those documents so incorporated by reference, the usage in the incorporated reference(s) should be considered supplementary to that of this document; for irreconcilable inconsistencies, the usage in this document controls.
[0120] In this document, the terms "a" or "an" are used, as is common in patent documents, to include one or more than one, independent of any other instances or usages of "at least one" or "one or more." In this document, the term "or" is used to refer to a nonexclusive or, such that "A or B" includes "A but not B," "B but not A," and "A and B," unless otherwise indicated. In the appended claims, the terms "including" and "in which" are used as the plain-English equivalents of the respective terms "comprising" and "wherein." Also, in the following claims, the terms "including" and "comprising" are open-ended, that is, a system, device, article, or process that includes elements in addition to those listed after such a term in a claim are still deemed to fall within the scope of that claim. Moreover, in the following claims, the terms "first," "second," and "third," etc. are used merely as labels, and are not intended to impose numerical requirements on their objects. [0121] The above description is intended to be illustrative, and not restrictive. For example, the above- described examples (or one or more aspects thereof) may be used in combination with each other. Other embodiments may be used, such as by one of ordinary skill in the art upon reviewing the above description. The Abstract is to allow the reader to quickly ascertain the nature of the technical disclosure and is submitted with the understanding that it will not be used to interpret or limit the scope or meaning of the claims. Also, in the above Detailed Description, various features may be grouped together to streamline the disclosure. This should not be interpreted as intending that an unclaimed disclosed feature is essential to any claim. Rather, inventive subject matter may lie in less than all features of a particular disclosed embodiment. Thus, the following claims are hereby incorporated into the Detailed Description, with each claim standing on its own as a separate embodiment. The scope of the embodiments should be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled.

Claims

WHAT IS CLAIMED IS:
1 . A device for automatic tuning of speech recognition parameters, the method comprising:
a storage device to store:
a clean audio segment, the clean audio segment being noiseless; and a dirty audio segment, the dirty audio segment being the clean audio segment with noise; and
a tuner to optimize preprocessing parameters, the tuner to iteratively:
select a set of parameters;
preprocess the clean audio segment with the set of parameters to produce a first result;
preprocess the dirty audio segment with the set of parameters to produce a second result;
score a portion of the first result with the a corresponding portion of the second result using clean-diff; and
provide the set of parameters when an optimization threshold is reached.
2, The device of claim 1, wherein the set of parameters and the optimization threshold are defined by an optimization definition of a many-dimensional nonlinear optimization.
3. The device of claim 2, wherein the optimization definition is an amoeba simplex optimization.
4. The device of claim 2, wherein the optimization definition is a Monte Carlo optimization.
5. The device of claim 1 , wherein to use clean-diff includes the tuner to: time-window both the clean audio segment and the dirty audio segment; compute a respective power spectrum in a time-window corresponding to both the clean audio segment and the dirty audio segment;
divide the time-window by a frequency filter into a plurality of portions; sum the energy in the respective plurality of portions for the clean audio segment and the dirty audio segment to create two energy totals;
convert the two energy totals to the decibel scale; and
measure a difference between the two energy totals.
6. The device of claim 5, wherein the frequency filter is a Mel filter.
7. The device of claim 5, wherein to sum the energy includes the tuner to weight higher frequencies more.
8. The device of claim 5, wherein the time-windows and the frequency filter match corresponding parameters for an automatic speech recognition system being tuned.
9. A method for automatic tuning of speech recognition parameters, the method comprising:
obtaining a clean audio segment, the clean audio segment being noiseless; obtaining a dirty audio segment, the dirty audio segment being the clean audio segment with noise; and
optimizing preprocessing parameters by iteratively:
selecting a set of parameters;
preprocessing the clean audio segment with the set of parameters to produce a first result;
preprocessing the dirty audio segment with the set of parameters to produce a second result;
scoring a portion of the first result with the a corresponding portion of the second result using clean-diff; and providing the set of parameters when an optimization threshold is reached.
10. The method of claim 9, wherein the set of parameters and optimization threshold are defined by an optimization definition of a many-dimensional nonlinear optimization.
1 1. The method of claim 10, wherein the optimization definition is an amoeba simplex optimization.
12. The method of claim 10, wherein the optimization definition is a Monte Carlo optimization.
13. The method of claim 9, wherein using clean-diff includes:
time-windowing both the clean audio segment and the dirty audio segment; computing a respective power spectrum in a time-window corresponding to both the clean audio segment and the dirty audio segment;
dividing the time-window by a frequency filter into a plurality of portions; summing the energy in the respective plurality of portions for the clean audio segment and the dirty audio segment to create two energy totals;
converting the two energy totals to the decibel scale; and
measuring a difference between the two energy totals.
14. The method of claim 13, wherein the frequency filter is a Mel filter.
15. The method of claim 13, wherein summing the energy includes weighting higher frequencies more.
16. The method of claim 13, wherein the time-windows and the frequency filter match corresponding parameters for an automatic speech recognition system being tuned.
17. At least one machine readable medium including instructions for automatic tuning of speech recognition parameters, the instructions, when executed by a machine, cause the machine to perform operations comprising:
obtaining a clean audio segment, the clean audio segment being noiseless; obtaining a dirty audio segment, the dirty audio segment being the clean audio segment with noise; and
optimizing preprocessing parameters by iteratively:
selecting a set of parameters;
preprocessing the clean audio segment with the set of parameters to produce a first result;
preprocessing the dirty audio segment with the set of parameters to produce a second result;
scoring a portion of the first result with the a corresponding portion of the second result using clean-diff; and
providing the set of parameters when an optimization threshold is reached.
18. The machine readable medium of claim 17, wherein the set of parameters and optimization threshold are defined by an optimization definition of a many- dimensional nonlinear optimization.
19. The machine readable medium of claim 18, wherein the optimization definition is an amoeba simplex optimization.
20. The machine readable medium of claim 18, wherein the optimization definition is a Monte Carlo optimization.
21. The machine readable medium of claim 17, wherein using clean-diff includes:
time-windowing both the clean audio segment and the dirty audio segment; computing a respective power spectrum in a time-window corresponding to both the clean audio segment and the dirty audio segment;
dividing the time-window by a frequency filter into a plurality of portions; summing the energy in the respective plurality of portions for the clean audio segment and the dirty audio segment to create two energy totals;
converting the two energy totals to the decibel scale; and
measuring a difference between the two energy totals.
22. The machine readable medium of claim 21, wherein the frequency filter ts a Mel filter.
23. The machine readable medium of claim 21 , wherein summing the energy includes weighting higher frequencies more.
24. The machine readable medium of claim 21 , wherein the time-windows and the frequency filter match corresponding parameters for an automatic speech recognition machine readable medium being tuned.
PCT/PL2015/050074 2015-12-22 2015-12-22 Automatic tuning of speech recognition parameters WO2017111634A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/PL2015/050074 WO2017111634A1 (en) 2015-12-22 2015-12-22 Automatic tuning of speech recognition parameters

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/PL2015/050074 WO2017111634A1 (en) 2015-12-22 2015-12-22 Automatic tuning of speech recognition parameters

Publications (1)

Publication Number Publication Date
WO2017111634A1 true WO2017111634A1 (en) 2017-06-29

Family

ID=55237884

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/PL2015/050074 WO2017111634A1 (en) 2015-12-22 2015-12-22 Automatic tuning of speech recognition parameters

Country Status (1)

Country Link
WO (1) WO2017111634A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5924065A (en) * 1997-06-16 1999-07-13 Digital Equipment Corporation Environmently compensated speech processing
CN101533077A (en) * 2009-04-17 2009-09-16 中国科学院电工研究所 Optimal design method of superconducting magnet used for magnetic resonance imaging (MRI) device
US20100153104A1 (en) * 2008-12-16 2010-06-17 Microsoft Corporation Noise Suppressor for Robust Speech Recognition

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5924065A (en) * 1997-06-16 1999-07-13 Digital Equipment Corporation Environmently compensated speech processing
US20100153104A1 (en) * 2008-12-16 2010-06-17 Microsoft Corporation Noise Suppressor for Robust Speech Recognition
CN101533077A (en) * 2009-04-17 2009-09-16 中国科学院电工研究所 Optimal design method of superconducting magnet used for magnetic resonance imaging (MRI) device

Similar Documents

Publication Publication Date Title
US20210089967A1 (en) Data training in multi-sensor setups
CN109074816B (en) Far field automatic speech recognition preprocessing
CN111133511B (en) sound source separation system
US20200227071A1 (en) Analysing speech signals
US9640194B1 (en) Noise suppression for speech processing based on machine-learning mask estimation
CN111489760B (en) Speech signal dereverberation processing method, device, computer equipment and storage medium
US8983844B1 (en) Transmission of noise parameters for improving automatic speech recognition
US9685171B1 (en) Multiple-stage adaptive filtering of audio signals
JP5749346B2 (en) Method, apparatus and computer readable storage medium for decomposing multi-channel audio signals
EP3301675B1 (en) Parameter prediction device and parameter prediction method for acoustic signal processing
JP5127754B2 (en) Signal processing device
JP6454916B2 (en) Audio processing apparatus, audio processing method, and program
CN108346433A (en) A kind of audio-frequency processing method, device, equipment and readable storage medium storing program for executing
GB2567503A (en) Analysing speech signals
EP3350804B1 (en) Collaborative audio processing
US8615394B1 (en) Restoration of noise-reduced speech
US20150264505A1 (en) Wireless exchange of data between devices in live events
CN108028979A (en) Cooperate audio frequency process
EP3484183B1 (en) Location classification for intelligent personal assistant
US11915687B1 (en) Systems and methods for generating labeled data to facilitate configuration of network microphone devices
JP6711789B2 (en) Target voice extraction method, target voice extraction device, and target voice extraction program
US20200201970A1 (en) Biometric user recognition
WO2022256577A1 (en) A method of speech enhancement and a mobile computing device implementing the method
US11380312B1 (en) Residual echo suppression for keyword detection
Eklund Data augmentation techniques for robust audio analysis

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15828782

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15828782

Country of ref document: EP

Kind code of ref document: A1