CN112037771B

CN112037771B - Method and device for adjusting volume, electronic equipment and storage medium

Info

Publication number: CN112037771B
Application number: CN202010886561.1A
Authority: CN
Inventors: 单彦会; 荣玉军; 张俊杰; 蔡旭浦; 罗红
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Hangzhou Information Technology Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Hangzhou Information Technology Co Ltd
Priority date: 2020-08-28
Filing date: 2020-08-28
Publication date: 2024-03-12
Anticipated expiration: 2040-08-28
Also published as: CN112037771A

Abstract

The embodiment of the invention relates to the field of voice recognition and discloses a method and a device for adjusting volume, electronic equipment and a storage medium. The method for adjusting the volume comprises the following steps: acquiring each audio sample in a training set for training a speech recognition model; wherein the speech recognition model is used for speech recognition; determining a volume value for each audio sample in the training set; determining a volume reference value of the training set according to the volume value of each audio sample; according to the volume reference value, adjusting the volume value of each audio sample; the difference value between the adjusted sound volume value of each audio sample and the sound volume reference value is within a preset difference value range. The volume adjusting method provided by the embodiment of the invention can be used for adjusting the volume of each piece of audio data based on the whole training set, and the volume value of the audio sample in the training set can be properly adjusted, so that the recognition effect of the voice recognition model is improved.

Description

Method and device for adjusting volume, electronic equipment and storage medium

Technical Field

The embodiment of the invention relates to the field of voice recognition, in particular to a method and device for adjusting volume, electronic equipment and a storage medium.

Background

With the development of computer technology, the voice recognition technology is applied to more and more fields, such as smart home, industrial control, voice interaction system of terminal equipment, and the like. The voice recognition technology is utilized to enable the information to be processed and acquired more conveniently, so that the working efficiency is improved. The speech recognition model is obtained by learning and reasoning through a deep neural network and performing iterative training on the basis of a large amount of audio data. The quality of the audio data used for training can greatly influence the effect of the speech recognition model.

The inventor finds that at least the following problems exist in the prior art: the recognition effect of the voice recognition model is seriously dependent on the quality of the trained audio data, the prior art firstly carries out high-pass filtering treatment on the training sample, which can filter part of effective data in the audio data, and carries out automatic gain control treatment on the training sample after the high-pass filtering treatment, however, the automatic gain effect is poor under the condition that the voice of the voice is very small or very loud, the volume information cannot be well adjusted, and finally the recognition effect of the voice recognition model is poor.

Disclosure of Invention

The embodiment of the invention aims to provide a method, a device, electronic equipment and a storage medium for volume adjustment, which can be used for volume adjustment of each piece of audio data based on the whole training set, and can be used for properly adjusting the volume value of an audio sample in the training set, so that the recognition effect of a voice recognition model is improved.

In order to solve the above technical problems, an embodiment of the present invention provides a method for adjusting volume, including the following steps: acquiring each audio sample in a training set for training a speech recognition model; wherein the speech recognition model is used for speech recognition; determining a volume value for each audio sample in the training set; determining a volume reference value of the training set according to the volume value of each audio sample; according to the volume reference value, adjusting the volume value of each audio sample; the difference value between the adjusted sound volume value of each audio sample and the sound volume reference value is within a preset difference value range.

The embodiment of the invention also provides a device for adjusting the volume, which comprises: the acquisition module is used for acquiring each audio sample in a training set for training the voice recognition model; wherein the speech recognition model is used for speech recognition; a computing module for determining a volume value for each audio sample in the training set; the statistics module is used for determining a volume reference value of the training set according to the volume values of the audio samples; the adjusting module is used for adjusting the volume value of each audio sample according to the volume reference value; the difference value between the adjusted sound volume value of each audio sample and the sound volume reference value is within a preset difference value range.

The embodiment of the invention also provides electronic equipment, which comprises: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of volume adjustment described above.

The embodiment of the invention also provides a computer readable storage medium storing a computer program which when executed by a processor realizes the method for realizing the volume adjustment.

In contrast to the prior art, embodiments of the present invention acquire audio samples in a training set for training a speech recognition model; wherein the speech recognition model is used for speech recognition; a volume value for each audio sample in the training set is determined. Considering that the prior art can perform high-pass filtering processing on the acquired audio samples, but the high-pass filter can filter part of effective data in the audio samples, which is unfavorable for training of a voice recognition model, the embodiment of the invention directly processes the acquired audio samples, so that the integrity of the samples can be maintained to the greatest extent. Further, according to the volume value of each audio sample, determining a volume reference value of the training set, and according to the volume reference value, adjusting the volume value of each audio sample; the difference value between the adjusted sound volume value of each audio sample and the sound volume reference value is within a preset difference value range. Considering that the volume values of all the audio samples in the training set are different, wherein the volume values of part of the samples may be too large or too small, when the volume values of the audio samples are regulated in the prior art, automatic gain control (automatic gain control, abbreviated as AGC) processing is performed on the audio samples after the high-pass filtering processing, but the automatic gain control effect is poor when the volume values of the samples are too large or too small.

In addition, the determining the volume reference value of the training set according to the volume value of each audio sample includes: according to the volume value of each audio sample, selecting N audio samples in the training set; the volume values of the N audio samples are in a preset volume value range, and N is a natural number larger than 1; determining a volume average value of the N audio samples; and taking the average volume value as a volume reference value of the training set. In order to reduce adverse effects of an audio sample with an excessively large or excessively small volume value on determining a volume reference value of a training set, the embodiment of the invention can select the audio sample with the volume value within a preset volume value range in the training set, improve the working efficiency to a certain extent, calculate the volume average value of a plurality of selected audio samples as the volume reference value, and enable the determined volume reference value to meet the training requirement.

In addition, selecting the audio samples in the training set according to the volume values of the audio samples, including: sorting the audio samples in the training set according to the sound volume values of the audio samples; determining the median of the sound volume values of the sequenced audio samples, and taking the audio sample corresponding to the median as a target audio sample; and selecting N audio samples from the audio samples sequenced on two sides of the target audio sample by taking the sequencing position of the target audio sample as a selection starting point. The audio sample is selected based on the median of the volume value, so that adverse effects of the sample with overlarge volume or overlarge volume on the whole training set can be better weakened, and the volume value of the selected audio sample is more in line with the training requirement.

In addition, determining a volume average of the selected audio samples includes: determining weight coefficients of the N audio samples; determining a weighted average of the volume values of the N audio samples according to the weight coefficient; and taking the weighted average value as the volume average value. And determining a weight coefficient to perform weighted average on the selected audio samples, so that a volume average value which is more in line with the training requirement of the voice recognition model can be obtained.

In addition, determining a volume value for each audio sample in the training set includes: determining a volume value for each frequency in each of the audio samples in the training set; determining an average value of the volume values of the frequencies according to the volume values of the frequencies in the audio sample; the average value is used as the volume value of the audio sample, and the volume value of each frequency in one audio sample can be comprehensively considered, so that the volume value of the obtained audio sample is more accurate.

In addition, adjusting the volume value of each audio sample according to the volume reference value comprises: and adjusting the volume value of each frequency in each audio sample according to the volume reference value. And adjusting the volume value of each frequency in one audio sample, so that the volume value of the whole audio sample is closer to the volume reference value, and the recognition effect of the voice recognition model is further improved.

Drawings

One or more embodiments are illustrated by way of example and not limitation in the figures of the accompanying drawings.

Fig. 1 is a flowchart of a method of volume adjustment according to a first embodiment of the present invention;

FIG. 2 is a flowchart showing the substeps of determining a volume reference value of a training set based on the volume values of respective audio samples according to the first embodiment of the present invention;

fig. 3 is a flow chart of a method of volume adjustment according to a second embodiment of the present invention;

FIG. 4 is a flowchart of the substeps of determining a volume average of selected audio samples in accordance with a second embodiment of the present invention;

fig. 5 is a flowchart of a method of volume adjustment according to a third embodiment of the present invention;

fig. 6 is a block diagram of a volume adjusting device according to a fourth embodiment of the present invention;

fig. 7 is a schematic structural view of an electronic device according to a fifth embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the following detailed description of the embodiments of the present invention will be given with reference to the accompanying drawings. However, those of ordinary skill in the art will understand that in various embodiments of the present invention, numerous technical details have been set forth in order to provide a better understanding of the present application. However, the technical solutions claimed in the present application can be implemented without these technical details and with various changes and modifications based on the following embodiments. The following embodiments are divided for convenience of description, and should not be construed as limiting the specific implementation of the present invention, and the embodiments can be mutually combined and referred to without contradiction.

The first embodiment of the invention relates to a volume adjusting method which is applied to electronic equipment; the electronic device may be a terminal or a server, and the present embodiment and the following embodiments will describe the electronic device by taking the server as an example. Implementation details of the method for adjusting volume according to the present embodiment are specifically described below, and the following is merely provided for convenience of understanding, and is not necessary to implement the present embodiment.

The specific flow of the method for adjusting volume according to this embodiment may be as shown in fig. 1, including:

step 101, obtaining each audio sample in a training set for training a speech recognition model;

in particular, a speech recognition model is used for speech recognition. When the server trains the voice recognition model, the voice recognition model can be trained based on a machine learning method by using a large number of audio samples, and the audio samples for training are collected in advance, namely, a training set of the voice recognition model is from actual production and life of a human society, and the voice recognition model is rich in data, real and reliable.

In a specific implementation, the training set for training the speech recognition model may be obtained from the internet, where the content of each audio sample includes at least natural language information, and after the server obtains the training set for training the speech recognition model, preprocessing may be performed on each audio sample, where the preprocessing operation includes, but is not limited to: echo cancellation processing is carried out on each audio sample; noise suppression processing or the like is performed on each audio sample. The obtained audio samples can be purer by performing operations such as echo cancellation and noise suppression, and the recognition effect of the voice recognition model can be improved by training with the audio samples.

In one example, the server obtains a training set for training a speech recognition model, the speech recognition model being applied to the smart home field, and the speech content of each audio sample in the training set may include: "turn off the television", "switch the air conditioner to a cooling mode", "set the toilet water heater temperature to 62 degrees", etc. The server performs echo cancellation on each audio sample using adaptive filtering technique processing, and inputs the echo cancelled audio samples into a recurrent neural network (Recurrent Neural Network, abbreviated as RNN) for noise suppression processing. It should be noted that, in this example, only the application of the voice recognition model to the field of smart home is taken as an example, but the application is not limited to this, and the audio samples in the training set may be selected in a targeted manner according to the field of application of the voice recognition model, so as to improve the recognition effect of the voice recognition model obtained after training. That is, audio samples in the training set may be obtained based on the domain in which the speech recognition model is applied.

Step 102, determining the volume value of each audio sample in the training set;

specifically, after preprocessing each audio sample in the training set, the server may obtain a volume value of each audio sample. In view of the pronunciation characteristics of humans, one person typically will have variations in pitch and loudness while speaking, and different pitches and loudness may contain different information. The loudness of the human speaking can be understood as the volume value of the acquired audio samples, and the information contained in each audio sample can be accurately judged by acquiring the volume value of each audio sample in the training set and combining with the pronunciation habit of the human. Meanwhile, the volume value of each audio sample in the training set is grasped, so that the integrity of the sample can be maintained to the greatest extent.

In a specific implementation, sound size is often described in decibels (dB) in the acoustic field, due to the logarithmic relationship of the human ear's perception of the size of the sound. In the case of acoustics, 20 micropas is generally used as a reference value, that is, a minimum auditory response value of a human, where 20 micropas corresponds to 0dB, the following formula can be used to calculate the decibel value of sound:wherein p is _ref Is a reference sound pressure, typically 2 x 10 ^-5 Pa，p _rms Is the root mean square value of the sound pressure of the target audio sample, L _p Representing the sound pressure level, i.e. the volume value of the audio sample.

In one example, the server may determine the volume value of an audio sample by obtaining the volume value of the audio sample at each time instant. Such as: the server acquires a training set of a voice recognition model applied to the field of intelligent home, wherein the training set comprises audio samples with long duration, for example, an audio sample with 4 seconds duration: "set toilet water heater temperature to 62 degrees". The server uses a volume detect module of the ffmpeg technology to acquire the corresponding volume values of the audio sample of 55dB, 60dB, 62dB and 59dB per second, calculates the average value of the volume values of each moment to be 59dB, and then takes the average value of the volume values of each moment, namely 59dB, as the volume value of the audio sample.

Step 103, determining a volume reference value of the training set according to the volume value of each audio sample;

specifically, after determining the volume value of each audio sample in the training set, the server comprehensively considers the volume value characteristics of each audio sample in the whole training set to determine the volume reference value of the training set. Considering that the application scenes of the voice recognition model are different, the sound volume values of the audio samples suitable for training the voice recognition model are different, and comprehensively considering the sound volume values of all the audio samples of the whole training set, the determined sound volume reference value can be more suitable for training the target voice recognition model, so that the recognition effect of the voice recognition model is improved.

In one example, a server obtains a training set for training a speech recognition model that is applied in the field of industrial control, and in consideration of the boom of a factory environment and the noisy sound, a worker needs to increase the loudness to hear each other while talking, and a higher volume is needed to interact when using a speech recognition system for human-computer interaction. The training set comprises an audio sample with a volume value of 81dB, namely, a third engine is turned off, 81dB is a volume value suitable for a noisy environment of a factory, and the server can take the volume value of the audio sample, namely, 81dB as a volume reference value of the training set.

In another example, determining the volume reference value of the training set based on the volume values of the audio samples may be accomplished by the sub-steps as shown in fig. 2:

sub-step 1031, selecting N audio samples in the training set according to the volume value of each audio sample;

the volume values of the N audio samples are in a preset volume value range, and N is a natural number larger than 1. The number N of the selected audio samples can be set by a developer, and the working efficiency can be improved to a certain extent by selecting a certain number of audio samples in the training set. The preset volume value range can be set by a developer according to different application scenes.

Specifically, the training set acquired by the server contains a large number of audio samples with uneven volume values, if the server determines the volume reference value of the training set according to the audio samples with overlarge or overlarge partial volume values, larger deviation of the volume reference value can be caused, and the audio samples in the preset volume value range are selected, so that the training requirement of the target voice recognition model can be well met.

In one example, a server obtains a training set for training a speech recognition model, the speech recognition model being applied to the field of industrial control, the training set comprising 2000 audio samples, a preset volume value range being set to 70dB to 110dB based on a noisy environment of a factory, 900 audio samples in the training set having volume values of 70dB to 110dB, and the server selecting 600 audio samples therein.

In another example, the server obtains a training set for training a speech recognition model, the speech recognition model is applied to the intelligent medical field, the training set comprises 1500 audio samples, a preset volume value range is set to be 30dB to 70dB based on a quiet environment of a hospital ward, 1400 audio samples with volume values of 30dB to 70dB in the training set are used, and the server selects 1300 audio samples.

A sub-step 1032 of determining a volume average value of the N audio samples;

specifically, after selecting a required audio sample, the server calculates a volume average value according to the volume value of the selected audio sample.

In one example, the server selects a total of 300 audio samples with a volume value of 70dB to 110dB in the training set, and calculates a volume average value of 88dB from the volume values of the 300 audio samples.

Sub-step 1033 takes the volume average value as the volume reference value of the training set.

Specifically, after calculating the average value of the volume of the selected audio samples, the server uses the average value of the volume as the volume reference value of the training set, so that deviation caused by that the volume value of some audio samples is too large or too small to determine the average value of the volume can be weakened, and the volume reference value suitable for training the speech recognition model is obtained.

In one example, the server selects 1350 total audio samples with a sound volume value between 30dB and 70dB in a training set for training a speech recognition model, the speech recognition model is applied to the intelligent medical field, calculates the average value of the sound volumes of the 1350 total audio samples as 52dB, and uses the 52dB as a sound volume reference value of the training set, wherein the 52dB is the sound volume value suitable for training the speech recognition model applied to the intelligent medical field.

And step 104, adjusting the volume value of each audio sample according to the volume reference value.

The difference value between the adjusted sound volume value of each audio sample and the sound volume reference value is in a preset difference range, and the preset difference range can be set by a developer according to actual needs. In consideration of the fact that a deviation may occur when the volume value of each audio sample is actually adjusted, it is necessary to set an acceptable error range for the difference range.

Specifically, after determining the volume reference value of the training set, the server adjusts the volume value of each audio sample according to the volume reference value, which can be understood as the optimal value of the volume in the training set. The volume value of the audio sample higher than the volume reference value is reduced, and the volume value of the audio sample lower than the volume reference value is increased, so that the volume value of each audio sample in the training set is close to the optimal value, the volume value of each audio sample in the training set is properly adjusted, and the accuracy of the voice recognition model is improved.

In one example, the server may directly modify the data in the audio sample file to adjust the volume value of the audio sample. Such as: the server uses ffmpeg technology to analyze the related data in the audio sample file, and sends an adjusting instruction to the data representing the volume value, wherein the adjusting instruction can be a code section, directly modifies the volume decibel of the audio file, and adjusts the volume value of the audio file to the volume reference value or close to the volume reference value.

In another example, the server may use audio visualization software to adjust the volume value of the audio sample. Such as: the server inputs the audio file into audio-video editing software such as Premiere, and adjusts the audio value of the audio file by adjusting the audio 'amplitude' through a visual operation interface provided by the audio-video editing software. The volume reference value can be expressed as a horizontal line in the visual interface, the volume value higher than the horizontal line is adjusted down by operations such as peak clipping and valley filling in the visual page, and the volume value lower than the horizontal line is adjusted up, namely, the volume value of the audio file is adjusted to the volume reference value or is close to the volume reference value.

In a first embodiment of the present invention, in contrast to the prior art, each audio sample in a training set for training a speech recognition model is obtained; wherein the speech recognition model is used for speech recognition; a volume value for each audio sample in the training set is determined. Considering that the prior art can perform high-pass filtering processing on the acquired audio samples, but the high-pass filter can filter part of effective data in the audio samples, which is unfavorable for training of a voice recognition model, the embodiment of the invention directly processes the acquired audio samples, so that the integrity of the samples can be maintained to the greatest extent. Further, according to the volume value of each audio sample, determining a volume reference value of the training set, and according to the volume reference value, adjusting the volume value of each audio sample; the difference value between the adjusted sound volume value of each audio sample and the sound volume reference value is within a preset difference value range. Considering that the volume values of all the audio samples in the training set are different, wherein the volume values of part of the samples may be too large or too small, when the volume values of the audio samples are regulated in the prior art, AGC processing is performed on the audio samples after high-pass filtering processing, but the automatic gain control effect is poor for the situation that the volume of the samples is too large or too small.

A second embodiment of the invention relates to a method of volume adjustment. The following details of implementation of the method for adjusting volume according to the present embodiment are merely provided for the convenience of understanding, and are not necessary for implementing the present embodiment, and fig. 3 is a schematic diagram of the method for adjusting volume according to the second embodiment, including:

step 201, obtaining each audio sample in a training set for training a speech recognition model;

step 202, determining the volume value of each audio sample in the training set;

steps 201 to 202 are already described in the first embodiment, and are not repeated here.

Step 203, sorting the audio samples in the training set according to the volume values of the audio samples;

specifically, after determining the volume values of each audio sample in the training set, the server sorts each audio sample according to the volume values. The sorting mode can be sorting from big to small according to the volume value, or sorting from small to big according to the volume value, and determining the position of each audio sample after sorting in the sequence.

In one example, the server sorts 1000 audio samples in the acquired training set according to the size of the volume value from small to large, and stores the positions of the 1000 sorted audio samples in the sequence. Such as: a piece of audio sample: the volume of "turn off bedside lamp" is 55dB, with 478 bits in the 1000 audio samples.

Step 204, determining the median of the sound volume values of the sequenced audio samples, and taking the audio sample corresponding to the median as a target audio sample;

specifically, consider that the training set obtained is used to train a speech recognition model, and in the speech recognition process, humans interact with each other by human, i.e., about 60dB, at a volume at which normal conversation is normally performed. The training set obtained by the server contains a large number of audio samples which are basically collected in actual production and life of human beings, and in the case that the training set contains thousands of audio samples, the median of the volume value of the training set is basically in the volume range of normal talking.

Step 205, selecting N audio samples from each audio sample ordered on both sides of the target audio sample by taking the ordering position of the target audio sample as the selection starting point;

the number N of the selected audio samples can be set by a developer according to actual needs, the sorting positions of the target audio samples are taken as the selection starting points, and the N audio samples are selected from the audio samples sorted on two sides of the target audio samples, so that the working efficiency can be improved to a certain extent. And selecting the audio samples from the target audio samples to two sides, so that the volume of the selected audio file is close to the normal sound range. That is, the audio samples are selected from the two sides with the sorting positions of the audio samples corresponding to the median as the selection starting points, and when the number of the selected audio samples reaches the preset proportion of the training set, the selection process is stopped, so that the volume of the selected audio file is close to the normal sound range. The preset proportion is the proportion of N to the total number of samples in the training set.

In one example, the ordered training set contains 1000 audio samples, the server determines that the median of the sound volume values of the training set is 63dB, and the server selects a total of 800 audio samples from both sides of the median starting from the audio sample corresponding to 63 dB.

Step 206, determining the average volume value of the selected N audio samples;

in one example, the sub-steps of determining the average volume of the selected N audio samples are implemented as shown in fig. 4:

sub-step 2061, determining the weighting coefficients of the selected N audio samples;

in a specific implementation, the server can determine the weight coefficients of the selected N audio samples according to the positions and actual needs of the N audio samples in the sequence after sorting, and through setting different weight coefficients, the finally obtained volume reference value can be ensured to meet the actual needs of training.

In one example, the server may determine the weighting coefficients of the N audio samples based on the ordered positions of the N audio samples relative to the target audio sample; the weight coefficients of the N audio samples are decreased to two sides by taking the weight coefficient of the target audio sample as a decreasing initial value. Such as: the server determines that the median of the sound volume value of the training set is 63dB, takes the audio sample corresponding to 63dB as a target audio sample, selects 100 total audio samples to two sides, sets the audio sample corresponding to the median of the sound volume value, namely the weight coefficient of the target audio sample to be 0.3, decreases the weight coefficient from the target audio sample to two sides, and ensures that the sum of all the weight coefficients is equal to 1.

Sub-step 2062, determining a weighted average of the volume values of the N audio samples according to the weight coefficients, taking the weighted average as the volume average;

specifically, after determining the weight coefficient of each selected N audio samples, the server performs weighted average according to the volume values and the weight coefficients of the N audio samples, and uses the obtained weighted average as a volume average value, so that the obtained volume reference value can more meet the actual training requirement by using the weighted average.

Step 207, taking the average value of the volume as a volume reference value of the training set;

step 208, adjusting the volume value of each audio sample according to the volume reference value;

the steps 207 to 208 are described in the first embodiment, and are not described herein.

Compared with the prior art, in this embodiment, according to the volume value of each audio sample, selecting an audio sample in the training set includes: sorting the audio samples in the training set according to the sound volume values of the audio samples; determining the median of the sound volume values of the sequenced audio samples, and taking the audio sample corresponding to the median as a target audio sample; selecting N audio samples from all the audio samples sequenced on two sides of the target audio sample by taking the sequencing position of the target audio sample as a selection starting point; wherein the number of audio samples selected is a preset proportion of the total number of audio samples in the training set. The audio sample is selected based on the median of the volume value, so that adverse effects of the sample with overlarge volume or overlarge volume on the whole training set can be better weakened, and the volume value of the selected audio sample is more in line with the training requirement. Determining a volume average of the selected audio samples, comprising: determining weight coefficients of the N audio samples; determining a weighted average of the volume values of the N audio samples according to the weight coefficient; and taking the weighted average value as the volume average value. And determining a weight coefficient to perform weighted average on the selected audio samples, so that a volume average value which is more in line with the training requirement of the voice recognition model can be obtained.

A third embodiment of the invention relates to a method of volume adjustment. The following details of implementation of the method for adjusting volume according to the present embodiment are merely provided for the convenience of understanding, and are not necessary for implementing the present embodiment, and fig. 5 is a schematic diagram of the method for adjusting volume according to the third embodiment, including:

step 301, obtaining each audio sample in a training set for training a speech recognition model;

step 301 is described in the first embodiment, and will not be described herein.

Step 302, determining the volume value of each frequency in each audio sample in the training set;

in particular, considering that the frequency of human occurrence ranges from 85Hz to 1100Hz, a person speaking may comprise different frequencies, and correspondingly, a single audio sample may also comprise different frequencies, and sounds at different frequencies may in turn correspond to different volume values. The server can thus determine the volume value for each frequency in each audio sample, further ensuring the integrity of the sample.

In one example, the server uses pulse code modulation (Pulse Code Modulation, simply PCM) techniques to determine the volume values for each frequency in the audio samples. Such as: the server converts the audio sample into PCM data between-1 and-1, performs fast Fourier transform (fast Fourier transform, FFT for short) on the PCM data, obtains a spectrogram of the audio sample, and uses the common formula according to the ordinate of the spectrogram, namely the energy of sound wavesThe formula: 10log10 (a) ² +b ² ) A volume value for each frequency of the audio sample is calculated, where a represents the real part and b represents the imaginary part.

In another example, the server samples the audio samples and determines the volume value for each frequency in the audio samples by means of an equal-ratio mapping. Such as: the server samples the audio sample, records the energy value of each sampling point, and maps the energy value of each sampling point to between 1 and 100 in an equal ratio. In general, the human voice is distributed in a lower energy range, the quantized value is approximately distributed in a range of 1-20, the quantized value is amplified 5 times, and the formula is used for values smaller than 100: 10log (10 x amplified quantized value), the sound volume value of each frequency of the audio sample is calculated. For values greater than 100, the volume value is directly assigned 100dB.

Step 303, determining an average value of the sound volume values of the frequencies according to the sound volume values of the frequencies in the audio sample, and taking the average value as the sound volume value of the audio sample;

specifically, after determining the volume value of each frequency of the audio sample, the server may calculate an average value, and use the average value as the volume value of the audio sample. The sound volume value of each frequency in one audio sample is comprehensively considered, so that the determined sound volume value of the audio sample is more accurate.

Step 304, determining a volume reference value of the training set according to the volume value of each audio sample;

step 304 is described in the first embodiment, and will not be described herein.

In step 305, the volume value of each frequency in each audio sample is adjusted according to the volume reference value.

Specifically, when the server adjusts the volume value of each audio sample according to the volume reference value, the volume value of each frequency of the audio sample may be adjusted. The volume value of the whole audio sample can be close to the volume reference value, and the recognition effect of the voice recognition model is further improved.

Compared with the prior art, in this embodiment, determining the volume value of each audio sample in the training set includes: determining a volume value for each frequency in each of the audio samples in the training set; determining an average value of the volume values of the frequencies according to the volume values of the frequencies in the audio sample; the average value is used as the volume value of the audio sample, and the volume value of each frequency in one audio sample can be comprehensively considered, so that the volume value of the obtained audio sample is more accurate. Adjusting the volume value of each audio sample according to the volume reference value, including: and adjusting the volume value of each frequency in each audio sample according to the volume reference value. And adjusting the volume value of each frequency in one audio sample, so that the volume value of the whole audio sample is closer to the volume reference value, and the recognition effect of the voice recognition model is further improved.

The above steps of the methods are divided, for clarity of description, and may be combined into one step or split into multiple steps when implemented, so long as they include the same logic relationship, and they are all within the protection scope of this patent; it is within the scope of this patent to add insignificant modifications to the algorithm or flow or introduce insignificant designs, but not to alter the core design of its algorithm and flow.

A fourth embodiment of the present invention relates to an apparatus for volume adjustment. The details of the apparatus for adjusting the volume of the present embodiment are specifically described below, and the following is merely provided for understanding the implementation details, but is not essential to the implementation of the present embodiment, and fig. 6 is a schematic diagram of the apparatus for adjusting the volume according to the fourth embodiment, including:

an acquisition module 401, configured to acquire each audio sample in a training set for training a speech recognition model; wherein the speech recognition model is used for speech recognition;

a calculation module 402 for determining a volume value for each audio sample in the training set;

a statistics module 403, configured to determine a volume reference value of the training set according to the volume value of each audio sample;

the adjusting module 404 is configured to adjust the volume value of each audio sample according to the volume reference value; the difference value between the volume value of each adjusted audio sample and the volume reference value is within a preset difference value range.

It is to be noted that this embodiment is an example of the apparatus corresponding to the first to third embodiments, and can be implemented in cooperation with the first to third embodiments. The related technical details and technical effects mentioned in the first to third embodiments are still valid in the present embodiment, and are not repeated here for the sake of reducing repetition. Accordingly, the related-art details mentioned in the present embodiment can also be applied to the first to third embodiments.

It should be noted that each module in this embodiment is a logic module, and in practical application, one logic unit may be one physical unit, or may be a part of one physical unit, or may be implemented by a combination of multiple physical units. In addition, in order to highlight the innovative part of the present invention, units that are not so close to solving the technical problem presented by the present invention are not introduced in the present embodiment, but this does not indicate that other units are not present in the present embodiment.

A fifth embodiment of the present invention relates to an electronic device, as shown in fig. 7, including: at least one processor 501; and a memory 502 communicatively coupled to the at least one processor 501; wherein the memory 502 stores instructions executable by the at least one processor 501 to enable the at least one processor 501 to perform the method of volume adjustment in the above embodiments.

Where the memory and the processor are connected by a bus, the bus may comprise any number of interconnected buses and bridges, the buses connecting the various circuits of the one or more processors and the memory together. The bus may also connect various other circuits such as peripherals, voltage regulators, and power management circuits, which are well known in the art, and therefore, will not be described any further herein. The bus interface provides an interface between the bus and the transceiver. The transceiver may be one element or may be a plurality of elements, such as a plurality of receivers and transmitters, providing a means for communicating with various other apparatus over a transmission medium. The data processed by the processor is transmitted over the wireless medium via the antenna, which further receives the data and transmits the data to the processor.

The processor is responsible for managing the bus and general processing and may also provide various functions including timing, peripheral interfaces, voltage regulation, power management, and other control functions. And memory may be used to store data used by the processor in performing operations.

A sixth embodiment of the present invention relates to a computer-readable storage medium storing a computer program. The computer program implements the above-described method embodiments when executed by a processor.

That is, it will be understood by those skilled in the art that all or part of the steps in implementing the methods of the embodiments described above may be implemented by a program stored in a storage medium, where the program includes several instructions for causing a device (which may be a single-chip microcomputer, a chip or the like) or a processor (processor) to perform all or part of the steps in the methods of the embodiments described herein. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

It will be understood by those of ordinary skill in the art that the foregoing embodiments are specific examples of carrying out the invention and that various changes in form and details may be made therein without departing from the spirit and scope of the invention.

Claims

1. A method of volume adjustment, comprising:

acquiring each audio sample in a training set for training a speech recognition model; wherein the speech recognition model is used for speech recognition;

determining a volume value for each audio sample in the training set;

determining a volume reference value of the training set according to the volume value of each audio sample;

according to the volume reference value, adjusting the volume value of each audio sample; the difference value between the adjusted sound volume value of each audio sample and the sound volume reference value is within a preset difference value range.

2. The method of volume adjustment according to claim 1, wherein the determining the volume reference value of the training set according to the volume value of each audio sample comprises:

according to the volume value of each audio sample, selecting N audio samples in the training set; the volume values of the N audio samples are in a preset volume value range, and N is a natural number larger than 1;

determining a volume average value of the N audio samples;

and taking the average volume value as a volume reference value of the training set.

3. The method for adjusting the volume according to claim 2, wherein selecting N audio samples in the training set according to the volume value of each audio sample comprises:

sorting the audio samples in the training set according to the sound volume values of the audio samples;

determining the median of the sound volume values of the sequenced audio samples, and taking the audio sample corresponding to the median as a target audio sample;

and selecting N audio samples from the audio samples sequenced on two sides of the target audio sample by taking the sequencing position of the target audio sample as a selection starting point.

4. A method of volume adjustment according to claim 3, characterized in that the determining the volume average of the N audio samples comprises:

determining weight coefficients of the N audio samples;

determining a weighted average of the volume values of the N audio samples according to the weight coefficient;

and taking the weighted average value as the volume average value.

5. The method of volume adjustment according to claim 4, wherein the determining the weighting coefficients of the N audio samples comprises:

determining weight coefficients of the N audio samples according to the sequencing positions of the N audio samples relative to the target audio sample; the weight coefficients of the N audio samples are decreased to two sides by taking the weight coefficient of the target audio sample as a decreasing initial value.

6. The method of volume adjustment according to claim 1, wherein said determining the volume value of each audio sample in the training set comprises:

determining a volume value for each frequency in each of the audio samples in the training set;

determining an average value of the volume values of the frequencies according to the volume values of the frequencies in the audio sample;

the average value is taken as a volume value of the audio sample.

7. The method of volume adjustment according to claim 6, wherein adjusting the volume value of each audio sample according to the volume reference value comprises:

and adjusting the volume value of each frequency in each audio sample according to the volume reference value.

8. A volume adjustment implementation apparatus, comprising:

the acquisition module is used for acquiring each audio sample in a training set for training the voice recognition model; wherein the speech recognition model is used for speech recognition;

a computing module for determining a volume value for each audio sample in the training set;

the statistics module is used for determining a volume reference value of the training set according to the volume values of the audio samples;

the adjusting module is used for adjusting the volume value of each audio sample according to the volume reference value; the difference value between the adjusted sound volume value of each audio sample and the sound volume reference value is within a preset difference value range.

9. An electronic device, comprising:

at least one processor; the method comprises the steps of,

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of volume adjustment according to any one of claims 1 to 7.

10. A computer readable storage medium storing a computer program, characterized in that the computer program, when executed by a processor, implements the method of volume adjustment according to any one of claims 1 to 7.