WO2022100692A1 - Human voice audio recording method and apparatus - Google Patents

Human voice audio recording method and apparatus Download PDF

Info

Publication number
WO2022100692A1
WO2022100692A1 PCT/CN2021/130305 CN2021130305W WO2022100692A1 WO 2022100692 A1 WO2022100692 A1 WO 2022100692A1 CN 2021130305 W CN2021130305 W CN 2021130305W WO 2022100692 A1 WO2022100692 A1 WO 2022100692A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio
human voice
vocal
duration
sub
Prior art date
Application number
PCT/CN2021/130305
Other languages
French (fr)
Chinese (zh)
Other versions
WO2022100692A9 (en
Inventor
贾杨
夏龙
吴凡
高强
郭常圳
Original Assignee
北京猿力未来科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京猿力未来科技有限公司 filed Critical 北京猿力未来科技有限公司
Publication of WO2022100692A1 publication Critical patent/WO2022100692A1/en
Publication of WO2022100692A9 publication Critical patent/WO2022100692A9/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G09EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
    • G09BEDUCATIONAL OR DEMONSTRATION APPLIANCES; APPLIANCES FOR TEACHING, OR COMMUNICATING WITH, THE BLIND, DEAF OR MUTE; MODELS; PLANETARIA; GLOBES; MAPS; DIAGRAMS
    • G09B5/00Electrically-operated educational appliances
    • G09B5/04Electrically-operated educational appliances with audible presentation of the material to be studied
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/84Detection of presence or absence of voice signals for discriminating voice from noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/87Detection of discrete points within a voice signal
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B20/00Signal processing not specific to the method of recording or reproducing; Circuits therefor
    • G11B20/10Digital recording or reproducing
    • G11B20/10527Audio or video recording; Data buffering arrangements
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B20/00Signal processing not specific to the method of recording or reproducing; Circuits therefor
    • G11B20/10Digital recording or reproducing
    • G11B20/10527Audio or video recording; Data buffering arrangements
    • G11B2020/10537Audio or video recording
    • G11B2020/10546Audio or video recording specifically adapted for audio data
    • G11B2020/10555Audio or video recording specifically adapted for audio data wherein the frequency, the amplitude, or other characteristics of the audio signal is taken into account
    • G11B2020/10564Audio or video recording specifically adapted for audio data wherein the frequency, the amplitude, or other characteristics of the audio signal is taken into account frequency

Definitions

  • the present application relates to the technical field of data processing, and in particular, to a method and device for recording human voice audio.
  • the present application provides a voice audio recording method, comprising: obtaining current original audio; obtaining audio clips identified as human voices in the current original audio; splicing the human voice audio clips in chronological order to obtain the spliced audio; The spliced audio is stored or output.
  • the above method further includes: obtaining a first mean value of the duration of the non-human voice audio clips in the current original audio; using the mean value to obtain a first variance of the duration of the non-human voice audio clips in the current original audio; the first The difference between the mean value and the first variance is used as the first threshold; if there is a non-human voice audio clip whose duration is less than the first threshold between the human voice audio clips, the non-human voice audio judgment is compared with the human voice audio clip. Sound and audio clip splicing.
  • the above method also includes: obtaining the total duration of at least one non-human voice audio clip of the user's historical raw audio, and the non-human voice audio clip of the historical raw audio according to the user ID corresponding to the original audio
  • the variance sum of the non-human voice audio clips of the current original audio and the total duration of the non-human voice audio clips of the historical original audio are used to obtain the second mean of the duration of the non-human voice audio clips; using the current original audio non-human voice audio clips
  • the variance and the variance sum of the non-human voice audio clips of the historical original audio are obtained to obtain the second variance of the non-human voice audio clips; the difference between the second mean and the second variance is used as the second threshold; if the current original In the audio, if there is a non-human voice audio clip with a duration shorter than the second threshold between the human voice audio clips, the non-human voice audio judgment is spliced with the human voice audio clip.
  • the method further includes: obtaining the sum of the duration of the current original audio non-human voice audio clip and the total duration of the non-human voice audio clips of the historical original audio, and saving; and obtaining the variance of the current original audio non-human voice audio clip and The sum of the variance sums of the non-vocal audio segments of the historical original audio is stored.
  • the above method further includes: obtaining a first mean value of the duration of the non-human voice audio clips in the current original audio; using the mean value to obtain a first variance of the duration of the non-human voice audio clips in the current original audio; the said The difference between the first mean value and the first variance is used as the first threshold; the user identifier corresponding to the original audio is taken; the average value of the duration of the audio clips of non-human voices in at least one historical original audio of the user is obtained as the third The mean value; the variance of the non-human voice audio segment obtained in the at least one historical original audio is the third difference; the difference between the third mean value and the third difference is used as the third threshold; the preset third threshold is used The weight of the first threshold is adjusted to obtain a fourth threshold; if in the current original audio, there are non-human voice audio clips whose duration is less than the fourth threshold, the non-human voice audio clips The judgment is spliced with the vocal audio clip.
  • the method further includes: obtaining the average duration of the current original audio non-human voice audio clip and the historical original audio non-human voice audio clip, and saving; obtaining the current original audio non-human voice audio clip and the historical original audio the variance of the non-vocal audio clips, and save them.
  • the above method further includes: if in the current original audio, there is a non-human voice audio clip whose duration is less than the fifth threshold between adjacent human voice audio clips, then comparing the non-human voice audio judgment with the said non-human voice audio clip. Vocal audio clip stitching.
  • obtaining an audio slice identified as a human voice in the original audio is specifically: dividing the original audio according to a preset method to obtain a plurality of sub-audio; calculating the Mel frequency cepstral coefficient feature sequence of the sub-audio; The neural network obtains the probability that the sub-audio belongs to the human voice according to the characteristic sequence of Mel frequency cepstral coefficients; obtains the sub-audio whose vocal probability is greater than the judgment threshold; obtains the adjacent sub-audio whose vocal probability is greater than the judgment threshold in the original audio; Acquire an audio segment consisting of determined time points in the adjacent sub-audio.
  • dividing the original audio according to a preset method to obtain multiple sub-audios includes: obtaining the original audio, adding null data of a first duration before the original audio header, and Add the empty data of the second duration after the audio tail to obtain the expanded audio; take the third duration of the sum of the first duration and the second duration as the segmentation window, and take the first duration from the header of the expanded audio
  • multiple sub audios are obtained after windowing in sequence.
  • the method further includes: obtaining an array of the probability that all the sub-audios of the original audio belong to the human voice; filtering the probability values in the array with the first number as a window , get the filtered vocal probability.
  • the method before acquiring the sub-audio whose vocal probability is greater than the judgment threshold, the method further includes: acquiring an audio energy value at a determined time point in the sub-audio in the original audio; and setting a vocal probability adjustment according to the audio energy value.
  • the vocal probability adjustment factor of the sub-audio is set to 1; if the audio energy value is less than the energy lower limit, the vocal probability adjustment factor of the sub-audio is set to 0; if the audio energy value is not is greater than the upper limit of energy and not less than the lower limit of energy, then normalize the vocal probability adjustment factor to be between 0 and 1 according to the audio energy value; multiply the sub-audio vocal probability adjustment factor by the sub-audio vocal probability, Get the corrected sub-audio vocal probabilities.
  • the application also provides a device for audio recognition, characterized in that it includes:
  • a memory having executable code stored thereon which, when executed by the processor, causes the processor to perform the method as described above.
  • This application uses human voice recognition to extract the voice part in the original audio, only stores the voice audio clips, and deletes the non-human voice audio clips, which not only removes the noise, but also saves storage because the non-human voice audio clips are deleted. space.
  • the application provides a variety of algorithms to achieve tolerance merging methods, retaining human voices.
  • the brief de-voiced parts between audio clips maintain the continuity of the vocals in the recording.
  • FIG. 1 is a schematic flowchart of a method for recording a human voice audio according to an embodiment of the present application
  • FIG. 2 is a schematic diagram of original audio segmentation preprocessing shown in an embodiment of the present application.
  • Fig. 3 is the audio frequency human voice probability distribution diagram before moving average shown in the embodiment of the present application.
  • Fig. 4 is the audio frequency human voice probability distribution diagram after moving average shown in the embodiment of the present application.
  • FIG. 5 is a schematic diagram of permissive merge processing.
  • first, second, third, etc. may be used in this application to describe various information, such information should not be limited by these terms. These terms are only used to distinguish the same type of information from each other.
  • first information may also be referred to as the second information, and similarly, the second information may also be referred to as the first information without departing from the scope of the present application.
  • second information may also be referred to as the first information without departing from the scope of the present application.
  • a feature defined as “first” or “second” may expressly or implicitly include one or more of that feature.
  • “plurality” means two or more, unless otherwise expressly and specifically defined.
  • the present application provides a method for audio recording of human voice.
  • the present application identifies the original audio, and extracts the vocal audio from the original audio for storage or output. Since a large number of non-human voice parts are removed from the stored audio file, storage resources are saved compared with the prior art.
  • FIG. 1 A specific embodiment of the present invention will be described with reference to FIG. 1 .
  • Step 11 Get the original audio.
  • the original audio may contain both the desired human voice and other non-human voices such as background sounds, noise, etc.
  • Step 12 Obtain the audio clips identified as human voices in the current original audio
  • the following provides a method for identifying the human voice audio in the current original audio, and then extracting the human voice audio segment.
  • the present invention does not limit other implementation methods that can achieve the same function.
  • Step 121 identifying the part of the human voice in the original audio.
  • subdivision processing is performed on the original audio, the original audio is divided into smaller sub-audios, a segment of empty audio is added to the beginning and end of the original audio, and the expanded audio is obtained.
  • the audio is sub-audio segmentation based on the segmentation window value, and the ratio of the empty audio value and the segmentation window value is maintained at 1:2.
  • the sub-audio in order to achieve accurate statistics of opening time nodes, the sub-audio needs to have a smaller segmentation granularity.
  • a is the original audio array, and null data of equal duration is added to the beginning and the end of the original audio a, that is, zeros of 480 milliseconds (ms), to obtain the expanded audio b.
  • the number of 0s in the 480ms is determined according to the sampling frequency of the audio, that is, the data frequency in 480ms is the same as the sampling frequency.
  • the duration of the null data added before and after the original audio header is 480 ms is only exemplary, and the present invention does not limit other values of the duration.
  • Step 122 take 2 times of the above-mentioned duration as the segmentation window, and obtain a plurality of sub-audios in sequence from the header of the expanded audio with the first length;
  • the segmentation window adopts 960ms, that is, twice the 480ms.
  • the segmentation step size is 10ms, so the minimum segmentation granularity of the sub-audio is 10ms.
  • the present invention does not limit the selection of other segmentation particle sizes.
  • the difference between adjacent sub audios is 10ms
  • the duration of each sub audio is 960ms.
  • the human voice of the sub-audio feature map calculated in the subsequent steps is The probability is taken as the probability of the human voice corresponding to the audio at the time point t i +0.48s. Therefore, the voice probability calculated according to the first sub-audio in this scheme is the vocal probability at the beginning of the original audio; the vocal probability calculated by the last sub-audio is the vocal probability at the end of the original audio.
  • Step 123 respectively calculate and obtain the time sequence feature sequence of the sub-audio
  • the Mel Frequency Cepstral Coefficient is used, which is a spectral coefficient obtained by linear transformation of the logarithmic energy spectrum of the nonlinear Mel scale of the sound frequency, which characterizes the frequency domain characteristics of the sound. .
  • a preset window length and step size are used to calculate the result of its short-time Fourier transform, and a characteristic sequence of Mel frequency cepstral coefficients is obtained. For example, a window length of 25ms and a step size of 10ms are used to calculate the result of its short-time Fourier transform to obtain the MFCC characteristics.
  • Step 124 The neural network obtains the probability that the sub-audio belongs to a specific category according to the time sequence feature sequence.
  • the probability corresponding to each audio segment is predicted by the trained neural network model. The value of the probability ranges from 0 to 1.
  • the trained neural network model adopts 3x3 convolution kernel and pool layer to simplify the model parameters.
  • the training of the neural network includes two stages: pre-training and fine-tuning.
  • the picture on the left shows the 500-class classification model.
  • the 500-class audio classification model was trained using the sound dataset.
  • the figure on the right shows the two-class model.
  • the network reuses the underlying network structure and parameters of the 500-class model, and uses the back propagation algorithm to make the model converge. Through this binary classification model to identify whether there is human voice in the audio clip, the model will output the probability that the current audio clip has human voice.
  • pre-training and fine-tuning the network trained by the present invention is more focused on the classification scene of human voice and non-human voice, and the performance of the model is improved.
  • Step 125 Compare the probabilities with the judgment threshold respectively to determine whether the sub-audio belongs to the classification of human voice; thus, obtain the sub-audio whose voice probability is greater than the judgment threshold in the original audio.
  • the judgment threshold is set as the basis for judging whether it is human voice, if the probability is greater than the judgment threshold, the judgment is human voice, and if the probability is less than the judgment threshold, the judgment is non-human voice.
  • the original audio a is divided into a vocal or non-vocal segment.
  • the duration of the vocals in the original audio can be obtained by accumulating the durations of all clips.
  • Step 126 For the adjacent sub-audio whose vocal probability is greater than the decision threshold in the original audio; obtain an audio segment composed of the central time point in the adjacent sub-audio, that is, the vocal audio segment.
  • null data of equal time length is added before the head of the original audio and after the tail of the original audio, for example, both are 480 milliseconds; Divide to get multiple sub audios.
  • Step 13 splicing the audio segments of the human voice in chronological order to obtain the spliced audio.
  • the above-mentioned vocal audio clips are spliced in chronological order to obtain a spliced vocal audio file, which is the vocal part audio in the original audio.
  • Step 14 Store or output the vocal part audio in the original audio obtained after the splicing. Therefore, on the one hand, the file occupies a small storage space; on the other hand, since the audio does not contain non-human voice parts, the audio time is short. Compared with the original audio, the playback time is short, and the user listens to the voice repeatedly. Content does not create a waste of time.
  • the following preprocessing steps can also be performed before the threshold judgment is performed, so as to achieve the purpose of optimizing the probability value.
  • the vocal probability array of the original audio obtained by the method described above contains noise points. It is reflected in the 200-millisecond vocal probability distribution diagram shown in Figure 3, where the ordinate represents the probability that the audio point is a human voice, the abscissa represents time, and each point represents 10ms. There are many sudden changes in probability values, that is, glitches, on the probability value distribution of 0-1 corresponding to the time axis of the horizontal axis. Therefore, it is necessary to perform sliding average preprocessing on the currently obtained probabilities to make the probability distribution smoother, and obtain the 200-millisecond vocal probability distribution map as shown in Figure 4.
  • the sliding average preprocessing adopts the median sliding filtering method.
  • the probability that the i-th sub-audio after median filtering is a human voice is:
  • P ⁇ p 1 ,p 2 ,p 3 ,...,p i ...,p n ⁇ , where n is the total number of sub-audio obtained by segmenting the original audio, and p i represents the probability that the ith sub-audio is a human voice .
  • w_smooth is the selected window size.
  • the window is selected as 31, that is, the window is 31 values in the vocal probability array of the sub-audio.
  • median filtering is to average the probability values of adjacent 31 points as the probability value of the intermediate point; according to this method, the probability value of each point is recalculated with the step size of 1.
  • the above median filtering is an implementation manner of the present invention, and the present invention does not limit the adoption of other filtering methods.
  • the audio probability of a small part of non-human voices is modified by the surrounding points during filtering.
  • the embodiment of the present invention utilizes the characteristic that the energy of noise or mute is weaker than the human voice, and uses the energy of the original audio to further modify the probability of the human voice, so as to improve the accuracy.
  • the moving average audio vocal probability array is:
  • the original audio is sliced with a step size of 10ms to obtain sub-audio, and then the probability of vocals with an interval of 10ms is obtained. Therefore, the energy array of the original audio is obtained by calculating the step size of 10ms here, so that Make the moments of the energy array of the original audio correspond to the moments of the vocal probability array of the original audio.
  • wi Normalize the value of the Power array to be between 0 and 1, and determine the upper limit of energy P up and the lower limit of energy P down , then wi can be normalized as follows:
  • the obtained probability adjustment factor is between 0 and 1, through the The probability adjustment factor adjusts the vocal probability value at the corresponding time point, and finally obtains an energy-corrected audio vocal probability value array P T .
  • the probabilities obtained in the above embodiments are first subjected to sliding average preprocessing, then energy correction preprocessing, and finally, a judgment algorithm is used to discriminate between human voices and non-human voices, determine vocal opening segments, and count the duration of user openings; for the currently obtained probability
  • energy correction preprocessing can be performed first, and then sliding average preprocessing.
  • the present invention can also adopt one of the above two preprocessing methods to achieve the purpose of improving the accuracy of human voice recognition.
  • the sub-audio is obtained by segmenting the original audio by performing audio expansion with dummy data of equal duration added before the head and after the tail of the original audio.
  • the length of null data added before the original audio header and after the trailer may not be equal. That is, add the null data of the first duration before the original audio head, and add the null data of the second duration after the original audio tail; and the third duration of the sum of the first duration and the second duration is cut.
  • the split window splits the original audio to obtain sub-audio.
  • the first duration is 240 milliseconds
  • the second duration is 720 milliseconds
  • the split window is the sum of the first duration and the second duration, that is, 960 milliseconds. It can be seen that the duration of the sub-audio obtained by using this method is the same as that in the above embodiment, which is still 960 ms.
  • the calculated sub-audio vocal probability is approximated as the vocal probability value of the sub-audio at 1/4 time.
  • the probability value of the sub-audio vocal is approximated as the human voice at the time of t i +0.24s in the sub-audio probability value.
  • the audio segment formed by the 1/4 time point in each sub-audio of the human voice is continuously determined to obtain the vocal segment in the original audio. It can be known that, since the sub-audio is obtained by segmenting the original audio by the first length, the first 1/4 time interval between the adjacent sub-audio moments is, for example, 10ms used in the above embodiment.
  • the non-human voice in the original audio can be removed to obtain the human voice audio, which can be saved or output.
  • a certain latitude is adopted to maintain the front and back continuity of the vocal audio segment.
  • This can output higher-quality vocal audio, provide teachers and students with higher-quality corpus, and facilitate the use of vocal recordings by teachers and students.
  • the first method embodiment The first method embodiment.
  • the duration of a certain non-vocal segment is denoted as l i , and it is assumed that there are a total of n non-vocal audio segments in the original audio.
  • the difference between the first mean value and the first variance is used as the first threshold T;
  • the audio clips of the human voice and the non-human voice audio clips whose duration is less than the first threshold T are spliced. That is, in the original audio, if the duration of the non-vocal audio segment between the two vocal audio segments is less than the first threshold T 1 , the non-vocal audio segment is retained in the saved or output audio.
  • the first method sets different clip splicing thresholds for different audios, dynamically adjusts the splicing effect, and is simple to calculate.
  • the statistical properties of their non-vocal audio clips are saved. For example, for user u, the mean value of its non-human voice audio clips obtained from the existing data is recorded as mu u , the variance is ⁇ u , the number of all non-human voice audio clips in the existing original audio N u , the existing The sum of the durations of all non-vocal audio segments in the original audio is Su, the variance sum of the non-vocal segments in the existing original audio and S ⁇ u .
  • the characteristics of the non-human voice audio segment are calculated from the first original audio and saved as follows.
  • the following parameters are calculated by using the statistical characteristics of the non-vocal audio clips saved by the user and the statistical characteristics of the non-vocal audio clips of the newly recorded audio.
  • the following description will be given by taking the non-human voice characteristics of all the saved historical original audio of the user as an example.
  • the superscript "old” represents the non-vocal segment statistical parameters of all historical original audios generated before the current original audio is generated for the user
  • the superscript "new” represents the non-vocal segment statistical parameters after obtaining the current original audio
  • S u old is the sum of the durations of all the original audio non-vocal audio clips before the user obtains the current original audio
  • S u new is the total duration of all the original audio non-vocal audio clips after the current original audio is obtained.
  • S ⁇ u old is the sum of duration variances of all original non-vocal audio segments before the user obtains the current original audio
  • S ⁇ u new is the sum of duration variances of all original non-vocal audio segments after the current original audio is obtained.
  • N u old is the sum of the number of all original audio non-vocal audio clips before the user obtains the current original audio
  • N u new is the total number of all original audio non-vocal audio clips after the current original audio is obtained.
  • the second mean value m u new and the second variance ⁇ u new of the durations of all the original audio non-human voice audio clips after the user obtains the current original audio are obtained.
  • n is a total of n segments of non-human voice audio in the current original audio, and the duration of each segment is l i .
  • the difference between the second mean and the second variance obtains a second threshold.
  • the non-vocal audio segment between the two vocal audio segments is less than the second threshold T 2 , then in the saved or output audio, the non-vocal audio segment is retained, and the non-vocal audio segment is retained.
  • the vocal audio clip is spliced with the front and rear duo audio clips.
  • the statistical properties of their non-vocal audio clips are saved. For example, for user u , its mean value is denoted as mu , the variance is ⁇ u , the number Nu of all non-voice audio segments in the existing original audio, and the sum of the durations of all non-vocal audio segments in the existing original audio are S u , the variance of non-vocal segments in the existing original audio, and S ⁇ u .
  • the characteristics of the non-vocal audio segment calculated from its first original audio are as follows.
  • S u old is the sum of the durations of all original non-human voice audio clips before the user obtains the current original audio
  • S ⁇ u old is the duration variance of all original non-human voice audio clips before the user obtains the current original audio Sum
  • Nu old is the sum of the number of all original audio non - vocal audio clips before the user obtains the current original audio.
  • T 3 m u old- ⁇ u old
  • the first threshold T 1 is obtained according to the current original audio.
  • the first threshold T 1 is adjusted by using the weight and the third threshold to obtain a fourth threshold T 4 .
  • the non-vocal audio segment between the two vocal audio segments is less than the fourth threshold T 4 , the non-vocal audio segment is retained in the saved or output audio, and the non-vocal audio segment is The clip is spliced with the front and rear duo audio clips.
  • the splicing threshold of the non-human voice audio segment is dynamically adjusted in combination with the specific audio statistical information on the basis of considering the overall statistical information of the user.
  • the original audio a is divided into multiple vocal or non-vocal segments. If the interval between the two sub-audios determined to be human voices is less than a threshold value (the fifth threshold), further acquire the audio segments that are determined to be non-human voices between the adjacent audio segments identified to be human voices. And, in the current original audio, the audio clips of the human voice and the audio clips between the adjacent audio clips identified as human voices are spliced.
  • a threshold value the fifth threshold
  • the above describes four methods of whether to perform audio splicing. Referring to FIG. 5 , assuming that the original audio contains two vocal segments a i , a i+1 , the start and end time nodes are respectively Assume that the threshold is 500ms using any of the methods described above. if Then the two segments are merged into one. It can be seen that the human voice audio obtained by the above method maintains the continuity of the speech segment before and after.
  • the first method embodiment is suitable for a lightweight recording and feedback system, and does not need to record any user information
  • the second method embodiment is suitable for the adult user group, because adults are more stable when recording, and the user's statistical information can be better Describe most of the audio features
  • the third method embodiment is suitable for the group of young children. Considering that the emotional fluctuations of the group of young children are relatively irregular and the occurrence is relatively irregular, it is possible that the audio of the same text spoken at different times in the vicinity is vastly different, so It is necessary to make reasonable adjustments with reference to the statistical information of the audio while considering the statistical characteristics of the user.
  • the present application further provides a human voice audio recording device.
  • the device includes:
  • a memory having executable code stored thereon which, when executed by the processor, causes the processor to perform the method described above.
  • the specific manner in which each module performs the operation has been described in detail in the embodiment of the method, and will not be described in detail here.
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of code that contains one or more functions for implementing the specified logical function(s) executable instructions.
  • the functions noted in the block may occur out of the order noted in the figures. For example, two blocks in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
  • each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations can be implemented in dedicated hardware-based systems that perform the specified functions or operations , or can be implemented in a combination of dedicated hardware and computer instructions.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Business, Economics & Management (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Educational Administration (AREA)
  • Educational Technology (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Electrically Operated Instructional Devices (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A human voice audio recording method and apparatus. The method comprises: obtaining the current original audio (11); obtaining audio clips identified as human voice in the current original audio (12); splicing the human voice audio clips according to a time sequence to obtain a spliced audio (13); and storing or outputting the spliced audio (14).

Description

人声音频录制方法和装置Human voice audio recording method and device
相关申请的交叉引用CROSS-REFERENCE TO RELATED APPLICATIONS
本申请要求于2020年11月12日提交中国专利局、申请号为202011258272.3、发明名称为“一种人声音频录制方法和装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application filed on November 12, 2020 with the application number 202011258272.3 and the invention titled "A method and device for recording human voice audio", the entire contents of which are incorporated herein by reference Applying.
技术领域technical field
本申请涉及数据处理技术领域,尤其涉及一种人声音频录制方法和装置。The present application relates to the technical field of data processing, and in particular, to a method and device for recording human voice audio.
背景技术Background technique
随着互联网技术的发展,线上教育等类似行业蓬勃发展,在线学习人数剧增。With the development of Internet technology, online education and other similar industries have flourished, and the number of people studying online has increased dramatically.
学生在线学习过程中的语音作答需要被录制并进行输出,这种需求尤其出现在英语这类语言类学习中,也会出现在逻辑思维等可以通过语言交互的学习。一方面,老师通过录音了解学生学习的情况,例如英语的发音,以便提供指导;另一方面学生可以通过录音实现复习的目的。因而,需要一种满足上述场景需要的语音提取录制方法。In the process of online learning, students' voice responses need to be recorded and output. This demand especially occurs in language learning such as English, but also in logical thinking and other learning through language interaction. On the one hand, the teacher understands the student's learning situation through the recording, such as English pronunciation, so as to provide guidance; on the other hand, the student can realize the purpose of review through the recording. Therefore, there is a need for a voice extraction and recording method that meets the needs of the above scenarios.
发明内容SUMMARY OF THE INVENTION
本申请提供一种人声音频录制方法,包括:获得当前原始音频;获取所述当前原始音频中识别为人声的音频片段;将所述人声的音频片段按照时间顺序拼接得到拼接后的音频;存储或输出所述拼接后的音频。The present application provides a voice audio recording method, comprising: obtaining current original audio; obtaining audio clips identified as human voices in the current original audio; splicing the human voice audio clips in chronological order to obtain the spliced audio; The spliced audio is stored or output.
上述方法中还包括:获得当前原始音频中非人声的音频片段时长的第一均值;利用所述均值,获得当前原始音频中非人声的音频片段时长的第一方差;所述第一均值与所述第一方差的差值作为第一门限;若人声音频片段间存在时长小于所述第一门限的非人声音频片段,则将所述非人声音频判断与所述人声音频片段拼接。The above method further includes: obtaining a first mean value of the duration of the non-human voice audio clips in the current original audio; using the mean value to obtain a first variance of the duration of the non-human voice audio clips in the current original audio; the first The difference between the mean value and the first variance is used as the first threshold; if there is a non-human voice audio clip whose duration is less than the first threshold between the human voice audio clips, the non-human voice audio judgment is compared with the human voice audio clip. Sound and audio clip splicing.
并列的,上述方法中还包括:据所述原始音频对应的用户标识,获得该用户的至少一个历史原始音频的非人声音频片段的总时长,以及所述历史原始音频的非人声音频片段的方差和;利用当前原始音频非人声音频片段时长以及所述历史原始音频的非人声音频片段总时长,获得非人声音频片段时长的第二均值;利用当前原始音频非人声音频片段方差以及所述历史原始音频的非人声音频片段的方差和,获得非人声音频片段的第二方差;所述第二均值与所述第二方差的差值作为第二门限;若当前原始音频中,人声音频片段间存在时长小于所述第二门限的非人声音频片段,则将所述非人声音频判断与所述人声音频片段拼接。In parallel, the above method also includes: obtaining the total duration of at least one non-human voice audio clip of the user's historical raw audio, and the non-human voice audio clip of the historical raw audio according to the user ID corresponding to the original audio The variance sum of the non-human voice audio clips of the current original audio and the total duration of the non-human voice audio clips of the historical original audio are used to obtain the second mean of the duration of the non-human voice audio clips; using the current original audio non-human voice audio clips The variance and the variance sum of the non-human voice audio clips of the historical original audio are obtained to obtain the second variance of the non-human voice audio clips; the difference between the second mean and the second variance is used as the second threshold; if the current original In the audio, if there is a non-human voice audio clip with a duration shorter than the second threshold between the human voice audio clips, the non-human voice audio judgment is spliced with the human voice audio clip.
以及,该方法还包括:获得当前原始音频非人声音频片段时长以及所述历史原始音频的非人声音频片段总时长的和,并保存;以及,获得当前原始音频非人声音频片段方差以及所述历史原始音频的非人声音频片段方差和的和,并保存。And, the method further includes: obtaining the sum of the duration of the current original audio non-human voice audio clip and the total duration of the non-human voice audio clips of the historical original audio, and saving; and obtaining the variance of the current original audio non-human voice audio clip and The sum of the variance sums of the non-vocal audio segments of the historical original audio is stored.
并列的,上述方法还包括:获得当前原始音频中非人声的音频片段时长的第一均值;利用所述均值,获得当前原始音频中非人声的音频片段时长的第一方差;所述第一均值与所述第一方差的差值作为第一门限;取所述原始音频对应的用户标识;获取所述用户至少一个历史原始音频中非人声的音频片段时长的均值为第三均值;获得所述至少一个历史原始音频中非人声的音频片段的方差为第三方差;所述第三均值与所述第三方差的差值作为第三门限;利用预置的第三门限的权值对所述第一门限进行调整得到第四门限;若当前原始音频中,人声音频片段间存在时长小于所述第四门限的非人声音频片段,则将所述非人声音频判断与所述人声音频片段拼接。In parallel, the above method further includes: obtaining a first mean value of the duration of the non-human voice audio clips in the current original audio; using the mean value to obtain a first variance of the duration of the non-human voice audio clips in the current original audio; the said The difference between the first mean value and the first variance is used as the first threshold; the user identifier corresponding to the original audio is taken; the average value of the duration of the audio clips of non-human voices in at least one historical original audio of the user is obtained as the third The mean value; the variance of the non-human voice audio segment obtained in the at least one historical original audio is the third difference; the difference between the third mean value and the third difference is used as the third threshold; the preset third threshold is used The weight of the first threshold is adjusted to obtain a fourth threshold; if in the current original audio, there are non-human voice audio clips whose duration is less than the fourth threshold, the non-human voice audio clips The judgment is spliced with the vocal audio clip.
以及,该方法还包括:获得当前原始音频非人声音频片段以及所述历史原始音频的非人声音频片段的时长均值,并保存;获得当前原始音频非人声音频片段以及所述历史原始音频的非人声音频片段的方差,并保存。And, the method further includes: obtaining the average duration of the current original audio non-human voice audio clip and the historical original audio non-human voice audio clip, and saving; obtaining the current original audio non-human voice audio clip and the historical original audio the variance of the non-vocal audio clips, and save them.
并列的,上述方法还包括:若当前原始音频中,相邻的人声的音频片段之间存在的时长小于第五门限的非人声音频片段,则将所述非人声音频判断与所述人声音频片段拼接。In parallel, the above method further includes: if in the current original audio, there is a non-human voice audio clip whose duration is less than the fifth threshold between adjacent human voice audio clips, then comparing the non-human voice audio judgment with the said non-human voice audio clip. Vocal audio clip stitching.
以上实施例中,获得所述原始音频中识别为人声的音频片具体为:按照预置的方法对所述原始音频切分得到多个子音频;计算子音频的梅尔频率倒谱系数特征序列;神经网络根据梅尔频率倒谱系数特征序列得到子音频属于人声的概率;获取所述人声概率大于判决门限的子音频;获取原始音频中人声概率大于判决门限且相邻的子音频;获取由所述相邻子音频中确定时刻点组成的音频片段。In the above embodiment, obtaining an audio slice identified as a human voice in the original audio is specifically: dividing the original audio according to a preset method to obtain a plurality of sub-audio; calculating the Mel frequency cepstral coefficient feature sequence of the sub-audio; The neural network obtains the probability that the sub-audio belongs to the human voice according to the characteristic sequence of Mel frequency cepstral coefficients; obtains the sub-audio whose vocal probability is greater than the judgment threshold; obtains the adjacent sub-audio whose vocal probability is greater than the judgment threshold in the original audio; Acquire an audio segment consisting of determined time points in the adjacent sub-audio.
以上实施例中,所述按照预置的方法对所述原始音频切分得到多个子音频包括:获得原始音频,在所述原始音频头部之前添加第一时长的空数据,以及在所述原始音频尾部之后添加第二时长的空数据,得到扩展后的音频;以第一时长与第二时长之和的第三时长为切分窗口,以第一步长从所述扩展后的音频的首部开始,依次分窗后获得多个子音频。In the above embodiment, dividing the original audio according to a preset method to obtain multiple sub-audios includes: obtaining the original audio, adding null data of a first duration before the original audio header, and Add the empty data of the second duration after the audio tail to obtain the expanded audio; take the third duration of the sum of the first duration and the second duration as the segmentation window, and take the first duration from the header of the expanded audio At the beginning, multiple sub audios are obtained after windowing in sequence.
以上实施例中,所述获得子音频属于人声概率之后,还包括:获得所述原始音频所有子音频属于人声概率的数组;以第一数量作为窗口对所述数组中的概率值进行滤波,得到滤波后的人声概率。In the above embodiment, after obtaining the probability that the sub-audio belongs to the human voice, the method further includes: obtaining an array of the probability that all the sub-audios of the original audio belong to the human voice; filtering the probability values in the array with the first number as a window , get the filtered vocal probability.
上述实施例中,获取所述人声概率大于判决门限的子音频之前还包括:获取原始音频中所述子音频中确定时刻点的音频能量值;以及根据所述音频能量值设置人声概率调节因子,包括:若音频能量值大于能量上限,该子音频的人声概率调节因子置为1;音频能量值小于能量下限,该子音频的人声概率调节因子置为0;若音频能量值不大于能量上限且不小于能量下限,则根据音频能量值将所述人声概率调节因子归一化为0至1之间;将子音频人声概率调节因子乘以所述子音频人声概率,得到修正后的子音频人声概率。In the above-mentioned embodiment, before acquiring the sub-audio whose vocal probability is greater than the judgment threshold, the method further includes: acquiring an audio energy value at a determined time point in the sub-audio in the original audio; and setting a vocal probability adjustment according to the audio energy value. Factors, including: if the audio energy value is greater than the energy upper limit, the vocal probability adjustment factor of the sub-audio is set to 1; if the audio energy value is less than the energy lower limit, the vocal probability adjustment factor of the sub-audio is set to 0; if the audio energy value is not is greater than the upper limit of energy and not less than the lower limit of energy, then normalize the vocal probability adjustment factor to be between 0 and 1 according to the audio energy value; multiply the sub-audio vocal probability adjustment factor by the sub-audio vocal probability, Get the corrected sub-audio vocal probabilities.
本申请还提供一种音频识别的装置,其特征在于,包括:The application also provides a device for audio recognition, characterized in that it includes:
处理器;以及processor; and
存储器,其上存储有可执行代码,当所述可执行代码被所述处理器执行时,使所述处理器执行如上所述的方法。A memory having executable code stored thereon which, when executed by the processor, causes the processor to perform the method as described above.
本申请利用人声识别,提取原始音频中的语音部分,仅对语音音频片段进行存储,删除非人声的音频片段,不仅去掉了杂音,而且由于删除了非人声的音频片段,因而节省存储空间。This application uses human voice recognition to extract the voice part in the original audio, only stores the voice audio clips, and deletes the non-human voice audio clips, which not only removes the noise, but also saves storage because the non-human voice audio clips are deleted. space.
进一步,基于人说话的特点,例如人类说话前后的延续性,尤其儿童在语音回答问题中经常出现短暂的停顿、换气等情况,本申请提供了多种算法实现宽容合并的方法,保留人声音频片段之间短暂的被非人声的部分,保持了录音中人声的连续性。Further, based on the characteristics of human speech, such as the continuity of human speech before and after, especially children often have short pauses, ventilation, etc. in the voice answering questions, the application provides a variety of algorithms to achieve tolerance merging methods, retaining human voices. The brief de-voiced parts between audio clips maintain the continuity of the vocals in the recording.
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本申请。It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not limiting of the present application.
附图说明Description of drawings
通过结合附图对本申请示例性实施方式进行更详细的描述,本申请的上述以及其它目的、特征和优势将变得更加明显,其中,在本申请示例性实施方式中,相同的参考标号通常代表相同部件。The above and other objects, features and advantages of the present application will become more apparent from the more detailed description of the exemplary embodiments of the present application in conjunction with the accompanying drawings, wherein the same reference numerals generally represent the exemplary embodiments of the present application. same parts.
图1是本申请实施例示出的一种人声音频录制方法的流程示意图;1 is a schematic flowchart of a method for recording a human voice audio according to an embodiment of the present application;
图2是本申请实施例示出的原始音频切分预处理示意图;2 is a schematic diagram of original audio segmentation preprocessing shown in an embodiment of the present application;
图3是本申请实施例示出的滑动平均前音频人声概率分布图;Fig. 3 is the audio frequency human voice probability distribution diagram before moving average shown in the embodiment of the present application;
图4是本申请实施例示出的滑动平均后音频人声概率分布图;Fig. 4 is the audio frequency human voice probability distribution diagram after moving average shown in the embodiment of the present application;
图5是宽容合并处理示意图。FIG. 5 is a schematic diagram of permissive merge processing.
具体实施方式Detailed ways
下面将参照附图更详细地描述本申请的优选实施方式。虽然附图中显示了本申请的优选实施方式,然而应该理解,可以以各种形式实现本申请而不应被这里阐述的实施方式所限制。相反,提供这些实施方式是为了使本申请更加透彻和完整,并且能够将本申请的范围完整地传达给本领域的技术人员。Preferred embodiments of the present application will be described in more detail below with reference to the accompanying drawings. While preferred embodiments of the present application are shown in the drawings, it should be understood that the present application may be embodied in various forms and should not be limited by the embodiments set forth herein. Rather, these embodiments are provided so that this application will be thorough and complete, and will fully convey the scope of this application to those skilled in the art.
在本申请使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本申请。在本申请和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其他含义。还应当理解,本文中使用的术语“和/或”是指并包含一个或多个相关联的列出项目的任何或所有可能组合。The terminology used in this application is for the purpose of describing particular embodiments only and is not intended to limit the application. As used in this application and the appended claims, the singular forms "a," "the," and "the" are intended to include the plural forms as well, unless the context clearly dictates otherwise. It will also be understood that the term "and/or" as used herein refers to and includes any and all possible combinations of one or more of the associated listed items.
应当理解,尽管在本申请可能采用术语“第一”、“第二”、“第三”等来描述各种信息,但这些信息不应限于这些术语。这些术语仅用来将同一类型的信息彼此区分开。例如,在不脱离本申请范围的情况下,第一信息也可以被称为第二信息,类似地,第二信息也可以被称为第一信息。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个该特征。在本申请的描述中,“多个”的含义是两个或两个以上,除非另有明确具体的限定。It should be understood that although the terms "first", "second", "third", etc. may be used in this application to describe various information, such information should not be limited by these terms. These terms are only used to distinguish the same type of information from each other. For example, the first information may also be referred to as the second information, and similarly, the second information may also be referred to as the first information without departing from the scope of the present application. Thus, a feature defined as "first" or "second" may expressly or implicitly include one or more of that feature. In the description of the present application, "plurality" means two or more, unless otherwise expressly and specifically defined.
本申请提供一种人声音频录制的方法。本申请对原始音频进行识别,将人声音频从原始音频提取出来后进行存储或者输出,由于存储的音频文件中去掉了大量的非人声部分,因而与相比现有技术节省了存储资源。The present application provides a method for audio recording of human voice. The present application identifies the original audio, and extracts the vocal audio from the original audio for storage or output. Since a large number of non-human voice parts are removed from the stored audio file, storage resources are saved compared with the prior art.
参照图1说明本发明具体实施例。A specific embodiment of the present invention will be described with reference to FIG. 1 .
步骤11:获得原始音频。Step 11: Get the original audio.
获得原始音频文件,例如学生在线学习时,根据学习软件的提示进行语音的作答,智能设备通过麦克风获取学生语音作答时的原始音频。该原始音频可能既包含了所需的人声,还包括背景声、噪音等其他非人声的音频。Obtain the original audio files. For example, when students study online, they can answer the voice according to the prompts of the learning software, and the smart device can obtain the original audio of the students' voice answers through the microphone. The original audio may contain both the desired human voice and other non-human voices such as background sounds, noise, etc.
步骤12:获取所述当前原始音频中识别为人声的音频片段;Step 12: Obtain the audio clips identified as human voices in the current original audio;
以下提供一种在当前原始音频中识别人声音频,进而提取出人声音频片段的方法,然而,本发明并不限制其他能够实现相同功能的其他实现方法。The following provides a method for identifying the human voice audio in the current original audio, and then extracting the human voice audio segment. However, the present invention does not limit other implementation methods that can achieve the same function.
步骤121,识别原始音频中人声的部分。Step 121, identifying the part of the human voice in the original audio.
分别在所述原始音频头部之前以及尾部之后添加空数据,得到扩展后的音频;Add empty data before the original audio head and after the tail respectively to obtain the expanded audio;
在一种实施方式中,对原始音频进行细分度切分处理,将原始音频切分为更小的子音频,对原始音频首尾各增加一段空音频,得到扩展后的音频,对扩展后的音频基于切分窗口值进行子音频切分,空音频数值与切分窗口值保持1:2的比例。In one embodiment, subdivision processing is performed on the original audio, the original audio is divided into smaller sub-audios, a segment of empty audio is added to the beginning and end of the original audio, and the expanded audio is obtained. The audio is sub-audio segmentation based on the segmentation window value, and the ratio of the empty audio value and the segmentation window value is maintained at 1:2.
如图2所示本实施例中,为了实现开口时间节点的精确统计,子音频需要有更小的切分粒度。如图所示,a为原始的音频数组,在原始音频a的首、尾部各添加等时长的空数据,即480毫秒(ms)的零,得到扩展后的音频b。所述480ms中0的个数根据音频的采样频率而定,即480ms中的数据频率与采样频率相同。In this embodiment as shown in FIG. 2 , in order to achieve accurate statistics of opening time nodes, the sub-audio needs to have a smaller segmentation granularity. As shown in the figure, a is the original audio array, and null data of equal duration is added to the beginning and the end of the original audio a, that is, zeros of 480 milliseconds (ms), to obtain the expanded audio b. The number of 0s in the 480ms is determined according to the sampling frequency of the audio, that is, the data frequency in 480ms is the same as the sampling frequency.
本实施例中在原始音频首部之前和尾部之后添加的空数据时长为480ms仅为示例性的,本发明并不限制该时长的其他取值。In this embodiment, the duration of the null data added before and after the original audio header is 480 ms is only exemplary, and the present invention does not limit other values of the duration.
步骤122:以2倍上述时长为切分窗口,以第一步长从所述扩展后的音频的首部开始顺序获得多个子音频;Step 122: take 2 times of the above-mentioned duration as the segmentation window, and obtain a plurality of sub-audios in sequence from the header of the expanded audio with the first length;
如图2所示,本实施例中,对原始音频切分获得子音频时,切分窗口采用960ms,即所述480ms的2倍。切分步长采用10ms,从而子音频的最小切分粒度为10ms。本发明不限制其他切分粒度的选择。As shown in FIG. 2 , in this embodiment, when sub-audio is obtained by segmenting the original audio, the segmentation window adopts 960ms, that is, twice the 480ms. The segmentation step size is 10ms, so the minimum segmentation granularity of the sub-audio is 10ms. The present invention does not limit the selection of other segmentation particle sizes.
按照以上切分方法,得到了数个子音频,相邻子音频之间相差10ms,每个子音频的时长为960ms。According to the above segmentation method, several sub audios are obtained, the difference between adjacent sub audios is 10ms, and the duration of each sub audio is 960ms.
假设某一个子音频的起始时刻和截止时刻在原始音频中分别表示为t i,t i+0.96s,则在本发明实施例,将后续步骤中计算得到的该子音频特征图的人声概率作为时间点t i+0.48s时刻音频对应的人声概率。因此,本方案根据第一个子音频计算得到的人声概率即作为原始音频起始时刻的人声概率;最后一个子音频计算得到的人声概率即作为原始音频结束时刻的人声概率。 Assuming that the start time and the end time of a certain sub-audio are respectively represented as t i , t i +0.96s in the original audio, then in the embodiment of the present invention, the human voice of the sub-audio feature map calculated in the subsequent steps is The probability is taken as the probability of the human voice corresponding to the audio at the time point t i +0.48s. Therefore, the voice probability calculated according to the first sub-audio in this scheme is the vocal probability at the beginning of the original audio; the vocal probability calculated by the last sub-audio is the vocal probability at the end of the original audio.
通过上述原始音频的切分方法,近似计算某个时间点的人声概率,因此可以实现较为准确的开口片段检测。Through the above-mentioned segmentation method of the original audio, the probability of human voice at a certain time point is approximately calculated, so a relatively accurate opening segment detection can be realized.
步骤123:分别计算得到所述子音频的时序特征序列;Step 123: respectively calculate and obtain the time sequence feature sequence of the sub-audio;
本实施例中采用梅尔频率倒谱系数(MFCC),是基于声音频率的非线性梅尔刻度(mel scale)的对数能量频谱的线性变换得到的谱系数,表征声音所具备的频域特性。In this embodiment, the Mel Frequency Cepstral Coefficient (MFCC) is used, which is a spectral coefficient obtained by linear transformation of the logarithmic energy spectrum of the nonlinear Mel scale of the sound frequency, which characterizes the frequency domain characteristics of the sound. .
对于每个切分得到的子音频,采用预设的窗口长度以及步长,计算其短时傅立叶变换的结果,得到梅尔频率倒谱系数特征序列。例如采用窗口长度25ms,步长10ms,计算其短时傅立叶变换的结果,得到MFCC特性。For each sub-audio obtained by segmentation, a preset window length and step size are used to calculate the result of its short-time Fourier transform, and a characteristic sequence of Mel frequency cepstral coefficients is obtained. For example, a window length of 25ms and a step size of 10ms are used to calculate the result of its short-time Fourier transform to obtain the MFCC characteristics.
步骤124:神经网络根据所述时序特征序列得到子音频属于特定分类的概率。Step 124: The neural network obtains the probability that the sub-audio belongs to a specific category according to the time sequence feature sequence.
将梅尔频率倒谱系数特征序列输入已训练的神经网络模型,并获得神经网络模型输出的各音频片段对应的概率,在该实施例中,将得到的各音频片段按照时间顺序输入已训练的神经网络模型中,由已训练的神经网络模型预测各音频片段对应的概率。所述概率的取值范围在0到1之间。Input the Mel frequency cepstral coefficient feature sequence into the trained neural network model, and obtain the corresponding probability of each audio segment output by the neural network model. In the neural network model, the probability corresponding to each audio segment is predicted by the trained neural network model. The value of the probability ranges from 0 to 1.
例如,已训练的神经网络模型采用3x3的卷积核和pool层简化模型参数。神经网路的训练包括预训练和微调两个阶段。左图为500类分类模型,先使用声音数据集训练了500分类的音频分类模型。右图为二分类模型,该网络复用了500分类模型的底层网络结构和参数,通过反向传播算法使得模型收敛。通过此二分类模型来识别音频片段是否存在人声,则模型会输出当前音频片段存在人声的音频的概率。通过引入预训练和微调两个,使得本发明所训练的网络更加聚焦于人声、非人声的分类场景,提高了模型性能。For example, the trained neural network model adopts 3x3 convolution kernel and pool layer to simplify the model parameters. The training of the neural network includes two stages: pre-training and fine-tuning. The picture on the left shows the 500-class classification model. First, the 500-class audio classification model was trained using the sound dataset. The figure on the right shows the two-class model. The network reuses the underlying network structure and parameters of the 500-class model, and uses the back propagation algorithm to make the model converge. Through this binary classification model to identify whether there is human voice in the audio clip, the model will output the probability that the current audio clip has human voice. By introducing pre-training and fine-tuning, the network trained by the present invention is more focused on the classification scene of human voice and non-human voice, and the performance of the model is improved.
步骤125:将所述概率分别与判决门限进行比较判决子音频是否属于人声的分类;从而获取原始音频中人声概率大于判决门限的子音频。Step 125 : Compare the probabilities with the judgment threshold respectively to determine whether the sub-audio belongs to the classification of human voice; thus, obtain the sub-audio whose voice probability is greater than the judgment threshold in the original audio.
设置所述判决门限作为判决是否为人声的依据,若所述概率大于判决门限,则判决为人声,若概率小于判决门限则判决为非人声。The judgment threshold is set as the basis for judging whether it is human voice, if the probability is greater than the judgment threshold, the judgment is human voice, and if the probability is less than the judgment threshold, the judgment is non-human voice.
经过以上步骤,原始音频a被分成了一个个人声或非人声的片段。通过累加所有片段的时长即可得到原始音频中人声的时长,After the above steps, the original audio a is divided into a vocal or non-vocal segment. The duration of the vocals in the original audio can be obtained by accumulating the durations of all clips.
步骤126:对于原始音频中人声概率大于判决门限且相邻的子音频;获取由所述相邻子音频中的中心时刻点组成的音频片段,即为人声音频片段。Step 126 : For the adjacent sub-audio whose vocal probability is greater than the decision threshold in the original audio; obtain an audio segment composed of the central time point in the adjacent sub-audio, that is, the vocal audio segment.
在采用上述获取人声音频片段的方法中,在原始音频头部之前以及尾部之后添加时间长度相等的空数据,例如均为480毫秒;以及采用2倍480毫秒即960毫秒时长的窗口对原始音频进行切分得到多个子音频。In the above-mentioned method for obtaining a vocal audio segment, null data of equal time length is added before the head of the original audio and after the tail of the original audio, for example, both are 480 milliseconds; Divide to get multiple sub audios.
步骤13:将所述人声的音频片段按照时间顺序拼接得到拼接后的音频。Step 13: splicing the audio segments of the human voice in chronological order to obtain the spliced audio.
将上述人声音频片段按照时间顺序拼接,得到拼接后的人声音频文件,即为所述原始音频中的人声部分音频。The above-mentioned vocal audio clips are spliced in chronological order to obtain a spliced vocal audio file, which is the vocal part audio in the original audio.
步骤14:存储或输出所述拼接后得到的原始音频中的人声部分音频。从而该文件一方面占用的存储空间小;另一方面,由于该音频中不包含非人声的部分,因而音频时间较短,与原始音频相比,播放所需时间较短,用户重复收听语音内容不会产生时间的浪费。Step 14: Store or output the vocal part audio in the original audio obtained after the splicing. Therefore, on the one hand, the file occupies a small storage space; on the other hand, since the audio does not contain non-human voice parts, the audio time is short. Compared with the original audio, the playback time is short, and the user listens to the voice repeatedly. Content does not create a waste of time.
在上述实施例基础上,还可以再神经网络得到子音频为人声概率后,在进行门限判 决前还可以执行如下的预处理步骤,达到对概率值进行优化的目的。On the basis of the above-mentioned embodiment, after obtaining the probability that the sub-audio is a human voice by the neural network, the following preprocessing steps can also be performed before the threshold judgment is performed, so as to achieve the purpose of optimizing the probability value.
1)对当前获得的概率进行滑动平均预处理。1) Perform sliding average preprocessing on the currently obtained probability.
由于切分粒度和噪声的原因,导致按照上文记载的方法得到的原始音频的人声概率数组中包含噪点。体现在如图3所示200毫秒的人声概率分布图中,纵坐标表示该音频点为人声的概率,横坐标代表时间,每个点表示10ms。在横轴时间轴所对应的0-1的概率值分布上存在很多概率值的突变,即毛刺。因此,需要对当前获得的概率进行滑动平均预处理,使得概率分布更加平滑,得到如图4所示的200毫秒的人声概率分布图。Due to the segmentation granularity and noise, the vocal probability array of the original audio obtained by the method described above contains noise points. It is reflected in the 200-millisecond vocal probability distribution diagram shown in Figure 3, where the ordinate represents the probability that the audio point is a human voice, the abscissa represents time, and each point represents 10ms. There are many sudden changes in probability values, that is, glitches, on the probability value distribution of 0-1 corresponding to the time axis of the horizontal axis. Therefore, it is necessary to perform sliding average preprocessing on the currently obtained probabilities to make the probability distribution smoother, and obtain the 200-millisecond vocal probability distribution map as shown in Figure 4.
滑动平均预处理,采用中值滑动滤波法,中值滤波后的第i个子音频为人声的概率为:The sliding average preprocessing adopts the median sliding filtering method. The probability that the i-th sub-audio after median filtering is a human voice is:
Figure PCTCN2021130305-appb-000001
Figure PCTCN2021130305-appb-000001
其中,原始音频中的所有子音频的人声概率数组where, an array of vocal probabilities for all sub-audios in the original audio
P={p 1,p 2,p 3,...,p i...,p n},其中n为原始音频切分得到的子音频总数,p i代表第i个子音频为人声的概率。 P={p 1 ,p 2 ,p 3 ,...,p i ...,p n }, where n is the total number of sub-audio obtained by segmenting the original audio, and p i represents the probability that the ith sub-audio is a human voice .
w_smooth是选定窗口大小。例如本实施例中选取所述窗口为31,即窗口为所述子音频的人声概率数组中的31个值。w_smooth is the selected window size. For example, in this embodiment, the window is selected as 31, that is, the window is 31 values in the vocal probability array of the sub-audio.
针对于p i,确定滑动平均的上、下限索引。 For pi , determine the upper and lower bound indices of the moving average.
下限索引为:Lo=max(0,i-15),表示数组中的第一个概率值;The lower limit index is: Lo=max(0,i-15), which represents the first probability value in the array;
上限索引为:Hi=min(n,i+15),表示数组中的最后一个概率值。The upper limit index is: Hi=min(n, i+15), which represents the last probability value in the array.
本实施例中,中值滤波即是以相邻31个点的概率值进行平均后作为中间点的概率值;按照该方法,以步长为1,重新计算每个点的概率值。In this embodiment, median filtering is to average the probability values of adjacent 31 points as the probability value of the intermediate point; according to this method, the probability value of each point is recalculated with the step size of 1.
对比图3和图4,可以看出经过滑动平均后子音频人声概率图的毛刺被有效修正,在一定程度上提高了开口片段切分的精度。Comparing Fig. 3 and Fig. 4, it can be seen that the burr of the sub-audio vocal probability map is effectively corrected after moving average, which improves the accuracy of opening segment segmentation to a certain extent.
以上中值滤波为本发明的一种实现方式,本发明并不限制其他滤波方法的采用。The above median filtering is an implementation manner of the present invention, and the present invention does not limit the adoption of other filtering methods.
2)能量修正预处理。2) Energy correction preprocessing.
经过滑动平均预处理后,由于本发明实施例中采用细粒度的子音频切分,以及由于子音频大幅重叠的策略导致一小部分非人声的音频概率在经过滤波时被周围点修正得更倾向于人声,即人声概率增加,但其本质为非人声。After the moving average preprocessing, due to the fine-grained sub-audio segmentation adopted in the embodiment of the present invention and the strategy of greatly overlapping sub-audio, the audio probability of a small part of non-human voices is modified by the surrounding points during filtering. Tends to human voice, that is, the probability of human voice increases, but its nature is non-human voice.
为解决上述问题,本发明实施例利用噪声或者静音的能量相对人声较弱的特性,利用原始音频的能量对人声概率进行进一步修正,以提高精度。In order to solve the above problem, the embodiment of the present invention utilizes the characteristic that the energy of noise or mute is weaker than the human voice, and uses the energy of the original audio to further modify the probability of the human voice, so as to improve the accuracy.
经过滑动平均的音频人声概率数组为:The moving average audio vocal probability array is:
Figure PCTCN2021130305-appb-000002
Figure PCTCN2021130305-appb-000002
以10ms为窗口大小,10ms为步长,计算得到原始音频的能量数组:Taking 10ms as the window size and 10ms as the step size, calculate the energy array of the original audio:
P ower={w 1,w 2,w 3,...,w i,..w n} Power = {w 1 ,w 2 ,w 3 ,..., wi ,..w n }
由于上文记载的实施例中,采用步长10ms对原始音频进行切片得到子音频,进而 得到10ms为间隔的人声概率,因而,此处采用10ms的步长计算得到原始音频的能量数组,从而使得原始音频的能量数组的时刻与原始音频的人声概率数组时刻相应。Since in the above-recorded embodiment, the original audio is sliced with a step size of 10ms to obtain sub-audio, and then the probability of vocals with an interval of 10ms is obtained. Therefore, the energy array of the original audio is obtained by calculating the step size of 10ms here, so that Make the moments of the energy array of the original audio correspond to the moments of the vocal probability array of the original audio.
将P ower数组的值归一化到0~1之间,确定能量上限P up和能量下限P down,则w i可以按照以下方式归一化: Normalize the value of the Power array to be between 0 and 1, and determine the upper limit of energy P up and the lower limit of energy P down , then wi can be normalized as follows:
Figure PCTCN2021130305-appb-000003
Figure PCTCN2021130305-appb-000003
以上公式可以看到,当某时刻音频能量大于所述能量上限P up时,w i取值为1,若某时刻音频能量小于所述能量下限P down时,w i取值为0,得到
Figure PCTCN2021130305-appb-000004
It can be seen from the above formula that when the audio energy at a certain moment is greater than the energy upper limit P up , wi takes a value of 1, and if the audio energy at a certain moment is less than the energy lower limit P down , wi takes a value of 0, obtaining
Figure PCTCN2021130305-appb-000004
数组P f和数组
Figure PCTCN2021130305-appb-000005
对应值进行点积运算,得到能量修正后的音频人声概率值数组P T。经过该运算,当某时刻音频能量大于所述能量上限P up时,则该时刻人声概率值不变;若某时刻音频能量小于所述能量下限P down时,则该时刻人声概率值取值为0。
array P f and array
Figure PCTCN2021130305-appb-000005
A dot product operation is performed on the corresponding values to obtain an energy-corrected audio vocal probability value array P T . After this calculation, when the audio energy at a certain moment is greater than the energy upper limit P up , the probability value of the human voice at that moment remains unchanged; if the audio energy at a certain moment is less than the lower energy limit P down , the human voice probability value at that moment is taken as The value is 0.
在实施例中,若所述音频能量介于所述能量下限和能量上限之间(包含能量上限值和能量下限值),则取得的概率调整因子介于0和1之间,通过该概率调整因子调整对应时刻点的人声概率值,最终得到能量修正后的音频人声概率值数组P TIn an embodiment, if the audio energy is between the energy lower limit and the energy upper limit (including the energy upper limit value and the energy lower limit value), the obtained probability adjustment factor is between 0 and 1, through the The probability adjustment factor adjusts the vocal probability value at the corresponding time point, and finally obtains an energy-corrected audio vocal probability value array P T .
以上可以看出,通过利用原始音频的能量矩阵,若某时刻音频能量低于能量下限,则认为该时刻音频为非人声,从而将该时刻的人声概率变为零,通过这种方法进一步去除了非人声的部分音频。It can be seen from the above that by using the energy matrix of the original audio, if the audio energy at a certain moment is lower than the energy lower limit, the audio at that moment is considered to be non-human voice, so that the probability of the human voice at this moment becomes zero. Parts of audio that are not human voices are removed.
以上实施例将获得的概率先经过滑动平均预处理,再经过能量修正预处理,最后利用判决算法判别人声与非人声,确定人声开口片段,统计用户开口时长;对于对当前获取的概率进行能量修正和滑动平均两种预处理,没有先后顺序,亦可先进行能量修正预处理,再进行滑动平均预处理。The probabilities obtained in the above embodiments are first subjected to sliding average preprocessing, then energy correction preprocessing, and finally, a judgment algorithm is used to discriminate between human voices and non-human voices, determine vocal opening segments, and count the duration of user openings; for the currently obtained probability There are two kinds of preprocessing, energy correction and sliding average. There is no sequence. Energy correction preprocessing can be performed first, and then sliding average preprocessing.
本发明也可以采用上述两种预处理方法中的其中一种达到提高人声识别准确率的目的。The present invention can also adopt one of the above two preprocessing methods to achieve the purpose of improving the accuracy of human voice recognition.
上述实施例中,通过在原始音频头部之前以及尾部之后添加的等时长的空数据进行音频扩展后对原始音频进行切分得到子音频。然而,在原始音频头部之前以及尾部之后添加的空数据时间长度也可以不相等。即在所述原始音频头部之前添加第一时长的空数据,以及在所述原始音频尾部之后添加第二时长的空数据;并且以第一时长与第二时长之和的第三时长为切分窗口对原始音频进行切分得到子音频。In the above-mentioned embodiment, the sub-audio is obtained by segmenting the original audio by performing audio expansion with dummy data of equal duration added before the head and after the tail of the original audio. However, the length of null data added before the original audio header and after the trailer may not be equal. That is, add the null data of the first duration before the original audio head, and add the null data of the second duration after the original audio tail; and the third duration of the sum of the first duration and the second duration is cut. The split window splits the original audio to obtain sub-audio.
例如,第一时长为240毫秒,第二时长为720毫秒,切分窗口为第一时长与第二时长之和,即为960毫秒。可见,利用本方式得到的子音频时长与上文实施例相同,依然为960ms。For example, the first duration is 240 milliseconds, the second duration is 720 milliseconds, and the split window is the sum of the first duration and the second duration, that is, 960 milliseconds. It can be seen that the duration of the sub-audio obtained by using this method is the same as that in the above embodiment, which is still 960 ms.
使用此种切分方式,将计算得到的子音频人声概率近似的作为子音频中在1/4时刻的人声概率值。假设某一个子音频的起始时刻和截止时刻在原始音频中分别表示为t i,t i+0.96s,则将子音频人声概率值近似作为子音频中t i+0.24s时刻的人声概率值。以及, 连续判决为人声的各子音频中第1/4时刻点组成的音频片段的得到原始音频中的人声片段。可知,由于采用第一步长对原始音频切分得到子音频,因而相邻的子音频的第1/4时刻之间相隔第一步长,例如上述实施例中采用的10ms。 Using this segmentation method, the calculated sub-audio vocal probability is approximated as the vocal probability value of the sub-audio at 1/4 time. Assuming that the start time and the end time of a certain sub-audio are represented as t i and t i +0.96s in the original audio, respectively, the probability value of the sub-audio vocal is approximated as the human voice at the time of t i +0.24s in the sub-audio probability value. And, the audio segment formed by the 1/4 time point in each sub-audio of the human voice is continuously determined to obtain the vocal segment in the original audio. It can be known that, since the sub-audio is obtained by segmenting the original audio by the first length, the first 1/4 time interval between the adjacent sub-audio moments is, for example, 10ms used in the above embodiment.
以及,在对得到的子音频的人声概率数组进行音频能量修正预处理时,较优的方式是计算子音频中前1/4时刻的能量值。例如,假设某一个子音频的起始时刻和截止时刻在原始音频中分别表示为t i,t i+0.96s,则计算t i+0.24s时刻的能量值,并根据该能量值得到该子音频(t i,t i+0.96s)的概率修正因子。 And, when performing the audio energy correction preprocessing on the obtained vocal probability array of the sub-audio, a better way is to calculate the energy value of the first 1/4 moment in the sub-audio. For example, assuming that the start time and end time of a certain sub-audio are represented as t i and t i +0.96s respectively in the original audio, then calculate the energy value at the time of t i +0.24s, and obtain the sub-audio according to the energy value. Probability correction factor for audio (t i , t i +0.96s).
根据以上实施例的说明,可以将原始音频中的非人声去除后得到人声音频保存或者输出。According to the description of the above embodiments, the non-human voice in the original audio can be removed to obtain the human voice audio, which can be saved or output.
考虑到人类说话的前后延续性,尤其是儿童、青少年线上学习的场景,表达完整意思句子的单词间往往有短暂的停顿,通常用以换气或者表征某种情绪。Considering the continuity of human speech, especially in the online learning scenarios of children and adolescents, there are often brief pauses between words that express complete sentences, which are usually used to breathe or express a certain emotion.
本实施例中,采用一定的宽容度以保持人声音频片段的前后连续性。这样可以输出质量更高的人声音频,为老师和学生提供更高内容质量的语料,方便老师和学生对人声录音内容的使用。In this embodiment, a certain latitude is adopted to maintain the front and back continuity of the vocal audio segment. This can output higher-quality vocal audio, provide teachers and students with higher-quality corpus, and facilitate the use of vocal recordings by teachers and students.
以下分别记载三种算法以保持人声音频片段的前后连续性。Three algorithms are described below to maintain the continuity of the vocal audio clips.
第一方法实施例。The first method embodiment.
由于每条音频的内容不同、用户发声的心情、状态不同。基于原始音频中非人声的统计特性,动态调整宽容度。具体方法为:Because the content of each audio is different, the mood and state of the user's voice are different. Tolerance is dynamically adjusted based on the statistical properties of non-human voices in the original audio. The specific method is:
对于一个原始音频,某个非人声片段的时长表示为l i,且假设该原始音频中共有n个非人声音频片段。 For an original audio, the duration of a certain non-vocal segment is denoted as l i , and it is assumed that there are a total of n non-vocal audio segments in the original audio.
首先,获得当前原始音频中识别为非人声的音频片段时长的第一均值m lFirst, obtain the first mean value m l of the durations of the audio clips identified as non-human voices in the current original audio;
Figure PCTCN2021130305-appb-000006
Figure PCTCN2021130305-appb-000006
利用所述均值,获得当前原始音频中识别为非人声的音频片段时长的第一方差δ;Using the mean value, obtain the first variance δ of the duration of the audio segment identified as non-human voice in the current original audio;
Figure PCTCN2021130305-appb-000007
Figure PCTCN2021130305-appb-000007
所述第一均值与所述第一方差的差值作为第一门限T;The difference between the first mean value and the first variance is used as the first threshold T;
T 1=m lT 1 = ml
将所述人声的音频片段以及音频片段时长小于所述第一门限T的非人声的音频片段进行拼接。即在原始音频中,若两个人声音频片段之间的非人声音频片段的时长小于第一门限T 1,则在保存或输出的音频中,保留该非人声音频片段。 The audio clips of the human voice and the non-human voice audio clips whose duration is less than the first threshold T are spliced. That is, in the original audio, if the duration of the non-vocal audio segment between the two vocal audio segments is less than the first threshold T 1 , the non-vocal audio segment is retained in the saved or output audio.
第一种方法针对于不同的音频设定不同的片段拼接门限,动态调整拼接效果,计算简单。The first method sets different clip splicing thresholds for different audios, dynamically adjusts the splicing effect, and is simple to calculate.
第二方法实施例。Second method embodiment.
针对于每个用户,保存其非人声音频片段的统计特性。例如对于用户u而言,由已有数据统计得出的其非人声音频片段均值记为m u、方差为δ u、已有原始音频中的所有非人声音频片段数目N u、已有原始音频中的所有非人声音频片段的时长总和为S u、已有原 始音频中的非人声片段的方差和S δuFor each user, the statistical properties of their non-vocal audio clips are saved. For example, for user u, the mean value of its non-human voice audio clips obtained from the existing data is recorded as mu u , the variance is δ u , the number of all non-human voice audio clips in the existing original audio N u , the existing The sum of the durations of all non-vocal audio segments in the original audio is Su, the variance sum of the non-vocal segments in the existing original audio and S δu .
从而,对于一个新用户u而言,由其第一条原始音频计算出其非人声音频片段的特性如下,并保存。Therefore, for a new user u, the characteristics of the non-human voice audio segment are calculated from the first original audio and saved as follows.
原始音频中识别为非人声的音频片段时长的均值:Mean duration of audio clips identified as non-vocals in the original audio:
Figure PCTCN2021130305-appb-000008
Figure PCTCN2021130305-appb-000008
已有原始音频中的非人声音频片段的时长总和:Sum of durations of non-vocal audio clips in existing original audio:
Figure PCTCN2021130305-appb-000009
Figure PCTCN2021130305-appb-000009
原始音频中识别为非人声的音频片段时长的方差:Variance of durations of audio clips identified as non-vocals in the original audio:
Figure PCTCN2021130305-appb-000010
Figure PCTCN2021130305-appb-000010
已有原始音频中识别为非人声的音频片段时长的方差和:The variance sum of the durations of audio segments identified as non-human voices in the existing original audio:
Figure PCTCN2021130305-appb-000011
Figure PCTCN2021130305-appb-000011
已有原始音频中的所有非人声音频片段数目N u=n,其中n代表该音频总共有n段非人声,每段的时长为l iThe number of all non - human voice audio segments in the existing original audio is Nu = n, where n represents that the audio has a total of n segments of non-human voice, and the duration of each segment is l i .
若该用户录制过至少一个原始音频后,当录制新的原始音频时,根据该用户已有原始音频的非人声音频片段的统计特性计算当前新的原始音频中非人声音频部分是否可以输出或保存,具体如下。If the user has recorded at least one original audio, when recording new original audio, calculate whether the non-vocal audio part of the current new original audio can be output according to the statistical characteristics of the non-vocal audio clips of the original audio that the user has or save, as follows.
获取所述原始音频对应的用户标识;从而,获得以上所述已保存的该用户的非人声音频片段的特性。Obtain the user identifier corresponding to the original audio; thus, obtain the above-mentioned characteristics of the user's non-human voice audio clip that has been saved.
利用该用户已保存的非人声音频片段的统计特性和新录制音频的非人声音频片段的统计特性计算得到如下参数。作为较佳的实施例,以下以已保存的该用户所有历史原始音频的非人声特性为例进行说明。The following parameters are calculated by using the statistical characteristics of the non-vocal audio clips saved by the user and the statistical characteristics of the non-vocal audio clips of the newly recorded audio. As a preferred embodiment, the following description will be given by taking the non-human voice characteristics of all the saved historical original audio of the user as an example.
以下,角标old表示为该用户生成当前原始音频之前生成的所有历史原始音频的非人声片段统计参数,角标new表示获得当前原始音频后的非人声片段统计参数。Hereinafter, the superscript "old" represents the non-vocal segment statistical parameters of all historical original audios generated before the current original audio is generated for the user, and the superscript "new" represents the non-vocal segment statistical parameters after obtaining the current original audio.
S uold为该用户获得当前原始音频之前的所有原始音频非人声音频片段的时长总和;S unew为获得当前原始音频之后的所有原始音频非人声音频片段的时长总和。 S u old is the sum of the durations of all the original audio non-vocal audio clips before the user obtains the current original audio; S u new is the total duration of all the original audio non-vocal audio clips after the current original audio is obtained.
Figure PCTCN2021130305-appb-000012
Figure PCTCN2021130305-appb-000012
S δuold为该用户获得当前原始音频之前的所有原始音频非人声音频片段的时长方差总和;S δunew为获得当前原始音频之后的所有原始音频非人声音频片段的时长方差总和。 S δu old is the sum of duration variances of all original non-vocal audio segments before the user obtains the current original audio; S δu new is the sum of duration variances of all original non-vocal audio segments after the current original audio is obtained.
Figure PCTCN2021130305-appb-000013
Figure PCTCN2021130305-appb-000013
N uold为该用户获得当前原始音频之前的所有原始音频非人声音频片段数量总和;N unew为获得当前原始音频之后的所有原始音频非人声音频片段数量总和。 N u old is the sum of the number of all original audio non-vocal audio clips before the user obtains the current original audio; N u new is the total number of all original audio non-vocal audio clips after the current original audio is obtained.
N unew=N uold+n, N u new=N u old+n,
根据以上参数得到该用户获得当前原始音频之后所有原始音频非人声音频片段时长第二均值m unew,以及第二方差δ unew。 According to the above parameters, the second mean value m u new and the second variance δ u new of the durations of all the original audio non-human voice audio clips after the user obtains the current original audio are obtained.
m unew=S unew/(N unew) m u new=S u new/(N u new)
δ unew=S δunew/(N unew) δ u new=S δu new/(N u new)
其中n为当前原始音频中共有n段非人声音频片段,每段的时长为l iwhere n is a total of n segments of non-human voice audio in the current original audio, and the duration of each segment is l i .
进而,第二均值与所述第二方差的差值得到第二门限。Furthermore, the difference between the second mean and the second variance obtains a second threshold.
T 2=m unew-δ unew T 2 =m u new-δ u new
所述当前原始音频中,若两个人声音频片段之间的非人声音频片段的时长小于第二门限T 2,则在保存或输出的音频中,保留该非人声音频片段,将该非人声音频片段与前后两人声音频片段拼接。 In the current original audio, if the duration of the non-vocal audio segment between the two vocal audio segments is less than the second threshold T 2 , then in the saved or output audio, the non-vocal audio segment is retained, and the non-vocal audio segment is retained. The vocal audio clip is spliced with the front and rear duo audio clips.
第二种方法实施例,考虑不同用户的发声习惯不同,进行用户粒度的拼接门限动态调整;以及,在此基础上流式地计算变量的均值和方差,节省计算和存储资源。In the second method embodiment, considering the different pronunciation habits of different users, dynamically adjust the splicing threshold of user granularity; and, on this basis, stream the mean and variance of variables, saving computing and storage resources.
第三方法实施例。Third method embodiment.
针对于每个用户,保存其非人声音频片段的统计特性。例如对于用户u而言,其均值记为m u、方差为δ u、已有原始音频中的所有非人声音频片段数目N u、已有原始音频中的所有非人声音频片段的时长总和为S u、已有原始音频中的非人声片段的方差和S δuFor each user, the statistical properties of their non-vocal audio clips are saved. For example, for user u , its mean value is denoted as mu , the variance is δ u , the number Nu of all non-voice audio segments in the existing original audio, and the sum of the durations of all non-vocal audio segments in the existing original audio are S u , the variance of non-vocal segments in the existing original audio, and S δu .
从而,对于一个新用户u而言,由其第一条原始音频计算出其非人声音频片段的特性如下。Therefore, for a new user u, the characteristics of the non-vocal audio segment calculated from its first original audio are as follows.
原始音频中识别为非人声的音频片段时长的均值:Mean duration of audio clips identified as non-vocals in the original audio:
Figure PCTCN2021130305-appb-000014
Figure PCTCN2021130305-appb-000014
已有原始音频中的非人声音频片段的时长总和:Sum of durations of non-vocal audio clips in existing original audio:
Figure PCTCN2021130305-appb-000015
Figure PCTCN2021130305-appb-000015
原始音频中识别为非人声的音频片段时长的方差:Variance of durations of audio clips identified as non-vocals in the original audio:
Figure PCTCN2021130305-appb-000016
Figure PCTCN2021130305-appb-000016
已有原始音频中识别为非人声的音频片段时长的方差和:The variance sum of the durations of audio segments identified as non-human voices in the existing original audio:
Figure PCTCN2021130305-appb-000017
Figure PCTCN2021130305-appb-000017
已有原始音频中的所有非人声音频片段数目N u=n,其中n代表该音频总共有n段非人声,每段的时长为l iThe number of all non - human voice audio segments in the existing original audio is Nu = n, where n represents that the audio has a total of n segments of non-human voice, and the duration of each segment is l i .
若该用户录制过至少一个原始音频后,当录制新的原始音频时,根据该用户已有原始音频的非人声音频片段的特性计算当前新的原始音频中非人声音频部分是否可以输出或保存,具体如下。If the user has recorded at least one original audio, when recording new original audio, calculate whether the non-vocal audio part in the current new original audio can be outputted or Save, as follows.
获取所述原始音频对应的用户标识;从而,获得以上所述已保存的该用户的历史原始音频中非人声音频片段的特性。具体的,S uold为该用户获得当前原始音频之前的所有原始音频非人声音频片段的时长总和;S δuold为该用户获得当前原始音频之前的所有原始音频非人声音频片段的时长方差总和;N uold为该用户获得当前原始音频之前的所有原始音频非人声音频片段数量总和。 Obtain the user identifier corresponding to the original audio; thus, obtain the above-mentioned characteristics of the non-human voice audio segment in the user's saved historical original audio. Specifically, S u old is the sum of the durations of all original non-human voice audio clips before the user obtains the current original audio; S δu old is the duration variance of all original non-human voice audio clips before the user obtains the current original audio Sum; Nu old is the sum of the number of all original audio non - vocal audio clips before the user obtains the current original audio.
利用以上参数,并参照上文计算方法,得到当前原始音频之前已保存的该用户所有历史原始音频非人声的音频片段时长第三均值m uold和第三方差δ uold;进而得到第三门限值。 Using the above parameters and referring to the above calculation method, obtain the third mean value mu old and the third difference δ u old of the audio clip durations of all the historical original audio non-human voices of the user that have been saved before the current original audio; and then obtain the third Threshold value.
T 3=m uold-δ uold T 3 =m u old-δ u old
参照上文第一方法实施例记载的方法,根据当前原始音频获得所述第一门限T 1With reference to the method described in the first method embodiment above, the first threshold T 1 is obtained according to the current original audio.
定义权值α,0<α≤1。Define the weight α, 0<α≤1.
利用该权值及所述第三门限值对所述第一门限T 1进行调整得到第四门限T 4The first threshold T 1 is adjusted by using the weight and the third threshold to obtain a fourth threshold T 4 .
Figure PCTCN2021130305-appb-000018
Figure PCTCN2021130305-appb-000018
原始音频中,若两个人声音频片段之间的非人声音频片段的时长小于第四门限T 4,则在保存或输出的音频中,保留该非人声音频片段,将该非人声音频片段与前后两人声音频片段拼接。 In the original audio, if the duration of the non-vocal audio segment between the two vocal audio segments is less than the fourth threshold T 4 , the non-vocal audio segment is retained in the saved or output audio, and the non-vocal audio segment is The clip is spliced with the front and rear duo audio clips.
第三种方法实施例,在考虑用户整体的统计信息基础上,结合具体的音频统计信息动态调整非人声音频片段的拼接门限。In the third method embodiment, the splicing threshold of the non-human voice audio segment is dynamically adjusted in combination with the specific audio statistical information on the basis of considering the overall statistical information of the user.
第四方法实施例。Fourth method embodiment.
由于原始音频a被分成了多个人声或非人声的片段。如果判决为人声的两个子音频之间间隔小于一个门限值(第五门限),则进一步获取所述相邻的识别为人声的音频片段间的被判决为非人声的音频片段。以及,将当前原始音频中,所述人声的音频片段以及所述相邻的识别为人声的音频片段间的音频片段进行拼接。Since the original audio a is divided into multiple vocal or non-vocal segments. If the interval between the two sub-audios determined to be human voices is less than a threshold value (the fifth threshold), further acquire the audio segments that are determined to be non-human voices between the adjacent audio segments identified to be human voices. And, in the current original audio, the audio clips of the human voice and the audio clips between the adjacent audio clips identified as human voices are spliced.
以上记载了四种是否进行音频拼接的方法。参照图5所示,假设原始音频中包含有两个人声片段a i,a i+1,起止的时间节点分别为
Figure PCTCN2021130305-appb-000019
假设利用上文记载的任一方法得到门限值为500ms。如果
Figure PCTCN2021130305-appb-000020
则就将这两个片段合并为一个。可以看到,采用以上方法的处理所得到的人声音频保持了语音片段的前后连续性。
The above describes four methods of whether to perform audio splicing. Referring to FIG. 5 , assuming that the original audio contains two vocal segments a i , a i+1 , the start and end time nodes are respectively
Figure PCTCN2021130305-appb-000019
Assume that the threshold is 500ms using any of the methods described above. if
Figure PCTCN2021130305-appb-000020
Then the two segments are merged into one. It can be seen that the human voice audio obtained by the above method maintains the continuity of the speech segment before and after.
以上各种宽容合并的方法并无优劣之分,分别适配于不同场景和不同用户群体。例如,第一方法实施例适合于轻量级的录音、反馈***,不需要记录任何用户信息;第二方法实施例适合成人用户群体,因为成人录制时情绪较为稳定,用户的统计信息可以较好地描述大部分的音频特征;第三方法实施例适合幼儿群体,考虑到幼儿群体的情绪起伏较大,发生较为非规律化,可能相同文本在相邻较近不同时刻说出来的音频千差万别,因此需要在考虑用户统计特性的同时,参考音频的统计信息进行合理调整。The above methods of tolerant merging have no advantages or disadvantages, and are respectively suitable for different scenarios and different user groups. For example, the first method embodiment is suitable for a lightweight recording and feedback system, and does not need to record any user information; the second method embodiment is suitable for the adult user group, because adults are more stable when recording, and the user's statistical information can be better Describe most of the audio features; the third method embodiment is suitable for the group of young children. Considering that the emotional fluctuations of the group of young children are relatively irregular and the occurrence is relatively irregular, it is possible that the audio of the same text spoken at different times in the vicinity is vastly different, so It is necessary to make reasonable adjustments with reference to the statistical information of the audio while considering the statistical characteristics of the user.
与前述应用功能实现方法实施例相对应,本申请还提供了一种人声音频录制装置。该装置包括:Corresponding to the foregoing application function implementation method embodiments, the present application further provides a human voice audio recording device. The device includes:
处理器;以及processor; and
存储器,其上存储有可执行代码,当所述可执行代码被所述处理器执行时,使所述处理器执行上文记载的方法。关于上述实施例中的装置,其中各个模块执行操作的具体方式已经在有关该方法的实施例中进行了详细描述,此处将不再做详细阐述说明。A memory having executable code stored thereon which, when executed by the processor, causes the processor to perform the method described above. Regarding the apparatus in the above-mentioned embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment of the method, and will not be described in detail here.
本领域技术人员还将明白的是,结合这里的申请所描述的各种示例性逻辑块、模块、 电路和算法步骤可以被实现为电子硬件、计算机软件或两者的组合。Those skilled in the art will also appreciate that the various exemplary logical blocks, modules, circuits, and algorithm steps described in connection with the applications herein may be implemented as electronic hardware, computer software, or combinations of both.
附图中的流程图和框图显示了根据本申请的多个实施例的***和方法的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段或代码的一部分,所述模块、程序段或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标记的功能也可以以不同于附图中所标记的顺序发生。例如,两个连续的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的***来实现,或者可以用专用硬件与计算机指令的组合来实现。The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems and methods according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code that contains one or more functions for implementing the specified logical function(s) executable instructions. It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It is also noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented in dedicated hardware-based systems that perform the specified functions or operations , or can be implemented in a combination of dedicated hardware and computer instructions.
以上已经描述了本申请的各实施例,上述说明是示例性的,并非穷尽性的,并且也不限于所披露的各实施例。在不偏离所说明的各实施例的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。本文中所用术语的选择,旨在最好地解释各实施例的原理、实际应用或对市场中的技术的改进,或者使本技术领域的其它普通技术人员能理解本文披露的各实施例。Various embodiments of the present application have been described above, and the foregoing descriptions are exemplary, not exhaustive, and not limiting of the disclosed embodiments. Numerous modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen to best explain the principles of the various embodiments, the practical application or improvement over the technology in the marketplace, or to enable others of ordinary skill in the art to understand the various embodiments disclosed herein.

Claims (12)

  1. 一种人声音频录制方法,其特征在于,包括:A method for recording a human voice, comprising:
    获得当前原始音频;Get the current original audio;
    获取所述当前原始音频中识别为人声的音频片段;Obtain the audio clips identified as human voices in the current original audio;
    将所述人声的音频片段按照时间顺序拼接得到拼接后的音频;Splicing the audio clips of the human voice in chronological order to obtain the spliced audio;
    存储或输出所述拼接后的音频。The spliced audio is stored or output.
  2. 根据权利要求1所述的方法,其特征在于,还包括:The method of claim 1, further comprising:
    获得当前原始音频中非人声的音频片段时长的第一均值;Obtain the first average of the duration of the non-human voice audio clips in the current original audio;
    利用所述均值,获得当前原始音频中非人声的音频片段时长的第一方差;Using the mean value, obtain the first variance of the duration of the non-human voice audio segment in the current original audio;
    所述第一均值与所述第一方差的差值作为第一门限;The difference between the first mean and the first variance is used as a first threshold;
    若人声音频片段间存在时长小于所述第一门限的非人声音频片段,则将所述非人声音频判断与所述人声音频片段拼接。If there is a non-human voice audio clip with a duration shorter than the first threshold between the human voice audio clips, the non-human voice audio judgment is spliced with the human voice audio clip.
  3. 根据权利要求1所述的方法,其特征在于,还包括:The method of claim 1, further comprising:
    根据所述原始音频对应的用户标识,获得该用户的至少一个历史原始音频的非人声音频片段的总时长,以及所述历史原始音频的非人声音频片段的方差和;According to the user identification corresponding to the original audio, obtain the total duration of at least one non-human voice audio clip of the user's historical raw audio, and the variance sum of the non-human voice audio clips of the historical raw audio;
    利用当前原始音频非人声音频片段时长以及所述历史原始音频的非人声音频片段总时长,获得非人声音频片段时长的第二均值;Utilize the duration of the current original audio non-human voice audio clip and the total duration of the non-human voice audio clips of the historical original audio to obtain a second average of the duration of the non-human voice audio clips;
    利用当前原始音频非人声音频片段方差以及所述历史原始音频的非人声音频片段的方差和,获得非人声音频片段的第二方差;Utilize the variance sum of the non-human voice audio clips of the current original audio and the non-human voice audio clips of the historical original audio to obtain the second variance of the non-human voice audio clips;
    所述第二均值与所述第二方差的差值作为第二门限;The difference between the second mean and the second variance is used as a second threshold;
    若当前原始音频中,人声音频片段间存在时长小于所述第二门限的非人声音频片段,则将所述非人声音频判断与所述人声音频片段拼接。If in the current original audio, there is a non-voice audio segment with a duration shorter than the second threshold between the vocal audio segments, the non-human voice audio judgment is spliced with the vocal audio segment.
  4. 根据权利要求3所述的方法,其特征在于还包括:The method of claim 3, further comprising:
    获得当前原始音频非人声音频片段时长以及所述历史原始音频的非人声音频片段总时长的和,并保存;Obtain the sum of the duration of the non-human voice audio clips of the current original audio and the total duration of the non-human voice audio clips of the historical original audio, and save them;
    以及,获得当前原始音频非人声音频片段方差以及所述历史原始音频的非人声音频片段方差和的和,并保存。And, obtain and save the sum of the variance of the non-vocal audio segment of the current original audio and the variance of the non-vocal audio segment of the historical original audio.
  5. 根据权利要求1所述的方法,其特征在于,还包括:The method of claim 1, further comprising:
    获得当前原始音频中非人声的音频片段时长的第一均值;Obtain the first average of the durations of non-human voice audio clips in the current original audio;
    利用所述均值,获得当前原始音频中非人声的音频片段时长的第一方差;所述第一均值与所述第一方差的差值作为第一门限;Using the mean, obtain the first variance of the duration of the audio clips of non-human voices in the current original audio; the difference between the first mean and the first variance is used as the first threshold;
    获取所述原始音频对应的用户标识;obtaining the user identification corresponding to the original audio;
    获取所述用户至少一个历史原始音频中非人声的音频片段时长的均值为第三均值;Obtaining the average of the duration of the audio clips of non-human voices in at least one historical original audio of the user is the third average;
    获得所述至少一个历史原始音频中非人声的音频片段的方差为第三方差;所述第三均值与所述第三方差的差值作为第三门限;Obtaining the variance of the non-human voice audio segment in the at least one historical original audio is the third variance; the difference between the third mean and the third variance is used as the third threshold;
    利用预置的第三门限的权值对所述第一门限进行调整得到第四门限;Using the preset weight of the third threshold to adjust the first threshold to obtain a fourth threshold;
    若当前原始音频中,人声音频片段间存在时长小于所述第四门限的非人声音频片段,则将所述非人声音频判断与所述人声音频片段拼接。If in the current original audio, there is a non-voice audio segment with a duration shorter than the fourth threshold between the vocal audio segments, the non-vocal audio judgment is spliced with the vocal audio segment.
  6. 根据权利要求5所述的方法,其特征在于,还包括:The method of claim 5, further comprising:
    获得当前原始音频非人声音频片段以及所述历史原始音频的非人声音频片段的时长均值,并保存;Obtain the average duration of the current original audio non-human voice audio clips and the non-human voice audio clips of the historical original audio, and save them;
    获得当前原始音频非人声音频片段以及所述历史原始音频的非人声音频片段的方差,并保存。The variance of the current original audio non-vocal audio segment and the non-vocal audio segment of the historical original audio is obtained, and saved.
  7. 根据权利要求1所述的方法,其特征在于,还包括:The method of claim 1, further comprising:
    若当前原始音频中,相邻的人声的音频片段之间存在的时长小于第五门限的非人声音频片段,则将所述非人声音频判断与所述人声音频片段拼接。If, in the current original audio, there is a non-vocal audio segment whose duration is less than the fifth threshold between adjacent audio segments of the human voice, the non-vocal audio judgment is spliced with the vocal audio segment.
  8. 根据权利要求1至7其中之一所述的方法,其特征在于,获得所述原始音频中识别为人声的音频片具体为:The method according to any one of claims 1 to 7, wherein obtaining the audio piece identified as a human voice in the original audio is specifically:
    按照预置的方法对所述原始音频切分得到多个子音频;计算子音频的梅尔频率倒谱系数特征序列;According to the preset method, the original audio is divided to obtain a plurality of sub-audio; the Mel frequency cepstral coefficient feature sequence of the sub-audio is calculated;
    神经网络根据梅尔频率倒谱系数特征序列得到子音频属于人声的概率;The neural network obtains the probability that the sub-audio belongs to the human voice according to the characteristic sequence of Mel frequency cepstral coefficients;
    获取所述人声概率大于判决门限的子音频;obtaining the sub-audio whose vocal probability is greater than the decision threshold;
    获取原始音频中人声概率大于判决门限且相邻的子音频;Obtain the adjacent sub-audio whose vocal probability in the original audio is greater than the decision threshold;
    获取由所述相邻子音频中确定时刻点组成的音频片段。Acquire an audio segment consisting of determined time points in the adjacent sub-audio.
  9. 根据权利要求8所述的方法,其特征在于,所述按照预置的方法对所述原始音频切分得到多个子音频包括:The method according to claim 8, wherein the obtaining of multiple sub-audios by segmenting the original audio according to a preset method comprises:
    获得原始音频,在所述原始音频头部之前添加第一时长的空数据,以及在所述原始音频尾部之后添加第二时长的空数据,得到扩展后的音频;Obtain the original audio, add the empty data of the first duration before the original audio header, and add the empty data of the second duration after the original audio tail to obtain the expanded audio;
    以第一时长与第二时长之和的第三时长为切分窗口,以第一步长从所述扩展后的音频的首部开始,依次分窗后获得多个子音频。Taking the third duration of the sum of the first duration and the second duration as the split window, starting from the header of the expanded audio with the first duration, and sequentially splitting the windows to obtain multiple sub-audios.
  10. 根据权利要求9所述的方法,其特征在于,所述获得子音频属于人声概率之后,还包括:The method according to claim 9, wherein after obtaining the probability that the sub-audio belongs to a human voice, the method further comprises:
    获得所述原始音频所有子音频属于人声概率的数组;Obtain an array of probabilities that all sub-audios of the original audio belong to human voices;
    以第一数量作为窗口对所述数组中的概率值进行滤波,得到滤波后的人声概率。The probability values in the array are filtered using the first number as a window to obtain the filtered vocal probability.
  11. 根据权利要求9或10所述的方法,其特征在于,获取所述人声概率大于判决门限的子音频之前还包括:The method according to claim 9 or 10, wherein before acquiring the sub-audio with the probability of the human voice being greater than a decision threshold, the method further comprises:
    获取原始音频中所述子音频中确定时刻点的音频能量值;以及根据所述音频能量值设置人声概率调节因子,包括:Acquiring the audio energy value at a determined time point in the sub-audio in the original audio; and setting a human voice probability adjustment factor according to the audio energy value, including:
    若音频能量值大于能量上限,该子音频的人声概率调节因子置为1;If the audio energy value is greater than the energy upper limit, the vocal probability adjustment factor of the sub-audio is set to 1;
    若音频能量值小于能量下限,该子音频的人声概率调节因子置为0;If the audio energy value is less than the energy lower limit, the vocal probability adjustment factor of the sub-audio is set to 0;
    若音频能量值不大于能量上限且不小于能量下限,则根据音频能量值将所述人声概率调节因子归一化为0至1之间;If the audio energy value is not greater than the upper energy limit and not less than the lower energy limit, normalize the human voice probability adjustment factor to be between 0 and 1 according to the audio energy value;
    将子音频人声概率调节因子乘以所述子音频人声概率,得到修正后的子音频人声概率。The sub-audio vocal probability adjustment factor is multiplied by the sub-audio vocal probability to obtain the modified sub-audio vocal probability.
  12. 一种人声音频录制装置,其特征在于,包括:A human voice audio recording device, comprising:
    处理器;以及processor; and
    存储器,其上存储有可执行代码,当所述可执行代码被所述处理器执行时,使所述处理器执行如权利要求1-11中任一项所述的方法。A memory having executable code stored thereon which, when executed by the processor, causes the processor to perform the method of any one of claims 1-11.
PCT/CN2021/130305 2020-11-12 2021-11-12 Human voice audio recording method and apparatus WO2022100692A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011258272.3 2020-11-12
CN202011258272.3A CN112382310B (en) 2020-11-12 2020-11-12 Human voice audio recording method and device

Publications (2)

Publication Number Publication Date
WO2022100692A1 true WO2022100692A1 (en) 2022-05-19
WO2022100692A9 WO2022100692A9 (en) 2022-11-24

Family

ID=74582989

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/130305 WO2022100692A1 (en) 2020-11-12 2021-11-12 Human voice audio recording method and apparatus

Country Status (2)

Country Link
CN (1) CN112382310B (en)
WO (1) WO2022100692A1 (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112382310B (en) * 2020-11-12 2022-09-27 北京猿力未来科技有限公司 Human voice audio recording method and device
CN113870896A (en) * 2021-09-27 2021-12-31 动者科技(杭州)有限责任公司 Motion sound false judgment method and device based on time-frequency graph and convolutional neural network
CN116364064B (en) * 2023-05-19 2023-07-28 北京大学 Audio splicing method, electronic equipment and storage medium
CN117558284A (en) * 2023-12-26 2024-02-13 中邮消费金融有限公司 Voice enhancement method, device, equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140074467A1 (en) * 2012-09-07 2014-03-13 Verint Systems Ltd. Speaker Separation in Diarization
CN104681036A (en) * 2014-11-20 2015-06-03 苏州驰声信息科技有限公司 System and method for detecting language voice frequency
CN109065075A (en) * 2018-09-26 2018-12-21 广州势必可赢网络科技有限公司 A kind of method of speech processing, device, system and computer readable storage medium
CN110085251A (en) * 2019-04-26 2019-08-02 腾讯音乐娱乐科技(深圳)有限公司 Voice extracting method, voice extraction element and Related product
CN110827798A (en) * 2019-11-12 2020-02-21 广州欢聊网络科技有限公司 Audio signal processing method and device
CN112382310A (en) * 2020-11-12 2021-02-19 北京猿力未来科技有限公司 Human voice audio recording method and device

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8751236B1 (en) * 2013-10-23 2014-06-10 Google Inc. Devices and methods for speech unit reduction in text-to-speech synthesis systems
CN106782627B (en) * 2015-11-23 2019-08-27 广州酷狗计算机科技有限公司 Audio file rerecords method and device
CN111243618B (en) * 2018-11-28 2024-03-19 阿里巴巴集团控股有限公司 Method, device and electronic equipment for determining specific voice fragments in audio
CN109545242A (en) * 2018-12-07 2019-03-29 广州势必可赢网络科技有限公司 A kind of audio data processing method, system, device and readable storage medium storing program for executing
CN110400559B (en) * 2019-06-28 2020-09-29 北京达佳互联信息技术有限公司 Audio synthesis method, device and equipment
CN111161712A (en) * 2020-01-22 2020-05-15 网易有道信息技术(北京)有限公司 Voice data processing method and device, storage medium and computing equipment

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140074467A1 (en) * 2012-09-07 2014-03-13 Verint Systems Ltd. Speaker Separation in Diarization
CN104681036A (en) * 2014-11-20 2015-06-03 苏州驰声信息科技有限公司 System and method for detecting language voice frequency
CN109065075A (en) * 2018-09-26 2018-12-21 广州势必可赢网络科技有限公司 A kind of method of speech processing, device, system and computer readable storage medium
CN110085251A (en) * 2019-04-26 2019-08-02 腾讯音乐娱乐科技(深圳)有限公司 Voice extracting method, voice extraction element and Related product
CN110827798A (en) * 2019-11-12 2020-02-21 广州欢聊网络科技有限公司 Audio signal processing method and device
CN112382310A (en) * 2020-11-12 2021-02-19 北京猿力未来科技有限公司 Human voice audio recording method and device

Also Published As

Publication number Publication date
CN112382310A (en) 2021-02-19
CN112382310B (en) 2022-09-27
WO2022100692A9 (en) 2022-11-24

Similar Documents

Publication Publication Date Title
WO2022100692A1 (en) Human voice audio recording method and apparatus
CN108564942B (en) Voice emotion recognition method and system based on adjustable sensitivity
Li et al. Automatic speaker age and gender recognition using acoustic and prosodic level information fusion
WO2022100691A1 (en) Audio recognition method and device
CN107329996B (en) Chat robot system and chat method based on fuzzy neural network
JP2015219480A (en) Dialogue situation characteristic calculation device, sentence end mark estimation device, method thereof, and program
Beigi Speaker recognition: Advancements and challenges
CN113823323A (en) Audio processing method and device based on convolutional neural network and related equipment
Xiao et al. Hierarchical classification of emotional speech
WO2023279691A1 (en) Speech classification method and apparatus, model training method and apparatus, device, medium, and program
CN114254587A (en) Topic paragraph dividing method and device, electronic equipment and storage medium
CN112562725A (en) Mixed voice emotion classification method based on spectrogram and capsule network
Shah et al. Speech emotion recognition based on SVM using MATLAB
CN114927126A (en) Scheme output method, device and equipment based on semantic analysis and storage medium
Chakroun et al. Efficient text-independent speaker recognition with short utterances in both clean and uncontrolled environments
Rabiee et al. Persian accents identification using an adaptive neural network
CN113129895A (en) Voice detection processing system
Jeyalakshmi et al. HMM and K-NN based automatic musical instrument recognition
CN115910083A (en) Real-time voice conversion method, device, electronic equipment and medium
US8600750B2 (en) Speaker-cluster dependent speaker recognition (speaker-type automated speech recognition)
CN114203180A (en) Conference summary generation method and device, electronic equipment and storage medium
Tzudir et al. Low-resource dialect identification in Ao using noise robust mean Hilbert envelope coefficients
İleri et al. Comparison of different normalization techniques on speakers’ gender detection
Baird et al. A Prototypical Network Approach for Evaluating Generated Emotional Speech}}
Apandi et al. An analysis of Malay language emotional speech corpus for emotion recognition system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21891212

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21891212

Country of ref document: EP

Kind code of ref document: A1