WO2022100691A1 - 音频识别方法和装置 - Google Patents

音频识别方法和装置 Download PDF

Info

Publication number
WO2022100691A1
WO2022100691A1 PCT/CN2021/130304 CN2021130304W WO2022100691A1 WO 2022100691 A1 WO2022100691 A1 WO 2022100691A1 CN 2021130304 W CN2021130304 W CN 2021130304W WO 2022100691 A1 WO2022100691 A1 WO 2022100691A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio
sub
probability
duration
human voice
Prior art date
Application number
PCT/CN2021/130304
Other languages
English (en)
French (fr)
Inventor
贾杨
夏龙
吴凡
郭常圳
Original Assignee
北京猿力未来科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京猿力未来科技有限公司 filed Critical 北京猿力未来科技有限公司
Publication of WO2022100691A1 publication Critical patent/WO2022100691A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination

Definitions

  • the present application relates to the technical field of data processing, and in particular, to an audio recognition method and device.
  • the sound is generated by the vibration of the sound source, which is a way of expressing energy. Therefore, the energy of the audio can be analyzed, and the effective opening time of the user is obtained after the "mute" is removed.
  • This method has good effect in relatively quiet environment, but the effect will be reduced when there is relatively strong noise and reverberation around, because the noise itself also contains energy.
  • the statistical algorithm of opening duration based on speech recognition can get the text contained in the audio and the corresponding time point. Based on the statistics of the user's opening time, ideally, this solution should have the best performance results, but this solution often has the following shortcomings: 1)
  • the speech recognition model generally has high computational complexity and is not suitable for high-concurrency online environments. use. 2)
  • the effect of statistical opening depends on the accuracy of the speech recognition model. For different languages (including mixed languages, dialects, etc.), different recognition models need to be trained, and a large amount of labeling data is required to improve the performance of the model. Therefore, the generalization of this scheme is poor and the early labeling cost is too high.
  • Vggish is a technique proposed by Google for audio classification.
  • the VGG network structure in the field of image recognition proposed by Oxford's Visual Geometry Group at ILSVRC 2014 is used, and the scheme uses the Youtube-100M video dataset.
  • Vggish's solution uses audio classification based on titles, subtitles, comments, etc. from online videos, such as songs, music, sports, speeches, and so on. Under this scheme, the quality of classification depends on manual review, otherwise there will be more errors in the classification standards; on the other hand, if the classification related to human voice in this scheme is classified into the human voice category, and the rest are classified as non-voice parts, and finally The model performance is poor.
  • the present application provides an audio recognition method. Including: obtaining original audio, adding null data of a first duration before the head of the original audio, and adding dummy data of a second duration after the tail of the original audio, to obtain the expanded audio;
  • the third duration of the sum of the two durations is a segmented window, starting from the header of the expanded audio with the first length, and then dividing the windows to obtain multiple sub-audios; respectively, calculating the time-frequency feature sequence of the sub-audio
  • the neural network obtains the probability that the sub-audio belongs to a specific category according to the time-frequency feature sequence; and compares the probability with the judgment threshold to determine whether the sub-audio is a specific category.
  • the time-frequency feature sequence of the sub-audio is a Mel-frequency cepstral coefficient feature sequence
  • the neural network obtains the vocal probability of the sub-audio according to the characteristic sequence of Mel frequency cepstral coefficients; and, respectively, compares the vocal probability with a judgment threshold to determine whether the sub-audio is a human voice.
  • the method further includes: obtaining an array of the human voice probabilities of all the sub-audio of the original audio; filtering the probability values in the array with the first number as a window, Get the filtered vocal probability.
  • the method of median filtering is used to filter the vocal probability array.
  • the audio energy value of the sub-audio in the original audio at a certain time point obtains the audio energy value of the sub-audio in the original audio at a certain time point; and set the vocal probability adjustment factor according to the energy value, including: if the audio energy value is greater than the energy upper limit, the sub-audio's vocal The probability adjustment factor is set to 1; if the audio energy value is less than the energy lower limit, the human voice probability adjustment factor of the sub-audio is set to 0; if the audio energy value is not greater than the energy upper limit and not less than the energy lower limit, then according to the energy value
  • the vocal probability adjustment factor is normalized to be between 0 and 1; the sub-audio vocal probability adjustment factor is multiplied by the sub-audio vocal probability to obtain the modified sub-audio vocal probability; The sub-audio vocal probability is compared with the decision threshold to determine whether the sub-audio is human.
  • the method further includes: acquiring sub-audio continuously judged to be human voice in the original audio; acquiring an audio segment composed of the determined time points of the sub-audio continuously determined to be human voice; outputting the audio segment.
  • the first duration is equal to the second duration; and, the audio energy value of the determined time point in the sub-audio is specifically the audio energy value of the sub-audio center time point; the acquisition of the continuous judgment is the human voice.
  • the audio segment composed of the determined time point of the sub-audio is specifically the audio segment composed of the central time point of the sub-audio that is continuously judged to be a human voice.
  • the method further includes: if the time interval between the adjacent audio clips is less than a third threshold, acquiring the audio clips between the adjacent audio clips.
  • the above method further includes: counting the duration of the output audio.
  • the application provides an audio recognition device, including:
  • a memory having executable code stored thereon which, when executed by the processor, causes the processor to perform the method as described above.
  • the present application by adding null data before and after the original audio, and dividing the original audio into a window with a first length of 2 times the duration of the null data, multiple sub-audios are obtained by segmentation, and then based on this fine-grained analysis of the original audio Segment, calculate MFCC for the sub-audio separately, and obtain the time-frequency two-dimensional map of the sound signal.
  • the method can also obtain the probability value of whether the audio is recognized as a certain type at the beginning and end of the original audio data, the probability of the approximate sub-audio is the probability corresponding to the central time point, and then the probability array of the original audio time point granularity can be obtained, More accurate detection of opening segments can be achieved.
  • the present application distinguishes between human voices and non-human voices through deep learning algorithms. And further, the probability value obtained by the neural network is filtered to effectively remove the noise in the audio recognition result, and the noise is generated based on the fine-grained segmentation of the previous step, the model performance or/and the cause of noise in the original audio. Therefore, the audio recognition result is smoother through this reasonable sliding window and smoothing mechanism system.
  • the fine-grained sub-audio segmentation and sub-audio substantial overlap strategy of the present application results in a small part of the non-vocal audio probabilities being modified by the surrounding points to be more human-like. Therefore, based on the feature that the energy of noise or silence is weaker than the human voice, the present application further corrects the probability value of the application network by calculating the energy of the original audio to further improve the recognition accuracy.
  • a certain latitude is given to maintain the continuity of speech segments when performing segmentation and statistics of opening segments according to the final audio human voice probability array. This segmentation can not only provide better accuracy, but also provide corpus with higher content quality.
  • the method of the present application can save statistical time and improve feedback efficiency, increase the function of opening time node statistics and improve its accuracy, increase generalization, and improve the accuracy of statistics on the premise of ensuring user interactivity and interaction efficiency.
  • FIG. 1 is a schematic flowchart of an audio recognition method shown in an embodiment of the present application.
  • FIG. 2 is a schematic diagram of original audio segmentation preprocessing shown in an embodiment of the present application.
  • Figure 3a is a schematic diagram of an existing neural network structure
  • FIG. 3b is a schematic diagram of the structure of the neural network according to the embodiment of the present application.
  • Fig. 4 is the audio frequency human voice probability distribution diagram before moving average shown in the embodiment of the present application.
  • Fig. 5 is the audio frequency human voice probability distribution diagram after moving average shown in the embodiment of the present application.
  • FIG. 6 is a schematic diagram of permissive merge processing.
  • first, second, third, etc. may be used in this application to describe various information, such information should not be limited by these terms. These terms are only used to distinguish the same type of information from each other.
  • first information may also be referred to as the second information, and similarly, the second information may also be referred to as the first information without departing from the scope of the present application.
  • second information may also be referred to as the first information without departing from the scope of the present application.
  • a feature defined as “first” or “second” may expressly or implicitly include one or more of that feature.
  • “plurality” means two or more, unless otherwise expressly and specifically defined.
  • the present application provides an audio recognition method for recognizing audio belonging to a certain audio category from a piece of original audio. For example, the part of the human voice can be accurately identified from a piece of original audio, so that the duration of the human voice can be counted, or/and the human voice audio in the original audio can be further output.
  • the following embodiments illustrate the application of human voice recognition in an online education scenario as an example.
  • the idea of the present invention is to obtain sub-audio by fine-grained windowing of the original audio, and the probability of approximating the sub-audio is the probability corresponding to the central time point, and then by calculating the MFCC of the sub-audio, the time-frequency two-dimensional sound signal is obtained.
  • FIG. 1 A specific embodiment of the present invention will be described with reference to FIG. 1 .
  • Step 11 Get the original audio.
  • the original audio may contain both the desired human voice and other non-human voices such as background sounds, noise, etc.
  • Step 12 adding empty data before the original audio head and after the tail respectively to obtain the expanded audio
  • subdivision processing is performed on the original audio, the original audio is divided into smaller sub-audios, a segment of empty audio is added to the beginning and end of the original audio, and the expanded audio is obtained.
  • the audio is sub-audio segmentation based on the segmentation window value, and the ratio of the empty audio value and the segmentation window value is maintained at 1:2.
  • FIG. 2 shows an optimal implementation method for audio extension provided by an embodiment of the present invention.
  • a is the original audio array.
  • empty data of equal duration is added to the beginning and the end of the original audio a, that is, zeros of 480 milliseconds (ms), to obtain the expanded audio b.
  • the number of 0s in the 480ms is determined according to the sampling frequency of the audio, that is, the data frequency in 480ms is the same as the sampling frequency.
  • the duration of the null data added before and after the original audio header is 480 ms is only exemplary, and the present invention does not limit other values of the duration.
  • Step 13 take 2 times the above-mentioned duration as the segmentation window, and obtain a plurality of sub-audios in sequence from the header of the expanded audio with the first length;
  • the segmentation window adopts 960ms, that is, twice the 480ms.
  • the segmentation step size is 10ms, so the minimum segmentation granularity of the sub-audio is 10ms.
  • the difference between adjacent sub audios is 10ms
  • the duration of each sub audio is 960ms.
  • the human voice of the sub-audio feature map calculated in the subsequent steps is The probability is taken as the probability of the human voice corresponding to the audio at the time point t i +0.48s. Therefore, the voice probability calculated according to the first sub-audio in this scheme is the vocal probability at the beginning of the original audio; the vocal probability calculated by the last sub-audio is the vocal probability at the end of the original audio.
  • the present invention since the empty data is added before and after the head of the original audio, and then combined with the data of the half time of the segmentation window, that is, the present invention approximately calculates the probability of a human voice at a certain time point in this way, so it can be realized More accurate detection of open fragments.
  • the first step length that is, the segmentation granularity is 10 ms, and the present invention does not limit the selection of other segmentation granularities.
  • Step 14 respectively calculate and obtain the time sequence feature sequence of the sub-audio
  • the vibration of human vocal cords and the closure of the oral cavity have universal laws, which are directly reflected in the frequency domain characteristics of the sound.
  • the Mel Frequency Cepstral Coefficient MFCC
  • MFCC Mel Frequency Cepstral Coefficient
  • a preset window length and step size are used to calculate the result of its short-time Fourier transform, and a characteristic sequence of Mel frequency cepstral coefficients is obtained.
  • a window length of 25 ms and a step length of 10 ms are used to calculate the result of the short-time Fourier transform, and obtain the MFCC characteristic.
  • Step 15 The neural network obtains the probability that the sub-audio belongs to a specific category according to the time sequence feature sequence.
  • the probability corresponding to each audio segment is predicted by the trained neural network model. The value of this probability can range from 0 to 1.
  • the trained neural network model adopts a 3x3 convolution kernel and pool layer to simplify the model parameters.
  • the training of the neural network includes two stages: pre-training and fine-tuning.
  • the picture on the left shows the 500-class classification model.
  • the 500-class audio classification model was trained using the sound dataset.
  • the figure on the right shows the two-class model.
  • the network reuses the underlying network structure and parameters of the 500-class model, and uses the back propagation algorithm to make the model converge. Through this binary classification model to identify whether there is human voice in the audio clip, the model will output the probability that the current audio clip has human voice.
  • pre-training and fine-tuning the network trained by the present invention is more focused on the classification scene of human voice and non-human voice, and the performance of the model is improved.
  • the trained convolutional neural network (CNN) in the embodiment of the present invention is composed of input and output layers and multiple hidden layers, wherein the hidden layer is mainly composed of a series of convolutional layers, It consists of a pooling layer and a fully connected layer.
  • the convolution layer generally defines a convolution kernel.
  • the size of the convolution kernel represents the receptive field of the network in this layer. By sliding different convolution kernels on the input feature map and performing a dot product operation on the feature map, the information in the receptive field is projected. Go to a certain element in the lower layer to achieve the effect of information enrichment. In general, the size of the convolution kernel is much smaller than the input feature map, and it overlaps or acts on the input feature map in parallel.
  • the pooling layer is actually a non-linear form of downsampling operation, and there are many different forms of non-linear pooling functions such as max pooling, mean pooling, etc.
  • pooling layers are periodically inserted between the convolutional layers in the CNN network structure.
  • the fully connected layer is to fuse the high-level feature information abstracted by the convolution layer and the pooling layer, and finally achieve the classification effect.
  • Step 16 Compare the probabilities with the decision thresholds to decide whether the sub-audio is of a specific category.
  • the judgment threshold is set as the basis for judging whether it is human voice, if the probability is greater than the judgment threshold, the judgment is human voice, and if the probability is less than the judgment threshold, the judgment is non-human voice.
  • the original audio a is divided into a vocal or non-vocal segment.
  • the duration of the human voice in the original audio that is, the duration information of the user's mouth
  • the original audio can be segmented according to the information of each opening segment, so as to facilitate the subsequent output of vocal audio, for example, to evaluate the learning status.
  • the obtained probability value can also be preprocessed by the following method, so as to achieve the probability value for optimization purposes.
  • the vocal probability array of the original audio obtained by the method described above contains noise points. It is reflected in the 200-millisecond vocal probability distribution diagram shown in Figure 4, where the ordinate represents the probability that the audio point is a human voice, the abscissa represents time, and each point represents 10ms. There are many sudden changes in probability values, that is, glitches, on the probability value distribution of 0-1 corresponding to the time axis of the horizontal axis. Therefore, it is necessary to perform sliding average preprocessing on the currently obtained probability to make the probability distribution smoother, and obtain the 200-millisecond vocal probability distribution map as shown in Figure 5.
  • the sliding average preprocessing adopts the median sliding filtering method.
  • the probability that the i-th sub-audio after median filtering is a human voice is:
  • P ⁇ p 1 ,p 2 ,p 3 ,...,p i ...,p n ⁇ , where n is the total number of sub-audio obtained by segmenting the original audio, and p i represents the probability that the ith sub-audio is a human voice .
  • w_smooth is the selected window size.
  • the window is selected as 31, that is, the window is 31 values in the vocal probability array of the sub-audio.
  • median filtering is to average the probability values of adjacent 31 points as the probability value of the intermediate point; according to this method, the probability value of each point is recalculated with the step size of 1.
  • the above median filtering is an implementation manner of the present invention, and the present invention does not limit the adoption of other filtering methods.
  • the judgment algorithm is used to distinguish human voice and non-human voice, determine the opening segment of human voice, and count the user's opening time.
  • the audio probability of a small part of non-human voices is modified by the surrounding points during filtering.
  • the embodiment of the present invention utilizes the characteristic that the energy of noise or mute is weaker than the human voice, and uses the energy of the original audio to further modify the probability of the human voice, so as to improve the accuracy.
  • the moving average audio vocal probability array is:
  • the original audio is sliced with a step size of 10ms to obtain sub-audio, and then the probability of vocals with an interval of 10ms is obtained. Therefore, the energy array of the original audio is obtained by calculating the step size of 10ms here, so that Make the moments of the energy array of the original audio correspond to the moments of the vocal probability array of the original audio.
  • wi Normalize the value of the Power array to be between 0 and 1, and determine the upper limit of energy P up and the lower limit of energy P down , then wi can be normalized as follows:
  • wi takes the value of 1, and if the audio energy at a certain moment is less than the energy lower limit P down , wi takes the value 0, and obtains
  • the obtained probability adjustment factor is between 0 and 1, by which The probability adjustment factor adjusts the vocal probability value at the corresponding time point, and finally obtains an energy-corrected audio vocal probability value array P T .
  • the probabilities obtained in the above embodiments are first subjected to sliding average preprocessing, then energy correction preprocessing, and finally, a judgment algorithm is used to discriminate between human voices and non-human voices, determine vocal opening segments, and count the duration of user openings; for the currently obtained probability
  • energy correction preprocessing can be performed first, and then sliding average preprocessing.
  • the present invention can also adopt one of the above two preprocessing methods to achieve the purpose of improving the accuracy of human voice recognition.
  • the human voice audio obtained by the above method may also be subjected to tolerant merge processing.
  • the method of permissive merging is as follows.
  • the judgment result of whether each time node of the original audio is a human voice is obtained, that is, if If the decision threshold is set, the audio at time point i of the original audio is a human voice; otherwise, it corresponds to a non-human voice.
  • the original audio a is divided into a vocal or non-vocal segment. If the interval between the two sub-audio judged to be human voice and the number of sub-audio judged to be non-human voice is less than the third threshold, then further acquire the audio between the central moments of the two sub-audio sub-audio judged to be human voice.
  • the value of the third threshold is 500 milliseconds, which is only exemplary, and is not limited in the present invention.
  • the user's opening duration information obtained by accumulating the durations of all segments is more reasonable than the user's opening duration information obtained without the tolerant merging process; and the obtained vocal audio maintains the continuity of the speech segments. sex.
  • step 12 specifically adopts adding null data with equal time lengths before the original audio head and after the tail, for example, both are 480 milliseconds; and in step 13, a window with a duration of 2 times 480 milliseconds or 960 milliseconds is used. Divide the original audio to get multiple sub audios.
  • the length of time of the null data added before the original audio header and after the tail may not be equal. That is, add the null data of the first duration before the original audio head, and add the null data of the second duration after the original audio tail; and the third duration of the sum of the first duration and the second duration is cut.
  • the split window splits the original audio to obtain sub-audio.
  • the first duration is 240 milliseconds
  • the second duration is 720 milliseconds
  • the split window is the sum of the first duration and the second duration, that is, 960 milliseconds. It can be seen that the duration of the sub-audio obtained by using this method is the same as that in the above embodiment, which is still 960ms.
  • the calculated sub-audio vocal probability is approximated as the vocal probability value of the sub-audio at 1/4 time.
  • the probability value of the sub-audio vocals is approximated as the human voice at the time of t i +0.24s in the sub-audio probability value.
  • an audio segment composed of the 1/4th time point in each sub-audio that is continuously judged to be a human voice is obtained. It can be known that since the sub-audio is obtained by segmenting the original audio by the first length, the first 1/4 moment is separated by the first length, for example, 10ms used in the above-mentioned embodiment.
  • the resulting array of vocal probabilities for the sub-audio may be filtered using the same method as described above.
  • the present application further provides an audio recognition device.
  • the device includes:
  • a memory having executable code stored thereon which, when executed by the processor, causes the processor to perform the method described above.
  • the specific manner in which each module performs the operation has been described in detail in the embodiment of the method, and will not be described in detail here.
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of code that contains one or more functions for implementing the specified logical function(s) executable instructions.
  • the functions noted in the block may occur out of the order noted in the figures. For example, two blocks in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
  • each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations can be implemented in dedicated hardware-based systems that perform the specified functions or operations , or can be implemented in a combination of dedicated hardware and computer instructions.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Electrically Operated Instructional Devices (AREA)
  • Telephonic Communication Services (AREA)

Abstract

一种音频识别方法和装置,音频识别方法包括:获得原始音频(11),在原始音频头部之前添加第一时长的空数据,以及在原始音频尾部之后添加第二时长的空数据,得到扩展后的音频;以第一时长与第二时长之和的第三时长为切分窗口,以第一步长从扩展后的音频的首部开始,依次分窗后获得多个子音频;分别计算得到子音频的时频特征序列(14);神经网络根据时频特征序列得到子音频属于特定分类的概率;将概率分别与判决门限进行比较判决子音频是否为特定分类(16)。

Description

音频识别方法和装置
相关申请的交叉引用
本申请要求于2020年11月12日提交中国专利局、申请号为202011260050.5、发明名称为“一种音频识别方法和装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及数据处理技术领域,尤其涉及一种音频识别方法和装置。
背景技术
随着互联网技术的发展,线上教育等类似行业蓬勃发展,在线学习人数剧增,老师通过人为主观感受和统计同学的开口时长,评估学生课堂参与度,并给予反馈,提高教学效果。
目前,现有技术中有几种统计用户开口时长的方案。
基于开关式的开口时长统计,即在用户客户端设置可开启、关闭的录制按钮,用户需按下按钮之后再开口说话。该方案的出发点是为了实现简单直接的时长统计,但当受众是低龄的幼儿群体时,该群体的特点就是服从性差、行动不循章法,为此点击按钮并说话的方式显得效率低下,全程开启麦克风的方案更适合此场景。同时,用户在交互发生的过程中,必须手动进行操作然后才能传达信息,这一定程度上减少了用户的互动性,且时长统计的有效性完全取决于用户的自觉性。
基于音频能量分析的开口时长统计,声音产生于声源的震动,是一种能量的表示方式。因此,可对音频的能量进行分析,切除掉“静音”后即为用户的有效开口时长。该方法在相对安静的环境下具备较好的效果,但是当周围存在较为强烈的噪声和混响时效果会有所下降,因为噪声本身也涵盖能量。
基于语音识别的开口时长统计算法,基于混合高斯和隐式马尔科夫、神经网络和隐式马尔科夫和Connectionist temporal classification(CTC)等语音识别算法,可以得到音频中包含的文本以及对应的时间点。基于此统计用户的开口时长,理想情况下该方案应具备最好的性能结果,但该方案往往存在以下缺点:1)语音识别模型一般计算复杂度较高,不适合在高并发的线上环境使用。2)统计开口的效果依赖于语音识别模型的精度。针对于不同的语种(包含混合语种、方言等),需训练不同的识别模型,而模型的性能提升需要大量标注数据,因此该方案的泛化性较差且前期的标注成本过高。
Vggish是Google提出的用于音频分类的技术。使用了Oxford的Visual Geometry Group在ILSVRC 2014上提出的图像识别领域的VGG网络结构,该方案使用Youtube-100M视频数据集。Vggish的方案采用的音频的分类依据来自线上视频的标题、字幕、评论等,例如歌曲、音乐、运动、演讲等等。这种方案下,分类质量依赖于人工 审核,否则类别标准将存在较多错误;另一方面,如果将该方案中与人声相关的分类归于人声类别,其余归为非人声部分,最终的模型性能较差。
发明内容
本申请提供一种音频识别方法。包括:获得原始音频,在所述原始音频头部之前添加第一时长的空数据,以及在所述原始音频尾部之后添加第二时长的空数据,得到扩展后的音频;以第一时长与第二时长之和的第三时长为切分窗口,以第一步长从所述扩展后的音频的首部开始,依次分窗后获得多个子音频;分别计算得到所述子音频的时频特征序列;神经网络根据所述时频特征序列得到子音频属于特定分类的概率;将所述概率分别与判决门限进行比较判决子音频是否为特定分类。
其中,所述子音频的时频特征序列为梅尔频率倒谱系数特征序列;
以及,所述神经网络根据梅尔频率倒谱系数特征序列得到子音频的人声概率;以及,将所述人声概率分别与判决门限进行比较判决子音频是否为人声。
上述方法中,所述获得子音频属于人声概率之后,还包括:获得所述原始音频所有子音频的人声概率的数组;以第一数量作为窗口对所述数组中的概率值进行滤波,得到滤波后的人声概率。
其中,采用中值滤波的方法对所述人声概率数组进行滤波。
上述方法中,获取原始音频中所述子音频中确定时刻点的音频能量值;以及根据所述能量值设置人声概率调节因子,包括:若音频能量值大于能量上限,该子音频的人声概率调节因子置为1;若音频能量值小于能量下限,该子音频的人声概率调节因子置为0;若音频能量值不大于能量上限且不小于能量下限,则根据能量值将所述人声概率调节因子归一化为0至1之间;将子音频的人声概率调节因子乘以所述子音频人声概率,得到修正后的子音频人声概率;以及,将所述修正后的子音频人声概率分别与判决门限进行比较判决子音频是否为人声。
所述方法还包括:获取原始音频中连续判决为人声的子音频;获取所述连续判决为人声的子音频的所述确定时刻点的组成的音频片段;输出所述音频片段。
上述方法中,所述第一时长与第二时长相等;以及,所述子音频中确定时刻点的音频能量值具体为子音频中心时刻点的音频能量值;所述获取所述连续判决为人声的子音频的所述确定时刻点的组成的音频片段具体为获取所述连续判决为人声的子音频中心时刻点的组成的音频片段。
以及,所述输出音频之前还包括:若相邻的所述音频片段之间的时间间隔小于第三门限,则获取相邻音频片段间的音频片段。
上述方法还包括:统计所述输出的音频的时长。
本申请提供一种音频识别装置,包括:
处理器;以及
存储器,其上存储有可执行代码,当所述可执行代码被所述处理器执行时,使所述 处理器执行如上所述的方法。
本申请通过在原始音频前后添加空数据,并以所述空数据时长的2倍对原始音频以第一步长进行分窗,切分得到多个子音频,进而基于这种对原始音频的细粒度切分,分别对子音频计算MFCC,得到声音信号的时频二维图。该方法在原始音频数据开始和结尾阶段也能够得到音频是否识别为某种类型的概率值,近似子音频的概率为其中心时间点所对应的概率,进而得到原始音频时间点粒度的概率数组,可以实现较为准确的开口片段检测。
另一方面,本申请通过深度学***滑机制统使得音频识别结果更加平滑。
本申请的细粒度的子音频切分和子音频大幅重叠策略导致一小部分非人声的音频概率被周围点修正得更倾向于人声。因而进一步的,本申请基于噪声或者静音的能量相比人声弱的特征,通过计算原始音频的能量进一步对申请网络的概率值进行修正,进一步提高识别精度。
进一步,本申请根据最终的音频人声概率数组进行开口片段切分统计时给予一定的宽容度用以保持语音片段的前后连续性。这样切分既能提供较好的精度,也可以提供更高内容质量的语料。
综上,通过本申请的方法能够在保证用户的互动性和交互效率的前提下,节省统计时间提高反馈效率,增加开口时间节点统计的功能并提升其精度,增加泛化性,更精准的统计用户开口时长。
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本申请。
附图说明
通过结合附图对本申请示例性实施方式进行更详细的描述,本申请的上述以及其它目的、特征和优势将变得更加明显,其中,在本申请示例性实施方式中,相同的参考标号通常代表相同部件。
图1是本申请实施例示出的一种音频识别方法的流程示意图;
图2是本申请实施例示出的原始音频切分预处理示意图;
图3a是现有神经网络结构示意图;
图3b是本申请实施例神经网络结构示意图;
图4是本申请实施例示出的滑动平均前音频人声概率分布图;
图5是本申请实施例示出的滑动平均后音频人声概率分布图;
图6是宽容合并处理示意图。
具体实施方式
下面将参照附图更详细地描述本申请的优选实施方式。虽然附图中显示了本申请的优选实施方式,然而应该理解,可以以各种形式实现本申请而不应被这里阐述的实施方式所限制。相反,提供这些实施方式是为了使本申请更加透彻和完整,并且能够将本申请的范围完整地传达给本领域的技术人员。
在本申请使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本申请。在本申请和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其他含义。还应当理解,本文中使用的术语“和/或”是指并包含一个或多个相关联的列出项目的任何或所有可能组合。
应当理解,尽管在本申请可能采用术语“第一”、“第二”、“第三”等来描述各种信息,但这些信息不应限于这些术语。这些术语仅用来将同一类型的信息彼此区分开。例如,在不脱离本申请范围的情况下,第一信息也可以被称为第二信息,类似地,第二信息也可以被称为第一信息。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个该特征。在本申请的描述中,“多个”的含义是两个或两个以上,除非另有明确具体的限定。
本申请提供一种音频识别方法,用于从一段原始音频中识别出属于某种音频分类的音频。例如,从一段原始音频中准确识别出人声的部分,从而可以统计人声的时长,或/和进一步输出原始音频中的人声音频。
以下实施例以人声识别在在线教育场景下的应用为例进行说明。
教育领域中有开口时长统计的概念。例如在线下日常教育中,由老师作为统计的执行者和评估者,通过人为主观感受和统计同学的开口时长,评估学生课堂参与度,并给予反馈,提高教学效果。随着线上教育的发展,在线学习人数剧增,开口时长的统计依然是在线教育评估学生参与度的指标之一,因而需要一种技术方案准确有效的统计学生的开口时长。
本发明的思路在于通过对原始音频进行细粒度分窗的得到子音频,近似子音频的概率为其中心时间点所对应的概率,进而通过计算子音频的MFCC,得到声音信号的时频二维图,并通过深度学***滑机制统计有效开口时长。
参照图1说明本发明具体实施例。
步骤11:获得原始音频。
获得原始音频文件,例如学生在线学习时,根据学习软件的提示进行语音的作答,智能设备通过麦克风获取学生语音作答时的原始音频。该原始音频可能既包含了所需的人声,还包括背景声、噪音等其他非人声的音频。
步骤12:分别在所述原始音频头部之前以及尾部之后添加空数据,得到扩展后的音频;
在一种实施方式中,对原始音频进行细分度切分处理,将原始音频切分为更小的子 音频,对原始音频首尾各增加一段空音频,得到扩展后的音频,对扩展后的音频基于切分窗口值进行子音频切分,空音频数值与切分窗口值保持1:2的比例。
如图2所示为本发明实施例提供的一种音频扩展较优实现方法。
本实施例中,为了实现开口时间节点的精确统计,子音频需要有更小的切分粒度。如图所示,a为原始的音频数组,首先在原始音频a的首、尾部各添加等时长的空数据,即480毫秒(ms)的零,得到扩展后的音频b。所述480ms中0的个数根据音频的采样频率而定,即480ms中的数据频率与采样频率相同。
本实施例中在原始音频首部之前和尾部之后添加的空数据时长为480ms仅为示例性的,本发明并不限制该时长的其他取值。
步骤13:以2倍上述时长为切分窗口,以第一步长从所述扩展后的音频的首部开始顺序获得多个子音频;
如图2所示,本实施例中,对原始音频切分获得子音频时,切分窗口采用960ms,即所述480ms的2倍。切分步长采用10ms,从而子音频的最小切分粒度为10ms。
按照以上切分方法,得到了数个子音频,相邻子音频之间相差10ms,每个子音频的时长为960ms。
假设某一个子音频的起始时刻和截止时刻在原始音频中分别表示为t i,t i+0.96s,则在本发明实施例,将后续步骤中计算得到的该子音频特征图的人声概率作为时间点t i+0.48s时刻音频对应的人声概率。因此,本方案根据第一个子音频计算得到的人声概率即作为原始音频起始时刻的人声概率;最后一个子音频计算得到的人声概率即作为原始音频结束时刻的人声概率。
可以看出,由于在原始音频头部之前和尾部之后添加了空数据,进而结合切分窗口一半时间的数据,即本发明通过这种方式近似计算某个时间点的人声概率,因此可以实现较为准确的开口片段检测。
本发明实施例中第一步长即切分粒度为10ms,本发明不限制其他切分粒度的选择。
步骤14:分别计算得到所述子音频的时序特征序列;
对于人声的识别,人类声带震动、口腔的闭合等具有普遍规律,该规律直接表现在声音的频域特征。本发明实施例中采用梅尔频率倒谱系数(MFCC),是基于声音频率的非线性梅尔刻度(mel scale)的对数能量频谱的线性变换得到的谱系数,符合人耳对于声音频率的感受特性,可以表征声音所具备的频域特性。
对于每个切分得到的子音频,采用预设的窗口长度以及步长,计算其短时傅立叶变换的结果,得到梅尔频率倒谱系数特征序列。本发明实施例中采用窗口长度25ms,步长10ms,计算其短时傅立叶变换的结果,得到MFCC特性。
步骤15:神经网络根据所述时序特征序列得到子音频属于特定分类的概率。
将梅尔频率倒谱系数特征序列输入已训练的神经网络模型,并获得神经网络模型输出的各音频片段对应的概率,在该实施例中,将得到的各音频片段按照时间顺序输入已训练的神经网络模型中,由已训练的神经网络模型预测各音频片段对应的概率。这个概 率的取值范围可以在0到1之间。
例如,如图3所示,已训练的神经网络模型采用3x3的卷积核和pool层简化模型参数。神经网路的训练包括预训练和微调两个阶段。左图为500类分类模型,先使用声音数据集训练了500分类的音频分类模型。右图为二分类模型,该网络复用了500分类模型的底层网络结构和参数,通过反向传播算法使得模型收敛。通过此二分类模型来识别音频片段是否存在人声,则模型会输出当前音频片段存在人声的音频的概率。通过引入预训练和微调两个,使得本发明所训练的网络更加聚焦于人声、非人声的分类场景,提高了模型性能。
如图3在进行音频的人声识别时,本发明实施例经过训练的卷积神经网络(CNN):由输入和输出层以及多个隐藏层组成,其中隐藏层主要由一系列卷积层、池化层和全连接层组成。
卷积层一般会定义一个卷积核,卷积核的大小表征该层网络的感受野,通过在输入特征图上滑动不同的卷积核并于特征图进行点积运算将感受野内的信息投影到下层的某一个元素,达到信息富集的作用。一般,卷积核的尺寸要比输入特征图小得多,且重叠或平行地作用于输入特征图中。
池化层实际上是一种非线性形式的降采样运算,有多种不同形式的非线性池化函数例如最大值池化、均值池化等。通常来说,CNN的网络结构中的卷积层之间都会周期性地***池化层。
全连接层则是将卷积层、池化层抽象出的高层特征信息进行融合交汇,并最终达到分类效果。
步骤16:将所述概率分别与判决门限进行比较判决子音频是否为特定分类。
设置所述判决门限作为判决是否为人声的依据,若所述概率大于判决门限,则判决为人声,若概率小于判决门限则判决为非人声。
经过以上步骤,原始音频a被分成了一个个人声或非人声的片段。通过累加所有片段的时长即可得到原始音频中人声的时长,即用户的开口时长信息。并且,可以按照各开口片段的信息将原始音频进行切分,方便后续输出人声音频,用于例如对学习状况的评估。
在本发明的又一实施例中,在步骤15神经网络根据所述时序特征序列得到子音频属于特定分类的概率后,还可以利用如下方法对所得到的概率值进行预处理,达到对概率值进行优化的目的。
1)对当前获得的概率进行滑动平均预处理。
由于切分粒度和噪声的原因,导致按照上文记载的方法得到的原始音频的人声概率数组中包含噪点。体现在如图4所示200毫秒的人声概率分布图中,纵坐标表示该音频点为人声的概率,横坐标代表时间,每个点表示10ms。在横轴时间轴所对应的0-1的概率值分布上存在很多概率值的突变,即毛刺。因此,需要对当前获得的概率进行滑动平均预处理,使得概率分布更加平滑,得到如图5所示的200毫秒的人声概率分布图。
滑动平均预处理,采用中值滑动滤波法,中值滤波后的第i个子音频为人声的概率为:
Figure PCTCN2021130304-appb-000001
其中,原始音频中的所有子音频的人声概率数组
P={p 1,p 2,p 3,...,p i...,p n},其中n为原始音频切分得到的子音频总数,p i代表第i个子音频为人声的概率。
w_smooth是选定窗口大小。例如本实施例中选取所述窗口为31,即窗口为所述子音频的人声概率数组中的31个值。
针对于p i,确定滑动平均的上、下限索引。
下限索引为:Lo=max(0,i-15),表示数组中的第一个概率值;
上限索引为:Hi=min(n,i+15),表示数组中的最后一个概率值。
本实施例中,中值滤波即是以相邻31个点的概率值进行平均后作为中间点的概率值;按照该方法,以步长为1,重新计算每个点的概率值。
对比图4和图5,可以看出经过滑动平均后子音频人声概率图的毛刺被有效修正,在一定程度上提高了开口片段切分的精度。
以上中值滤波为本发明的一种实现方式,本发明并不限制其他滤波方法的采用。
经过滤波预处理后,利用判决算法判别人声与非人声,确定人声开口片段,统计用户开口时长。
2)能量修正预处理。
经过滑动平均预处理后,由于本发明实施例中采用细粒度的子音频切分,以及由于子音频大幅重叠的策略导致一小部分非人声的音频概率在经过滤波时被周围点修正得更倾向于人声,即人声概率增加,但其本质为非人声。
为解决上述问题,本发明实施例利用噪声或者静音的能量相对人声较弱的特性,利用原始音频的能量对人声概率进行进一步修正,以提高精度。
经过滑动平均的音频人声概率数组为:
Figure PCTCN2021130304-appb-000002
以10ms为窗口大小,10ms为步长,计算得到原始音频的能量数组:
P ower={w 1,w 2,w 3,...,w i,..w n}
由于上文记载的实施例中,采用步长10ms对原始音频进行切片得到子音频,进而得到10ms为间隔的人声概率,因而,此处采用10ms的步长计算得到原始音频的能量数组,从而使得原始音频的能量数组的时刻与原始音频的人声概率数组时刻相应。
将P ower数组的值归一化到0~1之间,确定能量上限P up和能量下限P down,则w i可以按照以下方式归一化:
Figure PCTCN2021130304-appb-000003
以上公式可以看到,当某时刻音频能量大于所述能量上限P up时,w i取值为1,若某时刻音频能量小于所述能量下限P down时,w i取值为0,得到
Figure PCTCN2021130304-appb-000004
数组P f和数组
Figure PCTCN2021130304-appb-000005
对应值进行点积运算,得到能量修正后的音频人声概率值数组P T。经过该运算,当某时刻音频能量大于所述能量上限P up时,则该时刻人声概率值不变;若某时刻音频能量小于所述能量下限P down时,则该时刻人声概率值取值为0。
在实施例中,若所述音频能量介于所述能量下限和能量上限之间(包含能量上限值和能量下限值),则取得的概率调整因子介于0和1之间,通过该概率调整因子调整对应时刻点的人声概率值,最终得到能量修正后的音频人声概率值数组P T
以上可以看出,通过利用原始音频的能量矩阵,若某时刻音频能量低于能量下限,则认为该时刻音频为非人声,从而将该时刻的人声概率变为零,通过这种方法进一步去除了非人声的部分音频。
以上实施例将获得的概率先经过滑动平均预处理,再经过能量修正预处理,最后利用判决算法判别人声与非人声,确定人声开口片段,统计用户开口时长;对于对当前获取的概率进行能量修正和滑动平均两种预处理,没有先后顺序,亦可先进行能量修正预处理,再进行滑动平均预处理。
本发明也可以采用上述两种预处理方法中的其中一种达到提高人声识别准确率的目的。
作为对以上实施例的进一步优化,在统计人声时长或输出人声音频之前,还可以对上述方法得到的人声音频进行宽容合并处理。
具体的,考虑到人类说话的前后延续性,尤其是儿童、青少年线上学习的场景,表达完整意思句子的单词间往往有短暂的停顿,通常用以换气或者表征某种情绪。本实施例中,根据最终的音频人声概率数组进行开口片段切分统计时并不严格按照上文记载的步骤所得到的识别结果,而是给予一定的宽容度用以保持语音片段的前后连续性。这样切分既能提供较好的精度,也可以为教师提供更高内容质量的评价语料,方便老师评估学生的学习效果。
宽容合并的方法具体如下。
设置判决门限作为判决是否为人声的依据。
在上述实施例的基础上,结合最终的概率数组P T和所述判决门限,得到原始音频每个时间节点是否为人声的判决结果,即若
Figure PCTCN2021130304-appb-000006
判决门限,则原始音频i时刻点的音频为人声;反之,则对应于非人声。
通过以上步骤,原始音频a被分成了一个个人声或非人声的片段。如果判决为人声的两个子音频之间间隔的判决为非人声的子音频数量小于第三门限,则进一步获取所述 判决为人声的两个子音频中心时刻之间的音频。
具体如图6所示,如果原始音频中包含有两个人声片段a i,a i+1,起止的时间节点分别为
Figure PCTCN2021130304-appb-000007
Figure PCTCN2021130304-appb-000008
则就将这两个片段合并为一个。本实施例中,第三门限取值为500毫秒,其仅为示例性的,本发明不进行限制。
经过宽容合并的处理,通过累加所有片段的时长所得到的用户的开口时长信息相比不采用宽容合并处理所得到的用户开口时长信息更加合理;并且得到的人声音频保持了语音片段的前后连续性。
上文记载的实施例中,步骤12具体采用在原始音频头部之前以及尾部之后添加时间长度相等的空数据,例如均为480毫秒;以及步骤13中采用2倍480毫秒即960毫秒时长的窗口对原始音频进行切分得到多个子音频。
在本发明的其他实施例中,在原始音频头部之前以及尾部之后添加的空数据时间长度可以不相等。即在所述原始音频头部之前添加第一时长的空数据,以及在所述原始音频尾部之后添加第二时长的空数据;并且以第一时长与第二时长之和的第三时长为切分窗口对原始音频进行切分得到子音频。
例如,第一时长为240毫秒,第二时长为720毫秒,切分窗口为第一时长与第二时长之和,即为960毫秒。可见,利用本方式得到的子音频时长与上文实施例相同,依然为960ms。
使用此种切分方式,将计算得到的子音频人声概率近似的作为子音频中在1/4时刻的人声概率值。假设某一个子音频的起始时刻和截止时刻在原始音频中分别表示为t i,t i+0.96s,则将子音频人声概率值近似作为子音频中t i+0.24s时刻的人声概率值。进一步,在输出人声音频时,获得连续判决为人声的各子音频中第1/4时刻点组成的音频片段。可知,由于采用第一步长对原始音频切分得到子音频,因而相邻的子音频的第1/4时刻之间相隔第一步长,例如上述实施例中采用的10ms。
可采用上文记载相同的方法对得到的子音频的人声概率数组进行滤波。
在对得到的子音频的人声概率数组进行音频能量修正预处理时,较优的方式是计算子音频中前1/4时刻的能量值。例如,假设某一个子音频的起始时刻和截止时刻在原始音频中分别表示为t i,t i+0.96s,则计算t i+0.24s时刻的能量值,并根据该能量值得到该子音频(t i,t i+0.96s)的概率修正因子。
与前述应用功能实现方法实施例相对应,本申请还提供了一种音频识别装置。该装置包括:
处理器;以及
存储器,其上存储有可执行代码,当所述可执行代码被所述处理器执行时,使所述处理器执行上文记载的方法。关于上述实施例中的装置,其中各个模块执行操作的具体方式已经在有关该方法的实施例中进行了详细描述,此处将不再做详细阐述说明。
本领域技术人员还将明白的是,结合这里的申请所描述的各种示例性逻辑块、模块、电路和算法步骤可以被实现为电子硬件、计算机软件或两者的组合。
附图中的流程图和框图显示了根据本申请的多个实施例的***和方法的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段或代码的一部分,所述模块、程序段或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标记的功能也可以以不同于附图中所标记的顺序发生。例如,两个连续的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的***来实现,或者可以用专用硬件与计算机指令的组合来实现。
以上已经描述了本申请的各实施例,上述说明是示例性的,并非穷尽性的,并且也不限于所披露的各实施例。在不偏离所说明的各实施例的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。本文中所用术语的选择,旨在最好地解释各实施例的原理、实际应用或对市场中的技术的改进,或者使本技术领域的其它普通技术人员能理解本文披露的各实施例。

Claims (10)

  1. 一种音频识别方法,其特征在于,包括:
    获得原始音频,在所述原始音频头部之前添加第一时长的空数据,以及在所述原始音频尾部之后添加第二时长的空数据,得到扩展后的音频;
    以第一时长与第二时长之和的第三时长为切分窗口,以第一步长从所述扩展后的音频的首部开始,依次分窗后获得多个子音频;
    分别计算得到所述子音频的时频特征序列;
    神经网络根据所述时频特征序列得到子音频属于特定分类的概率;
    将所述概率分别与判决门限进行比较判决子音频是否为特定分类。
  2. 根据权利要求1所述的方法,其特征在于,
    所述子音频的时频特征序列为梅尔频率倒谱系数特征序列;
    以及,所述神经网络根据梅尔频率倒谱系数特征序列得到子音频的人声概率;
    以及,将所述人声概率分别与判决门限进行比较判决子音频是否为人声。
  3. 根据权利要求2所述的方法,其特征在于,所述获得子音频的人声概率之后,还包括:
    获得所述原始音频所有子音频属于人声概率的数组;
    以第一数量作为窗口对所述数组中的概率值进行滤波,得到滤波后的概率。
  4. 根据权利要求3所述的方法,其特征在于,采用中值滤波的方法对所述人声概率的数组进行滤波。
  5. 根据权利要求2或者3所述的方法,其特征在于,按照预置的规则,根据所述人声概率判决子音频是否为人声包括:
    获取原始音频中所述子音频中确定时刻点的音频能量值;以及根据所述能量值设置人声概率调节因子,包括,
    若能量值大于能量上限,该子音频的人声概率调节因子置为1;
    若能量值小于能量下限,该子音频的人声概率调节因子置为0;
    若能量值不大于能量上限且不小于能量下限,则根据能量值将所述人声概率调节因子归一化为0至1之间;
    将子音频的人声概率调节因子乘以所述子音频人声概率,得到修正后的子音频人声概率;
    以及,将所述修正后的子音频人声概率分别与判决门限进行比较判决子音频是否为人声。
  6. 根据权利要求5所述的方法,其特征在于,所述方法还包括:
    获取原始音频中连续判决为人声的子音频;
    获取所述连续判决为人声的子音频的所述确定时刻点的组成的音频片段;
    输出所述音频片段。
  7. 根据权利要求6所述的方法,其特征在于:
    所述第一时长与第二时长相等;
    以及,所述子音频中确定时刻点的音频能量值具体为子音频中心时刻点的音频能量值;
    所述获取所述连续判决为人声的子音频的所述确定时刻点的组成的音频片段具体为获取所述连续判决为人声的子音频中心时刻点的组成的音频片段。
  8. 根据权利要求6或7所述的方法,其特征在于,所述输出音频之前还包括:
    若相邻的所述音频片段之间的时间间隔小于第三门限,则获取相邻音频片段间的音频片段。
  9. 根据权利要求8所述的方法,其特征在于,还包括:
    统计所述输出的音频的时长。
  10. 一种音频识别的装置,其特征在于,包括:
    处理器;以及
    存储器,其上存储有可执行代码,当所述可执行代码被所述处理器执行时,使所述处理器执行如权利要求1-9中任一项所述的方法。
PCT/CN2021/130304 2020-11-12 2021-11-12 音频识别方法和装置 WO2022100691A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011260050.5 2020-11-12
CN202011260050.5A CN112270933B (zh) 2020-11-12 2020-11-12 一种音频识别方法和装置

Publications (1)

Publication Number Publication Date
WO2022100691A1 true WO2022100691A1 (zh) 2022-05-19

Family

ID=74339924

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/130304 WO2022100691A1 (zh) 2020-11-12 2021-11-12 音频识别方法和装置

Country Status (2)

Country Link
CN (1) CN112270933B (zh)
WO (1) WO2022100691A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115579022A (zh) * 2022-12-09 2023-01-06 南方电网数字电网研究院有限公司 叠音检测方法、装置、计算机设备和存储介质
CN115840877A (zh) * 2022-12-06 2023-03-24 中国科学院空间应用工程与技术中心 Mfcc提取的分布式流处理方法、***、存储介质及计算机

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112270933B (zh) * 2020-11-12 2024-03-12 北京猿力未来科技有限公司 一种音频识别方法和装置
CN113593603A (zh) * 2021-07-27 2021-11-02 浙江大华技术股份有限公司 音频类别的确定方法、装置、存储介质及电子装置

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030088411A1 (en) * 2001-11-05 2003-05-08 Changxue Ma Speech recognition by dynamical noise model adaptation
CN1666252A (zh) * 2002-07-08 2005-09-07 里昂中央理工学院 为声音信号分配声级的方法和装置
CN108288465A (zh) * 2018-01-29 2018-07-17 中译语通科技股份有限公司 智能语音切轴的方法、信息数据处理终端、计算机程序
CN109712641A (zh) * 2018-12-24 2019-05-03 重庆第二师范学院 一种基于支持向量机的音频分类和分段的处理方法
CN110085251A (zh) * 2019-04-26 2019-08-02 腾讯音乐娱乐科技(深圳)有限公司 人声提取方法、人声提取装置及相关产品
CN110349597A (zh) * 2019-07-03 2019-10-18 山东师范大学 一种语音检测方法及装置
CN110782920A (zh) * 2019-11-05 2020-02-11 广州虎牙科技有限公司 音频识别方法、装置及数据处理设备
US20200074997A1 (en) * 2018-08-31 2020-03-05 CloudMinds Technology, Inc. Method and system for detecting voice activity in noisy conditions
CN111145763A (zh) * 2019-12-17 2020-05-12 厦门快商通科技股份有限公司 一种基于gru的音频中的人声识别方法及***
CN111613213A (zh) * 2020-04-29 2020-09-01 广州三人行壹佰教育科技有限公司 音频分类的方法、装置、设备以及存储介质
CN111883182A (zh) * 2020-07-24 2020-11-03 平安科技(深圳)有限公司 人声检测方法、装置、设备及存储介质
CN112270933A (zh) * 2020-11-12 2021-01-26 北京猿力未来科技有限公司 一种音频识别方法和装置

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107610707B (zh) * 2016-12-15 2018-08-31 平安科技(深圳)有限公司 一种声纹识别方法及装置
CN107527626A (zh) * 2017-08-30 2017-12-29 北京嘉楠捷思信息技术有限公司 一种音频识别***
CN109859771B (zh) * 2019-01-15 2021-03-30 华南理工大学 一种联合优化深层变换特征与聚类过程的声场景聚类方法
CN110782872A (zh) * 2019-11-11 2020-02-11 复旦大学 基于深度卷积循环神经网络的语种识别方法及装置
CN110827798B (zh) * 2019-11-12 2020-09-11 广州欢聊网络科技有限公司 一种音频信号处理的方法及装置

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030088411A1 (en) * 2001-11-05 2003-05-08 Changxue Ma Speech recognition by dynamical noise model adaptation
CN1666252A (zh) * 2002-07-08 2005-09-07 里昂中央理工学院 为声音信号分配声级的方法和装置
CN108288465A (zh) * 2018-01-29 2018-07-17 中译语通科技股份有限公司 智能语音切轴的方法、信息数据处理终端、计算机程序
US20200074997A1 (en) * 2018-08-31 2020-03-05 CloudMinds Technology, Inc. Method and system for detecting voice activity in noisy conditions
CN109712641A (zh) * 2018-12-24 2019-05-03 重庆第二师范学院 一种基于支持向量机的音频分类和分段的处理方法
CN110085251A (zh) * 2019-04-26 2019-08-02 腾讯音乐娱乐科技(深圳)有限公司 人声提取方法、人声提取装置及相关产品
CN110349597A (zh) * 2019-07-03 2019-10-18 山东师范大学 一种语音检测方法及装置
CN110782920A (zh) * 2019-11-05 2020-02-11 广州虎牙科技有限公司 音频识别方法、装置及数据处理设备
CN111145763A (zh) * 2019-12-17 2020-05-12 厦门快商通科技股份有限公司 一种基于gru的音频中的人声识别方法及***
CN111613213A (zh) * 2020-04-29 2020-09-01 广州三人行壹佰教育科技有限公司 音频分类的方法、装置、设备以及存储介质
CN111883182A (zh) * 2020-07-24 2020-11-03 平安科技(深圳)有限公司 人声检测方法、装置、设备及存储介质
CN112270933A (zh) * 2020-11-12 2021-01-26 北京猿力未来科技有限公司 一种音频识别方法和装置

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115840877A (zh) * 2022-12-06 2023-03-24 中国科学院空间应用工程与技术中心 Mfcc提取的分布式流处理方法、***、存储介质及计算机
CN115579022A (zh) * 2022-12-09 2023-01-06 南方电网数字电网研究院有限公司 叠音检测方法、装置、计算机设备和存储介质

Also Published As

Publication number Publication date
CN112270933A (zh) 2021-01-26
CN112270933B (zh) 2024-03-12

Similar Documents

Publication Publication Date Title
CN108564942B (zh) 一种基于敏感度可调的语音情感识别方法及***
WO2022100691A1 (zh) 音频识别方法和装置
CN110021308B (zh) 语音情绪识别方法、装置、计算机设备和存储介质
Kinnunen Spectral features for automatic text-independent speaker recognition
Schuller Intelligent audio analysis
WO2022100692A1 (zh) 人声音频录制方法和装置
Qamhan et al. Digital audio forensics: microphone and environment classification using deep learning
Sefara The effects of normalisation methods on speech emotion recognition
CN113488063B (zh) 一种基于混合特征及编码解码的音频分离方法
CN114023300A (zh) 一种基于扩散概率模型的中文语音合成方法
WO2023226239A1 (zh) 对象情绪的分析方法、装置和电子设备
CN112735404A (zh) 一种语音反讽检测方法、***、终端设备和存储介质
CN111009235A (zh) 一种基于cldnn+ctc声学模型的语音识别方法
CN113823323A (zh) 一种基于卷积神经网络的音频处理方法、装置及相关设备
Xiao et al. Hierarchical classification of emotional speech
Hashem et al. Speech emotion recognition approaches: A systematic review
WO2023279691A1 (zh) 语音分类方法、模型训练方法及装置、设备、介质和程序
WO2024114303A1 (zh) 音素识别方法、装置、电子设备及存储介质
Chakroun et al. Efficient text-independent speaker recognition with short utterances in both clean and uncontrolled environments
Chetouani et al. Time-scale feature extractions for emotional speech characterization: applied to human centered interaction analysis
Grewal et al. Isolated word recognition system for English language
Laghari et al. Robust speech emotion recognition for sindhi language based on deep convolutional neural network
Eyben et al. Audiovisual vocal outburst classification in noisy acoustic conditions
CN111009236A (zh) 一种基于dblstm+ctc声学模型的语音识别方法
Jothimani et al. A new spatio-temporal neural architecture with Bi-LSTM for multimodal emotion recognition

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21891211

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21891211

Country of ref document: EP

Kind code of ref document: A1