CN113593523B

CN113593523B - Speech detection method and device based on artificial intelligence and electronic equipment

Info

Publication number: CN113593523B
Application number: CN202110074985.2A
Authority: CN
Inventors: 林炳怀; 王丽园
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-01-20
Filing date: 2021-01-20
Publication date: 2024-06-21
Anticipated expiration: 2041-01-20
Also published as: CN113593523A

Abstract

The application provides a voice detection method, a voice detection device, electronic equipment and a computer readable storage medium based on artificial intelligence; the method comprises the following steps: dividing an audio signal into a plurality of pronunciation fragments, and acquiring the audio characteristics of each pronunciation fragment; based on the audio characteristics of each pronunciation fragment, carrying out voice classification processing on each pronunciation fragment to obtain a voice classification result of each pronunciation fragment; based on the audio characteristics of each pronunciation segment, carrying out language classification processing on each pronunciation segment to obtain a language classification result of each pronunciation segment; and determining a voice classification result of the audio signal based on the voice classification result of each pronunciation segment, and determining a language classification result of the audio signal based on the language classification result of each pronunciation segment. By the method and the device, the real-time performance and accuracy of voice recognition can be improved.

Description

Speech detection method and device based on artificial intelligence and electronic equipment

Technical Field

The present application relates to an artificial intelligence technology, and in particular, to a voice detection method, apparatus, electronic device and computer readable storage medium based on artificial intelligence.

Background

Artificial intelligence (AI, artificial Intelligence) is the theory, method and technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results.

More and more artificial intelligence products have the function of voice interaction, and the voice interaction can be applied to various voice scoring systems, such as an encyclopedia question-answering system, a language testing system for language education application, a spoken language examination system, an intelligent assistant control system, a voice input system embedded in a client, a voice control system embedded in the client and the like, so that abnormal voice of various conditions easily occurs in the use process of the voice interaction function, and the real-time performance and accuracy of the voice interaction are affected.

Disclosure of Invention

The embodiment of the application provides a voice detection method, a voice detection device, electronic equipment and a computer readable storage medium based on artificial intelligence, which can improve the instantaneity and accuracy of voice recognition.

The technical scheme of the embodiment of the application is realized as follows:

The embodiment of the application provides a voice detection method based on artificial intelligence, which comprises the following steps:

Dividing an audio signal into a plurality of pronunciation fragments, and acquiring the audio characteristics of each pronunciation fragment;

Based on the audio characteristics of each pronunciation fragment, carrying out voice classification processing on each pronunciation fragment to obtain a voice classification result of each pronunciation fragment;

based on the audio characteristics of each pronunciation segment, carrying out language classification processing on each pronunciation segment to obtain a language classification result of each pronunciation segment;

And determining a voice classification result of the audio signal based on the voice classification result of each pronunciation segment, and determining a language classification result of the audio signal based on the language classification result of each pronunciation segment.

The embodiment of the application provides a voice detection device based on artificial intelligence, which comprises:

The acquisition module is used for dividing the audio signal into a plurality of pronunciation fragments and acquiring the audio characteristics of each pronunciation fragment;

The voice module is used for carrying out voice classification processing on each pronunciation fragment based on the audio characteristics of each pronunciation fragment to obtain a voice classification result of each pronunciation fragment;

the language module is used for carrying out language classification processing on each pronunciation fragment based on the audio characteristics of each pronunciation fragment to obtain a language classification result of each pronunciation fragment;

And the result module is used for determining the voice classification result of the audio signal based on the voice classification result of each pronunciation segment and determining the language classification result of the audio signal based on the language classification result of each pronunciation segment.

In the above solution, the obtaining module is further configured to: determining a speech energy for each audio frame in the audio signal; and combining a plurality of continuous audio frames with voice energy larger than background noise energy in the audio signal into a pronunciation fragment.

In the above solution, the obtaining module is further configured to: carrying out framing treatment on the audio signal to obtain a plurality of audio frames corresponding to the audio signal; performing feature extraction processing on each audio frame through an audio frame classification network to obtain audio frame classification features corresponding to each audio frame; wherein the audio frame classification feature comprises at least one of: log frame energy characteristics; zero crossing rate characteristics; normalizing the autocorrelation characteristics; performing classification processing on each audio frame based on the audio frame classification characteristics through the audio frame classification network, and combining a plurality of continuous audio frames with classification results of pronunciation data into pronunciation fragments; the training samples of the audio frame classification network comprise audio frame samples, and the labeling data of the training samples comprise pre-labeled classification results of the audio frame samples.

In the above scheme, the voice classification processing and the language classification processing are realized through a multi-classification task model, wherein the multi-classification task model comprises a voice classification network and a language classification network; the voice module is further used for: transmitting the audio characteristics of each pronunciation fragment in the voice classification network in a forward direction to obtain a voice classification result of each pronunciation fragment; the language module is further configured to: and carrying out forward transmission on the audio characteristics of each pronunciation segment in the language classification network to obtain the language classification result of each pronunciation segment.

In the above scheme, the voice module is further configured to: carrying out first full-connection processing on each pronunciation segment through a shared full-connection layer of the voice classification network and the language classification network to obtain a first full-connection processing result corresponding to each pronunciation segment; performing second full-connection processing on the first full-connection processing result of each pronunciation segment through a voice full-connection layer of the voice classification network to obtain a second full-connection processing result of each pronunciation segment; carrying out maximum likelihood processing on the second full-connection processing result of each pronunciation fragment to obtain the probability of corresponding each voice classification label; determining the voice classification label with the highest probability as a voice classification result of each pronunciation fragment; the language module is further configured to: carrying out third full-connection processing on each pronunciation segment through a shared full-connection layer of the language classification network and the language classification network to obtain a third full-connection processing result corresponding to each pronunciation segment; performing fourth full-connection processing on the third full-connection processing result of each pronunciation segment through the language full-connection layer of the language classification network to obtain a fourth full-connection processing result of each pronunciation segment; carrying out maximum likelihood processing on the fourth full-connection processing result of each pronunciation fragment to obtain the probability of classifying labels corresponding to each language; and determining the language classification label with the highest probability as the language classification result of each pronunciation segment.

In the above scheme, the audio feature of each pronunciation section is obtained through a shared feature network in the multi-classification task model; the acquisition module is further configured to: transforming the type of each pronunciation segment from a time domain signal to a frequency domain signal, and performing Mel calculation on each pronunciation segment transformed to the frequency domain signal to obtain the spectrum of Mel scale of each pronunciation segment; and forward transmitting the frequency spectrum of the Mel scale of each pronunciation fragment in the shared feature network to obtain the audio feature of each pronunciation fragment.

In the above scheme, the shared feature network includes N cascaded feature extraction networks, where N is an integer greater than or equal to 2; the acquisition module is further configured to: performing feature extraction processing on the input of an nth feature extraction network through the nth feature extraction network in the N cascaded feature extraction networks; transmitting an nth feature extraction result output by the nth feature extraction network to an (n+1) th feature extraction network to continue feature extraction processing; wherein N is an integer whose value increases from 1, and the value range of N satisfies 1.ltoreq.n.ltoreq.N-1; when the value of N is 1, the input of the nth feature extraction network is the frequency spectrum of the Mel scale of each pronunciation segment, and when the value of N is 2-N-1, the input of the nth feature extraction network is the feature extraction result of the nth-1 feature extraction network.

In the above scheme, the nth feature extraction network includes a convolution layer, a normalization layer, a linear rectification layer, and an average pooling layer; the acquisition module is further configured to: carrying out convolution processing on the input of the nth characteristic extraction network and the convolution layer parameters of the convolution layer of the nth characteristic extraction network to obtain an nth convolution layer processing result; normalizing the processing result of the nth convolution layer through a normalization layer of the nth feature extraction network to obtain an nth normalization processing result; performing linear rectification processing on the nth normalization processing result through a linear rectification layer of the nth characteristic extraction network to obtain an nth linear rectification processing result; and carrying out average pooling treatment on the nth linear rectification treatment result through an average pooling layer of the nth characteristic extraction network to obtain an nth characteristic extraction result.

In the above scheme, the voice module is further configured to: based on the application scene of the audio signal, performing adaptation of a plurality of candidate classification processes; when the voice classification processing is adapted to the voice classification processing in the candidate classification processing, performing voice classification processing on each pronunciation segment to obtain a voice classification result of each pronunciation segment; the language module is further configured to: based on the application scene of the audio signal, performing adaptation of a plurality of candidate classification processes; and when the language classification processing is adapted to the plurality of candidate classification processing, carrying out language classification processing on each pronunciation segment to obtain a language classification result of each pronunciation segment.

In the above scheme, the voice module is further configured to: acquiring a limiting condition of the application scene to determine candidate classification processing corresponding to the limiting condition in the plurality of candidate classification processing as classification processing matched with the application scene; wherein the defined condition includes at least one of: age, age; species; language; sex.

In the above scheme, the voice classification processing and the language classification processing are realized through a multi-classification task model, wherein the multi-classification task model comprises a shared feature network, a voice classification network and a language classification network; the apparatus further comprises: training module for: carrying out forward propagation and backward propagation on corpus samples in a training sample set in the shared feature network, the shared full-connection layer of the voice classification network and the language classification network and the full-connection layer corresponding to the shared feature network so as to update parameters of the shared feature network and the shared full-connection layer; and carrying out forward propagation and backward propagation on corpus samples in the training sample set in the updated shared feature network, the updated shared full-connection layer, the full-connection layer of the voice classification network and the full-connection layer of the language classification network so as to update parameters of the multi-classification task model.

In the above solution, the result module is further configured to: acquiring a first number of pronunciation fragments of which the voice classification result is non-voice and a second number of pronunciation fragments of which the voice classification result is voice; determining a voice classification result corresponding to a greater number of the first number and the second number as a voice classification result of the audio signal; acquiring the language classification result as the number of pronunciation fragments of each language; and determining the languages corresponding to the maximum number as language classification results of the audio signals.

An embodiment of the present application provides an electronic device, including:

a memory for storing executable instructions;

and the processor is used for realizing the voice detection method based on artificial intelligence when executing the executable instructions stored in the memory.

The embodiment of the application provides a computer readable storage medium which stores executable instructions for realizing the artificial intelligence-based voice detection method provided by the embodiment of the application when being executed by a processor.

The embodiment of the application has the following beneficial effects:

by extracting the characteristics of each pronunciation segment in the audio signal and carrying out the voice classification processing and the language classification processing respectively aiming at the extracted audio characteristics, various anomalies of the audio signal are accurately detected, and the voice recognition is more accurately realized.

Drawings

FIG. 1 is a schematic diagram of an artificial intelligence based speech detection system according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of an electronic device according to an embodiment of the present application;

FIGS. 3A-3D are schematic flow diagrams of an artificial intelligence based speech detection method according to an embodiment of the present application;

FIGS. 4A-4B are schematic interface diagrams of an artificial intelligence based speech detection method provided by an embodiment of the present application;

FIG. 5 is a flow chart of an artificial intelligence based speech detection method according to an embodiment of the present application;

FIG. 6A is a schematic diagram of a multi-classification task model of an artificial intelligence-based speech detection method according to an embodiment of the present application;

FIG. 6B is a schematic diagram of a basic classification model of an artificial intelligence-based speech detection method according to an embodiment of the present application;

fig. 7 is a schematic diagram of a data structure of an artificial intelligence-based voice detection method according to an embodiment of the present application.

Detailed Description

The present application will be further described in detail with reference to the accompanying drawings, for the purpose of making the objects, technical solutions and advantages of the present application more apparent, and the described embodiments should not be construed as limiting the present application, and all other embodiments obtained by those skilled in the art without making any inventive effort are within the scope of the present application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is to be understood that "some embodiments" can be the same subset or different subsets of all possible embodiments and can be combined with one another without conflict.

In the following description, the terms "first", "second", "third" and the like are merely used to distinguish similar objects and do not represent a specific ordering of the objects, it being understood that the "first", "second", "third" may be interchanged with a specific order or sequence, as permitted, to enable embodiments of the application described herein to be practiced otherwise than as illustrated or described herein.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the application only and is not intended to be limiting of the application.

Before describing embodiments of the present application in further detail, the terms and terminology involved in the embodiments of the present application will be described, and the terms and terminology involved in the embodiments of the present application will be used in the following explanation.

1) Speech recognition technology: automatic speech recognition (ASR, automatic Speech Recognition) aims to convert lexical content in human speech into computer-readable inputs, such as keys, binary codes or character sequences.

2) Mel-frequency cepstral coefficient (MFCC, mel-Frequency Cepstrum Coefficient): is a cepstrum parameter extracted in the mel-scale frequency domain, the mel-scale describes the nonlinear characteristics of human ear frequencies, and the mel-spectrum is a spectrum whose frequency is converted into the mel-scale.

3) Identity Vector (I-Vector): the speech features are extracted as a low-dimensional vector matrix that characterizes the speaker's information variability.

4) Voice endpoint detection (VAD, voice Activity Detection): a sounding section and a mute section of an audio signal are detected.

5) Full Connection (FC): the fully connected layer may integrate local information with class differentiation in a convolutional layer or a pooled layer.

One of typical applications of the voice interactive function in the related art is a spoken language evaluation application scenario, the spoken language evaluation is a process of evaluating the voice of a speaker, voice recognition is firstly performed, and evaluation is performed based on features such as pronunciation confidence extracted by voice recognition, so that in order to improve the accuracy of evaluation, the language of voice recognition is consistent with the language to be evaluated, for example, for the spoken language evaluation of Chinese, the adopted voice recognition engine is a Chinese recognition engine, but in the embodiment of the application, the situation of the spoken language evaluation is found to be various, for example, the speaker does not speak the corresponding evaluation language, namely, evaluate Chinese but the speaker speaks English, or non-human voice such as animal voice, table strike voice, keyboard voice and the like is randomly recorded for evaluation, and the abnormal situation reduces the robustness of the spoken language evaluation, so that the abnormal detection of an audio signal is required before the evaluation, so as to reduce the influence of the abnormal audio signal on the evaluation accuracy.

In the related art, the process of distinguishing the voice and the process of distinguishing the non-voice are mutually independent, and when the embodiment of the application is implemented, the applicant finds that in the scene of applying the voice interaction function, such as a spoken language evaluation scene, the voice of the audio signal is not in accordance with the rule and the non-voice of the audio signal belongs to abnormal conditions, both the accuracy and the instantaneity of voice interaction can be influenced, the voice distinguishing is one of the voice categories of the non-voice distinguishing, and the non-voice distinguishing and the voice distinguishing are related, and only the non-voice distinguishing or only the voice distinguishing is carried out, so that the comprehensive abnormal conditions can not be effectively detected.

The embodiment of the application provides a voice detection method, a device, electronic equipment and a computer readable storage medium based on artificial intelligence, which can combine language classification tasks and non-human voice classification tasks, extract effective audio characteristics for the two tasks, optimize the two tasks based on multi-task learning at the same time, and output language classification results and human voice classification results at the same time, so that the accuracy and instantaneity of voice interaction are improved. In the following, an exemplary application when the device is implemented as a terminal will be described.

Referring to fig. 1, fig. 1 is a schematic structural diagram of an artificial intelligence-based voice detection system according to an embodiment of the present application, in order to support a spoken language evaluation application, a terminal 400 is connected to a server 200 through a network 300, where the network 300 may be a wide area network or a local area network, or a combination of both. The server 200 receives the audio signal of the user answering the question transmitted from the terminal 400, and performs the voice classification process and the language classification process on the audio signal at the same time, and when at least one of the voice classification result and the language classification result is abnormal, the server 200 returns the classification result having the abnormality to the terminal 400 to display.

In some embodiments, the server 200 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, and basic cloud computing services such as big data and artificial intelligence platforms. The terminal 400 may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, etc. The terminal and the server may be directly or indirectly connected through wired or wireless communication, which is not limited in the embodiment of the present application.

In some embodiments, in the spoken language evaluation scenario, the audio signal to be classified is an audio signal of a user answering a question, in response to a voice collection operation of the user, the terminal 400 receives the audio signal of the user for the following question type, and the audio signal is english for the language set by the user for the following question type, the terminal 400 sends the audio signal (i.e., the answer of the user for the following question type) to the server 200, the server 200 performs a voice classification process and a language classification process on the audio signal, and when the voice classification result is non-voice or the language classification result is not english, the obtained classification result (non-voice or voice and non-english) representing that the audio signal is abnormal is returned to the terminal 400 to prompt the user to answer again.

In some embodiments, in the intelligent voice assistant scenario, the audio signal to be classified is the audio signal of the user waking up the intelligent voice assistant, in response to the voice collection operation of the user, the terminal 400 receives the audio signal of the user waking up the intelligent voice assistant, the terminal 400 sends the audio signal to the server 200, the server 200 performs the voice classification processing and the language classification processing on the audio signal, and when the voice classification result is voice and the language classification result is english, the virtual image of the intelligent voice assistant corresponding to the classification result is returned and presented to the terminal 400, and controls the intelligent voice assistant to interact with the user in the manner of english voice.

In some embodiments, in the voice input scenario, the audio signal to be classified is an audio signal input by a user, in response to a voice collection operation of the user, the voice input language set by the terminal 400 is chinese, the terminal 400 receives the audio signal input by the user, the terminal 400 sends the audio signal to the server 200, the server 200 performs a voice classification process and a language classification process on the audio signal, and when the voice classification result is non-voice or the language classification result is not chinese, the obtained classification result (non-voice or voice and non-chinese) representing that the audio signal is abnormal is returned to the terminal 400, so as to prompt the user to perform voice input again, thereby completing the voice input process.

Referring to fig. 2, fig. 2 is a schematic structural diagram of an electronic device according to an embodiment of the present application, taking an example that the electronic device is a server 200, the server 200 shown in fig. 2 includes: at least one processor 210, a memory 250, at least one network interface 220, and a user interface 230. The various components in terminal 200 are coupled together by bus system 240. It is understood that the bus system 240 is used to enable connected communications between these components. The bus system 240 includes a power bus, a control bus, and a status signal bus in addition to the data bus. But for clarity of illustration the various buses are labeled as bus system 240 in fig. 2.

The Processor 210 may be an integrated circuit chip having signal processing capabilities such as a general purpose Processor, such as a microprocessor or any conventional Processor, a digital signal Processor (DSP, digital Signal Processor), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like.

The user interface 230 includes one or more output devices 231, including one or more speakers and/or one or more visual displays, that enable presentation of media content. The user interface 230 also includes one or more input devices 232, including user interface components that facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, other input buttons and controls.

The memory 250 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard drives, optical drives, and the like. Memory 250 optionally includes one or more storage devices physically located remote from processor 210.

Memory 250 includes volatile memory or nonvolatile memory, and may also include both volatile and nonvolatile memory. The non-volatile Memory may be a Read Only Memory (ROM) and the volatile Memory may be a random access Memory (RAM, random Access Memory). The memory 250 described in embodiments of the present application is intended to comprise any suitable type of memory.

In some embodiments, memory 250 is capable of storing data to support various operations, examples of which include programs, modules and data structures, or subsets or supersets thereof, as exemplified below.

An operating system 251 including system programs for handling various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and handling hardware-based tasks;

Network communication module 252 for reaching other computing devices via one or more (wired or wireless) network interfaces 220, exemplary network interfaces 220 include: bluetooth, wireless compatibility authentication (WiFi), and universal serial bus (USB, universal Serial Bus), etc.;

a presentation module 253 for enabling presentation of information (e.g., a user interface for operating peripheral devices and displaying content and information) via one or more output devices 231 (e.g., a display screen, speakers, etc.) associated with the user interface 230;

In some embodiments, the artificial intelligence based speech detection apparatus provided in the embodiments of the present application may be implemented in software, and fig. 2 shows the artificial intelligence based speech detection apparatus 255 stored in the memory 250, which may be software in the form of a program, a plug-in, or the like, including the following software modules: the acquisition module 2551, the voice module 2552, the language module 2553, the results module 2554, and the training module 2555 are logical, so that any combination or further splitting may be performed according to the implemented functions, and the functions of the respective modules will be described below.

The artificial intelligence-based voice detection method provided by the embodiment of the present application will be described in connection with exemplary applications and implementations of the server 200 provided by the embodiment of the present application.

Referring to fig. 6A, fig. 6A is a schematic structural diagram of a multi-classification task model of an artificial intelligence-based speech detection method according to an embodiment of the present application, where the multi-classification task model includes a shared feature network, a voice classification network, and a language classification network, the shared feature network is used for feature extraction, input of the shared feature network is a mel spectrum obtained based on an audio signal, output of the shared feature network is an audio feature of the audio signal, full-connection processing is performed on the audio feature through a shared full-connection layer of the voice classification network and the language classification network, and full-connection processing is performed through full-connection layers corresponding to the voice classification network and the language classification network, so as to obtain a voice classification result and a language classification result, respectively, and the voice classification network includes the shared full-connection layer and the voice full-connection layer corresponding to the voice classification network.

Referring to fig. 6B, fig. 6B is a schematic structural diagram of a basic classification model of an artificial intelligence-based speech detection method according to an embodiment of the present application, where the basic classification model includes a plurality of feature extraction networks, a shared full-connection layer (FC 2048 and a linear rectification function), and full-connection layers (FC 527 and sigmoid activation functions) corresponding to 527 categories, each feature extraction network includes a convolution layer (for example, convolution layer 3×3@64), a normalization layer, a linear rectification layer, and an averaging pooling layer, the shared full-connection layer is a shared full-connection layer of the corresponding voice classification network and the language classification network, the plurality of feature extraction networks are combined into the shared feature network, and the full-connection layers corresponding to 527 categories can directly output 527 classification results to perform visual training on the basic classification model.

Referring to fig. 3A, fig. 3A is a schematic flow chart of an artificial intelligence-based voice detection method according to an embodiment of the present application, and will be described with reference to steps 101 to 104 shown in fig. 3A.

In step 101, the audio signal is divided into a plurality of sound fragments, and the audio feature of each sound fragment is acquired.

As an example, in a spoken evaluation scenario, the audio signal is derived from gathering audio content that the user answers questions, in a smart assistant scenario, the audio signal is derived from gathering audio content that carries user instructions, and in a speech input scenario, the audio signal is derived from gathering audio content that carries user input text.

In some embodiments, the audio signal is divided into a plurality of pronunciation fragments in step 101, which may be implemented by the following technical scheme: determining a speech energy for each audio frame in the audio signal; a plurality of consecutive audio frames of the audio signal having speech energy greater than background noise energy are combined into a voicing segment.

As an example, the strength of the audio signal is detected based on an energy criterion, where an audio frame is determined to be speech-present when speech energy of the audio frame is greater than background noise energy in the audio signal, and where the audio frame is determined to be speech-absent when speech energy of the audio frame is not greater than background noise energy in the audio signal, e.g., the audio frame is background noise.

In some embodiments, referring to fig. 3B, fig. 3B is a schematic flow chart of a voice detection method based on artificial intelligence according to an embodiment of the present application, and in step 101, the audio signal is divided into a plurality of pronunciation fragments, which can be implemented in steps 1011-1013.

In step 1011, the audio signal is subjected to framing processing, so as to obtain a plurality of audio frames corresponding to the audio signal.

In step 1012, feature extraction processing is performed on each audio frame through the audio frame classification network, so as to obtain an audio frame classification feature corresponding to each audio frame.

In step 1013, performing a classification process based on the classification characteristics of the audio frames on each audio frame through the audio frame classification network, and combining a plurality of consecutive audio frames whose classification result is pronunciation data into a pronunciation fragment;

as an example, the audio frame classification features include at least one of: log frame energy characteristics; zero crossing rate characteristics; the autocorrelation characteristics are normalized. The training samples of the audio frame classification network comprise audio frame samples, and the labeling data of the training samples comprise pre-labeled classification results of the audio frame samples.

As an example, an audio signal is subjected to framing processing, audio frame classification features are extracted from data of each audio frame, an audio frame classification network is trained on an audio frame set of known speech signal areas and silence signal areas, unknown audio frames are classified by the audio frame classification network obtained through training to determine that the audio frames belong to speech signals or silence signals, the audio frame classification network divides the audio signal into voiced segments and unvoiced segments, the audio signal is firstly passed through a high-pass filter to remove direct-current bias components and low-frequency noise in the audio signal, framing with a length of 20-40 milliseconds (ms) is performed on the audio signal before feature extraction is performed, overlap between the audio frames is 10ms, and after framing is completed, extraction of at least one of the following three features is performed on each audio frame: log frame energy characteristics; zero crossing rate characteristics; the autocorrelation characteristics are normalized. By combining various features, the probability of error classification of the audio frame can be effectively reduced, and the accuracy of voice recognition is further improved.

In some embodiments, referring to fig. 3C, fig. 3C is a schematic flow chart of an artificial intelligence-based voice detection method according to an embodiment of the present application, and the step 101 of obtaining the audio feature of each pronunciation segment may be implemented by steps 1014-1015.

In step 1014, the type of each voicing segment is transformed from the time domain signal to the frequency domain signal, and the mel calculation is performed on each voicing segment transformed to the frequency domain signal, so as to obtain the mel scale spectrum of each voicing segment.

In step 1015, the spectrum of the mel scale of each pronunciation segment is transmitted forward in the shared feature network to obtain the audio feature corresponding to each pronunciation segment.

As an example, the audio features of each pronunciation segment are obtained through a shared feature network in a multi-classification task model. Because the original audio signal is a waveform diagram which changes along with time and cannot be decomposed into a plurality of basic signals, the original audio signal is transformed from a time domain to a frequency domain to obtain a spectrogram, the audio signal is transformed from the time domain to the frequency domain through Fourier transformation, the horizontal axis of the spectrogram is time, the vertical axis of the spectrogram is frequency, the human being has difficulty in perceiving the frequency in a linear range, the capability of perceiving low-frequency differences is stronger than the capability of perceiving high-frequency differences, in order to overcome the difficulty in perceiving, the frequency can be subjected to Mel calculation, the pronunciation fragments transformed into the frequency domain signal can be subjected to Mel calculation to obtain Mel scales, the original audio signal is finally transformed into the spectrum of the Mel scales, the horizontal axis of the spectrum of the Mel scales is time, the vertical axis of the spectrum of the Mel scales is frequency of the Mel scales, and the spectrum of the Mel scales is used as the input of the multi-classification task model.

In some embodiments, the shared feature network comprises N cascaded feature extraction networks, N being an integer greater than or equal to 2; in step 1015, the spectrum of the mel scale of each pronunciation segment is transmitted forward in the shared feature network to obtain the audio feature corresponding to each pronunciation segment, which is implemented by the following technical scheme, through the nth feature extraction network in the N cascaded feature extraction networks, the feature extraction processing is performed on the input of the nth feature extraction network; transmitting an nth feature extraction result output by the nth feature extraction network to the (n+1) th feature extraction network to continue feature extraction processing; wherein N is an integer whose value is increased from 1, and the value range of N satisfies that N is more than or equal to 1 and less than or equal to N-1; when the value of N is 1, the input of the nth feature extraction network is the frequency spectrum of the Mel scale of each pronunciation fragment, and when the value of N is 2-N-1, the input of the nth feature extraction network is the feature extraction result of the nth-1 feature extraction network.

As an example, referring to fig. 6B, the basic classification model includes a plurality of feature extraction networks, which constitute a shared feature network, a shared full-connection layer (FC 2048 and linear rectification function) which is a full-connection layer shared between the voice classification network and the language classification network, and full-connection layers (FC 527 and sigmoid activation function) corresponding to 527 categories. The input of the shared feature network is the frequency spectrum of the mel scale of each pronunciation section, and the output of the shared feature network is the audio feature.

In some embodiments, the nth feature extraction network includes a convolution layer, a normalization layer, a linear rectification layer, and an average pooling layer; the method comprises the following steps of carrying out convolution processing on the input of an nth feature extraction network and convolution layer parameters of a convolution layer of the nth feature extraction network to obtain an nth convolution layer processing result; normalizing the processing result of the nth convolution layer through a normalization layer of the nth feature extraction network to obtain an nth normalization processing result; carrying out linear rectification treatment on the nth normalization treatment result through a linear rectification layer of the nth characteristic extraction network to obtain an nth linear rectification treatment result; and carrying out average pooling treatment on the nth linear rectification treatment result through an average pooling layer of the nth characteristic extraction network to obtain the nth characteristic extraction result.

As an example, each feature extraction network includes a convolution layer, a normalization layer, a linear rectification layer, and an average pooling layer; and carrying out convolution processing, normalization processing, linear rectification processing and average pooling processing on the input of the feature extraction network through the feature extraction network to obtain a feature extraction result output by the feature extraction network, and outputting the audio feature of the pronunciation fragments by the last feature extraction network.

In step 102, a voice classification process is performed on each pronunciation section based on the audio characteristics of each pronunciation section, so as to obtain a voice classification result of each pronunciation section.

In some embodiments, referring to fig. 3D, fig. 3D is a schematic flow chart of a voice detection method based on artificial intelligence according to an embodiment of the present application, in step 102, based on the audio feature of each pronunciation segment, a voice classification process is performed on each pronunciation segment to obtain a voice classification result of each pronunciation segment, which may be implemented in steps 1021-1022.

In step 1021, a plurality of candidate classification processes are adapted based on the application scenario of the audio signal.

In step 1022, when the voice classification process is adapted to the plurality of candidate classification processes, the voice classification process is performed on each pronunciation section, and a voice classification result of each pronunciation section is obtained.

As an example, based on an application scene of an audio signal, adaptation of a plurality of candidate classification processes is performed, for example, when the application scene is a spoken evaluation scene, the candidate classification processes including: in the spoken language evaluation scene, the audio signal is required to be voice, and in the intelligent assistant scene, the audio signal is required to be voice, so that when the voice classification processing is adapted to the plurality of candidate classification processing, the voice classification processing is performed on each pronunciation segment, and the voice classification result of each pronunciation segment is obtained.

In some embodiments, in step 102, the voice classification processing is performed on each pronunciation segment based on the audio feature of each pronunciation segment to obtain a voice classification result of each pronunciation segment, which may be implemented by the following technical scheme, where the audio feature of each pronunciation segment is transmitted forward in the voice classification network to obtain a voice classification result of each pronunciation segment.

In some embodiments, the foregoing forward transmission of the audio feature of each pronunciation segment in the voice classification network to obtain the voice classification result of each pronunciation segment may be implemented by the following technical scheme, where the first full connection processing is performed on each pronunciation segment through a shared full connection layer of the voice classification network and the language classification network to obtain the first full connection processing result corresponding to each pronunciation segment; performing second full-connection processing on the first full-connection processing result of each pronunciation segment through a voice full-connection layer of the voice classification network to obtain a second full-connection processing result of each pronunciation segment; carrying out maximum likelihood processing on the second full-connection processing result of each pronunciation fragment to obtain the probability of corresponding each voice classification label; and determining the voice classification label with the highest probability as a voice classification result of each pronunciation fragment.

By way of example, the voice classification process and the language classification process are implemented by a multi-classification task model that includes a voice classification network and a language classification network; referring to fig. 6A, the multi-classification task model includes a shared feature network, a voice classification network and a language classification network, the shared feature network is used for feature extraction, the input of the shared feature network is a mel spectrum obtained based on an audio signal, the output of the shared feature network is the audio feature of each pronunciation segment, the shared full-connection layer of the voice classification network and the language classification network is used for performing first full-connection processing on the audio feature, the shared full-connection layer is also used for performing processing based on a linear rectification function, the full-connection layer is used for performing second full-connection processing and maximum likelihood processing through the full-connection layer of the voice corresponding to the voice classification network, the probability of each voice classification tag can be obtained through the maximum likelihood processing, two voice classification tags (voice and non-voice) exist, the voice classification tag with the maximum probability is determined to be the voice classification result of each pronunciation segment, the probability of non-voice is assumed to be 0.9, and the voice classification result is assumed to be non-voice.

In step 103, based on the audio characteristics of each pronunciation segment, the language classification processing is performed on each pronunciation segment, so as to obtain the language classification result of each pronunciation segment.

In some embodiments, in step 103, based on the audio feature of each pronunciation segment, the language classification processing is performed on each pronunciation segment to obtain the language classification result of each pronunciation segment, which may be implemented by the following technical scheme: based on the application scene of the audio signal, performing adaptation of a plurality of candidate classification processes; when the method is adapted to the language classification processing in the candidate classification processing, the language classification processing is carried out on each pronunciation segment, and the language classification result of each pronunciation segment is obtained.

As an example, based on an application scene of an audio signal, adaptation of a plurality of candidate classification processes is performed, for example, when the application scene is a spoken evaluation scene, the candidate classification processes including: the method comprises the steps of language classification processing, age classification processing, gender classification processing, language classification processing and the like, wherein in a spoken language evaluation scene, the language of a required audio signal is English, and in an intelligent assistant scene, the required audio signal is Chinese, so that when the method is adapted to the language classification processing in a plurality of candidate classification processing, the language classification processing is carried out on each pronunciation segment, and the language classification result of each pronunciation segment is obtained.

In some embodiments, the adapting of the plurality of candidate classification processes based on the application scenario of the audio signal may be implemented by acquiring a limiting condition of the application scenario, so as to determine a candidate classification process corresponding to the limiting condition in the plurality of candidate classification processes as a classification process adapted to the application scenario; wherein the defining conditions include at least one of: age, age; species; language; sex.

As an example, different application scenarios have different constraints, for example, a spoken language evaluation application scenario has a constraint on the age of a user, for example, a user who is required to participate in spoken language evaluation is a child, and a language classification process, an age classification process, a gender classification process, and an age classification process corresponding to the age constraint among the language classification processes are regarded as classification processes adapted to the application scenario.

In some embodiments, in step 103, the language classification processing is performed on each pronunciation segment based on the audio feature of each pronunciation segment to obtain the language classification result of each pronunciation segment, which may be implemented by the following technical scheme, where the audio feature of each pronunciation segment is transmitted forward in the language classification network to obtain the language classification result of each pronunciation segment.

In some embodiments, the foregoing forward transmission of the audio feature of each pronunciation segment in the language classification network may be implemented by the following technical solutions: carrying out third full connection processing on each pronunciation segment through a shared full connection layer of the language classification network and the language classification network to obtain a third full connection processing result corresponding to each pronunciation segment; performing fourth full-connection processing on the third full-connection processing result of each pronunciation segment through a language full-connection layer of the language classification network to obtain a fourth full-connection processing result of each pronunciation segment; carrying out maximum likelihood processing on the fourth full-connection processing result of each pronunciation segment to obtain the probability of classifying the label corresponding to each language; and determining the language classification label with the highest probability as the language classification result of each pronunciation segment.

By way of example, the voice classification process and the language classification process are implemented by a multi-classification task model that includes a voice classification network and a language classification network; referring to fig. 6A, the multi-classification task model includes a shared feature network, a voice classification network and a language classification network, the shared feature network is used for feature extraction, the input of the shared feature network is a mel spectrum obtained based on an audio signal, the output of the shared feature network is the audio feature of each pronunciation segment, the third full-connection processing is performed on the audio feature through a shared full-connection layer of the voice classification network and the language classification network, the shared full-connection layer is a full-connection layer with 2048 nodes in fig. 6B, the processing based on a linear rectification activation function is further performed in the full-connection layer, the fourth full-connection processing and the maximum likelihood processing are performed through a language full-connection layer corresponding to the language classification network, the probability of each language classification tag can be obtained through the maximum likelihood processing, a plurality of language classification tags (for example, english, chinese and japanese) exist, the language classification tag with the maximum probability is determined as the language classification result of each segment, the probability of english is assumed to be 0.8, the probability of chinese is 0.1, and the pronunciation result of chinese is the english classification result.

In step 104, a human voice classification result of the audio signal is determined based on the human voice classification result of each pronunciation section, and a language classification result of the audio signal is determined based on the language classification result of each pronunciation section.

In some embodiments, determining the voice classification result of the audio signal based on the voice classification result of each voice segment in step 104 may be implemented by obtaining a first number of voice segments for which the voice classification result is non-voice and a second number of voice segments for which the voice classification result is voice; determining a voice classification result corresponding to a larger number of the first number and the second number as a voice classification result of the audio signal; in step 104, the determining the language classification result of the audio signal based on the language classification result of each pronunciation segment may be implemented by the following technical scheme: obtaining language classification results as the number of pronunciation fragments of each language; and determining the languages corresponding to the maximum number as the language classification result of the audio signal.

As an example, an audio signal is divided into 10 pronunciation fragments, 8 pronunciation fragments are classified as human voices, 2 pronunciation fragments are classified as non-human voices, the human voice classification result of the audio signal is human voices, the audio signal is divided into 10 pronunciation fragments, 8 pronunciation fragments are classified as english, 2 pronunciation fragments are classified as chinese, and the language classification result of the audio signal is english.

As an example, when the classification result of the voice is non-voice, a prompt message is displayed to prompt that the audio signal belongs to an abnormal signal; when the human voice classification result is non-human voice, the following processing is performed: when the client side receiving the audio signal belongs to an intelligent voice control scene, an intelligent voice assistant corresponding to the language classification result is displayed; when the client side receiving the audio signal belongs to a voice test scene and the language classification result does not accord with the set language, a prompt message is displayed to prompt that the audio signal belongs to an abnormal signal.

In some embodiments, the voice classification process and the language classification process are implemented by a multi-classification task model that includes a shared feature network, a voice classification network, and a language classification network; carrying out forward propagation and backward propagation on corpus samples in the training sample set in a shared full-connection layer of a shared feature network, a voice classification network and a language classification network and a full-connection layer corresponding to the shared feature network so as to update parameters of the shared feature network and the shared full-connection layer; and carrying out forward propagation and backward propagation on corpus samples in the training sample set in the updated shared feature network, the updated shared full-connection layer, the full-connection layer of the voice classification network and the full-connection layer of the language classification network so as to update the parameters of the multi-classification task model.

As an example, referring to fig. 6B, a basic classification model is first trained, where the basic classification model includes a plurality of feature extraction networks (shared feature networks) and two full-connection layers, the first full-connection layer is a shared full-connection layer having 2048 nodes (shared full-connection layer corresponding to a voice classification network and a language classification network), linear rectification processing can be implemented in the layer, the second full-connection layer is a full-connection layer implementing 527 voice type classification (full-connection layer corresponding to a shared feature network), maximum likelihood processing can be implemented in the second full-connection layer, visual updating of the basic classification model can be implemented through the second full-connection layer, after training of the basic classification model is completed, the shared feature network and the shared full-connection layer of the basic classification model are reserved, and the voice full-connection layer of the voice classification network and the language full-connection layer of the language classification network are added on the basis of the reserved network, so as to obtain a multi-classification task model, and training is continued.

As an example, the following process is performed during each iteration of the multi-classification task model: forward transmitting each corpus sample in a shared feature network and a voice classification network of the multi-classification task model to obtain a predicted voice classification category corresponding to the corpus sample when voice classification processing is carried out; forward spreading each corpus sample in a shared feature network and a language classification network of the multi-classification task model to obtain a corresponding predicted language classification category when the corpus sample is subjected to language classification processing; determining a voice error between the predicted voice classification category and the pre-labeled voice true category, and a language error between the predicted language classification category and the pre-labeled language true category; and aggregating the language errors and the voice errors according to the loss function to obtain an aggregate error, and back-propagating the aggregate error in the multi-classification task model to determine a parameter change value of the multi-classification task model when the loss function obtains the minimum value, and updating parameters of the multi-classification task model based on the parameter change value.

In the following, an exemplary application of the artificial intelligence-based voice detection method provided by the embodiment of the application in a spoken test scenario with an application scenario as a spoken test scenario will be described.

For language classification processing, the following two schemes can be adopted: 1. based on a plurality of voice recognition engines, selecting a language corresponding to the voice recognition engine with the largest output probability as a recognition language; 2. extracting effective pronunciation characteristics to construct a language classifier for distinguishing languages, wherein the effective pronunciation characteristics can be extracted based on professional knowledge when the effective pronunciation characteristics are extracted, and the effective characteristics of audio can be extracted based on a neural network, for example, the characteristics of a mel frequency cepstrum coefficient, an identity authentication vector and the like are extracted for classifying the languages; inputting the original audio waveform signals into a deep neural network, and outputting language classification results; extracting an original spectrogram corresponding to the voice, inputting the original spectrogram into a deep neural network, and outputting a language classification result; for voice classification processing, a classifier of various sounds can be constructed, and the various sounds are classified based on the extracted voice spectrogram.

In the field of oral examination, the voice interaction function is mainly applied to the following-reading questions or the open-expression questions of oral examination.

For example, referring to fig. 4A, fig. 4A is an interface schematic diagram of an artificial intelligence-based voice detection method according to an embodiment of the present application, in which the human-computer interaction interface 501A presents the following text "i know true, do you know? (I know the face, do you know) ", receiving an audio signal of the user's speakable text in response to a trigger operation, e.g., a click operation, for the start of the speakable button 502A in the human-computer interaction interface 501A, and stopping receiving an audio signal of the user's speakable text in response to a trigger operation, e.g., a click operation, for the end of the speakable button 503A in the human-computer interaction interface 501A. Referring to fig. 4B, fig. 4B is an interface schematic diagram of a voice detection method based on artificial intelligence according to an embodiment of the present application, and an anomaly detection result of an audio signal, for example, a non-english anomaly detection result, is presented on a human-computer interaction interface 501B.

Referring to fig. 5, fig. 5 is a schematic flow chart of a speech detection method based on artificial intelligence, in which, in response to initialization of a client, a follow-up text is displayed in a man-machine interface of the client, in response to a start recording operation for a start-read button in the client, an audio signal when a user performs text recitation is collected, the client sends the collected audio signal to a server, the server sends the audio signal to an anomaly detection module, the anomaly detection module outputs a human voice classification result and a language classification result for non-human voice and returns to the server, when the non-human voice classification result or the language classification result unrelated to current spoken language evaluation occurs, the server returns an anomaly detection result to the client to remind the user, otherwise, the server returns an evaluation result for spoken language evaluation to the client.

In some embodiments, in the language classification process, the audio signal may include at least one language, and the audio signal is segmented into a plurality of pronunciation fragments, so that the situation that the audio signal includes at least one language can be effectively solved, and a voice endpoint detection technology can be used to detect audio, for example, to determine whether each frame of audio in the audio signal is mute, and to determine whether any frame of audio belongs to the audio signal or the mute signal.

In some embodiments, since the audio original signal is a time-varying waveform diagram and cannot be decomposed into a plurality of base signals, the signal is transformed from the time domain to the frequency domain by fourier transformation, the signal is transformed into a spectrogram, the horizontal axis of the spectrogram is time and the vertical axis is frequency, the frequency can be calculated by a mel calculation because the human does not perceive a frequency in a linear range, the frequency is perceived to be stronger than the high frequency difference, the frequency is converted into a mel scale, and finally the original signal is converted into a mel spectrogram, the horizontal axis of the mel spectrogram is time, the vertical axis is the frequency of the mel scale, and the mel spectrum is taken as the input of the multi-classification task model.

In some embodiments, the non-human voice detection and the language detection are performed separately through a basic classification model (pre-trained), wherein the pre-trained basic classification model is an audio classification network, which may be a convolutional neural network obtained based on audio training, and has the capability of classifying 527 audio types, the basic classification model is based on a network structure of the basic classification model, as shown in fig. 6B, and each unit of the basic classification model is composed of the convolutional neural network, batch Normalization (BN), a linear rectification function (ReLU) and average pooling (pooling), and finally classifying 527 audio types through global average pooling (Global Pooling) and two full-connection transformations (FC 2048 and FC 527).

In some embodiments, referring to fig. 6A, the multi-classification task model is obtained by performing migration learning based on the trained basic classification model, so that the multi-classification task model can perform voice classification processing and language classification processing, specifically, the last full-connection layer (including FC527 and sigmoid activation function) in the basic classification model is replaced by two required independent full-connection layers (including FC and maximum likelihood function), and finally two classification results including whether voice classification results are voice classification results and language classification results are output.

In some embodiments, the penalty function of the multi-classification task model is divided into two parts, including a penalty part for the vocal classification and a penalty part for the language classification, the penalty function being derived by superimposing the two classified penalty parts, see equation (1),

L_total＝w₁*L_{Human voice}+w₂*L_{Language type} (1)；

Wherein L _total is a loss function of the multi-classification task model, w ₁ is a preset parameter of loss for voice classification, w ₂ is a preset parameter of loss for language classification, w ₁ and w ₂ are used for balancing the loss of two loss parts, L _{Human voice} is the loss for voice classification, and L _{Language type} is the loss for language classification.

The loss of the voice classification processing and the language classification processing refers to a formula (2), L is loss aiming at voice classification or loss aiming at language classification, y is whether an audio signal is a real label of voice or a real label of language, and P is the prediction probability of a voice classification result or the prediction probability of a language classification result output by the multi-classification task model.

L＝-y*log(p) (2)；

Referring to fig. 7, fig. 7 is a schematic diagram of a data structure of an artificial intelligence-based voice detection method according to an embodiment of the present application, wherein an input of a multi-classification task model is a mel spectrogram, the multi-classification task model includes a pre-trained voice neural network (PANN, PRETRAINED AUDIO NEURAL NETWORKS), a voice full-connection layer (FC and maximum likelihood function) of a voice classification network, and a language full-connection layer (FC and maximum likelihood function) of a language classification network, and outputs two abnormal detection results, including: the voice classification result (0 represents voice, probability 0.9,1 represents non-voice, probability 0.1) and the language classification result (0 represents english, probability 0.2,1 represents non-english, probability 0.8).

The data test of the voice detection method based on artificial intelligence provided by the embodiment of the application is mainly performed aiming at voice classification processing and language classification processing, the language classification result is mainly reflected in English languages and non-English languages, the voice classification result is mainly reflected in voice and non-voice, the data test is mainly performed aiming at word, sentence and paragraph scenes, the test data come from certain spoken language examination data, and each type of data comprises 1000 pieces of data, including: 1000 sentences (500 English, 500 non-English), 1000 words (500 English, 500 non-English), 1000 paragraphs (500 English, 500 non-English), 1000 non-human voice, 1000 human voice) and the classification result accuracy are shown in the following table 1.

	Language detection result	Non-human voice detection result
			Words and phrases	89％	99％
Sentence	99％	99％
			Paragraph of (c)	92％	99％
Non-human voice	--	99％

Table 1 test accuracy table

In some embodiments, the language classification network and the voice classification network can be implemented based on various neural network structures, and more related abnormality detection tasks, such as child voice and adult voice classification tasks, can be added to achieve the technical effect of achieving multi-dimensional abnormality discrimination based on one model.

Continuing with the description below of an exemplary architecture for implementing the artificial intelligence based speech detection apparatus 255 as a software module provided by embodiments of the present application, in some embodiments, as shown in fig. 2, the software modules stored in the artificial intelligence based speech detection apparatus 255 of the memory 250 may include: an obtaining module 2551, configured to divide the audio signal into a plurality of pronunciation fragments, and obtain an audio feature of each pronunciation fragment; the voice module 2552 is configured to perform voice classification processing on each pronunciation segment based on the audio feature of each pronunciation segment, so as to obtain a voice classification result of each pronunciation segment; the language module 2553 is configured to perform a language classification process on each pronunciation segment based on the audio feature of each pronunciation segment, so as to obtain a language classification result of each pronunciation segment; the result module 2554 is configured to determine a human voice classification result of the audio signal based on the human voice classification result of each pronunciation segment, and determine a language classification result of the audio signal based on the language classification result of each pronunciation segment.

In some embodiments, the obtaining module 2551 is further configured to: determining a speech energy for each audio frame in the audio signal; a plurality of consecutive audio frames of the audio signal having speech energy greater than background noise energy are combined into a voicing segment.

In some embodiments, the obtaining module 2551 is further configured to: carrying out framing treatment on the audio signal to obtain a plurality of audio frames corresponding to the audio signal; performing feature extraction processing on each audio frame through an audio frame classification network to obtain audio frame classification features corresponding to each audio frame; wherein the audio frame classification feature comprises at least one of: log frame energy characteristics; zero crossing rate characteristics; normalizing the autocorrelation characteristics; performing classification processing based on the classification characteristics of the audio frames on each audio frame through an audio frame classification network, and combining a plurality of continuous audio frames with classification results of pronunciation data into pronunciation fragments; the training samples of the audio frame classification network comprise audio frame samples, and the labeling data of the training samples comprise pre-labeled classification results of the audio frame samples.

In some embodiments, the voice classification process and the language classification process are implemented by a multi-classification task model that includes a voice classification network and a language classification network; the voice module 2552 is further configured to: transmitting the audio characteristics of each pronunciation fragment in the forward direction in a voice classification network to obtain a voice classification result of each pronunciation fragment; language module 2553, further configured to: and transmitting the audio characteristics of each pronunciation segment forward in the language classification network to obtain the language classification result of each pronunciation segment.

In some embodiments, the voice module 2551 is further configured to: carrying out first full-connection processing on each pronunciation segment through a shared full-connection layer of the voice classification network and the language classification network to obtain a first full-connection processing result corresponding to each pronunciation segment; performing second full-connection processing on the first full-connection processing result of each pronunciation segment through a voice full-connection layer of the voice classification network to obtain a second full-connection processing result of each pronunciation segment; carrying out maximum likelihood processing on the second full-connection processing result of each pronunciation fragment to obtain the probability of corresponding each voice classification label; determining the voice classification label with the highest probability as a voice classification result of each pronunciation fragment; language module 2553, further configured to: carrying out third full connection processing on each pronunciation segment through a shared full connection layer of the language classification network and the language classification network to obtain a third full connection processing result corresponding to each pronunciation segment; performing fourth full-connection processing on the third full-connection processing result of each pronunciation segment through a language full-connection layer of the language classification network to obtain a fourth full-connection processing result of each pronunciation segment; carrying out maximum likelihood processing on the fourth full-connection processing result of each pronunciation segment to obtain the probability of classifying the label corresponding to each language; and determining the language classification label with the highest probability as the language classification result of each pronunciation segment.

In some embodiments, the audio features of each pronunciation segment are obtained through a shared feature network in a multi-classification task model; the acquisition module 2551 is further configured to: transforming the type of each pronunciation segment from a time domain signal to a frequency domain signal, and performing Mel calculation on each pronunciation segment transformed to the frequency domain signal to obtain the spectrum of Mel scale of each pronunciation segment; and forward transmitting the frequency spectrum of the Mel scale of each pronunciation segment in the shared feature network to obtain the audio feature corresponding to each pronunciation segment.

In some embodiments, the shared feature network comprises N cascaded feature extraction networks, N being an integer greater than or equal to 2; the acquisition module 2551 is further configured to: performing feature extraction processing on the input of an nth feature extraction network through the nth feature extraction network in the N cascaded feature extraction networks; transmitting an nth feature extraction result output by the nth feature extraction network to the (n+1) th feature extraction network to continue feature extraction processing; wherein N is an integer whose value increases from 1, and the value range of N satisfies 1.ltoreq.n.ltoreq.N-1; when the value of N is 1, the input of the nth feature extraction network is the frequency spectrum of the Mel scale of each pronunciation fragment, and when the value of N is 2-N-1, the input of the nth feature extraction network is the feature extraction result of the nth-1 feature extraction network.

In some embodiments, the nth feature extraction network includes a convolution layer, a normalization layer, a linear rectification layer, and an average pooling layer; the acquisition module 2551 is further configured to: carrying out convolution processing on the input of the nth characteristic extraction network and the convolution layer parameters of the convolution layer of the nth characteristic extraction network to obtain an nth convolution layer processing result; normalizing the processing result of the nth convolution layer through a normalization layer of the nth feature extraction network to obtain an nth normalization processing result; carrying out linear rectification treatment on the nth normalization treatment result through a linear rectification layer of the nth characteristic extraction network to obtain an nth linear rectification treatment result; and carrying out average pooling treatment on the nth linear rectification treatment result through an average pooling layer of the nth characteristic extraction network to obtain the nth characteristic extraction result.

In some embodiments, the voice module 2552 is further configured to: based on the application scene of the audio signal, performing adaptation of a plurality of candidate classification processes; when the voice classification processing is adapted to the voice classification processing in the plurality of candidate classification processing, performing voice classification processing on each pronunciation segment to obtain a voice classification result of each pronunciation segment; language module 2553, further configured to: based on the application scene of the audio signal, performing adaptation of a plurality of candidate classification processes; when the method is adapted to the language classification processing in the candidate classification processing, the language classification processing is carried out on each pronunciation segment, and the language classification result of each pronunciation segment is obtained.

In some embodiments, the voice module 2552 is further configured to: acquiring limiting conditions of an application scene to determine candidate classification processing corresponding to the limiting conditions in a plurality of candidate classification processing as classification processing matched with the application scene; wherein the defining conditions include at least one of: age, age; species; language; sex.

In some embodiments, the voice classification process and the language classification process are implemented by a multi-classification task model that includes a shared feature network, a voice classification network, and a language classification network; the apparatus further comprises: training module 2555, for: carrying out forward propagation and backward propagation on corpus samples in the training sample set in a shared full-connection layer of a shared feature network, a voice classification network and a language classification network and a full-connection layer corresponding to the shared feature network so as to update parameters of the shared feature network and the shared full-connection layer; and carrying out forward propagation and backward propagation on corpus samples in the training sample set in the updated shared feature network, the updated shared full-connection layer, the full-connection layer of the voice classification network and the full-connection layer of the language classification network so as to update the parameters of the multi-classification task model.

In some embodiments, results module 2554 is further configured to: acquiring a first number of pronunciation fragments of which the voice classification result is non-voice and a second number of pronunciation fragments of which the voice classification result is voice; determining a voice classification result corresponding to a larger number of the first number and the second number as a voice classification result of the audio signal; obtaining language classification results as the number of pronunciation fragments of each language; and determining the languages corresponding to the maximum number as the language classification result of the audio signal.

Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions, so that the computer device executes the artificial intelligence-based voice detection method according to the embodiment of the application.

Embodiments of the present application provide a computer readable storage medium having stored therein executable instructions that, when executed by a processor, cause the processor to perform the artificial intelligence based speech detection method provided by embodiments of the present application, for example, as shown in fig. 3A-3D.

In some embodiments, the computer readable storage medium may be FRAM, ROM, PROM, EPROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; but may be a variety of devices including one or any combination of the above memories.

In some embodiments, the executable instructions may be in the form of programs, software modules, scripts, or code, written in any form of programming language (including compiled or interpreted languages, or declarative or procedural languages), and they may be deployed in any form, including as stand-alone programs or as modules, components, subroutines, or other units suitable for use in a computing environment.

As an example, executable instructions may, but need not, correspond to files in a file system, may be stored as part of a file that holds other programs or data, such as in one or more scripts in a hypertext markup language (HTML, hyper Text Markup Language) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).

As an example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices located at one site or distributed across multiple sites and interconnected by a communication network.

In summary, according to the embodiment of the application, feature extraction is performed on each pronunciation segment in the audio signal, and voice classification processing and language classification processing are performed on the extracted audio features, so that the abnormality of the audio signal is accurately detected, and thus voice recognition is more accurately realized.

The foregoing is merely exemplary embodiments of the present application and is not intended to limit the scope of the present application. Any modification, equivalent replacement, improvement, etc. made within the spirit and scope of the present application are included in the protection scope of the present application.

Claims

1. A method for detecting speech based on artificial intelligence, the method comprising:

Carrying out first full-connection processing on each pronunciation segment through a shared full-connection layer of a voice classification network and a language classification network to obtain a first full-connection processing result corresponding to each pronunciation segment;

performing second full-connection processing on the first full-connection processing result of each pronunciation segment through a voice full-connection layer of the voice classification network to obtain a second full-connection processing result of each pronunciation segment;

Carrying out maximum likelihood processing on the second full-connection processing result of each pronunciation fragment to obtain the probability of each voice classification label, and determining the voice classification label with the maximum probability as the voice classification result of each pronunciation fragment;

Carrying out third full-connection processing on each pronunciation segment through a shared full-connection layer of the voice classification network and the language classification network to obtain a third full-connection processing result corresponding to each pronunciation segment;

performing fourth full-connection processing on the third full-connection processing result of each pronunciation segment through the language full-connection layer of the language classification network to obtain a fourth full-connection processing result of each pronunciation segment;

Carrying out maximum likelihood processing on the fourth full-connection processing result of each pronunciation segment to obtain the probability of the corresponding language classification label, and determining the language classification label with the maximum probability as the language classification result of each pronunciation segment;

2. The method of claim 1, wherein the dividing the audio signal into a plurality of voicing segments comprises:

determining a speech energy for each audio frame in the audio signal;

and combining a plurality of continuous audio frames with voice energy larger than background noise energy in the audio signal into a pronunciation fragment.

3. The method of claim 1, wherein the dividing the audio signal into a plurality of voicing segments comprises:

carrying out framing treatment on the audio signal to obtain a plurality of audio frames corresponding to the audio signal;

Performing feature extraction processing on each audio frame through an audio frame classification network to obtain audio frame classification features corresponding to each audio frame;

Wherein the audio frame classification feature comprises at least one of: log frame energy characteristics; zero crossing rate characteristics; normalizing the autocorrelation characteristics;

Performing classification processing on each audio frame based on the audio frame classification characteristics through the audio frame classification network, and combining a plurality of continuous audio frames with classification results of pronunciation data into pronunciation fragments;

the training samples of the audio frame classification network comprise audio frame samples, and the labeling data of the training samples comprise pre-labeled classification results of the audio frame samples.

4. The method of claim 1, wherein the step of determining the position of the substrate comprises,

The audio characteristics of each pronunciation fragment are obtained through a shared characteristic network in the multi-classification task model;

The acquiring the audio feature of each pronunciation section comprises the following steps:

Transforming the type of each pronunciation segment from a time domain signal to a frequency domain signal, and performing Mel calculation on each pronunciation segment transformed to the frequency domain signal to obtain the spectrum of Mel scale of each pronunciation segment;

And forward transmitting the frequency spectrum of the Mel scale of each pronunciation fragment in the shared feature network to obtain the audio feature of each pronunciation fragment.

5. The method of claim 4, wherein the step of determining the position of the first electrode is performed,

The shared feature network comprises N cascaded feature extraction networks, wherein N is an integer greater than or equal to 2;

transmitting the spectrum of the mel scale of each pronunciation segment forward in the shared feature network to obtain the audio feature of each pronunciation segment, including:

Performing feature extraction processing on the input of an nth feature extraction network through the nth feature extraction network in the N cascaded feature extraction networks;

Transmitting an nth feature extraction result output by the nth feature extraction network to an (n+1) th feature extraction network to continue feature extraction processing;

Wherein n is an integer whose value increases from 1, and the value range of n satisfies 1 n/>; When n is 1, the input of the nth feature extraction network is the spectrum of the mel scale of each pronunciation segment, and when n is 2/>nAnd when the input of the nth characteristic extraction network is the characteristic extraction result of the n-1 th characteristic extraction network.

6. The method of claim 5, wherein the step of determining the position of the probe is performed,

The nth feature extraction network comprises a convolution layer, a normalization layer, a linear rectification layer and an average pooling layer;

The feature extraction processing of the input of the nth feature extraction network through the nth feature extraction network in the N cascaded feature extraction networks comprises the following steps:

Carrying out convolution processing on the input of the nth characteristic extraction network and the convolution layer parameters of the convolution layer of the nth characteristic extraction network to obtain an nth convolution layer processing result;

Normalizing the processing result of the nth convolution layer through a normalization layer of the nth feature extraction network to obtain an nth normalization processing result;

Performing linear rectification processing on the nth normalization processing result through a linear rectification layer of the nth characteristic extraction network to obtain an nth linear rectification processing result;

And carrying out average pooling treatment on the nth linear rectification treatment result through an average pooling layer of the nth characteristic extraction network to obtain an nth characteristic extraction result.

7. The method of claim 1, wherein the step of determining the position of the substrate comprises,

The method further comprises the steps of:

Performing adaptation of a plurality of candidate classification processes based on an application scenario of the audio signal;

invoking the voice classification network when adapting to the voice classification process among the plurality of candidate classification processes;

The language classification network is invoked when adapting to the language classification process among the plurality of candidate classification processes.

8. The method of claim 7, wherein the adapting of the plurality of candidate classification processes based on the application scenario of the audio signal comprises:

Acquiring a limiting condition of the application scene to determine candidate classification processing corresponding to the limiting condition in the plurality of candidate classification processing as classification processing matched with the application scene;

wherein the defined condition includes at least one of: age, age; species; language; sex.

9. The method of claim 1, wherein the speech detection method is implemented by invoking a multi-classification task model comprising a shared feature network, a voice classification network, and a language classification network;

The method further comprises the steps of:

Carrying out forward propagation and backward propagation on corpus samples in a training sample set in the shared feature network, the shared full-connection layer of the voice classification network and the language classification network and the full-connection layer corresponding to the shared feature network so as to update parameters of the shared feature network and the shared full-connection layer;

and carrying out forward propagation and backward propagation on corpus samples in the training sample set in the updated shared feature network, the updated shared full-connection layer, the full-connection layer of the voice classification network and the full-connection layer of the language classification network so as to update parameters of the multi-classification task model.

10. The method of claim 1, wherein the step of determining the position of the substrate comprises,

The determining the voice classification result of the audio signal based on the voice classification result of each pronunciation section comprises the following steps:

acquiring a first number of pronunciation fragments of which the voice classification result is non-voice and a second number of pronunciation fragments of which the voice classification result is voice;

Determining a voice classification result corresponding to a greater number of the first number and the second number as a voice classification result of the audio signal;

the determining the language classification result of the audio signal based on the language classification result of each pronunciation segment comprises:

Acquiring the language classification result as the number of pronunciation fragments of each language;

And determining the languages corresponding to the maximum number as language classification results of the audio signals.

11. A speech detection device based on artificial intelligence, comprising:

The voice module is used for carrying out first full-connection processing on each pronunciation segment through a shared full-connection layer of a voice classification network and a language classification network to obtain a first full-connection processing result corresponding to each pronunciation segment; performing second full-connection processing on the first full-connection processing result of each pronunciation segment through a voice full-connection layer of the voice classification network to obtain a second full-connection processing result of each pronunciation segment; carrying out maximum likelihood processing on the second full-connection processing result of each pronunciation fragment to obtain the probability of each voice classification label, and determining the voice classification label with the maximum probability as the voice classification result of each pronunciation fragment;

The language module is used for carrying out third full-connection processing on each pronunciation segment through the shared full-connection layer of the voice classification network and the language classification network to obtain a third full-connection processing result corresponding to each pronunciation segment; performing fourth full-connection processing on the third full-connection processing result of each pronunciation segment through the language full-connection layer of the language classification network to obtain a fourth full-connection processing result of each pronunciation segment; carrying out maximum likelihood processing on the fourth full-connection processing result of each pronunciation segment to obtain the probability of the corresponding language classification label, and determining the language classification label with the maximum probability as the language classification result of each pronunciation segment;

12. An electronic device, comprising:

a memory for storing executable instructions;

A processor for implementing the artificial intelligence based speech detection method of any one of claims 1 to 10 when executing executable instructions stored in said memory.

13. A computer readable storage medium storing executable instructions for implementing the artificial intelligence based speech detection method of any one of claims 1 to 10 when executed by a processor.

14. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements the artificial intelligence based speech detection method of any one of claims 1 to 10.