CN114283845A

CN114283845A - Model construction method for audio recognition

Info

Publication number: CN114283845A
Application number: CN202010996980.0A
Authority: CN
Inventors: 陈建芳; 吴易万; 许桓瑞; 李建明
Original assignee: Askey Technology Jiangsu Ltd; Askey Computer Corp
Current assignee: Askey Technology Jiangsu Ltd; Askey Computer Corp
Priority date: 2020-09-21
Filing date: 2020-09-21
Publication date: 2022-04-05

Abstract

The embodiment of the invention provides a model construction method for audio recognition. In this method. Audio data is obtained. A prediction result of the audio data is determined using a classification model, which is trained based on a machine learning algorithm, and which includes labels defined by the classification model. Providing a prompt message according to the loss degree of the prediction result, wherein the loss degree is related to the difference between the prediction result and the corresponding actual result, and the prompt message is used for inquiring the relevance of the audio data and the label. The classification model is modified based on the confirmation response of the prompt message and this confirmation response is related to confirming the correlation of the audio data with the tag. Therefore, the marking efficiency and the prediction accuracy can be improved.

Description

Model construction method for audio recognition

Technical Field

The present invention relates to machine learning (machine learning) technology, and more particularly, to a model construction method for audio recognition.

Background

Machine learning algorithms can predict unknown data by analyzing large amounts of data to infer the regularity of the data. In recent years, machine learning has been widely used in the fields of image recognition, natural language processing, medical diagnosis, voice recognition, or the like.

It is noted that for speech or other audio type recognition techniques, during training of its model, the operator would flag (labeling) the type of sound content (e.g., female, baby, alarm, etc.) as input data in the training data to generate the correct output result in the training data. If the image is a marked image, an operator can recognize the object in a short time, and a corresponding label can be provided. However, for a sound tag, an operator may need to listen to a long sound file to start the marking, and the sound file may be disturbed by noise to make it difficult to identify the content. Therefore, the training work is not efficient for the operator.

Disclosure of Invention

The embodiment of the invention is directed to a model construction method for audio recognition, which provides simple inquiry prompts so as to facilitate marking of operators.

According to an embodiment of the present invention, a model construction method for audio recognition includes (but is not limited to) the following steps: audio data is obtained. A prediction result of the audio data is determined using a classification model, which is trained based on a machine learning algorithm, and the prediction result includes a label (label) defined by the classification model. A prompt message is provided based on a degree of loss of the predicted result, the degree of loss being related to a difference between the predicted result and a corresponding actual result, and the prompt message is used to query the relevance of the audio data to the tag. The classification model is modified based on the confirmation response of the prompt message and this confirmation response is related to confirming the correlation of the audio data with the tag.

Based on the above, the model construction method for audio recognition in the embodiment of the present invention can determine the difference between the predicted result and the actual result obtained by the trained classification model, and provide a simple prompt message to the operator according to the difference. And the operator can finish marking only by responding to the prompt message, and further amend the classification model, thereby improving the identification accuracy of the classification model and the marking efficiency of the operator.

Drawings

The accompanying drawings are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification. The drawings illustrate embodiments of the invention and together with the description serve to explain the principles of the invention.

FIG. 1 is a flow diagram of a model construction method for audio recognition according to an embodiment of the present invention;

FIG. 2 is a flow diagram of audio processing according to an embodiment of the present invention;

FIG. 3 is a flow diagram of noise cancellation according to an embodiment of the invention;

FIG. 4A is a waveform diagram illustrating an example of original audio data;

FIG. 4B is a waveform diagram illustrating an exemplary IMF (Intrinsic Mode Function);

FIG. 4C is a waveform diagram illustrating exemplary noise-canceled audio data;

FIG. 5 is a flow diagram of audio segmentation according to an embodiment of the present invention;

FIG. 6 is a flow diagram of model training according to an embodiment of the invention;

FIG. 7 is a schematic diagram of a Neural Network (Neural Network) according to an embodiment of the present invention;

FIG. 8 is a flow diagram of updating a model according to an embodiment of the invention;

FIG. 9 is a schematic flow diagram of a smart doorbell application in accordance with an embodiment of the present invention;

FIG. 10 is a block diagram of components of a server in accordance with one embodiment of the present invention.

Description of the reference numerals

S110 to S170, S210 to S230, S310 to S350, S510 to S530, S610 to S630, S810 to S870, and S910 to S980;

710 an input layer;

730 hidden layer;

750, an output layer;

10, a cloud server;

30, training server;

31, a communication interface;

33, a memory;

35, a processor;

50, an intelligent doorbell;

51, a microphone;

53, a memory.

Detailed Description

Reference will now be made in detail to exemplary embodiments of the invention, examples of which are illustrated in the accompanying drawings. Wherever possible, the same reference numbers will be used throughout the drawings and the description to refer to the same or like parts.

FIG. 1 is a flowchart of a model construction method for audio recognition according to an embodiment of the present invention. Referring to fig. 1, the server obtains audio data (step S110). Specifically, the audio data refers to an audio Signal generated by picking up sound waves (e.g., generated by a sound source such as human voice, environmental sound, machine operation sound) and converting the sound waves into analog or Digital form, or by setting the vibration, frequency, tone, rhythm and/or melody of the sound by a Processor (e.g., a Central Processing Unit (CPU), an Application Specific Integrated Circuit (ASIC), a Digital Signal Processor (DSP), or the like). In other words, the audio data may be generated by microphone recording or computer editing. For example, a baby cries are recorded by a smartphone, or the user edits a track with music software on a computer. In one embodiment, the audio data may be downloaded via a network, transmitted wirelessly or by wire (e.g., Bluetooth Low Energy (BLE), Wi-Fi, fiber optic network, etc.), transmitted in a real-time or non-real-time packet or streaming mode, or accessed from an external or built-in storage medium (e.g., a usb disk, a compact disk, an external hard disk, a memory, etc.) to retrieve the audio data for use in subsequent model construction. For example, the audio data is stored at a cloud server, and the training server downloads the audio data via the FTS.

In one embodiment, the audio data is obtained by audio processing of raw audio data (the implementation and type of which may refer to the audio data). Fig. 2 is a flow diagram of audio processing according to an embodiment of the present invention. Referring to fig. 2, the server may cancel a noise component of the original audio data (step S210) and segment the audio data (step S230). In other words, the original audio data is subjected to noise cancellation and/or audio segmentation to obtain audio data. In some embodiments, the order of noise cancellation and audio segmentation may be altered according to actual requirements.

There are many methods of noise cancellation for audio. In one embodiment, the server may analyze characteristics of the raw audio data to determine a noise component (i.e., interference to the signal) of the raw audio data. The audio-related characteristic is, for example, a change in amplitude, frequency, energy, or other physical characteristic, and the noise component typically has a particular characteristic.

For example, fig. 3 is a flow chart of noise cancellation according to an embodiment of the invention. Referring to fig. 3, the characteristics include a plurality of Intrinsic Mode Functions (IMFs). And data satisfying the following conditions may be referred to as an intrinsic mode function: first, the sum of the number of local maxima (local maxima) and local minima (local minima) is equal to or differs by at most one from the number of zero crossings (zero crossing); second, at any point in time, the average of the upper envelope (upper envelope) of the local maxima and the lower envelope of the local minima approaches zero. The server may decompose the raw audio data (i.e., modal decomposition) (step S310) to generate several modal components of the raw audio data as a fundamental (fundamental) signal. And each modal component corresponds to an intrinsic mode function.

In one embodiment, the original audio data may be processed by Empirical Mode Decomposition (EMD) or other signal Decomposition based on time scale features to obtain corresponding components of the intrinsic Mode function (i.e., modal components). And the modal components include local feature signals of the original audio data at different time scales on the waveform in the time domain.

For example, fig. 4A is a waveform diagram illustrating original audio data, and fig. 4B is a waveform diagram illustrating an Intrinsic Mode Function (IMF). Referring to fig. 4A and 4B, the waveform of fig. 4A is empirically decomposed to obtain seven different intrinsic mode functions and a residual component as shown in fig. 4B.

It should be noted that, in some embodiments, each of the intrinsic mode functions may be further subjected to Hilbert-yellow Transform (HHT) to obtain the corresponding instantaneous frequency and/or amplitude.

The server may further determine the autocorrelation of each modal component (step S330). For example, Detrended Fluctuation Analysis (DFA) can be used to determine the statistical self-similarity properties (i.e., autocorrelation) of the signal and obtain the slope of each modal component by linear fitting using the least squares method. For another example, autocorrelation (autocorrelation) operation is performed on each modal component.

The server may select one or more modal components as noise components of the raw audio data according to the autocorrelation of those modal components. Taking the slope obtained by the detrended fluctuation analysis as an example, if the slope of the first modal component is smaller than a slope threshold (e.g., 0.5 or other values), the first modal component is inversely correlated (anti-correlated) and is used as the noise component; if the slope of the second modal component is not less than the slope threshold, the second modal component is correlated and will not be considered a noise component.

In other embodiments, for other types of autocorrelation analysis, if the autocorrelation of the third modal component is minimal, second minimal, or small, the third modal component may also be a noise component.

After determining the noise component, the server may cancel the noise component on the raw audio data to generate the audio data. Referring to fig. 3, the server may cancel the modal component as the noise component according to the autocorrelation of the modal component, and generate the de-noised audio data according to the modal component of the non-noise component (step S350). In other words, the server reconstructs a signal from non-noise components other than the noise component in the original audio data, and generates de-noised audio data therefrom. Wherein the noise component may be removed or deleted.

Fig. 4C is a waveform diagram illustrating an example of noise-cancelled audio data. Referring to fig. 4A and 4C, compared to fig. 4A, the waveform of fig. 4C cancels the noise component.

It should be noted that the noise cancellation for audio is not limited to the above-mentioned mode and autocorrelation analysis, and other noise cancellation techniques may be applied in other embodiments. Such as filters configured with specific or variable thresholds, or spectral subtraction (spectral subtraction), etc.

On the other hand, there are many audio segmentation methods for audio. Fig. 5 is a flow diagram of audio segmentation according to an embodiment of the present invention. Referring to fig. 5, in an embodiment, the server may extract sound features from audio data (e.g., raw audio data or denoised audio data) (step S510). In particular, the sound characteristic may be an amplitude, a frequency, a timbre, an energy or a change in at least one of the foregoing. For example, the sound features are Short Time Energy (Short Time Energy) and/or Zero Crossing Rate (Zero Crossing Rate). The short-time energy is assumed to change slowly or even unchanged within a short time (or window), and the energy within the short time is used as a feature representation of the audio signal, wherein different energy intervals correspond to different types of sounds, and can even be used to distinguish voiced segments from unvoiced segments. And the zero crossing rate is related to the amplitude of the audio signal changing from positive to negative and/or the statistical number changing from negative to positive, wherein the amount of the number corresponds to the frequency of the audio signal. In some embodiments, the sound features may also be obtained by Spectral flux (Spectral flux), Linear Predictive Coefficient (LPC), or Band Periodicity (Band Periodicity) analysis.

After obtaining the sound features, the server may determine the target segments and the non-target segments in the audio data according to the sound features (step S530). Specifically, the target segment represents a sound segment to which one or more sound types are specified, and the non-target segment represents a sound segment of a type other than the specified sound type described above. The sound type is, for example, music, ambient sound, speech, or silence, etc. And the sound feature corresponding value may correspond to a particular sound type. Taking the zero-crossing rate as an example, the zero-crossing rate of voice is approximately 0.15, the zero-crossing rate of music is approximately 0.05, and the zero-crossing rate of ambient sound changes drastically. In addition, taking short-term energy as an example, the energy of speech is approximately 0.15 to 0.3, the energy of music is approximately 0 to 0.15, and the energy of silence is 0. It should be noted that the values and segments of the evaluation sound types for different types of sound features may be different, and the aforementioned values are also used as examples.

In one embodiment, it is assumed that the target segment is speech content (i.e., the sound type is speech) and the non-target segment is not speech content (e.g., ambient sound, or music sound, etc.). The server can determine two end points of the target segment in the audio data according to the short-time energy and the zero crossing rate of the audio data. For example, a voice signal of audio data having a zero crossing rate below a zero crossing threshold is considered as speech, and a voice signal having an energy above an energy threshold is considered as speech. And the sound segment with the zero crossing rate lower than the zero crossing threshold or the energy exceeding the energy threshold is the target segment. In addition, the beginning and end points of a target segment in the time domain are the boundaries thereof, and the sound segments outside the boundaries may be non-target segments. For example, the end of a voiced speech is roughly determined by short-time energy detection, and then the true beginning and end of a speech segment are detected by the zero crossing rate.

In one embodiment, the server may retain the target segment for the original audio data or the denoised audio data and remove the non-target segments as the final sound data. In other words, one piece of sound data includes one or more target segments, and there are no non-target segments. Taking the target segment of the voice content as an example, if the audio data segmented by the audio is played, only the human speech can be heard.

It should be noted that in other embodiments, either or both of steps S210 and S230 in fig. 2 may be omitted.

Referring to fig. 1, the server may determine a prediction result of the audio data by using the classification model (step S130). Specifically, the classification model is trained based on a machine learning (machine learning) algorithm. The machine learning algorithm is, for example, a Neural Network (NN), a Recurrent Neural Network (RNN), a Long Short-Term Memory model (LSTM), or other audio recognition related algorithms. The server may train the classification model in advance or directly obtain a preliminarily trained classification model.

FIG. 6 is a flow diagram of model training according to an embodiment of the present invention. Referring to fig. 6, for the prior training, the server may provide an initial prompt message according to the target segment (step S610). This initial prompt message is used to ask for a tag to be assigned to the target segment. In one embodiment, the server may play the target segment through a speaker and provide visual or audible message content through a display or speaker. For example, whether it is crying. The operator may provide an initial confirmation response (i.e., flag) to the initial prompt message. For example, the operator selects one of "yes" or "no" via a keyboard, a mouse, or a touch panel. For another example, the server provides options (i.e., labels) such as crying, laughter, and screaming, and the operator selects one of the options.

After all target segments are marked, the server may train a classification model according to the initial confirmation response of the initial prompt message (step S630). And the initial confirmation response includes the tag corresponding to the target segment. That is, the target segment is taken as input data in the training data, and the corresponding label is taken as an output/prediction result in the training data.

The server may use a preset or user-chosen machine learning algorithm. For example, FIG. 7 is a schematic diagram of a neural network, according to an embodiment of the invention. Referring to fig. 7, the structure of the neural network mainly includes three parts: an Input layer 710, a Hidden layer 730, and an Output layer 750. In the input layer 710, a number of neurons (Neuron) receive a large number of nonlinear input messages. In hidden layer 730, a number of neurons and connections may constitute one or more layers, and each layer includes linear combinations and non-linear excitation (activation) functions. In some embodiments, for example, a recurrent neural network will take as input from one level of hidden layers 730. The messages are transmitted, analyzed, and/or weighted in the neuron linkage to form a prediction at the output layer 750. The classification model is trained by finding the parameters (e.g., weights, bias values (bias), etc.) and links in the hidden layer 750.

After the classification model is trained, if the audio data is input into the classification model, a prediction result can be deduced. The prediction results include one or more labels (labels) defined by the classification model. The label is, for example, a female voice, a male voice, a baby voice, a crying voice, a laughing voice, a specific character voice, a warning bell, etc., and the label can be changed according to the needs of the user. In some embodiments, predicting the outcome may further include predicting a probability for each tag.

Referring to fig. 1, the server may provide a prompt message according to the loss degree of the prediction result (step S150). In particular, the degree of loss is related to the difference between the predicted outcome and the corresponding actual outcome. For example, the degree of loss can be determined by Mean-Square Error (MSE), Mean Absolute Error (MAE), or Cross Entropy (Cross Entropy). If the degree of loss does not exceed the loss threshold, the classification model may remain unchanged or not be retrained. If the degree of loss exceeds the loss threshold, the classification model may need to be retrained or revised.

In the embodiment of the invention, the server can further provide prompt messages for operators. The prompt message is used to ask for the relevance of the audio data to the tag. In one embodiment, the prompt message includes audio data and question content, and the question content asks whether the audio data belongs to (or is associated with) the tag. The server may play audio data through a speaker and play it through a speaker or display to provide question content. For example, the display presents an option of whether the baby cries. The operator simply selects one of the "yes" and "no" options. In addition, if the audio data has been subjected to audio processing as described in fig. 2, the operator only needs to listen to the target segment or de-noised sound, and it is imperative to improve the labeling efficiency.

It should be noted that in some embodiments, the hint message may also be an option to query a plurality of tags. For example, "does a baby cry or adult cry? "is used to determine the message content.

The server may modify the classification model based on the confirmation response of the prompt message (step S170). In particular, the confirmation response is related to confirming the correlation of the audio data with the tag. The relevance is for example a belonging, not belonging or relevance measure value. In one embodiment, the server may receive an input operation (e.g., pressing, clicking, etc.) from the operator via an input device (e.g., a mouse, a keyboard, a touch panel or keys, etc.). This input operation corresponds to an option of question content, and this option is that the audio data belongs to the tag or that the audio data does not belong to the tag. For example, a prompt message is presented on the display and provides the two options of "yes" and "no," and after the operator has heard the target segment, the option of "yes" can be selected by a button corresponding to "yes.

In other embodiments, the server may also generate the confirmation response by other speech recognition means, such as default keyword recognition, default acoustic feature comparison, and the like.

If the correlation is that the audio data belongs to the queried tag or its correlation metric is greater than a threshold degree, the predicted result can be confirmed to be correct (i.e., the predicted result is equivalent to the actual result). On the other hand, if the correlation is that the information data does not belong to the interrogated tag or its degree of correlation value is less than the degree threshold, the predicted result may be confirmed to be incorrect (i.e., the predicted result is different from the actual result).

FIG. 8 is a flow diagram of updating a model according to an embodiment of the invention. Referring to fig. 8, the server determines whether the prediction result is correct (step S810). If the prediction result is correct, it indicates that the prediction capability of the current classification model meets the expectation, and the classification model is not updated or corrected (step S820). On the other hand, if the prediction result is incorrect (i.e., the confirmation response assumes that the tag corresponding to the prediction result is incorrect), the server may correct the incorrect data (step S830). For example, the "yes" option is modified to the "no" option. Next, the server may use the corrected data as training data and retrain the classification model (step S850). In some embodiments, if the validation response has specified a particular label, the server may use the label corresponding to the validation response and the audio data as training data for the classification model, and retrain the classification model accordingly. After retraining, the server may update the classification model (step S870). For example, the retrained classification model replaces the existing stored classification model.

Therefore, the embodiment of the invention evaluates whether the prediction capability of the classification model is in accordance with the expectation or needs to be corrected through the two stages of the loss degree and the confirmation response, thereby improving the training efficiency and the prediction accuracy.

In addition, the server can provide the classification model for other devices to use. For example, fig. 9 is a flowchart illustrating an application of the smart doorbell 50 according to an embodiment of the present invention. Referring to fig. 9, the training server 30 downloads audio data from the cloud server 10 (step S910). The training server 30 may train the classification model (step S920) and store the trained classification model (step S930). The training server 30 may be configured to provide a data providing platform (e.g., as a File Transfer Protocol (FTS) server or a web server) and may provide the classification model for transmission to other devices via a network. Taking the smart doorbell 50 as an example, the smart doorbell 50 may download the classification model through the FTS (step S940) and store it in its own memory 53 for subsequent use (step S950). On the other hand, the smart doorbell 50 may sound the outside and receive voice input through the microphone 51 (step S960). The speech input is for example a human speaking, a human calling, or a human crying, etc. Alternatively, the smart doorbell 50 may collect voice information from other remote devices via internet of things (IoT) wireless technologies (e.g., LE, Zigbee, Z-wave, etc.), which may be streamed in real-time and sent to the smart doorbell 50 in a wireless transmission. The smart doorbell 50 may parse the voice message and provide a voice input upon receipt. The smart doorbell 50 may load the classification model retrieved through the network from its memory 53 to recognize the received voice input and determine the prediction/recognition result accordingly (step S970). The smart doorbell 50 may further provide event notification according to the recognition result of the voice input (step S980). For example, if the result of the recognition is a call from a male owner, the intelligent doorbell 50 sounds an auditory event notification of a musical sound. As another example, if the recognition result is a call by an outsider or other non-family member, the smart doorbell 50 presents a visual event notification of the image in front of the door.

FIG. 10 is a block diagram of the components of training server 30 according to one embodiment of the present invention. Referring to fig. 10, the training server 30 may be a server executing the embodiments described in fig. 1, fig. 2, fig. 3, fig. 5, fig. 6, and fig. 8, and may be a workstation, a personal computer, a smart phone, a tablet computer, and other computing devices. Training server 30 includes, but is not limited to, a communication interface 31, a memory 33, and a processor 35.

Communication interface 31 may support a wired network such as a fiber optic network, an ethernet network, or a cable, and may also support a wireless network such as Wi-Fi, mobile network, bluetooth (e.g., BLE, fifth generation, or later), Zigbee, Z-Wave, etc. In one embodiment, the communication interface 31 is used for transmitting or receiving data. For example, audio data is received, or a classification model is transmitted.

The Memory 33 may be any type of fixed or removable Random Access Memory (RAM), Read Only Memory (ROM), flash Memory (flash Memory) or the like, and is used for recording program codes, software modules, audio data, classification models, related parameters thereof, and other data or files.

The processor 35 is coupled to the communication interface 31 and the memory 33, and the processor 35 may be a Central Processing Unit (CPU), or other programmable general purpose or special purpose Microprocessor (Microprocessor), Digital Signal Processor (DSP), programmable controller, Application-Specific Integrated Circuit (ASIC), or other similar components or combinations thereof. In the embodiment of the present invention, the processor 35 is configured to execute all or part of the jobs of the server 30. For example, training classification models, audio processing, or modifying data, etc.

In summary, in the model construction method for audio recognition according to the embodiment of the present invention, the prompt message is provided according to the loss degree of the difference between the predicted result and the actual result obtained by the classification model, and the classification model is modified according to the corresponding confirmation response. For the operator, the marking can be easily completed only by responding to the prompt message. In addition, the raw audio data may be processed by noise cancellation and audio segmentation for easy listening by the operator. Therefore, the recognition accuracy of the classification model and the marking efficiency of operators can be improved.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A model building method for audio recognition, comprising:

obtaining audio data;

determining a prediction result of the audio data by using a classification model, wherein the classification model is trained based on a machine learning algorithm, and the prediction result comprises a label defined by the classification model;

providing a prompt message according to a degree of loss of the predicted result, wherein the degree of loss is related to a difference between the predicted result and a corresponding actual result, and the prompt message is used for inquiring the relevance of the audio data and the tag; and

modifying the classification model based on an acknowledgement response of the prompt message, wherein the acknowledgement response is associated with acknowledging the association of the audio data with the tag.

2. A method of model construction for audio recognition as recited in claim 1, wherein the prompt message includes the audio data and question content asking whether the audio data belongs to the tag, and the step of providing the prompt message includes:

playing the audio data and providing the question content.

3. A method of model construction for audio recognition as recited in claim 2, wherein the step of modifying the classification model in accordance with the confirmation response of the prompt message comprises:

receiving an input operation, wherein the input operation corresponds to an option of the question content, and the option is that the audio data belongs to the tag or the audio data does not belong to the tag; and

and determining the confirmation response according to the input operation.

4. A method of model construction for audio recognition as recited in claim 1, wherein the step of modifying the classification model in accordance with the confirmation response of the prompt message comprises:

and taking the label corresponding to the confirmation response and the audio data as training data of the classification model, and retraining the classification model according to the training data.

5. A method of model construction for audio recognition as recited in claim 1, wherein the step of retrieving the audio data comprises:

analyzing characteristics of original audio data to determine a noise component of the original audio data; and

canceling the noise component on the original audio data to generate the audio data.

6. A method of model construction for audio identification as recited in claim 5, wherein the characteristics include a plurality of intrinsic mode functions and the step of determining a noise component of the audio data comprises:

decomposing the original audio data to generate a plurality of modal components of the original audio data, wherein each of the modal components corresponds to one of the intrinsic mode functions;

determining an autocorrelation of each of the modal components; and

selecting one modal component as the noise component according to the autocorrelation of the plurality of modal components.

7. The model construction method for audio recognition according to claim 1 or 5, wherein the step of retrieving the audio data comprises:

extracting sound features from the audio data;

determining a target segment and a non-target segment in the audio data according to the sound characteristics; and

the target fragment is retained and the non-target fragment is removed.

8. A method of model construction for audio recognition as recited in claim 7, wherein the target segments are speech content, the non-target segments are not the speech content, the sound features include short-term energy and zero-crossing rate, and the step of extracting the sound features for the audio data comprises:

determining two endpoints of the target segment in the audio data according to the short-time energy and the zero crossing rate of the audio data, wherein the two endpoints are related to the boundary of the target segment in the time domain.

9. The model construction method for audio recognition of claim 7, further comprising:

providing a second prompt message according to the target segment, wherein the second prompt message is used for requiring the tag to be given to the target segment; and

training the classification model according to a second confirmation response of the second prompt message, wherein the second confirmation response comprises the label corresponding to the target segment.

10. The model construction method for audio recognition of claim 1, further comprising:

providing the classification model for transmission over a network;

loading the classification model retrieved over the network to identify a speech input; and

and providing event notification according to the recognition result of the voice input.