CN116597817A

CN116597817A - Audio recognition method, device and storage medium

Info

Publication number: CN116597817A
Application number: CN202310456605.0A
Authority: CN
Inventors: 王运侠
Original assignee: Sipic Technology Co Ltd
Current assignee: Sipic Technology Co Ltd
Priority date: 2023-04-25
Filing date: 2023-04-25
Publication date: 2023-08-15

Abstract

The application discloses an audio frequency identification method, a device and a storage medium, which relate to the technical field of audio frequency identification, wherein the method comprises the following steps: by inputting target audio data into an audio recognition model; dividing target audio data into n chunk through an audio recognition model, wherein n is an integer greater than 1; for each of the n chunks, acquiring a history state of the current chunk through an audio recognition model, and calculating the current chunk according to the history state and the current chunk; the history state is a state calculated and saved before the attention operation when calculating the previous chunk of the current chunk; and outputting the target audio data after identification according to the calculation result of each chunk through the audio identification model. The problem of among the prior art audio identification efficiency lower is solved, the effect that can pass through the record history state has been reached, and then calculate current chunk through history state and current chunk, and need not to calculate based on the whole data of preceding chunk, improves audio identification efficiency is reached.

Description

Audio recognition method, device and storage medium

Technical Field

The application relates to an audio identification method, an audio identification device and a storage medium, and belongs to the technical field of audio identification.

Background

The time series model is dependent on the order in which the events occur, and the results produced by the input model after changing the order of the same magnitude values are different. The most common timing network models in deep learning include RNN (Recurrent neural network) and LSTM (Long Short-Term memory) networks.

In model streaming reasoning application scenarios, such as in audio call process, the chunk attribute performance loss is minimum, and the application is most widely used. The chunk attribute divides the input into a plurality of chunk according to fixed chunk size, each chunk depends on itself and previous chunk, and when calculating the next chunk, the information of the previous chunk needs to be relied on, so when actually calculating, the previous chunk and the current chunk need to be input together for calculation, the calculated amount is large, and the audio recognition efficiency in the existing scheme is low.

Disclosure of Invention

The application aims to provide an audio identification method, an audio identification device and a storage medium, which are used for solving the problems in the prior art.

In order to achieve the above purpose, the present application provides the following technical solutions:

according to a first aspect, an embodiment of the present application provides an audio recognition method, including:

inputting the target audio data into an audio recognition model;

dividing the target audio data into n chunk through the audio recognition model, wherein n is an integer greater than 1;

for each of the n chunks, acquiring a history state of a current chunk through the audio recognition model, and calculating the current chunk according to the history state and the current chunk; the historical state is a state calculated and saved before the attention operation when calculating the previous chunk of the current chunk;

and outputting the target audio data after recognition according to the calculation results of the chunk through the audio recognition model.

Optionally, the obtaining, by the audio recognition model, a history state of a current chunk, and calculating, according to the history state and the current chunk, the current chunk includes:

acquiring the input state of the target audio data;

and acquiring a history state of the current chunk through the audio recognition model, and calculating the current chunk according to the input state, the history state and the current chunk.

Optionally, the acquiring the input state of the target audio data includes:

identifying the first and last states of the target audio data through a voice identification module;

and marking the input state according to the first and last states obtained through recognition through a state model, wherein the input state comprises a starting moment, an intermediate moment and an ending moment of the target audio data.

Optionally, the obtaining, by the audio recognition model, a history state of the current chunk, and calculating the current chunk according to the input state, the history state, and the current chunk includes:

if the input state is the starting moment, initializing the historical state, and calculating the current chunk according to the current chunk through the audio recognition model;

and updating the calculation result before the attention operation to the historical state.

if the input state is the middle time, calculating the current chunk according to the historical state and the current chunk through the audio recognition model;

and if the input state is the ending time, calculating the current chunk according to the historical state and the current chunk through the audio recognition model, and not needing to update the historical state.

Optionally, the marking the input state by the state model according to the identified first and last states includes:

if the first and last states are the initial states, the input states are: start=1, end=0;

if the first and last states are intermediate states, the input states are: start=0, end=0;

if the first and last states are end states, the input states are: start=0, end=1.

In a second aspect, there is provided an audio recognition apparatus, the apparatus comprising:

the input module is used for inputting the target audio data into the audio recognition model;

the segmentation module is used for segmenting the target audio data into n chunks through the audio recognition model, wherein n is an integer greater than 1;

the computing module is used for acquiring the history state of the current chunk through the audio recognition model for each chunk in the n chunk, and computing the current chunk according to the history state and the current chunk; the historical state is a state calculated and saved before the attention operation when calculating the previous chunk of the current chunk;

and the output module is used for outputting the target audio data after recognition according to the calculation results of the chunk through the audio recognition model.

In a third aspect, there is provided an audio recognition apparatus comprising a memory having stored therein at least one program instruction and a processor to implement the method of the first aspect by loading and executing the at least one program instruction.

In a fourth aspect, there is provided a computer storage medium having stored therein at least one program instruction that is loaded and executed by a processor to implement the method according to the first aspect.

By inputting target audio data into an audio recognition model; dividing the target audio data into n chunk through the audio recognition model, wherein n is an integer greater than 1; for each of the n chunks, acquiring a history state of a current chunk through the audio recognition model, and calculating the current chunk according to the history state and the current chunk; the historical state is a state calculated and saved before the attention operation when calculating the previous chunk of the current chunk; and outputting the target audio data after recognition according to the calculation results of the chunk through the audio recognition model. The problem of among the prior art audio identification efficiency lower is solved, the effect that can pass through the record history state has been reached, and then calculate current chunk through history state and current chunk, and need not to calculate based on the whole data of preceding chunk, improves audio identification efficiency is reached.

Meanwhile, the application marks the input state of the audio data through start and end, thereby replacing the real-time transmission state of the client and the server in the prior scheme, reducing the data transmission between the client and the server and reducing the reasoning delay.

The foregoing description is only an overview of the present application, and is intended to provide a better understanding of the present application, as it is embodied in the following description, with reference to the preferred embodiments of the present application and the accompanying drawings.

Drawings

FIG. 1 is a flow chart of a method for audio recognition according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a possible input-output sequence of an audio recognition model according to an embodiment of the present application;

FIG. 3 is a schematic diagram of one possible network model of an audio recognition model according to one embodiment of the present application;

FIG. 4 is a schematic diagram of one possible flow of characterizing input states by start and end according to one embodiment of the present application;

FIG. 5 is a flowchart of one possible calculation of chunk according to historical status provided by one embodiment of the present application;

FIG. 6 is a flowchart of one possible calculation of chunk based on historical state and input state, according to one embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made apparent and fully in view of the accompanying drawings, in which some, but not all embodiments of the application are shown. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

In the description of the present application, it should be noted that the directions or positional relationships indicated by the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc. are based on the directions or positional relationships shown in the drawings, are merely for convenience of describing the present application and simplifying the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present application. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

In the description of the present application, it should be noted that, unless explicitly specified and limited otherwise, the terms "mounted," "connected," and "connected" are to be construed broadly, and may be either fixedly connected, detachably connected, or integrally connected, for example; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present application will be understood in specific cases by those of ordinary skill in the art.

In addition, the technical features of the different embodiments of the present application described below may be combined with each other as long as they do not collide with each other.

Referring to fig. 1, a flowchart of a method for audio recognition according to an embodiment of the present application is shown, and as shown in fig. 1, the method includes:

step 101, inputting target audio data into an audio recognition model;

there are various timing models in the field of speech recognition, and it is currently more common to use an end-to-end (end-to-end) model based on a transducer structure, where the end-to-end model includes an encoder and a decoder, the encoder is responsible for encoding, the decoder is responsible for decoding, the input is one sequence and the output is another sequence, and the input sequence and the output sequence both include a timing sequence. The above end-to-end model is the audio recognition model of the application.

An example of an application of an end-to-end model, as shown in FIG. 2, describes the end-to-end translation model workflow. The end-to-end speech recognition model, like this, contains two modules, an encode and a decode. Wherein the input of the translation model is the english sequence "it is a cat" and the output of the corresponding chinese sequence "this is a cat". The difference of the speech recognition model is that the input of the encoding module is changed into the fbank characteristic extracted from the audio of a cat, the encoding module outputs the encoding vector of the audio, and then the encoding module decodes the encoding vector into the text recognition result corresponding to the audio.

The above is only exemplified by encoding and decoding together forming an audio recognition model, and in practical implementation, the encoding and decoding may be two independent models, which are located in a server and share a GPU (graphics processing unit, graphics processor) graphics card. Of course, in actual implementation, according to actual architecture deployment, the two models may also be deployed in different servers, which is not limited thereto.

Additionally, please refer to fig. 3, which illustrates a schematic diagram of one possible network model of the audio recognition model. As can be seen from fig. 3, the Multi-head attention operation in the audio recognition model of the present application relies on the context information, while other network structures do not rely on the context information.

102, dividing the target audio data into n chunks through the audio recognition model, wherein n is an integer greater than 1;

the chunk attitution divides the input into multiple chunks according to a fixed chunk size, containing for each chunk the input [ t+1, t+2, …, t+c ], where C is the size of the chunk size. Each chunk depends on itself and the previous chunk, so the delay in the decode model depends on the size of the chunk size.

In actual implementation, the number of n obtained by segmentation may be determined according to the size of the chunk and the length of the target audio data, and the size of chunk may be a default size or a custom size, which is not limited.

Step 103, for each of the n chunks, acquiring a history state of a current chunk through the audio recognition model, and calculating the current chunk according to the history state and the current chunk; the historical state is a state calculated and saved before the attention operation when calculating the previous chunk of the current chunk;

optionally, the step includes:

firstly, acquiring an input state of the target audio data;

(1) Identifying the first and last states of the target audio data through a voice identification module;

optionally, during the audio recognition process, the client may send a start, feed, or stop to the server to indicate the beginning and end of a piece of audio. These data were obtained by detecting the end-of-head state of the speech signal by the vad module (Voice activity detection, also called speech activity detection or speech detection) before the audio recognition model. The first and last states include: a start state, an intermediate state, and an end state.

Specifically, when the vad module detects that the client transmits the start, the vad module can identify that the target audio data is in the starting state; when detecting that the client sends the feed, the target audio data can be identified as an intermediate state; when the client terminal is detected to send stop, the target audio data can be identified as an ending state.

(2) And marking the input state according to the first and last states obtained through the state model, wherein the input state comprises a starting moment, an intermediate moment and an ending moment of the target audio data.

The state model may be a stateful model, and the input state of the audio data may be identified by start and end. Specific:

For example, please refer to fig. 4, which illustrates one possible schematic diagram representing an input state through a state model. As shown in fig. 4, at the start input time start=1 and end=0, and at the intermediate time start=0 and end=0, and at the end time start=0 and end=1.

The application marks the input state of the audio data through the start and end so as to replace the real-time transmission state of the client and the server in the prior scheme, thereby reducing the data transmission time between the client and the server and the data copying time between the CPU and the GPU, namely reducing the delay of the client for requesting the audio recognition service.

The present application is illustrated by marking the input state by the state model, and the input state can be obtained by real-time interaction with the client in actual implementation, which is not limited in this embodiment.

Secondly, acquiring a history state of the current chunk through the audio recognition model, and calculating the current chunk according to the input state, the history state and the current chunk.

Alternatively, the step may include the steps of:

if the input state is the starting moment, initializing the historical state, and calculating the current chunk according to the current chunk through the audio recognition model; and updating the calculation result before the attention operation to the historical state.

When the input state is the starting time, the current chunk is the first chunk, and the history state can be initialized to be empty at this time, and the calculation is directly performed according to the current chunk. And updating a calculation result before the attention operation to the history state for the purpose of time-series prediction of subsequent audio after calculation. Except for special description, the attention operation in the application refers to all operations in the audio recognition model which need to rely on the previous chunk to calculate, and in actual implementation, the operation is possibly named differently according to different network models, and the application is not limited to the operation.

If the input state is the middle time, calculating the current chunk according to the historical state and the current chunk through the audio recognition model; and updating the calculation result before the attention operation to the historical state.

When the input state is the end time, the current chunk is the last chunk, and the voice recognition is ended after the calculation is completed, so that the history state does not need to be updated after the current chunk is calculated.

In one possible embodiment, the historical state is illustrated by state representation, please refer to fig. 5, which shows a possible schematic diagram of the present application, and as shown in fig. 5, each output is calculated in turn according to the current input chunk and the historical state when calculating each chunk.

In another possible embodiment, the respective input states are represented by a state model on the basis of fig. 5, please refer to fig. 6, in which start=1 and end=0 at input1, i.e. at the start time, start=0 and end=0 at input2 and input3, i.e. at the intermediate time, start=0 and end=1 at input, i.e. at the end time, and for each chunk, an output is calculated from the state in which the current chunk is located and the history state.

In one possible embodiment, the encodings in the streaming recognition are piecewise inferred, for example when chunkize=80, the input length of the audio recognition model is 80 frames of audio data, the speech recognition is extracting features every 10ms, then 80 frames are 800ms, and thus the audio input to the audio recognition model is calculated every 800 ms. After 80 frames of audio features are input to the encodings, the features are reduced to 20 frames in length by passing through a vgg network, two layers of maxpoll in vgg. The output of vgg enters the attention operation in the incoming code, and the attention calculation is performed on the current chunk and the previous chunk 10 frames of data, so that each calculation needs to reserve the last 10 frames of results for the next chunk to perform calculation. Since the last 80 frame data calculation ends when the next 80 frame data is received for the second time, these data need to be sent to the encoding as history information together with the second input to calculate, the encoding model needs to record the last state and update the state after the calculation is completed. In the application, the encodings need to rely on the last 10 frame results of the last chunk in calculation, and the network can also see the previous information by using a data feeding repetition mode. In the prior art, 1-80 frames of data are fed for the first time, 40-160 frames of data can be fed for the second time (so that the last half of data to be relied on are input together except the current 80-160 frames, and the history information before the current input can be seen), but the calculation of a model except the attention is caused, the calculation of 80 frames of data is changed into 120 frames of data each time, and the calculated amount is increased.

In addition, in the present application, the decoding model is the encoding of each chunk outputting a section of audio, and besides the encoding result of the encoding, a ctc header is also output, and according to the output result of the ctc, several words in the section of audio can be predicted, because the decoding network includes lstm structures, and the decoding process needs to be performed multiple times, each time the decoding process outputs a word, the next sequence needs to be predicted by using the current sequence when obtaining a new word, and the state information of the last word needs to be obtained by each time the decoding process is calculated, so the decoding model is also a stateful network model.

And, in conjunction with fig. 3, the decoding end in the present application adopts two layers lstm and one layer of attribute (other possible structures may also have different structures and use of multiple layers of attribute instead of lstm), and the difference between the multiple layers of attribute structures is that the state of each layer is recorded separately, and then a large state is spliced, for example, one layer of attribute needs to record 10 frames of state, the feature dimension is 512, then one layer needs to record 10×512=5120 data, if there are 10 layers of such structures, 5120×10=51200 data are recorded as hidden layer states, when using, 5120 is allocated for each layer, and the result of each layer is summarized and returned after calculating.

The above is only exemplified by using an audio recognition model as a model based on a transducer structure, and optionally, the audio recognition model may also be a scene of streaming computation by a request service mode based on models such as LSTM (Long Short-Term Memory network), GRU, DFSMN, etc., and the application is not limited to specific application scenes thereof.

And 104, outputting the target audio data after recognition according to the calculation results of the respective chunk through the audio recognition model.

In summary, the target audio data is input to the audio recognition model; dividing the target audio data into n chunk through the audio recognition model, wherein n is an integer greater than 1; for each of the n chunks, acquiring a history state of a current chunk through the audio recognition model, and calculating the current chunk according to the history state and the current chunk; the historical state is a state calculated and saved before the attention operation when calculating the previous chunk of the current chunk; and outputting the target audio data after recognition according to the calculation results of the chunk through the audio recognition model. The problem of among the prior art audio identification efficiency lower is solved, the effect that can pass through the record history state has been reached, and then calculate current chunk through history state and current chunk, and need not to calculate based on the whole data of preceding chunk, improves audio identification efficiency is reached.

The application also provides an audio recognition device, which comprises:

The application also provides an audio recognition device comprising a memory having stored therein at least one program instruction and a processor for implementing the method as described above by loading and executing the at least one program instruction.

The present application also provides a computer storage medium having stored therein at least one program instruction that is loaded and executed by a processor to implement a method as described above.

The technical features of the above-described embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above-described embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.

Claims

1. A method of audio recognition, the method comprising:

inputting the target audio data into an audio recognition model;

2. The method according to claim 1, wherein the obtaining, by the audio recognition model, a history state of a current chunk, and calculating the current chunk according to the history state and the current chunk, includes:

acquiring the input state of the target audio data;

3. The method of claim 2, wherein said obtaining an input state of said target audio data comprises:

4. A method according to claim 3, wherein said obtaining, by said audio recognition model, a history of a current chunk, said current chunk being calculated from said input state, said history and said current chunk, comprises:

5. A method according to claim 3, wherein said obtaining, by said audio recognition model, a history of a current chunk, said current chunk being calculated from said input state, said history and said current chunk, comprises:

6. A method according to claim 3, wherein said obtaining, by said audio recognition model, a history of a current chunk, said current chunk being calculated from said input state, said history and said current chunk, comprises:

7. A method according to claim 3, wherein said marking said input state by a state model based on said identified end-to-end state comprises:

8. An audio recognition device, the device comprising:

9. An audio recognition device, characterized in that the device comprises a memory in which at least one program instruction is stored and a processor which, by loading and executing the at least one program instruction, implements the method according to any of claims 1 to 7.

10. A computer storage medium having stored therein at least one program instruction that is loaded and executed by a processor to implement the method of any of claims 1 to 7.