CN111402906B

CN111402906B - Speech decoding method, device, engine and storage medium

Info

Publication number: CN111402906B
Application number: CN202010155132.7A
Authority: CN
Inventors: 赵伟伟
Original assignee: WeBank Co Ltd
Current assignee: WeBank Co Ltd
Priority date: 2020-03-06
Filing date: 2020-03-06
Publication date: 2024-05-14
Anticipated expiration: 2040-03-06
Also published as: CN111402906A

Abstract

The invention discloses a voice decoding method, a device, an engine and a storage medium, wherein the method is applied to the voice decoding engine, when a plurality of voice decoding requests are received, a plurality of thread-level decoding channels are applied, and the plurality of voice decoding requests are in one-to-one correspondence with the plurality of thread-level decoding channels; and respectively calling a general model by using the thread-level decoding channels, performing parallel decoding processing on voice stream data in the voice decoding requests to obtain decoding results, and responding to the voice decoding requests based on the decoding results. Therefore, a plurality of voice decoding requests are processed in parallel through a plurality of thread-level decoding channels, and the plurality of thread-level decoding channels are requested to share a common model, so that the parallel processing of the thread level of voice decoding is realized, the hardware cost is reduced, and the concurrency capacity and decoding efficiency of voice decoding are improved.

Description

Speech decoding method, device, engine and storage medium

Technical Field

The present invention relates to the field of speech recognition technologies, and in particular, to a speech decoding method, device, engine and storage medium.

Background

With the development of computer technology, more and more technologies (big data, distributed, blockchain Blockchain, artificial intelligence, etc.) are applied in the financial field, and the traditional financial industry is gradually changing to financial technology (Fintech), but due to the requirements of security and real-time performance of the financial industry, higher requirements are also put forward on the technologies.

Speech decoding is an important component of speech recognition. Currently, the text corresponding to the voice stream data is generally obtained by decoding the voice stream data based on a general model. If the parallel processing is needed to be realized, the speech decoding efficiency can only be realized by deploying more general models at the process level, but the general models are large in size, so that the hardware cost is greatly improved.

Disclosure of Invention

The invention provides a voice decoding method, a device, an engine and a storage medium, which aim to create a common model for a plurality of decoding channels, realize parallel processing at a thread level, reduce hardware cost and improve concurrency capacity and decoding efficiency of voice decoding.

To achieve the above object, the present invention provides a speech decoding method, the method comprising:

When a plurality of voice decoding requests are received, a plurality of thread-level decoding channels are applied, and the plurality of voice decoding requests are in one-to-one correspondence with the plurality of thread-level decoding channels;

and respectively calling a general model by using the thread-level decoding channels, performing parallel decoding processing on voice stream data in the voice decoding requests to obtain decoding results, and responding to the voice decoding requests based on the decoding results.

Preferably, the thread-level decoding channel comprises a channel decoding unit, a data cache area and a callback interface unit;

The step of using the plurality of thread-level decoding channels to respectively call a general model, performing parallel decoding processing on the voice stream data in the plurality of voice decoding requests to obtain decoding results, and responding to the plurality of voice decoding requests based on the decoding results comprises the steps of:

respectively caching voice stream data in the voice decoding requests by utilizing data cache channels of the thread-level decoding channels;

respectively calling a general model by utilizing channel decoding units of the thread-level decoding channels, and performing parallel decoding processing on voice stream data in the voice decoding requests to obtain decoding results;

and respectively responding to the voice decoding requests based on the decoding results by utilizing callback interface units of the thread-level decoding channels.

Preferably, the buffering of the voice stream data in the plurality of voice decoding requests by the data buffering channels of the plurality of thread-level decoding channels includes:

for any specific thread level decoding channel in the plurality of thread level decoding channels, checking the data state of a data buffer area of the specific thread level decoding channel;

If the data state of the data buffer area of the specific thread level decoding channel is waiting data, directly temporarily storing the voice stream data corresponding to the specific thread level decoding channel in the data buffer area of the specific thread level decoding channel;

And if the data state of the data buffer area of the specific thread level decoding channel is data, temporarily storing the voice stream data corresponding to the specific thread level decoding channel at the tail end of the data buffer area of the specific thread level decoding channel.

Preferably, the channel decoding unit that uses the multiple thread-level decoding channels respectively invokes a common model, and performs parallel decoding processing on the voice stream data in the multiple voice decoding requests, and the step of obtaining a decoding result includes:

Respectively calling a general model by using channel decoding units of the plurality of thread-level decoding channels;

and based on the general model, converting the voice stream data into a feature vector set in each channel decoding unit in parallel, and converting the feature vector set into a decoding result.

Preferably, the thread-level decoding channel further comprises a state control unit,

The method further comprises the steps of:

updating the running state of the thread-level decoding channel, the data state of the data buffer area and the registration state of the callback interface unit in real time through a state control unit of the thread-level decoding channel so as to execute corresponding steps according to the running state, the data state and the registration state; and/or

And receiving an external control signal by a state control unit of the thread-level decoding channel, and adjusting the running state of the thread-level decoding channel according to the external control signal.

Preferably, the thread-level decoding channel comprises a reclamation unit;

The step of using the decoding channels of the multiple thread levels to respectively call a general model, performing parallel decoding processing on the voice stream data in the multiple voice decoding requests, and responding to the multiple voice decoding requests based on decoding results further comprises:

clearing the voice stream data in the data buffer area of the thread level decoding channel through the recovery unit of the thread level decoding channel so as to store the voice stream data again by utilizing the data buffer area; and/or

And clearing the state information recorded by the state control unit of the thread-level decoding channel through the recovery unit of the thread-level decoding channel so that the state control unit can record the state of the thread-level decoding channel again.

Preferably, the registration state of the callback interface unit is registered or unregistered;

Before the step of respectively responding to the plurality of voice decoding requests by using the callback interface unit of the plurality of thread-level decoding channels based on the decoding results, the callback interface unit further comprises:

Checking the registration state of a callback interface of the thread-level decoding channel recorded by a state control unit of each thread-level decoding channel;

If the registration state of the callback interface unit of the thread-level decoding channel is registered, executing the steps: respectively responding to the voice decoding requests based on the decoding results by utilizing callback interface units of the thread-level decoding channels;

and if the registration state of the callback interface unit of the thread-level decoding channel is unregistered, clearing the voice stream data of the data cache region through the recovery unit of the thread-level decoding channel, and clearing the state information updated by the state control unit.

In addition, in order to achieve the above object, the present invention also provides a speech decoding apparatus including:

The application module is used for applying a plurality of thread-level decoding channels when a plurality of voice decoding requests are received, wherein the plurality of voice decoding requests are in one-to-one correspondence with the plurality of thread-level decoding channels;

and the decoding module is used for respectively calling a general model by utilizing the plurality of thread-level decoding channels, carrying out parallel decoding processing on voice stream data in the plurality of voice decoding requests, obtaining a decoding result, and responding to the plurality of voice decoding requests based on the decoding result.

In addition, in order to achieve the above object, the present invention also provides a speech decoding engine including a processor, a memory, and a speech decoding program stored in the memory, which when executed by the processor, implements the steps of the speech decoding method as described above.

In addition, in order to achieve the above object, the present invention also provides a computer storage medium having stored thereon a speech decoding program which, when executed by a processor, implements the steps of the speech decoding method as described above.

Compared with the prior art, the invention provides a voice decoding method, a device, an engine and a storage medium, wherein the method is applied to the voice decoding engine, when a plurality of voice decoding requests are received, a plurality of thread-level decoding channels are applied, and the plurality of voice decoding requests are in one-to-one correspondence with the plurality of thread-level decoding channels; and respectively calling a general model by using the thread-level decoding channels, performing parallel decoding processing on voice stream data in the voice decoding requests to obtain decoding results, and responding to the voice decoding requests based on the decoding results. Therefore, a plurality of voice decoding requests are processed in parallel through a plurality of thread-level decoding channels, and the plurality of thread-level decoding channels are requested to share a common model, so that the parallel processing of the thread level of voice decoding is realized, the hardware cost is reduced, and the concurrency capacity and decoding efficiency of voice decoding are improved.

Drawings

FIG. 1 is a schematic diagram of the hardware architecture of a speech decoding engine according to various embodiments of the present invention;

FIG. 2 is a schematic diagram of the components of a speech decoding engine of the present invention;

FIG. 3 is a schematic diagram of the components of the speech decoding channel of the present invention

FIG. 4 is a flow chart of a first embodiment of the speech decoding method of the present invention;

FIG. 5 is a schematic diagram of a speech stream data processing flow according to an embodiment of the speech decoding method of the present invention;

fig. 6 is a schematic functional block diagram of a first embodiment of the speech decoding apparatus according to the present invention.

The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.

Detailed Description

It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

The voice decoding engine mainly related to the embodiment of the invention refers to a network connection engine capable of realizing network connection, and the voice decoding engine can be a server, a cloud platform and the like.

Referring to fig. 1, fig. 1 is a schematic diagram of a hardware configuration of a speech decoding engine according to various embodiments of the present invention. In an embodiment of the present invention, the speech decoding engine may include a processor 1001 (e.g., central processing unit Central Processing Unit, CPU), a communication bus 1002, an input port 1003, an output port 1004, and a memory 1005. Wherein the communication bus 1002 is used to enable connected communications between these components; the input port 1003 is used for data input; the output port 1004 is used for data output, and the memory 1005 may be a high-speed RAM memory or a stable memory (non-volatile memory), such as a disk memory, and the memory 1005 may be an optional storage device independent of the processor 1001. Those skilled in the art will appreciate that the hardware configuration shown in fig. 1 is not limiting of the invention and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.

With continued reference to fig. 1, the memory 1005 of fig. 1, which is a readable storage medium, may include an operating system, a network communication module, an application program module, and a speech decoding program. In fig. 1, the network communication module is mainly used for connecting with a server and performing data communication with the server; and the processor 1001 may call a voice decoding program stored in the memory 1005 and execute the voice decoding method provided by the embodiment of the present invention.

Further, referring to fig. 2, fig. 2 is a schematic diagram of the components of the speech decoding engine of the present invention. The decoding engine is an important component of the speech recognition technology, interfaces with the network communication protocol, controls the speech stream data and decoding algorithms, returns decoding results, and responds to various control signals. From the application device, it can be classified into a terminal decoding engine (e.g. handheld device, embedded device), a server decoding engine (e.g. cloud service), etc. According to different scenes, the decoder is required to be designed in a targeted manner, for example, a terminal decoding engine only needs to decode in a single path, power consumption is reduced, and cloud services need to be parallel in multiple paths as much as possible, so that deployment cost is reduced.

The speech decoding engine includes a generic model and a plurality of thread-level decoding channels. Wherein the generic model is a speech recognition model, including an acoustic model, a language model, and the like. The general model is a speech decoding model obtained by training huge amount of data, can be used as a model training base for vertical field recognition, and is combined with field data to perform migration learning training, so that a high-precision general model is obtained. It will be appreciated that the generic model includes a decoding algorithm, which may be a bundle search algorithm, a viterbi algorithm, or the like. Converting the audio stream data into a voice feature vector set by a mel frequency cepstrum coefficient (MelFrequencyCepstrumCoefficient, MFCC), a linear prediction cepstrum coefficient (LinearPredictiveCepstralCoefficient, LPCC), a linear prediction analysis (LinearPredictionCoefficients, LPC) and other methods, wherein the acoustic model provides a probability that a voice feature vector is converted into a certain semantic; the language model provides probabilities of transitions between semantics and finally obtains decoding results. In this embodiment, the training process of the generic model is not different from the training process of the existing speech decoding model, and will not be described here again.

The generic model provides support for the plurality of decoding channels to decode a speech stream. With continued reference to FIG. 2, the decoding engine includes a plurality of thread-level decoding channels: thread level decode channel 1, thread level decode channel 2 … …, thread level decode channel n. The individual thread-level decoding channels may process the speech streams in parallel. Each thread-level decoding channel corresponds to one speech decoding request, thereby providing a timely and quick response for each speech decoding request. And a plurality of thread-level decoding channels are arranged on the basis of the general model, so that the volume of the general model is not increased, and compared with parallel processing of a process level, the hardware cost can be greatly reduced. One path uses one thread-level decoding channel in parallel, the thread-level decoding channel and the thread-level decoding channel are isolated from each other, but the same general model is used, so that one process can comprise n thread-level decoding channels, namely n paths of parallel decoding capability. Because the thread-level decoding channel operates at the thread level, a process can comprise any multi-path parallel decoding channel, and compared with a one-path decoding channel method of a process, the processing efficiency can be improved by times, and the use efficiency of a memory and a video memory can be greatly improved.

Further, referring to fig. 3, fig. 3 is a schematic diagram of the composition of the speech decoding channel of the present invention. The thread-level decoding channel comprises a data buffer area, a channel decoding unit, a callback interface unit and a state control unit.

The data buffer area is used for storing voice stream data for the channel decoding unit to read; the buffer capacity of the data buffer area can be specifically set according to actual needs.

And the channel decoding unit extracts and reads the voice stream data from the data cache region, and calls the universal model to decode the voice stream data after obtaining the voice stream data, so as to obtain a decoding result. Typically, the decoding result is an optimal text corresponding to the voice stream data;

the callback interface unit is used for returning the decoding result to the client;

The state control unit is used for updating the states of the data cache area, the channel decoding unit and the callback interface unit in real time; the state control unit is also used for influencing external control signals and adjusting the running state of the channel according to the external control signals.

Further, the thread-level decoding channel further comprises a reclamation unit, wherein the reclamation unit is used for emptying the data buffer area, clearing state information updated by the state control unit and emptying data format information of the voice stream data recorded by the state control unit. Thus, the thread-level decoding channel can be reused after completing a task.

The embodiment of the invention provides a voice decoding method.

Referring to fig. 4, fig. 4 is a flowchart illustrating a first embodiment of the speech decoding method according to the present invention.

Step S101: when a plurality of voice decoding requests are received, a plurality of thread-level decoding channels are applied, and the plurality of voice decoding requests are in one-to-one correspondence with the plurality of thread-level decoding channels;

After the voice decoding engine establishes network connection with the client, the voice decoding request sent by the client can be received. The speech decoding engine comprises a plurality of thread-level decoding channels for decoding, so that network connection can be established with a plurality of clients simultaneously, and a corresponding plurality of speech decoding requests are received.

And the voice decoding engine establishes network connection with a plurality of clients, and applies for a corresponding thread-level decoding channel for each client after the connection is successful. The plurality of speech decoding requests are in one-to-one correspondence with the plurality of thread-level decoding channels.

Respectively calling a general model by utilizing the thread-level decoding channels, performing parallel decoding processing on voice stream data in the voice decoding requests to obtain decoding results, and responding to the voice decoding requests based on the decoding results

Step S102: and respectively calling a general model by using the thread-level decoding channels, performing parallel decoding processing on voice stream data in the voice decoding requests to obtain decoding results, and responding to the voice decoding requests based on the decoding results.

The thread-level decoding channel comprises a channel decoding unit, a data buffer area and a callback interface unit, wherein the channel decoding unit, the data buffer area and the callback interface unit cooperatively operate to store voice stream data, decode the voice stream data to obtain a decoding result, and respond the decoding result to a voice decoding request.

Specifically, the step S102: the step of respectively calling a general model by utilizing the plurality of thread-level decoding channels, performing parallel decoding processing on voice stream data in the plurality of voice decoding requests to obtain decoding results, and responding to the plurality of voice decoding requests based on the decoding results comprises the following steps:

step S102a: respectively caching voice stream data in the voice decoding requests by utilizing data cache channels of the thread-level decoding channels;

The data buffer area is used for buffering the voice stream data received based on the voice decoding requests, and the plurality of data buffer areas of the plurality of thread-level decoding channels are respectively used for storing the voice stream data in different voice decoding requests. The voice stream data corresponding to the voice decoding requests are respectively stored in different data buffer areas temporarily, and the different voice stream data are mutually isolated, so that the integrity of the voice stream data is ensured, and the voice stream data are orderly stored and are not easy to lose.

Specifically, the step S102a: the step of buffering the voice stream data in the plurality of voice decoding requests by using the data buffers of the plurality of thread-level decoding channels respectively includes:

Step S102a1: for any specific thread level decoding channel in the plurality of thread level decoding channels, checking the data state of a data buffer area of the specific thread level decoding channel;

each voice decoding request corresponds to a particular thread-level decoding channel, and the particular thread-level decoding channel corresponding to the voice decoding request is marked as the particular thread-level decoding channel.

And storing the voice stream data into the data buffer area of the specific thread level decoding channel, determining the data state of the data buffer area of the specific thread level decoding channel, and selecting a corresponding voice stream data storage mode according to the data state.

Step S102a2: if the data state of the data buffer area of the specific thread level decoding channel is waiting data, directly temporarily storing the voice stream data corresponding to the specific thread level decoding channel in the data buffer area of the specific thread level decoding channel;

In this embodiment, if the data state of the data buffer area of the specific thread level decoding channel is waiting data, the voice stream data corresponding to the specific thread level decoding channel is directly buffered in the data buffer area of the specific thread level decoding channel.

Step S102a3: and if the data state of the data buffer area of the specific thread level decoding channel is data, temporarily storing the voice stream data corresponding to the specific thread level decoding channel at the tail end of the data buffer area of the specific thread level decoding channel.

Therefore, the channel decoding unit can sequentially extract the voice stream data in the data buffer area, so that the loss of the voice stream data is avoided, and the un-decoded voice stream data is not missed in the decoding process.

Step S102b: respectively calling a general model by utilizing channel decoding units of the thread-level decoding channels, and performing parallel decoding processing on voice stream data in the voice decoding requests to obtain decoding results;

In this embodiment, the generic model provides support for the multiple thread-level decoding channels, and the multiple thread-level decoding channels decode the respective voice stream data by using the generic model to obtain a decoding result.

Specifically, the step S102b includes:

Step S102b1: respectively calling a general model by using channel decoding units of the plurality of thread-level decoding channels;

and after the channel decoding unit extracts the voice stream data from the data cache region, a general model is called, and the general model decodes the voice stream data.

Step S102b2: and based on the general model, converting the voice stream data into a feature vector set in each channel decoding unit in parallel, and converting the feature vector set into a decoding result.

The generic model is a speech recognition model, including an acoustic model, a language model, and the like. The general model is a speech decoding model obtained by training huge amount of data, can be used as a model training base for vertical field recognition, and is combined with field data to perform migration learning training, so that a high-precision general model is obtained. The generic model includes a decoding algorithm, which may be a bundle search algorithm, a viterbi algorithm, or the like. In this embodiment, after the general model converts the audio stream data into a voice feature vector set by means of mel frequency cepstrum coefficient (MelFrequencyCepstrumCoefficient, MFCC), linear prediction cepstrum coefficient (LinearPredictiveCepstralCoefficient, LPCC), linear prediction analysis (LinearPredictionCoefficients, LPC) and the like, the acoustic model is used to provide a probability that a voice feature vector is converted into a certain semantic; and providing the probability of transition between semantics by using the language model, and finally obtaining a decoding result. The decoding result is text data.

In this embodiment, decoding may be performed based on the viterbi algorithm. The Viterbi algorithm is a general decoding algorithm and is a method for solving the shortest path of a sequence based on dynamic programming. The viterbi algorithm is a dynamic programming algorithm for finding the-viterbi path-hidden state sequence most likely to produce the sequence of observed events, especially in the markov information source context and hidden markov models. In this embodiment, based on the voice stream data, the most probable hidden state sequence is obtained by using a viterbi algorithm, and the most probable hidden state sequence is marked as a decoding result. Typically, the decoding result is text corresponding to the voice stream data. In addition, the voice stream data may be decoded based on a beam search algorithm. The cluster search is a heuristic graph search algorithm, and is generally used under the condition that the solution space of a graph is relatively large, in order to reduce the space and time occupied by searching, when each step of depth expansion is performed, some nodes with relatively poor quality are cut off, and some nodes with relatively high quality are reserved. This reduces space consumption and improves time efficiency.

Step S102c: and respectively responding to the voice decoding requests based on the decoding results by utilizing callback interface units of the thread-level decoding channels.

And after the connection is successfully established with the client corresponding to the voice decoding request and the corresponding decoding channel is obtained, registering a callback interface unit so that the callback interface unit returns the decoding result to the corresponding client.

After the corresponding channel decoding unit and callback interface unit are obtained, initializing the corresponding data buffer area, the corresponding channel decoding unit and the corresponding callback interface unit for standby use in receiving a voice stream corresponding to the voice decoding request, decoding the voice stream by the channel decoding unit, and responding the decoding result to the voice decoding requests by the callback interface unit.

In this embodiment, network connection may be established with multiple clients simultaneously or sequentially. After establishing network connection, receiving voice stream data uploaded by the client in real time through a communication protocol determined by the client; the communication protocol may be TCP (Transmission Control Protocol transmission control protocol), HTTP (HyperText Transfer Protocol, hypertext transmission protocol), webSocket (duplex communication protocol based on TCP), MRCP (Media Resource Control Protocol ), or the like. The client comprises a webpage, a microphone, a mobile terminal and the like.

Each client corresponds to one decoding channel, the decoding channels are mutually isolated, the voice decoding flow is respectively carried out, and the decoding flows among the decoding channels are mutually not interfered. For example, three speech decoding requests come from three clients, respectively: client a, client B and client C, apply for three decoding channels: decoding channel A, decoding channel B, decoding channel C. It can be understood that if the number of the voice decoding requests, that is, the number of the corresponding clients, exceeds the maximum value of the number of the preset decoding channels, adding the exceeded voice decoding requests into a queuing sequence according to a queuing rule, marking the voice decoding requests in the queuing sequence as queuing voice decoding requests, and accessing one or more corresponding numbers of the queuing voice decoding requests after one or more decoding channels complete the current voice decoding task. After the decoding channel is determined, initializing a channel decoding unit in the decoding channel, and registering a callback interface unit so as to enable a corresponding decoding result to be returned through the callback interface unit.

Further, the thread-level decoding channel further comprises a state control unit, wherein the state control unit is used for updating the states of the data cache area, the channel decoding unit and the callback interface unit in real time; the state control unit is also used for influencing external control signals and adjusting the running state of the channel according to the external control signals.

The method further comprises the steps of:

step S200: recording the running state of the thread-level decoding channel, the data state of the data buffer area and the registration state of the callback interface unit in real time through a state control unit of the thread-level decoding channel, so as to execute corresponding steps according to the running state, the data state and the registration state;

It will be appreciated that as the speech decoding engine processes the speech stream data, the operating state of the decoding channels in the speech decoding engine, the data state of the data buffers, and the registration state of the callback interface unit change. In order to better monitor the speech decoding flow, the present embodiment updates the running state, the data state, and the registration state in real time by the state control unit, so as to execute corresponding steps according to the running state, the data state, and the registration state. Wherein the operation state comprises operation middle and stop operation; the data state comprises time, waiting data and receiving end transmission data; the registration status includes registered and unregistered.

Step S300: and receiving an external control signal by a state control unit of the thread-level decoding channel, and adjusting the running state of the thread-level decoding channel according to the external control signal.

Further, the state control unit of the thread-level decoding channel may further receive an external control signal, the external control signal being sent by the client that sends the speech decoding request, the external control signal including a normal data stream signal, a network connection error signal, wherein the normal data stream signal includes start, pause, end, and the like. And if the external control signal is started, updating the running state of the thread-level decoding channel to be running, and if the external control signal is started, updating the running state of the thread-level decoding channel to be stopped.

Further, the thread-level decode channel includes a reclamation unit; the reclaiming unit is used for clearing the voice stream data in the data buffer area of the thread-level decoding channel so as to store the voice stream data again by utilizing the data buffer area. The reclaiming unit is also used for clearing the state information recorded by the state control unit of the thread-level decoding channel so that the state control unit can update the state of the thread-level decoding channel again.

Further, after the steps of respectively calling the universal models by using the decoding channels of the multiple thread levels, performing parallel decoding processing on the voice stream data in the multiple voice decoding requests, and responding to the multiple voice decoding requests based on the decoding results, the method further comprises:

Step 2011a, emptying the voice stream data in the data buffer area of the thread level decoding channel through the recovery unit of the thread level decoding channel so as to store the voice stream data again by utilizing the data buffer area;

In the process of executing the decoding process on the voice stream data, if the network connection is interrupted, an external control instruction for stopping is received, or the like, part or all of the voice stream data stored in the data buffer may not be extracted yet, and thus may still be stored in the data buffer. At this time, the recovery unit needs to empty the voice stream data in the data buffer of the thread-level decoding channel, so that when the thread-level decoding channel needs to execute the voice decoding process again, the voice stream data can be stored again by using the data buffer.

And step 2011b, clearing the state information recorded by the state control unit of the thread-level decoding channel through the recovery unit of the thread-level decoding channel so that the state control unit can record the state of the thread-level decoding channel again.

And in the process of performing voice decoding on the thread-level decoding channel, the recovery unit of the thread-level decoding channel is required to empty the state control unit of the thread-level decoding channel to record the state information of the thread-level decoding channel, the data buffer area and the callback interface unit. It will be appreciated that when a speech decoding process is completed, the state information recorded by the state control unit of the thread-level decoding channel needs to be cleared, so that the state control unit records the state of the thread-level decoding channel again.

Further, when the thread-level decoding channel receives the voice stream data, the state control unit records and updates data format information of the voice stream data, wherein the data format information comprises a data stream format, a data stream sampling rate, data stream coding and the like. And when the corresponding voice decoding process is finished, the recovery unit empties the data format information of the voice stream data more in the state control unit. Thus, disk space of the speech decoding engine can be saved.

Further, the registration state of the callback interface unit is registered or unregistered; if the client corresponding to the voice decoding request registers a callback interface, the state control unit updates the state of the corresponding callback interface to registered; otherwise, the registration state is unregistered.

The step S102c, before the step of respectively responding to the plurality of voice decoding requests based on the decoding results by using the callback interface units of the plurality of thread-level decoding channels, further includes:

step S102c0: checking the registration state of a callback interface of the thread-level decoding channel recorded by a state control unit of each thread-level decoding channel;

It may be appreciated that, if the registration status is registered, the decoding result may be directly responded to a plurality of clients corresponding to the plurality of voice decoding requests through a plurality of registered callback interface units. If the registration state is unregistered, it indicates that there is no legal callback interface unit, and the recovery unit of the thread level decoding channel needs to empty the voice stream data in the data buffer area, and empty the state information updated by the state control unit.

Referring to fig. 5, fig. 5 is a schematic diagram of a speech stream data processing flow according to an embodiment of the present invention, where the speech stream processing flow is exemplified by a specific one-thread-level decoding channel, and the overall flow of the speech decoding method according to the present invention is schematically illustrated.

As shown in fig. 5, firstly, receiving a voice decoding request, applying for a thread-level decoding channel, initializing a decoding unit, and registering a callback interface unit; and updating the running state of the thread-level decoding channel into running state, updating the state of the callback interface unit into registered state and updating the state of the data cache area into waiting data through a state control unit. In addition, the data format information of the voice stream data may be updated by the state control unit, where the data format information includes a data stream format, a data stream sampling rate, a data stream coding, and the like.

Then, the voice stream data is sent to a thread-level decoding channel through a communication protocol, and the thread-level decoding channel stores the voice stream data in a data buffer area temporarily; if the data buffer area has the un-decoded data, the data buffer area is added at the end of the buffer area, and the state of the data buffer area is updated to be 'data exists' through a state control unit.

Further, extracting voice stream data stored in the data buffer area through the channel decoding unit, and emptying the data buffer area; updating the data state to "waiting data" by the state control unit; and decoding the voice stream data to generate a decoding result.

After obtaining the decoding result, checking the registration state of the callback interface unit: and if the registration state is unregistered, clearing the state information recorded by the channel state control unit of the thread-level decoding channel through a recovery unit.

Otherwise, if the registration state is registered, returning a decoding result through a callback interface;

Further, judging whether an end data stream signal exists, wherein the end data stream transmission signal can be an external control signal, and executing the steps if the end data stream signal does not exist: sending the voice stream data to a thread-level decoding channel through a communication protocol, wherein the thread-level decoding channel stores the voice stream data in a data buffer area temporarily; if the data buffer area has the un-decoded data, the data buffer area is added at the end of the buffer area, and the state of the data buffer area is updated to be 'data exists' through a state control unit.

If the data stream ending signal exists, updating the data state into 'ending transmitting data' through the control unit, and reading all data in the data cache area by the decoding unit to obtain a decoding result;

Then checking the registration state of the callback interface, if the registration state is registered, returning a decoding result through the callback interface unit, and updating the running state of the thread-level decoding channel through the state control unit to stop running; and if the registration state is unregistered, clearing the state information recorded by the channel state control unit of the thread-level decoding channel through a recovery unit.

According to the scheme, when a plurality of voice decoding requests are received, a plurality of thread-level decoding channels are applied, and the plurality of voice decoding requests are in one-to-one correspondence with the plurality of thread-level decoding channels; and respectively calling a general model by using the thread-level decoding channels, performing parallel decoding processing on voice stream data in the voice decoding requests to obtain decoding results, and responding to the voice decoding requests based on the decoding results. Therefore, a plurality of voice decoding requests are processed in parallel through a plurality of thread-level decoding channels, and the plurality of thread-level decoding channels are requested to share a common model, so that the parallel processing of the thread level of voice decoding is realized, the hardware cost is reduced, and the concurrency capacity and decoding efficiency of voice decoding are improved.

In addition, the embodiment also provides a voice decoding device. Referring to fig. 6, fig. 6 is a schematic functional block diagram of a speech decoding apparatus according to a first embodiment of the present invention.

In this embodiment, the speech decoding apparatus is a virtual apparatus, and is stored in the memory 1005 of the speech decoding engine shown in fig. 1, so as to implement all the functions of the speech decoding program: when receiving a plurality of voice decoding requests, applying for a plurality of thread-level decoding channels, wherein the plurality of voice decoding requests are in one-to-one correspondence with the plurality of thread-level decoding channels; and the method is used for respectively calling a general model by utilizing the thread-level decoding channels, carrying out parallel decoding processing on the voice stream data in the voice decoding requests to obtain decoding results, and responding to the voice decoding requests based on the decoding results.

Specifically, the speech decoding apparatus includes:

The application module 10 is configured to apply for a plurality of thread-level decoding channels when receiving a plurality of voice decoding requests, where the plurality of voice decoding requests are in one-to-one correspondence with the plurality of thread-level decoding channels;

And the decoding module 20 is configured to invoke a common model by using the multiple thread-level decoding channels, perform parallel decoding processing on the voice stream data in the multiple voice decoding requests, obtain a decoding result, and respond to the multiple voice decoding requests based on the decoding result.

Further, the decoding module includes:

A buffer unit, configured to buffer voice stream data in the plurality of voice decoding requests by using data buffer units of the plurality of thread-level decoding channels, respectively;

The calling unit is used for calling the universal model respectively by utilizing the channel decoding units of the plurality of thread-level decoding channels and carrying out parallel decoding processing on the voice stream data in the plurality of voice decoding requests to obtain decoding results;

And the response unit is used for respectively responding to the voice decoding requests based on the decoding results by utilizing callback interface units of the thread-level decoding channels.

Further, the response unit includes:

A checking subunit, configured to check, for any specific thread level decoding channel of the plurality of thread level decoding channels, a data state of a data buffer area of the specific thread level decoding channel;

The first temporary storage subunit is used for directly temporarily storing the voice stream data corresponding to the specific thread level decoding channel in the data buffer area of the specific thread level decoding channel if the data state of the data buffer area of the specific thread level decoding channel is waiting data;

And the second temporary storage subunit is used for temporarily storing the voice stream data corresponding to the specific thread level decoding channel at the tail of the data buffer zone of the specific thread level decoding channel if the data state of the data buffer zone of the specific thread level decoding channel is data.

Further, the calling unit includes:

A calling subunit, configured to respectively call the generic models by using the channel decoding units of the multiple thread-level decoding channels;

and the decoding subunit is used for parallelly converting the voice stream data into a feature vector set in each channel decoding unit based on the universal model, and converting the feature vector set into a decoding result.

Further, the speech decoding apparatus further includes:

The updating module is used for updating the running state of the thread-level decoding channel, the data state of the data cache area and the registration state of the callback interface unit in real time through the state control unit of the thread-level decoding channel so as to execute corresponding steps according to the running state, the data state and the registration state; and/or

The control module is used for receiving an external control signal through the state control unit of the thread-level decoding channel and adjusting the running state of the thread-level decoding channel according to the external control signal.

Further, the decoding module further includes:

The first emptying unit is used for emptying the voice stream data in the data cache area of the thread-level decoding channel through the recovery unit of the thread-level decoding channel so as to store the voice stream data again by utilizing the data cache area; and/or

And the second clearing unit is used for clearing the state information recorded by the state control unit of the thread-level decoding channel through the recovery unit of the thread-level decoding channel so that the state control unit can record the state of the thread-level decoding channel again.

Further, the response unit further includes:

a checking subunit, configured to check the registration states of callback interfaces of the thread-level decoding channels recorded by the state control unit of each thread-level decoding channel;

And the execution subunit is configured to execute the steps if the registration state of the callback interface unit of the thread-level decoding channel is registered: respectively responding to the voice decoding requests based on the decoding results by utilizing callback interface units of the thread-level decoding channels;

and the clearing subunit is used for clearing the voice stream data of the data cache area through the recovery unit of the thread-level decoding channel and clearing the state information updated by the state control unit if the registration state of the callback interface unit of the thread-level decoding channel is unregistered.

In addition, the embodiment of the present invention further provides a computer storage medium, where a speech decoding program is stored, where the speech decoding program is executed by a processor to implement the steps of the speech decoding method described above, which is not described herein again.

Compared with the prior art, the method, the device, the engine and the storage medium for voice decoding are provided, wherein the method is applied to the voice decoding engine, when a plurality of voice decoding requests are received, a plurality of thread-level decoding channels are applied, and the plurality of voice decoding requests are in one-to-one correspondence with the plurality of thread-level decoding channels; and respectively calling a general model by using the thread-level decoding channels, performing parallel decoding processing on voice stream data in the voice decoding requests to obtain decoding results, and responding to the voice decoding requests based on the decoding results. Therefore, a plurality of voice decoding requests are processed in parallel through a plurality of thread-level decoding channels, and the plurality of thread-level decoding channels are requested to share a common model, so that the parallel processing of the thread level of voice decoding is realized, the hardware cost is reduced, and the concurrency capacity and decoding efficiency of voice decoding are improved.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) as described above, comprising several instructions for causing a terminal device to perform the method according to the embodiments of the present invention.

The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the invention, and all equivalent structures or modifications in the structures or processes described in the specification and drawings, or the direct or indirect application of the present invention to other related technical fields, are included in the scope of the present invention.

Claims

1. A method of speech decoding, the method being applied to a speech decoding engine, comprising:

When a plurality of voice decoding requests are received, a plurality of thread-level decoding channels are applied, wherein the voice decoding requests are in one-to-one correspondence with the thread-level decoding channels, and the thread-level decoding channels comprise a channel decoding unit, a data cache area and a callback interface unit;

the channel decoding units of the thread-level decoding channels are utilized to respectively call the same general model, and the voice stream data in the voice decoding requests are subjected to parallel decoding processing to obtain decoding results, wherein each general model is used for parallel use of the thread-level decoding channels;

2. The method of claim 1, wherein buffering the voice stream data in the plurality of voice decoding requests with the data buffers of the plurality of thread-level decoding channels, respectively, comprises:

3. The method according to claim 1, wherein the step of using the channel decoding units of the plurality of thread-level decoding channels to respectively call a common model to perform parallel decoding processing on the voice stream data in the plurality of voice decoding requests, and obtaining a decoding result includes:

4. The method of claim 1, wherein the thread-level decode channel further comprises a state control unit,

The method further comprises the steps of:

5. The method of claim 1, wherein the thread-level decode channel comprises a reclamation unit;

And clearing the state information updated by the state control unit of the thread-level decoding channel through the recovery unit of the thread-level decoding channel so as to enable the state control unit to update the state of the thread-level decoding channel again.

6. The method of claim 1, wherein the registration status of the callback interface element is registered or unregistered;

7. A speech decoding apparatus, characterized in that the speech decoding apparatus comprises:

The application module is used for applying a plurality of thread-level decoding channels when a plurality of voice decoding requests are received, wherein the voice decoding requests are in one-to-one correspondence with the thread-level decoding channels, and the thread-level decoding channels comprise a channel decoding unit, a data cache area and a callback interface unit;

The decoding module is used for respectively caching the voice stream data in the voice decoding requests by utilizing the data cache memory of the thread-level decoding channels; the channel decoding units of the thread-level decoding channels are utilized to respectively call the same general model, and the voice stream data in the voice decoding requests are subjected to parallel decoding processing to obtain decoding results, wherein each general model is used for parallel use of the thread-level decoding channels; and respectively responding to the voice decoding requests based on the decoding results by utilizing callback interface units of the thread-level decoding channels.

8. A speech decoding engine, characterized in that it comprises a processor, a memory and a speech decoding program stored in said memory, which, when being executed by said processor, implements the steps of the speech decoding method according to any of claims 1-7.

9. A computer storage medium, characterized in that the computer storage medium has stored thereon a speech decoding program which, when executed by a processor, implements the steps of the speech decoding method according to any of claims 1-7.