CN111079446A

CN111079446A - Voice data reconstruction method and device and electronic equipment

Info

Publication number: CN111079446A
Application number: CN201911398821.4A
Authority: CN
Inventors: 黄启辉
Original assignee: Beijing Infobird Software Co Ltd
Current assignee: Beijing Infobird Software Co Ltd
Priority date: 2019-12-30
Filing date: 2019-12-30
Publication date: 2020-04-28

Abstract

The invention discloses a voice data reconstruction method, a voice data reconstruction device and electronic equipment. The invention carries out data reconstruction from two aspects of semantics and sound, and restores the sound state of a speaker as much as possible while meeting the semantic logical relationship as much as possible, so that the finally reconstructed data can more accurately and truly restore the missing information, and the invention has smoother, smooth and natural playing effect.

Description

Voice data reconstruction method and device and electronic equipment

Technical Field

The invention relates to a voice data reconstruction method, a voice data reconstruction device and electronic equipment, and belongs to the technical field of voice processing.

Background

The real-time voice communication is widely applied to the fields of instant communication, call centers and the like. Network problems such as network congestion, packet loss, jitter and the like are common and inevitable, and have negative effects on voice call quality and even block communication.

In the voice transmission based on the IP, after packet loss retransmission, the conventional processing method is to directly insert white noise into the missing data portion, or to splice the preceding and following data of the missing data. The method can not restore real sound data and has the problems of jamming, information loss and the like.

Disclosure of Invention

The invention provides a voice data reconstruction method.

Another object of the present invention is to provide a speech data reconstruction apparatus.

Another technical problem to be solved by the present invention is to provide an electronic device for implementing voice data reconstruction.

In order to achieve the purpose, the invention adopts the following technical scheme:

according to a first aspect of the embodiments of the present invention, there is provided a speech data reconstruction method, including the steps of:

determining semantic information of the missing data according to the context of the missing data, wherein the missing data is a missing part in voice data of a speaker;

and performing text voice conversion on the semantic information of the missing data based on the acoustic model of the speaker to obtain reconstructed data of the missing data.

Preferably, the determining semantic information of the missing data according to the context of the missing data includes the following steps:

acquiring preceding data and following data of the missing data;

and performing voice recognition calculation based on the preceding data and the following data, and determining the phoneme with the highest probability corresponding to the missing data.

Preferably, the method further comprises: judging based on the probability of the phoneme and the confidence of the text corresponding to the missing data; and under the condition that the relation between the two meets a set condition, performing text-to-speech conversion on the semantic information of the missing data based on the acoustic model of the speaker.

Preferably, the method further comprises: and under the condition that the relation between the two data does not meet the set condition, replacing the missing data by white noise, or splicing the previous data and the subsequent data after extending.

Preferably, the relationship between the two satisfies the set condition includes:

m×w+n×q＞k；

wherein w represents the probability of the phoneme, q represents the confidence of the text corresponding to the missing data, m represents the weight of w, n represents the weight of q, and k is a set threshold.

Preferably, the method further comprises: collecting phoneme information of the speaker in real time based on the voice data of the speaker and training an acoustic model of the speaker in real time.

According to a second aspect of the embodiments of the present invention, there is provided a voice data reconstruction apparatus including:

the semantic analysis module is used for determining semantic information of the missing data according to the context of the missing data, wherein the missing data is a missing part in the voice data of the speaker;

and the first reconstruction module is used for performing text voice conversion on the semantic information of the missing data based on the acoustic model of the speaker to obtain reconstructed data of the missing data.

Preferably, the semantic analysis module comprises: the data acquisition submodule is used for acquiring the previous data and the next data of the missing data; and the voice recognition submodule is used for carrying out voice recognition calculation on the basis of the preceding data and the following data and determining the phoneme with the highest probability corresponding to the missing data.

Preferably, the device further comprises: the judging module is used for judging based on the probability of the phonemes and the confidence coefficient of the text corresponding to the missing data, and triggering the first reconstruction module under the condition that the relationship between the probability of the phonemes and the confidence coefficient of the text corresponding to the missing data meets set conditions; under the condition that the relationship between the first reconfiguration module and the second reconfiguration module does not meet the set condition, triggering the second reconfiguration module; and the second reconstruction module is used for splicing the previous data and the subsequent data after extension or replacing the missing data with white noise.

m×w+n×q＞k；

wherein w represents the probability of the phoneme, q represents the confidence of the text corresponding to the missing data, m represents the weight of w, n represents the weight of q, and k is a threshold.

Preferably, the device further comprises: and the model training module is used for collecting phoneme information of the speaker in real time based on the voice data of the speaker and training the acoustic model of the speaker in real time.

According to a third aspect of the embodiments of the present invention, there is provided an electronic device for performing voice data reconstruction, the electronic device including:

a memory for storing computer instructions;

a processor for retrieving and executing the computer instructions from the memory, thereby implementing the method for providing voice data reconstruction or the preferred processing manner thereof provided in the foregoing first aspect.

Compared with the prior art, the method obtains the semantic information through context analysis, can reconstruct the lost packet data at the semantic level, and meets the semantic logic relationship as much as possible; performing text-to-speech conversion on semantic information by using an acoustic model of a speaker to obtain speech data so as to restore audio data; the semantic reconstruction and the voice restoration are combined, the finally reconstructed data can accurately restore the missing information and carry more characteristic information, and the playing effect is smoother, smoother and natural.

Drawings

Fig. 1 is a schematic flow chart of a voice data reconstruction method according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart of a semantic analysis method according to an embodiment of the present invention;

fig. 3 is a schematic flowchart of a voice data reconstruction method according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a voice data reconstruction apparatus according to an embodiment of the present invention;

fig. 5 is a schematic diagram illustrating an architecture of a voice data reconstruction apparatus according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The technical contents of the invention are described in detail below with reference to the accompanying drawings and specific embodiments.

In the voice transmission technology based on the IP, in the face of the problem of voice data packet loss, the prior art cannot well restore voice data, and the problems of blockage, information loss and the like exist.

In order to solve the above problems, the present invention provides a method and an apparatus for reconstructing voice data, and an electronic device, which fully considers the importance of semantic logic and sound characteristics in voice data and voice announcement.

First, the nouns/terms involved or possibly involved in the various embodiments of the present invention are briefly explained:

ASR automatic speech recognition.

TTS Text To Speech, Text-To-Speech conversion.

ARM, Audio Reconstruction Model.

Forward Error Correction (FEC).

NACK, A Lost Packet Retransmit Protocol, a Packet loss retransmission response.

HMM, Hidden Markov Model.

GMM Gussian Mixture Model, Gaussian Mixture Model.

AM: Acoustic Model.

LM, Language Model.

Fig. 1 is a schematic flow chart of a speech data reconstruction method according to an embodiment of the present invention, and referring to fig. 1, the method includes:

100: and determining semantic information of the missing data according to the context of the missing data. The missing data is a missing portion of the speaker's voice data.

For example, in consecutive packets, the k-th packet in the middle is lost, and the k-th packet is the missing data. The information contained in one or more data packets before the kth data packet and one or more data packets after the kth data packet, i.e., the context of the missing data. It should be noted that, theoretically, the more context information, the more accurate the semantic analysis result is but the longer the semantic analysis time is. The length of the context (i.e., the number of packets) can be selected by those skilled in the art according to their different requirements for real-time performance and accuracy, and the embodiment of the present invention does not specifically limit this.

102: and performing text voice conversion on the semantic information of the missing data based on the acoustic model of the speaker to obtain reconstructed data of the missing data.

In the embodiment of the invention, the acoustic model of the speaker can be obtained through pre-training, and can be trained in real time in real-time speech processing to perfect the acoustic model.

By adopting the method provided by the embodiment of the invention, on one hand, semantic information of missing data is obtained through context analysis, compared with the traditional technology, the method supplements information missing and accords with semantic logic; on the other hand, the supplementary semantic information is subjected to text-to-speech conversion through the acoustic model of the speaker, so that the audio information is further provided and the pronunciation characteristics of the speaker are met. The two aspects are combined, and data reconstruction is performed from the semantic aspect and the sound aspect at the same time, so that missing information can be accurately restored, more characteristic information is carried, and the playing effect is smoother, smoother and natural.

Optionally, in an implementation manner of the embodiment of the present invention, the phoneme information of the vocalist is collected in real time based on the voice data of the vocalist, and the acoustic model of the vocalist is trained in real time. Thus, the text-to-speech conversion can be performed in consideration of the latest state of the speaker, and the obtained audio data more conforms to the current state of the speaker. To achieve this, the acoustic model can weaken the acoustic characteristics of the speaker history and strengthen the acoustic characteristics of the speaker in the near term.

Alternatively, in one implementation of the embodiment of the invention, referring to fig. 2, the process 100 may be implemented as follows:

1002: the preceding data and the following data of the missing data are acquired.

For example, for data to be played in the voice buffer, missing data is detected, and if the missing data packet is r, the first j data packets of the data packet r are selected as the previous data, and the last d data packets are selected as the next data. Wherein j and d are positive integers, and the numerical value can be set by a person skilled in the art after comprehensively considering the requirements of real-time performance and accuracy.

1004: and performing voice recognition calculation based on the preceding data and the following data, and determining the phoneme with the highest probability corresponding to the missing data.

For example, the first j data packets and the last d data packets are input into the speech recognition module, specifically, the GMM and HMM sub-modules input into the speech recognition module perform calculation to obtain the corresponding relationship and the corresponding probability between the data packet r and the phoneme, and the phoneme with the highest probability is selected as the semantic information corresponding to the missing data. The process can be obtained through the traditional GMM and HMM, has higher semantic accuracy and is related to context logic.

Optionally, in this implementation, as shown by a dashed box, the method further includes:

1006: the determination is made based on the probability of the phoneme and the confidence of the text corresponding to the missing data, and the processing 102 is triggered if a set condition is satisfied.

Step 1006 is used to take into account the necessity of text-to-speech conversion, avoid unnecessary data processing, and save system resources. For example, if 1004 obtains semantic information with poor quality, the desired effect cannot be obtained even if text-to-speech conversion is performed, and resources are wasted.

Alternatively, at 1006, it is determined whether m × w + n × q > k holds. Wherein w represents the probability of the phoneme, q represents the confidence of the text corresponding to the missing data, m represents the weight of w, n represents the weight of q, and k is a threshold. If the relation is established, the semantic information quality is high, and text voice conversion can be performed; otherwise, no text-to-speech conversion is performed.

Illustratively, m is 1 and n is 1. In fact, those skilled in the art can flexibly select the values of m and n and flexibly relate to the combination relationship between w and q according to different voice scenes, different voice recognition methods, different voice reconstruction methods, and the like, and can perform deep learning training (for example, a supervised learning method) on the values of w, q, and k by using the idea of the present application to obtain an appropriate value of k. The present embodiment is not particularly limited to specific values of the above parameters.

Fig. 3 is a flowchart illustrating a voice data reconstruction method according to an embodiment of the present invention. Referring to fig. 3, the method includes:

300: and receiving and buffering voice data. Specifically, voice data is received and stored in a voice buffer, and conventional FEC, NACK, and the like are performed.

301: and (4) ASR recognition and acoustic model real-time training. Specifically, a speech recognition module is called to perform continuous ASR recognition at a speech receiving end, phoneme information of a speech speaker is collected, and a specific acoustic model is trained in real time. To improve the effect, the acoustic model AM may initially be pre-trained. The acoustic model is built up and modified as the speech communication progresses, in preparation for subsequent TTS.

302: and checking and processing missing data. Specifically, when data of the voice buffer is about to be played, it is checked whether there is missing data in the corresponding area. And if the missing packet is r, inputting the first j data and the last d data of the missing data into GMM and HMM sub-modules of the speech recognition module, and calculating probability information w of one or more phonemes corresponding to the missing data.

303: the confidence q of the corresponding sentence text (i.e. the semantics of the first j data + the last d data + the data packet r) is obtained from the speech recognition module.

304: a confidence level of the reconstructed data is calculated. Specifically, the parameters obtained at 302 and 303 are input to the speech reconstruction module ARM, and the confidence e of the reconstructed data is calculated according to the functional relationship e ═ f (w, q). For a detailed description of the functional relationship, please refer to the detailed description in the embodiment shown in fig. 2, which is not repeated herein.

305: and judging whether e is larger than k. If yes, the confidence level is higher, then 306 is executed; otherwise, the confidence is lower and execution is 307.

306: and converting text into voice. Specifically, a replacement packet for the missing packet is generated by TTS using the acoustic model of step 301 and inserted into the buffer.

307: white noise processing or stitching processing. The white noise processing refers to the insertion of white noise at the missing data packet. The splicing treatment means: and splicing the previous data and the next data of the data packet r after extension, and performing smoothing processing at an interface. The extension includes: the last packet of preceding data and the first packet of following data are extended by a factor of 1.5.

By adopting the method provided by the embodiment of the invention, on one hand, the waste of calculation cost and time cost caused by invalid data reconstruction is avoided by considering the necessity of data reconstruction; on the other hand, data reconstruction is carried out from the semantic aspect and the sound aspect, real data can be restored as far as possible, and the quality of voice data is improved.

Furthermore, the invention also provides a voice data reconstruction device. As shown in fig. 4, the apparatus includes a semantic analysis module 40 and a first reconstruction module 42. The semantic analysis module 40 is configured to determine semantic information of missing data according to a context of the missing data, where the missing data is a missing part in voice data of a speaker. The first reconstruction module 42 is configured to perform text-to-speech conversion on the semantic information of the missing data based on the acoustic model of the speaker, so as to obtain reconstructed data of the missing data.

Optionally, in an implementation manner of the embodiment of the present invention, as shown by a dashed box in fig. 4, the semantic analysis module 40 includes a data obtaining sub-module 400, configured to obtain preceding data and following data of the missing data; a speech recognition submodule 402, configured to perform speech recognition calculation based on the preceding data and the following data, and determine a phoneme with a highest probability corresponding to the missing data.

Fig. 5 is a block diagram of a speech data reconstruction apparatus according to an embodiment of the present invention. Referring to fig. 5, the speech data reconstruction apparatus includes, in addition to the semantic analysis module 40 and the first reconstruction module 42, a judgment module 44, configured to make a judgment based on the probability of the phoneme and the confidence of the text corresponding to the missing data, and trigger the first reconstruction module 42 if a relationship between the two satisfies a set condition.

Optionally, as shown by the dashed line box in the figure, the speech data reconstruction apparatus may further include a second reconstruction module 46 for performing extended post-concatenation on the preceding data and the following data, or replacing the missing data with white noise. At this time, the determining module 44 is further configured to trigger the second reconstructing module 46 when the probability of the phoneme and the confidence of the text corresponding to the missing data do not satisfy the setting condition.

Illustratively, the setting condition described above is m × w + n × q > k; wherein w represents the probability of the phoneme, q represents the confidence of the text corresponding to the missing data, m represents the weight of w, n represents the weight of q, and k is a threshold.

Optionally, as shown by the dashed box in the figure, the speech data reconstruction apparatus may further include a model training module 48 for collecting phoneme information of the speaker in real time based on the speech data of the speaker and training an acoustic model of the speaker in real time.

In the above embodiments of the speech data reconstruction apparatus, for descriptions of related nouns/terms, specific logic processing procedures, parameter values or ranges, technical effects, and the like, please refer to corresponding descriptions in the method embodiments, which are not described herein again.

Furthermore, the invention also provides electronic equipment for reconstructing voice data. As shown in fig. 6, the electronic device at least includes a processor and a memory, and may further include a communication component, a sensor component, a power component, a multimedia component, and an input/output interface according to actual needs. The memory, the communication component, the sensor component, the power supply component, the multimedia component and the input/output interface are all connected with the processor. The memory may be a Static Random Access Memory (SRAM), an Electrically Erasable Programmable Read Only Memory (EEPROM), an Erasable Programmable Read Only Memory (EPROM), a Programmable Read Only Memory (PROM), a Read Only Memory (ROM), a magnetic memory, a flash memory, etc., and the processor may be a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), a Digital Signal Processing (DSP) chip, etc. Other communication components, sensor components, power components, multimedia components, etc. may be implemented using common components and are not specifically described herein.

In one embodiment of the invention, a processor invokes and executes computer instructions from a processor to: a) determining semantic information of the missing data according to the context of the missing data, wherein the missing data is a missing part in voice data of a speaker; b) and performing text voice conversion on the semantic information of the missing data based on the acoustic model of the speaker to obtain reconstructed data of the missing data.

Wherein operation a may be implemented by the following logic: acquiring preceding data and following data of the missing data; and performing voice recognition calculation based on the preceding data and the following data, and determining the phoneme with the highest probability corresponding to the missing data.

In addition, the processor can also judge based on the probability of the phoneme and the confidence of the text corresponding to the missing data; under the condition that the relation between the two meets a set condition, text-to-speech conversion is carried out on the semantic information of the missing data based on the acoustic model of the speaker; when the relationship between the two does not satisfy the setting condition: and replacing the missing data by white noise, or splicing the previous data and the subsequent data after extension.

In addition, the processor may collect phoneme information of the speaker in real time based on voice data of the speaker and train an acoustic model of the speaker in real time.

For a specific description of the operation of the processor in the electronic device, please refer to the linear description in the method embodiment, which is not described herein again.

Compared with the prior art, the invention carries out data reconstruction from the semantic and sound aspects, and restores the sound state of a speaker as much as possible while meeting the semantic logical relationship as much as possible, so that the finally reconstructed data can more accurately and truly restore the missing information, and the invention has smoother, smooth and natural playing effect.

The voice data reconstruction method, the voice data reconstruction device and the electronic device provided by the invention are explained in detail above. Any obvious modifications to the invention, which would occur to those skilled in the art, without departing from the true spirit of the invention, would constitute a violation of the patent rights of the invention and would carry a corresponding legal responsibility.

Claims

1. A method for reconstructing speech data, comprising the steps of:

2. The method of reconstructing speech data according to claim 1, wherein said determining semantic information of the missing data based on the context of the missing data comprises the steps of:

acquiring preceding data and following data of the missing data;

3. The method of speech data reconstruction according to claim 2, wherein the method further comprises:

judging based on the probability of the phoneme and the confidence of the text corresponding to the missing data;

under the condition that the relation between the two meets a set condition, text-to-speech conversion is carried out on the semantic information of the missing data based on the acoustic model of the speaker;

and under the condition that the relation between the two data does not meet the set condition, replacing the missing data by white noise, or splicing the previous data and the subsequent data after extending.

4. The speech data reconstruction method according to claim 3, wherein the relationship between the two satisfies a predetermined condition includes:

m×w+n×q＞k；

5. The method of speech data reconstruction according to claim 1, wherein the method further comprises:

collecting phoneme information of the speaker in real time based on the voice data of the speaker and training an acoustic model of the speaker in real time.

6. A speech data reconstruction apparatus characterized by comprising:

7. The speech data reconstruction device of claim 6 wherein the semantic analysis module comprises:

the data acquisition submodule is used for acquiring the previous data and the next data of the missing data;

and the voice recognition submodule is used for carrying out voice recognition calculation on the basis of the preceding data and the following data and determining the phoneme with the highest probability corresponding to the missing data.

8. The speech data reconstruction apparatus according to claim 7, further comprising:

the judging module is used for judging based on the probability of the phonemes and the confidence coefficient of the text corresponding to the missing data, and triggering the first reconstruction module under the condition that the relationship between the probability of the phonemes and the confidence coefficient of the text corresponding to the missing data meets set conditions; under the condition that the relationship between the first reconfiguration module and the second reconfiguration module does not meet the set condition, triggering the second reconfiguration module;

and the second reconstruction module is used for splicing the previous data and the subsequent data after extension or replacing the missing data with white noise.

9. The speech data reconstruction apparatus according to claim 6, further comprising:

and the model training module is used for collecting phoneme information of the speaker in real time based on the voice data of the speaker and training the acoustic model of the speaker in real time.

10. An electronic device for performing voice data reconstruction, comprising:

a memory for storing computer instructions;

a processor for retrieving and executing said computer instructions from said memory to implement a voice data reconstruction method as claimed in any one of claims 1 to 5.