CN111079446A - Voice data reconstruction method and device and electronic equipment - Google Patents

Voice data reconstruction method and device and electronic equipment Download PDF

Info

Publication number
CN111079446A
CN111079446A CN201911398821.4A CN201911398821A CN111079446A CN 111079446 A CN111079446 A CN 111079446A CN 201911398821 A CN201911398821 A CN 201911398821A CN 111079446 A CN111079446 A CN 111079446A
Authority
CN
China
Prior art keywords
data
missing
speaker
missing data
voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911398821.4A
Other languages
Chinese (zh)
Inventor
黄启辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Infobird Software Co Ltd
Original Assignee
Beijing Infobird Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Infobird Software Co Ltd filed Critical Beijing Infobird Software Co Ltd
Priority to CN201911398821.4A priority Critical patent/CN111079446A/en
Publication of CN111079446A publication Critical patent/CN111079446A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/005Correction of errors induced by the transmission channel, if related to the coding algorithm
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1815Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Signal Processing (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a voice data reconstruction method, a voice data reconstruction device and electronic equipment. The invention carries out data reconstruction from two aspects of semantics and sound, and restores the sound state of a speaker as much as possible while meeting the semantic logical relationship as much as possible, so that the finally reconstructed data can more accurately and truly restore the missing information, and the invention has smoother, smooth and natural playing effect.

Description

Voice data reconstruction method and device and electronic equipment
Technical Field
The invention relates to a voice data reconstruction method, a voice data reconstruction device and electronic equipment, and belongs to the technical field of voice processing.
Background
The real-time voice communication is widely applied to the fields of instant communication, call centers and the like. Network problems such as network congestion, packet loss, jitter and the like are common and inevitable, and have negative effects on voice call quality and even block communication.
In the voice transmission based on the IP, after packet loss retransmission, the conventional processing method is to directly insert white noise into the missing data portion, or to splice the preceding and following data of the missing data. The method can not restore real sound data and has the problems of jamming, information loss and the like.
Disclosure of Invention
The invention provides a voice data reconstruction method.
Another object of the present invention is to provide a speech data reconstruction apparatus.
Another technical problem to be solved by the present invention is to provide an electronic device for implementing voice data reconstruction.
In order to achieve the purpose, the invention adopts the following technical scheme:
according to a first aspect of the embodiments of the present invention, there is provided a speech data reconstruction method, including the steps of:
determining semantic information of the missing data according to the context of the missing data, wherein the missing data is a missing part in voice data of a speaker;
and performing text voice conversion on the semantic information of the missing data based on the acoustic model of the speaker to obtain reconstructed data of the missing data.
Preferably, the determining semantic information of the missing data according to the context of the missing data includes the following steps:
acquiring preceding data and following data of the missing data;
and performing voice recognition calculation based on the preceding data and the following data, and determining the phoneme with the highest probability corresponding to the missing data.
Preferably, the method further comprises: judging based on the probability of the phoneme and the confidence of the text corresponding to the missing data; and under the condition that the relation between the two meets a set condition, performing text-to-speech conversion on the semantic information of the missing data based on the acoustic model of the speaker.
Preferably, the method further comprises: and under the condition that the relation between the two data does not meet the set condition, replacing the missing data by white noise, or splicing the previous data and the subsequent data after extending.
Preferably, the relationship between the two satisfies the set condition includes:
m×w+n×q>k;
wherein w represents the probability of the phoneme, q represents the confidence of the text corresponding to the missing data, m represents the weight of w, n represents the weight of q, and k is a set threshold.
Preferably, the method further comprises: collecting phoneme information of the speaker in real time based on the voice data of the speaker and training an acoustic model of the speaker in real time.
According to a second aspect of the embodiments of the present invention, there is provided a voice data reconstruction apparatus including:
the semantic analysis module is used for determining semantic information of the missing data according to the context of the missing data, wherein the missing data is a missing part in the voice data of the speaker;
and the first reconstruction module is used for performing text voice conversion on the semantic information of the missing data based on the acoustic model of the speaker to obtain reconstructed data of the missing data.
Preferably, the semantic analysis module comprises: the data acquisition submodule is used for acquiring the previous data and the next data of the missing data; and the voice recognition submodule is used for carrying out voice recognition calculation on the basis of the preceding data and the following data and determining the phoneme with the highest probability corresponding to the missing data.
Preferably, the device further comprises: the judging module is used for judging based on the probability of the phonemes and the confidence coefficient of the text corresponding to the missing data, and triggering the first reconstruction module under the condition that the relationship between the probability of the phonemes and the confidence coefficient of the text corresponding to the missing data meets set conditions; under the condition that the relationship between the first reconfiguration module and the second reconfiguration module does not meet the set condition, triggering the second reconfiguration module; and the second reconstruction module is used for splicing the previous data and the subsequent data after extension or replacing the missing data with white noise.
Preferably, the relationship between the two satisfies the set condition includes:
m×w+n×q>k;
wherein w represents the probability of the phoneme, q represents the confidence of the text corresponding to the missing data, m represents the weight of w, n represents the weight of q, and k is a threshold.
Preferably, the device further comprises: and the model training module is used for collecting phoneme information of the speaker in real time based on the voice data of the speaker and training the acoustic model of the speaker in real time.
According to a third aspect of the embodiments of the present invention, there is provided an electronic device for performing voice data reconstruction, the electronic device including:
a memory for storing computer instructions;
a processor for retrieving and executing the computer instructions from the memory, thereby implementing the method for providing voice data reconstruction or the preferred processing manner thereof provided in the foregoing first aspect.
Compared with the prior art, the method obtains the semantic information through context analysis, can reconstruct the lost packet data at the semantic level, and meets the semantic logic relationship as much as possible; performing text-to-speech conversion on semantic information by using an acoustic model of a speaker to obtain speech data so as to restore audio data; the semantic reconstruction and the voice restoration are combined, the finally reconstructed data can accurately restore the missing information and carry more characteristic information, and the playing effect is smoother, smoother and natural.
Drawings
Fig. 1 is a schematic flow chart of a voice data reconstruction method according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart of a semantic analysis method according to an embodiment of the present invention;
fig. 3 is a schematic flowchart of a voice data reconstruction method according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a voice data reconstruction apparatus according to an embodiment of the present invention;
fig. 5 is a schematic diagram illustrating an architecture of a voice data reconstruction apparatus according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The technical contents of the invention are described in detail below with reference to the accompanying drawings and specific embodiments.
In the voice transmission technology based on the IP, in the face of the problem of voice data packet loss, the prior art cannot well restore voice data, and the problems of blockage, information loss and the like exist.
In order to solve the above problems, the present invention provides a method and an apparatus for reconstructing voice data, and an electronic device, which fully considers the importance of semantic logic and sound characteristics in voice data and voice announcement.
First, the nouns/terms involved or possibly involved in the various embodiments of the present invention are briefly explained:
ASR automatic speech recognition.
TTS Text To Speech, Text-To-Speech conversion.
ARM, Audio Reconstruction Model.
Forward Error Correction (FEC).
NACK, A Lost Packet Retransmit Protocol, a Packet loss retransmission response.
HMM, Hidden Markov Model.
GMM Gussian Mixture Model, Gaussian Mixture Model.
AM: Acoustic Model.
LM, Language Model.
Fig. 1 is a schematic flow chart of a speech data reconstruction method according to an embodiment of the present invention, and referring to fig. 1, the method includes:
100: and determining semantic information of the missing data according to the context of the missing data. The missing data is a missing portion of the speaker's voice data.
For example, in consecutive packets, the k-th packet in the middle is lost, and the k-th packet is the missing data. The information contained in one or more data packets before the kth data packet and one or more data packets after the kth data packet, i.e., the context of the missing data. It should be noted that, theoretically, the more context information, the more accurate the semantic analysis result is but the longer the semantic analysis time is. The length of the context (i.e., the number of packets) can be selected by those skilled in the art according to their different requirements for real-time performance and accuracy, and the embodiment of the present invention does not specifically limit this.
102: and performing text voice conversion on the semantic information of the missing data based on the acoustic model of the speaker to obtain reconstructed data of the missing data.
In the embodiment of the invention, the acoustic model of the speaker can be obtained through pre-training, and can be trained in real time in real-time speech processing to perfect the acoustic model.
By adopting the method provided by the embodiment of the invention, on one hand, semantic information of missing data is obtained through context analysis, compared with the traditional technology, the method supplements information missing and accords with semantic logic; on the other hand, the supplementary semantic information is subjected to text-to-speech conversion through the acoustic model of the speaker, so that the audio information is further provided and the pronunciation characteristics of the speaker are met. The two aspects are combined, and data reconstruction is performed from the semantic aspect and the sound aspect at the same time, so that missing information can be accurately restored, more characteristic information is carried, and the playing effect is smoother, smoother and natural.
Optionally, in an implementation manner of the embodiment of the present invention, the phoneme information of the vocalist is collected in real time based on the voice data of the vocalist, and the acoustic model of the vocalist is trained in real time. Thus, the text-to-speech conversion can be performed in consideration of the latest state of the speaker, and the obtained audio data more conforms to the current state of the speaker. To achieve this, the acoustic model can weaken the acoustic characteristics of the speaker history and strengthen the acoustic characteristics of the speaker in the near term.
Alternatively, in one implementation of the embodiment of the invention, referring to fig. 2, the process 100 may be implemented as follows:
1002: the preceding data and the following data of the missing data are acquired.
For example, for data to be played in the voice buffer, missing data is detected, and if the missing data packet is r, the first j data packets of the data packet r are selected as the previous data, and the last d data packets are selected as the next data. Wherein j and d are positive integers, and the numerical value can be set by a person skilled in the art after comprehensively considering the requirements of real-time performance and accuracy.
1004: and performing voice recognition calculation based on the preceding data and the following data, and determining the phoneme with the highest probability corresponding to the missing data.
For example, the first j data packets and the last d data packets are input into the speech recognition module, specifically, the GMM and HMM sub-modules input into the speech recognition module perform calculation to obtain the corresponding relationship and the corresponding probability between the data packet r and the phoneme, and the phoneme with the highest probability is selected as the semantic information corresponding to the missing data. The process can be obtained through the traditional GMM and HMM, has higher semantic accuracy and is related to context logic.
Optionally, in this implementation, as shown by a dashed box, the method further includes:
1006: the determination is made based on the probability of the phoneme and the confidence of the text corresponding to the missing data, and the processing 102 is triggered if a set condition is satisfied.
Step 1006 is used to take into account the necessity of text-to-speech conversion, avoid unnecessary data processing, and save system resources. For example, if 1004 obtains semantic information with poor quality, the desired effect cannot be obtained even if text-to-speech conversion is performed, and resources are wasted.
Alternatively, at 1006, it is determined whether m × w + n × q > k holds. Wherein w represents the probability of the phoneme, q represents the confidence of the text corresponding to the missing data, m represents the weight of w, n represents the weight of q, and k is a threshold. If the relation is established, the semantic information quality is high, and text voice conversion can be performed; otherwise, no text-to-speech conversion is performed.
Illustratively, m is 1 and n is 1. In fact, those skilled in the art can flexibly select the values of m and n and flexibly relate to the combination relationship between w and q according to different voice scenes, different voice recognition methods, different voice reconstruction methods, and the like, and can perform deep learning training (for example, a supervised learning method) on the values of w, q, and k by using the idea of the present application to obtain an appropriate value of k. The present embodiment is not particularly limited to specific values of the above parameters.
Fig. 3 is a flowchart illustrating a voice data reconstruction method according to an embodiment of the present invention. Referring to fig. 3, the method includes:
300: and receiving and buffering voice data. Specifically, voice data is received and stored in a voice buffer, and conventional FEC, NACK, and the like are performed.
301: and (4) ASR recognition and acoustic model real-time training. Specifically, a speech recognition module is called to perform continuous ASR recognition at a speech receiving end, phoneme information of a speech speaker is collected, and a specific acoustic model is trained in real time. To improve the effect, the acoustic model AM may initially be pre-trained. The acoustic model is built up and modified as the speech communication progresses, in preparation for subsequent TTS.
302: and checking and processing missing data. Specifically, when data of the voice buffer is about to be played, it is checked whether there is missing data in the corresponding area. And if the missing packet is r, inputting the first j data and the last d data of the missing data into GMM and HMM sub-modules of the speech recognition module, and calculating probability information w of one or more phonemes corresponding to the missing data.
303: the confidence q of the corresponding sentence text (i.e. the semantics of the first j data + the last d data + the data packet r) is obtained from the speech recognition module.
304: a confidence level of the reconstructed data is calculated. Specifically, the parameters obtained at 302 and 303 are input to the speech reconstruction module ARM, and the confidence e of the reconstructed data is calculated according to the functional relationship e ═ f (w, q). For a detailed description of the functional relationship, please refer to the detailed description in the embodiment shown in fig. 2, which is not repeated herein.
305: and judging whether e is larger than k. If yes, the confidence level is higher, then 306 is executed; otherwise, the confidence is lower and execution is 307.
306: and converting text into voice. Specifically, a replacement packet for the missing packet is generated by TTS using the acoustic model of step 301 and inserted into the buffer.
307: white noise processing or stitching processing. The white noise processing refers to the insertion of white noise at the missing data packet. The splicing treatment means: and splicing the previous data and the next data of the data packet r after extension, and performing smoothing processing at an interface. The extension includes: the last packet of preceding data and the first packet of following data are extended by a factor of 1.5.
By adopting the method provided by the embodiment of the invention, on one hand, the waste of calculation cost and time cost caused by invalid data reconstruction is avoided by considering the necessity of data reconstruction; on the other hand, data reconstruction is carried out from the semantic aspect and the sound aspect, real data can be restored as far as possible, and the quality of voice data is improved.
Furthermore, the invention also provides a voice data reconstruction device. As shown in fig. 4, the apparatus includes a semantic analysis module 40 and a first reconstruction module 42. The semantic analysis module 40 is configured to determine semantic information of missing data according to a context of the missing data, where the missing data is a missing part in voice data of a speaker. The first reconstruction module 42 is configured to perform text-to-speech conversion on the semantic information of the missing data based on the acoustic model of the speaker, so as to obtain reconstructed data of the missing data.
Optionally, in an implementation manner of the embodiment of the present invention, as shown by a dashed box in fig. 4, the semantic analysis module 40 includes a data obtaining sub-module 400, configured to obtain preceding data and following data of the missing data; a speech recognition submodule 402, configured to perform speech recognition calculation based on the preceding data and the following data, and determine a phoneme with a highest probability corresponding to the missing data.
Fig. 5 is a block diagram of a speech data reconstruction apparatus according to an embodiment of the present invention. Referring to fig. 5, the speech data reconstruction apparatus includes, in addition to the semantic analysis module 40 and the first reconstruction module 42, a judgment module 44, configured to make a judgment based on the probability of the phoneme and the confidence of the text corresponding to the missing data, and trigger the first reconstruction module 42 if a relationship between the two satisfies a set condition.
Optionally, as shown by the dashed line box in the figure, the speech data reconstruction apparatus may further include a second reconstruction module 46 for performing extended post-concatenation on the preceding data and the following data, or replacing the missing data with white noise. At this time, the determining module 44 is further configured to trigger the second reconstructing module 46 when the probability of the phoneme and the confidence of the text corresponding to the missing data do not satisfy the setting condition.
Illustratively, the setting condition described above is m × w + n × q > k; wherein w represents the probability of the phoneme, q represents the confidence of the text corresponding to the missing data, m represents the weight of w, n represents the weight of q, and k is a threshold.
Optionally, as shown by the dashed box in the figure, the speech data reconstruction apparatus may further include a model training module 48 for collecting phoneme information of the speaker in real time based on the speech data of the speaker and training an acoustic model of the speaker in real time.
In the above embodiments of the speech data reconstruction apparatus, for descriptions of related nouns/terms, specific logic processing procedures, parameter values or ranges, technical effects, and the like, please refer to corresponding descriptions in the method embodiments, which are not described herein again.
Furthermore, the invention also provides electronic equipment for reconstructing voice data. As shown in fig. 6, the electronic device at least includes a processor and a memory, and may further include a communication component, a sensor component, a power component, a multimedia component, and an input/output interface according to actual needs. The memory, the communication component, the sensor component, the power supply component, the multimedia component and the input/output interface are all connected with the processor. The memory may be a Static Random Access Memory (SRAM), an Electrically Erasable Programmable Read Only Memory (EEPROM), an Erasable Programmable Read Only Memory (EPROM), a Programmable Read Only Memory (PROM), a Read Only Memory (ROM), a magnetic memory, a flash memory, etc., and the processor may be a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), a Digital Signal Processing (DSP) chip, etc. Other communication components, sensor components, power components, multimedia components, etc. may be implemented using common components and are not specifically described herein.
In one embodiment of the invention, a processor invokes and executes computer instructions from a processor to: a) determining semantic information of the missing data according to the context of the missing data, wherein the missing data is a missing part in voice data of a speaker; b) and performing text voice conversion on the semantic information of the missing data based on the acoustic model of the speaker to obtain reconstructed data of the missing data.
Wherein operation a may be implemented by the following logic: acquiring preceding data and following data of the missing data; and performing voice recognition calculation based on the preceding data and the following data, and determining the phoneme with the highest probability corresponding to the missing data.
In addition, the processor can also judge based on the probability of the phoneme and the confidence of the text corresponding to the missing data; under the condition that the relation between the two meets a set condition, text-to-speech conversion is carried out on the semantic information of the missing data based on the acoustic model of the speaker; when the relationship between the two does not satisfy the setting condition: and replacing the missing data by white noise, or splicing the previous data and the subsequent data after extension.
In addition, the processor may collect phoneme information of the speaker in real time based on voice data of the speaker and train an acoustic model of the speaker in real time.
For a specific description of the operation of the processor in the electronic device, please refer to the linear description in the method embodiment, which is not described herein again.
Compared with the prior art, the invention carries out data reconstruction from the semantic and sound aspects, and restores the sound state of a speaker as much as possible while meeting the semantic logical relationship as much as possible, so that the finally reconstructed data can more accurately and truly restore the missing information, and the invention has smoother, smooth and natural playing effect.
The voice data reconstruction method, the voice data reconstruction device and the electronic device provided by the invention are explained in detail above. Any obvious modifications to the invention, which would occur to those skilled in the art, without departing from the true spirit of the invention, would constitute a violation of the patent rights of the invention and would carry a corresponding legal responsibility.

Claims (10)

1. A method for reconstructing speech data, comprising the steps of:
determining semantic information of the missing data according to the context of the missing data, wherein the missing data is a missing part in voice data of a speaker;
and performing text voice conversion on the semantic information of the missing data based on the acoustic model of the speaker to obtain reconstructed data of the missing data.
2. The method of reconstructing speech data according to claim 1, wherein said determining semantic information of the missing data based on the context of the missing data comprises the steps of:
acquiring preceding data and following data of the missing data;
and performing voice recognition calculation based on the preceding data and the following data, and determining the phoneme with the highest probability corresponding to the missing data.
3. The method of speech data reconstruction according to claim 2, wherein the method further comprises:
judging based on the probability of the phoneme and the confidence of the text corresponding to the missing data;
under the condition that the relation between the two meets a set condition, text-to-speech conversion is carried out on the semantic information of the missing data based on the acoustic model of the speaker;
and under the condition that the relation between the two data does not meet the set condition, replacing the missing data by white noise, or splicing the previous data and the subsequent data after extending.
4. The speech data reconstruction method according to claim 3, wherein the relationship between the two satisfies a predetermined condition includes:
m×w+n×q>k;
wherein w represents the probability of the phoneme, q represents the confidence of the text corresponding to the missing data, m represents the weight of w, n represents the weight of q, and k is a threshold.
5. The method of speech data reconstruction according to claim 1, wherein the method further comprises:
collecting phoneme information of the speaker in real time based on the voice data of the speaker and training an acoustic model of the speaker in real time.
6. A speech data reconstruction apparatus characterized by comprising:
the semantic analysis module is used for determining semantic information of the missing data according to the context of the missing data, wherein the missing data is a missing part in the voice data of the speaker;
and the first reconstruction module is used for performing text voice conversion on the semantic information of the missing data based on the acoustic model of the speaker to obtain reconstructed data of the missing data.
7. The speech data reconstruction device of claim 6 wherein the semantic analysis module comprises:
the data acquisition submodule is used for acquiring the previous data and the next data of the missing data;
and the voice recognition submodule is used for carrying out voice recognition calculation on the basis of the preceding data and the following data and determining the phoneme with the highest probability corresponding to the missing data.
8. The speech data reconstruction apparatus according to claim 7, further comprising:
the judging module is used for judging based on the probability of the phonemes and the confidence coefficient of the text corresponding to the missing data, and triggering the first reconstruction module under the condition that the relationship between the probability of the phonemes and the confidence coefficient of the text corresponding to the missing data meets set conditions; under the condition that the relationship between the first reconfiguration module and the second reconfiguration module does not meet the set condition, triggering the second reconfiguration module;
and the second reconstruction module is used for splicing the previous data and the subsequent data after extension or replacing the missing data with white noise.
9. The speech data reconstruction apparatus according to claim 6, further comprising:
and the model training module is used for collecting phoneme information of the speaker in real time based on the voice data of the speaker and training the acoustic model of the speaker in real time.
10. An electronic device for performing voice data reconstruction, comprising:
a memory for storing computer instructions;
a processor for retrieving and executing said computer instructions from said memory to implement a voice data reconstruction method as claimed in any one of claims 1 to 5.
CN201911398821.4A 2019-12-30 2019-12-30 Voice data reconstruction method and device and electronic equipment Pending CN111079446A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911398821.4A CN111079446A (en) 2019-12-30 2019-12-30 Voice data reconstruction method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911398821.4A CN111079446A (en) 2019-12-30 2019-12-30 Voice data reconstruction method and device and electronic equipment

Publications (1)

Publication Number Publication Date
CN111079446A true CN111079446A (en) 2020-04-28

Family

ID=70319919

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911398821.4A Pending CN111079446A (en) 2019-12-30 2019-12-30 Voice data reconstruction method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN111079446A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022143364A1 (en) * 2020-12-28 2022-07-07 阿里巴巴(中国)有限公司 Audio packet loss compensation processing method and apparatus, and electronic device

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103810996A (en) * 2014-02-21 2014-05-21 北京凌声芯语音科技有限公司 Processing method, device and system for voice to be tested
CN104681036A (en) * 2014-11-20 2015-06-03 苏州驰声信息科技有限公司 System and method for detecting language voice frequency
CN108831440A (en) * 2018-04-24 2018-11-16 中国地质大学(武汉) A kind of vocal print noise-reduction method and system based on machine learning and deep learning
CN109389990A (en) * 2017-08-09 2019-02-26 2236008安大略有限公司 Reinforce method, system, vehicle and the medium of voice
CN109545197A (en) * 2019-01-02 2019-03-29 珠海格力电器股份有限公司 Recognition methods, device and the intelligent terminal of phonetic order
CN109616128A (en) * 2019-01-30 2019-04-12 努比亚技术有限公司 Voice transmitting method, device and computer readable storage medium
CN110556093A (en) * 2019-09-17 2019-12-10 浙江核新同花顺网络信息股份有限公司 Voice marking method and system

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103810996A (en) * 2014-02-21 2014-05-21 北京凌声芯语音科技有限公司 Processing method, device and system for voice to be tested
CN104681036A (en) * 2014-11-20 2015-06-03 苏州驰声信息科技有限公司 System and method for detecting language voice frequency
CN109389990A (en) * 2017-08-09 2019-02-26 2236008安大略有限公司 Reinforce method, system, vehicle and the medium of voice
CN108831440A (en) * 2018-04-24 2018-11-16 中国地质大学(武汉) A kind of vocal print noise-reduction method and system based on machine learning and deep learning
CN109545197A (en) * 2019-01-02 2019-03-29 珠海格力电器股份有限公司 Recognition methods, device and the intelligent terminal of phonetic order
CN109616128A (en) * 2019-01-30 2019-04-12 努比亚技术有限公司 Voice transmitting method, device and computer readable storage medium
CN110556093A (en) * 2019-09-17 2019-12-10 浙江核新同花顺网络信息股份有限公司 Voice marking method and system

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022143364A1 (en) * 2020-12-28 2022-07-07 阿里巴巴(中国)有限公司 Audio packet loss compensation processing method and apparatus, and electronic device

Similar Documents

Publication Publication Date Title
CN110415687B (en) Voice processing method, device, medium and electronic equipment
US8532994B2 (en) Speech recognition using a personal vocabulary and language model
US7490042B2 (en) Methods and apparatus for adapting output speech in accordance with context of communication
US7885817B2 (en) Easy generation and automatic training of spoken dialog systems using text-to-speech
CA2486125C (en) A system and method of using meta-data in speech-processing
US7269561B2 (en) Bandwidth efficient digital voice communication system and method
CN111508498A (en) Conversational speech recognition method, system, electronic device and storage medium
CN110287303B (en) Man-machine conversation processing method, device, electronic equipment and storage medium
CN112581938B (en) Speech breakpoint detection method, device and equipment based on artificial intelligence
CN111816210B (en) Voice scoring method and device
WO2022227935A1 (en) Speech recognition method and apparatus, and device, storage medium and program product
WO2023116660A2 (en) Model training and tone conversion method and apparatus, device, and medium
WO2023030235A1 (en) Target audio output method and system, readable storage medium, and electronic apparatus
US6377921B1 (en) Identifying mismatches between assumed and actual pronunciations of words
CN111667834B (en) Hearing-aid equipment and hearing-aid method
WO2016172871A1 (en) Speech synthesis method based on recurrent neural networks
US8355484B2 (en) Methods and apparatus for masking latency in text-to-speech systems
CN111079446A (en) Voice data reconstruction method and device and electronic equipment
Pradhan et al. Estimating semantic confidence for spoken dialogue systems
JP6448950B2 (en) Spoken dialogue apparatus and electronic device
CN109389999A (en) A kind of high performance audio-video is made pauses in reading unpunctuated ancient writings method and system automatically
CN112669821B (en) Voice intention recognition method, device, equipment and storage medium
CN117253485B (en) Data processing method, device, equipment and storage medium
JPH09198077A (en) Speech recognition device
CN115881085A (en) Speech synthesis method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination