CN111627457A - Voice separation method, system and computer readable storage medium - Google Patents

Voice separation method, system and computer readable storage medium Download PDF

Info

Publication number
CN111627457A
CN111627457A CN202010405182.6A CN202010405182A CN111627457A CN 111627457 A CN111627457 A CN 111627457A CN 202010405182 A CN202010405182 A CN 202010405182A CN 111627457 A CN111627457 A CN 111627457A
Authority
CN
China
Prior art keywords
voice
speech
data
phoneme
voice data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010405182.6A
Other languages
Chinese (zh)
Inventor
郑琳琳
龙洪锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Speakin Intelligent Technology Co ltd
Original Assignee
Guangzhou Speakin Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Speakin Intelligent Technology Co ltd filed Critical Guangzhou Speakin Intelligent Technology Co ltd
Priority to CN202010405182.6A priority Critical patent/CN111627457A/en
Publication of CN111627457A publication Critical patent/CN111627457A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/005Language recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/18Artificial neural networks; Connectionist approaches
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • G10L21/028Voice signal separating using properties of sound source
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Signal Processing (AREA)
  • Quality & Reliability (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a voice separation method, a system and a computer readable storage medium, wherein the voice separation method comprises the following steps: acquiring preprocessed voice data; performing feature extraction on the voice data to acquire phoneme feature data corresponding to the voice data; based on the phoneme feature data, the voice data are separated, and voice separation is performed through the phoneme features, so that the accuracy of voice separation is improved.

Description

Voice separation method, system and computer readable storage medium
Technical Field
The present invention relates to the field of voice separation, and in particular, to a method, a system, and a computer-readable storage medium for voice separation.
Background
At present, in the aspect of voice separation, attention is paid to separating human voice from noise, but in reality, a plurality of different human voices often exist at the same time. Therefore, how to separate voices in an acoustic environment with multiple mixed voices has been an important research direction in the field of voice signal processing. Since the speech characteristics of different speakers are very similar, the technical difficulty of speech separation is significantly greater than speech noise reduction. How to separate the voice from the voice is still an unsolved problem.
The above is only for the purpose of assisting understanding of the technical aspects of the present invention, and does not represent an admission that the above is prior art.
Disclosure of Invention
The invention mainly aims to provide a voice separation method, a voice separation system and a computer readable storage medium, and aims to solve the technical problem that the accuracy of the existing multi-person voice separation is not high.
In order to achieve the above object, the present invention provides a speech separation method, including:
acquiring preprocessed voice data;
performing feature extraction on the voice data to acquire phoneme feature data corresponding to the voice data;
separating the speech data based on the phoneme feature data.
Preferably, the phoneme feature data is input to a language identification model to obtain a language pre-judgment result corresponding to the voice data;
and separating the voice data with different languages according to the language pre-judgment result to obtain a plurality of voice data sets with the same language.
Preferably, respectively carrying out target phoneme recognition on the voice data sets with the same language, and acquiring a plurality of voice frames containing the target phonemes;
and acquiring target phoneme posterior probabilities corresponding to the plurality of speech frames one by one, and separating the plurality of speech frames based on the target phoneme posterior probabilities.
Preferably, the target phoneme posterior probabilities corresponding to the plurality of speech frames one by one are encoded based on an encoder to obtain the encoding layer characteristics corresponding to each speech frame;
decoding the coding layer characteristics to obtain frequency spectrum characteristics corresponding to the coding layer characteristics;
and separating a plurality of voice frames according to the spectral characteristics.
Preferably, the posterior probabilities of the target phonemes corresponding to the plurality of speech frames one to one are sequentially input to the convolutional neural network for feature mapping so as to respectively obtain the mapping features corresponding to each speech frame;
and inputting the mapping characteristics to the bidirectional long-time and short-time memory neural network to obtain the coding layer characteristics corresponding to each voice frame.
Preferably, the spectral characteristics are input into an overlap judgment model, and a prejudgment result of whether overlap exists among a plurality of speech frames is output;
and separating a plurality of voice frames according to the pre-judging result.
Preferably, initial voice data collected by an audio device is received;
and pre-filtering the initial voice data to obtain pre-processed voice data.
In addition, to achieve the above object, the present invention also provides a speech separation system, including:
the acquisition module is used for acquiring the preprocessed voice data;
the feature extraction module is used for extracting features of the voice data to obtain phoneme feature data corresponding to the voice data;
and the separation module is used for separating the voice data based on the phoneme feature data.
Preferably, the separation module further comprises:
a language identification unit, configured to input the phoneme feature data into a language identification model, so as to obtain a language pre-judgment result corresponding to the speech data;
and the separation unit is used for separating the voice data with different languages according to the language pre-judgment result so as to obtain a plurality of voice data sets with the same language.
Further, to achieve the above object, the present invention also provides a computer-readable storage medium having stored thereon a voice separation program which, when executed by a processor, realizes the steps of the voice separation method described in any one of the above.
According to the voice separation method provided by the invention, the preprocessed voice data are obtained, then the voice data are subjected to feature extraction to obtain the phoneme feature data corresponding to the voice data, finally, the voice data are separated based on the phoneme feature data, and the voice is separated based on the difference of the phoneme features of each person, so that the accuracy rate of voice separation is improved.
Drawings
Fig. 1 is a schematic terminal structure diagram of a hardware operating environment according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating a voice separation method according to a first embodiment of the present invention.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
As shown in fig. 1, fig. 1 is a schematic terminal structure diagram of a hardware operating environment according to an embodiment of the present invention.
As shown in fig. 1, the terminal may include: a processor 1001, such as a CPU, a network interface 1004, a user interface 1003, a memory 1005, a communication bus 1002. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (e.g., a magnetic disk memory). The memory 1005 may alternatively be a storage device separate from the processor 1001.
Optionally, the terminal may further include a camera, a Radio Frequency (RF) circuit, a sensor, an audio circuit, a WiFi module, and the like. Such as light sensors, motion sensors, and other sensors. Specifically, the light sensor may include an ambient light sensor that may adjust the brightness of the display screen according to the brightness of ambient light, and a proximity sensor that may turn off the display screen and/or the backlight when the mobile terminal is moved to the ear. As one of the motion sensors, the gravity acceleration sensor can detect the magnitude of acceleration in each direction (generally, three axes), detect the magnitude and direction of gravity when the mobile terminal is stationary, and can be used for applications (such as horizontal and vertical screen switching, related games, magnetometer attitude calibration), vibration recognition related functions (such as pedometer and tapping) and the like for recognizing the attitude of the mobile terminal; of course, the mobile terminal may also be configured with other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor, which are not described herein again.
Those skilled in the art will appreciate that the terminal structure shown in fig. 1 is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.
As shown in fig. 1, a memory 1005, which is a kind of computer-readable storage medium, may include therein an operating system, a network communication module, a user interface module, and a voice separation program.
In the terminal shown in fig. 1, the network interface 1004 is mainly used for connecting to a backend server and performing data communication with the backend server; the user interface 1003 is mainly used for connecting a client (user side) and performing data communication with the client; and processor 1001 may be used to invoke a voice separation program stored in memory 1005.
In this embodiment, the voice separating apparatus includes: a memory 1005, a processor 1001 and a voice separation program stored in the memory 1005 and operable on the processor 1001, wherein when the processor 1001 calls the voice separation program stored in the memory 1005, the following operations are performed:
acquiring preprocessed voice data;
performing feature extraction on the voice data to acquire phoneme feature data corresponding to the voice data;
separating the speech data based on the phoneme feature data.
Further, processor 1001 may call a voice separation program stored in memory 1005, and also perform the following operations:
inputting the phoneme feature data into a language identification model to obtain a language pre-judgment result corresponding to the voice data;
and separating the voice data with different languages according to the language pre-judgment result to obtain a plurality of voice data sets with the same language.
Further, processor 1001 may call a voice separation program stored in memory 1005, and also perform the following operations:
respectively carrying out target phoneme recognition on the voice data sets with the same language, and acquiring a plurality of voice frames containing the target phonemes;
and acquiring target phoneme posterior probabilities corresponding to the plurality of speech frames one by one, and separating the plurality of speech frames based on the target phoneme posterior probabilities.
Further, processor 1001 may call a voice separation program stored in memory 1005, and also perform the following operations:
coding the posterior probability of the target phoneme corresponding to the plurality of voice frames one by one based on a coder to obtain the coding layer characteristics corresponding to each voice frame;
decoding the coding layer characteristics to obtain frequency spectrum characteristics corresponding to the coding layer characteristics;
and separating a plurality of voice frames according to the spectral characteristics.
Further, processor 1001 may call a voice separation program stored in memory 1005, and also perform the following operations:
sequentially inputting the posterior probabilities of the target phonemes corresponding to the plurality of voice frames one by one to the convolutional neural network for feature mapping so as to respectively obtain the mapping feature corresponding to each voice frame;
and inputting the mapping characteristics to the bidirectional long-time and short-time memory neural network to obtain the coding layer characteristics corresponding to each voice frame.
Further, processor 1001 may call a voice separation program stored in memory 1005, and also perform the following operations:
inputting the spectrum characteristics into an overlapping judgment model, and outputting a prejudgment result of whether overlapping exists among a plurality of voice frames;
and separating a plurality of voice frames according to the pre-judging result.
Further, processor 1001 may call a voice separation program stored in memory 1005, and also perform the following operations:
receiving initial voice data collected by audio equipment;
and pre-filtering the initial voice data to obtain pre-processed voice data.
The invention also provides a voice separation method, and referring to fig. 2, fig. 2 is a flowchart illustrating a first embodiment of the voice separation method of the invention.
Step S10, acquiring preprocessed voice data;
step S20, extracting the characteristics of the voice data to obtain the phoneme characteristic data corresponding to the voice data;
in the embodiment of the invention, the preprocessed voice data is obtained, then the voice data is subjected to feature extraction, specifically, the voice data composed of a plurality of voice frames is collected by the collecting equipment, and transmits the voice data composed of the plurality of voice frames to a preset feature extraction model, specifically, a plurality of first speech features corresponding to the plurality of speech frames one to one can be extracted from the plurality of speech frames, optionally, a plurality of key speech features can be determined from the plurality of first speech features, wherein the probability that each key speech feature corresponds to a phoneme in the set of phonemes is greater than or equal to a target probability threshold, then determining a set of speech features corresponding to each key speech feature, each voice feature set comprises a corresponding key voice feature and one or more voice features adjacent to the corresponding key voice feature in the plurality of first voice features; and finally, respectively carrying out feature fusion on the voice features in each voice feature set to obtain a plurality of fused voice features, wherein each voice feature set corresponds to one fused voice feature.
The phoneme feature data corresponding to each fused speech feature is recognized in the phoneme set, and feature fusion can be performed in multiple ways, for example, weighted summation is performed on each speech feature of the current speech feature set, wherein the weight of each speech feature can be set by a user. If different weights are given to different speech features according to the distance between each speech feature of the current speech feature set and the current key speech feature, the closer the distance of the current key speech feature is, the greater the weight is. Further, the voice features in each voice feature set may be input into a target self-attention layer to obtain a plurality of fused voice features, where the target self-attention layer is configured to perform weighted summation on the voice features in each voice feature set to obtain a fused voice feature corresponding to each voice feature set, or perform feature fusion on the voice features in each voice feature set through the self-attention layer to extract features at a unit length level to obtain a fused voice feature. Further, for a current fused speech feature of the multiple fused speech features, the probability that the current fused speech feature corresponds to each phoneme in the phoneme set may be obtained according to the current fused speech feature, and a phoneme corresponding to each fused speech feature may be determined according to the probability that the current fused speech feature corresponds to each phoneme in the phoneme set.
Further, step S10 includes,
step S101, receiving initial voice data collected by audio equipment;
step S102, pre-filtering the initial voice data to obtain pre-processed voice data.
In this step, it can be understood that, in order to eliminate the influence of noise on the voice separation, the voice data needs to be denoised, specifically, initial voice data acquired by the audio device is received, and then the initial voice data is pre-filtered and denoised to obtain the pre-processed voice data.
Step S30, separating the speech data based on the phoneme feature data.
In the step, after obtaining the phoneme feature data, and then separating the speech data according to the phoneme feature data, understandably, the phoneme is the minimum speech unit divided according to the natural attribute of the speech, from the acoustic property, the phoneme is the minimum speech unit divided from the sound quality perspective, from the physiological property, one pronunciation action forms one phoneme, the sounds sent by the same pronunciation action are the same phoneme, the sounds sent by different pronunciation actions are different phonemes, the phonemes are generally divided into two categories of vowels and consonants, and different pronunciation phonemes can be divided from different languages. Taking Mandarin Chinese as an example, the Chinese language comprises 22 consonants and 10 vowels; and the English international phonetic symbol has 48 phonemes, wherein 20 vowel phonemes and 28 consonant phonemes are included. Specifically, step S30 includes the steps of,
step S301, inputting the phoneme feature data into a language identification model to obtain a language pre-judgment result corresponding to the voice data;
in the step, the language to which the speech data belongs can be judged through the phoneme characteristics which represent the pronunciation phoneme information in the speech characteristics, and correspondingly, the embodiment of the invention can realize the result of prejudgment on the language to which the speech data belongs by extracting the phoneme characteristics which represent the pronunciation phoneme information in the speech data and inputting the language identification model which is obtained in advance based on multi-language corpus training.
Step S302, according to the language pre-judging result, separating the voice data with different languages to obtain a plurality of voice data sets with the same language.
In this step, after obtaining the language pre-judgment result, separating the speech data of different languages to obtain a plurality of speech data sets of the same language, and if the speech data includes a chinese language and an english language, separating the speech data of a chinese language and the speech data of an english language.
Further, after step S302, the method further includes,
step S303, respectively carrying out target phoneme recognition on the voice data sets with the same language, and acquiring a plurality of voice frames containing the target phonemes;
step S304, obtaining the one-to-one corresponding target phoneme posterior probabilities of a plurality of voice frames, and separating the plurality of voice frames based on the target phoneme posterior probabilities.
In the step, target phoneme recognition is respectively carried out on voice data sets with the same language, a plurality of voice frames containing the target phonemes are obtained, wherein the target phonemes can be set in a self-defined mode according to current voice data information, then target phoneme posterior probabilities corresponding to the voice frames one by one are obtained, optionally, the voice generally consists of tone features and text features, because source voice of a first person needs to be converted into target voice of a second person, namely the first person is converted into the second person under the condition that the voice is not changed, when the plurality of voice frames are extracted, the target phoneme posterior probabilities corresponding to the text features are extracted, and the target phoneme posterior probabilities are matched with the tone features corresponding to the second person, so that the source voice of the first person is converted into the target voice of the second person.
Specifically, step S304 includes, for example,
step S305, based on the encoder, the posterior probability of the target phoneme corresponding to a plurality of voice frames one by one is encoded to obtain the coding layer characteristics corresponding to each voice frame;
after the target phoneme posterior probabilities corresponding to the plurality of voice frames one by one are obtained, the target phoneme posterior probabilities corresponding to the plurality of voice frames one by one are coded based on a coder to obtain coding layer characteristics corresponding to each voice frame, wherein the coder comprises a cascaded convolutional neural network and a bidirectional long-time and short-time memory neural network. The convolutional neural network is a feed-forward neural network which comprises convolutional calculation and has a deep structure, and the convolutional neural network has a characteristic learning capability. Optionally, the convolutional neural network includes a feature mapping layer, where the feature mapping layer is configured to perform feature mapping on the posterior probability of the target phoneme, and map a low-dimensional feature to a high-dimensional feature, where a dimension after mapping may be preset or determined according to a dimension before mapping. The bidirectional long-time and short-time memory neural network is used for determining the relation between the current target phoneme posterior probability, the previous target phoneme posterior probability and the next target phoneme posterior probability in the n sections of sequentially arranged target phoneme posterior probabilities. Alternatively, the long-term memory neural network is a time-recursive neural network that can solve the problem of time series between the preceding and following features. Optionally, the encoder further comprises an average pooling layer, and the average pooling layer is used for pooling the phoneme posterior probabilities.
Further, step S305 includes, for example,
step S3051, sequentially inputting the posterior probabilities of the target phonemes corresponding to the plurality of speech frames one by one to the convolutional neural network for feature mapping so as to respectively obtain the mapping features corresponding to each speech frame;
step S3052, inputting the mapping characteristics to the bidirectional long-short time memory neural network so as to obtain the coding layer characteristics corresponding to each voice frame.
In the step, the posterior probabilities of the target phonemes corresponding to the plurality of speech frames one by one are sequentially input to a convolutional neural network in an encoder for feature mapping so as to respectively obtain the mapping features corresponding to each speech frame, and then the mapping features are input to a bidirectional long-time and short-time memory neural network in the encoder so as to obtain the coding layer features corresponding to each speech frame.
Step S306, decoding the coding layer characteristics to obtain the frequency spectrum characteristics corresponding to the coding layer characteristics;
step S307, separating a plurality of voice frames according to the frequency spectrum characteristics.
In this step, the method of the present invention decodes the coding layer features by using a decoder, wherein the decoder includes a cascaded autoregressive long-term memory neural network and a feature mapping network, specifically, the autoregressive long-term memory neural network is used to establish a time domain relationship between a current phoneme posterior probability and a phoneme posterior probability before and a phoneme posterior probability after the current phoneme posterior probability, and the feature mapping network is used to map the coding layer features. Optionally, the decoder further includes a residual connection layer, and the residual connection layer is configured to adjust the spectral feature output by the feature mapping network. Further, the encoder and the decoder in the method of the present invention are trained in advance, specifically, with the sample speech of the second human voice. Optionally, in the training process, the sample voice of the second voice is input to the encoder and the decoder for processing, so as to obtain a spectral feature, the spectral feature is compared with the actual spectral feature of the sample voice, and parameters in the encoder and the decoder are adjusted according to the comparison result, thereby implementing training on each neural network layer in the encoder and the decoder.
In the embodiment of the invention, the coding layer features are decoded by adopting an autoregressive long-term memory neural network and a feature mapping network so as to obtain the spectral feature vectors corresponding to the coding layer features, and then the plurality of speech frames are separated according to the difference of the spectral feature vectors corresponding to the plurality of speech frames one by one.
Further, step S307 includes,
step S3071, inputting the frequency spectrum characteristics to an overlapping judgment model, and outputting a prejudgment result of whether overlapping exists among a plurality of voice frames;
and step S3072, separating a plurality of voice frames according to the pre-judging result.
In the step, after obtaining the spectrum features corresponding to a plurality of speech frames one to one, inputting the spectrum features to an overlap judgment model, specifically, obtaining single-channel spectrum features and multi-channel azimuth features corresponding to the plurality of speech frames, inputting the single-channel spectrum features and the multi-channel azimuth features corresponding to the plurality of speech frames as input values to the overlap judgment model, and outputting a pre-judgment result of whether overlap exists between the plurality of speech frames, further, the overlap judgment model used by the method of the invention is obtained by training data and sample data in advance, after the overlap judgment model outputs the pre-judgment result of whether overlap exists between the plurality of speech frames, separating the plurality of speech frames according to the pre-judgment result, understandably, if the spectrum features corresponding to the plurality of speech frames one to one are not overlapped, the speech frames corresponding to the non-overlapped spectrum features are not sent by the same person, the speech frame may be separated.
The voice separation method provided by the embodiment of the invention obtains the preprocessed voice data, then extracts the characteristics of the voice data to obtain the phoneme characteristic data corresponding to the voice data, finally separates the voice data based on the phoneme characteristic data, and separates the voice based on the difference of each person's phoneme characteristics to improve the accuracy of voice separation.
The present invention also provides a voice separation system, comprising:
the acquisition module is used for acquiring the preprocessed voice data;
the feature extraction module is used for extracting features of the voice data to obtain phoneme feature data corresponding to the voice data;
and the separation module is used for separating the voice data based on the phoneme feature data.
Further, the separation module further comprises:
a language identification unit, configured to input the phoneme feature data into a language identification model, so as to obtain a language pre-judgment result corresponding to the speech data;
and the separation unit is used for separating the voice data with different languages according to the language pre-judgment result so as to obtain a plurality of voice data sets with the same language.
Furthermore, an embodiment of the present invention further provides a computer-readable storage medium, where a voice separation program is stored, and the voice separation program, when executed by a processor, implements the steps of the above-mentioned voice separation method in each embodiment.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (10)

1. A speech separation method, characterized in that it comprises the steps of:
acquiring preprocessed voice data;
performing feature extraction on the voice data to acquire phoneme feature data corresponding to the voice data;
separating the speech data based on the phoneme feature data.
2. The speech separation method of claim 1 wherein the step of separating the speech data based on phoneme feature data comprises:
inputting the phoneme feature data into a language identification model to obtain a language pre-judgment result corresponding to the voice data;
and separating the voice data with different languages according to the language pre-judgment result to obtain a plurality of voice data sets with the same language.
3. The method according to claim 2, wherein after the step of separating the speech data with different languages according to the language prediction result to obtain a plurality of speech data sets with the same language, the method further comprises:
respectively carrying out target phoneme recognition on the voice data sets with the same language, and acquiring a plurality of voice frames containing the target phonemes;
and acquiring target phoneme posterior probabilities corresponding to the plurality of speech frames one by one, and separating the plurality of speech frames based on the target phoneme posterior probabilities.
4. The speech separation method of claim 3 wherein the step of separating the plurality of speech frames based on the target phoneme posterior probability comprises:
coding the posterior probability of the target phoneme corresponding to the plurality of voice frames one by one based on a coder to obtain the coding layer characteristics corresponding to each voice frame;
decoding the coding layer characteristics to obtain frequency spectrum characteristics corresponding to the coding layer characteristics;
and separating a plurality of voice frames according to the spectral characteristics.
5. The speech separation method according to claim 4, wherein the encoder comprises a convolutional neural network and a bidirectional long-term memory neural network, and the step of performing encoding processing based on the posterior probability of the target phoneme corresponding to one of the plurality of speech frames by the encoder to obtain the encoding layer characteristics corresponding to each of the speech frames comprises:
sequentially inputting the posterior probabilities of the target phonemes corresponding to the plurality of voice frames one by one to the convolutional neural network for feature mapping so as to respectively obtain the mapping feature corresponding to each voice frame;
and inputting the mapping characteristics to the bidirectional long-time and short-time memory neural network to obtain the coding layer characteristics corresponding to each voice frame.
6. The speech separation method of claim 4 wherein the step of separating the plurality of speech frames based on spectral characteristics comprises:
inputting the spectrum characteristics into an overlapping judgment model, and outputting a prejudgment result of whether overlapping exists among a plurality of voice frames;
and separating a plurality of voice frames according to the pre-judging result.
7. The speech separation method of any one of claims 1 to 6 wherein the step of obtaining pre-processed speech data comprises:
receiving initial voice data collected by audio equipment;
and pre-filtering the initial voice data to obtain pre-processed voice data.
8. A speech separation system, comprising:
the acquisition module is used for acquiring the preprocessed voice data;
the feature extraction module is used for extracting features of the voice data to obtain phoneme feature data corresponding to the voice data;
and the separation module is used for separating the voice data based on the phoneme feature data.
9. The speech separation system of claim 8 wherein the separation module further comprises:
a language identification unit, configured to input the phoneme feature data into a language identification model, so as to obtain a language pre-judgment result corresponding to the speech data;
and the separation unit is used for separating the voice data with different languages according to the language pre-judgment result so as to obtain a plurality of voice data sets with the same language.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a speech separation program which, when executed by a processor, implements the steps of the speech separation method according to any one of claims 1 to 7.
CN202010405182.6A 2020-05-13 2020-05-13 Voice separation method, system and computer readable storage medium Pending CN111627457A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010405182.6A CN111627457A (en) 2020-05-13 2020-05-13 Voice separation method, system and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010405182.6A CN111627457A (en) 2020-05-13 2020-05-13 Voice separation method, system and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN111627457A true CN111627457A (en) 2020-09-04

Family

ID=72271898

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010405182.6A Pending CN111627457A (en) 2020-05-13 2020-05-13 Voice separation method, system and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN111627457A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111899758A (en) * 2020-09-07 2020-11-06 腾讯科技(深圳)有限公司 Voice processing method, device, equipment and storage medium
CN112652324A (en) * 2020-12-28 2021-04-13 深圳万兴软件有限公司 Speech enhancement optimization method, speech enhancement optimization system and readable storage medium
CN113012710A (en) * 2021-01-28 2021-06-22 广州朗国电子科技有限公司 Audio noise reduction method and storage medium
WO2022057759A1 (en) * 2020-09-21 2022-03-24 华为技术有限公司 Voice conversion method and related device
CN114464206A (en) * 2022-04-11 2022-05-10 中国人民解放军空军预警学院 Single-channel blind source separation method and system

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109817213A (en) * 2019-03-11 2019-05-28 腾讯科技(深圳)有限公司 The method, device and equipment of speech recognition is carried out for adaptive languages
CN110223705A (en) * 2019-06-12 2019-09-10 腾讯科技(深圳)有限公司 Phonetics transfer method, device, equipment and readable storage medium storing program for executing
CN110335592A (en) * 2019-06-28 2019-10-15 腾讯科技(深圳)有限公司 Phoneme of speech sound recognition methods and device, storage medium and electronic device
CN110364142A (en) * 2019-06-28 2019-10-22 腾讯科技(深圳)有限公司 Phoneme of speech sound recognition methods and device, storage medium and electronic device
CN110473566A (en) * 2019-07-25 2019-11-19 深圳壹账通智能科技有限公司 Audio separation method, device, electronic equipment and computer readable storage medium
CN110610707A (en) * 2019-09-20 2019-12-24 科大讯飞股份有限公司 Voice keyword recognition method and device, electronic equipment and storage medium
CN110675855A (en) * 2019-10-09 2020-01-10 出门问问信息科技有限公司 Voice recognition method, electronic equipment and computer readable storage medium
CN110808034A (en) * 2019-10-31 2020-02-18 北京大米科技有限公司 Voice conversion method, device, storage medium and electronic equipment
CN110827849A (en) * 2019-11-11 2020-02-21 广州国音智能科技有限公司 Human voice separation method and device for database building, terminal and readable storage medium
CN110956959A (en) * 2019-11-25 2020-04-03 科大讯飞股份有限公司 Speech recognition error correction method, related device and readable storage medium
CN111128197A (en) * 2019-12-25 2020-05-08 北京邮电大学 Multi-speaker voice separation method based on voiceprint features and generation confrontation learning

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109817213A (en) * 2019-03-11 2019-05-28 腾讯科技(深圳)有限公司 The method, device and equipment of speech recognition is carried out for adaptive languages
CN110223705A (en) * 2019-06-12 2019-09-10 腾讯科技(深圳)有限公司 Phonetics transfer method, device, equipment and readable storage medium storing program for executing
CN110534092A (en) * 2019-06-28 2019-12-03 腾讯科技(深圳)有限公司 Phoneme of speech sound recognition methods and device, storage medium and electronic device
CN110335592A (en) * 2019-06-28 2019-10-15 腾讯科技(深圳)有限公司 Phoneme of speech sound recognition methods and device, storage medium and electronic device
CN110364142A (en) * 2019-06-28 2019-10-22 腾讯科技(深圳)有限公司 Phoneme of speech sound recognition methods and device, storage medium and electronic device
CN110428809A (en) * 2019-06-28 2019-11-08 腾讯科技(深圳)有限公司 Phoneme of speech sound recognition methods and device, storage medium and electronic device
CN110473518A (en) * 2019-06-28 2019-11-19 腾讯科技(深圳)有限公司 Phoneme of speech sound recognition methods and device, storage medium and electronic device
CN110473566A (en) * 2019-07-25 2019-11-19 深圳壹账通智能科技有限公司 Audio separation method, device, electronic equipment and computer readable storage medium
CN110610707A (en) * 2019-09-20 2019-12-24 科大讯飞股份有限公司 Voice keyword recognition method and device, electronic equipment and storage medium
CN110675855A (en) * 2019-10-09 2020-01-10 出门问问信息科技有限公司 Voice recognition method, electronic equipment and computer readable storage medium
CN110808034A (en) * 2019-10-31 2020-02-18 北京大米科技有限公司 Voice conversion method, device, storage medium and electronic equipment
CN110827849A (en) * 2019-11-11 2020-02-21 广州国音智能科技有限公司 Human voice separation method and device for database building, terminal and readable storage medium
CN110956959A (en) * 2019-11-25 2020-04-03 科大讯飞股份有限公司 Speech recognition error correction method, related device and readable storage medium
CN111128197A (en) * 2019-12-25 2020-05-08 北京邮电大学 Multi-speaker voice separation method based on voiceprint features and generation confrontation learning

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111899758A (en) * 2020-09-07 2020-11-06 腾讯科技(深圳)有限公司 Voice processing method, device, equipment and storage medium
CN111899758B (en) * 2020-09-07 2024-01-30 腾讯科技(深圳)有限公司 Voice processing method, device, equipment and storage medium
WO2022057759A1 (en) * 2020-09-21 2022-03-24 华为技术有限公司 Voice conversion method and related device
CN112652324A (en) * 2020-12-28 2021-04-13 深圳万兴软件有限公司 Speech enhancement optimization method, speech enhancement optimization system and readable storage medium
CN113012710A (en) * 2021-01-28 2021-06-22 广州朗国电子科技有限公司 Audio noise reduction method and storage medium
CN114464206A (en) * 2022-04-11 2022-05-10 中国人民解放军空军预警学院 Single-channel blind source separation method and system

Similar Documents

Publication Publication Date Title
CN111627457A (en) Voice separation method, system and computer readable storage medium
CN109817213B (en) Method, device and equipment for performing voice recognition on self-adaptive language
US20200294488A1 (en) Method, device and storage medium for speech recognition
CN110797016B (en) Voice recognition method and device, electronic equipment and storage medium
US8407039B2 (en) Method and apparatus of translating language using voice recognition
CN111292720A (en) Speech synthesis method, speech synthesis device, computer readable medium and electronic equipment
KR20170034227A (en) Apparatus and method for speech recognition, apparatus and method for learning transformation parameter
CN108428446A (en) Audio recognition method and device
KR102443087B1 (en) Electronic device and voice recognition method thereof
CN110808034A (en) Voice conversion method, device, storage medium and electronic equipment
WO2014190732A1 (en) Method and apparatus for building a language model
CN110910903B (en) Speech emotion recognition method, device, equipment and computer readable storage medium
CN111368559A (en) Voice translation method and device, electronic equipment and storage medium
CN111369971A (en) Speech synthesis method, speech synthesis device, storage medium and electronic equipment
KR101819458B1 (en) Voice recognition apparatus and system
CN110827803A (en) Method, device and equipment for constructing dialect pronunciation dictionary and readable storage medium
CN111341326A (en) Voice processing method and related product
CN112397056B (en) Voice evaluation method and computer storage medium
CN110826637A (en) Emotion recognition method, system and computer-readable storage medium
CN112017633B (en) Speech recognition method, device, storage medium and electronic equipment
CN111554281B (en) Vehicle-mounted man-machine interaction method for automatically identifying languages, vehicle-mounted terminal and storage medium
CN110797049A (en) Voice evaluation method and related device
CN110875036A (en) Voice classification method, device, equipment and computer readable storage medium
KR20230026242A (en) Voice synthesis method and device, equipment and computer storage medium
CN112818657B (en) Method and device for determining pronunciation of polyphone, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200904

RJ01 Rejection of invention patent application after publication