WO2019196196A1 - 一种耳语音恢复方法、装置、设备及可读存储介质 - Google Patents

一种耳语音恢复方法、装置、设备及可读存储介质 Download PDF

Info

Publication number
WO2019196196A1
WO2019196196A1 PCT/CN2018/091460 CN2018091460W WO2019196196A1 WO 2019196196 A1 WO2019196196 A1 WO 2019196196A1 CN 2018091460 W CN2018091460 W CN 2018091460W WO 2019196196 A1 WO2019196196 A1 WO 2019196196A1
Authority
WO
WIPO (PCT)
Prior art keywords
ear
recognition result
speech
acoustic
model
Prior art date
Application number
PCT/CN2018/091460
Other languages
English (en)
French (fr)
Inventor
潘嘉
刘聪
王海坤
王智国
胡国平
Original Assignee
科大讯飞股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 科大讯飞股份有限公司 filed Critical 科大讯飞股份有限公司
Priority to JP2019519686A priority Critical patent/JP6903129B2/ja
Priority to US16/647,284 priority patent/US11508366B2/en
Publication of WO2019196196A1 publication Critical patent/WO2019196196A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • G10L15/25Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/32Multiple recognisers used in sequence or in parallel; Score combination systems therefor, e.g. voting systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0635Training updating or merging of old and new templates; Mean values; Weighting
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/226Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics
    • G10L2015/227Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics of the speaker; Human-factor methodology

Definitions

  • Speech recognition is a part of artificial intelligence that allows machines to automatically convert speech into corresponding text through machine learning methods, thus giving the machine a human-like hearing function.
  • speech recognition is an important part of human-computer interaction and is widely used in various intelligent terminals. More and more users are accustomed to using voice input.
  • the voice includes normal sound and ear voice, wherein the ear voice refers to the voice generated when the user whispers, and the normal sound is the voice when the user normally speaks.
  • the pronunciation of normal sound and ear voice is different.
  • the human vocal cords exhibit regular periodic vibrations, which are called fundamental frequencies.
  • the vocal cord vibration is not obvious, and there are some irregular vibrations, that is, there is no fundamental frequency, so even if the volume of the ear voice is amplified, it will not be the same as the normal pronunciation.
  • the present application provides an ear voice recovery method, apparatus, device, and readable storage medium to achieve high-accuracy recovery of ear voice data.
  • An ear speech recovery method includes:
  • the ear speech recovery model is that the earphone speech training data recognition result and the ear speech training acoustic feature are taken as samples, and the normal acoustic acoustic features corresponding to the normal speech data parallel to the ear speech training data are used as sample tags for training. get.
  • the method further comprises:
  • a final recognition result of the ear speech data is determined.
  • the preliminary identification result corresponding to the ear voice data is obtained, including:
  • the ear speech recognition model is that the normal speech recognition model is used as an initial model, and the acoustic model is trained using the ear speech annotation with the recognition result of the ear speech training data, and the initial model is trained.
  • the method further comprises:
  • the obtaining the preliminary identification result corresponding to the ear voice data further comprising:
  • the lip shape recognition model Inputting the lip image data into a preset lip shape recognition model to obtain an output lip shape recognition result; the lip shape recognition model is pre-trained by using lip image training data labeled with a lip shape recognition result;
  • the ear speech recognition result and the lip shape recognition result are merged, and the merged recognition result is obtained as a preliminary recognition result corresponding to the ear speech data.
  • the method further comprises:
  • the lip region is extracted from the corresponding frame image, and image regularization processing is performed to obtain regular lip image data as an input of the lip recognition model.
  • the acquiring the ear speech acoustic features corresponding to the ear voice data comprises:
  • the spectral features of the processed back-ear speech data are extracted separately; the spectral features include: one or more of a Meyer filter log energy feature, a Mel frequency cepstral coefficient feature, and a perceptual linear predictive coefficient feature.
  • the inputting the ear speech acoustic feature and the preliminary recognition result into a preset ear speech recovery model to obtain an output normal acoustic acoustic feature comprises:
  • the ear speech acoustic feature and the preliminary recognition result are input into an ear speech recovery model of a recurrent neural network type to obtain a normal acoustic acoustic feature output by the model.
  • the inputting the ear speech acoustic feature and the preliminary recognition result into a preset ear speech recovery model to obtain an output normal acoustic acoustic feature comprises:
  • the coefficients of the encoded back ear speech acoustic features are linearly weighted to obtain the weighted back ear speech acoustic features at the current time;
  • the encoded preliminary recognition result, the weighted back ear speech acoustic feature of the current time, and the output of the previous time decoding layer are used as input of the current time decoding layer, and the current time decoding layer
  • the output is a normal acoustical feature.
  • the determining, by using the normal acoustic acoustic feature, a final recognition result of the ear voice data comprising:
  • the normal sound recognition result is used as the final recognition result of the ear voice data.
  • the determining, by using the normal acoustic acoustic feature, a final recognition result of the ear voice data comprising:
  • the normal sound recognition result is used as a final recognition result of the ear voice data
  • An ear voice recovery device includes:
  • An ear speech acoustic feature acquiring unit configured to acquire an ear speech acoustic feature corresponding to the ear voice data
  • a preliminary identification result obtaining unit configured to acquire a preliminary recognition result corresponding to the ear voice data
  • An ear speech recovery processing unit configured to input the ear speech acoustic feature and the preliminary recognition result into a preset ear speech recovery model to obtain an output normal acoustic acoustic feature;
  • the ear speech recovery model is that the earphone speech training data recognition result and the ear speech training acoustic feature are taken as samples, and the normal acoustic acoustic features corresponding to the normal speech data parallel to the ear speech training data are used as sample tags for training. get.
  • the method further comprises:
  • a final recognition result determining unit configured to determine a final recognition result of the ear voice data by using the normal sound acoustic feature.
  • the preliminary identification result obtaining unit comprises:
  • a first preliminary recognition result acquisition subunit configured to input the ear speech acoustic feature into a preset ear speech recognition model, to obtain an output ear speech recognition result, as a preliminary recognition result corresponding to the ear speech data;
  • the ear speech recognition model is that the normal speech recognition model is used as an initial model, and the acoustic model is trained using the ear speech annotation with the recognition result of the ear speech training data, and the initial model is trained.
  • the method further comprises:
  • a lip image data acquiring unit configured to acquire lip image data matched by the ear voice data
  • the preliminary identification result obtaining unit further includes:
  • the lip recognition model is: using a lip labeled with a lip recognition result Shape image training data is pre-trained;
  • the third preliminary recognition result acquisition subunit merges the ear speech recognition result and the lip shape recognition result, and obtains the merged recognition result as a preliminary recognition result corresponding to the ear speech data.
  • the method further comprises:
  • a lip detecting unit configured to perform lip detection on each frame of lip image data to obtain a lip region
  • the image processing unit is configured to extract the lip region from the corresponding frame image, and perform image regularization processing to obtain regular lip image data as an input of the lip recognition model.
  • the ear speech acoustic feature acquiring unit comprises:
  • a framing processing unit configured to framing the ear voice data to obtain a plurality of frame ear voice data
  • a pre-emphasis processing unit configured to perform pre-emphasis processing on each frame of ear voice data to obtain processed back-ear voice data
  • a spectral feature extraction unit configured to separately extract spectral features of the processed back ear speech data for each frame;
  • the spectral features include: a Meyer filter log energy feature, a Mel frequency cepstral coefficient feature, and a perceptual linear prediction coefficient feature Any one or more of them.
  • the ear voice recovery processing unit comprises:
  • a recursive processing unit configured to input the ear speech acoustic feature and the preliminary recognition result into an ear speech recovery model of a recurrent neural network type to obtain a normal acoustic acoustic feature output by the model.
  • the ear speech recovery processing unit comprises: a codec processing unit, the codec processing unit comprising:
  • a first codec processing subunit configured to input the ear speech acoustic feature and the preliminary recognition result into an ear speech recovery model based on a codec type of attention mechanism
  • a second codec processing subunit configured to respectively encode the ear speech acoustic feature and the preliminary recognition result by using an encoding layer of the ear speech recovery model, to obtain an encoded back ear speech acoustic feature and a preliminary recognition result after encoding;
  • a third codec processing subunit configured to perform linear weighting on the encoded back ear speech acoustic feature through the attention layer of the ear speech recovery model, to obtain a weighted back ear speech acoustic feature at the current time;
  • a fourth codec processing subunit configured to use, by using a decoding layer of the ear speech recovery model, the encoded preliminary recognition result, the weighted back ear speech acoustic feature of the current time, and the output of the previous time decoding layer as the current time
  • the final recognition result determining unit comprises:
  • a normal sound recognition unit configured to input the normal sound acoustic feature into a preset normal sound recognition model to obtain an output normal sound recognition result
  • the first result determining unit is configured to use the normal sound recognition result as a final recognition result of the ear voice data.
  • the final recognition result determining unit comprises:
  • a normal sound recognition unit configured to input the normal sound acoustic feature into a preset normal sound recognition model to obtain an output normal sound recognition result
  • An iterative determining unit configured to determine whether a set iteration termination condition is reached
  • a second result determining unit configured to use the normal sound recognition result as a final recognition result of the ear voice data when the determination result of the iterative determination unit is YES;
  • a third result determining unit configured to return the normal sound recognition result as the preliminary recognition result to the acoustic characteristics of the ear speech and the preliminary when the determination result of the iterative determination unit is negative.
  • An ear voice recovery device including a memory and a processor
  • the memory is configured to store a program
  • the processor is configured to execute the program to implement various steps of the ear speech recovery method as disclosed above.
  • a readable storage medium having stored thereon a computer program that, when executed by a processor, implements the various steps of the ear speech recovery method as disclosed above.
  • the ear speech recovery method is implemented based on an ear speech recovery model, which is a sample of the ear speech training data and the ear speech training acoustic feature.
  • the normal acoustic acoustic features corresponding to the normal speech data parallel to the ear speech training data are trained for the sample tags.
  • the present application acquires an ear speech acoustic feature corresponding to the ear voice data, and a preliminary recognition result corresponding to the ear voice data, and then inputs the ear voice acoustic feature and the preliminary recognition result into a preset ear voice recovery model to obtain an output normal sound acoustics.
  • the feature can be used to recover the ear voice, so that the user can accurately understand the content expressed by the other party in the ear voice dialogue scene.
  • FIG. 1 is a flowchart of an ear voice recovery method according to an embodiment of the present application
  • FIG. 2 is a flowchart of a method for acquiring acoustic characteristics of an ear voice disclosed in an embodiment of the present application
  • Figure 3 illustrates a schematic structural view of a lip recognition model
  • FIG. 4 illustrates a schematic structural diagram of an ear speech recovery model of a recurrent neural network type
  • FIG. 5 illustrates a schematic structural diagram of an ear speech recovery model based on a codec type of attention mechanism
  • FIG. 6 is a flowchart of another ear voice recovery method according to an embodiment of the present application.
  • FIG. 7 is a flowchart of still another method for recovering an ear voice according to an embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of an ear voice recovery apparatus according to an embodiment of the present application.
  • FIG. 9 is a block diagram showing the hardware structure of an ear voice recovery device according to an embodiment of the present application.
  • the ear speech recovery method of the present application is introduced in conjunction with FIG. 1, as shown in FIG. 1, the method includes:
  • Step S100 Acquire an ear speech acoustic feature corresponding to the ear voice data, and obtain a preliminary recognition result corresponding to the ear voice data;
  • the ear speech acoustic feature corresponding to the externally input ear speech data may be directly obtained, or the corresponding ear speech acoustic feature may be determined according to the ear speech data.
  • the preliminary recognition result corresponding to the ear voice data may be input from the outside, or may be determined according to the ear voice data according to the present application.
  • the preliminary recognition result corresponding to the ear voice data may not be accurate and cannot be directly used as the final recognition result.
  • the ear voice data can be collected by the terminal device, and the terminal device can be a mobile phone, a personal computer, a tablet computer, and the like. Ear voice data can be collected through a microphone on the terminal device.
  • Step S110 input the ear speech acoustic feature and the preliminary recognition result into a preset ear speech recovery model to obtain an output normal acoustic acoustic feature.
  • the ear speech recovery model is that the ear sound training data recognition result and the ear voice training acoustic feature are taken as samples, and the normal sound acoustic feature corresponding to the normal voice data parallel to the ear voice training data is a sample tag. Get training.
  • the training samples of the ear speech recovery model may include: an ear speech training acoustic feature corresponding to the ear speech training data, and a recognition result of the ear speech training data; the sample tag includes: normal speech data corresponding to the ear speech training data Normal acoustic acoustic characteristics.
  • the normal voice data parallel to the ear voice training data means that the ear voice training data and the normal voice data are the same speaker in the same situation of equipment, environment, speech rate, mood, etc., respectively, in a whisper manner and normal. The way to say it.
  • the recognition result of the ear voice training data may be manually labeled, or may be a preliminary recognition result corresponding to the acquired externally introduced ear voice training data, which is similar to the step S100, and is used as the recognition result of the ear voice training data.
  • the ear speech recovery model using the ear speech acoustic feature and the preliminary recognition result, can predict the normal acoustic acoustic characteristics corresponding to the ear speech data, thereby recovering the ear speech, which is convenient for the user to accurately understand in the ear voice dialogue scene.
  • the content expressed by the other party can be used to predict the normal acoustic acoustic characteristics corresponding to the ear speech data, thereby recovering the ear speech, which is convenient for the user to accurately understand in the ear voice dialogue scene.
  • the content expressed by the other party is expressed by the other party.
  • the process of acquiring the ear speech acoustic features corresponding to the ear voice data in the above step S100 is introduced.
  • the process can include:
  • Step S200 Perform framing on the ear voice data to obtain a plurality of frame ear voice data.
  • Step S210 Perform pre-emphasis processing on each frame of ear voice data to obtain processed back-ear voice data;
  • Step S220 Extract spectral features of the processed back-ear speech data for each frame.
  • the spectral features may include: Log Filter Bank Energy, Mel Frequency Cepstrum Coefficient (MFCC), and Perceptual Linear Predictive (PLP). Any one or more of them.
  • MFCC Mel Frequency Cepstrum Coefficient
  • PLP Perceptual Linear Predictive
  • the first way is based on the ear speech recognition model.
  • the ear speech recognition model may be pre-trained.
  • the ear speech recognition model uses the normal speech recognition model as an initial model, and uses the ear speech training acoustic feature labeled with the recognition result of the ear speech training data to perform the initial model. Trained.
  • the normal sound recognition model is obtained by training the normal sound training acoustic characteristics marked with the recognition result of the normal sound training data.
  • the ear speech recognition model designed by this application is adaptively obtained for the normal speech recognition model, specifically:
  • the normal sound recognition model is trained by using the normal acoustic characteristics and the recognition result of the normal sound data annotation;
  • the normal sound recognition model after training is taken as the initial model, and the initial model is trained by using the acoustic characteristics of the ear and the recognition result of the artificial voice data.
  • the ear speech recognition model is obtained after training.
  • the ear speech acoustic feature corresponding to the acquired ear speech data can be input into the ear speech recognition model, and the output ear speech recognition result is obtained as a preliminary correspondence of the ear speech data. Identify the results.
  • the ear speech recognition model can also be trained based on the ear speech data and its corresponding recognition result in this embodiment.
  • the second way is based on the ear speech recognition model and the lip recognition model.
  • the lip shape recognition process is further combined in the embodiment to comprehensively determine the preliminary recognition result corresponding to the ear voice data. specifically:
  • the lip image data matched by the ear voice data can be further obtained in this embodiment.
  • the lip image data is an image of the lip that is captured when the user speaks the ear voice data.
  • the present application pre-trains the lip recognition model, which is obtained by pre-training the lip image training data marked with the lip recognition result.
  • the lip shape recognition result of the model output is obtained by inputting the lip shape image data matching the ear voice data into the lip shape recognition model.
  • the embodiment may further perform a preprocessing operation on the lip image data.
  • the pre-processed lip image data is used as an input to the lip recognition model.
  • the process of preprocessing the lip image may include:
  • the lip image data of each frame is subjected to lip detection to obtain a lip region
  • the lip detection can employ an object detection technique such as a Faster RCNN model.
  • the lip region is extracted from the corresponding frame image, and image regularization processing is performed to obtain regular lip image data as an input of the lip recognition model.
  • the image is normalized to scale the image to a predetermined size, such as 32*32 pixels, or other specifications.
  • the regular processing method can adopt various existing image scaling techniques, such as linear interpolation.
  • the preprocessed lip image sequence is used as input to the model.
  • the convolutional neural network CNN is used to obtain the feature expression of each lip image.
  • the structure of the convolutional neural network is not limited, and may be a VGG structure or a residual structure which is often used in existing image recognition.
  • the recurrent neural network RNN forms a feature expression of the lip image sequence, and then passes through the feedforward neural network FFNN to connect the output layer, and the output layer is a phoneme sequence or a phoneme state sequence corresponding to the input lip image sequence.
  • the phoneme sequence outputted by the output layer of the example in Fig. 3 is "zh, ong, g, uo".
  • the lip recognition result is merged with the ear speech recognition result output by the ear speech recognition model, and the merged recognition result is obtained as the preliminary recognition result corresponding to the ear speech data.
  • the process of merging the lip recognition result with the ear speech recognition result output by the ear speech recognition model may adopt an existing model fusion method, such as ROVER (Recognizer output voting error reduction; recognition result voting error reduction method), or other fusion method.
  • ROVER Recognition output voting error reduction; recognition result voting error reduction method
  • the lip recognition result By combining the lip recognition result with the ear speech recognition result, the ear speech recognition accuracy is improved, and the preliminary recognition result corresponding to the determined ear speech data is more accurate.
  • step S110 is introduced, and the ear speech acoustic feature and the preliminary recognition result are input into a preset ear speech recovery model to obtain an implementation process of the output normal acoustic acoustic feature.
  • Two ear speech recovery models are provided in this embodiment, as follows:
  • the ear speech recovery model is a recurrent neural network type. As shown in FIG. 4, a schematic diagram of a structure of an ear speech recovery model of a recurrent neural network type is illustrated.
  • the input layer includes two types of data, which are the acoustic characteristics of the ear speech of each frame and the preliminary recognition results of each frame.
  • the preliminary recognition result is described by taking the phoneme sequence "zh, ong, g, uo" as an example.
  • the output layer is the normal acoustical characteristics of each frame.
  • the embodiment can input the ear speech acoustic feature and the preliminary recognition result into an ear speech recovery model of a recurrent neural network type to obtain a normal acoustic acoustic feature output by the model.
  • the preliminary recognition result of the input model may be a preliminary recognition result after vectorization.
  • the ear speech recovery model is a codec type based on the attention mechanism. As shown in FIG. 5, a schematic diagram of a structure of an ear speech recovery model based on a codec type of attention mechanism is illustrated.
  • the input layer includes two types of data, which are the ear speech acoustic features x 1 -x s of each frame and the preliminary recognition results of each frame.
  • the preliminary recognition result is described by taking the phoneme sequence "zh, ong, g, uo" as an example.
  • the attention layer uses the encoded back ear speech acoustic characteristics And the hidden layer variable of the decoding layer at the current time t
  • the coefficient vector a t of the acoustical features of the ear speech of each frame is determined together at the current time t .
  • the composed vectors are multiplied to obtain the weighted back-ear speech acoustic characteristics c t at the current time.
  • the initial recognition result after encoding, the weighted back-ear speech acoustic feature c t at the current time, and the output y t-1 of the decoding layer at the previous time t-1 are taken as the input of the decoding layer of the current time t, and the output y of the decoding layer at the current time t t as a normal acoustic feature.
  • the present embodiment can determine the normal acoustic acoustic characteristics using the model by the following steps:
  • the preliminary recognition result of the input model may be a preliminary recognition result after vectorization.
  • the encoded preliminary recognition result, the weighted back ear speech acoustic feature of the current time, and the output of the previous time decoding layer are used as input of the current time decoding layer, and the current time is decoded.
  • the output of the layer acts as a normal acoustical feature.
  • another ear speech recovery method is introduced. As shown in FIG. 6, the method may include:
  • Step S300 acquiring an ear speech acoustic feature corresponding to the ear voice data, and a preliminary recognition result corresponding to the ear voice data;
  • Step S310 inputting the ear speech acoustic feature and the preliminary recognition result into a preset ear speech recovery model to obtain an output normal acoustic acoustic feature;
  • the ear speech recovery model is that the recognition result and the ear speech training acoustic feature marked in advance by the ear speech training data are samples, and the normal acoustic acoustic features corresponding to the normal speech data parallel to the ear speech training data are samples.
  • the tag is trained to get.
  • steps S300-S310 correspond to the steps S100-S110 in the foregoing embodiment, and the foregoing description is in detail, and details are not described herein again.
  • Step S320 Determine, by using the normal acoustic acoustic feature, a final recognition result of the ear voice data.
  • the final acoustic recognition result is further determined by using the normal acoustic acoustic feature, and the final recognition result may be in a text form.
  • the present application can also utilize normal acoustic acoustic features, synthesize normal voice speech for output, or other optional manners, specifically selected according to application needs.
  • a process of determining the final recognition result of the ear voice data by using the normal sound acoustic feature is added, and the final recognition result can be used for storage, recording, and the like.
  • the final recognition result may be merged with the lip recognition result output by the lip recognition model introduced in the foregoing embodiment, and the fusion is performed.
  • the accuracy of the final recognition result is further improved.
  • step S320 is described in which two alternative embodiments of the final recognition result of the ear speech data are determined using the normal acoustic acoustic characteristics.
  • the normal tone recognition model can be referred to the foregoing description, and details are not described herein again.
  • the normal tone recognition result output by the normal tone recognition model is directly used as the final recognition result.
  • step S320 is described in conjunction with a complete ear voice recovery process in this embodiment.
  • FIG. 7 is a flowchart of still another method for recovering an ear voice disclosed in the embodiment of the present application. As shown in FIG. 7, the method includes:
  • Step S400 Acquire an ear speech acoustic feature corresponding to the ear voice data, and a preliminary recognition result corresponding to the ear voice data;
  • Step S410 input the ear speech acoustic feature and the preliminary recognition result into a preset ear speech recovery model to obtain an output normal acoustic acoustic feature;
  • steps S400-S410 in this embodiment are in one-to-one correspondence with the steps S100-S110 in the foregoing embodiment, and the foregoing description is in detail, and details are not described herein again.
  • Step S420 input the normal sound acoustic feature into a preset normal sound recognition model to obtain an output normal sound recognition result;
  • Step S430 it is determined whether the set iteration termination condition is reached; if yes, step S440 is performed, and if not, step S450 is performed;
  • Step S440 using the normal sound recognition result as a final recognition result of the ear voice data
  • Step S450 using the normal sound recognition result as the preliminary recognition result, and returning to step S410.
  • the implementation method adds an iterative process through the ear speech recovery model, that is, the normal sound recognition result outputted by the normal sound recognition model is further used as a preliminary recognition result, and the input ear speech recovery model is iterated. Until it is determined that the set iteration termination condition is reached.
  • the iteration termination condition there may be multiple conditions for setting the iteration termination condition, such as the number of iterations of the ear speech recovery model reaching the threshold of the number of times, the iteration time reaching the time threshold, or the convergence of the confidence of the normal tone recognition result reaching the set convergence condition, etc. .
  • the number of times threshold and the time threshold may be determined according to requirements of the actual task for system response time and computing resources.
  • the ear speech recovery device provided by the embodiment of the present application is described below, and the ear speech recovery device described below and the above-described ear speech recovery method can refer to each other.
  • FIG. 8 is a schematic structural diagram of an ear voice recovery apparatus according to an embodiment of the present application. As shown in FIG. 8, the apparatus may include:
  • the ear speech acoustic feature acquiring unit 11 is configured to acquire an ear speech acoustic feature corresponding to the ear voice data
  • the preliminary recognition result obtaining unit 12 is configured to obtain a preliminary recognition result corresponding to the ear voice data
  • the ear speech recovery processing unit 13 is configured to input the ear speech acoustic feature and the preliminary recognition result into a preset ear speech recovery model to obtain an output normal acoustic acoustic feature;
  • the ear speech recovery model is that the earphone speech training data recognition result and the ear speech training acoustic feature are taken as samples, and the normal acoustic acoustic features corresponding to the normal speech data parallel to the ear speech training data are used as sample tags for training. get.
  • the foregoing preliminary identification result obtaining unit may include:
  • a first preliminary recognition result acquisition subunit configured to input the ear speech acoustic feature into a preset ear speech recognition model, to obtain an output ear speech recognition result, as a preliminary recognition result corresponding to the ear speech data;
  • the ear speech recognition model is that the normal speech recognition model is used as an initial model, and the acoustic model is trained using the ear speech annotation with the recognition result of the ear speech training data, and the initial model is trained.
  • the apparatus of the present application may further include:
  • a lip image data acquiring unit configured to acquire lip image data matched by the ear voice data
  • the preliminary identification result obtaining unit may further include:
  • the lip recognition model is: using a lip labeled with a lip recognition result Shape image training data is pre-trained;
  • the third preliminary recognition result acquisition subunit merges the ear speech recognition result and the lip shape recognition result, and obtains the merged recognition result as a preliminary recognition result corresponding to the ear speech data.
  • the apparatus of the present application may further include:
  • a lip detecting unit configured to perform lip detection on each frame of lip image data to obtain a lip region
  • the image processing unit is configured to extract the lip region from the corresponding frame image, and perform image regularization processing to obtain regular lip image data as an input of the lip recognition model.
  • the ear voice acoustic feature acquiring unit may include:
  • a framing processing unit configured to framing the ear voice data to obtain a plurality of frame ear voice data
  • a pre-emphasis processing unit configured to perform pre-emphasis processing on each frame of ear voice data to obtain processed back-ear voice data
  • a spectral feature extraction unit configured to separately extract spectral features of the processed back ear speech data for each frame;
  • the spectral features include: a Meyer filter log energy feature, a Mel frequency cepstral coefficient feature, and a perceptual linear prediction coefficient feature Any one or more of them.
  • this embodiment discloses two optional structures of the ear voice recovery processing unit.
  • the ear voice recovery processing unit may include:
  • a recursive processing unit configured to input the ear speech acoustic feature and the preliminary recognition result into an ear speech recovery model of a recurrent neural network type to obtain a normal acoustic acoustic feature output by the model.
  • the ear speech recovery processing unit may include: a codec processing unit, where the codec processing unit includes:
  • a first codec processing subunit configured to input the ear speech acoustic feature and the preliminary recognition result into an ear speech recovery model based on a codec type of attention mechanism
  • a second codec processing subunit configured to respectively encode the ear speech acoustic feature and the preliminary recognition result by using an encoding layer of the ear speech recovery model, to obtain an encoded back ear speech acoustic feature and a preliminary recognition result after encoding;
  • a third codec processing subunit configured to perform linear weighting on the encoded back ear speech acoustic feature through the attention layer of the ear speech recovery model to obtain a weighted back ear speech acoustic feature at the current time;
  • a fourth codec processing subunit configured to use, by using a decoding layer of the ear speech recovery model, the encoded preliminary recognition result, the weighted back ear speech acoustic feature of the current time, and the output of the previous time decoding layer as the current time
  • the apparatus of the present application may further include:
  • a final recognition result determining unit configured to determine a final recognition result of the ear voice data by using the normal sound acoustic feature.
  • the embodiment discloses two optional structures of the final recognition result determining unit,
  • the final recognition result determining unit may include:
  • a normal sound recognition unit configured to input the normal sound acoustic feature into a preset normal sound recognition model to obtain an output normal sound recognition result
  • the first result determining unit is configured to use the normal sound recognition result as a final recognition result of the ear voice data.
  • the final recognition result determining unit may include:
  • a normal sound recognition unit configured to input the normal sound acoustic feature into a preset normal sound recognition model to obtain an output normal sound recognition result
  • An iterative determining unit configured to determine whether a set iteration termination condition is reached
  • a second result determining unit configured to use the normal sound recognition result as a final recognition result of the ear voice data when the determination result of the iterative determination unit is YES;
  • a third result determining unit configured to return the normal sound recognition result as the preliminary recognition result to the acoustic characteristics of the ear speech and the preliminary when the determination result of the iterative determination unit is negative.
  • FIG. 9 is a block diagram showing the hardware structure of the ear voice recovery device.
  • the hardware structure of the ear voice recovery device may include: at least one processor 1, at least one communication interface 2, at least one memory 3, and at least a communication bus 4;
  • the number of the processor 1, the communication interface 2, the memory 3, and the communication bus 4 is at least one, and the processor 1, the communication interface 2, and the memory 3 complete communication with each other through the communication bus 4.
  • the processor 1 may be a central processing unit CPU, or an application specific integrated circuit (ASIC), or one or more integrated circuits configured to implement the embodiments of the present invention
  • the memory 3 may include a high speed RAM memory, and may also include a non-volatile memory or the like, such as at least one disk memory;
  • the memory stores a program
  • the processor can call a program stored in the memory, the program is used to:
  • the ear speech recovery model is that the earphone speech training data recognition result and the ear speech training acoustic feature are taken as samples, and the normal acoustic acoustic features corresponding to the normal speech data parallel to the ear speech training data are used as sample tags for training. get.
  • refinement function and the extended function of the program may refer to the foregoing description.
  • the embodiment of the present application further provides a storage medium, where the storage medium can store a program suitable for execution by a processor, the program is used to:
  • the ear speech recovery model is that the earphone speech training data recognition result and the ear speech training acoustic feature are taken as samples, and the normal acoustic acoustic features corresponding to the normal speech data parallel to the ear speech training data are used as sample tags for training. get.
  • refinement function and the extended function of the program may refer to the foregoing description.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • General Engineering & Computer Science (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)
  • Circuit For Audible Band Transducer (AREA)
  • Telephone Function (AREA)

Abstract

本申请公开了一种耳语音恢复方法、装置、设备及可读存储介质,基于耳语音恢复模型实现,该耳语音恢复模型为预先以耳语音训练数据的识别结果及耳语音训练声学特征为样本,以与所述耳语音训练数据平行的正常语音数据对应的正常音声学特征为样本标签进行训练得到。本申请获取耳语音数据对应的耳语音声学特征,及所述耳语音数据对应的初步识别结果,进而将耳语音声学特征及初步识别结果输入预置的耳语音恢复模型,得到输出的正常音声学特征,可以据此恢复耳语音。

Description

一种耳语音恢复方法、装置、设备及可读存储介质 技术领域
本申请要求于2018年4月12日提交中国专利局、申请号为201810325696.3、发明名称为“一种耳语音恢复方法、装置、设备及可读存储介质”的国内申请的优先权,其全部内容通过引用结合在本申请中。
背景技术
语音识别通过机器学习方法让机器能够自动的将语音转换成对应的文字,从而赋予了机器类似人听觉的功能,是人工智能的重要组成部分。随着人工智能技术的不断突破和各种智能终端设备的日益普及,语音识别作为人机交互的重要环节,广泛应用于各种智能终端上,越来越多的用户习惯用语音输入。
语音包括正常音和耳语音,其中耳语音是指用户在说悄悄话时产生的语音,正常音即用户正常说话时的语音。正常音和耳语音的发音不同。正常发音时人的声带呈现规律的周期性的振动,这种振动频率称为基频。而发耳语音时,声带振动不明显,呈现出不规律的有一些随机性的振动,也就是说没有基频,因此即使把耳语音的音量进行放大,也不会和正常发音一样。
然而,在开会或者涉及到隐私等场合下,正常使用语音输入会带来一些不便,很多用户会选择说悄悄话,这样引入的问题是机器无法准确识别用户所说的内容。同时也有很多失音患者他们的发音和耳语音比较接近。基于此,现有技术迫切需要一种能够将耳语音恢复成正常音的方案。
发明内容
有鉴于此,本申请提供了一种耳语音恢复方法、装置、设备及可读存储介质,以实现高准确度的对耳语音数据进行恢复。
为了实现上述目的,现提出的方案如下:
一种耳语音恢复方法,包括:
获取耳语音数据对应的耳语音声学特征,及所述耳语音数据对应的初 步识别结果;
将所述耳语音声学特征及所述初步识别结果输入预置的耳语音恢复模型,得到输出的正常音声学特征;
所述耳语音恢复模型为,预先以耳语音训练数据的识别结果及耳语音训练声学特征为样本,以与所述耳语音训练数据平行的正常语音数据对应的正常音声学特征为样本标签进行训练得到。
优选地,还包括:
利用所述正常音声学特征,确定所述耳语音数据的最终识别结果。
优选地,获取所述耳语音数据对应的初步识别结果,包括:
将所述耳语音声学特征输入预置的耳语音识别模型,得到输出的耳语音识别结果,作为所述耳语音数据对应的初步识别结果;
所述耳语音识别模型为,以正常音识别模型为初始模型,使用标注有耳语音训练数据的识别结果的耳语音训练声学特征,对所述初始模型进行训练得到。
优选地,还包括:
获取所述耳语音数据匹配的唇形图像数据;
则所述获取所述耳语音数据对应的初步识别结果,还包括:
将所述唇形图像数据输入预置的唇形识别模型,得到输出的唇形识别结果;所述唇形识别模型为,利用标注有唇形识别结果的唇形图像训练数据预训练得到;
将所述耳语音识别结果及所述唇形识别结果进行融合,得到融合后的识别结果作为所述耳语音数据对应的初步识别结果。
优选地,还包括:
对每一帧唇形图像数据进行***检测,得到***区域;
将所述***区域从对应帧图像中提取出来,并进行图像规整处理,得到规整后的唇形图像数据,作为所述唇形识别模型的输入。
优选地,所述获取耳语音数据对应的耳语音声学特征,包括:
对所述耳语音数据进行分帧,得到若干帧耳语音数据;
对每一帧耳语音数据进行预加重处理,得到处理后耳语音数据;
分别提取每一帧处理后耳语音数据的频谱特征;所述频谱特征包括:梅尔滤波器对数能量特征、梅尔频率倒谱系数特征、感知线性预测系数特征中的任意一个或多个。
优选地,所述将所述耳语音声学特征及所述初步识别结果输入预置的耳语音恢复模型,得到输出的正常音声学特征,包括:
将所述耳语音声学特征及所述初步识别结果输入递归神经网络类型的耳语音恢复模型,得到模型输出的正常音声学特征。
优选地,所述将所述耳语音声学特征及所述初步识别结果输入预置的耳语音恢复模型,得到输出的正常音声学特征,包括:
将所述耳语音声学特征及所述初步识别结果输入基于注意力机制的编解码类型的耳语音恢复模型;
通过耳语音恢复模型的编码层,分别对所述耳语音声学特征、所述初步识别结果进行编码,得到编码后耳语音声学特征及编码后初步识别结果;
通过耳语音恢复模型的注意力层,对所述编码后耳语音声学特征进行系数线性加权,得到当前时刻的加权后耳语音声学特征;
通过耳语音恢复模型的解码层,将所述编码后初步识别结果、所述当前时刻的加权后耳语音声学特征及上一时刻解码层的输出作为当前时刻解码层的输入,当前时刻解码层的输出作为正常音声学特征。
优选地,所述利用所述正常音声学特征,确定所述耳语音数据的最终识别结果,包括:
将所述正常音声学特征输入预置的正常音识别模型,得到输出的正常音识别结果;
将所述正常音识别结果作为所述耳语音数据的最终识别结果。
优选地,所述利用所述正常音声学特征,确定所述耳语音数据的最终识别结果,包括:
将所述正常音声学特征输入预置的正常音识别模型,得到输出的正常音识别结果;
判断是否达到设定迭代终止条件;
若是,将所述正常音识别结果作为所述耳语音数据的最终识别结果;
若否,将所述正常音识别结果作为所述初步识别结果,返回至所述将所述耳语音声学特征及所述初步识别结果输入预置的耳语音恢复模型的过程。
一种耳语音恢复装置,包括:
耳语音声学特征获取单元,用于获取耳语音数据对应的耳语音声学特征;
初步识别结果获取单元,用于获取所述耳语音数据对应的初步识别结果;
耳语音恢复处理单元,用于将所述耳语音声学特征及所述初步识别结果输入预置的耳语音恢复模型,得到输出的正常音声学特征;
所述耳语音恢复模型为,预先以耳语音训练数据的识别结果及耳语音训练声学特征为样本,以与所述耳语音训练数据平行的正常语音数据对应的正常音声学特征为样本标签进行训练得到。
优选地,还包括:
最终识别结果确定单元,用于利用所述正常音声学特征,确定所述耳语音数据的最终识别结果。
优选地,所述初步识别结果获取单元包括:
第一初步识别结果获取子单元,用于将所述耳语音声学特征输入预置的耳语音识别模型,得到输出的耳语音识别结果,作为所述耳语音数据对应的初步识别结果;
所述耳语音识别模型为,以正常音识别模型为初始模型,使用标注有耳语音训练数据的识别结果的耳语音训练声学特征,对所述初始模型进行训练得到。
优选地,还包括:
唇形图像数据获取单元,用于获取所述耳语音数据匹配的唇形图像数据;
则所述初步识别结果获取单元还包括:
第二初步识别结果获取子单元,将所述唇形图像数据输入预置的唇形识别模型,得到输出的唇形识别结果;所述唇形识别模型为,利用标注有 唇形识别结果的唇形图像训练数据预训练得到;
第三初步识别结果获取子单元,将所述耳语音识别结果及所述唇形识别结果进行融合,得到融合后的识别结果作为所述耳语音数据对应的初步识别结果。
优选地,还包括:
***检测单元,用于对每一帧唇形图像数据进行***检测,得到***区域;
图像处理单元,用于将所述***区域从对应帧图像中提取出来,并进行图像规整处理,得到规整后的唇形图像数据,作为所述唇形识别模型的输入。
优选地,所述耳语音声学特征获取单元包括:
分帧处理单元,用于对所述耳语音数据进行分帧,得到若干帧耳语音数据;
预加重处理单元,用于对每一帧耳语音数据进行预加重处理,得到处理后耳语音数据;
频谱特征提取单元,用于分别提取每一帧处理后耳语音数据的频谱特征;所述频谱特征包括:梅尔滤波器对数能量特征、梅尔频率倒谱系数特征、感知线性预测系数特征中的任意一个或多个。
优选地,所述耳语音恢复处理单元包括:
递归处理单元,用于将所述耳语音声学特征及所述初步识别结果输入递归神经网络类型的耳语音恢复模型,得到模型输出的正常音声学特征。
优选地,所述耳语音恢复处理单元包括:编解码处理单元,该编解码处理单元包括:
第一编解码处理子单元,用于将所述耳语音声学特征及所述初步识别结果输入基于注意力机制的编解码类型的耳语音恢复模型;
第二编解码处理子单元,用于通过耳语音恢复模型的编码层,分别对所述耳语音声学特征、所述初步识别结果进行编码,得到编码后耳语音声学特征及编码后初步识别结果;
第三编解码处理子单元,用于通过耳语音恢复模型的注意力层,对所 述编码后耳语音声学特征进行系数线性加权,得到当前时刻的加权后耳语音声学特征;
第四编解码处理子单元,用于通过耳语音恢复模型的解码层,将所述编码后初步识别结果、所述当前时刻的加权后耳语音声学特征及上一时刻解码层的输出作为当前时刻解码层的输入,当前时刻解码层的输出作为正常音声学特征。
优选地,所述最终识别结果确定单元包括:
正常声识别单元,用于将所述正常音声学特征输入预置的正常音识别模型,得到输出的正常音识别结果;
第一结果确定单元,用于将所述正常音识别结果作为所述耳语音数据的最终识别结果。
优选地,所述最终识别结果确定单元包括:
正常声识别单元,用于将所述正常音声学特征输入预置的正常音识别模型,得到输出的正常音识别结果;
迭代判断单元,用于判断是否达到设定迭代终止条件;
第二结果确定单元,用于在所述迭代判断单元的判断结果为是时,将所述正常音识别结果作为所述耳语音数据的最终识别结果;
第三结果确定单元,用于在所述迭代判断单元的判断结果为否时,将所述正常音识别结果作为所述初步识别结果,返回至所述将所述耳语音声学特征及所述初步识别结果输入预置的耳语音恢复模型的过程。
一种耳语音恢复设备,包括存储器和处理器;
所述存储器,用于存储程序;
所述处理器,用于执行所述程序,实现如上公开的耳语音恢复方法的各个步骤。
一种可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时,实现如上公开的耳语音恢复方法的各个步骤。
从上述的技术方案可以看出,本申请实施例提供的耳语音恢复方法,基于耳语音恢复模型实现,该耳语音恢复模型为预先以耳语音训练数据的识别结果及耳语音训练声学特征为样本,以与所述耳语音训练数据平行的 正常语音数据对应的正常音声学特征为样本标签进行训练得到。本申请获取耳语音数据对应的耳语音声学特征,及所述耳语音数据对应的初步识别结果,进而将耳语音声学特征及初步识别结果输入预置的耳语音恢复模型,得到输出的正常音声学特征,可以据此恢复耳语音,便于用户在耳语音对话场景下能够准确了解对方表达的内容。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据提供的附图获得其他的附图。
图1为本申请实施例公开的一种耳语音恢复方法流程图;
图2为本申请实施例公开的一种获取耳语音声学特征方法流程图;
图3示例了一种唇形识别模型的结构示意图;
图4示例了一种递归神经网络类型的耳语音恢复模型结构示意图;
图5示例了一种基于注意力机制的编解码类型的耳语音恢复模型结构示意图;
图6为本申请实施例公开的另一种耳语音恢复方法流程图;
图7为本申请实施例公开的又一种耳语音恢复方法流程图;
图8为本申请实施例公开的一种耳语音恢复装置结构示意图;
图9为本申请实施例公开的一种耳语音恢复设备的硬件结构框图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
接下来,结合附图1,对本申请的耳语音恢复方法进行介绍,如图1所示,该方法包括:
步骤S100、获取耳语音数据对应的耳语音声学特征,及所述耳语音数据对应的初步识别结果;
具体地,本步骤中可以直接获取外部输入的耳语音数据对应的耳语音声学特征,也可以是根据耳语音数据来确定其对应的耳语音声学特征。
进一步,耳语音数据对应的初步识别结果,可以是由外部输入的,也可以是本申请根据耳语音数据所确定的。
耳语音数据对应的初步识别结果其准确度可能并不高,并不能够直接作为最终的识别结果。
耳语音数据可以通过终端设备采集,终端设备可以是手机、个人电脑、平板电脑等设备。可以通过终端设备上的麦克风采集耳语音数据。
步骤S110、将所述耳语音声学特征及所述初步识别结果输入预置的耳语音恢复模型,得到输出的正常音声学特征。
其中,所述耳语音恢复模型为,预先以耳语音训练数据的识别结果及耳语音训练声学特征为样本,以与所述耳语音训练数据平行的正常语音数据对应的正常音声学特征为样本标签进行训练得到。
换句话说,耳语音恢复模型的训练样本可以包括:耳语音训练数据对应的耳语音训练声学特征,以及耳语音训练数据的识别结果;样本标签包括:与耳语音训练数据平行的正常语音数据对应的正常音声学特征。
其中,与耳语音训练数据平行的正常语音数据指,耳语音训练数据与正常语音数据为同一说话人在设备、环境、语速、情绪等各场景均相同的情况下,分别以耳语方式和正常方式说的话。
耳语音训练数据的识别结果可以是人工标注的,也可以是与步骤S100中类似的,获取的外部导入的耳语音训练数据对应的初步识别结果,作为耳语音训练数据的识别结果。
本实施例中,耳语音恢复模型,利用耳语音声学特征及初步识别结果,可以预测出耳语音数据对应的正常音声学特征,据此恢复耳语音,便于用户在耳语音对话场景下能够准确了解对方表达的内容。
在本申请的一个实施例中,介绍上述步骤S100中获取耳语音数据对应的耳语音声学特征的过程。参见图2,该过程可以包括:
步骤S200、对所述耳语音数据进行分帧,得到若干帧耳语音数据;
步骤S210、对每一帧耳语音数据进行预加重处理,得到处理后耳语音数据;
步骤S220、分别提取每一帧处理后耳语音数据的频谱特征。
其中,频谱特征可以包括:梅尔滤波器对数能量特征(LogFilter Bank Energy)、梅尔频率倒谱系数特征(Mel Frequency Cepstrum Coefficient,MFCC)、感知线性预测系数特征(Perceptual Linear Predictive,PLP)中的任意一个或多个。
进一步地,介绍上述步骤S100中获取耳语音数据对应的初步识别结果的过程。本实施例公开了两种获取方式,分别如下:
第一种方式,基于耳语音识别模型实现。
本实施例中可以预先训练耳语音识别模型,该耳语音识别模型为以正常音识别模型为初始模型,使用标注有耳语音训练数据的识别结果的耳语音训练声学特征,对所述初始模型进行训练得到。
其中,正常音识别模型为,以标注有正常音训练数据的识别结果的正常音训练声学特征训练得到。
本实施例中,考虑到耳语音数据的收集成本比较高,通常收集到的耳语音数据较少,因此很难在说话人、环境等层面有很好的覆盖性,从而导致耳语音训练数据没有覆盖到的时候识别率显著降低。在此基础上,本申请设计的耳语音识别模型为正常语音识别模型经过自适应得到,具体地:
首先,收集大量有人工标注识别结果的正常说话的正常音数据和少量有人工标注识别结果的耳语音数据;
其次,提取正常音数据的正常音声学特征,以及提取耳语音数据的耳语音声学特征;
再次,利用正常音声学特征及人工对正常音数据标注的识别结果,训练正常音识别模型;
最后,以训练后的正常音识别模型为初始模型,利用耳语音声学特征及人工对耳语音数据标注的识别结果,训练该初始模型,训练后得到耳语音识别模型。
基于训练得到的耳语音识别模型,本实施例中可以将获取的耳语音数据对应的耳语音声学特征输入该耳语音识别模型,得到输出的耳语音识别结果,作为所述耳语音数据对应的初步识别结果。
可以理解的是,本实施例中也可以仅基于耳语音数据及其对应的识别结果,训练耳语音识别模型。
第二种方式,基于耳语音识别模型和唇形识别模型实现。
在第一种实现方式的基础上,本实施例中进一步结合了唇形识别过程,来综合确定耳语音数据对应的初步识别结果。具体地:
本实施例中可以进一步获取耳语音数据匹配的唇形图像数据。该唇形图像数据即为,拍摄的包含用户在说耳语音数据时唇形的图像。
在此基础上,本申请预先训练唇形识别模型,该唇形识别模型为,利用标注有唇形识别结果的唇形图像训练数据预训练得到。
通过将耳语音数据匹配的唇形图像数据,输入该唇形识别模型,得到模型输出的唇形识别结果。
进一步可选的,在获取到耳语音数据匹配的唇形图像数据之后,本实施例还可以进一步对唇形图像数据进行预处理操作。并将预处理后的唇形图像数据作为唇形识别模型的输入。
对唇形图像进行预处理的过程,可以包括:
首先,每一帧唇形图像数据进行***检测,得到***区域;
具体地,***检测可以采用物体检测技术,如FasterRCNN模型等。
进一步,将所述***区域从对应帧图像中提取出来,并进行图像规整处理,得到规整后的唇形图像数据,作为所述唇形识别模型的输入。
对图像进行规整处理过程,可以将图像缩放到预定大小,如32*32像素,或其它规格。该规整处理方式可以采用现有各种图像放缩技术,如线性内插等。
参见图3,其示例了一种唇形识别模型的结构示意图。
预处理后的唇形图像序列作为模型的输入。首先经过卷积神经网络CNN得到每帧唇形图像的特征表达,卷积神经网络的结构不做限制,可以是现有图像识别中经常采用的VGG结构或残差结构等。进一步,经过递归神经网络RNN形成唇形图像序列的特征表达,再经过前馈神经网络FFNN之后连接输出层,输出层为输入唇形图像序列对应的音素序列或者音素状态序列。
图3中示例的输出层输出的音素序列为“zh、ong、g、uo”。
在上述介绍的得到唇形识别结果的基础上,将唇形识别结果与耳语音识别模型输出的耳语音识别结果进行融合,得到融合后的识别结果作为耳语音数据对应的初步识别结果。
其中,唇形识别结果与耳语音识别模型输出的耳语音识别结果进行融合的过程,可以采用现有模型融合方法,如ROVER(Recognizer output voting error reduction;识别结果投票错误降低法),或者其它融合方法。
通过将唇形识别结果与耳语音识别结果相结合,提高了耳语音识别准确度,使得确定的耳语音数据对应的初步识别结果更加准确。
在本申请的另一个实施例中,介绍上述步骤S110,将所述耳语音声学特征及所述初步识别结果输入预置的耳语音恢复模型,得到输出的正常音声学特征的实施过程。
本实施例中提供了两种耳语音恢复模型,分别如下:
第一种:
耳语音恢复模型为递归神经网络类型。如图4,示例了一种递归神经网络类型的耳语音恢复模型结构示意图。
输入层包括两类数据,分别为各帧的耳语音声学特征及各帧的初步识别结果。图4中,初步识别结果以音素序列“zh、ong、g、uo”为例进行的说明。
输出层为各帧的正常音声学特征。
基于上述耳语音恢复模型,本实施例可以将所述耳语音声学特征及所 述初步识别结果输入递归神经网络类型的耳语音恢复模型,得到模型输出的正常音声学特征。
其中,输入模型的初步识别结果可以是向量化后的初步识别结果。
第二种:
耳语音恢复模型为基于注意力机制的编解码类型。如图5,示例了一种基于注意力机制的编解码类型的耳语音恢复模型结构示意图。
输入层包括两类数据,分别为各帧的耳语音声学特征x 1-x s及各帧的初步识别结果。图5中,初步识别结果以音素序列“zh、ong、g、uo”为例进行的说明。
经过编码层对各帧的耳语音声学特征进行编码,得到编码后耳语音声学特征
Figure PCTCN2018091460-appb-000001
其中i∈[1,s]。注意力层利用编码后耳语音声学特征
Figure PCTCN2018091460-appb-000002
及当前时刻t解码层的隐层变量
Figure PCTCN2018091460-appb-000003
共同确定当前时刻t,各帧耳语音声学特征的系数向量a t。利用系数向量a t与各帧编码后耳语音声学特征
Figure PCTCN2018091460-appb-000004
组成的向量相乘,得到当前时刻的加权后耳语音声学特征c t。将编码后初步识别结果、当前时刻的加权后耳语音声学特征c t及上一时刻t-1解码层的输出y t-1作为当前时刻t解码层的输入,当前时刻t解码层的输出y t作为正常声学特征。
基于上述耳语音恢复模型,本实施例可以通过如下步骤利用模型确定正常音声学特征:
1)将所述耳语音声学特征及所述初步识别结果输入基于注意力机制的编解码类型的耳语音恢复模型;
其中,输入模型的初步识别结果可以是向量化后的初步识别结果。
2)通过耳语音恢复模型的编码层,分别对所述耳语音声学特征、所述初步识别结果进行编码,得到编码后耳语音声学特征及编码后初步识别结果;
3)通过耳语音恢复模型的注意力层,对所述编码后耳语音声学特征进行系数线性加权,得到当前时刻的加权后耳语音声学特征;
4)通过耳语音恢复模型的解码层,将所述编码后初步识别结果、所述当前时刻的加权后耳语音声学特征及上一时刻解码层的输出作为当前时刻解码层的输入,当前时刻解码层的输出作为正常音声学特征。
在本申请的又一个实施例中,介绍了另一种耳语音恢复方法,结合图6所示,该方法可以包括:
步骤S300、获取耳语音数据对应的耳语音声学特征,及所述耳语音数据对应的初步识别结果;
步骤S310、将所述耳语音声学特征及所述初步识别结果输入预置的耳语音恢复模型,得到输出的正常音声学特征;
其中,所述耳语音恢复模型为,预先以耳语音训练数据标注的识别结果及耳语音训练声学特征为样本,以与所述耳语音训练数据平行的正常语音数据对应的正常音声学特征为样本标签进行训练得到。
需要说明的是,本实施例中步骤S300-S310与前述实施例中步骤S100-S110一一对应,详细参照前述介绍,此处不再赘述。
步骤S320、利用所述正常音声学特征,确定所述耳语音数据的最终识别结果。
本实施例中,在得到正常音声学特征之后,进一步利用该正常音声学特征,确定耳语音数据的最终识别结果,该最终识别结果可以是文本形式。
可以理解的是,除此之外,本申请还可以利用正常音声学特征,合成正常声语音进行输出,或者其它可选方式,具体按照应用需要而选择。
相比于前述实施例,本实施例中增加了利用正常音声学特征,确定耳语音数据的最终识别结果的过程,该最终识别结果可以进行存储、记录等用途。
可选的,本实施例中在步骤S320利用正常音声学特征,确定最终识别结果之后,可以将该最终识别结果与前述实施例介绍的唇形识别模型输出的唇形识别结果进行融合,将融合结果作为更新后的最终识别结果,进一步提高最终识别结果的准确度。
在本申请的又一个实施例中,介绍了上述步骤S320,利用所述正常音声学特征,确定所述耳语音数据的最终识别结果的两种可选实施方式。
第一种:
1)将所述正常音声学特征输入预置的正常音识别模型,得到输出的正常音识别结果;
2)将所述正常音识别结果作为所述耳语音数据的最终识别结果。
其中,正常音识别模型可以参照前文介绍,此处不再赘述。在该种实现方式中,将正常音识别模型输出的正常音识别结果直接作为最终的识别结果。
第二种:
为了便于理解,本实施例中结合一个完整的耳语音恢复流程,对上述步骤S320的过程进行说明。
参加图7所示,图7为本申请实施例公开的又一种耳语音恢复方法流程图。如图7所示,该方法包括:
步骤S400、获取耳语音数据对应的耳语音声学特征,及所述耳语音数据对应的初步识别结果;
步骤S410、将所述耳语音声学特征及所述初步识别结果输入预置的耳语音恢复模型,得到输出的正常音声学特征;
需要说明的是,本实施例中步骤S400-S410与前述实施例中步骤S100-S110一一对应,详细参照前述介绍,此处不再赘述。
步骤S420、将所述正常音声学特征输入预置的正常音识别模型,得到输出的正常音识别结果;
步骤S430、判断是否达到设定迭代终止条件;若是,执行步骤S440,若否,执行步骤S450;
步骤S440、将所述正常音识别结果作为所述耳语音数据的最终识别结果;
步骤S450、将所述正常音识别结果作为所述初步识别结果,并返回执行步骤S410。
相比于第一种实现方式可知,本实现方式中增加了通过耳语音恢复模型进行迭代过程,即将正常音识别模型输出的正常音识别结果进一步作为初步识别结果,输入耳语音恢复模型进行迭代,直至确定达到设定迭代终 止条件。
可以理解的是,设定迭代终止条件可以有多种,如耳语音恢复模型的迭代次数达到次数阈值,迭代时间达到时间阈值,或者,正常音识别结果的置信度收敛情况达到设定收敛条件等。
具体地次数阈值、时间阈值可以根据实际任务对于***响应时间的要求和计算资源而定。
可以理解的是,迭代次数越多,得到的最终识别结果的准确度越高,当然消耗的时间和计算资源也越多。
下面对本申请实施例提供的耳语音恢复装置进行描述,下文描述的耳语音恢复装置与上文描述的耳语音恢复方法可相互对应参照。
参见图8,图8为本申请实施例公开的一种耳语音恢复装置结构示意图。如图8所示,该装置可以包括:
耳语音声学特征获取单元11,用于获取耳语音数据对应的耳语音声学特征;
初步识别结果获取单元12,用于获取所述耳语音数据对应的初步识别结果;
耳语音恢复处理单元13,用于将所述耳语音声学特征及所述初步识别结果输入预置的耳语音恢复模型,得到输出的正常音声学特征;
所述耳语音恢复模型为,预先以耳语音训练数据的识别结果及耳语音训练声学特征为样本,以与所述耳语音训练数据平行的正常语音数据对应的正常音声学特征为样本标签进行训练得到。
可选的,上述初步识别结果获取单元可以包括:
第一初步识别结果获取子单元,用于将所述耳语音声学特征输入预置的耳语音识别模型,得到输出的耳语音识别结果,作为所述耳语音数据对应的初步识别结果;
所述耳语音识别模型为,以正常音识别模型为初始模型,使用标注有耳语音训练数据的识别结果的耳语音训练声学特征,对所述初始模型进行训练得到。
可选的,本申请的装置还可以包括:
唇形图像数据获取单元,用于获取所述耳语音数据匹配的唇形图像数据;
则所述初步识别结果获取单元还可以包括:
第二初步识别结果获取子单元,将所述唇形图像数据输入预置的唇形识别模型,得到输出的唇形识别结果;所述唇形识别模型为,利用标注有唇形识别结果的唇形图像训练数据预训练得到;
第三初步识别结果获取子单元,将所述耳语音识别结果及所述唇形识别结果进行融合,得到融合后的识别结果作为所述耳语音数据对应的初步识别结果。
可选的,本申请的装置还可以包括:
***检测单元,用于对每一帧唇形图像数据进行***检测,得到***区域;
图像处理单元,用于将所述***区域从对应帧图像中提取出来,并进行图像规整处理,得到规整后的唇形图像数据,作为所述唇形识别模型的输入。
可选的,上述耳语音声学特征获取单元可以包括:
分帧处理单元,用于对所述耳语音数据进行分帧,得到若干帧耳语音数据;
预加重处理单元,用于对每一帧耳语音数据进行预加重处理,得到处理后耳语音数据;
频谱特征提取单元,用于分别提取每一帧处理后耳语音数据的频谱特征;所述频谱特征包括:梅尔滤波器对数能量特征、梅尔频率倒谱系数特征、感知线性预测系数特征中的任意一个或多个。
可选的,本实施例公开了耳语音恢复处理单元的两种可选结构,
其一:耳语音恢复处理单元可以包括:
递归处理单元,用于将所述耳语音声学特征及所述初步识别结果输入递归神经网络类型的耳语音恢复模型,得到模型输出的正常音声学特征。
其二:耳语音恢复处理单元可以包括:编解码处理单元,该编解码处 理单元包括:
第一编解码处理子单元,用于将所述耳语音声学特征及所述初步识别结果输入基于注意力机制的编解码类型的耳语音恢复模型;
第二编解码处理子单元,用于通过耳语音恢复模型的编码层,分别对所述耳语音声学特征、所述初步识别结果进行编码,得到编码后耳语音声学特征及编码后初步识别结果;
第三编解码处理子单元,用于通过耳语音恢复模型的注意力层,对所述编码后耳语音声学特征进行系数线性加权,得到当前时刻的加权后耳语音声学特征;
第四编解码处理子单元,用于通过耳语音恢复模型的解码层,将所述编码后初步识别结果、所述当前时刻的加权后耳语音声学特征及上一时刻解码层的输出作为当前时刻解码层的输入,当前时刻解码层的输出作为正常音声学特征。
可选的,本申请的装置还可以包括:
最终识别结果确定单元,用于利用所述正常音声学特征,确定所述耳语音数据的最终识别结果。
可选的,本实施例公开了最终识别结果确定单元的两种可选结构,
其一,最终识别结果确定单元可以包括:
正常声识别单元,用于将所述正常音声学特征输入预置的正常音识别模型,得到输出的正常音识别结果;
第一结果确定单元,用于将所述正常音识别结果作为所述耳语音数据的最终识别结果。
其二,最终识别结果确定单元可以包括:
正常声识别单元,用于将所述正常音声学特征输入预置的正常音识别模型,得到输出的正常音识别结果;
迭代判断单元,用于判断是否达到设定迭代终止条件;
第二结果确定单元,用于在所述迭代判断单元的判断结果为是时,将所述正常音识别结果作为所述耳语音数据的最终识别结果;
第三结果确定单元,用于在所述迭代判断单元的判断结果为否时,将 所述正常音识别结果作为所述初步识别结果,返回至所述将所述耳语音声学特征及所述初步识别结果输入预置的耳语音恢复模型的过程。
本申请实施例提供的耳语音恢复装置可应用于耳语音恢复设备,如PC终端、云平台、服务器及服务器集群等。可选的,图9示出了耳语音恢复设备的硬件结构框图,参照图9,耳语音恢复设备的硬件结构可以包括:至少一个处理器1,至少一个通信接口2,至少一个存储器3和至少一个通信总线4;
在本申请实施例中,处理器1、通信接口2、存储器3、通信总线4的数量为至少一个,且处理器1、通信接口2、存储器3通过通信总线4完成相互间的通信;
处理器1可能是一个中央处理器CPU,或者是特定集成电路ASIC(Application Specific Integrated Circuit),或者是被配置成实施本发明实施例的一个或多个集成电路等;
存储器3可能包含高速RAM存储器,也可能还包括非易失性存储器(non-volatile memory)等,例如至少一个磁盘存储器;
其中,存储器存储有程序,处理器可调用存储器存储的程序,所述程序用于:
获取耳语音数据对应的耳语音声学特征,及所述耳语音数据对应的初步识别结果;
将所述耳语音声学特征及所述初步识别结果输入预置的耳语音恢复模型,得到输出的正常音声学特征;
所述耳语音恢复模型为,预先以耳语音训练数据的识别结果及耳语音训练声学特征为样本,以与所述耳语音训练数据平行的正常语音数据对应的正常音声学特征为样本标签进行训练得到。
可选的,所述程序的细化功能和扩展功能可参照上文描述。
本申请实施例还提供一种存储介质,该存储介质可存储有适于处理器执行的程序,所述程序用于:
获取耳语音数据对应的耳语音声学特征,及所述耳语音数据对应的初 步识别结果;
将所述耳语音声学特征及所述初步识别结果输入预置的耳语音恢复模型,得到输出的正常音声学特征;
所述耳语音恢复模型为,预先以耳语音训练数据的识别结果及耳语音训练声学特征为样本,以与所述耳语音训练数据平行的正常语音数据对应的正常音声学特征为样本标签进行训练得到。
可选的,所述程序的细化功能和扩展功能可参照上文描述。
最后,还需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。
本说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似部分互相参见即可。
对所公开的实施例的上述说明,使本领域专业技术人员能够实现或使用本申请。对这些实施例的多种修改对本领域的专业技术人员来说将是显而易见的,本文中所定义的一般原理可以在不脱离本申请的精神或范围的情况下,在其它实施例中实现。因此,本申请将不会被限制于本文所示的这些实施例,而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。

Claims (22)

  1. 一种耳语音恢复方法,其特征在于,包括:
    获取耳语音数据对应的耳语音声学特征,及所述耳语音数据对应的初步识别结果;
    将所述耳语音声学特征及所述初步识别结果输入预置的耳语音恢复模型,得到输出的正常音声学特征;
    所述耳语音恢复模型为,预先以耳语音训练数据的识别结果及耳语音训练声学特征为样本,以与所述耳语音训练数据平行的正常语音数据对应的正常音声学特征为样本标签进行训练得到。
  2. 根据权利要求1所述的方法,其特征在于,还包括:
    利用所述正常音声学特征,确定所述耳语音数据的最终识别结果。
  3. 根据权利要求1所述的方法,其特征在于,获取所述耳语音数据对应的初步识别结果,包括:
    将所述耳语音声学特征输入预置的耳语音识别模型,得到输出的耳语音识别结果,作为所述耳语音数据对应的初步识别结果;
    所述耳语音识别模型为,以正常音识别模型为初始模型,使用标注有耳语音训练数据的识别结果的耳语音训练声学特征,对所述初始模型进行训练得到。
  4. 根据权利要求3所述的方法,其特征在于,还包括:
    获取所述耳语音数据匹配的唇形图像数据;
    则所述获取所述耳语音数据对应的初步识别结果,还包括:
    将所述唇形图像数据输入预置的唇形识别模型,得到输出的唇形识别结果;所述唇形识别模型为,利用标注有唇形识别结果的唇形图像训练数据预训练得到;
    将所述耳语音识别结果及所述唇形识别结果进行融合,得到融合后的识别结果作为所述耳语音数据对应的初步识别结果。
  5. 根据权利要求4所述的方法,其特征在于,还包括:
    对每一帧唇形图像数据进行***检测,得到***区域;
    将所述***区域从对应帧图像中提取出来,并进行图像规整处理,得到规整后的唇形图像数据,作为所述唇形识别模型的输入。
  6. 根据权利要求1所述的方法,其特征在于,所述获取耳语音数据对应的耳语音声学特征,包括:
    对所述耳语音数据进行分帧,得到若干帧耳语音数据;
    对每一帧耳语音数据进行预加重处理,得到处理后耳语音数据;
    分别提取每一帧处理后耳语音数据的频谱特征;所述频谱特征包括:梅尔滤波器对数能量特征、梅尔频率倒谱系数特征、感知线性预测系数特征中的任意一个或多个。
  7. 根据权利要求1所述的方法,其特征在于,所述将所述耳语音声学特征及所述初步识别结果输入预置的耳语音恢复模型,得到输出的正常音声学特征,包括:
    将所述耳语音声学特征及所述初步识别结果输入递归神经网络类型的耳语音恢复模型,得到模型输出的正常音声学特征。
  8. 根据权利要求1所述的方法,其特征在于,所述将所述耳语音声学特征及所述初步识别结果输入预置的耳语音恢复模型,得到输出的正常音声学特征,包括:
    将所述耳语音声学特征及所述初步识别结果输入基于注意力机制的编解码类型的耳语音恢复模型;
    通过耳语音恢复模型的编码层,分别对所述耳语音声学特征、所述初步识别结果进行编码,得到编码后耳语音声学特征及编码后初步识别结果;
    通过耳语音恢复模型的注意力层,对所述编码后耳语音声学特征进行系数线性加权,得到当前时刻的加权后耳语音声学特征;
    通过耳语音恢复模型的解码层,将所述编码后初步识别结果、所述当前时刻的加权后耳语音声学特征及上一时刻解码层的输出作为当前时刻解码层的输入,当前时刻解码层的输出作为正常音声学特征。
  9. 根据权利要求2所述的方法,其特征在于,所述利用所述正常音声学特征,确定所述耳语音数据的最终识别结果,包括:
    将所述正常音声学特征输入预置的正常音识别模型,得到输出的正常 音识别结果;
    将所述正常音识别结果作为所述耳语音数据的最终识别结果。
  10. 根据权利要求2所述的方法,其特征在于,所述利用所述正常音声学特征,确定所述耳语音数据的最终识别结果,包括:
    将所述正常音声学特征输入预置的正常音识别模型,得到输出的正常音识别结果;
    判断是否达到设定迭代终止条件;
    若是,将所述正常音识别结果作为所述耳语音数据的最终识别结果;
    若否,将所述正常音识别结果作为所述初步识别结果,返回至所述将所述耳语音声学特征及所述初步识别结果输入预置的耳语音恢复模型的过程。
  11. 一种耳语音恢复装置,其特征在于,包括:
    耳语音声学特征获取单元,用于获取耳语音数据对应的耳语音声学特征;
    初步识别结果获取单元,用于获取所述耳语音数据对应的初步识别结果;
    耳语音恢复处理单元,用于将所述耳语音声学特征及所述初步识别结果输入预置的耳语音恢复模型,得到输出的正常音声学特征;
    所述耳语音恢复模型为,预先以耳语音训练数据的识别结果及耳语音训练声学特征为样本,以与所述耳语音训练数据平行的正常语音数据对应的正常音声学特征为样本标签进行训练得到。
  12. 根据权利要求11所述的装置,其特征在于,还包括:
    最终识别结果确定单元,用于利用所述正常音声学特征,确定所述耳语音数据的最终识别结果。
  13. 根据权利要求11所述的装置,其特征在于,所述初步识别结果获取单元包括:
    第一初步识别结果获取子单元,用于将所述耳语音声学特征输入预置的耳语音识别模型,得到输出的耳语音识别结果,作为所述耳语音数据对应的初步识别结果;
    所述耳语音识别模型为,以正常音识别模型为初始模型,使用标注有耳语音训练数据的识别结果的耳语音训练声学特征,对所述初始模型进行训练得到。
  14. 根据权利要求13所述的装置,其特征在于,还包括:
    唇形图像数据获取单元,用于获取所述耳语音数据匹配的唇形图像数据;
    则所述初步识别结果获取单元还包括:
    第二初步识别结果获取子单元,将所述唇形图像数据输入预置的唇形识别模型,得到输出的唇形识别结果;所述唇形识别模型为,利用标注有唇形识别结果的唇形图像训练数据预训练得到;
    第三初步识别结果获取子单元,将所述耳语音识别结果及所述唇形识别结果进行融合,得到融合后的识别结果作为所述耳语音数据对应的初步识别结果。
  15. 根据权利要求14所述的装置,其特征在于,还包括:
    ***检测单元,用于对每一帧唇形图像数据进行***检测,得到***区域;
    图像处理单元,用于将所述***区域从对应帧图像中提取出来,并进行图像规整处理,得到规整后的唇形图像数据,作为所述唇形识别模型的输入。
  16. 根据权利要求11所述的装置,其特征在于,所述耳语音声学特征获取单元包括:
    分帧处理单元,用于对所述耳语音数据进行分帧,得到若干帧耳语音数据;
    预加重处理单元,用于对每一帧耳语音数据进行预加重处理,得到处理后耳语音数据;
    频谱特征提取单元,用于分别提取每一帧处理后耳语音数据的频谱特征;所述频谱特征包括:梅尔滤波器对数能量特征、梅尔频率倒谱系数特征、感知线性预测系数特征中的任意一个或多个。
  17. 根据权利要求11所述的装置,其特征在于,所述耳语音恢复处理 单元包括:
    递归处理单元,用于将所述耳语音声学特征及所述初步识别结果输入递归神经网络类型的耳语音恢复模型,得到模型输出的正常音声学特征。
  18. 根据权利要求11所述的装置,其特征在于,所述耳语音恢复处理单元包括:编解码处理单元,该编解码处理单元包括:
    第一编解码处理子单元,用于将所述耳语音声学特征及所述初步识别结果输入基于注意力机制的编解码类型的耳语音恢复模型;
    第二编解码处理子单元,用于通过耳语音恢复模型的编码层,分别对所述耳语音声学特征、所述初步识别结果进行编码,得到编码后耳语音声学特征及编码后初步识别结果;
    第三编解码处理子单元,用于通过耳语音恢复模型的注意力层,对所述编码后耳语音声学特征进行系数线性加权,得到当前时刻的加权后耳语音声学特征;
    第四编解码处理子单元,用于通过耳语音恢复模型的解码层,将所述编码后初步识别结果、所述当前时刻的加权后耳语音声学特征及上一时刻解码层的输出作为当前时刻解码层的输入,当前时刻解码层的输出作为正常音声学特征。
  19. 根据权利要求12所述的装置,其特征在于,所述最终识别结果确定单元包括:
    正常声识别单元,用于将所述正常音声学特征输入预置的正常音识别模型,得到输出的正常音识别结果;
    第一结果确定单元,用于将所述正常音识别结果作为所述耳语音数据的最终识别结果。
  20. 根据权利要求12所述的装置,其特征在于,所述最终识别结果确定单元包括:
    正常声识别单元,用于将所述正常音声学特征输入预置的正常音识别模型,得到输出的正常音识别结果;
    迭代判断单元,用于判断是否达到设定迭代终止条件;
    第二结果确定单元,用于在所述迭代判断单元的判断结果为是时,将 所述正常音识别结果作为所述耳语音数据的最终识别结果;
    第三结果确定单元,用于在所述迭代判断单元的判断结果为否时,将所述正常音识别结果作为所述初步识别结果,返回至所述将所述耳语音声学特征及所述初步识别结果输入预置的耳语音恢复模型的过程。
  21. 一种耳语音恢复设备,其特征在于,包括存储器和处理器;
    所述存储器,用于存储程序;
    所述处理器,用于执行所述程序,实现如权利要求1-10中任一项所述的耳语音恢复方法的各个步骤。
  22. 一种可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时,实现如权利要求1-10中任一项所述的耳语音恢复方法的各个步骤。
PCT/CN2018/091460 2018-04-12 2018-06-15 一种耳语音恢复方法、装置、设备及可读存储介质 WO2019196196A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
JP2019519686A JP6903129B2 (ja) 2018-04-12 2018-06-15 ささやき声変換方法、装置、デバイス及び可読記憶媒体
US16/647,284 US11508366B2 (en) 2018-04-12 2018-06-15 Whispering voice recovery method, apparatus and device, and readable storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810325696.3 2018-04-12
CN201810325696.3A CN108520741B (zh) 2018-04-12 2018-04-12 一种耳语音恢复方法、装置、设备及可读存储介质

Publications (1)

Publication Number Publication Date
WO2019196196A1 true WO2019196196A1 (zh) 2019-10-17

Family

ID=63432257

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/091460 WO2019196196A1 (zh) 2018-04-12 2018-06-15 一种耳语音恢复方法、装置、设备及可读存储介质

Country Status (4)

Country Link
US (1) US11508366B2 (zh)
JP (1) JP6903129B2 (zh)
CN (1) CN108520741B (zh)
WO (1) WO2019196196A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112562686A (zh) * 2020-12-10 2021-03-26 青海民族大学 一种使用神经网络的零样本语音转换语料预处理方法

Families Citing this family (67)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10509626B2 (en) 2016-02-22 2019-12-17 Sonos, Inc Handling of loss of pairing between networked devices
US9820039B2 (en) 2016-02-22 2017-11-14 Sonos, Inc. Default playback devices
US9965247B2 (en) 2016-02-22 2018-05-08 Sonos, Inc. Voice controlled media playback system based on user profile
US9947316B2 (en) 2016-02-22 2018-04-17 Sonos, Inc. Voice control of a media playback system
US10095470B2 (en) 2016-02-22 2018-10-09 Sonos, Inc. Audio response playback
US10264030B2 (en) 2016-02-22 2019-04-16 Sonos, Inc. Networked microphone device control
US9978390B2 (en) 2016-06-09 2018-05-22 Sonos, Inc. Dynamic player selection for audio signal processing
US10134399B2 (en) 2016-07-15 2018-11-20 Sonos, Inc. Contextualization of voice inputs
US10115400B2 (en) 2016-08-05 2018-10-30 Sonos, Inc. Multiple voice services
US9942678B1 (en) 2016-09-27 2018-04-10 Sonos, Inc. Audio playback settings for voice interaction
US10181323B2 (en) 2016-10-19 2019-01-15 Sonos, Inc. Arbitration-based voice recognition
US10475449B2 (en) 2017-08-07 2019-11-12 Sonos, Inc. Wake-word detection suppression
US10048930B1 (en) 2017-09-08 2018-08-14 Sonos, Inc. Dynamic computation of system response volume
US10531157B1 (en) * 2017-09-21 2020-01-07 Amazon Technologies, Inc. Presentation and management of audio and visual content across devices
US10446165B2 (en) 2017-09-27 2019-10-15 Sonos, Inc. Robust short-time fourier transform acoustic echo cancellation during audio playback
US10482868B2 (en) 2017-09-28 2019-11-19 Sonos, Inc. Multi-channel acoustic echo cancellation
US10621981B2 (en) 2017-09-28 2020-04-14 Sonos, Inc. Tone interference cancellation
US10466962B2 (en) 2017-09-29 2019-11-05 Sonos, Inc. Media playback system with voice assistance
US11343614B2 (en) 2018-01-31 2022-05-24 Sonos, Inc. Device designation of playback and network microphone device arrangements
US10600408B1 (en) * 2018-03-23 2020-03-24 Amazon Technologies, Inc. Content output management based on speech quality
US11175880B2 (en) 2018-05-10 2021-11-16 Sonos, Inc. Systems and methods for voice-assisted media content selection
US10959029B2 (en) 2018-05-25 2021-03-23 Sonos, Inc. Determining and adapting to changes in microphone performance of playback devices
US10681460B2 (en) 2018-06-28 2020-06-09 Sonos, Inc. Systems and methods for associating playback devices with voice assistant services
US11076035B2 (en) 2018-08-28 2021-07-27 Sonos, Inc. Do not disturb feature for audio notifications
US10461710B1 (en) 2018-08-28 2019-10-29 Sonos, Inc. Media playback system with maximum volume setting
US10587430B1 (en) 2018-09-14 2020-03-10 Sonos, Inc. Networked devices, systems, and methods for associating playback devices based on sound codes
WO2020060311A1 (en) * 2018-09-20 2020-03-26 Samsung Electronics Co., Ltd. Electronic device and method for providing or obtaining data for training thereof
US11024331B2 (en) 2018-09-21 2021-06-01 Sonos, Inc. Voice detection optimization using sound metadata
US11100923B2 (en) 2018-09-28 2021-08-24 Sonos, Inc. Systems and methods for selective wake word detection using neural network models
US11899519B2 (en) 2018-10-23 2024-02-13 Sonos, Inc. Multiple stage network microphone device with reduced power consumption and processing load
EP3654249A1 (en) 2018-11-15 2020-05-20 Snips Dilated convolutions and gating for efficient keyword spotting
US11183183B2 (en) 2018-12-07 2021-11-23 Sonos, Inc. Systems and methods of operating media playback systems having multiple voice assistant services
US11132989B2 (en) 2018-12-13 2021-09-28 Sonos, Inc. Networked microphone devices, systems, and methods of localized arbitration
US10602268B1 (en) 2018-12-20 2020-03-24 Sonos, Inc. Optimization of network microphone devices using noise classification
US10867604B2 (en) 2019-02-08 2020-12-15 Sonos, Inc. Devices, systems, and methods for distributed voice processing
US11315556B2 (en) 2019-02-08 2022-04-26 Sonos, Inc. Devices, systems, and methods for distributed voice processing by transmitting sound data associated with a wake word to an appropriate device for identification
EP3709194A1 (en) 2019-03-15 2020-09-16 Spotify AB Ensemble-based data comparison
US11120794B2 (en) 2019-05-03 2021-09-14 Sonos, Inc. Voice assistant persistence across multiple network microphone devices
CN110211568A (zh) * 2019-06-03 2019-09-06 北京大牛儿科技发展有限公司 一种语音识别方法及装置
US10586540B1 (en) 2019-06-12 2020-03-10 Sonos, Inc. Network microphone device with command keyword conditioning
US11200894B2 (en) 2019-06-12 2021-12-14 Sonos, Inc. Network microphone device with command keyword eventing
US11361756B2 (en) 2019-06-12 2022-06-14 Sonos, Inc. Conditional wake word eventing based on environment
CN110444053B (zh) * 2019-07-04 2021-11-30 卓尔智联(武汉)研究院有限公司 语言学习方法、计算机装置及可读存储介质
US10871943B1 (en) 2019-07-31 2020-12-22 Sonos, Inc. Noise classification for event detection
US11138969B2 (en) 2019-07-31 2021-10-05 Sonos, Inc. Locally distributed keyword detection
US11138975B2 (en) 2019-07-31 2021-10-05 Sonos, Inc. Locally distributed keyword detection
US11227579B2 (en) * 2019-08-08 2022-01-18 International Business Machines Corporation Data augmentation by frame insertion for speech data
US11094319B2 (en) 2019-08-30 2021-08-17 Spotify Ab Systems and methods for generating a cleaned version of ambient sound
US11189286B2 (en) 2019-10-22 2021-11-30 Sonos, Inc. VAS toggle based on device orientation
JP7495220B2 (ja) 2019-11-15 2024-06-04 エヌ・ティ・ティ・コミュニケーションズ株式会社 音声認識装置、音声認識方法、および、音声認識プログラム
US11200900B2 (en) 2019-12-20 2021-12-14 Sonos, Inc. Offline voice control
US11562740B2 (en) 2020-01-07 2023-01-24 Sonos, Inc. Voice verification for media playback
US11308958B2 (en) 2020-02-07 2022-04-19 Sonos, Inc. Localized wakeword verification
US11328722B2 (en) * 2020-02-11 2022-05-10 Spotify Ab Systems and methods for generating a singular voice audio stream
US11308959B2 (en) 2020-02-11 2022-04-19 Spotify Ab Dynamic adjustment of wake word acceptance tolerance thresholds in voice-controlled devices
CN111462733B (zh) * 2020-03-31 2024-04-16 科大讯飞股份有限公司 多模态语音识别模型训练方法、装置、设备及存储介质
US11308962B2 (en) * 2020-05-20 2022-04-19 Sonos, Inc. Input detection windowing
US11482224B2 (en) 2020-05-20 2022-10-25 Sonos, Inc. Command keywords with input detection windowing
US11727919B2 (en) 2020-05-20 2023-08-15 Sonos, Inc. Memory allocation for keyword spotting engines
CN111916095B (zh) * 2020-08-04 2022-05-17 北京字节跳动网络技术有限公司 语音增强方法、装置、存储介质及电子设备
US11698771B2 (en) 2020-08-25 2023-07-11 Sonos, Inc. Vocal guidance engines for playback devices
CN112365884A (zh) * 2020-11-10 2021-02-12 珠海格力电器股份有限公司 耳语的识别方法和装置、存储介质、电子装置
US11984123B2 (en) 2020-11-12 2024-05-14 Sonos, Inc. Network device interaction by range
CN113066485B (zh) * 2021-03-25 2024-05-17 支付宝(杭州)信息技术有限公司 一种语音数据处理方法、装置及设备
CN112927682B (zh) * 2021-04-16 2024-04-16 西安交通大学 一种基于深度神经网络声学模型的语音识别方法及***
WO2023210149A1 (ja) * 2022-04-26 2023-11-02 ソニーグループ株式会社 情報処理装置及び情報処理方法、並びにコンピュータプログラム
CN115294970B (zh) * 2022-10-09 2023-03-24 苏州大学 针对病理嗓音的语音转换方法、装置和存储介质

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101527141A (zh) * 2009-03-10 2009-09-09 苏州大学 基于径向基神经网络的耳语音转换为正常语音的方法
CN102023703A (zh) * 2009-09-22 2011-04-20 现代自动车株式会社 组合唇读与语音识别的多模式界面***
CN107452381A (zh) * 2016-05-30 2017-12-08 ***通信有限公司研究院 一种多媒体语音识别装置及方法

Family Cites Families (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6317716B1 (en) * 1997-09-19 2001-11-13 Massachusetts Institute Of Technology Automatic cueing of speech
CN1095580C (zh) * 1998-04-18 2002-12-04 茹家佑 聋哑人语音学习、对话方法中使用的语音同步反馈装置
US6594632B1 (en) * 1998-11-02 2003-07-15 Ncr Corporation Methods and apparatus for hands-free operation of a voice recognition system
JP2006119647A (ja) * 2005-09-16 2006-05-11 Yasuto Takeuchi ささやき声を通常の有声音声に擬似的に変換する装置
CN101154385A (zh) * 2006-09-28 2008-04-02 北京远大超人机器人科技有限公司 机器人语音动作的控制方法及其所采用的控制***
JP4264841B2 (ja) * 2006-12-01 2009-05-20 ソニー株式会社 音声認識装置および音声認識方法、並びに、プログラム
US20080261576A1 (en) * 2007-04-20 2008-10-23 Alcatel Lucent Communication system for oil and gas platforms
US8386252B2 (en) * 2010-05-17 2013-02-26 Avaya Inc. Estimating a listener's ability to understand a speaker, based on comparisons of their styles of speech
KR20160009344A (ko) * 2014-07-16 2016-01-26 삼성전자주식회사 귓속말 인식 방법 및 장치
CN104484656A (zh) * 2014-12-26 2015-04-01 安徽寰智信息科技股份有限公司 基于深度学习的唇语识别唇形模型库构建方法
CN104537358A (zh) * 2014-12-26 2015-04-22 安徽寰智信息科技股份有限公司 基于深度学习的唇语识别唇形训练数据库的生成方法
JP2016186516A (ja) * 2015-03-27 2016-10-27 日本電信電話株式会社 疑似音声信号生成装置、音響モデル適応装置、疑似音声信号生成方法、およびプログラム
JP6305955B2 (ja) * 2015-03-27 2018-04-04 日本電信電話株式会社 音響特徴量変換装置、音響モデル適応装置、音響特徴量変換方法、およびプログラム
US9867012B2 (en) * 2015-06-03 2018-01-09 Dsp Group Ltd. Whispered speech detection
CN106571135B (zh) * 2016-10-27 2020-06-09 苏州大学 一种耳语音特征提取方法及***
US10665243B1 (en) * 2016-11-11 2020-05-26 Facebook Technologies, Llc Subvocalized speech recognition
CN106847271A (zh) * 2016-12-12 2017-06-13 北京光年无限科技有限公司 一种用于对话交互***的数据处理方法及装置
CN106782504B (zh) * 2016-12-29 2019-01-22 百度在线网络技术(北京)有限公司 语音识别方法和装置
CN107665705B (zh) * 2017-09-20 2020-04-21 平安科技(深圳)有限公司 语音关键词识别方法、装置、设备及计算机可读存储介质
CN107680597B (zh) * 2017-10-23 2019-07-09 平安科技(深圳)有限公司 语音识别方法、装置、设备以及计算机可读存储介质
US10529355B2 (en) * 2017-12-19 2020-01-07 International Business Machines Corporation Production of speech based on whispered speech and silent speech

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101527141A (zh) * 2009-03-10 2009-09-09 苏州大学 基于径向基神经网络的耳语音转换为正常语音的方法
CN102023703A (zh) * 2009-09-22 2011-04-20 现代自动车株式会社 组合唇读与语音识别的多模式界面***
CN107452381A (zh) * 2016-05-30 2017-12-08 ***通信有限公司研究院 一种多媒体语音识别装置及方法

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112562686A (zh) * 2020-12-10 2021-03-26 青海民族大学 一种使用神经网络的零样本语音转换语料预处理方法
CN112562686B (zh) * 2020-12-10 2022-07-15 青海民族大学 一种使用神经网络的零样本语音转换语料预处理方法

Also Published As

Publication number Publication date
US20200211550A1 (en) 2020-07-02
JP6903129B2 (ja) 2021-07-14
JP2020515877A (ja) 2020-05-28
US11508366B2 (en) 2022-11-22
CN108520741B (zh) 2021-05-04
CN108520741A (zh) 2018-09-11

Similar Documents

Publication Publication Date Title
WO2019196196A1 (zh) 一种耳语音恢复方法、装置、设备及可读存储介质
CN109785824B (zh) 一种语音翻译模型的训练方法及装置
US9202462B2 (en) Key phrase detection
US20240021202A1 (en) Method and apparatus for recognizing voice, electronic device and medium
CN112435684B (zh) 语音分离方法、装置、计算机设备和存储介质
CN111916061B (zh) 语音端点检测方法、装置、可读存储介质及电子设备
US8160875B2 (en) System and method for improving robustness of speech recognition using vocal tract length normalization codebooks
WO2020238045A1 (zh) 智能语音识别方法、装置及计算机可读存储介质
CN111261162B (zh) 语音识别方法、语音识别装置及存储介质
CN113488024B (zh) 一种基于语义识别的电话打断识别方法和***
US20240135933A1 (en) Method, apparatus, device, and storage medium for speaker change point detection
CN113362813B (zh) 一种语音识别方法、装置和电子设备
EP3980991B1 (en) System and method for recognizing user's speech
CN111199160A (zh) 即时通话语音的翻译方法、装置以及终端
CN112017633B (zh) 语音识别方法、装置、存储介质及电子设备
CN114333865A (zh) 一种模型训练以及音色转换方法、装置、设备及介质
CN116090474A (zh) 对话情绪分析方法、装置和计算机可读存储介质
CN116978359A (zh) 音素识别方法、装置、电子设备及存储介质
CN112669821B (zh) 一种语音意图识别方法、装置、设备及存储介质
CN115700871A (zh) 模型训练和语音合成方法、装置、设备及介质
CN111951807A (zh) 语音内容检测方法及其装置、介质和***
CN113707130B (zh) 一种语音识别方法、装置和用于语音识别的装置
WO2024082928A1 (zh) 语音处理方法、装置、设备和介质
JP2024520985A (ja) 適応型視覚音声認識
CN116092485A (zh) 语音识别模型的训练方法及装置、语音识别方法及装置

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2019519686

Country of ref document: JP

Kind code of ref document: A

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18914003

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18914003

Country of ref document: EP

Kind code of ref document: A1