CN113113044B - Audio processing method and device, terminal and storage medium - Google Patents

Audio processing method and device, terminal and storage medium Download PDF

Info

Publication number
CN113113044B
CN113113044B CN202110309769.1A CN202110309769A CN113113044B CN 113113044 B CN113113044 B CN 113113044B CN 202110309769 A CN202110309769 A CN 202110309769A CN 113113044 B CN113113044 B CN 113113044B
Authority
CN
China
Prior art keywords
audio
voiceprint feature
voice
target object
voiceprint
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110309769.1A
Other languages
Chinese (zh)
Other versions
CN113113044A (en
Inventor
徐娜
王林章
贾永涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Duke Kunshan University
Beijing Xiaomi Pinecone Electronic Co Ltd
Original Assignee
Duke Kunshan University
Beijing Xiaomi Pinecone Electronic Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Duke Kunshan University, Beijing Xiaomi Pinecone Electronic Co Ltd filed Critical Duke Kunshan University
Priority to CN202110309769.1A priority Critical patent/CN113113044B/en
Publication of CN113113044A publication Critical patent/CN113113044A/en
Application granted granted Critical
Publication of CN113113044B publication Critical patent/CN113113044B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The disclosure relates to an audio processing method and device, a terminal and a storage medium. The method comprises the following steps: determining a first voiceprint feature of the target object; pre-separating the mixed audio to obtain multiple paths of voice signals; and determining target audio matched with the target object in the mixed audio according to the first voiceprint characteristics and the multipath voice signals. By the method, the accuracy of voice separation can be improved.

Description

Audio processing method and device, terminal and storage medium
Technical Field
The disclosure relates to the field of electronic technology, and in particular, to an audio processing method and device, a terminal and a storage medium.
Background
The goal of speech separation is to separate the speech signal of each targeted speaker from a plurality of mixed speakers. The traditional voice separation method mainly uses a blind separation technology based on independent component analysis, and in recent years, the voice separation technology based on deep learning gradually becomes a main trend in voice separation, and in the training process, a certain voice characteristic is used as network input to train, so that the model has the capability of distinguishing different speakers. However, it is difficult to obtain a good voice separation effect in the above-mentioned schemes.
Disclosure of Invention
The disclosure provides an audio processing method and device, a terminal and a storage medium.
According to a first aspect of embodiments of the present disclosure, there is provided an audio processing method, including:
determining a first voiceprint feature of the target object;
pre-separating the mixed audio to obtain multiple paths of voice signals;
and determining target audio matched with the target object in the mixed audio according to the first voiceprint characteristics and the multipath voice signals.
In some embodiments, the determining the target audio matching the target object in the mixed audio according to the first voiceprint feature and the multiple voice signals includes:
determining a second voice characteristic of each voice signal in the multipath voice signals;
splicing the second voiceprint feature of each voice signal and the first voiceprint feature to obtain a third voiceprint feature;
and inputting the third voiceprint feature into a preset voice separation network model, and determining target audio matched with the target object in the mixed audio.
In some embodiments, the inputting the third voiceprint feature into a predetermined voice separation network model, determining a target audio in the mixed audio that matches the target object, includes:
inputting the third voiceprint feature into each sub-module of the predetermined voice separation network model to obtain an output result of each sub-module;
and determining target audio matched with the target object in the mixed audio according to the total output result of the series connection of the output results of the sub-modules.
In some embodiments, the sub-module comprises: a multi-layer long short term memory network LSTM and a full connection layer.
In some embodiments, the determining the first voiceprint feature of the target object includes:
acquiring an audio signal of the target object;
and extracting a first voiceprint feature of the target object according to the frequency spectrum of the audio signal.
In some embodiments, the extracting the first voiceprint feature of the target object from the spectrum of the audio signal includes:
and inputting the frequency spectrum of the audio signal into a preset voiceprint extraction network model, and acquiring a first voiceprint characteristic of the target object.
In some embodiments, the voiceprint extraction network model includes:
residual network RESNET;
at least one pooling layer connected with the RESNET;
and the full-connection layer is connected with the pooling layer.
In some embodiments, the pre-separating the mixed audio to obtain multiple voice signals includes:
and pre-separating the mixed audio by adopting an independent vector analysis IVA mode to obtain the multipath voice signals.
In some embodiments, the mixed audio is collected during a voice call;
the method further comprises the steps of:
and carrying out noise reduction processing on the target audio after the voice separation, and outputting the enhanced target audio.
According to a second aspect of embodiments of the present disclosure, there is provided an audio processing apparatus comprising:
a determination module configured to determine a first voiceprint feature of a target object;
the pre-separation module is configured to perform pre-separation processing on the mixed audio to obtain multiple paths of voice signals;
and the extraction module is configured to determine target audio matched with the target object in the mixed audio according to the first voiceprint feature and the multipath voice signals.
In some embodiments, the extraction module is further configured to determine a second voiceprint feature of each of the plurality of voiceprint signals; splicing the second voiceprint feature of each voice signal and the first voiceprint feature to obtain a third voiceprint feature; and inputting the third voiceprint feature into a preset voice separation network model, and determining target audio matched with the target object in the mixed audio.
In some embodiments, the extracting module is further configured to input the third voiceprint feature into each sub-module of the predetermined voice separation network model, and obtain an output result of each sub-module; and determining target audio matched with the target object in the mixed audio according to the total output result of the series connection of the output results of the sub-modules.
In some embodiments, the sub-module comprises: a multi-layer long short term memory network LSTM and a full connection layer.
In some embodiments, the determining module is further configured to obtain an audio signal of the target object; and extracting a first voiceprint feature of the target object according to the frequency spectrum of the audio signal.
In some embodiments, the determining module is further configured to input a spectrum of the audio signal into a predetermined voiceprint extraction network model, obtaining a first voiceprint feature of the target object.
In some embodiments, the voiceprint extraction network model includes:
residual network RESNET;
at least one pooling layer connected with the RESNET;
and the full-connection layer is connected with the pooling layer.
In some embodiments, the pre-separation module is further configured to perform pre-separation processing on the mixed audio in an independent vector analysis IVA manner, so as to obtain the multi-path speech signal.
In some embodiments, the mixed audio is collected during a voice call;
the apparatus further comprises:
and the enhancement module is configured to perform noise reduction processing on the target audio after the voice separation and output the enhanced target audio.
According to a third aspect of embodiments of the present disclosure, there is provided a terminal comprising:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to perform the audio processing method as described in the first aspect above.
According to a fourth aspect of embodiments of the present disclosure, there is provided a storage medium comprising:
the instructions in the storage medium, when executed by a processor of the terminal, enable the terminal to perform the audio processing method as described in the first aspect above.
The technical scheme provided by the embodiment of the disclosure can comprise the following beneficial effects:
in the embodiment of the disclosure, the mixed audio is subjected to pre-separation processing, and the target audio of the target object in the mixed audio is further determined by combining the determined first voiceprint feature of the target object and the pre-separated multipath voice signals.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure.
Fig. 1 is a flow chart of an audio processing method according to an exemplary embodiment of the present disclosure.
Fig. 2 is a block diagram illustrating voiceprint feature extraction in an exemplary embodiment of the present disclosure.
Fig. 3 is a functional block diagram of an audio processing method according to an exemplary embodiment of the present disclosure.
Fig. 4 is a diagram of an audio processing apparatus according to an exemplary embodiment of the present disclosure.
Fig. 5 is a block diagram of a terminal shown in an exemplary embodiment of the present disclosure.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.
Fig. 1 is a flowchart of an audio processing method according to an exemplary embodiment of the present disclosure, and as shown in fig. 1, the audio processing method applied to a terminal includes the steps of:
s11, determining a first voiceprint feature of a target object;
s12, performing pre-separation processing on the mixed audio to obtain multiple paths of voice signals;
s13, determining target audio matched with the target object in the mixed audio according to the first voiceprint features and the multipath voice signals.
In an embodiment of the present disclosure, a terminal device includes: a mobile device and a stationary device; the mobile device includes: a cell phone, a tablet computer or a wearable device, etc. Including but not limited to personal computers (Personal Computer, PCs), smart speakers, smart televisions, smart home devices, and the like.
The terminal equipment comprises an audio acquisition component and an audio output component, and the mobile phone is taken as an example, the audio acquisition component in the mobile phone can be a microphone, and the audio output component can be a loudspeaker. The terminal equipment can comprise a plurality of audio acquisition components which support a plurality of audio acquisition channels to acquire audio signals.
In step S11, the terminal determines a first voiceprint feature of the target object. The voiceprint features comprise tone, timbre, intensity, sound wave wavelength, frequency and change rhythm, and the like, which can reflect the speaking characteristics of different people. Because different people have different sound-producing organs such as oral cavities, vocal cords and the like and different people have different speaking habits, each person also has different voiceprint characteristics.
The target object may be a user who performs voiceprint registration, or may be another object designated by the user. The first voiceprint feature of the target object may be obtained by sampling the target object, for example, the user reads specified text content according to the instruction, realizes voice input, and the terminal performs voice sampling according to the input content of the user, and obtains the first voiceprint feature according to the sampled content.
The first voiceprint feature may be obtained in advance, for example, the user is instructed to perform audio input in the process of registering the user with the terminal, so as to obtain the voiceprint feature, and the terminal may store the voiceprint feature of the user, that is, the first voiceprint feature of the target object. The user here is of course not limited to the user of the terminal but may be any authorized user. In the scene of needing to carry out voice recognition, the terminal can call the first voiceprint feature as a verification parameter to carry out recognition and authentication on the user.
In addition, the first voiceprint feature can also be obtained during a voice call, voice input, or the like. For example, the user performs voice call through the terminal, at this time, the user and the terminal are closest in distance and therefore have the largest volume in the call scene, at this time, the terminal can take the user performing voice call as a target user to acquire the voiceprint feature of the user, and recognize the voice in the current call process in real time based on the voiceprint feature, thereby separating the target audio and the audio in the environmental noise, and realizing noise reduction in the call process.
In step S12, the terminal performs pre-separation processing on the mixed audio to obtain multiple voice signals. The mixed audio may include target audio generated by speaking the target object and audio emitted by speaking other people, or include target audio and other environmental noise.
When the mixed audio is subjected to the pre-separation process, the information of the target object is not introduced in advance, or the mixing mode of the mixed audio is predicted, and thus the pre-processing mode corresponds to a blind source separation mode. One possible case in each pre-separated speech signal is: the target audio of the target object and the non-target audio of the non-target object are distributed in different paths of signals, but the information of the target object is not added in advance, so that which path of signals corresponds to the target audio cannot be distinguished. Another possible case is: the audio frequencies of the non-target object and the target object in each path of voice signals are not well separated, and each path of voice signals after separation possibly comprises the target audio frequency of the target object.
In embodiments of the present disclosure, the multi-path speech signal may be obtained based on conventional means such as independent component analysis (Independent Component Analysis, ICA) or based on a deep learning model when the pre-separation process is performed, which is not limiting of the present disclosure.
In step S13, the terminal combines the first voiceprint feature and the multipath voice signal to further separate the target audio of the target object from the mixed audio.
It can be understood that the method and the device for pre-separating the mixed audio further determine the target audio of the target object in the mixed audio by combining the first voiceprint feature and the pre-separated multipath voice signals, and the introduction of the first voiceprint feature can provide more information of the target object because the first voiceprint feature is from the target object, so that the extraction of the target audio can be more accurate by combining the first voiceprint feature on the basis of the pre-separation result.
In some embodiments, the determining the target audio matching the target object in the mixed audio according to the first voiceprint feature and the multiple voice signals includes:
determining a second voice characteristic of each voice signal in the multipath voice signals;
splicing the second voiceprint feature of each voice signal and the first voiceprint feature to obtain a third voiceprint feature;
and inputting the third voiceprint feature into a preset voice separation network model, and determining target audio matched with the target object in the mixed audio.
In this embodiment, for multiple voice signals, the terminal device extracts second voice features of each voice signal, and splices each second voice feature and the first voice feature to obtain a third voice feature. In some embodiments, each second voiceprint feature of the pre-separated multipath voice signal and the first voiceprint feature of the target object may be directly spliced to obtain a third voiceprint feature having a dimension that is the sum of each second voiceprint feature and the first voiceprint feature dimension.
Illustratively, if there are n speech signals, then the second voiceprint feature has n paths. Assuming that the dimensions of the first voiceprint feature and the second voiceprint feature are both 1, after the second voiceprint feature and the first voiceprint feature of each voice signal are spliced, the obtained third voiceprint feature has n+1 dimensions.
In other embodiments, each second voiceprint feature and each first voiceprint feature may be input into a feature stitching model, where the feature stitching model performs analysis processing on each input second voiceprint feature and first voiceprint feature, and extracts a main feature of each second voiceprint feature and first voiceprint feature as a third voiceprint feature, so as to reduce redundancy features and realize dimension reduction.
It should be noted that, the second voiceprint feature may be extracted in the same manner as the first voiceprint feature, or may be extracted in a different manner, which is not limited to the embodiment of the present disclosure. If the extraction modes of the second voiceprint feature and the first voiceprint feature are different, in the feature splicing process, the first voiceprint feature and the second voiceprint feature can be normalized to obtain a third voiceprint feature, so that feature quantities of the first voiceprint feature and the second voiceprint feature can represent the characteristics of sound in the same numerical range.
After the third voiceprint feature is obtained, the third voiceprint feature can be input into a predetermined voice separation network model, so that target audio matched with the target object in the mixed audio is determined.
It can be understood that in this embodiment, the first voiceprint feature of the target object and the second voiceprint feature of each pre-separated voice signal are combined, that is, the first voiceprint feature of the target object is added based on the pre-separation result, and the first voiceprint feature of the target object is used as a reference, so that when the third voiceprint feature is input into the predetermined voice separation network model, the accuracy of extracting the target audio can be improved.
In some embodiments, the inputting the third voiceprint feature into a predetermined voice separation network model, determining a target audio in the mixed audio that matches the target object, includes:
inputting the third voiceprint feature into each sub-module of the predetermined voice separation network model to obtain an output result of each sub-module;
and determining target audio matched with the target object in the mixed audio according to the total output result of the series connection of the output results of the sub-modules.
In this embodiment, the voice separation network model may include a plurality of sub-modules, and each sub-module outputs a total output result of the result series connection, that is, a separation result of the whole mixed audio.
In some embodiments, the sub-module comprises: a multi-layer long short term memory network LSTM and a full connection layer.
In the embodiment of the disclosure, the network structure of each sub-module of the speech separation network model is a Long Short-Term Memory (LSTM) connected with a full connection layer, and the trained loss function may be cross entropy.
According to the technical scheme, based on the deep learning network, the mixed audio is separated by combining the first voiceprint feature of the target object and the second voiceprint feature of each pre-separated voice signal, so that the separation accuracy of the target audio can be effectively improved.
Of course, the present disclosure is not limited to LSTM networks when training the speech separation network model, but may also be cyclic/recurrent neural networks (Recurrent neural networks, RNN), etc.
In some embodiments, the determining the first voiceprint feature of the target object includes:
acquiring an audio signal of the target object;
and extracting a first voiceprint feature of the target object according to the frequency spectrum of the audio signal.
In this embodiment, the first voiceprint feature may be acquired and stored in advance, and the stored first voiceprint feature is used for separation when performing speech separation. The process of acquiring the first voiceprint features is achieved by acquiring an audio signal of a target object and performing feature extraction using the frequency spectrum of the audio signal.
The spectrum of the audio signal may be obtained by Short-time fourier transform (Short-Time Fourier Transform, STFT) of the audio signal, for example.
In some embodiments, the extracting the first voiceprint feature of the target object from the spectrum of the audio signal includes:
and inputting the frequency spectrum of the audio signal into a preset voiceprint extraction network model, and acquiring a first voiceprint characteristic of the target object.
In this embodiment, the spectrum of the audio signal of the target object is input to the neural network model of voiceprint extraction, and voiceprint features are output.
In some embodiments, the voiceprint extraction network model includes:
residual network RESNET;
at least one pooling layer connected with the RESNET;
and the full-connection layer is connected with the pooling layer.
In this embodiment, the voiceprint extraction network model may be composed of a residual network rest, a pooling layer, a full connection layer, and the like. Wherein the pooling layer may comprise multiple layers, such as two layers. The Loss function (Loss) employed in model training may be cross entropy.
As previously described, the extraction of the second voiceprint feature in each of the pre-separated voice signals in the present disclosure may also take the same way as the first voiceprint extraction. However, it should be noted that, in the embodiments of the present disclosure, the extraction of the first voiceprint feature and/or the second voiceprint feature is not limited to the method described above, and the voiceprint feature may be extracted from the spectrum of the audio signal by using other neural network models, or may be extracted from the audio signal by using other manners based on the time domain characteristics.
Fig. 2 is a block diagram of voiceprint feature extraction according to an exemplary embodiment of the present disclosure, and as shown in fig. 2, voiceprint extraction is performed on a voice signal of a target object by using a voiceprint extraction module, so as to obtain a voiceprint feature (a first voiceprint feature) of the target object. The voiceprint extraction module can be trained by using a deep learning technology, the input of the module is a registered corpus STFT amplitude spectrum of a target object, and the output of the module can be 128-dimensional voiceprint characteristics.
In some embodiments, the pre-separating the mixed audio to obtain multiple voice signals includes:
and pre-separating the mixed audio by adopting an independent vector analysis IVA mode to obtain the multipath voice signals.
In this embodiment, the terminal may pre-separate the mixed audio using conventional independent vector analysis (Independent Vector Analysis, IVA) to obtain multiple voice signals.
It can be understood that the method and the device perform pre-separation in a traditional IVA mode, then perform separation again by adopting a voice separation network model trained based on a deep learning method, namely, a mode of combining the deep learning method with the traditional IVA, so that on one hand, the problem of channel selection faced by the IVA technology is avoided, and the problem of non-ideal separation effect faced by the IVA technology is solved; on the other hand, compared with a mode of separating by using a deep learning method singly, the two modes are combined, so that the separation performance of the whole system can obtain benefits of the two methods, and the separation performance is better.
In some embodiments, the mixed audio is collected during a voice call;
the method further comprises the steps of:
and carrying out noise reduction processing on the target audio after the voice separation, and outputting the enhanced target audio.
In the embodiment of the disclosure, the mixed audio is collected during the voice call, and for the target audio after voice separation, noise reduction processing can be performed on the target audio because the target audio may still contain a part of noise, so as to obtain the target audio with better quality. For example, wiener filtering techniques may be employed to reduce noise for the target audio.
According to the scheme, real-time voice separation can be performed in the voice call process, target audio is separated from mixed audio, or further post-processing such as noise reduction or amplification is performed and transmitted to the opposite terminal, so that noise in the received audio signal of the opposite terminal is filtered out to a great extent, and the call quality is improved.
The specific application scene of the scheme can be a scene of mobile phone conversation, and the voice enhancement and noise reduction functions aiming at specific speakers are realized. For example, speaker a registers before use, and the voice of a can be transmitted to the other party when the call is made after the registration is completed. If the voices of other speakers such as the user B and the user C appear at the moment, the voices are not transmitted, and if the ABC simultaneously speaks, only the voice of the A passes through and is transmitted to the other party.
Fig. 3 is a functional block diagram of an audio processing method according to an exemplary embodiment of the present disclosure, and as shown in fig. 3, the audio processing method implements acquisition of target audio through a plurality of modules, including the following modules:
the input of the IVA module (pre-separation processing) is the mixed audio of Mic1 and Mic2 … Micn, and the output of the IVA module is the n paths of pre-separated voice signals.
The input of the feature extraction module is n paths of pre-separated voice signals, and the n paths of voice signals are output as voiceprint features of the n paths of voice signals after passing through a network. And obtaining the second voice characteristic of each voice signal in the multipath voice signals through the module.
And the characteristic splicing module inputs voiceprint characteristics of the n paths of pre-separated voice signals and voiceprint characteristics (first voiceprint characteristics) of a target speaker, and outputs splicing characteristics, namely third voiceprint characteristics, through characteristic splicing, for example, the first voiceprint characteristics and the second voiceprint characteristics are spliced.
The input of the target speaker separation module (voice separation network model) is a splicing characteristic, and the voice of the target speaker is separated from the pre-separated multi-channel voice signal. Specifically, if the third voiceprint feature is input into a predetermined voice separation network model, the target audio in the mixed audio is determined. The network structure of the voice separation network model can be that a neural network with a full connection layer is connected in series with a plurality of LSTM networks to serve as a sub-module, the input characteristic of each sub-module is connected in series with the voiceprint characteristic, and the network of the complete target speaker separation module is formed by connecting a plurality of identical sub-modules in series.
And the post-processing module can adopt a wiener filtering technology to reduce noise of the voice of the target speaker, and further enhance the voice of the target speaker, namely, the enhanced target audio is obtained.
By the technical scheme of the embodiment of the disclosure, the traditional IVA mode and deep learning are combined, and benefits of the two methods can be obtained, so that the voice separation performance is better.
Fig. 4 is a diagram of an audio processing apparatus according to an exemplary embodiment of the present disclosure. Referring to fig. 4, the apparatus includes:
a determining module 101 configured to determine a first voiceprint feature of the target object;
the pre-separation module 102 is configured to perform pre-separation processing on the mixed audio to obtain multiple paths of voice signals;
an extraction module 103 is configured to determine a target audio matching the target object in the mixed audio according to the first voiceprint feature and the multiple voice signals.
In some embodiments, the extracting module 103 is further configured to determine a second voiceprint feature of each of the plurality of voice signals; splicing the second voiceprint feature of each voice signal and the first voiceprint feature to obtain a third voiceprint feature; and inputting the third voiceprint feature into a preset voice separation network model, and determining target audio matched with the target object in the mixed audio.
In some embodiments, the extracting module 103 is further configured to input the third voiceprint feature into each sub-module of the predetermined voice separation network model, to obtain an output result of each sub-module; and determining target audio matched with the target object in the mixed audio according to the total output result of the series connection of the output results of the sub-modules.
In some embodiments, the sub-module comprises: a multi-layer long short term memory network LSTM and a full connection layer.
In some embodiments, the determining module 101 is further configured to obtain an audio signal of the target object; and extracting a first voiceprint feature of the target object according to the frequency spectrum of the audio signal.
In some embodiments, the determining module 101 is further configured to input the spectrum of the audio signal into a predetermined voiceprint extraction network model, and obtain a first voiceprint feature of the target object.
In some embodiments, the voiceprint extraction network model includes:
residual network RESNET;
at least one pooling layer connected with the RESNET;
and the full-connection layer is connected with the pooling layer.
In some embodiments, the pre-separation module 102 is further configured to perform pre-separation processing on the mixed audio in an independent vector analysis IVA manner, so as to obtain the multiple voice signals.
In some embodiments, the mixed audio is collected during a voice call;
the apparatus further comprises:
and the enhancement module 104 is configured to perform noise reduction processing on the target audio after the voice separation and output the enhanced target audio.
The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.
Fig. 5 is a block diagram of a terminal device 800 according to an exemplary embodiment of the present disclosure. For example, the device 800 may be a cell phone, a computer, etc.
Referring to fig. 5, apparatus 800 may include one or more of the following components: a processing component 802, a memory 804, a power component 806, a multimedia component 808, an audio component 810, an input/output (I/O) interface 812, a sensor component 814, and a communication component 816.
The processing component 802 generally controls overall operation of the apparatus 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 802 may include one or more processors 820 to execute instructions to perform all or part of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interactions between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.
The memory 804 is configured to store various types of data to support operations at the apparatus 800. Examples of such data include instructions for any application or method operating on the device 800, contact data, phonebook data, messages, pictures, videos, and the like. The memory 804 may be implemented by any type or combination of volatile or nonvolatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.
The power supply component 806 provides power to the various components of the device 800. The power components 806 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the device 800.
The multimedia component 808 includes a screen between the device 800 and the user that provides an output interface. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may sense not only the boundary of a touch or slide action, but also the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front camera and/or a rear camera. The front camera and/or the rear camera may receive external multimedia data when the apparatus 800 is in an operational mode, such as a photographing mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have focal length and optical zoom capabilities.
The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the device 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may be further stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 further includes a speaker for outputting audio signals.
The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be a keyboard, click wheel, buttons, etc. These buttons may include, but are not limited to: homepage button, volume button, start button, and lock button.
The sensor assembly 814 includes one or more sensors for providing status assessment of various aspects of the apparatus 800. For example, the sensor assembly 814 may detect an on/off state of the device 800, a relative positioning of the components, such as a display and keypad of the device 800, the sensor assembly 814 may also detect a change in position of the device 800 or a component of the device 800, the presence or absence of user contact with the device 800, an orientation or acceleration/deceleration of the device 800, and a change in temperature of the device 800. The sensor assembly 814 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscopic sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
The communication component 816 is configured to facilitate communication between the apparatus 800 and other devices, either in a wired or wireless manner. The device 800 may access a wireless network based on a communication standard, such as Wi-Fi,2G, or 3G, or a combination thereof. In one exemplary embodiment, the communication component 816 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.
In an exemplary embodiment, the apparatus 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic elements for executing the methods described above.
In an exemplary embodiment, a non-transitory computer readable storage medium is also provided, such as memory 804 including instructions executable by processor 820 of apparatus 800 to perform the above-described method. For example, the non-transitory computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.
A non-transitory computer readable storage medium, which when executed by a processor of a terminal, enables the terminal to perform an audio processing method, the method comprising:
determining a first voiceprint feature of the target object;
pre-separating the mixed audio to obtain multiple paths of voice signals;
and determining target audio matched with the target object in the mixed audio according to the first voiceprint characteristics and the multipath voice signals.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any adaptations, uses, or adaptations of the disclosure following the general principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (18)

1. An audio processing method, comprising:
determining a first voiceprint feature of the target object;
pre-separating the mixed audio to obtain multiple paths of voice signals;
determining target audio matched with the target object in the mixed audio according to the first voiceprint features and the multipath voice signals;
wherein the determining, according to the first voiceprint feature and the multiple voice signals, the target audio matching the target object in the mixed audio includes:
determining a second voice characteristic of each voice signal in the multipath voice signals;
splicing the second voiceprint feature of each voice signal and the first voiceprint feature to obtain a third voiceprint feature;
and inputting the third voiceprint feature into a preset voice separation network model, and determining target audio matched with the target object in the mixed audio.
2. The method of claim 1, wherein said inputting the third voiceprint feature into a predetermined voice separation network model determines a target audio in the mixed audio that matches the target object, comprising:
inputting the third voiceprint feature into each sub-module of the predetermined voice separation network model to obtain an output result of each sub-module;
and determining target audio matched with the target object in the mixed audio according to the total output result of the series connection of the output results of the sub-modules.
3. The method of claim 2, wherein the sub-module comprises: a multi-layer long short term memory network LSTM and a full connection layer.
4. The method of claim 1, wherein determining the first voiceprint feature of the target object comprises:
acquiring an audio signal of the target object;
and extracting a first voiceprint feature of the target object according to the frequency spectrum of the audio signal.
5. The method of claim 4, wherein extracting the first voiceprint feature of the target object from the frequency spectrum of the audio signal comprises:
and inputting the frequency spectrum of the audio signal into a preset voiceprint extraction network model, and acquiring a first voiceprint characteristic of the target object.
6. The method of claim 5, wherein the voiceprint extraction network model comprises:
residual network RESNET;
at least one pooling layer connected with the RESNET;
and the full-connection layer is connected with the pooling layer.
7. The method of claim 1, wherein the pre-separating the mixed audio to obtain multiple voice signals comprises:
and pre-separating the mixed audio by adopting an independent vector analysis IVA mode to obtain the multipath voice signals.
8. The method of any one of claims 1 to 7, wherein the mixed audio is collected during a voice call;
the method further comprises the steps of:
and carrying out noise reduction processing on the target audio after the voice separation, and outputting the enhanced target audio.
9. An audio processing apparatus, comprising:
a determination module configured to determine a first voiceprint feature of a target object;
the pre-separation module is configured to perform pre-separation processing on the mixed audio to obtain multiple paths of voice signals;
the extraction module is configured to determine second voice characteristics of each voice signal in the multipath voice signals; splicing the second voiceprint feature of each voice signal and the first voiceprint feature to obtain a third voiceprint feature; and inputting the third voiceprint feature into a preset voice separation network model, and determining target audio matched with the target object in the mixed audio.
10. The apparatus of claim 9, wherein the device comprises a plurality of sensors,
the extraction module is further configured to input the third voiceprint feature into each sub-module of the predetermined voice separation network model to obtain an output result of each sub-module; and determining target audio matched with the target object in the mixed audio according to the total output result of the series connection of the output results of the sub-modules.
11. The apparatus of claim 10, wherein the sub-module comprises: a multi-layer long short term memory network LSTM and a full connection layer.
12. The apparatus of claim 9, wherein the device comprises a plurality of sensors,
the determining module is further configured to acquire an audio signal of the target object; and extracting a first voiceprint feature of the target object according to the frequency spectrum of the audio signal.
13. The apparatus of claim 12, wherein the device comprises a plurality of sensors,
the determining module is further configured to input a frequency spectrum of the audio signal into a predetermined voiceprint extraction network model, and obtain a first voiceprint feature of the target object.
14. The apparatus of claim 13, wherein the voiceprint extraction network model comprises:
residual network RESNET;
at least one pooling layer connected with the RESNET;
and the full-connection layer is connected with the pooling layer.
15. The apparatus of claim 9, wherein the device comprises a plurality of sensors,
the pre-separation module is further configured to perform pre-separation processing on the mixed audio by adopting an independent vector analysis IVA mode to obtain the multipath voice signals.
16. The apparatus according to any one of claims 9 to 15, wherein the mixed audio is collected during a voice call;
the apparatus further comprises:
and the enhancement module is configured to perform noise reduction processing on the target audio after the voice separation and output the enhanced target audio.
17. A terminal, comprising:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to perform the audio processing method of any of claims 1 to 8.
18. A non-transitory computer readable storage medium, characterized in that instructions in the storage medium, when executed by a processor of a terminal, enable the terminal to perform the audio processing method of any one of claims 1 to 8.
CN202110309769.1A 2021-03-23 2021-03-23 Audio processing method and device, terminal and storage medium Active CN113113044B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110309769.1A CN113113044B (en) 2021-03-23 2021-03-23 Audio processing method and device, terminal and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110309769.1A CN113113044B (en) 2021-03-23 2021-03-23 Audio processing method and device, terminal and storage medium

Publications (2)

Publication Number Publication Date
CN113113044A CN113113044A (en) 2021-07-13
CN113113044B true CN113113044B (en) 2023-05-09

Family

ID=76710501

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110309769.1A Active CN113113044B (en) 2021-03-23 2021-03-23 Audio processing method and device, terminal and storage medium

Country Status (1)

Country Link
CN (1) CN113113044B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113674755B (en) * 2021-08-19 2024-04-02 北京百度网讯科技有限公司 Voice processing method, device, electronic equipment and medium
CN116153326A (en) * 2021-11-22 2023-05-23 北京字跳网络技术有限公司 Voice separation method, device, electronic equipment and readable storage medium

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105869645B (en) * 2016-03-25 2019-04-12 腾讯科技(深圳)有限公司 Voice data processing method and device
CN108305615B (en) * 2017-10-23 2020-06-16 腾讯科技(深圳)有限公司 Object identification method and device, storage medium and terminal thereof
CN111081257A (en) * 2018-10-19 2020-04-28 珠海格力电器股份有限公司 Voice acquisition method, device, equipment and storage medium
CN111128197B (en) * 2019-12-25 2022-05-13 北京邮电大学 Multi-speaker voice separation method based on voiceprint features and generation confrontation learning
CN111370031B (en) * 2020-02-20 2023-05-05 厦门快商通科技股份有限公司 Voice separation method, system, mobile terminal and storage medium
CN111863020B (en) * 2020-07-30 2022-09-20 腾讯科技(深圳)有限公司 Voice signal processing method, device, equipment and storage medium
CN112435684B (en) * 2020-11-03 2021-12-03 中电金信软件有限公司 Voice separation method and device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN113113044A (en) 2021-07-13

Similar Documents

Publication Publication Date Title
CN106024009B (en) Audio processing method and device
CN104991754B (en) The way of recording and device
CN107945806B (en) User identification method and device based on sound characteristics
CN113113044B (en) Audio processing method and device, terminal and storage medium
CN109360549B (en) Data processing method, wearable device and device for data processing
TW201807565A (en) Voice-based information sharing method, device, and mobile terminal
CN116129931B (en) Audio-visual combined voice separation model building method and voice separation method
CN110931028B (en) Voice processing method and device and electronic equipment
CN112820300B (en) Audio processing method and device, terminal and storage medium
CN108710791A (en) The method and device of voice control
CN104851423B (en) Sound information processing method and device
CN109036404A (en) Voice interactive method and device
CN110970015B (en) Voice processing method and device and electronic equipment
CN114446318A (en) Audio data separation method and device, electronic equipment and storage medium
CN110767229B (en) Voiceprint-based audio output method, device and equipment and readable storage medium
CN111696566B (en) Voice processing method, device and medium
CN111694539B (en) Method, device and medium for switching between earphone and loudspeaker
CN113113036B (en) Audio signal processing method and device, terminal and storage medium
CN112532912A (en) Video processing method and device and electronic equipment
CN111816174A (en) Speech recognition method, device and computer readable storage medium
CN112116916B (en) Method, device, medium and equipment for determining performance parameters of voice enhancement algorithm
CN111696565B (en) Voice processing method, device and medium
CN111696564B (en) Voice processing method, device and medium
CN113345451B (en) Sound changing method and device and electronic equipment
CN111063365B (en) Voice processing method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20210712

Address after: 100085 No.004, 3rd floor, building 6, courtyard 33, Xierqi Middle Road, Haidian District, Beijing

Applicant after: Beijing Xiaomi pinecone Electronic Co.,Ltd.

Applicant after: DUKE KUNSHAN University

Address before: No.018, 8th floor, building 6, No.33 yard, middle Xierqi Road, Haidian District, Beijing 100085

Applicant before: BEIJING XIAOMI MOBILE SOFTWARE Co.,Ltd.

Applicant before: DUKE KUNSHAN University

SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant