CN116110373B - Voice data acquisition method and related device of intelligent conference system - Google Patents

Voice data acquisition method and related device of intelligent conference system Download PDF

Info

Publication number
CN116110373B
CN116110373B CN202310384553.0A CN202310384553A CN116110373B CN 116110373 B CN116110373 B CN 116110373B CN 202310384553 A CN202310384553 A CN 202310384553A CN 116110373 B CN116110373 B CN 116110373B
Authority
CN
China
Prior art keywords
target
voice
phoneme
frequency spectrum
accent
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310384553.0A
Other languages
Chinese (zh)
Other versions
CN116110373A (en
Inventor
李庆余
黄智�
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Shengfeite Technology Co ltd
Original Assignee
Shenzhen Shengfeite Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Shengfeite Technology Co ltd filed Critical Shenzhen Shengfeite Technology Co ltd
Priority to CN202310384553.0A priority Critical patent/CN116110373B/en
Publication of CN116110373A publication Critical patent/CN116110373A/en
Application granted granted Critical
Publication of CN116110373B publication Critical patent/CN116110373B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/30Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/42Systems providing special services or facilities to subscribers
    • H04M3/56Arrangements for connecting several subscribers to a common circuit, i.e. affording conference facilities
    • H04M3/568Arrangements for connecting several subscribers to a common circuit, i.e. affording conference facilities audio processing specific to telephonic conferencing, e.g. spatial distribution, mixing of participants
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Quality & Reliability (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The invention relates to the field of voice processing, and discloses a voice data acquisition method and a related device of an intelligent conference system, which are used for improving the voice acquisition quality of the intelligent conference system and improving the user experience. The method comprises the following steps: performing voice decomposition on the target voice waveform to obtain a discrete voice signal; performing speaking population characteristic recognition on the discrete voice signals to obtain target accent characteristics, and performing phoneme extraction on the discrete voice signals to obtain initial phoneme data; according to the target accent characteristics, accent phoneme sequence recognition and processing are carried out on the initial phoneme data to obtain target phoneme data; inputting the target phoneme data into a Mel frequency spectrum generation model to generate Mel frequency spectrum, so as to obtain a target Mel frequency spectrum; and analyzing the content restoration degree of the target Mel frequency spectrum to obtain the content restoration degree of the target speaking, and carrying out voice recombination and voice enhancement on the target Mel frequency spectrum according to the content restoration degree of the target speaking to output a target voice signal.

Description

Voice data acquisition method and related device of intelligent conference system
Technical Field
The invention relates to the field of voice processing, in particular to a voice data acquisition method and a related device of an intelligent conference system.
Background
With the development and popularization of artificial intelligence and communication technology, more and more enterprises and users adopt a voice intelligent conference system to conduct local and multiparty conference communication. The intelligent conference system can greatly reduce the communication cost and time of users and improve the production and working efficiency of enterprises and users.
The existing scheme uses acquisition equipment to take tone quality as an optimization index, and an algorithm is solidified in a hardware system, so that the accuracy of a voice post-processing technology is reduced, and the acquisition quality of voice is low.
Disclosure of Invention
The invention provides a voice data acquisition method and a related device of an intelligent conference system, which are used for improving the voice acquisition quality of the intelligent conference system and improving the user experience.
The first aspect of the present invention provides a voice data acquisition method for an intelligent conference system, where the voice data acquisition method for the intelligent conference system includes:
determining multiple groups of far-field multi-microphone devices of a target conference scene and near-speaking microphone devices aiming at a speaker based on an intelligent conference system, and collecting initial voice signals of the speaker through a preset target voice channel;
performing waveform conversion on the initial voice signal to obtain a target voice waveform, and performing voice decomposition on the target voice waveform to obtain a discrete voice signal;
performing speaking population voice characteristic recognition on the discrete voice signals to obtain target accent characteristics of the speaking person, and performing phoneme extraction on the discrete voice signals to obtain initial phoneme data;
according to the target accent characteristics, accent phoneme sequence recognition and processing are carried out on the initial phoneme data, so that target phoneme data are obtained;
inputting the target phoneme data into a preset Mel frequency spectrum generation model to generate Mel frequency spectrum, so as to obtain a target Mel frequency spectrum;
and analyzing the content restoration degree of the target Mel frequency spectrum to obtain the content restoration degree of the target speaking, and carrying out voice recombination and voice enhancement on the target Mel frequency spectrum according to the content restoration degree of the target speaking to output a target voice signal.
With reference to the first aspect, in a first implementation manner of the first aspect of the present invention, the determining, based on the intelligent conference system, a plurality of far-field multi-microphone devices of a target conference scene and a near-speaking microphone device for a speaker, and collecting an initial voice signal of the speaker through a preset target voice channel includes:
determining multiple far-field multiple microphone devices of a target conference scene based on the intelligent conference system and near-speaking microphone devices for a speaker;
collecting original voice signals of the speaker through the plurality of far-field multi-microphone devices and the near-speaking microphone device;
and carrying out signal enhancement on the original voice signal through a preset target voice channel to obtain an initial voice signal of the speaker.
With reference to the first aspect, in a second implementation manner of the first aspect of the present invention, performing waveform conversion on the initial speech signal to obtain a target speech waveform, and performing speech decomposition on the target speech waveform to obtain a discrete speech signal, where the method includes:
performing waveform conversion on the initial voice signal to obtain a target voice waveform;
according to a preset sequence length value, performing decomposition level calculation on the target voice waveform to obtain a target decomposition level;
decomposing the target voice waveform according to the target decomposition level to obtain initial discrete voice;
and calling a preset denoising function, and compressing the initial discrete voice to obtain a discrete voice signal.
With reference to the first aspect, in a third implementation manner of the first aspect of the present invention, the performing speaker demographic recognition on the discrete speech signal to obtain a target accent feature of the speaker, and performing phoneme extraction on the discrete speech signal to obtain initial phoneme data includes:
extracting acoustic features of the discrete voice signals to obtain a plurality of voice features;
inputting the voice features into a preset speaker voice feature recognition model to perform accent feature recognition to obtain target accent features of the speaker;
extracting phonemes from the discrete voice signals to obtain a phoneme state sequence;
and generating initial phoneme data corresponding to the discrete voice signals according to the phoneme state sequence.
With reference to the first aspect, in a fourth implementation manner of the first aspect of the present invention, the performing, according to the target accent feature, accent phoneme sequence recognition and processing on the initial phoneme data to obtain target phoneme data includes:
carrying out accent phoneme sequence coding processing on the initial phoneme data to obtain a phoneme sequence vector;
according to the target accent characteristics, carrying out accent sequence detection on the phoneme sequence vector to obtain at least one accent sequence;
and carrying out replacement processing on at least one accent sequence in the phoneme sequence vector to generate target phoneme data.
With reference to the first aspect, in a fifth implementation manner of the first aspect of the present invention, the inputting the target phoneme data into a preset mel spectrum generating model to generate a mel spectrum, to obtain a target mel spectrum, includes:
inputting the target phoneme data into a preset mel spectrum generation model, wherein the mel spectrum generation model comprises: a bidirectional long-short-time memory network, a double-layer threshold circulation network and a mel spectrum generation network;
extracting the characteristics of the target phoneme data through the bidirectional long-short-time memory network to obtain a characteristic vector;
inputting the feature vector into the double-layer threshold cyclic network to perform vector feature conversion processing to obtain a target conversion vector;
and inputting the target conversion vector into the Mel spectrum generation network to generate Mel spectrum, so as to obtain a target Mel spectrum.
With reference to the first aspect, in a sixth implementation manner of the first aspect of the present invention, the performing a content restoration degree analysis on the target mel frequency spectrum to obtain a content restoration degree of target speech, and performing speech recombination and speech enhancement on the target mel frequency spectrum according to the content restoration degree of target speech, to output a target speech signal, where the method includes:
inputting the target Mel frequency spectrum into a preset voice generation network for voice generation to obtain voice data corresponding to the target Mel frequency spectrum;
performing speaking content restoration degree calculation on the voice data corresponding to the target Mel frequency spectrum to obtain target speaking content restoration degree;
and if the target speaking content restoration degree exceeds a preset target value, performing voice recombination and voice enhancement on voice data corresponding to the target Mel frequency spectrum, and outputting a target voice signal.
The second aspect of the present invention provides a voice data acquisition device of an intelligent conference system, the voice data acquisition device of the intelligent conference system comprising:
the acquisition module is used for determining a plurality of groups of far-field multi-microphone devices of a target conference scene and near-speaking microphone devices aiming at a speaker based on the intelligent conference system, and acquiring initial voice signals of the speaker through a preset target voice channel;
the decomposition module is used for performing waveform conversion on the initial voice signal to obtain a target voice waveform, and performing voice decomposition on the target voice waveform to obtain a discrete voice signal;
the recognition module is used for carrying out speaking population voice characteristic recognition on the discrete voice signals to obtain target accent characteristics of the speaking person, and carrying out phoneme extraction on the discrete voice signals to obtain initial phoneme data;
the processing module is used for carrying out accent phoneme sequence recognition and processing on the initial phoneme data according to the target accent characteristics to obtain target phoneme data;
the conversion module is used for inputting the target phoneme data into a preset Mel frequency spectrum generation model to generate Mel frequency spectrum, so as to obtain a target Mel frequency spectrum;
and the output module is used for analyzing the content restoration degree of the target Mel frequency spectrum to obtain the content restoration degree of the target speech, and carrying out voice recombination and voice enhancement on the target Mel frequency spectrum according to the content restoration degree of the target speech to output a target voice signal.
A third aspect of the present invention provides a voice data acquisition device of an intelligent conference system, including: a memory and at least one processor, the memory having instructions stored therein; the at least one processor invokes the instructions in the memory to cause the voice data collection device of the intelligent conference system to execute the voice data collection method of the intelligent conference system described above.
A fourth aspect of the present invention provides a computer readable storage medium having instructions stored therein that, when run on a computer, cause the computer to perform the method of voice data collection of an intelligent conference system described above.
In the technical scheme provided by the invention, the target voice waveform is subjected to voice decomposition to obtain a discrete voice signal; performing speaking population characteristic recognition on the discrete voice signals to obtain target accent characteristics, and performing phoneme extraction on the discrete voice signals to obtain initial phoneme data; according to the target accent characteristics, accent phoneme sequence recognition and processing are carried out on the initial phoneme data to obtain target phoneme data; inputting the target phoneme data into a Mel frequency spectrum generation model to generate Mel frequency spectrum, so as to obtain a target Mel frequency spectrum; the invention improves the voice collection quality of an intelligent conference system, further improves the tone quality of the target voice signal and improves the user experience of conference listeners by processing the voice signal and processing the accent sequence of a speaker in the conference process.
Drawings
FIG. 1 is a schematic diagram of an embodiment of a voice data collection method of an intelligent conference system according to the present invention;
FIG. 2 is a flow chart of waveform conversion and speech decomposition in an embodiment of the present invention;
FIG. 3 is a flow chart of speaker accent feature recognition and phoneme extraction in accordance with an embodiment of the present invention;
FIG. 4 is a flow chart of accent phoneme sequence recognition and processing in an embodiment of the invention;
FIG. 5 is a schematic diagram of an embodiment of a voice data acquisition device of an intelligent conference system according to the present invention;
fig. 6 is a schematic diagram of another embodiment of a voice data acquisition device of the intelligent conference system according to the present invention.
Detailed Description
The embodiment of the invention provides a voice data acquisition method and a related device of an intelligent conference system, which are used for improving the voice acquisition quality of the intelligent conference system and improving the user experience. The terms "first," "second," "third," "fourth" and the like in the description and in the claims and in the above drawings, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments described herein may be implemented in other sequences than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus.
For easy understanding, referring to fig. 1, an embodiment of a voice data collection method of an intelligent conference system according to the embodiment of the present invention includes:
s101, determining multiple groups of far-field multi-microphone devices of a target conference scene and near-speaking microphone devices aiming at a speaker based on an intelligent conference system, and collecting initial voice signals of the speaker through a preset target voice channel;
it can be understood that the execution subject of the present invention may be a voice data acquisition device of an intelligent conference system, and may also be a terminal or a server, which is not limited herein. The embodiment of the invention is described by taking a server as an execution main body as an example.
Specifically, the server analyzes conference scene configuration information of the intelligent conference system, determines conference scene configuration information, further analyzes microphone equipment types and microphone equipment quantity according to the conference scene configuration information, determines multiple far-field microphone equipment of a target conference scene and near-speaking microphone equipment aiming at a speaker, and finally acquires initial voice signals of the speaker through a target voice channel.
S102, performing waveform conversion on an initial voice signal to obtain a target voice waveform, and performing voice decomposition on the target voice waveform to obtain a discrete voice signal;
specifically, the server performs waveform conversion on the initial voice signal, wherein the server performs voice signal point waveform analysis on the initial voice signal to obtain a plurality of voice signal point waveforms, further, the frequency value of occurrence of each point waveform obtained by the server is converted into an audio decibel value through a method convenient for hardware implementation, the audio decibel value can be changed according to the audio decibel value set by a user and converted into audio data, and finally, the server performs waveform conversion on the initial voice signal according to the audio data and the plurality of voice signal point waveforms to obtain a target voice waveform, and performs voice decomposition on the target voice waveform to obtain a discrete voice signal.
S103, carrying out speaking population voice characteristic recognition on the discrete voice signals to obtain target accent characteristics of a speaker, and carrying out phoneme extraction on the discrete voice signals to obtain initial phoneme data;
the method includes the steps of obtaining a discrete voice signal, extracting acoustic features corresponding to the discrete voice signal, and carrying out speaking population voice feature recognition through the acoustic features corresponding to the discrete voice signal, wherein a server inputs a plurality of voice features corresponding to the acoustic features into a speaking population voice feature recognition model to carry out accent feature recognition to obtain target accent features of a speaker, and further the server carries out phoneme extraction on the discrete voice signal to obtain initial phoneme data.
S104, carrying out accent phoneme sequence recognition and processing on the initial phoneme data according to the target accent characteristics to obtain target phoneme data;
specifically, the server firstly encodes initial phoneme data to determine a phoneme sequence vector, and then the server detects an accent sequence of the phoneme sequence vector according to the target accent feature, wherein when the sequence is detected, the server respectively calculates feature similarity of the phoneme sequence vector and the accent feature to determine a corresponding similarity calculation result, and then the server carries out accent factor sequence recognition on the initial factor data according to the similarity calculation result to obtain target factor data.
S105, inputting the target phoneme data into a preset Mel frequency spectrum generation model to generate Mel frequency spectrum, so as to obtain a target Mel frequency spectrum;
specifically, the server inputs the target factor data into a Mel frequency spectrum generation model to generate Mel frequency spectrum, wherein the feature extraction is firstly carried out on the target factor data through the Mel frequency spectrum generation model to obtain a feature vector, then the vector feature conversion is carried out on the feature vector through the Mel frequency spectrum generation model to obtain a target conversion vector, and finally the Mel frequency spectrum generation is carried out according to the target conversion vector to obtain a target Mel frequency spectrum.
S106, analyzing the content restoration degree of the target Mel frequency spectrum to obtain the content restoration degree of the target speaking, and carrying out voice recombination and voice enhancement on the target Mel frequency spectrum according to the content restoration degree of the target speaking to output a target voice signal.
Specifically, the server performs voice generation on the target mel frequency spectrum, determines voice data corresponding to the target mel frequency spectrum, further performs voice continuity analysis on the voice data corresponding to the target mel frequency spectrum, determines a voice continuity analysis result, further performs speaking content reduction degree analysis on the voice data through the voice continuity analysis result, determines target speaking content reduction degree, and finally performs voice recombination and voice enhancement on the target mel frequency spectrum according to the target speaking content reduction degree to output a target voice signal.
In the embodiment of the invention, the target voice waveform is subjected to voice decomposition to obtain a discrete voice signal; performing speaking population characteristic recognition on the discrete voice signals to obtain target accent characteristics, and performing phoneme extraction on the discrete voice signals to obtain initial phoneme data; according to the target accent characteristics, accent phoneme sequence recognition and processing are carried out on the initial phoneme data to obtain target phoneme data; inputting the target phoneme data into a Mel frequency spectrum generation model to generate Mel frequency spectrum, so as to obtain a target Mel frequency spectrum; the invention improves the voice collection quality of an intelligent conference system, further improves the tone quality of the target voice signal and improves the user experience of conference listeners by processing the voice signal and processing the accent sequence of a speaker in the conference process.
In a specific embodiment, the process of executing step S101 may specifically include the following steps:
(1) Determining multiple far-field multiple microphone devices of a target conference scene based on the intelligent conference system and near-speaking microphone devices for a speaker;
(2) Collecting original voice signals of a speaker through a plurality of groups of far-field multi-microphone devices and near-speaking microphone devices;
(3) And carrying out signal enhancement on the original voice signal through a preset target voice channel to obtain an initial voice signal of a speaker.
Specifically, the server determines multiple far-field multiple microphone devices of a target conference scene and near-speaking microphone devices of a speaker based on an intelligent conference system, wherein the server analyzes conference scene configuration information of the intelligent conference system, determines conference scene configuration information, further analyzes microphone device types and microphone device numbers through the conference scene configuration information, determines multiple far-field multiple microphone devices of the target conference scene and near-speaking microphone devices of the speaker, acquires original voice signals of the speaker through the multiple far-field multiple microphone devices and the near-speaking microphone devices, performs signal enhancement on the original voice signals through a preset target voice channel, obtains initial voice signals of the speaker, acquires sound signals acquired by the microphone devices, performs feature extraction on the sound signals, obtains a first feature vector, matches the first feature vector with a second feature vector which is a feature vector corresponding to the noise signals when the microphone devices are mistakenly inserted into a microphone hole, further determines the original voice signals of the speaker according to the first feature vector and the second feature vector, and further performs signal enhancement on the original voice signals of the speaker.
In a specific embodiment, as shown in fig. 2, the process of executing step S102 may specifically include the following steps:
s201, performing waveform conversion on an initial voice signal to obtain a target voice waveform;
s202, performing decomposition level calculation on a target voice waveform according to a preset sequence length value to obtain a target decomposition level;
s203, decomposing the target voice waveform according to the target decomposition level to obtain initial discrete voice;
s204, calling a preset denoising function, and compressing the initial discrete voice to obtain a discrete voice signal.
Specifically, the server performs waveform conversion on an initial voice signal to obtain a target voice waveform, wherein the server performs voice signal point waveform analysis on the initial voice signal to obtain a plurality of voice signal point waveforms, further, the frequency value of occurrence of each point waveform obtained by the server is converted into an audio decibel value through a method convenient for hardware implementation, the audio decibel value can be changed and converted according to the audio decibel value set by a user to obtain audio data, finally, the server performs waveform conversion on the initial voice signal according to the audio data and the plurality of voice signal point waveforms to obtain a target voice waveform, performs decomposition level calculation on the target voice waveform according to a preset sequence length value to obtain a target decomposition level, specifically, the server determines a level standard length according to the preset sequence length value, further, the server performs decomposition level calculation on the target voice waveform according to the level standard length to obtain a target decomposition level, further, the server decomposes the target voice waveform according to the target decomposition level to obtain initial discrete voice, calls a preset denoising function, and compresses the initial discrete voice to obtain a discrete voice signal.
In a specific embodiment, as shown in fig. 3, the process of executing step S103 may specifically include the following steps:
s301, extracting acoustic features of discrete voice signals to obtain a plurality of voice features;
s302, inputting a plurality of voice features into a preset speaker voice feature recognition model to perform accent feature recognition, so as to obtain target accent features of a speaker;
s303, extracting phonemes of the discrete voice signals to obtain a phoneme state sequence;
s304, generating initial phoneme data corresponding to the discrete voice signals according to the phoneme state sequence.
Specifically, the server extracts acoustic features of the discrete voice signals to obtain a plurality of voice features, inputs the plurality of voice features into a preset speaker voice feature recognition model to perform accent feature recognition to obtain target accent features of a speaker, wherein the server acquires the discrete voice signals, extracts acoustic features corresponding to the discrete voice signals, performs speaker voice feature recognition through the acoustic features corresponding to the discrete voice signals, wherein the server inputs the plurality of voice features corresponding to the acoustic features into the speaker voice feature recognition model to perform accent feature recognition to obtain the target accent features of the speaker,
and extracting phonemes of the discrete voice signals to obtain a phoneme state sequence, wherein the server acquires a priori phoneme set obtained by manually identifying the discrete voice signals, trains a preset voice phoneme extraction model according to the priori phoneme set to obtain a trained voice phoneme extraction model, performs phoneme extraction on the discrete voice signals through the trained voice phoneme extraction model to obtain the phoneme state sequence, and finally generates initial phoneme data corresponding to the discrete voice signals according to the phoneme state sequence.
In a specific embodiment, as shown in fig. 4, the process of executing step S104 may specifically include the following steps:
s401, carrying out accent phoneme sequence coding processing on initial phoneme data to obtain a phoneme sequence vector;
s402, carrying out accent sequence detection on the phoneme sequence vector according to the target accent characteristics to obtain at least one accent sequence;
s403, replacing at least one accent sequence in the phoneme sequence vector to generate target phoneme data.
Specifically, the server performs accent phoneme sequence encoding processing on initial phoneme data to obtain a phoneme sequence vector, performs accent sequence detection on the phoneme sequence vector according to target accent characteristics to obtain at least one accent sequence, performs encoding processing on the initial phoneme data to determine a phoneme sequence vector, further performs accent sequence detection on the phoneme sequence vector according to the target accent characteristics, performs feature similarity calculation on the phoneme sequence vector and the accent characteristics respectively during sequence detection, determines a corresponding similarity calculation result, further performs accent sequence detection according to the similarity calculation result to obtain at least one accent sequence, and finally performs replacement processing on at least one accent sequence in the phoneme sequence vector to generate target phoneme data.
In a specific embodiment, the process of executing step S105 may specifically include the following steps:
(1) Inputting the target phoneme data into a preset mel-frequency spectrum generation model, wherein the mel-frequency spectrum generation model comprises: a bidirectional long-short-time memory network, a double-layer threshold circulation network and a mel spectrum generation network;
(2) Extracting features of the target phoneme data through a bidirectional long-short-time memory network to obtain feature vectors;
(3) Inputting the feature vector into a double-layer threshold cyclic network to perform vector feature conversion processing to obtain a target conversion vector;
(4) Inputting the target conversion vector into a Mel spectrum generation network to generate Mel spectrum, and obtaining the target Mel spectrum.
Specifically, inputting the target phoneme data into a preset mel spectrum generation model, wherein the mel spectrum generation model comprises: the method comprises the steps of performing feature extraction on target phoneme data through a bidirectional long-short-time memory network, performing feature extraction on the target phoneme data through the bidirectional long-short-time memory network to obtain feature vectors, performing pre-training on a deep learning network by utilizing the target phoneme data to obtain a Mel frequency spectrum generation model, extracting phoneme duration feature vectors by taking hidden features of the last n layers of the Mel frequency spectrum generation model as intermediate variables to obtain feature vectors, further inputting the feature vectors into the double-layer threshold circulation network to perform vector feature conversion processing to obtain target conversion vectors, and finally inputting the target conversion vectors into the Mel frequency spectrum generation network to perform Mel frequency spectrum generation to obtain target Mel frequency spectrums.
In a specific embodiment, the process of executing step S106 may specifically include the following steps:
(1) Inputting the target Mel frequency spectrum into a preset voice generation network for voice generation to obtain voice data corresponding to the target Mel frequency spectrum;
(2) Performing speaking content restoration degree calculation on voice data corresponding to the target Mel frequency spectrum to obtain target speaking content restoration degree;
(3) If the target speaking content restoration degree exceeds the preset target value, carrying out voice recombination and voice enhancement on voice data corresponding to the target Mel frequency spectrum, and outputting a target voice signal.
Specifically, inputting the target Mel frequency spectrum into a preset voice generation network to generate voice so as to obtain voice data corresponding to the target Mel frequency spectrum, carrying out speaking content restoration degree calculation on the voice data corresponding to the target Mel frequency spectrum so as to obtain target speaking content restoration degree, wherein the server carries out voice generation on the target Mel frequency spectrum, determines the voice data corresponding to the target Mel frequency spectrum, further carries out voice continuity analysis on the voice data corresponding to the target Mel frequency spectrum, determines a voice continuity analysis result, further carries out speaking content restoration degree analysis on the voice data through the voice continuity analysis result, determines target speaking content restoration degree, finally carries out numerical analysis according to the target speaking content restoration degree, carries out voice recombination and voice enhancement on the voice data corresponding to the target Mel frequency spectrum if the target speaking content restoration degree exceeds a preset target value, and outputs a target voice signal.
The method for collecting voice data of the intelligent conference system in the embodiment of the present invention is described above, and the following describes a voice data collecting device of the intelligent conference system in the embodiment of the present invention, please refer to fig. 5, and one embodiment of the voice data collecting device of the intelligent conference system in the embodiment of the present invention includes:
the acquisition module 501 is used for determining a plurality of groups of far-field multi-microphone devices of a target conference scene and near-speaking microphone devices aiming at a speaker based on an intelligent conference system, and acquiring initial voice signals of the speaker through a preset target voice channel;
the decomposition module 502 is configured to perform waveform conversion on the initial speech signal to obtain a target speech waveform, and perform speech decomposition on the target speech waveform to obtain a discrete speech signal;
the recognition module 503 is configured to perform speaker demographic recognition on the discrete speech signal to obtain a target accent feature of the speaker, and perform phoneme extraction on the discrete speech signal to obtain initial phoneme data;
a processing module 504, configured to perform accent phoneme sequence recognition and processing on the initial phoneme data according to the target accent feature, so as to obtain target phoneme data;
the conversion module 505 is configured to input the target phoneme data into a preset mel frequency spectrum generation model to generate a mel frequency spectrum, so as to obtain a target mel frequency spectrum;
and the output module 506 is configured to perform a content restoration degree analysis on the target mel frequency spectrum to obtain a content restoration degree of target speech, perform speech recombination and speech enhancement on the target mel frequency spectrum according to the content restoration degree of target speech, and output a target speech signal.
Through the cooperative cooperation of the components, the target voice waveform is subjected to voice decomposition to obtain a discrete voice signal; performing speaking population characteristic recognition on the discrete voice signals to obtain target accent characteristics, and performing phoneme extraction on the discrete voice signals to obtain initial phoneme data; according to the target accent characteristics, accent phoneme sequence recognition and processing are carried out on the initial phoneme data to obtain target phoneme data; inputting the target phoneme data into a Mel frequency spectrum generation model to generate Mel frequency spectrum, so as to obtain a target Mel frequency spectrum; the invention improves the voice collection quality of an intelligent conference system, further improves the tone quality of the target voice signal and improves the user experience of conference listeners by processing the voice signal and processing the accent sequence of a speaker in the conference process.
The voice data collection device of the intelligent conference system in the embodiment of the present invention is described in detail from the point of view of modularized functional entities in fig. 5, and the voice data collection device of the intelligent conference system in the embodiment of the present invention is described in detail from the point of view of hardware processing in the following.
Fig. 6 is a schematic structural diagram of a voice data collection device of an intelligent conference system according to an embodiment of the present invention, where the voice data collection device 600 of the intelligent conference system may have a relatively large difference due to different configurations or performances, and may include one or more processors (central processing units, CPU) 610 (e.g., one or more processors) and a memory 620, and one or more storage media 630 (e.g., one or more mass storage devices) storing application programs 633 or data 632. Wherein the memory 620 and the storage medium 630 may be transitory or persistent storage. The program stored on the storage medium 630 may include one or more modules (not shown), each of which may include a series of instruction operations in the voice data acquisition device 600 of the intelligent conference system. Still further, the processor 610 may be configured to communicate with the storage medium 630 to execute a series of instruction operations in the storage medium 630 on the voice data acquisition device 600 of the intelligent conference system.
The voice data collection device 600 of the intelligent conference system may also include one or more power supplies 640, one or more wired or wireless network interfaces 650, one or more input/output interfaces 660, and/or one or more operating systems 631, such as Windows Serve, mac OS X, unix, linux, freeBSD, and the like. It will be appreciated by those skilled in the art that the voice data collection device structure of the intelligent conference system shown in fig. 6 does not constitute a limitation of the voice data collection device of the intelligent conference system, and may include more or less components than illustrated, or may combine certain components, or may be arranged in different components.
The invention also provides voice data acquisition equipment of the intelligent conference system, which comprises a memory and a processor, wherein the memory stores computer readable instructions, and when the computer readable instructions are executed by the processor, the processor executes the steps of the voice data acquisition method of the intelligent conference system in the above embodiments.
The present invention also provides a computer readable storage medium, which may be a non-volatile computer readable storage medium, and may also be a volatile computer readable storage medium, where instructions are stored in the computer readable storage medium, where the instructions, when executed on a computer, cause the computer to perform the steps of the voice data collection method of the intelligent conference system.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a random access memory (random acceS memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
The above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (10)

1. The voice data acquisition method of the intelligent conference system is characterized by comprising the following steps of:
determining multiple groups of far-field multi-microphone devices of a target conference scene and near-speaking microphone devices aiming at a speaker based on an intelligent conference system, and collecting initial voice signals of the speaker through a preset target voice channel;
performing waveform conversion on the initial voice signal to obtain a target voice waveform, and performing voice decomposition on the target voice waveform to obtain a discrete voice signal;
performing speaking population voice characteristic recognition on the discrete voice signals to obtain target accent characteristics of the speaking person, and performing phoneme extraction on the discrete voice signals to obtain initial phoneme data;
according to the target accent characteristics, accent phoneme sequence recognition and processing are carried out on the initial phoneme data, so that target phoneme data are obtained;
inputting the target phoneme data into a preset Mel frequency spectrum generation model to generate Mel frequency spectrum, so as to obtain a target Mel frequency spectrum;
and analyzing the content restoration degree of the target Mel frequency spectrum to obtain the content restoration degree of the target speaking, and carrying out voice recombination and voice enhancement on the target Mel frequency spectrum according to the content restoration degree of the target speaking to output a target voice signal.
2. The voice data collection method of the intelligent conference system according to claim 1, wherein the determining, based on the intelligent conference system, a plurality of far-field multi-microphone devices of a target conference scene and a near-speaking microphone device for a speaker, and collecting an initial voice signal of the speaker through a preset target voice channel, includes:
determining multiple far-field multiple microphone devices of a target conference scene based on the intelligent conference system and near-speaking microphone devices for a speaker;
collecting original voice signals of the speaker through the plurality of far-field multi-microphone devices and the near-speaking microphone device;
and carrying out signal enhancement on the original voice signal through a preset target voice channel to obtain an initial voice signal of the speaker.
3. The method for collecting voice data of intelligent conference system according to claim 1, wherein said performing waveform conversion on said initial voice signal to obtain a target voice waveform, and performing voice decomposition on said target voice waveform to obtain a discrete voice signal comprises:
performing waveform conversion on the initial voice signal to obtain a target voice waveform;
according to a preset sequence length value, performing decomposition level calculation on the target voice waveform to obtain a target decomposition level;
decomposing the target voice waveform according to the target decomposition level to obtain initial discrete voice;
and calling a preset denoising function, and compressing the initial discrete voice to obtain a discrete voice signal.
4. The method for collecting voice data of intelligent conference system according to claim 1, wherein said performing speaker demographic recognition on said discrete voice signal to obtain target accent features of said speaker, and performing phoneme extraction on said discrete voice signal to obtain initial phoneme data comprises:
extracting acoustic features of the discrete voice signals to obtain a plurality of voice features;
inputting the voice features into a preset speaker voice feature recognition model to perform accent feature recognition to obtain target accent features of the speaker;
extracting phonemes from the discrete voice signals to obtain a phoneme state sequence;
and generating initial phoneme data corresponding to the discrete voice signals according to the phoneme state sequence.
5. The method for collecting voice data of intelligent conference system according to claim 1, wherein said performing accent phoneme sequence recognition and processing on said initial phoneme data according to said target accent features to obtain target phoneme data comprises:
carrying out accent phoneme sequence coding processing on the initial phoneme data to obtain a phoneme sequence vector;
according to the target accent characteristics, carrying out accent sequence detection on the phoneme sequence vector to obtain at least one accent sequence;
and carrying out replacement processing on at least one accent sequence in the phoneme sequence vector to generate target phoneme data.
6. The method for collecting voice data of an intelligent conference system according to claim 1, wherein inputting the target phoneme data into a preset mel spectrum generation model to generate a mel spectrum, and obtaining a target mel spectrum comprises:
inputting the target phoneme data into a preset mel spectrum generation model, wherein the mel spectrum generation model comprises: a bidirectional long-short-time memory network, a double-layer threshold circulation network and a mel spectrum generation network;
extracting the characteristics of the target phoneme data through the bidirectional long-short-time memory network to obtain a characteristic vector;
inputting the feature vector into the double-layer threshold cyclic network to perform vector feature conversion processing to obtain a target conversion vector;
and inputting the target conversion vector into the Mel spectrum generation network to generate Mel spectrum, so as to obtain a target Mel spectrum.
7. The method for collecting voice data of intelligent conference system according to claim 1, wherein said analyzing the content of speaking restoration degree of the target mel frequency spectrum to obtain the content of speaking restoration degree of the target, and according to the content of speaking restoration degree of the target, performing voice recombination and voice enhancement on the target mel frequency spectrum to output a target voice signal, comprises:
inputting the target Mel frequency spectrum into a preset voice generation network for voice generation to obtain voice data corresponding to the target Mel frequency spectrum;
performing speaking content restoration degree calculation on the voice data corresponding to the target Mel frequency spectrum to obtain target speaking content restoration degree;
and if the target speaking content restoration degree exceeds a preset target value, performing voice recombination and voice enhancement on voice data corresponding to the target Mel frequency spectrum, and outputting a target voice signal.
8. The utility model provides a voice data collection system of intelligent conference system which characterized in that, voice data collection system of intelligent conference system includes:
the acquisition module is used for determining a plurality of groups of far-field multi-microphone devices of a target conference scene and near-speaking microphone devices aiming at a speaker based on the intelligent conference system, and acquiring initial voice signals of the speaker through a preset target voice channel;
the decomposition module is used for performing waveform conversion on the initial voice signal to obtain a target voice waveform, and performing voice decomposition on the target voice waveform to obtain a discrete voice signal;
the recognition module is used for carrying out speaking population voice characteristic recognition on the discrete voice signals to obtain target accent characteristics of the speaking person, and carrying out phoneme extraction on the discrete voice signals to obtain initial phoneme data;
the processing module is used for carrying out accent phoneme sequence recognition and processing on the initial phoneme data according to the target accent characteristics to obtain target phoneme data;
the conversion module is used for inputting the target phoneme data into a preset Mel frequency spectrum generation model to generate Mel frequency spectrum, so as to obtain a target Mel frequency spectrum;
and the output module is used for analyzing the content restoration degree of the target Mel frequency spectrum to obtain the content restoration degree of the target speech, and carrying out voice recombination and voice enhancement on the target Mel frequency spectrum according to the content restoration degree of the target speech to output a target voice signal.
9. The voice data acquisition device of the intelligent conference system is characterized by comprising: a memory and at least one processor, the memory having instructions stored therein;
the at least one processor invoking the instructions in the memory to cause a voice data collection device of the intelligent conference system to perform the voice data collection method of the intelligent conference system of any of claims 1-7.
10. A computer readable storage medium having instructions stored thereon, which when executed by a processor, implement the method of voice data collection of an intelligent conference system according to any of claims 1-7.
CN202310384553.0A 2023-04-12 2023-04-12 Voice data acquisition method and related device of intelligent conference system Active CN116110373B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310384553.0A CN116110373B (en) 2023-04-12 2023-04-12 Voice data acquisition method and related device of intelligent conference system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310384553.0A CN116110373B (en) 2023-04-12 2023-04-12 Voice data acquisition method and related device of intelligent conference system

Publications (2)

Publication Number Publication Date
CN116110373A CN116110373A (en) 2023-05-12
CN116110373B true CN116110373B (en) 2023-06-09

Family

ID=86256507

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310384553.0A Active CN116110373B (en) 2023-04-12 2023-04-12 Voice data acquisition method and related device of intelligent conference system

Country Status (1)

Country Link
CN (1) CN116110373B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015001492A1 (en) * 2013-07-02 2015-01-08 Family Systems, Limited Systems and methods for improving audio conferencing services
CN110300001A (en) * 2019-05-21 2019-10-01 深圳壹账通智能科技有限公司 Conference audio control method, system, equipment and computer readable storage medium
CN110914898A (en) * 2018-05-28 2020-03-24 北京嘀嘀无限科技发展有限公司 System and method for speech recognition
WO2020250016A1 (en) * 2019-06-14 2020-12-17 Cedat 85 S.R.L. Apparatus for processing an audio signal for the generation of a multimedia file with speech transcription
WO2021024869A1 (en) * 2019-08-02 2021-02-11 日本電気株式会社 Speech processing device, speech processing method, and recording medium
CN113345450A (en) * 2021-06-25 2021-09-03 平安科技(深圳)有限公司 Voice conversion method, device, equipment and storage medium
CN114141237A (en) * 2021-11-06 2022-03-04 招联消费金融有限公司 Speech recognition method, speech recognition device, computer equipment and storage medium
CN114203180A (en) * 2021-11-16 2022-03-18 广西中科曙光云计算有限公司 Conference summary generation method and device, electronic equipment and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8849666B2 (en) * 2012-02-23 2014-09-30 International Business Machines Corporation Conference call service with speech processing for heavily accented speakers

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015001492A1 (en) * 2013-07-02 2015-01-08 Family Systems, Limited Systems and methods for improving audio conferencing services
CN110914898A (en) * 2018-05-28 2020-03-24 北京嘀嘀无限科技发展有限公司 System and method for speech recognition
CN110300001A (en) * 2019-05-21 2019-10-01 深圳壹账通智能科技有限公司 Conference audio control method, system, equipment and computer readable storage medium
WO2020250016A1 (en) * 2019-06-14 2020-12-17 Cedat 85 S.R.L. Apparatus for processing an audio signal for the generation of a multimedia file with speech transcription
WO2021024869A1 (en) * 2019-08-02 2021-02-11 日本電気株式会社 Speech processing device, speech processing method, and recording medium
CN113345450A (en) * 2021-06-25 2021-09-03 平安科技(深圳)有限公司 Voice conversion method, device, equipment and storage medium
CN114141237A (en) * 2021-11-06 2022-03-04 招联消费金融有限公司 Speech recognition method, speech recognition device, computer equipment and storage medium
CN114203180A (en) * 2021-11-16 2022-03-18 广西中科曙光云计算有限公司 Conference summary generation method and device, electronic equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Phone set construction based on context-sensitive articulatory attributes for code-switching speech recognition;Chung-Hsien Wu;《2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)》;全文 *
基于语音及交互的远程图像采集***;王永生;《中国优秀硕士学位论文全文数据库》;全文 *

Also Published As

Publication number Publication date
CN116110373A (en) 2023-05-12

Similar Documents

Publication Publication Date Title
CN108597496B (en) Voice generation method and device based on generation type countermeasure network
CN110223705B (en) Voice conversion method, device, equipment and readable storage medium
CN106486130B (en) Noise elimination and voice recognition method and device
CN106847305B (en) Method and device for processing recording data of customer service telephone
CN105118522B (en) Noise detection method and device
CN105118501A (en) Speech recognition method and system
KR20050115857A (en) System and method for speech processing using independent component analysis under stability constraints
CN107316635B (en) Voice recognition method and device, storage medium and electronic equipment
CN104766608A (en) Voice control method and voice control device
CN111462758A (en) Method, device and equipment for intelligent conference role classification and storage medium
CN112242149B (en) Audio data processing method and device, earphone and computer readable storage medium
CN112614510B (en) Audio quality assessment method and device
CN108399913B (en) High-robustness audio fingerprint identification method and system
CN111710332B (en) Voice processing method, device, electronic equipment and storage medium
CN115050372A (en) Audio segment clustering method and device, electronic equipment and medium
CN112802498B (en) Voice detection method, device, computer equipment and storage medium
JP4703648B2 (en) Vector codebook generation method, data compression method and apparatus, and distributed speech recognition system
CN113782044A (en) Voice enhancement method and device
CN116110373B (en) Voice data acquisition method and related device of intelligent conference system
CN112466287A (en) Voice segmentation method and device and computer readable storage medium
CN112652309A (en) Dialect voice conversion method, device, equipment and storage medium
CN111108553A (en) Voiceprint detection method, device and equipment for sound collection object
CN110767238B (en) Blacklist identification method, device, equipment and storage medium based on address information
CN111489745A (en) Chinese speech recognition system applied to artificial intelligence
CN112927680B (en) Voiceprint effective voice recognition method and device based on telephone channel

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant