CN118098266A - Voice data processing method and device based on multi-model selection - Google Patents

Voice data processing method and device based on multi-model selection Download PDF

Info

Publication number
CN118098266A
CN118098266A CN202410151187.9A CN202410151187A CN118098266A CN 118098266 A CN118098266 A CN 118098266A CN 202410151187 A CN202410151187 A CN 202410151187A CN 118098266 A CN118098266 A CN 118098266A
Authority
CN
China
Prior art keywords
voice
voice data
data
similarity
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410151187.9A
Other languages
Chinese (zh)
Inventor
汪刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongchuang Technology Guangzhou Co ltd
Original Assignee
Zhongchuang Technology Guangzhou Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhongchuang Technology Guangzhou Co ltd filed Critical Zhongchuang Technology Guangzhou Co ltd
Priority to CN202410151187.9A priority Critical patent/CN118098266A/en
Publication of CN118098266A publication Critical patent/CN118098266A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a voice data processing method and device based on multi-model selection, wherein the method comprises the following steps: acquiring a plurality of voice data of a target user; screening target voice data with an image generation purpose from the voice data according to a preset voice screening algorithm; determining a corresponding image generation algorithm model from a plurality of candidate algorithm models according to the data parameters of the target voice data; and inputting the target voice data into the image generation algorithm model to obtain image data corresponding to the target voice data. Therefore, the method and the device can improve the automation degree and the intelligent degree of the generated image according to the voice, reduce the operation cost of the user and improve the algorithm efficiency and the effect.

Description

Voice data processing method and device based on multi-model selection
Technical Field
The present invention relates to the field of speech processing technologies, and in particular, to a speech data processing method and apparatus based on multi-model selection.
Background
With the development of algorithm technology, the requirement of users for generating images by using keywords is higher and higher, and the application scene of automatically generating images according to the voice of the users can effectively improve the user experience, so that the method is favored by partial users. However, in the prior art, when the application scene is realized, only a trained algorithm is generally adopted to directly generate an image or a matched image according to voice, and the data characteristics are not fully utilized to improve the accuracy of voice data and the rationality of model selection, so that the prediction efficiency and the effect are not ideal. It can be seen that the prior art has defects and needs to be solved.
Disclosure of Invention
The invention aims to solve the technical problem of providing a voice data processing method and device based on multi-model selection, which can improve the automation degree and the intelligent degree of generating images according to voice, reduce the operation cost of users and improve the algorithm efficiency and effect.
To solve the above technical problem, a first aspect of the present invention discloses a voice data processing method based on multi-model selection, the method comprising:
acquiring a plurality of voice data of a target user;
Screening target voice data with an image generation purpose from the voice data according to a preset voice screening algorithm;
determining a corresponding image generation algorithm model from a plurality of candidate algorithm models according to the data parameters of the target voice data;
And inputting the target voice data into the image generation algorithm model to obtain image data corresponding to the target voice data.
In a first aspect of the present invention, the selecting, according to a preset voice filtering algorithm, target voice data with an image generating purpose from the plurality of voice data includes:
determining a voice text and a voice language corresponding to each voice data according to a neural network algorithm;
Based on a matching algorithm and a similarity algorithm, calculating the text object matching degree and the language-gas type matching degree corresponding to each voice data according to the voice text and the voice language;
And screening target voice data with an image generation purpose from a plurality of voice data according to the text purpose matching degree and the language type matching degree.
In a first aspect of the present invention, the determining, according to a neural network algorithm, a voice text and a voice tone corresponding to each of the voice data includes:
for each voice data, carrying out noise reduction treatment on the voice data to obtain corresponding noise-reduced data;
Inputting the noise-reduced data into a trained text recognition neural network model to obtain a voice text corresponding to the voice data; the text recognition neural network is obtained through training a training data set comprising a plurality of training voice data and corresponding text labels;
Inputting the noise-reduced data into a trained corresponding language-gas recognition neural network model to obtain a voice language gas corresponding to the voice data; the voice recognition neural network is obtained through training a training data set comprising a plurality of training voice data and corresponding voice marks; the voice mood or the mood label is one or a combination of two of a target mood and an emotional mood; the target mood is command mood, statement mood or query mood; the emotion mood is no emotion mood, anger mood, open mood or low mood.
In a first aspect of the present invention, the calculating, based on a matching algorithm and a similarity algorithm, a matching degree of a text object and a matching degree of a mood type corresponding to each of the voice data according to the voice text and the voice mood includes:
For each piece of voice data, calculating an average value of text similarity between the voice text of the voice data and a plurality of preset standard image generation command texts to obtain a first similarity parameter corresponding to the voice data;
calculating the number of matching keywords in the voice text of the voice data according to a preset text keyword matching rule;
Calculating the product of the first similarity parameter, the number of the matching keywords and the duration weight to obtain the text destination matching degree corresponding to the voice data; the time length weight is in direct proportion to the voice time length of the voice data;
Calculating the average value of the similarity of the speech utterances of the speech data and speech utterances of a plurality of preset historical speech data samples to obtain a second similarity parameter corresponding to the speech data;
calculating the average value of the voice tone of the voice data and the tone similarity between the voice tone of two adjacent voice data in the acquisition time to obtain a third similarity parameter corresponding to the voice data;
And calculating the product of the second similarity parameter, the third similarity parameter and the duration weight to obtain the tone type matching degree corresponding to the voice data.
In a first aspect of the present invention, the selecting, according to the text purpose matching degree and the language type matching degree, target speech data having an image generation purpose from a plurality of speech data includes:
Calculating a weighted sum average value of the text object matching degree and the language type matching degree corresponding to each voice data to obtain a matching parameter of each voice data;
Sequencing all the voice data according to the matching parameters from large to small to obtain a first voice sequence;
Screening out a plurality of candidate voice data which are in the first number in the first voice sequence and have the matching parameters larger than a first parameter threshold value;
Calculating the average value of the acquired time difference between each candidate voice data and all other candidate voice data to obtain a time parameter corresponding to each candidate voice data;
sequencing all the candidate voice data according to the time parameter from large to small to obtain a second voice sequence;
And screening out candidate voice data which are in the first second quantity in the second voice sequence and have the time parameter larger than a second parameter threshold value, and obtaining target voice data.
In a first aspect of the present invention, determining, according to the data parameter of the target voice data, a corresponding image generation algorithm model from a plurality of candidate algorithm models includes:
Determining the number of voices, the voice mood set, the voice text set and the average voice duration corresponding to the target voice data;
according to the number of voices, the voice mood set, the voice text set and the average voice duration, calculating the data similarity between the training data set corresponding to each candidate algorithm model and the target voice data set; the training data set corresponding to the candidate algorithm model comprises a plurality of training voice data and corresponding image labels;
And determining a corresponding image generation algorithm model from a plurality of candidate algorithm models according to the data similarity.
In an optional implementation manner, in a first aspect of the present invention, the calculating, according to the number of voices, the voice-to-speech set, the voice-to-text set and the average voice duration, a data similarity between the training data set corresponding to each candidate algorithm model and the target voice data set includes:
For each candidate algorithm model, acquiring a plurality of training voice data sets in a training data set corresponding to the candidate algorithm model;
Calculating the average number of the number differences between the number of the voices and the number of the voices in each training voice data set to obtain number similarity;
Calculating the average number of the superposition ratio of the voice intonation set and the intonation labeling set corresponding to each training voice data set to obtain the intonation similarity;
calculating the average number of the text similarity of the voice text set and the text label set corresponding to each training voice data set to obtain the text similarity;
Calculating the average number of the time difference between the average voice time and the average voice time corresponding to each training voice data set to obtain time similarity;
And calculating the product of the quantity similarity, the language similarity, the text similarity and the duration similarity to obtain the data similarity corresponding to the candidate algorithm model.
In a first aspect of the present invention, determining, according to the data similarity, a corresponding image generation algorithm model from a plurality of candidate algorithm models includes:
sequencing all the candidate algorithm models according to the data similarity from large to small to obtain a model sequence;
screening candidate algorithm models which are in the first third quantity in the model sequence and have the data similarity larger than a third parameter threshold value, and obtaining an image generation algorithm model;
and inputting the target voice data into the image generation algorithm model to obtain image data corresponding to the target voice data, wherein the method comprises the following steps:
When the number of the image generation algorithm models is greater than 1, respectively inputting the target voice data into each image generation algorithm model to obtain a plurality of output predicted image data;
Calculating an average value of image similarity between each piece of predicted image data and a plurality of images confirmed by the target user in a historical time period to obtain a similarity parameter of each piece of predicted image data;
and determining the predicted image data with the maximum similarity parameter as the image data corresponding to the target voice data.
The second aspect of the present invention discloses a voice data processing device based on multi-model selection, the device comprising:
The acquisition module is used for acquiring a plurality of voice data of the target user;
the screening module is used for screening target voice data with an image generation purpose from the voice data according to a preset voice screening algorithm;
The determining module is used for determining a corresponding image generation algorithm model from a plurality of candidate algorithm models according to the data parameters of the target voice data;
And the prediction module is used for inputting the target voice data into the image generation algorithm model so as to obtain image data corresponding to the target voice data.
In a second aspect of the present invention, the specific manner of selecting, by the screening module, the target voice data with the image generating purpose from the plurality of voice data according to a preset voice screening algorithm includes:
determining a voice text and a voice language corresponding to each voice data according to a neural network algorithm;
Based on a matching algorithm and a similarity algorithm, calculating the text object matching degree and the language-gas type matching degree corresponding to each voice data according to the voice text and the voice language;
And screening target voice data with an image generation purpose from a plurality of voice data according to the text purpose matching degree and the language type matching degree.
In a second aspect of the present invention, the filtering module determines, according to a neural network algorithm, a specific manner of voice text and voice tone corresponding to each of the voice data, including:
for each voice data, carrying out noise reduction treatment on the voice data to obtain corresponding noise-reduced data;
Inputting the noise-reduced data into a trained text recognition neural network model to obtain a voice text corresponding to the voice data; the text recognition neural network is obtained through training a training data set comprising a plurality of training voice data and corresponding text labels;
Inputting the noise-reduced data into a trained corresponding language-gas recognition neural network model to obtain a voice language gas corresponding to the voice data; the voice recognition neural network is obtained through training a training data set comprising a plurality of training voice data and corresponding voice marks; the voice mood or the mood label is one or a combination of two of a target mood and an emotional mood; the target mood is command mood, statement mood or query mood; the emotion mood is no emotion mood, anger mood, open mood or low mood.
In a second aspect of the present invention, the specific manner of calculating the text object matching degree and the language-type matching degree corresponding to each piece of voice data according to the voice text and the voice language based on the matching algorithm and the similarity algorithm includes:
For each piece of voice data, calculating an average value of text similarity between the voice text of the voice data and a plurality of preset standard image generation command texts to obtain a first similarity parameter corresponding to the voice data;
calculating the number of matching keywords in the voice text of the voice data according to a preset text keyword matching rule;
Calculating the product of the first similarity parameter, the number of the matching keywords and the duration weight to obtain the text destination matching degree corresponding to the voice data; the time length weight is in direct proportion to the voice time length of the voice data;
Calculating the average value of the similarity of the speech utterances of the speech data and speech utterances of a plurality of preset historical speech data samples to obtain a second similarity parameter corresponding to the speech data;
calculating the average value of the voice tone of the voice data and the tone similarity between the voice tone of two adjacent voice data in the acquisition time to obtain a third similarity parameter corresponding to the voice data;
And calculating the product of the second similarity parameter, the third similarity parameter and the duration weight to obtain the tone type matching degree corresponding to the voice data.
In a second aspect of the present invention, the specific manner of screening target speech data with an image generating purpose from a plurality of speech data according to the text purpose matching degree and the language type matching degree includes:
Calculating a weighted sum average value of the text object matching degree and the language type matching degree corresponding to each voice data to obtain a matching parameter of each voice data;
Sequencing all the voice data according to the matching parameters from large to small to obtain a first voice sequence;
Screening out a plurality of candidate voice data which are in the first number in the first voice sequence and have the matching parameters larger than a first parameter threshold value;
Calculating the average value of the acquired time difference between each candidate voice data and all other candidate voice data to obtain a time parameter corresponding to each candidate voice data;
sequencing all the candidate voice data according to the time parameter from large to small to obtain a second voice sequence;
And screening out candidate voice data which are in the first second quantity in the second voice sequence and have the time parameter larger than a second parameter threshold value, and obtaining target voice data.
In a second aspect of the present invention, the determining module determines, according to a data parameter of the target voice data, a specific mode of the corresponding image generation algorithm model from a plurality of candidate algorithm models, including:
Determining the number of voices, the voice mood set, the voice text set and the average voice duration corresponding to the target voice data;
according to the number of voices, the voice mood set, the voice text set and the average voice duration, calculating the data similarity between the training data set corresponding to each candidate algorithm model and the target voice data set; the training data set corresponding to the candidate algorithm model comprises a plurality of training voice data and corresponding image labels;
And determining a corresponding image generation algorithm model from a plurality of candidate algorithm models according to the data similarity.
In a second aspect of the present invention, the determining module calculates, according to the number of voices, the voice-tone set, the voice-text set and the average voice duration, a specific manner of data similarity between the training data set corresponding to each candidate algorithm model and the target voice data set, where the specific manner includes:
For each candidate algorithm model, acquiring a plurality of training voice data sets in a training data set corresponding to the candidate algorithm model;
Calculating the average number of the number differences between the number of the voices and the number of the voices in each training voice data set to obtain number similarity;
Calculating the average number of the superposition ratio of the voice intonation set and the intonation labeling set corresponding to each training voice data set to obtain the intonation similarity;
calculating the average number of the text similarity of the voice text set and the text label set corresponding to each training voice data set to obtain the text similarity;
Calculating the average number of the time difference between the average voice time and the average voice time corresponding to each training voice data set to obtain time similarity;
And calculating the product of the quantity similarity, the language similarity, the text similarity and the duration similarity to obtain the data similarity corresponding to the candidate algorithm model.
In a second aspect of the present invention, as an optional implementation manner, the determining module determines, according to the data similarity, a specific manner of the corresponding image generating algorithm model from a plurality of candidate algorithm models, where the specific manner includes:
sequencing all the candidate algorithm models according to the data similarity from large to small to obtain a model sequence;
screening candidate algorithm models which are in the first third quantity in the model sequence and have the data similarity larger than a third parameter threshold value, and obtaining an image generation algorithm model;
And the prediction module inputs the target voice data to the image generation algorithm model to obtain a specific mode of image data corresponding to the target voice data, and the specific mode comprises the following steps:
When the number of the image generation algorithm models is greater than 1, respectively inputting the target voice data into each image generation algorithm model to obtain a plurality of output predicted image data;
Calculating an average value of image similarity between each piece of predicted image data and a plurality of images confirmed by the target user in a historical time period to obtain a similarity parameter of each piece of predicted image data;
and determining the predicted image data with the maximum similarity parameter as the image data corresponding to the target voice data.
In a third aspect, the present invention discloses another speech data processing apparatus based on multimodal selection, the apparatus comprising:
A memory storing executable program code;
A processor coupled to the memory;
The processor invokes the executable program code stored in the memory to perform some or all of the steps in the voice data processing method based on multimodal selection disclosed in the first aspect of the invention.
Compared with the prior art, the invention has the following beneficial effects:
Therefore, the embodiment of the invention can screen the voice data with the image generation purpose from a plurality of voice data according to the screening algorithm, and select the most suitable image generation algorithm from a plurality of candidate algorithm models according to the data parameters to generate the image, thereby improving the automation degree and the intelligent degree of the image generated according to the voice, reducing the operation cost of a user and improving the algorithm efficiency and the effect.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic flow chart of a voice data processing method based on multi-model selection according to an embodiment of the present invention.
Fig. 2 is a schematic structural diagram of a voice data processing device based on multi-model selection according to an embodiment of the present invention.
Fig. 3 is a schematic structural diagram of another voice data processing device based on multi-model selection according to an embodiment of the present invention.
Detailed Description
In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The terms "second," "second," and the like in the description and in the claims and in the above-described figures, are used for distinguishing between different objects and not necessarily for describing a particular sequential or chronological order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, apparatus, article, or device that comprises a list of steps or elements is not limited to the list of steps or elements but may, in the alternative, include other steps or elements not expressly listed or inherent to such process, method, article, or device.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the invention. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.
The invention discloses a voice data processing method and device based on multi-model selection, which can screen voice data with an image generation purpose from a plurality of voice data according to a screening algorithm, and select the most suitable image generation algorithm from a plurality of candidate algorithm models to generate an image according to data parameters, so that the automation degree and the intelligent degree of generating the image according to voice can be improved, the operation cost of a user is reduced, and the algorithm efficiency and the algorithm effect are improved. The following will describe in detail.
Example 1
Referring to fig. 1, fig. 1 is a flow chart of a voice data processing method based on multi-model selection according to an embodiment of the invention. The voice data processing method based on multi-model selection described in fig. 1 is applied to a data processing chip, a processing terminal or a processing server (wherein the processing server may be a local server or a cloud server). As shown in fig. 1, the voice data processing method based on the multi-model selection may include the following operations:
101. A plurality of voice data of a target user is acquired.
Optionally, the voice acquiring device may acquire a plurality of voice data sent by the target user in a shorter time period, and the voice data distinguishing manner may be that the last voice data is determined to be interrupted if no signal data greater than the preset volume threshold is received at preset time intervals, and subsequently acquired voice data may be new voice data.
102. And screening target voice data with the image generation purpose from the voice data according to a preset voice screening algorithm.
The purpose of this setting is that can acquire a plurality of voice data of user and be used for screening discernment automatically, can directly carry out image generation after screening out the target voice data that has the image generation purpose, need not the user to input too much instruction and carry out too much operation, improves degree of automation.
103. And determining a corresponding image generation algorithm model from the candidate algorithm models according to the data parameters of the target voice data.
Specifically, each candidate algorithm model is an algorithm model with an image generated according to voice, and may be a single neural network model or a combination of multiple algorithm models, alternatively, the neural network model or the prediction model in the present invention may be a model with a CNN structure or a model with a RNN structure, and training is performed through a corresponding training data set, a gradient descent algorithm and a loss function until convergence, which is not limited herein.
104. And inputting the target voice data into an image generation algorithm model to obtain image data corresponding to the target voice data.
Therefore, according to the embodiment of the invention, the voice data with the image generation purpose can be screened out from the voice data according to the screening algorithm, and the most suitable image generation algorithm is selected from the candidate algorithm models according to the data parameters to generate the image, so that the degree of automation and the degree of intelligence of the image generated according to the voice can be improved, the operation cost of a user is reduced, and the algorithm efficiency and the algorithm effect are improved.
As an optional embodiment, in the step, selecting, according to a preset voice filtering algorithm, target voice data with an image generating purpose from a plurality of voice data includes:
Determining a voice text and a voice language corresponding to each voice data according to a neural network algorithm;
based on a matching algorithm and a similarity algorithm, calculating the text object matching degree and the language-gas type matching degree corresponding to each voice data according to the voice text and the voice language;
And screening target voice data with the image generation purpose from the voice data according to the text purpose matching degree and the language type matching degree.
Through the embodiment, the voice text and voice tone corresponding to the voice data can be determined according to the neural network, the text purpose matching degree and the tone type matching degree can be further calculated, and then the target voice data with the image generation purpose can be screened out from the voice data, so that the accuracy of voice screening can be improved, the accuracy of the subsequent image generation according to voice is also improved, the user operation cost is reduced, and the algorithm efficiency and effect are improved.
As an optional embodiment, in the step, determining, according to a neural network algorithm, a voice text and a voice language corresponding to each voice data includes:
for each voice data, carrying out noise reduction treatment on the voice data to obtain corresponding noise-reduced data;
Inputting the noise-reduced data into a trained text recognition neural network model to obtain a voice text corresponding to the voice data; the text recognition neural network is obtained through training a training data set comprising a plurality of training voice data and corresponding text labels;
and inputting the noise-reduced data into a trained corresponding language-gas recognition neural network model to obtain the voice language gas corresponding to the voice data.
Specifically, the mood-recognition neural network is trained by a training data set comprising a plurality of training speech data and corresponding mood labels, optionally with speech mood or mood labels being one or a combination of both of the target mood and emotional mood.
Optionally, the target mood is a command mood, a statement mood, or an doubt mood.
Alternatively, the emotional mood is no mood, anger mood, open mood or low mood.
Through the embodiment, the voice text and the voice tone corresponding to the voice data can be determined according to the neural network and the voice data after noise reduction, the accuracy of subsequent voice screening can be improved, the accuracy of subsequent image generation according to voice is also improved, the operation cost of a user is reduced, and the algorithm efficiency and the algorithm effect are improved.
As an optional embodiment, in the step, based on the matching algorithm and the similarity algorithm, calculating the matching degree of the text object and the matching degree of the language type corresponding to each piece of voice data according to the voice text and the voice language comprises:
For each piece of voice data, calculating an average value of text similarity between a voice text of the voice data and a plurality of preset standard image generation command texts to obtain a first similarity parameter corresponding to the voice data;
calculating the number of matching keywords in a voice text of the voice data according to a preset text keyword matching rule;
Calculating the product of the first similarity parameter, the number of the matching keywords and the duration weight to obtain the text destination matching degree corresponding to the voice data; the duration weight is proportional to the voice duration of the voice data;
Calculating the average value of the similarity of the voice utterances of the voice data and the voice utterances of a plurality of preset historical voice data samples to obtain a second similarity parameter corresponding to the voice data;
Calculating the average value of the voice tone of the voice data and the tone similarity between the voice tone of two adjacent voice data in the acquisition time to obtain a third similarity parameter corresponding to the voice data;
and calculating the product of the second similarity parameter, the third similarity parameter and the duration weight to obtain the tone type matching degree corresponding to the voice data.
Through the embodiment, the text object matching degree and the language type matching degree corresponding to each voice data can be calculated based on the preset reference text and the reference voice data and based on the matching algorithm and the similarity algorithm, so that the accuracy degree of subsequent voice screening can be improved, the accuracy degree of subsequent image generation according to voice is also improved, the operation cost of a user is reduced, and the algorithm efficiency and effect are improved.
As an optional embodiment, in the step, selecting the target voice data with the image generation purpose from the plurality of voice data according to the text purpose matching degree and the language type matching degree, the method includes:
calculating a weighted sum average value of the text object matching degree and the language type matching degree corresponding to each voice data to obtain a matching parameter of each voice data;
sequencing all voice data according to the matching parameters from large to small to obtain a first voice sequence;
Screening a plurality of candidate voice data which are included in the first number and have matching parameters larger than a first parameter threshold value in the first voice sequence;
Calculating the average value of the acquired time difference between each candidate voice data and all other candidate voice data to obtain a time parameter corresponding to each candidate voice data;
sequencing all the candidate voice data according to the time parameters from large to small to obtain a second voice sequence;
And screening candidate voice data which are included in the first second number and have time parameters larger than a second parameter threshold value in the second voice sequence, and obtaining target voice data.
Through the embodiment, the secondary screening of a plurality of voice data can be realized based on the calculation of the matching parameters and the time parameters, so that the voice data which is most likely to embody the image generation purpose of the user can be screened out according to the matching degree of the voice data and the concentration degree in time, the accuracy degree of generating the image according to the voice in the follow-up process is improved, the operation cost of the user is reduced, and the algorithm efficiency and effect are improved.
As an optional embodiment, in the step, determining, according to the data parameter of the target voice data, a corresponding image generating algorithm model from a plurality of candidate algorithm models, includes:
Determining the number of voices, the voice mood set, the voice text set and the average voice duration corresponding to the target voice data;
According to the number of voices, the voice mood set, the voice text set and the average voice duration, calculating the data similarity between the training data set and the target voice data set corresponding to each candidate algorithm model; the training data set corresponding to the candidate algorithm model comprises a plurality of training voice data and corresponding image labels;
and determining a corresponding image generation algorithm model from the plurality of candidate algorithm models according to the data similarity.
Through the embodiment, the corresponding image generation algorithm model can be determined from the candidate algorithm models based on the voice quantity, the voice mood set, the voice text set and the average voice duration and the similarity algorithm corresponding to the target voice data, the algorithm model with the best prediction effect can be screened and obtained, the accuracy of the voice according to the generated image through the algorithm model is improved, the user operation cost is reduced, and the algorithm efficiency and effect are improved.
As an optional embodiment, in the step, calculating the data similarity between the training data set and the target speech data set corresponding to each candidate algorithm model according to the number of voices, the voice mood set, the voice text set and the average voice duration includes:
For each candidate algorithm model, acquiring a plurality of training voice data sets in a training data set corresponding to the candidate algorithm model;
calculating the average of the number differences between the number of voices and the number of voices in each training voice data set to obtain the number similarity;
Calculating the average number of the superposition ratio of the voice intonation sets and the intonation labeling sets corresponding to each training voice data set to obtain the intonation similarity;
calculating the average number of the text similarity of the voice text set and the text label set corresponding to each training voice data set to obtain the text similarity;
Calculating the average of the time length differences between the average voice time length and the average voice time length corresponding to each training voice data set to obtain the time length similarity;
and calculating the product of the quantity similarity, the mood similarity, the text similarity and the duration similarity to obtain the data similarity corresponding to the candidate algorithm model.
Through the embodiment, the data similarity corresponding to each candidate algorithm model can be calculated based on the calculation of the quantity similarity, the mood similarity, the text similarity and the duration similarity, so that the algorithm model with the best prediction effect can be obtained based on the screening in the follow-up process, the accuracy of the voice according to the generated image through the algorithm model in the follow-up process is improved, the user operation cost is reduced, and the algorithm efficiency and the algorithm effect are improved.
As an optional embodiment, in the step, determining, according to the data similarity, a corresponding image generating algorithm model from a plurality of candidate algorithm models, includes:
sequencing all candidate algorithm models according to the data similarity from large to small to obtain a model sequence;
and screening out candidate algorithm models which are in the first third quantity in the model sequence and have the data similarity larger than a third parameter threshold value, and obtaining the image generation algorithm model.
Through the embodiment, the algorithm model with the best prediction effect can be obtained through screening based on the data similarity and the preset screening rule, the accuracy of the subsequent voice according to the generated image through the algorithm model is improved, the operation cost of a user is reduced, and the algorithm efficiency and the algorithm effect are improved.
As an optional embodiment, in the step, inputting the target voice data into the image generation algorithm model to obtain image data corresponding to the target voice data, the method includes:
When the number of the image generation algorithm models is greater than 1, respectively inputting target voice data into each image generation algorithm model to obtain a plurality of output predicted image data;
Calculating the average value of the image similarity between each piece of predicted image data and a plurality of images confirmed by a target user in a historical time period to obtain a similarity parameter of each piece of predicted image data;
And determining the predicted image data with the maximum similarity parameter as the image data corresponding to the target voice data.
By the embodiment, when a plurality of image generation algorithm models exist, the most accurate image data can be determined from the plurality of prediction results based on the similarity between the prediction result of each model and the historical confirmation result of the user, the accuracy according to the generated image is improved, the operation cost of the user is reduced, the algorithm efficiency is improved, and a better using effect is provided for the user.
Example two
Referring to fig. 2, fig. 2 is a schematic structural diagram of a voice data processing device based on multi-model selection according to an embodiment of the present invention. The voice data processing device based on multi-model selection described in fig. 2 is applied to a data processing chip, a processing terminal or a processing server (wherein the processing server may be a local server or a cloud server). As shown in fig. 2, the voice data processing apparatus based on the multi-model selection may include:
an acquisition module 201, configured to acquire a plurality of voice data of a target user;
the screening module 202 is configured to screen target voice data with an image generating purpose from a plurality of voice data according to a preset voice screening algorithm;
the determining module 203 is configured to determine a corresponding image generation algorithm model from a plurality of candidate algorithm models according to a data parameter of the target voice data;
the prediction module 204 is configured to input the target voice data into the image generation algorithm model, so as to obtain image data corresponding to the target voice data.
As an optional embodiment, the specific manner of screening the target voice data with the image generating purpose from the plurality of voice data by the screening module 202 according to the preset voice screening algorithm includes:
Determining a voice text and a voice language corresponding to each voice data according to a neural network algorithm;
based on a matching algorithm and a similarity algorithm, calculating the text object matching degree and the language-gas type matching degree corresponding to each voice data according to the voice text and the voice language;
And screening target voice data with the image generation purpose from the voice data according to the text purpose matching degree and the language type matching degree.
As an alternative embodiment, the filtering module 202 determines, according to a neural network algorithm, a specific manner of voice text and voice language corresponding to each voice data, including:
for each voice data, carrying out noise reduction treatment on the voice data to obtain corresponding noise-reduced data;
Inputting the noise-reduced data into a trained text recognition neural network model to obtain a voice text corresponding to the voice data; the text recognition neural network is obtained through training a training data set comprising a plurality of training voice data and corresponding text labels;
Inputting the noise-reduced data into a trained corresponding language-gas recognition neural network model to obtain the voice language gas corresponding to the voice data; the mood recognition neural network is obtained by training a training data set comprising a plurality of training voice data and corresponding mood marks; the voice mood or mood label is one or a combination of both of the mood and emotion mood of interest; the target mood is command mood, state mood or doubt mood; emotional mood is no mood, anger, open mood or low mood.
As an alternative embodiment, the filtering module 202 calculates, based on a matching algorithm and a similarity algorithm, a specific manner of matching the text object and the mood type corresponding to each voice data according to the voice text and the voice mood, including:
For each piece of voice data, calculating an average value of text similarity between a voice text of the voice data and a plurality of preset standard image generation command texts to obtain a first similarity parameter corresponding to the voice data;
calculating the number of matching keywords in a voice text of the voice data according to a preset text keyword matching rule;
Calculating the product of the first similarity parameter, the number of the matching keywords and the duration weight to obtain the text destination matching degree corresponding to the voice data; the duration weight is proportional to the voice duration of the voice data;
Calculating the average value of the similarity of the voice utterances of the voice data and the voice utterances of a plurality of preset historical voice data samples to obtain a second similarity parameter corresponding to the voice data;
Calculating the average value of the voice tone of the voice data and the tone similarity between the voice tone of two adjacent voice data in the acquisition time to obtain a third similarity parameter corresponding to the voice data;
and calculating the product of the second similarity parameter, the third similarity parameter and the duration weight to obtain the tone type matching degree corresponding to the voice data.
As an alternative embodiment, the filtering module 202 filters, according to the matching degree of the text purpose and the matching degree of the language type, the specific manner of selecting the target voice data with the image generation purpose from the plurality of voice data, including:
calculating a weighted sum average value of the text object matching degree and the language type matching degree corresponding to each voice data to obtain a matching parameter of each voice data;
sequencing all voice data according to the matching parameters from large to small to obtain a first voice sequence;
Screening a plurality of candidate voice data which are included in the first number and have matching parameters larger than a first parameter threshold value in the first voice sequence;
Calculating the average value of the acquired time difference between each candidate voice data and all other candidate voice data to obtain a time parameter corresponding to each candidate voice data;
sequencing all the candidate voice data according to the time parameters from large to small to obtain a second voice sequence;
And screening candidate voice data which are included in the first second number and have time parameters larger than a second parameter threshold value in the second voice sequence, and obtaining target voice data.
As an optional embodiment, the determining module 203 determines, according to a data parameter of the target voice data, a specific manner of the corresponding image generating algorithm model from a plurality of candidate algorithm models, including:
Determining the number of voices, the voice mood set, the voice text set and the average voice duration corresponding to the target voice data;
According to the number of voices, the voice mood set, the voice text set and the average voice duration, calculating the data similarity between the training data set and the target voice data set corresponding to each candidate algorithm model; the training data set corresponding to the candidate algorithm model comprises a plurality of training voice data and corresponding image labels;
and determining a corresponding image generation algorithm model from the plurality of candidate algorithm models according to the data similarity.
As an optional embodiment, the determining module 203 calculates, according to the number of voices, the voice-to-speech set, the voice-to-text set and the average voice duration, a specific manner of data similarity between the training data set and the target voice data set corresponding to each candidate algorithm model, where the specific manner includes:
For each candidate algorithm model, acquiring a plurality of training voice data sets in a training data set corresponding to the candidate algorithm model;
calculating the average of the number differences between the number of voices and the number of voices in each training voice data set to obtain the number similarity;
Calculating the average number of the superposition ratio of the voice intonation sets and the intonation labeling sets corresponding to each training voice data set to obtain the intonation similarity;
calculating the average number of the text similarity of the voice text set and the text label set corresponding to each training voice data set to obtain the text similarity;
Calculating the average of the time length differences between the average voice time length and the average voice time length corresponding to each training voice data set to obtain the time length similarity;
and calculating the product of the quantity similarity, the mood similarity, the text similarity and the duration similarity to obtain the data similarity corresponding to the candidate algorithm model.
As an optional embodiment, the determining module 203 determines, according to the data similarity, a specific manner of the corresponding image generating algorithm model from the plurality of candidate algorithm models, including:
sequencing all candidate algorithm models according to the data similarity from large to small to obtain a model sequence;
Screening candidate algorithm models which are in the first third quantity in the model sequence and have the data similarity larger than a third parameter threshold value, and obtaining an image generation algorithm model;
and, the specific manner in which the prediction module 204 inputs the target voice data to the image generation algorithm model to obtain the image data corresponding to the target voice data includes:
When the number of the image generation algorithm models is greater than 1, respectively inputting target voice data into each image generation algorithm model to obtain a plurality of output predicted image data;
Calculating the average value of the image similarity between each piece of predicted image data and a plurality of images confirmed by a target user in a historical time period to obtain a similarity parameter of each piece of predicted image data;
And determining the predicted image data with the maximum similarity parameter as the image data corresponding to the target voice data.
Specific technical details and technical effects of the modules and steps in the above embodiment may refer to corresponding expressions in the first embodiment, and are not described herein.
Example III
Referring to fig. 3, fig. 3 is a schematic diagram of a voice data processing apparatus based on multi-model selection according to another embodiment of the present invention. The voice data processing device based on multimodal selection described in fig. 3 is applied to a data processing chip, a processing terminal or a processing server (wherein the processing server may be a local server or a cloud server). As shown in fig. 3, the voice data processing apparatus based on the multimodal selection may include:
a memory 301 storing executable program code;
A processor 302 coupled with the memory 301;
Wherein the processor 302 invokes executable program code stored in the memory 301 for performing the steps of the voice data processing method based on multimodal selection described in embodiment one.
Example IV
The embodiment of the invention discloses a computer-readable storage medium storing a computer program for electronic data exchange, wherein the computer program causes a computer to execute the steps of the voice data processing method based on multimodal selection described in the embodiment one.
Example five
The embodiment of the invention discloses a computer program product, which comprises a non-transitory computer readable storage medium storing a computer program, and the computer program is operable to cause a computer to perform the steps of the voice data processing method based on multimodal selection described in the embodiment.
The foregoing describes certain embodiments of the present disclosure, other embodiments being within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. Furthermore, the processes depicted in the accompanying drawings do not necessarily have to be in the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for apparatus, devices, non-transitory computer readable storage medium embodiments, the description is relatively simple, as it is substantially similar to method embodiments, with reference to portions of the description of method embodiments being relevant.
The apparatus, the device, the nonvolatile computer readable storage medium and the method provided in the embodiments of the present disclosure correspond to each other, and therefore, the apparatus, the device, and the nonvolatile computer storage medium also have similar advantageous technical effects as those of the corresponding method, and since the advantageous technical effects of the method have been described in detail above, the advantageous technical effects of the corresponding apparatus, device, and nonvolatile computer storage medium are not described herein again.
In the 90 s of the 20 th century, improvements to one technology could clearly be distinguished as improvements in hardware (e.g., improvements to circuit structures such as diodes, transistors, switches, etc.) or software (improvements to the process flow). However, with the development of technology, many improvements of the current method flows can be regarded as direct improvements of hardware circuit structures. Designers almost always obtain corresponding hardware circuit structures by programming improved method flows into hardware circuits. Therefore, an improvement of a method flow cannot be said to be realized by a hardware entity module. For example, a programmable logic device (Programmable Logic Device, PLD) (e.g., field programmable gate array (Field Programmable GATEARRAY, FPGA)) is an integrated circuit whose logic functions are determined by user programming of the device. A designer programs to "integrate" a digital system onto a PLD without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Moreover, nowadays, instead of manually manufacturing integrated circuit chips, such programming is mostly implemented with "logic compiler (logic compiler)" software, which is similar to the software compiler used in program development and writing, and the original code before being compiled is also written in a specific programming language, which is called hardware description language (Hardware Description Language, HDL), but HDL is not just one, but a plurality of kinds, such as ABEL(Advanced Boolean Expression Language)、AHDL(Altera Hardware DescriptionLanguage)、Confluence、CUPL(Cornell University Programming Language)、HDCal、JHDL(Java Hardware Description Language)、Lava、Lola、MyHDL、PALASM、RHDL(RubyHardware Description Language), and VHDL (Very-High-SPEEDINTEGRATED CIRCUIT HARDWARE DESCRIPTION LANGUAGE) and Verilog are currently most commonly used. It will also be apparent to those skilled in the art that a hardware circuit implementing the logic method flow can be readily obtained by merely slightly programming the method flow into an integrated circuit using several of the hardware description languages described above.
The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer readable medium storing computer readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, application SPECIFIC INTEGRATED Circuits (ASICs), programmable logic controllers, and embedded microcontrollers, examples of controllers include, but are not limited to, the following microcontrollers: ARC 625D, atmel AT91SAM, microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic of the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller in a pure computer readable program code, it is well possible to implement the same functionality by logically programming the method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc. Such a controller may thus be regarded as a kind of hardware component, and means for performing various functions included therein may also be regarded as structures within the hardware component. Or even means for achieving the various functions may be regarded as either software modules implementing the methods or structures within hardware components.
The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.
For convenience of description, the above devices are described as being functionally divided into various units, respectively. Of course, the functions of each element may be implemented in one or more software and/or hardware elements when implemented in the present specification.
It will be appreciated by those skilled in the art that the present description may be provided as a method, system, or computer program product. Accordingly, the present specification embodiments may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present description embodiments may take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
The present description is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the specification. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.
Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic tape disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.
The description may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.
Finally, it should be noted that: the embodiment of the invention discloses a voice data processing method and device based on multi-model selection, which are disclosed by the embodiment of the invention only for illustrating the technical scheme of the invention, but not limiting the technical scheme; although the invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art will understand that; the technical scheme recorded in the various embodiments can be modified or part of technical features in the technical scheme can be replaced equivalently; such modifications and substitutions do not depart from the spirit and scope of the corresponding technical solutions.

Claims (10)

1. A method of processing speech data based on multimodal selection, the method comprising:
acquiring a plurality of voice data of a target user;
Screening target voice data with an image generation purpose from the voice data according to a preset voice screening algorithm;
determining a corresponding image generation algorithm model from a plurality of candidate algorithm models according to the data parameters of the target voice data;
And inputting the target voice data into the image generation algorithm model to obtain image data corresponding to the target voice data.
2. The method for processing voice data based on multimodal selection according to claim 1, wherein the step of selecting target voice data having an image generation purpose from the plurality of voice data according to a preset voice selection algorithm comprises:
determining a voice text and a voice language corresponding to each voice data according to a neural network algorithm;
Based on a matching algorithm and a similarity algorithm, calculating the text object matching degree and the language-gas type matching degree corresponding to each voice data according to the voice text and the voice language;
And screening target voice data with an image generation purpose from a plurality of voice data according to the text purpose matching degree and the language type matching degree.
3. The method for processing voice data based on multimodal selection of claim 2, wherein the determining the voice text and the voice mood corresponding to each of the voice data according to the neural network algorithm comprises:
for each voice data, carrying out noise reduction treatment on the voice data to obtain corresponding noise-reduced data;
Inputting the noise-reduced data into a trained text recognition neural network model to obtain a voice text corresponding to the voice data; the text recognition neural network is obtained through training a training data set comprising a plurality of training voice data and corresponding text labels;
Inputting the noise-reduced data into a trained corresponding language-gas recognition neural network model to obtain a voice language gas corresponding to the voice data; the voice recognition neural network is obtained through training a training data set comprising a plurality of training voice data and corresponding voice marks; the voice mood or the mood label is one or a combination of two of a target mood and an emotional mood; the target mood is command mood, statement mood or query mood; the emotion mood is no emotion mood, anger mood, open mood or low mood.
4. The method for processing voice data based on multi-model selection according to claim 2, wherein the calculating the matching degree of the text object and the matching degree of the language type corresponding to each voice data according to the voice text and the voice language based on the matching algorithm and the similarity algorithm comprises:
For each piece of voice data, calculating an average value of text similarity between the voice text of the voice data and a plurality of preset standard image generation command texts to obtain a first similarity parameter corresponding to the voice data;
calculating the number of matching keywords in the voice text of the voice data according to a preset text keyword matching rule;
Calculating the product of the first similarity parameter, the number of the matching keywords and the duration weight to obtain the text destination matching degree corresponding to the voice data; the time length weight is in direct proportion to the voice time length of the voice data;
Calculating the average value of the similarity of the speech utterances of the speech data and speech utterances of a plurality of preset historical speech data samples to obtain a second similarity parameter corresponding to the speech data;
calculating the average value of the voice tone of the voice data and the tone similarity between the voice tone of two adjacent voice data in the acquisition time to obtain a third similarity parameter corresponding to the voice data;
And calculating the product of the second similarity parameter, the third similarity parameter and the duration weight to obtain the tone type matching degree corresponding to the voice data.
5. The method for processing voice data based on multimodal selection according to claim 2, wherein said selecting target voice data having an image generation purpose from a plurality of said voice data based on said text purpose matching degree and said language type matching degree comprises:
Calculating a weighted sum average value of the text object matching degree and the language type matching degree corresponding to each voice data to obtain a matching parameter of each voice data;
Sequencing all the voice data according to the matching parameters from large to small to obtain a first voice sequence;
Screening out a plurality of candidate voice data which are in the first number in the first voice sequence and have the matching parameters larger than a first parameter threshold value;
Calculating the average value of the acquired time difference between each candidate voice data and all other candidate voice data to obtain a time parameter corresponding to each candidate voice data;
sequencing all the candidate voice data according to the time parameter from large to small to obtain a second voice sequence;
And screening out candidate voice data which are in the first second quantity in the second voice sequence and have the time parameter larger than a second parameter threshold value, and obtaining target voice data.
6. The method for processing voice data based on multimodal selection according to claim 2, wherein the determining a corresponding image generation algorithm model from a plurality of candidate algorithm models according to the data parameters of the target voice data comprises:
Determining the number of voices, the voice mood set, the voice text set and the average voice duration corresponding to the target voice data;
according to the number of voices, the voice mood set, the voice text set and the average voice duration, calculating the data similarity between the training data set corresponding to each candidate algorithm model and the target voice data set; the training data set corresponding to the candidate algorithm model comprises a plurality of training voice data and corresponding image labels;
And determining a corresponding image generation algorithm model from a plurality of candidate algorithm models according to the data similarity.
7. The method for processing voice data based on multimodal selection according to claim 6, wherein the calculating the data similarity between the training data set corresponding to each candidate algorithm model and the target voice data set according to the number of voices, the voice-of-speech set, the voice-of-text set and the average voice duration comprises:
For each candidate algorithm model, acquiring a plurality of training voice data sets in a training data set corresponding to the candidate algorithm model;
Calculating the average number of the number differences between the number of the voices and the number of the voices in each training voice data set to obtain number similarity;
Calculating the average number of the superposition ratio of the voice intonation set and the intonation labeling set corresponding to each training voice data set to obtain the intonation similarity;
calculating the average number of the text similarity of the voice text set and the text label set corresponding to each training voice data set to obtain the text similarity;
Calculating the average number of the time difference between the average voice time and the average voice time corresponding to each training voice data set to obtain time similarity;
And calculating the product of the quantity similarity, the language similarity, the text similarity and the duration similarity to obtain the data similarity corresponding to the candidate algorithm model.
8. The method for processing speech data based on multimodal selection of claim 7, wherein said determining a corresponding image generation algorithm model from a plurality of said candidate algorithm models based on said data similarity comprises:
sequencing all the candidate algorithm models according to the data similarity from large to small to obtain a model sequence;
screening candidate algorithm models which are in the first third quantity in the model sequence and have the data similarity larger than a third parameter threshold value, and obtaining an image generation algorithm model;
and inputting the target voice data into the image generation algorithm model to obtain image data corresponding to the target voice data, wherein the method comprises the following steps:
When the number of the image generation algorithm models is greater than 1, respectively inputting the target voice data into each image generation algorithm model to obtain a plurality of output predicted image data;
Calculating an average value of image similarity between each piece of predicted image data and a plurality of images confirmed by the target user in a historical time period to obtain a similarity parameter of each piece of predicted image data;
and determining the predicted image data with the maximum similarity parameter as the image data corresponding to the target voice data.
9. A voice data processing apparatus based on multimodal selection, the apparatus comprising:
The acquisition module is used for acquiring a plurality of voice data of the target user;
the screening module is used for screening target voice data with an image generation purpose from the voice data according to a preset voice screening algorithm;
The determining module is used for determining a corresponding image generation algorithm model from a plurality of candidate algorithm models according to the data parameters of the target voice data;
And the prediction module is used for inputting the target voice data into the image generation algorithm model so as to obtain image data corresponding to the target voice data.
10. A voice data processing apparatus based on multimodal selection, the apparatus comprising:
A memory storing executable program code;
A processor coupled to the memory; the processor invokes the executable program code stored in the memory to perform the multimodal selection based speech data processing method of any of claims 1-8.
CN202410151187.9A 2024-02-02 2024-02-02 Voice data processing method and device based on multi-model selection Pending CN118098266A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410151187.9A CN118098266A (en) 2024-02-02 2024-02-02 Voice data processing method and device based on multi-model selection

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410151187.9A CN118098266A (en) 2024-02-02 2024-02-02 Voice data processing method and device based on multi-model selection

Publications (1)

Publication Number Publication Date
CN118098266A true CN118098266A (en) 2024-05-28

Family

ID=91143580

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410151187.9A Pending CN118098266A (en) 2024-02-02 2024-02-02 Voice data processing method and device based on multi-model selection

Country Status (1)

Country Link
CN (1) CN118098266A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2015230455A (en) * 2014-06-06 2015-12-21 日本電信電話株式会社 Voice classification device, voice classification method, and program
CN109308178A (en) * 2018-08-31 2019-02-05 维沃移动通信有限公司 A kind of voice drafting method and its terminal device
CN116363250A (en) * 2023-03-31 2023-06-30 阿维塔科技(重庆)有限公司 Image generation method and system
CN116741154A (en) * 2023-05-10 2023-09-12 深圳市声扬科技有限公司 Data selection method and device, electronic equipment and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2015230455A (en) * 2014-06-06 2015-12-21 日本電信電話株式会社 Voice classification device, voice classification method, and program
CN109308178A (en) * 2018-08-31 2019-02-05 维沃移动通信有限公司 A kind of voice drafting method and its terminal device
CN116363250A (en) * 2023-03-31 2023-06-30 阿维塔科技(重庆)有限公司 Image generation method and system
CN116741154A (en) * 2023-05-10 2023-09-12 深圳市声扬科技有限公司 Data selection method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN109543190B (en) Intention recognition method, device, equipment and storage medium
CN116227474B (en) Method and device for generating countermeasure text, storage medium and electronic equipment
CN115952272B (en) Method, device and equipment for generating dialogue information and readable storage medium
CN112735407B (en) Dialogue processing method and device
CN116343314B (en) Expression recognition method and device, storage medium and electronic equipment
CN112417093B (en) Model training method and device
CN116312480A (en) Voice recognition method, device, equipment and readable storage medium
CN116127305A (en) Model training method and device, storage medium and electronic equipment
CN117409466B (en) Three-dimensional dynamic expression generation method and device based on multi-label control
CN116127328B (en) Training method, training device, training medium and training equipment for dialogue state recognition model
CN112908315A (en) Question-answer intention judgment method based on voice characteristics and voice recognition
CN115620706B (en) Model training method, device, equipment and storage medium
CN116186231A (en) Method and device for generating reply text, storage medium and electronic equipment
CN115456114A (en) Method, device, medium and equipment for model training and business execution
CN118098266A (en) Voice data processing method and device based on multi-model selection
CN115019781A (en) Conversation service execution method, device, storage medium and electronic equipment
CN114373443A (en) Speech synthesis method and apparatus, computing device, storage medium, and program product
CN116501852B (en) Controllable dialogue model training method and device, storage medium and electronic equipment
CN116434787B (en) Voice emotion recognition method and device, storage medium and electronic equipment
CN115952271B (en) Method and device for generating dialogue information, storage medium and electronic equipment
CN114792256B (en) Crowd expansion method and device based on model selection
CN117079646B (en) Training method, device, equipment and storage medium of voice recognition model
CN115862675B (en) Emotion recognition method, device, equipment and storage medium
CN117316189A (en) Business execution method and device based on voice emotion recognition
CN115599896A (en) Method, device, equipment and medium for generating chatting answer based on dynamic Prompt

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination