CN113256751B

CN113256751B - Voice-based image generation method, device, equipment and storage medium

Info

Publication number: CN113256751B
Application number: CN202110607680.3A
Authority: CN
Inventors: 张旭龙; 王健宗
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2021-06-01
Filing date: 2021-06-01
Publication date: 2023-09-29
Anticipated expiration: 2041-06-01
Also published as: CN113256751A

Abstract

The invention relates to the field of artificial intelligence, and discloses a method, a device, equipment and a storage medium for generating an image based on voice, which are used for improving the accuracy of a voice synthesized image. The method comprises the following steps: acquiring target voice to be processed, and preprocessing the target voice to obtain standard voice; extracting features of standard voice to obtain voice feature vectors; calculating the voice similarity of the voice feature vector, and carrying out voice template matching on the target voice according to the voice similarity to obtain a target voice template; performing text and semantic query on target voice according to the target voice template to obtain text information and semantic information; inputting the text information and the semantic information into an image generator for image generation to obtain an initial image, and carrying out image detection on the initial image through a discriminator to obtain a detection result; and if the detection result is that the edge is correct, taking the initial image as an output image of the image generation model to obtain a target image.

Description

Voice-based image generation method, device, equipment and storage medium

Technical Field

The present invention relates to the field of artificial intelligence, and in particular, to a method, apparatus, device, and storage medium for generating an image based on voice.

Background

With the rapid development of artificial intelligence technology, computer-generated images have become possible. The method utilizes the computer to automatically generate the image, and has important application in the aspects of artwork creation, data enhancement and the like. Currently, computer-generated images are generated by computers primarily from textual descriptions. The technology has important application in man-machine interaction and computer aided design. The computer can generate an image which is consistent with the text instruction semanteme according to the text instruction of the creator, and the creation flow can be quickened. The method has potential application value in the fields of daily life and professional work such as criminal investigation, design idea sharing, recall and recording and the like.

Most image synthesis methods synthesize images based on global sentence vectors, which may lose important fine granularity information at word level, thereby causing errors in the generated images, and thus, the accuracy of the generated images is low in the conventional scheme.

Disclosure of Invention

The invention provides a voice-based image generation method, a device, equipment and a storage medium, which are used for improving the accuracy of a voice synthesized image.

The first aspect of the present invention provides a voice-based image generation method, which includes: acquiring target voice to be processed, and preprocessing the target voice to obtain standard voice; extracting the characteristics of the standard voice to obtain a voice characteristic vector; performing similarity calculation on the voice feature vector to obtain voice similarity, and performing voice template matching on the target voice according to the voice similarity to obtain a target voice template; according to the target voice template, text and semantic query is carried out on the target voice to obtain text information and semantic information corresponding to the target voice; inputting the text information and the semantic information into an image generator in a preset image generation model to generate an image to obtain an initial image, and detecting the initial image through a discriminator in the preset image generation model to obtain a detection result, wherein the detection result comprises correct edges and incorrect edges; and if the detection result is that the edge is correct, outputting the initial image as a target image through the image generation model.

Optionally, in a first implementation manner of the first aspect of the present invention, the extracting features of the standard speech to obtain a speech feature vector includes: performing short-time Fourier transform on the standard voice to obtain a voice frequency spectrum corresponding to the standard voice; filtering the voice spectrum by adopting a preset filter to obtain a target Mel spectrum corresponding to the standard voice; and carrying out vector coding on the target Mel frequency spectrum to obtain a voice characteristic vector.

Optionally, in a second implementation manner of the first aspect of the present invention, the calculating the similarity of the speech feature vector to obtain a speech similarity, and performing speech template matching on the target speech according to the speech similarity, to obtain a target speech template includes: extracting vector elements of the voice feature vector to obtain a plurality of vector elements; invoking a preset voice template vector, and performing similarity calculation on the plurality of vector elements to obtain voice similarity; and carrying out voice template matching on the target voice according to the voice similarity to obtain a target voice template.

Optionally, in a third implementation manner of the first aspect of the present invention, the performing text and semantic query on the target voice according to the target voice template to obtain text information and semantic information corresponding to the target voice includes: performing voice matching on the target voice based on the target voice template to obtain voice data corresponding to the target voice; and respectively inquiring the text and the semantic corresponding to the voice data to obtain the text information and the semantic information corresponding to the target voice.

Optionally, in a fourth implementation manner of the first aspect of the present invention, the inputting the text information and the semantic information into an image generator in a preset image generation model to perform image generation, and obtaining an initial image includes: inputting the text information and the semantic information into an image generator in a preset image generation model, extracting an image object from the text information by adopting a preset attention mechanism to obtain a target object, and extracting semantic layout information from the semantic information to obtain target layout information; invoking a preset text image library to search the image of the target object to obtain an image corresponding to the text information, and invoking a preset text relation graph to search the target layout information to obtain an image relation corresponding to the semantic information; and performing image synthesis on the image corresponding to the text information according to the image relation to obtain an initial image.

Optionally, in a fifth implementation manner of the first aspect of the present invention, the performing, by using a discriminator in a preset image generation model, image detection on the initial image, to obtain a detection result includes: inputting the initial image into a discriminator in a preset image generation model, and acquiring a boundary box of the initial image through the discriminator; and setting a binary cross entropy loss function of the boundary box, and carrying out image detection on the initial image according to the binary cross entropy loss function and the semantic information to obtain detection results, wherein the detection results comprise edge correctness and edge errors.

Optionally, in a sixth implementation manner of the first aspect of the present invention, after the performing image detection on the initial image by using a discriminator in a preset image generation model, the method further includes: and if the detection result is an edge error, feeding back the detection result to the image generator, and performing image optimization on the initial image through the image generator.

A second aspect of the present invention provides a voice-based image generation apparatus including: the acquisition module is used for acquiring target voice to be processed and preprocessing the target voice to obtain standard voice; the feature extraction module is used for extracting features of the standard voice to obtain a voice feature vector; the matching module is used for carrying out similarity calculation on the voice feature vectors to obtain voice similarity, and carrying out voice template matching on the target voice according to the voice similarity to obtain a target voice template; the query module is used for carrying out text and semantic query on the target voice according to the target voice template to obtain text information and semantic information corresponding to the target voice; the generation module is used for inputting the text information and the semantic information into an image generator in a preset image generation model to generate images so as to obtain an initial image, and detecting the initial image through a discriminator in the preset image generation model so as to obtain a detection result, wherein the detection result comprises correct edges and incorrect edges; and the output module is used for outputting the initial image as a target image through the image generation model if the detection result is that the edge is correct.

Optionally, in a first implementation manner of the second aspect of the present invention, the feature extraction module is specifically configured to: performing short-time Fourier transform on the standard voice to obtain a voice frequency spectrum corresponding to the standard voice; filtering the voice spectrum by adopting a preset filter to obtain a target Mel spectrum corresponding to the standard voice; and carrying out vector coding on the target Mel frequency spectrum to obtain a voice characteristic vector.

Optionally, in a second implementation manner of the second aspect of the present invention, the matching module is specifically configured to: extracting vector elements of the voice feature vector to obtain a plurality of vector elements; invoking a preset voice template vector, and performing similarity calculation on the plurality of vector elements to obtain voice similarity; and carrying out voice template matching on the target voice according to the voice similarity to obtain a target voice template.

Optionally, in a third implementation manner of the second aspect of the present invention, the query module is specifically configured to: performing voice matching on the target voice based on the target voice template to obtain voice data corresponding to the target voice; and respectively inquiring the text and the semantic corresponding to the voice data to obtain the text information and the semantic information corresponding to the target voice.

Optionally, in a fourth implementation manner of the second aspect of the present invention, the generating module further includes: the image generation unit is used for inputting the text information and the semantic information into an image generator in a preset image generation model, extracting an image object from the text information by adopting a preset attention mechanism to obtain a target object, and extracting semantic layout information from the semantic information to obtain target layout information; invoking a preset text image library to search the image of the target object to obtain an image corresponding to the text information, and invoking a preset text relation graph to search the target layout information to obtain an image relation corresponding to the semantic information; and performing image synthesis on the image corresponding to the text information according to the image relation to obtain an initial image.

Optionally, in a fifth implementation manner of the second aspect of the present invention, the generating module further includes: the image distinguishing unit is used for inputting the initial image into a distinguishing device in a preset image generation model and acquiring a boundary box of the initial image through the distinguishing device; and setting a binary cross entropy loss function of the boundary box, and carrying out image detection on the initial image according to the binary cross entropy loss function and the semantic information to obtain detection results, wherein the detection results comprise edge correctness and edge errors.

Optionally, in a sixth implementation manner of the second aspect of the present invention, the voice-based image generating apparatus further includes: and the image optimization module is used for feeding back the detection result to the image generator if the detection result is edge error, and carrying out image optimization on the initial image through the image generator.

A third aspect of the present invention provides a voice-based image generation apparatus, comprising: a memory and at least one processor, the memory having instructions stored therein; the at least one processor invokes the instructions in the memory to cause the speech-based image generation device to perform the speech-based image generation method described above.

A fourth aspect of the present invention provides a computer readable storage medium having instructions stored therein which, when run on a computer, cause the computer to perform the above-described speech-based image generation method.

In the technical scheme provided by the invention, target voice to be processed is obtained, and the target voice is preprocessed to obtain standard voice; extracting the characteristics of the standard voice to obtain a voice characteristic vector; performing similarity calculation on the voice feature vector to obtain voice similarity, and performing voice template matching on the target voice according to the voice similarity to obtain a target voice template; according to the target voice template, text and semantic query is carried out on the target voice to obtain text information and semantic information corresponding to the target voice; inputting the text information and the semantic information into an image generator in a preset image generation model to generate an image to obtain an initial image, and detecting the initial image through a discriminator in the preset image generation model to obtain a detection result, wherein the detection result comprises correct edges and incorrect edges; and if the detection result is that the edge is correct, outputting the initial image as a target image through the image generation model. The invention detects the initial image generated by the image generator and optimizes the initial image with the detection result of edge error, thereby improving the accuracy of the speech synthesis image.

Drawings

FIG. 1 is a schematic diagram of a first embodiment of a speech-based image generation method according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating a second embodiment of a speech-based image generation method according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a first embodiment of a speech-based image generating apparatus according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a second embodiment of a speech-based image generating apparatus according to an embodiment of the present invention;

fig. 5 is a schematic diagram of an embodiment of a voice-based image generating apparatus in an embodiment of the present invention.

Detailed Description

The embodiment of the invention provides a voice-based image generation method, device and equipment and a storage medium, which are used for improving the accuracy of a voice synthesized image. The terms "first," "second," "third," "fourth" and the like in the description and in the claims and in the above drawings, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments described herein may be implemented in other sequences than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus.

For ease of understanding, a specific flow of an embodiment of the present invention is described below with reference to fig. 1, where a first embodiment of a speech-based image generation method according to an embodiment of the present invention includes:

101. acquiring target voice to be processed, and preprocessing the target voice to obtain standard voice;

it is to be understood that the execution subject of the present invention may be a voice-based image generating apparatus, or may be a server, which is not limited herein. The embodiment of the invention is described by taking a server as an execution main body as an example.

The server acquires a target voice to be processed, and the target voice can be voices in various professional fields or daily lives, for example: the method comprises the steps of preprocessing target voice, namely removing additive noise in the target voice to obtain standard voice, wherein the voice is shared by criminal investigation, the voice is shared by design ideas, the voice is recalled and the like.

102. Extracting features of standard voice to obtain voice feature vectors;

the server performs short-time Fourier transform on standard voice to obtain a voice spectrum corresponding to the standard voice, then performs filtering operation on the voice spectrum by adopting a preset filter to obtain a target Mel spectrum corresponding to the standard voice, performs vector coding on the target Mel spectrum by adopting a preset coding rule to obtain a voice feature vector, wherein the target Mel spectrum comprises a plurality of data of a time domain, a plurality of data of a frequency domain and channel data, the number of the channel data can be 1, the channel data is set as an initial value, and the analysis spectrum can be coded by adopting a preset coding method to represent the variable-length analysis spectrum by adopting a feature vector with a fixed length to obtain the voice feature vector.

103. Performing similarity calculation on the voice feature vectors to obtain voice similarity, and performing voice template matching on target voice according to the voice similarity to obtain a target voice template;

the server performs similarity calculation according to a template vector and a voice feature vector corresponding to a preset voice template, and the similarity calculation adopts mean variance calculation, for example: when the voice feature vector is [2,3,5,0] and the template vector is [2,3,4,0], the voice similarity obtained by calculation according to the mean variance function is 0.33, the voice similarity of the voice feature vector and the preset template vector is 0.33, the voice feature vector and the template vector which is most similar to the target voice are obtained, the voice feature vector and the template vector are more similar according to the voice similarity, the smaller the voice similarity is, the smaller the mean variance is, the voice feature vector and the template vector are more similar, the preset voice template and the voice feature vector of the input target voice are subjected to similarity comparison, a plurality of templates matched with the target voice are calculated, and the voice template corresponding to the template vector with the smaller voice feature vector and the template vector is taken as the target voice template.

104. According to the target voice template, text and semantic query are carried out on the target voice, and text information and semantic information corresponding to the target voice are obtained;

The server performs text and semantic query on the target voice according to the target voice template, text information and semantic information can be obtained in the target voice template through table lookup, and the target voice template contains characters corresponding to each voice, for example: the voice "mao" is one tone, corresponds to the word "cat", and can obtain the word information and the semantic information corresponding to the target voice according to the voice which is the most similar to the input target voice.

105. Inputting text information and semantic information into an image generator in a preset image generation model to generate an image to obtain an initial image, and detecting the initial image through a discriminator in the preset image generation model to obtain a detection result, wherein the detection result comprises correct edges and incorrect edges;

the server inputs text information and semantic information into an image generator in a preset image generation model to generate an initial image, the image generation model is an ObjGAN model, the image generator is responsible for generating the initial image, and the discriminator is used for detecting whether the initial image generated by the image generator meets preset requirements or not to obtain a detection result, wherein the detection result comprises correct edges and errors edges.

106. And if the detection result is that the edge is correct, outputting the initial image as a target image through an image generation model.

If the detection result is that the edge is correct, the server outputs an initial image with the correct edge as a target image, and the target image synthesizes an image through the most relevant target object in the text information and the pre-generated semantic information, so that the high-quality target image can be quickly generated according to the text information.

In the embodiment of the invention, the target voice to be processed is obtained, and the target voice is preprocessed to obtain the standard voice; extracting features of standard voice to obtain voice feature vectors; performing similarity calculation on the voice feature vectors to obtain voice similarity, and performing voice template matching on target voice according to the voice similarity to obtain a target voice template; according to the target voice template, text and semantic query are carried out on the target voice, and text information and semantic information corresponding to the target voice are obtained; inputting the text information and the semantic information into an image generator in a preset image generation model to generate an image to obtain an initial image, and detecting the initial image through a discriminator in the preset image generation model to obtain a detection result, wherein the detection result comprises correct edges and incorrect edges; and if the detection result is that the edge is correct, outputting the initial image as a target image through an image generation model. The invention detects the initial image generated by the image generator and optimizes the initial image with the detection result of edge error, thereby improving the accuracy of the speech synthesis image.

Referring to fig. 2, a second embodiment of a voice-based image generation method according to an embodiment of the present invention includes:

201. acquiring target voice to be processed, and preprocessing the target voice to obtain standard voice;

202. Extracting the characteristics of the standard voice to obtain a voice characteristic vector;

Optionally, step 202 includes: performing short-time Fourier transform on the standard voice to obtain a voice frequency spectrum corresponding to the standard voice; filtering the voice frequency spectrum by adopting a preset filter to obtain a target Mel frequency spectrum corresponding to the standard voice; and carrying out vector coding on the target Mel frequency spectrum to obtain a voice characteristic vector.

The server performs short-time Fourier transform (STFT, short-time Fourier transform, or short-term Fouriertransform)) on standard voice, wherein the short-time Fourier transform is a mathematical transform related to the Fourier transform and is used for determining the frequency and the phase of a sine wave in a local area of a time-varying signal, the short-time Fourier transform is a commonly used time-frequency analysis method, a section of signal in a time window is used for representing the signal characteristic of a certain moment in the voice, in the short-time Fourier transform process, the length of the window determines the time resolution and the frequency resolution of a spectrogram, the longer the window length is, the longer the intercepted signal is, the higher the frequency resolution is after the Fourier transform is, and the worse the time resolution is; in contrast, the shorter the window length is, the shorter the intercepted signal is, the worse the frequency resolution is, the better the time resolution is, in addition, the preset filter is a Mel filter (Mel filter), after the voice frequency spectrum is filtered by the filter, the noise signal is removed, the target Mel frequency spectrum of the target voice can be obtained, and the target Mel frequency spectrum is encoded by adopting a preset encoding rule, so that the voice feature vector is obtained.

203. Performing similarity calculation on the voice feature vectors to obtain voice similarity, and performing voice template matching on target voice according to the voice similarity to obtain a target voice template;

the server performs similarity calculation according to a template vector and a voice feature vector corresponding to a preset voice template, and the similarity calculation adopts mean variance calculation, for example: when the voice feature vector is [2,3,5,0] and the template vector is [2,3,4,0], the voice similarity obtained by calculation according to the mean variance function is 0.33, the voice similarity of the voice feature vector and the preset template vector is 0.33, the voice feature vector and the template vector which is most similar to the target voice are obtained, the voice feature vector and the template vector are more similar according to the voice similarity, the smaller the voice similarity is, the smaller the mean variance is, the closer the voice feature vector and the template vector are, the preset voice template and the voice feature vector of the input target voice are subjected to similarity comparison, a plurality of templates matched with the target voice are calculated, and the voice template corresponding to the template vector which is more similar to the voice feature vector and the template vector is taken as the target voice template.

Optionally, step 203 includes: extracting vector elements of the voice feature vector to obtain a plurality of vector elements; invoking a preset voice template vector, and performing similarity calculation on a plurality of vector elements to obtain voice similarity; and carrying out voice template matching on the target voice according to the voice similarity to obtain a target voice template.

The server extracts vector elements of the voice feature vectors to obtain a plurality of vector elements; invoking a preset voice template vector, and performing similarity calculation on a plurality of vector elements to obtain voice similarity, for example: when the voice feature vector is [1,2,3,4], extracting vector elements of the voice feature vector to be 1,2,3,4, obtaining a preset voice template vector, extracting vector elements of the voice template vector to be 1,3, 5, carrying out mean variance calculation on vector elements of the template vector and the voice feature vector to obtain a calculation result of 0.5, so that the voice similarity is 0.5, the smaller the voice similarity is, the smaller the mean variance is, the closer the voice feature vector and the template vector are, and matching the voice template closest to the target voice according to the voice similarity to obtain the target voice template.

204. According to the target voice template, text and semantic query are carried out on the target voice, and text information and semantic information corresponding to the target voice are obtained;

Optionally, step 204 includes: performing voice matching on target voice based on a target voice template to obtain voice data corresponding to the target voice; and respectively inquiring the text and the semantic corresponding to the voice data to obtain text information and semantic information corresponding to the target voice.

The server performs voice matching on the target voice based on the target voice template to obtain voice data corresponding to the target voice, performs text and semantic query on the voice data respectively, and can obtain text information and semantic information through table lookup in the target voice template, wherein the target voice template comprises characters corresponding to each voice, for example: the voice sofa corresponds to the word sofa, and word information and semantic information corresponding to the target voice can be obtained according to the voice which is the most similar to the input target voice.

205. Inputting text information and semantic information into an image generator in a preset image generation model, extracting image objects from the text information by adopting a preset attention mechanism to obtain target objects, and extracting semantic layout information from the semantic information to obtain target layout information;

the server inputs the text information and the semantic information into an image generator in a preset image generation model, the image generator generates images, and the image generator processes the text information and the semantic information by adopting an attention mechanism to obtain target layout information. The server adopts a preset attention mechanism to extract the image object of the text information to obtain a target object, for example: when the text information is 'afternoon, one orange cat lies on a warm sofa for afternoon nap', only two target objects of 'orange cat' and 'sofa' are obtained after the attention mechanism. And the server extracts semantic layout information from the semantic information to obtain target layout information.

206. Calling a preset text image library to search an image of a target object to obtain an image corresponding to text information, and calling a preset text relation graph to search target layout information to obtain an image relation corresponding to semantic information;

the server obtains the corresponding image of the text information and the relation between the images by searching a text image library or a text relation graph, for example: after noon, one orange cat lies on a warm sofa for noon nap, which is character information of a generator, two target objects of the orange cat and the sofa in the character information are obtained through a attentiveness mechanism, two images to be synthesized of an orange cat image and a sofa image are obtained through searching of a character image library, and an image relation between the images to be synthesized is obtained by combining semantic information.

207. Image synthesis is carried out on the images corresponding to the text information according to the image relationship, and an initial image is obtained;

the server generates a target image with high resolution from low resolution to high resolution by repeatedly running the image generator, and when the target image is generated, the image generator synthesizes the image area in the boundary box by identifying the word most relevant to the target object in the boundary box of the target image.

208. Image detection is carried out on the initial image through a discriminator in a preset image generation model, so that a detection result is obtained, and the detection result comprises correct edges and incorrect edges;

the server adopts a discriminator to check each boundary box, so as to ensure that the target object to be generated is indeed matched with the pre-generated semantic information, and the technique of the discriminator adopts Fast R-CNN, so that the detection efficiency and accuracy can be improved rapidly.

Optionally, step 209 includes: inputting the initial image into a discriminator in a preset image generation model, and acquiring a boundary box of the initial image through the discriminator; and setting a binary cross entropy loss function of the boundary box, and carrying out image detection on the initial image according to the binary cross entropy loss function and semantic information to obtain a detection result, wherein the detection result comprises edge correctness and edge error.

The server inputs the initial image into a discriminator in the image generation model, acquires a boundary box of the initial image through the discriminator, classifies the initial image by Fast R-CNN to obtain classification results of correct edges and incorrect edges, and sets a binary cross entropy loss function in each boundary box, wherein the function is as follows:

L _i ＝-[ylogy’+(1-y)log(1-y’)]；

Wherein L is _i The loss function representing the ith boundary, y represents the real boundary condition, y' represents the generated boundary condition, and the accuracy of the speech synthesis image can be improved by setting the binary cross entropy loss.

Optionally, after step 209, the method further includes: if the detection result is edge error, the detection result is fed back to the image generator, and the image optimization is carried out on the initial image through the image generator.

If the detection result is edge errors, the server feeds back the detection result to the image generator, the image generator performs image optimization on the initial image, the error image is deleted, then the image corresponding to the correct target object is synthesized with the deleted image, and the target image is output after image detection.

209. And if the detection result is that the edge is correct, outputting the initial image as a target image through an image generation model.

In the embodiment of the invention, the target voice to be processed is obtained, and the target voice is preprocessed to obtain the standard voice; extracting features of standard voice to obtain voice feature vectors; performing similarity calculation on the voice feature vectors to obtain voice similarity, and performing voice template matching on target voice according to the voice similarity to obtain a target voice template; according to the target voice template, text and semantic query are carried out on the target voice, and text information and semantic information corresponding to the target voice are obtained; inputting the text information and the semantic information into an image generator in a preset image generation model to generate an image to obtain an initial image, and detecting the initial image through a discriminator in the image generation model to obtain a detection result, wherein the detection result comprises correct edges and incorrect edges; and if the detection result is that the edge is correct, outputting the initial image as a target image through an image generation model. The invention detects the initial image generated by the image generator and optimizes the initial image with the detection result of edge error, thereby improving the accuracy of the speech synthesis image.

The method for generating a voice-based image according to the embodiment of the present invention is described above, and the apparatus for generating a voice-based image according to the embodiment of the present invention is described below, referring to fig. 3, where a first embodiment of the apparatus for generating a voice-based image according to the embodiment of the present invention includes:

The obtaining module 301 is configured to obtain a target voice to be processed, and pre-process the target voice to obtain a standard voice;

the feature extraction module 302 is configured to perform feature extraction on the standard speech to obtain a speech feature vector;

the matching module 303 is configured to perform similarity calculation on the speech feature vector to obtain a speech similarity, and perform speech template matching on the target speech according to the speech similarity to obtain a target speech template;

the query module 304 is configured to perform text and semantic query on the target voice according to the target voice template, so as to obtain text information and semantic information corresponding to the target voice;

the generating module 305 is configured to input the text information and the semantic information into an image generator in a preset image generation model to generate an image, obtain an initial image, and perform image detection on the initial image through a discriminator in the preset image generation model to obtain a detection result, where the detection result includes an edge correct and an edge incorrect;

and the output module 306 is configured to output the initial image as a target image through the image generation model if the detection result indicates that the edge is correct.

In the embodiment of the invention, the target voice to be processed is obtained, and the target voice is preprocessed to obtain the standard voice; extracting the characteristics of the standard voice to obtain a voice characteristic vector; performing similarity calculation on the voice feature vector to obtain voice similarity, and performing voice template matching on the target voice according to the voice similarity to obtain a target voice template; according to the target voice template, text and semantic query is carried out on the target voice to obtain text information and semantic information corresponding to the target voice; inputting the text information and the semantic information into an image generator in a preset image generation model to generate an image to obtain an initial image, and detecting the initial image through a discriminator in the preset image generation model to obtain a detection result, wherein the detection result comprises correct edges and incorrect edges; and if the detection result is that the edge is correct, outputting the initial image as a target image through the image generation model. The invention detects the initial image generated by the image generator and optimizes the initial image with the detection result of edge error, thereby improving the accuracy of the speech synthesis image.

Referring to fig. 4, a second embodiment of a voice-based image generating apparatus according to an embodiment of the present invention includes:

Optionally, the feature extraction module 302 is specifically configured to: performing short-time Fourier transform on the standard voice to obtain a voice frequency spectrum corresponding to the standard voice; filtering the voice spectrum by adopting a preset filter to obtain a target Mel spectrum corresponding to the standard voice; and carrying out vector coding on the target Mel frequency spectrum to obtain a voice characteristic vector.

Optionally, the matching module 303 is specifically configured to: extracting vector elements of the voice feature vector to obtain a plurality of vector elements; invoking a preset voice template vector, and performing similarity calculation on the plurality of vector elements to obtain voice similarity; and carrying out voice template matching on the target voice according to the voice similarity to obtain a target voice template.

Optionally, the query module 304 is specifically configured to: performing voice matching on the target voice based on the target voice template to obtain voice data corresponding to the target voice; and respectively inquiring the text and the semantic corresponding to the voice data to obtain the text information and the semantic information corresponding to the target voice.

Optionally, the generating module 305 further includes: an image generation unit 3051 for inputting the text information and the semantic information into an image generator in a preset image generation model; extracting an image object from the text information by adopting a preset attention mechanism to obtain a target object, and extracting semantic layout information from the semantic information to obtain target layout information; invoking a preset text image library to search the image of the target object to obtain an image corresponding to the text information, and invoking a preset text relation graph to search the target layout information to obtain an image relation corresponding to the semantic information; and performing image synthesis on the image corresponding to the text information according to the image relation to obtain an initial image.

Optionally, the generating module 305 further includes: an image discriminating unit 3052 for inputting the initial image into a discriminator in a preset image generation model, and acquiring a bounding box of the initial image through the discriminator; and setting a binary cross entropy loss function of the boundary box, and carrying out image detection on the initial image according to the binary cross entropy loss function and the semantic information to obtain detection results, wherein the detection results comprise edge correctness and edge errors.

Optionally, the image optimization module 307 is configured to, if the detection result is an edge error, feed back the detection result to the image generator, and perform image optimization on the initial image through the image generator.

The voice-based image generating apparatus in the embodiment of the present invention is described in detail above in terms of the modularized functional entity in fig. 3 and 4, and the voice-based image generating device in the embodiment of the present invention is described in detail below in terms of hardware processing.

Fig. 5 is a schematic structural diagram of a voice-based image generating apparatus according to an embodiment of the present invention, where the voice-based image generating apparatus 500 may have a relatively large difference due to different configurations or performances, and may include one or more processors (central processing units, CPU) 510 (e.g., one or more processors) and a memory 520, and one or more storage media 530 (e.g., one or more mass storage devices) storing application programs 533 or data 532. Wherein memory 520 and storage medium 530 may be transitory or persistent storage. The program stored in the storage medium 530 may include one or more modules (not shown), each of which may include a series of instruction operations on the voice-based image generating apparatus 500. Still further, the processor 510 may be arranged to communicate with a storage medium 530, and to execute a series of instruction operations in the storage medium 530 on the speech-based image generating device 500.

The voice-based image generating apparatus 500 may also include one or more power supplies 540, one or more wired or wireless network interfaces 550, one or more input/output interfaces 560, and/or one or more operating systems 531, such as Windows Serve, mac OS X, unix, linux, freeBSD, and the like. It will be appreciated by those skilled in the art that the speech-based image generating device architecture shown in fig. 5 does not constitute a limitation of the speech-based image generating device, and may include more or fewer components than shown, or may combine certain components, or may be a different arrangement of components.

The present invention also provides a voice-based image generating apparatus including a memory and a processor, the memory storing computer-readable instructions that, when executed by the processor, cause the processor to perform the steps of the voice-based image generating method in the above embodiments.

The present invention also provides a computer readable storage medium, which may be a non-volatile computer readable storage medium, and which may also be a volatile computer readable storage medium, the computer readable storage medium having stored therein instructions which, when executed on a computer, cause the computer to perform the steps of the speech based image generation method.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A voice-based image generation method, characterized in that the voice-based image generation method comprises:

acquiring target voice to be processed, and preprocessing the target voice to obtain standard voice;

extracting the characteristics of the standard voice to obtain a voice characteristic vector;

performing similarity calculation on the voice feature vector to obtain voice similarity, and performing voice template matching on the target voice according to the voice similarity to obtain a target voice template;

according to the target voice template, text and semantic query is carried out on the target voice to obtain text information and semantic information corresponding to the target voice;

Inputting the text information and the semantic information into an image generator in a preset image generation model to generate an image to obtain an initial image, and detecting the initial image through a discriminator in the preset image generation model to obtain a detection result, wherein the detection result comprises correct edges and incorrect edges;

if the detection result is that the edge is correct, outputting the initial image as a target image through the image generation model;

inputting the text information and the semantic information into an image generator in a preset image generation model to generate an image, wherein the step of obtaining an initial image comprises the following steps of:

inputting the text information and the semantic information into an image generator in a preset image generation model, extracting an image object from the text information by adopting a preset attention mechanism to obtain a target object, and extracting semantic layout information from the semantic information to obtain target layout information;

invoking a preset text image library to search the image of the target object to obtain an image corresponding to the text information, and invoking a preset text relation graph to search the target layout information to obtain an image relation corresponding to the semantic information;

And performing image synthesis on the image corresponding to the text information according to the image relation to obtain an initial image.

2. The method of claim 1, wherein the feature extracting the standard speech to obtain a speech feature vector comprises:

performing short-time Fourier transform on the standard voice to obtain a voice frequency spectrum corresponding to the standard voice;

filtering the voice spectrum by adopting a preset filter to obtain a target Mel spectrum corresponding to the standard voice;

and carrying out vector coding on the target Mel frequency spectrum to obtain a voice characteristic vector.

3. The method of claim 1, wherein the performing similarity calculation on the speech feature vector to obtain a speech similarity, and performing speech template matching on the target speech according to the speech similarity, to obtain a target speech template comprises:

extracting vector elements of the voice feature vector to obtain a plurality of vector elements;

invoking a preset voice template vector, and performing similarity calculation on the plurality of vector elements to obtain voice similarity;

And carrying out voice template matching on the target voice according to the voice similarity to obtain a target voice template.

4. The method for generating a voice-based image according to claim 1, wherein the performing text and semantic query on the target voice according to the target voice template to obtain text information and semantic information corresponding to the target voice comprises:

performing voice matching on the target voice based on the target voice template to obtain voice data corresponding to the target voice;

and respectively inquiring the text and the semantic corresponding to the voice data to obtain the text information and the semantic information corresponding to the target voice.

5. The method for generating a voice-based image according to claim 1, wherein the detecting the initial image by a discriminator in a preset image generation model, and obtaining a detection result comprises:

inputting the initial image into a discriminator in a preset image generation model, and acquiring a boundary box of the initial image through the discriminator;

and setting a binary cross entropy loss function of the boundary box, and carrying out image detection on the initial image according to the binary cross entropy loss function and the semantic information to obtain detection results, wherein the detection results comprise edge correctness and edge errors.

6. The voice-based image generating method according to any one of claims 1 to 5, wherein after the image detection of the initial image by a discriminator in a preset image generating model, further comprising:

and if the detection result is an edge error, feeding back the detection result to the image generator, and performing image optimization on the initial image through the image generator.

7. A speech-based image generation apparatus, characterized in that the speech-based image generation apparatus comprises:

the acquisition module is used for acquiring target voice to be processed and preprocessing the target voice to obtain standard voice;

the feature extraction module is used for extracting features of the standard voice to obtain a voice feature vector;

the matching module is used for carrying out similarity calculation on the voice feature vectors to obtain voice similarity, and carrying out voice template matching on the target voice according to the voice similarity to obtain a target voice template;

the query module is used for carrying out text and semantic query on the target voice according to the target voice template to obtain text information and semantic information corresponding to the target voice;

The generation module is used for inputting the text information and the semantic information into an image generator in a preset image generation model to generate images so as to obtain an initial image, and detecting the initial image through a discriminator in the preset image generation model so as to obtain a detection result, wherein the detection result comprises correct edges and incorrect edges;

the output module is used for outputting the initial image as a target image through the image generation model if the detection result is that the edge is correct;

the generation module further includes: the image generation unit is used for inputting the text information and the semantic information into an image generator in a preset image generation model, extracting an image object from the text information by adopting a preset attention mechanism to obtain a target object, and extracting semantic layout information from the semantic information to obtain target layout information; invoking a preset text image library to search the image of the target object to obtain an image corresponding to the text information, and invoking a preset text relation graph to search the target layout information to obtain an image relation corresponding to the semantic information; and performing image synthesis on the image corresponding to the text information according to the image relation to obtain an initial image.

8. The voice-based image generating apparatus according to claim 7, wherein the voice-based image generating apparatus further comprises: and the image optimization module is used for feeding back the detection result to the image generator if the detection result is edge error, and carrying out image optimization on the initial image through the image generator.

9. A voice-based image generating apparatus, characterized in that the voice-based image generating apparatus comprises: a memory and at least one processor, the memory having instructions stored therein;

the at least one processor invokes the instructions in the memory to cause the speech-based image generation device to perform the speech-based image generation method of any of claims 1-6.

10. A computer readable storage medium having instructions stored thereon, which when executed by a processor, implement the speech-based image generation method of any of claims 1-6.