CN113256751B - Voice-based image generation method, device, equipment and storage medium - Google Patents

Voice-based image generation method, device, equipment and storage medium Download PDF

Info

Publication number
CN113256751B
CN113256751B CN202110607680.3A CN202110607680A CN113256751B CN 113256751 B CN113256751 B CN 113256751B CN 202110607680 A CN202110607680 A CN 202110607680A CN 113256751 B CN113256751 B CN 113256751B
Authority
CN
China
Prior art keywords
voice
image
target
preset
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110607680.3A
Other languages
Chinese (zh)
Other versions
CN113256751A (en
Inventor
张旭龙
王健宗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202110607680.3A priority Critical patent/CN113256751B/en
Publication of CN113256751A publication Critical patent/CN113256751A/en
Application granted granted Critical
Publication of CN113256751B publication Critical patent/CN113256751B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T11/002D [Two Dimensional] image generation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/68Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/683Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/13Edge detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/10Speech classification or search using distance or distortion measures between unknown speech and reference templates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0631Creating reference templates; Clustering
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Theoretical Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Library & Information Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Signal Processing (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the field of artificial intelligence, and discloses a method, a device, equipment and a storage medium for generating an image based on voice, which are used for improving the accuracy of a voice synthesized image. The method comprises the following steps: acquiring target voice to be processed, and preprocessing the target voice to obtain standard voice; extracting features of standard voice to obtain voice feature vectors; calculating the voice similarity of the voice feature vector, and carrying out voice template matching on the target voice according to the voice similarity to obtain a target voice template; performing text and semantic query on target voice according to the target voice template to obtain text information and semantic information; inputting the text information and the semantic information into an image generator for image generation to obtain an initial image, and carrying out image detection on the initial image through a discriminator to obtain a detection result; and if the detection result is that the edge is correct, taking the initial image as an output image of the image generation model to obtain a target image.

Description

Voice-based image generation method, device, equipment and storage medium
Technical Field
The present invention relates to the field of artificial intelligence, and in particular, to a method, apparatus, device, and storage medium for generating an image based on voice.
Background
With the rapid development of artificial intelligence technology, computer-generated images have become possible. The method utilizes the computer to automatically generate the image, and has important application in the aspects of artwork creation, data enhancement and the like. Currently, computer-generated images are generated by computers primarily from textual descriptions. The technology has important application in man-machine interaction and computer aided design. The computer can generate an image which is consistent with the text instruction semanteme according to the text instruction of the creator, and the creation flow can be quickened. The method has potential application value in the fields of daily life and professional work such as criminal investigation, design idea sharing, recall and recording and the like.
Most image synthesis methods synthesize images based on global sentence vectors, which may lose important fine granularity information at word level, thereby causing errors in the generated images, and thus, the accuracy of the generated images is low in the conventional scheme.
Disclosure of Invention
The invention provides a voice-based image generation method, a device, equipment and a storage medium, which are used for improving the accuracy of a voice synthesized image.
The first aspect of the present invention provides a voice-based image generation method, which includes: acquiring target voice to be processed, and preprocessing the target voice to obtain standard voice; extracting the characteristics of the standard voice to obtain a voice characteristic vector; performing similarity calculation on the voice feature vector to obtain voice similarity, and performing voice template matching on the target voice according to the voice similarity to obtain a target voice template; according to the target voice template, text and semantic query is carried out on the target voice to obtain text information and semantic information corresponding to the target voice; inputting the text information and the semantic information into an image generator in a preset image generation model to generate an image to obtain an initial image, and detecting the initial image through a discriminator in the preset image generation model to obtain a detection result, wherein the detection result comprises correct edges and incorrect edges; and if the detection result is that the edge is correct, outputting the initial image as a target image through the image generation model.
Optionally, in a first implementation manner of the first aspect of the present invention, the extracting features of the standard speech to obtain a speech feature vector includes: performing short-time Fourier transform on the standard voice to obtain a voice frequency spectrum corresponding to the standard voice; filtering the voice spectrum by adopting a preset filter to obtain a target Mel spectrum corresponding to the standard voice; and carrying out vector coding on the target Mel frequency spectrum to obtain a voice characteristic vector.
Optionally, in a second implementation manner of the first aspect of the present invention, the calculating the similarity of the speech feature vector to obtain a speech similarity, and performing speech template matching on the target speech according to the speech similarity, to obtain a target speech template includes: extracting vector elements of the voice feature vector to obtain a plurality of vector elements; invoking a preset voice template vector, and performing similarity calculation on the plurality of vector elements to obtain voice similarity; and carrying out voice template matching on the target voice according to the voice similarity to obtain a target voice template.
Optionally, in a third implementation manner of the first aspect of the present invention, the performing text and semantic query on the target voice according to the target voice template to obtain text information and semantic information corresponding to the target voice includes: performing voice matching on the target voice based on the target voice template to obtain voice data corresponding to the target voice; and respectively inquiring the text and the semantic corresponding to the voice data to obtain the text information and the semantic information corresponding to the target voice.
Optionally, in a fourth implementation manner of the first aspect of the present invention, the inputting the text information and the semantic information into an image generator in a preset image generation model to perform image generation, and obtaining an initial image includes: inputting the text information and the semantic information into an image generator in a preset image generation model, extracting an image object from the text information by adopting a preset attention mechanism to obtain a target object, and extracting semantic layout information from the semantic information to obtain target layout information; invoking a preset text image library to search the image of the target object to obtain an image corresponding to the text information, and invoking a preset text relation graph to search the target layout information to obtain an image relation corresponding to the semantic information; and performing image synthesis on the image corresponding to the text information according to the image relation to obtain an initial image.
Optionally, in a fifth implementation manner of the first aspect of the present invention, the performing, by using a discriminator in a preset image generation model, image detection on the initial image, to obtain a detection result includes: inputting the initial image into a discriminator in a preset image generation model, and acquiring a boundary box of the initial image through the discriminator; and setting a binary cross entropy loss function of the boundary box, and carrying out image detection on the initial image according to the binary cross entropy loss function and the semantic information to obtain detection results, wherein the detection results comprise edge correctness and edge errors.
Optionally, in a sixth implementation manner of the first aspect of the present invention, after the performing image detection on the initial image by using a discriminator in a preset image generation model, the method further includes: and if the detection result is an edge error, feeding back the detection result to the image generator, and performing image optimization on the initial image through the image generator.
A second aspect of the present invention provides a voice-based image generation apparatus including: the acquisition module is used for acquiring target voice to be processed and preprocessing the target voice to obtain standard voice; the feature extraction module is used for extracting features of the standard voice to obtain a voice feature vector; the matching module is used for carrying out similarity calculation on the voice feature vectors to obtain voice similarity, and carrying out voice template matching on the target voice according to the voice similarity to obtain a target voice template; the query module is used for carrying out text and semantic query on the target voice according to the target voice template to obtain text information and semantic information corresponding to the target voice; the generation module is used for inputting the text information and the semantic information into an image generator in a preset image generation model to generate images so as to obtain an initial image, and detecting the initial image through a discriminator in the preset image generation model so as to obtain a detection result, wherein the detection result comprises correct edges and incorrect edges; and the output module is used for outputting the initial image as a target image through the image generation model if the detection result is that the edge is correct.
Optionally, in a first implementation manner of the second aspect of the present invention, the feature extraction module is specifically configured to: performing short-time Fourier transform on the standard voice to obtain a voice frequency spectrum corresponding to the standard voice; filtering the voice spectrum by adopting a preset filter to obtain a target Mel spectrum corresponding to the standard voice; and carrying out vector coding on the target Mel frequency spectrum to obtain a voice characteristic vector.
Optionally, in a second implementation manner of the second aspect of the present invention, the matching module is specifically configured to: extracting vector elements of the voice feature vector to obtain a plurality of vector elements; invoking a preset voice template vector, and performing similarity calculation on the plurality of vector elements to obtain voice similarity; and carrying out voice template matching on the target voice according to the voice similarity to obtain a target voice template.
Optionally, in a third implementation manner of the second aspect of the present invention, the query module is specifically configured to: performing voice matching on the target voice based on the target voice template to obtain voice data corresponding to the target voice; and respectively inquiring the text and the semantic corresponding to the voice data to obtain the text information and the semantic information corresponding to the target voice.
Optionally, in a fourth implementation manner of the second aspect of the present invention, the generating module further includes: the image generation unit is used for inputting the text information and the semantic information into an image generator in a preset image generation model, extracting an image object from the text information by adopting a preset attention mechanism to obtain a target object, and extracting semantic layout information from the semantic information to obtain target layout information; invoking a preset text image library to search the image of the target object to obtain an image corresponding to the text information, and invoking a preset text relation graph to search the target layout information to obtain an image relation corresponding to the semantic information; and performing image synthesis on the image corresponding to the text information according to the image relation to obtain an initial image.
Optionally, in a fifth implementation manner of the second aspect of the present invention, the generating module further includes: the image distinguishing unit is used for inputting the initial image into a distinguishing device in a preset image generation model and acquiring a boundary box of the initial image through the distinguishing device; and setting a binary cross entropy loss function of the boundary box, and carrying out image detection on the initial image according to the binary cross entropy loss function and the semantic information to obtain detection results, wherein the detection results comprise edge correctness and edge errors.
Optionally, in a sixth implementation manner of the second aspect of the present invention, the voice-based image generating apparatus further includes: and the image optimization module is used for feeding back the detection result to the image generator if the detection result is edge error, and carrying out image optimization on the initial image through the image generator.
A third aspect of the present invention provides a voice-based image generation apparatus, comprising: a memory and at least one processor, the memory having instructions stored therein; the at least one processor invokes the instructions in the memory to cause the speech-based image generation device to perform the speech-based image generation method described above.
A fourth aspect of the present invention provides a computer readable storage medium having instructions stored therein which, when run on a computer, cause the computer to perform the above-described speech-based image generation method.
In the technical scheme provided by the invention, target voice to be processed is obtained, and the target voice is preprocessed to obtain standard voice; extracting the characteristics of the standard voice to obtain a voice characteristic vector; performing similarity calculation on the voice feature vector to obtain voice similarity, and performing voice template matching on the target voice according to the voice similarity to obtain a target voice template; according to the target voice template, text and semantic query is carried out on the target voice to obtain text information and semantic information corresponding to the target voice; inputting the text information and the semantic information into an image generator in a preset image generation model to generate an image to obtain an initial image, and detecting the initial image through a discriminator in the preset image generation model to obtain a detection result, wherein the detection result comprises correct edges and incorrect edges; and if the detection result is that the edge is correct, outputting the initial image as a target image through the image generation model. The invention detects the initial image generated by the image generator and optimizes the initial image with the detection result of edge error, thereby improving the accuracy of the speech synthesis image.
Drawings
FIG. 1 is a schematic diagram of a first embodiment of a speech-based image generation method according to an embodiment of the present invention;
FIG. 2 is a diagram illustrating a second embodiment of a speech-based image generation method according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a first embodiment of a speech-based image generating apparatus according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a second embodiment of a speech-based image generating apparatus according to an embodiment of the present invention;
fig. 5 is a schematic diagram of an embodiment of a voice-based image generating apparatus in an embodiment of the present invention.
Detailed Description
The embodiment of the invention provides a voice-based image generation method, device and equipment and a storage medium, which are used for improving the accuracy of a voice synthesized image. The terms "first," "second," "third," "fourth" and the like in the description and in the claims and in the above drawings, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments described herein may be implemented in other sequences than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus.
For ease of understanding, a specific flow of an embodiment of the present invention is described below with reference to fig. 1, where a first embodiment of a speech-based image generation method according to an embodiment of the present invention includes:
101. acquiring target voice to be processed, and preprocessing the target voice to obtain standard voice;
it is to be understood that the execution subject of the present invention may be a voice-based image generating apparatus, or may be a server, which is not limited herein. The embodiment of the invention is described by taking a server as an execution main body as an example.
The server acquires a target voice to be processed, and the target voice can be voices in various professional fields or daily lives, for example: the method comprises the steps of preprocessing target voice, namely removing additive noise in the target voice to obtain standard voice, wherein the voice is shared by criminal investigation, the voice is shared by design ideas, the voice is recalled and the like.
102. Extracting features of standard voice to obtain voice feature vectors;
the server performs short-time Fourier transform on standard voice to obtain a voice spectrum corresponding to the standard voice, then performs filtering operation on the voice spectrum by adopting a preset filter to obtain a target Mel spectrum corresponding to the standard voice, performs vector coding on the target Mel spectrum by adopting a preset coding rule to obtain a voice feature vector, wherein the target Mel spectrum comprises a plurality of data of a time domain, a plurality of data of a frequency domain and channel data, the number of the channel data can be 1, the channel data is set as an initial value, and the analysis spectrum can be coded by adopting a preset coding method to represent the variable-length analysis spectrum by adopting a feature vector with a fixed length to obtain the voice feature vector.
103. Performing similarity calculation on the voice feature vectors to obtain voice similarity, and performing voice template matching on target voice according to the voice similarity to obtain a target voice template;
the server performs similarity calculation according to a template vector and a voice feature vector corresponding to a preset voice template, and the similarity calculation adopts mean variance calculation, for example: when the voice feature vector is [2,3,5,0] and the template vector is [2,3,4,0], the voice similarity obtained by calculation according to the mean variance function is 0.33, the voice similarity of the voice feature vector and the preset template vector is 0.33, the voice feature vector and the template vector which is most similar to the target voice are obtained, the voice feature vector and the template vector are more similar according to the voice similarity, the smaller the voice similarity is, the smaller the mean variance is, the voice feature vector and the template vector are more similar, the preset voice template and the voice feature vector of the input target voice are subjected to similarity comparison, a plurality of templates matched with the target voice are calculated, and the voice template corresponding to the template vector with the smaller voice feature vector and the template vector is taken as the target voice template.
104. According to the target voice template, text and semantic query are carried out on the target voice, and text information and semantic information corresponding to the target voice are obtained;
The server performs text and semantic query on the target voice according to the target voice template, text information and semantic information can be obtained in the target voice template through table lookup, and the target voice template contains characters corresponding to each voice, for example: the voice "mao" is one tone, corresponds to the word "cat", and can obtain the word information and the semantic information corresponding to the target voice according to the voice which is the most similar to the input target voice.
105. Inputting text information and semantic information into an image generator in a preset image generation model to generate an image to obtain an initial image, and detecting the initial image through a discriminator in the preset image generation model to obtain a detection result, wherein the detection result comprises correct edges and incorrect edges;
the server inputs text information and semantic information into an image generator in a preset image generation model to generate an initial image, the image generation model is an ObjGAN model, the image generator is responsible for generating the initial image, and the discriminator is used for detecting whether the initial image generated by the image generator meets preset requirements or not to obtain a detection result, wherein the detection result comprises correct edges and errors edges.
106. And if the detection result is that the edge is correct, outputting the initial image as a target image through an image generation model.
If the detection result is that the edge is correct, the server outputs an initial image with the correct edge as a target image, and the target image synthesizes an image through the most relevant target object in the text information and the pre-generated semantic information, so that the high-quality target image can be quickly generated according to the text information.
In the embodiment of the invention, the target voice to be processed is obtained, and the target voice is preprocessed to obtain the standard voice; extracting features of standard voice to obtain voice feature vectors; performing similarity calculation on the voice feature vectors to obtain voice similarity, and performing voice template matching on target voice according to the voice similarity to obtain a target voice template; according to the target voice template, text and semantic query are carried out on the target voice, and text information and semantic information corresponding to the target voice are obtained; inputting the text information and the semantic information into an image generator in a preset image generation model to generate an image to obtain an initial image, and detecting the initial image through a discriminator in the preset image generation model to obtain a detection result, wherein the detection result comprises correct edges and incorrect edges; and if the detection result is that the edge is correct, outputting the initial image as a target image through an image generation model. The invention detects the initial image generated by the image generator and optimizes the initial image with the detection result of edge error, thereby improving the accuracy of the speech synthesis image.
Referring to fig. 2, a second embodiment of a voice-based image generation method according to an embodiment of the present invention includes:
201. acquiring target voice to be processed, and preprocessing the target voice to obtain standard voice;
the server acquires a target voice to be processed, and the target voice can be voices in various professional fields or daily lives, for example: the method comprises the steps of preprocessing target voice, namely removing additive noise in the target voice to obtain standard voice, wherein the voice is shared by criminal investigation, the voice is shared by design ideas, the voice is recalled and the like.
202. Extracting the characteristics of the standard voice to obtain a voice characteristic vector;
the server performs short-time Fourier transform on standard voice to obtain a voice spectrum corresponding to the standard voice, then performs filtering operation on the voice spectrum by adopting a preset filter to obtain a target Mel spectrum corresponding to the standard voice, performs vector coding on the target Mel spectrum by adopting a preset coding rule to obtain a voice feature vector, wherein the target Mel spectrum comprises a plurality of data of a time domain, a plurality of data of a frequency domain and channel data, the number of the channel data can be 1, the channel data is set as an initial value, and the analysis spectrum can be coded by adopting a preset coding method to represent the variable-length analysis spectrum by adopting a feature vector with a fixed length to obtain the voice feature vector.
Optionally, step 202 includes: performing short-time Fourier transform on the standard voice to obtain a voice frequency spectrum corresponding to the standard voice; filtering the voice frequency spectrum by adopting a preset filter to obtain a target Mel frequency spectrum corresponding to the standard voice; and carrying out vector coding on the target Mel frequency spectrum to obtain a voice characteristic vector.
The server performs short-time Fourier transform (STFT, short-time Fourier transform, or short-term Fouriertransform)) on standard voice, wherein the short-time Fourier transform is a mathematical transform related to the Fourier transform and is used for determining the frequency and the phase of a sine wave in a local area of a time-varying signal, the short-time Fourier transform is a commonly used time-frequency analysis method, a section of signal in a time window is used for representing the signal characteristic of a certain moment in the voice, in the short-time Fourier transform process, the length of the window determines the time resolution and the frequency resolution of a spectrogram, the longer the window length is, the longer the intercepted signal is, the higher the frequency resolution is after the Fourier transform is, and the worse the time resolution is; in contrast, the shorter the window length is, the shorter the intercepted signal is, the worse the frequency resolution is, the better the time resolution is, in addition, the preset filter is a Mel filter (Mel filter), after the voice frequency spectrum is filtered by the filter, the noise signal is removed, the target Mel frequency spectrum of the target voice can be obtained, and the target Mel frequency spectrum is encoded by adopting a preset encoding rule, so that the voice feature vector is obtained.
203. Performing similarity calculation on the voice feature vectors to obtain voice similarity, and performing voice template matching on target voice according to the voice similarity to obtain a target voice template;
the server performs similarity calculation according to a template vector and a voice feature vector corresponding to a preset voice template, and the similarity calculation adopts mean variance calculation, for example: when the voice feature vector is [2,3,5,0] and the template vector is [2,3,4,0], the voice similarity obtained by calculation according to the mean variance function is 0.33, the voice similarity of the voice feature vector and the preset template vector is 0.33, the voice feature vector and the template vector which is most similar to the target voice are obtained, the voice feature vector and the template vector are more similar according to the voice similarity, the smaller the voice similarity is, the smaller the mean variance is, the closer the voice feature vector and the template vector are, the preset voice template and the voice feature vector of the input target voice are subjected to similarity comparison, a plurality of templates matched with the target voice are calculated, and the voice template corresponding to the template vector which is more similar to the voice feature vector and the template vector is taken as the target voice template.
Optionally, step 203 includes: extracting vector elements of the voice feature vector to obtain a plurality of vector elements; invoking a preset voice template vector, and performing similarity calculation on a plurality of vector elements to obtain voice similarity; and carrying out voice template matching on the target voice according to the voice similarity to obtain a target voice template.
The server extracts vector elements of the voice feature vectors to obtain a plurality of vector elements; invoking a preset voice template vector, and performing similarity calculation on a plurality of vector elements to obtain voice similarity, for example: when the voice feature vector is [1,2,3,4], extracting vector elements of the voice feature vector to be 1,2,3,4, obtaining a preset voice template vector, extracting vector elements of the voice template vector to be 1,3, 5, carrying out mean variance calculation on vector elements of the template vector and the voice feature vector to obtain a calculation result of 0.5, so that the voice similarity is 0.5, the smaller the voice similarity is, the smaller the mean variance is, the closer the voice feature vector and the template vector are, and matching the voice template closest to the target voice according to the voice similarity to obtain the target voice template.
204. According to the target voice template, text and semantic query are carried out on the target voice, and text information and semantic information corresponding to the target voice are obtained;
the server performs text and semantic query on the target voice according to the target voice template, text information and semantic information can be obtained in the target voice template through table lookup, and the target voice template contains characters corresponding to each voice, for example: the voice "mao" is one tone, corresponds to the word "cat", and can obtain the word information and the semantic information corresponding to the target voice according to the voice which is the most similar to the input target voice.
Optionally, step 204 includes: performing voice matching on target voice based on a target voice template to obtain voice data corresponding to the target voice; and respectively inquiring the text and the semantic corresponding to the voice data to obtain text information and semantic information corresponding to the target voice.
The server performs voice matching on the target voice based on the target voice template to obtain voice data corresponding to the target voice, performs text and semantic query on the voice data respectively, and can obtain text information and semantic information through table lookup in the target voice template, wherein the target voice template comprises characters corresponding to each voice, for example: the voice sofa corresponds to the word sofa, and word information and semantic information corresponding to the target voice can be obtained according to the voice which is the most similar to the input target voice.
205. Inputting text information and semantic information into an image generator in a preset image generation model, extracting image objects from the text information by adopting a preset attention mechanism to obtain target objects, and extracting semantic layout information from the semantic information to obtain target layout information;
the server inputs the text information and the semantic information into an image generator in a preset image generation model, the image generator generates images, and the image generator processes the text information and the semantic information by adopting an attention mechanism to obtain target layout information. The server adopts a preset attention mechanism to extract the image object of the text information to obtain a target object, for example: when the text information is 'afternoon, one orange cat lies on a warm sofa for afternoon nap', only two target objects of 'orange cat' and 'sofa' are obtained after the attention mechanism. And the server extracts semantic layout information from the semantic information to obtain target layout information.
206. Calling a preset text image library to search an image of a target object to obtain an image corresponding to text information, and calling a preset text relation graph to search target layout information to obtain an image relation corresponding to semantic information;
the server obtains the corresponding image of the text information and the relation between the images by searching a text image library or a text relation graph, for example: after noon, one orange cat lies on a warm sofa for noon nap, which is character information of a generator, two target objects of the orange cat and the sofa in the character information are obtained through a attentiveness mechanism, two images to be synthesized of an orange cat image and a sofa image are obtained through searching of a character image library, and an image relation between the images to be synthesized is obtained by combining semantic information.
207. Image synthesis is carried out on the images corresponding to the text information according to the image relationship, and an initial image is obtained;
the server generates a target image with high resolution from low resolution to high resolution by repeatedly running the image generator, and when the target image is generated, the image generator synthesizes the image area in the boundary box by identifying the word most relevant to the target object in the boundary box of the target image.
208. Image detection is carried out on the initial image through a discriminator in a preset image generation model, so that a detection result is obtained, and the detection result comprises correct edges and incorrect edges;
the server adopts a discriminator to check each boundary box, so as to ensure that the target object to be generated is indeed matched with the pre-generated semantic information, and the technique of the discriminator adopts Fast R-CNN, so that the detection efficiency and accuracy can be improved rapidly.
Optionally, step 209 includes: inputting the initial image into a discriminator in a preset image generation model, and acquiring a boundary box of the initial image through the discriminator; and setting a binary cross entropy loss function of the boundary box, and carrying out image detection on the initial image according to the binary cross entropy loss function and semantic information to obtain a detection result, wherein the detection result comprises edge correctness and edge error.
The server inputs the initial image into a discriminator in the image generation model, acquires a boundary box of the initial image through the discriminator, classifies the initial image by Fast R-CNN to obtain classification results of correct edges and incorrect edges, and sets a binary cross entropy loss function in each boundary box, wherein the function is as follows:
L i =-[ylogy’+(1-y)log(1-y’)];
Wherein L is i The loss function representing the ith boundary, y represents the real boundary condition, y' represents the generated boundary condition, and the accuracy of the speech synthesis image can be improved by setting the binary cross entropy loss.
Optionally, after step 209, the method further includes: if the detection result is edge error, the detection result is fed back to the image generator, and the image optimization is carried out on the initial image through the image generator.
If the detection result is edge errors, the server feeds back the detection result to the image generator, the image generator performs image optimization on the initial image, the error image is deleted, then the image corresponding to the correct target object is synthesized with the deleted image, and the target image is output after image detection.
209. And if the detection result is that the edge is correct, outputting the initial image as a target image through an image generation model.
If the detection result is that the edge is correct, the server outputs an initial image with the correct edge as a target image, and the target image synthesizes an image through the most relevant target object in the text information and the pre-generated semantic information, so that the high-quality target image can be quickly generated according to the text information.
In the embodiment of the invention, the target voice to be processed is obtained, and the target voice is preprocessed to obtain the standard voice; extracting features of standard voice to obtain voice feature vectors; performing similarity calculation on the voice feature vectors to obtain voice similarity, and performing voice template matching on target voice according to the voice similarity to obtain a target voice template; according to the target voice template, text and semantic query are carried out on the target voice, and text information and semantic information corresponding to the target voice are obtained; inputting the text information and the semantic information into an image generator in a preset image generation model to generate an image to obtain an initial image, and detecting the initial image through a discriminator in the image generation model to obtain a detection result, wherein the detection result comprises correct edges and incorrect edges; and if the detection result is that the edge is correct, outputting the initial image as a target image through an image generation model. The invention detects the initial image generated by the image generator and optimizes the initial image with the detection result of edge error, thereby improving the accuracy of the speech synthesis image.
The method for generating a voice-based image according to the embodiment of the present invention is described above, and the apparatus for generating a voice-based image according to the embodiment of the present invention is described below, referring to fig. 3, where a first embodiment of the apparatus for generating a voice-based image according to the embodiment of the present invention includes:
The obtaining module 301 is configured to obtain a target voice to be processed, and pre-process the target voice to obtain a standard voice;
the feature extraction module 302 is configured to perform feature extraction on the standard speech to obtain a speech feature vector;
the matching module 303 is configured to perform similarity calculation on the speech feature vector to obtain a speech similarity, and perform speech template matching on the target speech according to the speech similarity to obtain a target speech template;
the query module 304 is configured to perform text and semantic query on the target voice according to the target voice template, so as to obtain text information and semantic information corresponding to the target voice;
the generating module 305 is configured to input the text information and the semantic information into an image generator in a preset image generation model to generate an image, obtain an initial image, and perform image detection on the initial image through a discriminator in the preset image generation model to obtain a detection result, where the detection result includes an edge correct and an edge incorrect;
and the output module 306 is configured to output the initial image as a target image through the image generation model if the detection result indicates that the edge is correct.
In the embodiment of the invention, the target voice to be processed is obtained, and the target voice is preprocessed to obtain the standard voice; extracting the characteristics of the standard voice to obtain a voice characteristic vector; performing similarity calculation on the voice feature vector to obtain voice similarity, and performing voice template matching on the target voice according to the voice similarity to obtain a target voice template; according to the target voice template, text and semantic query is carried out on the target voice to obtain text information and semantic information corresponding to the target voice; inputting the text information and the semantic information into an image generator in a preset image generation model to generate an image to obtain an initial image, and detecting the initial image through a discriminator in the preset image generation model to obtain a detection result, wherein the detection result comprises correct edges and incorrect edges; and if the detection result is that the edge is correct, outputting the initial image as a target image through the image generation model. The invention detects the initial image generated by the image generator and optimizes the initial image with the detection result of edge error, thereby improving the accuracy of the speech synthesis image.
Referring to fig. 4, a second embodiment of a voice-based image generating apparatus according to an embodiment of the present invention includes:
the obtaining module 301 is configured to obtain a target voice to be processed, and pre-process the target voice to obtain a standard voice;
the feature extraction module 302 is configured to perform feature extraction on the standard speech to obtain a speech feature vector;
the matching module 303 is configured to perform similarity calculation on the speech feature vector to obtain a speech similarity, and perform speech template matching on the target speech according to the speech similarity to obtain a target speech template;
the query module 304 is configured to perform text and semantic query on the target voice according to the target voice template, so as to obtain text information and semantic information corresponding to the target voice;
the generating module 305 is configured to input the text information and the semantic information into an image generator in a preset image generation model to generate an image, obtain an initial image, and perform image detection on the initial image through a discriminator in the preset image generation model to obtain a detection result, where the detection result includes an edge correct and an edge incorrect;
And the output module 306 is configured to output the initial image as a target image through the image generation model if the detection result indicates that the edge is correct.
Optionally, the feature extraction module 302 is specifically configured to: performing short-time Fourier transform on the standard voice to obtain a voice frequency spectrum corresponding to the standard voice; filtering the voice spectrum by adopting a preset filter to obtain a target Mel spectrum corresponding to the standard voice; and carrying out vector coding on the target Mel frequency spectrum to obtain a voice characteristic vector.
Optionally, the matching module 303 is specifically configured to: extracting vector elements of the voice feature vector to obtain a plurality of vector elements; invoking a preset voice template vector, and performing similarity calculation on the plurality of vector elements to obtain voice similarity; and carrying out voice template matching on the target voice according to the voice similarity to obtain a target voice template.
Optionally, the query module 304 is specifically configured to: performing voice matching on the target voice based on the target voice template to obtain voice data corresponding to the target voice; and respectively inquiring the text and the semantic corresponding to the voice data to obtain the text information and the semantic information corresponding to the target voice.
Optionally, the generating module 305 further includes: an image generation unit 3051 for inputting the text information and the semantic information into an image generator in a preset image generation model; extracting an image object from the text information by adopting a preset attention mechanism to obtain a target object, and extracting semantic layout information from the semantic information to obtain target layout information; invoking a preset text image library to search the image of the target object to obtain an image corresponding to the text information, and invoking a preset text relation graph to search the target layout information to obtain an image relation corresponding to the semantic information; and performing image synthesis on the image corresponding to the text information according to the image relation to obtain an initial image.
Optionally, the generating module 305 further includes: an image discriminating unit 3052 for inputting the initial image into a discriminator in a preset image generation model, and acquiring a bounding box of the initial image through the discriminator; and setting a binary cross entropy loss function of the boundary box, and carrying out image detection on the initial image according to the binary cross entropy loss function and the semantic information to obtain detection results, wherein the detection results comprise edge correctness and edge errors.
Optionally, the image optimization module 307 is configured to, if the detection result is an edge error, feed back the detection result to the image generator, and perform image optimization on the initial image through the image generator.
The voice-based image generating apparatus in the embodiment of the present invention is described in detail above in terms of the modularized functional entity in fig. 3 and 4, and the voice-based image generating device in the embodiment of the present invention is described in detail below in terms of hardware processing.
Fig. 5 is a schematic structural diagram of a voice-based image generating apparatus according to an embodiment of the present invention, where the voice-based image generating apparatus 500 may have a relatively large difference due to different configurations or performances, and may include one or more processors (central processing units, CPU) 510 (e.g., one or more processors) and a memory 520, and one or more storage media 530 (e.g., one or more mass storage devices) storing application programs 533 or data 532. Wherein memory 520 and storage medium 530 may be transitory or persistent storage. The program stored in the storage medium 530 may include one or more modules (not shown), each of which may include a series of instruction operations on the voice-based image generating apparatus 500. Still further, the processor 510 may be arranged to communicate with a storage medium 530, and to execute a series of instruction operations in the storage medium 530 on the speech-based image generating device 500.
The voice-based image generating apparatus 500 may also include one or more power supplies 540, one or more wired or wireless network interfaces 550, one or more input/output interfaces 560, and/or one or more operating systems 531, such as Windows Serve, mac OS X, unix, linux, freeBSD, and the like. It will be appreciated by those skilled in the art that the speech-based image generating device architecture shown in fig. 5 does not constitute a limitation of the speech-based image generating device, and may include more or fewer components than shown, or may combine certain components, or may be a different arrangement of components.
The present invention also provides a voice-based image generating apparatus including a memory and a processor, the memory storing computer-readable instructions that, when executed by the processor, cause the processor to perform the steps of the voice-based image generating method in the above embodiments.
The present invention also provides a computer readable storage medium, which may be a non-volatile computer readable storage medium, and which may also be a volatile computer readable storage medium, the computer readable storage medium having stored therein instructions which, when executed on a computer, cause the computer to perform the steps of the speech based image generation method.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
The above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (10)

1. A voice-based image generation method, characterized in that the voice-based image generation method comprises:
acquiring target voice to be processed, and preprocessing the target voice to obtain standard voice;
extracting the characteristics of the standard voice to obtain a voice characteristic vector;
performing similarity calculation on the voice feature vector to obtain voice similarity, and performing voice template matching on the target voice according to the voice similarity to obtain a target voice template;
according to the target voice template, text and semantic query is carried out on the target voice to obtain text information and semantic information corresponding to the target voice;
Inputting the text information and the semantic information into an image generator in a preset image generation model to generate an image to obtain an initial image, and detecting the initial image through a discriminator in the preset image generation model to obtain a detection result, wherein the detection result comprises correct edges and incorrect edges;
if the detection result is that the edge is correct, outputting the initial image as a target image through the image generation model;
inputting the text information and the semantic information into an image generator in a preset image generation model to generate an image, wherein the step of obtaining an initial image comprises the following steps of:
inputting the text information and the semantic information into an image generator in a preset image generation model, extracting an image object from the text information by adopting a preset attention mechanism to obtain a target object, and extracting semantic layout information from the semantic information to obtain target layout information;
invoking a preset text image library to search the image of the target object to obtain an image corresponding to the text information, and invoking a preset text relation graph to search the target layout information to obtain an image relation corresponding to the semantic information;
And performing image synthesis on the image corresponding to the text information according to the image relation to obtain an initial image.
2. The method of claim 1, wherein the feature extracting the standard speech to obtain a speech feature vector comprises:
performing short-time Fourier transform on the standard voice to obtain a voice frequency spectrum corresponding to the standard voice;
filtering the voice spectrum by adopting a preset filter to obtain a target Mel spectrum corresponding to the standard voice;
and carrying out vector coding on the target Mel frequency spectrum to obtain a voice characteristic vector.
3. The method of claim 1, wherein the performing similarity calculation on the speech feature vector to obtain a speech similarity, and performing speech template matching on the target speech according to the speech similarity, to obtain a target speech template comprises:
extracting vector elements of the voice feature vector to obtain a plurality of vector elements;
invoking a preset voice template vector, and performing similarity calculation on the plurality of vector elements to obtain voice similarity;
And carrying out voice template matching on the target voice according to the voice similarity to obtain a target voice template.
4. The method for generating a voice-based image according to claim 1, wherein the performing text and semantic query on the target voice according to the target voice template to obtain text information and semantic information corresponding to the target voice comprises:
performing voice matching on the target voice based on the target voice template to obtain voice data corresponding to the target voice;
and respectively inquiring the text and the semantic corresponding to the voice data to obtain the text information and the semantic information corresponding to the target voice.
5. The method for generating a voice-based image according to claim 1, wherein the detecting the initial image by a discriminator in a preset image generation model, and obtaining a detection result comprises:
inputting the initial image into a discriminator in a preset image generation model, and acquiring a boundary box of the initial image through the discriminator;
and setting a binary cross entropy loss function of the boundary box, and carrying out image detection on the initial image according to the binary cross entropy loss function and the semantic information to obtain detection results, wherein the detection results comprise edge correctness and edge errors.
6. The voice-based image generating method according to any one of claims 1 to 5, wherein after the image detection of the initial image by a discriminator in a preset image generating model, further comprising:
and if the detection result is an edge error, feeding back the detection result to the image generator, and performing image optimization on the initial image through the image generator.
7. A speech-based image generation apparatus, characterized in that the speech-based image generation apparatus comprises:
the acquisition module is used for acquiring target voice to be processed and preprocessing the target voice to obtain standard voice;
the feature extraction module is used for extracting features of the standard voice to obtain a voice feature vector;
the matching module is used for carrying out similarity calculation on the voice feature vectors to obtain voice similarity, and carrying out voice template matching on the target voice according to the voice similarity to obtain a target voice template;
the query module is used for carrying out text and semantic query on the target voice according to the target voice template to obtain text information and semantic information corresponding to the target voice;
The generation module is used for inputting the text information and the semantic information into an image generator in a preset image generation model to generate images so as to obtain an initial image, and detecting the initial image through a discriminator in the preset image generation model so as to obtain a detection result, wherein the detection result comprises correct edges and incorrect edges;
the output module is used for outputting the initial image as a target image through the image generation model if the detection result is that the edge is correct;
the generation module further includes: the image generation unit is used for inputting the text information and the semantic information into an image generator in a preset image generation model, extracting an image object from the text information by adopting a preset attention mechanism to obtain a target object, and extracting semantic layout information from the semantic information to obtain target layout information; invoking a preset text image library to search the image of the target object to obtain an image corresponding to the text information, and invoking a preset text relation graph to search the target layout information to obtain an image relation corresponding to the semantic information; and performing image synthesis on the image corresponding to the text information according to the image relation to obtain an initial image.
8. The voice-based image generating apparatus according to claim 7, wherein the voice-based image generating apparatus further comprises: and the image optimization module is used for feeding back the detection result to the image generator if the detection result is edge error, and carrying out image optimization on the initial image through the image generator.
9. A voice-based image generating apparatus, characterized in that the voice-based image generating apparatus comprises: a memory and at least one processor, the memory having instructions stored therein;
the at least one processor invokes the instructions in the memory to cause the speech-based image generation device to perform the speech-based image generation method of any of claims 1-6.
10. A computer readable storage medium having instructions stored thereon, which when executed by a processor, implement the speech-based image generation method of any of claims 1-6.
CN202110607680.3A 2021-06-01 2021-06-01 Voice-based image generation method, device, equipment and storage medium Active CN113256751B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110607680.3A CN113256751B (en) 2021-06-01 2021-06-01 Voice-based image generation method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110607680.3A CN113256751B (en) 2021-06-01 2021-06-01 Voice-based image generation method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113256751A CN113256751A (en) 2021-08-13
CN113256751B true CN113256751B (en) 2023-09-29

Family

ID=77185708

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110607680.3A Active CN113256751B (en) 2021-06-01 2021-06-01 Voice-based image generation method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113256751B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114343640B (en) * 2022-01-07 2023-10-13 北京师范大学 Attention assessment method and electronic equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20000056203A (en) * 1999-02-13 2000-09-15 이경목 Language study system by interactive conversation
WO2017050067A1 (en) * 2015-09-25 2017-03-30 中兴通讯股份有限公司 Video communication method, apparatus, and system
CN111477247A (en) * 2020-04-01 2020-07-31 宁波大学 GAN-based voice countermeasure sample generation method

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3282447B1 (en) * 2015-03-31 2020-08-26 Sony Corporation PROGRESSIVE UTTERANCE ANALYSIS FOR SUCCESSIVELY DISPLAYING EARLY SUGGESTIONS BASED ON PARTIAL SEMANTIC PARSES FOR VOICE CONTROL. 
REAL TIME PROGRESSIVE SEMANTIC UTTERANCE ANALYSIS FOR VISUALIZATION AND ACTIONS CONTROL.
US10891969B2 (en) * 2018-10-19 2021-01-12 Microsoft Technology Licensing, Llc Transforming audio content into images

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20000056203A (en) * 1999-02-13 2000-09-15 이경목 Language study system by interactive conversation
WO2017050067A1 (en) * 2015-09-25 2017-03-30 中兴通讯股份有限公司 Video communication method, apparatus, and system
CN111477247A (en) * 2020-04-01 2020-07-31 宁波大学 GAN-based voice countermeasure sample generation method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
FinTech时代商业银行智能语音识别技术应用与发展;王彦博;桂小柯;杨璇;杜新凯;卢佳慧;;中国金融电脑(05);第37-40页 *

Also Published As

Publication number Publication date
CN113256751A (en) 2021-08-13

Similar Documents

Publication Publication Date Title
Sahidullah et al. A comparison of features for synthetic speech detection
Tan et al. Dynamic time warping and sparse representation classification for birdsong phrase classification using limited training data
Rajisha et al. Performance analysis of Malayalam language speech emotion recognition system using ANN/SVM
CN105702251B (en) Reinforce the speech-emotion recognition method of audio bag of words based on Top-k
Carbonneau et al. Feature learning from spectrograms for assessment of personality traits
CN115641834A (en) Voice synthesis method and device, electronic equipment and storage medium
Ludena-Choez et al. Bird sound spectrogram decomposition through Non-Negative Matrix Factorization for the acoustic classification of bird species
Hook et al. Automatic speech based emotion recognition using paralinguistics features
CN117116290B (en) Method and related equipment for positioning defects of numerical control machine tool parts based on multidimensional characteristics
CN113450828A (en) Music genre identification method, device, equipment and storage medium
CN113256751B (en) Voice-based image generation method, device, equipment and storage medium
CN115759071A (en) Government affair sensitive information identification system and method based on big data
Dutta et al. Language identification using phase information
Amid et al. Unsupervised feature extraction for multimedia event detection and ranking using audio content
Sharma et al. ASe: Acoustic Scene Embedding Using Deep Archetypal Analysis and GMM.
CN114786059B (en) Video generation method, video generation device, electronic device, and storage medium
Ruiz-Muñoz et al. Enhancing the dissimilarity-based classification of birdsong recordings
Yang et al. Sound event detection in real-life audio using joint spectral and temporal features
CN116994553A (en) Training method of speech synthesis model, speech synthesis method, device and equipment
Noyum et al. Boosting the predictive accurary of singer identification using discrete wavelet transform for feature extraction
Paulino et al. A brazilian speech database
Jog et al. Indian language identification using cochleagram based texture descriptors and ANN classifier
Muñoz-Romero et al. Nonnegative OPLS for supervised design of filter banks: application to image and audio feature extraction
Kannapiran et al. Voice-based gender recognition model using FRT and light GBM
Spevak et al. Sound spotting–a frame-based approach

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant