CN117992169A - Plane design display method based on AIGC technology - Google Patents

Plane design display method based on AIGC technology Download PDF

Info

Publication number
CN117992169A
CN117992169A CN202410182259.6A CN202410182259A CN117992169A CN 117992169 A CN117992169 A CN 117992169A CN 202410182259 A CN202410182259 A CN 202410182259A CN 117992169 A CN117992169 A CN 117992169A
Authority
CN
China
Prior art keywords
voice
text
speech
features
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410182259.6A
Other languages
Chinese (zh)
Inventor
史明
周晶璇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Banding Network Technology Co ltd
Original Assignee
Shanghai Banding Network Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Banding Network Technology Co ltd filed Critical Shanghai Banding Network Technology Co ltd
Priority to CN202410182259.6A priority Critical patent/CN117992169A/en
Publication of CN117992169A publication Critical patent/CN117992169A/en
Pending legal-status Critical Current

Links

Landscapes

  • Machine Translation (AREA)

Abstract

The invention relates to the technical field of AIGC, in particular to a planar design display method based on AIGC technology, which comprises the following steps: inputting object features and appearance features, inputting dynamic features, analyzing the object features by using an encoder, outputting the results by using a decoder, analyzing the appearance features and the dynamic features by using an encoder, fusing the analysis results of the two, and outputting the results by using the decoder. Compared with the traditional display mode, the invention can generate the content in real time according to the requirements and descriptions of operators, not just simply play the manufactured video, which means that the display device is not limited to the content prepared in advance any more and can generate the content automatically according to the actual demands of the operators.

Description

Plane design display method based on AIGC technology
Technical Field
The invention relates to the technical field of AIGC, in particular to a plane design display method based on AIGC technology.
Background
ARTIFICIAL INTELLIGENCE GENERATED ContentAIGC techniques refer to the generation of various forms of content, such as articles, music, images, video, etc., using artificial intelligence algorithms and techniques that train large amounts of data using artificial intelligence models, which can then generate new content based on the conditions and requirements entered.
The conventional display mode generally copies the prefabricated video to the platform for display, relies on manual operation to a great extent, and mainly has the following problems:
1. Because the traditional display mode mainly depends on the pre-manufactured video to play, the display device cannot generate content based on real-time images or descriptions of operators, the intelligent degree of the mode is low, and the personalized requirements of different users cannot be met;
2. the traditional display mode cannot generate the content through real-time adjustment and according to actual conditions, so that the display mode cannot adapt to the display requirements of different scenes;
3. Under the traditional display mode, an operator must make videos in advance and input the videos into the display device, however, complicated manual operation is required for control and adjustment in the display device, so that the difficulty and complexity of the operation are increased, and the convenience of a display platform is reduced;
Therefore, there is an urgent need for a planar design display method based on AIGC technology, which solves the above problems.
Disclosure of Invention
The invention aims to provide a plane design display method based on AIGC technology, which has the advantage of intelligence and solves the problems presented by the background technology.
In order to achieve the above purpose, the present invention provides the following technical solutions: a plane design display method based on AIGC technology comprises the following steps:
S1: inputting object features and appearance features, inputting dynamic features, analyzing the object features by using an encoder, outputting the results by using a decoder, analyzing the appearance features and the dynamic features by using an encoder, fusing analysis results of the two, outputting the results by using the decoder, fusing the input object features, the input appearance features and the input dynamic feature analysis results, and forming key entities at the moment; on the basis, a one-to-one encoder is used for respectively analyzing specific object features, specific appearance features and specific dynamic features, after the dynamic entity features are obtained, the model is spliced with original dynamic features, after key entity generation, the extracted image features are subjected to semantic enhancement under the action of feature fusion to obtain rich information, the interaction relationship between the key entity and the feature fusion is fused, at the moment, global visual information and the description of the whole language layer are connected, and finally the description is generated;
S2: inputting text information, and generating voice output by using a voice synthesis engine; further, voice output and text input can be performed by utilizing a voice synthesis mode, recorded voice fragments are prepared in advance, the voice fragments are connected according to the input text information, the text is converted into audible voice output technology through a computer algorithm, and the synthesized voice is output by voice cloning;
S3: inputting voice, recognizing the voice by a voice recognizer, generating continuous texts by a recurrent neural network through learning context dependency relations of sequence data, and successfully generating the texts by taking the output of the last time step as the input of the current time step;
s4: after the text, the image and the voice are generated, an encoder and two decoders are used for analyzing the text, the image and the voice, the encoder is responsible for encoding the characteristics of the text, the image and the voice into a group of characteristics which face each other, the decoders respectively learn the association between static information and dynamic information, learn the relationship between context information, and carry out cross matching and fusion on the information of the text, the image and the voice, so that the visual representation is finally obtained;
S5: the method comprises the steps of collecting information of texts, voices and images by using a collecting module, processing the collected information under the action of a processing module, dividing the texts, the voices and the images into different areas after processing, analyzing the texts, the voices and the images by using a decoder when the texts, the voices and the images are generated, fusing the information by using a generating module, generating videos or images at the moment, and displaying by using a AIGC platform after the videos are generated.
In the invention, in step S2, the specific steps of converting input characters into voice are as follows:
S2.1.1: preprocessing the input text, including removing punctuation marks, marking into words or characters, processing numbers and specific abbreviations, etc.;
s2.1.2: selecting a speech synthesis model suitable for task requirements, wherein the step comprises converting text into phoneme sequences or using word phoneme sequences, and analyzing language characteristics such as syllables, phonemes, tones and the like;
s2.1.3: training an acoustic model using a large amount of text and corresponding speech data, the model being capable of learning a correlation between text and audio, and acoustic features of the audio;
s2.1.4: generating a voice waveform through an acoustic model based on the input text and the model obtained through training to generate synthetic voice;
S2.1.5: some post-processing operations are performed on the generated speech to improve the quality and naturalness of the speech, including adjusting pitch, volume, speech rate, etc.
In the invention, in step S2, the specific steps of inputting words and synthesizing voice by voice fragments are as follows:
S2.2.1: a speech database, also called a speech library or speech unit library, is built in advance, which contains a large number of recorded speech, which covers various possible phonemes, syllables, words and phrases, etc.;
S2.2.2: converting the input text into a corresponding sequence of phonemes, which may be accomplished by using a tool for text-to-speech conversion or internal processing of a speech synthesis system;
S2.2.3: selecting, from the speech library, an appropriate speech segment corresponding to each phoneme, each speech segment generally corresponding to a phoneme or a group of phonemes, based on the sequence of phonemes;
S2.2.4: the selected voice fragments are spliced together according to the sequence to form continuous voice output, wherein the voice output comprises the proper adjustment and smoothing treatment of the voice fragments so as to ensure that the spliced voice is smooth and natural;
S2.2.5: and carrying out some post-processing operations on the generated voice output, such as adjustment of tone, volume and speech speed, so as to improve the quality and naturalness of the voice.
Preferably, in step S3, the speech clone includes a speech feature extraction module, and the speech feature extraction module extracts synthesized speech, and processes the smoothing of the speech by using a signal processing technology, so as to output by using the speech clone.
In the present invention, in step S3, before text generation, the method for detecting text output is as follows:
s3.1: firstly, setting a judgment of whether a generated result is fit or not, after the text is generated, judging whether the generated result is matched with a display theme or not, if yes, re-editing and re-generating the text;
s3.2: then, whether the sentence of the generated result is smooth or not is detected, if yes, the sentence is output, otherwise, the sentence is edited again, and the text is regenerated.
The beneficial effects are that the technical scheme of the application has the following technical effects: compared with the traditional display mode, the application can generate the content in real time according to the requirements and descriptions of operators, not just simply play the manufactured video, which means that the display device is not limited to the content prepared in advance any more and can generate the content automatically according to the actual demands of the operators;
The operator can instruct the display device to generate the content meeting the requirements of the operator through inputting various forms of information such as pictures, texts and the like, in addition, the operator can interact with the display device in a voice instruction mode without complicated manual operation, the display device can recognize and understand voice commands of the operator, the display content is adjusted according to the instructions or related information is provided, and the natural and convenient interaction mode greatly improves user experience, so that the display process is more intelligent and personalized.
Drawings
The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate the invention and together with the embodiments of the invention, serve to explain the invention. In the drawings:
FIG. 1 is a flow chart of a technical system of the structure AIGC of the present invention;
FIG. 2 is a functional block diagram of text generation in accordance with the present invention;
FIG. 3 is a functional block diagram of speech generation according to the present invention;
Fig. 4 is a functional block diagram of image generation of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and in order to better understand the technical content of the present invention, specific embodiments are specifically described below with reference to the accompanying drawings. Aspects of the invention are described in this disclosure with reference to the drawings, in which are shown a number of illustrative embodiments. It should be appreciated that the various concepts and embodiments described above, as well as those described in more detail below, may be implemented in any of a wide variety of ways. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
As shown in fig. 1 to 4: the embodiment provides a plane design display method based on AIGC technology, which comprises the following steps:
S1: inputting object features and appearance features, inputting dynamic features, analyzing the object features by using an encoder, outputting the results by using a decoder, analyzing the appearance features and the dynamic features by using an encoder, fusing analysis results of the two, outputting the results by using the decoder, fusing the input object features, the input appearance features and the input dynamic feature analysis results, and forming key entities at the moment; on the basis, a one-to-one encoder is used for respectively analyzing specific object features, specific appearance features and specific dynamic features, after the dynamic entity features are obtained, the model is spliced with original dynamic features, after key entity generation, the extracted image features are subjected to semantic enhancement under the action of feature fusion to obtain rich information, the interaction relationship between the key entity and the feature fusion is fused, at the moment, global visual information and the description of the whole language layer are connected, and finally the description is generated;
S2: inputting text information, and generating voice output by using a voice synthesis engine; further, voice output and text input can be performed by utilizing a voice synthesis mode, recorded voice fragments are prepared in advance, the voice fragments are connected according to the input text information, the text is converted into audible voice output technology through a computer algorithm, and the synthesized voice is output by voice cloning;
S3: inputting voice, recognizing the voice by a voice recognizer, generating continuous texts by a recurrent neural network through learning context dependency relations of sequence data, and successfully generating the texts by taking the output of the last time step as the input of the current time step;
s4: after the text, the image and the voice are generated, an encoder and two decoders are used for analyzing the text, the image and the voice, the encoder is responsible for encoding the characteristics of the text, the image and the voice into a group of characteristics which face each other, the decoders respectively learn the association between static information and dynamic information, learn the relationship between context information, and carry out cross matching and fusion on the information of the text, the image and the voice, so that the visual representation is finally obtained;
S5: the method comprises the steps of collecting information of texts, voices and images by using a collecting module, processing the collected information under the action of a processing module, dividing the texts, the voices and the images into different areas after processing, analyzing the texts, the voices and the images by using a decoder when the texts, the voices and the images are generated, fusing the information by using a generating module, generating videos or images at the moment, and displaying by using a AIGC platform after the videos are generated.
In the invention, in step S2, the specific steps of converting input characters into voice are as follows:
S2.1.1: preprocessing the input text, including removing punctuation marks, marking into words or characters, processing numbers and specific abbreviations, etc.;
s2.1.2: selecting a speech synthesis model suitable for task requirements, wherein the step comprises converting text into phoneme sequences or using word phoneme sequences, and analyzing language characteristics such as syllables, phonemes, tones and the like;
s2.1.3: training an acoustic model using a large amount of text and corresponding speech data, the model being capable of learning a correlation between text and audio, and acoustic features of the audio;
s2.1.4: generating a voice waveform through an acoustic model based on the input text and the model obtained through training to generate synthetic voice;
S2.1.5: some post-processing operations are performed on the generated speech to improve the quality and naturalness of the speech, including adjusting pitch, volume, speech rate, etc.
In the invention, in step S2, the specific steps of inputting words and synthesizing voice by voice fragments are as follows:
S2.2.1: a speech database, also called a speech library or speech unit library, is built in advance, which contains a large number of recorded speech, which covers various possible phonemes, syllables, words and phrases, etc.;
S2.2.2: converting the input text into a corresponding sequence of phonemes, which may be accomplished by using a tool for text-to-speech conversion or internal processing of a speech synthesis system;
S2.2.3: selecting, from the speech library, an appropriate speech segment corresponding to each phoneme, each speech segment generally corresponding to a phoneme or a group of phonemes, based on the sequence of phonemes;
S2.2.4: the selected voice fragments are spliced together according to the sequence to form continuous voice output, wherein the voice output comprises the proper adjustment and smoothing treatment of the voice fragments so as to ensure that the spliced voice is smooth and natural;
S2.2.5: and carrying out some post-processing operations on the generated voice output, such as adjustment of tone, volume and speech speed, so as to improve the quality and naturalness of the voice.
Specifically, in step S3, the speech clone includes a speech feature extraction module, and the speech feature extraction module extracts synthesized speech, and processes the smoothing of the speech by using a signal processing technology, so as to output by using the speech clone.
In the present invention, in step S3, before text generation, the method for detecting text output is as follows:
s3.1: firstly, setting a judgment of whether a generated result is fit or not, after the text is generated, judging whether the generated result is matched with a display theme or not, if yes, re-editing and re-generating the text;
s3.2: then, whether the sentence of the generated result is smooth or not is detected, if yes, the sentence is output, otherwise, the sentence is edited again, and the text is regenerated.
It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.
While the invention has been described with reference to preferred embodiments, it is not intended to be limiting. Those skilled in the art will appreciate that various modifications and adaptations can be made without departing from the spirit and scope of the present invention. Accordingly, the scope of the invention is defined by the appended claims.

Claims (5)

1. A plane design display method based on AIGC technology is characterized in that: the method comprises the following steps:
S1: inputting object features and appearance features, inputting dynamic features, analyzing the object features by using an encoder, outputting the results by using a decoder, analyzing the appearance features and the dynamic features by using an encoder, fusing analysis results of the two, outputting the results by using the decoder, fusing the input object features, the input appearance features and the input dynamic feature analysis results, and forming key entities at the moment; on the basis, a one-to-one encoder is used for respectively analyzing specific object features, specific appearance features and specific dynamic features, after the dynamic entity features are obtained, the model is spliced with original dynamic features, after key entity generation, the extracted image features are subjected to semantic enhancement under the action of feature fusion to obtain rich information, the interaction relationship between the key entity and the feature fusion is fused, at the moment, global visual information and the description of the whole language layer are connected, and finally the description is generated;
S2: inputting text information, and generating voice output by using a voice synthesis engine; further, voice output and text input can be performed by utilizing a voice synthesis mode, recorded voice fragments are prepared in advance, the voice fragments are connected according to the input text information, the text is converted into audible voice output technology through a computer algorithm, and the synthesized voice is output by voice cloning;
S3: inputting voice, recognizing the voice by a voice recognizer, generating continuous texts by a recurrent neural network through learning context dependency relations of sequence data, and successfully generating the texts by taking the output of the last time step as the input of the current time step;
s4: after the text, the image and the voice are generated, an encoder and two decoders are used for analyzing the text, the image and the voice, the encoder is responsible for encoding the characteristics of the text, the image and the voice into a group of characteristics which face each other, the decoders respectively learn the association between static information and dynamic information, learn the relationship between context information, and carry out cross matching and fusion on the information of the text, the image and the voice, so that the visual representation is finally obtained;
S5: the method comprises the steps of collecting information of texts, voices and images by using a collecting module, processing the collected information under the action of a processing module, dividing the texts, the voices and the images into different areas after processing, analyzing the texts, the voices and the images by using a decoder when the texts, the voices and the images are generated, fusing the information by using a generating module, generating videos or images at the moment, and displaying by using a AIGC platform after the videos are generated.
2. The planar design display method based on AIGC technology according to claim 1, wherein: in step S2, the specific step of converting the input text into voice is as follows:
S2.1.1: preprocessing the input text, including removing punctuation marks, marking into words or characters, processing numbers and specific abbreviations, etc.;
s2.1.2: selecting a speech synthesis model suitable for task requirements, wherein the step comprises converting text into phoneme sequences or using word phoneme sequences, and analyzing language characteristics such as syllables, phonemes, tones and the like;
s2.1.3: training an acoustic model using a large amount of text and corresponding speech data, the model being capable of learning a correlation between text and audio, and acoustic features of the audio;
s2.1.4: generating a voice waveform through an acoustic model based on the input text and the model obtained through training to generate synthetic voice;
S2.1.5: some post-processing operations are performed on the generated speech to improve the quality and naturalness of the speech, including adjusting pitch, volume, speech rate, etc.
3. The planar design display method based on AIGC technology according to claim 1, wherein: in step S2, the specific steps of inputting text and speech fragment synthesized speech are as follows:
S2.2.1: a speech database, also called a speech library or speech unit library, is built in advance, which contains a large number of recorded speech, which covers various possible phonemes, syllables, words and phrases, etc.;
S2.2.2: converting the input text into a corresponding sequence of phonemes, which may be accomplished by using a tool for text-to-speech conversion or internal processing of a speech synthesis system;
S2.2.3: selecting, from the speech library, an appropriate speech segment corresponding to each phoneme, each speech segment generally corresponding to a phoneme or a group of phonemes, based on the sequence of phonemes;
S2.2.4: the selected voice fragments are spliced together according to the sequence to form continuous voice output, wherein the voice output comprises the proper adjustment and smoothing treatment of the voice fragments so as to ensure that the spliced voice is smooth and natural;
S2.2.5: and carrying out some post-processing operations on the generated voice output, such as adjustment of tone, volume and speech speed, so as to improve the quality and naturalness of the voice.
4. The planar design display method based on AIGC technology according to claim 1, wherein: in step S2, the voice clone includes a voice feature extraction module, the voice feature extraction module extracts synthesized voice, and the signal processing technology is used to process the smoothing of the voice, so as to output by using the voice clone.
5. The planar design display method based on AIGC technology according to claim 1, wherein: in step S3, before the text is generated, the method for detecting the text output is as follows:
s3.1: firstly, setting a judgment of whether a generated result is fit or not, after the text is generated, judging whether the generated result is matched with a display theme or not, if yes, re-editing and re-generating the text;
s3.2: then, whether the sentence of the generated result is smooth or not is detected, if yes, the sentence is output, otherwise, the sentence is edited again, and the text is regenerated.
CN202410182259.6A 2024-02-19 2024-02-19 Plane design display method based on AIGC technology Pending CN117992169A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410182259.6A CN117992169A (en) 2024-02-19 2024-02-19 Plane design display method based on AIGC technology

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410182259.6A CN117992169A (en) 2024-02-19 2024-02-19 Plane design display method based on AIGC technology

Publications (1)

Publication Number Publication Date
CN117992169A true CN117992169A (en) 2024-05-07

Family

ID=90888538

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410182259.6A Pending CN117992169A (en) 2024-02-19 2024-02-19 Plane design display method based on AIGC technology

Country Status (1)

Country Link
CN (1) CN117992169A (en)

Similar Documents

Publication Publication Date Title
CN112863483B (en) Voice synthesizer supporting multi-speaker style and language switching and controllable rhythm
CN111667812B (en) Speech synthesis method, device, equipment and storage medium
CN112184858B (en) Virtual object animation generation method and device based on text, storage medium and terminal
CN112017644A (en) Sound transformation system, method and application
CN107103900A (en) A kind of across language emotional speech synthesizing method and system
CN116863038A (en) Method for generating digital human voice and facial animation by text
Malcangi Text-driven avatars based on artificial neural networks and fuzzy logic
CN110675853A (en) Emotion voice synthesis method and device based on deep learning
JP2003186379A (en) Program for voice visualization processing, program for voice visualization figure display and for voice and motion image reproduction processing, program for training result display, voice-speech training apparatus and computer system
CN112466313A (en) Method and device for synthesizing singing voices of multiple singers
CN112489629A (en) Voice transcription model, method, medium, and electronic device
WO2023279976A1 (en) Speech synthesis method, apparatus, device, and storage medium
CN117349427A (en) Artificial intelligence multi-mode content generation system for public opinion event coping
CN116092472A (en) Speech synthesis method and synthesis system
CN113112575B (en) Mouth shape generating method and device, computer equipment and storage medium
Chen et al. Neural fusion for voice cloning
CN117079637A (en) Mongolian emotion voice synthesis method based on condition generation countermeasure network
Nazir et al. Deep learning end to end speech synthesis: A review
CN116129868A (en) Method and system for generating structured photo
CN117992169A (en) Plane design display method based on AIGC technology
CN113628609A (en) Automatic audio content generation
Aso et al. Speakbysinging: Converting singing voices to speaking voices while retaining voice timbre
CN112992116A (en) Automatic generation method and system of video content
Um et al. Facetron: A Multi-speaker Face-to-Speech Model based on Cross-Modal Latent Representations
CN117711374B (en) Audio-visual consistent personalized voice synthesis system, synthesis method and training method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication