CN117992169A - Plane design display method based on AIGC technology - Google Patents
Plane design display method based on AIGC technology Download PDFInfo
- Publication number
- CN117992169A CN117992169A CN202410182259.6A CN202410182259A CN117992169A CN 117992169 A CN117992169 A CN 117992169A CN 202410182259 A CN202410182259 A CN 202410182259A CN 117992169 A CN117992169 A CN 117992169A
- Authority
- CN
- China
- Prior art keywords
- voice
- text
- speech
- features
- information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 26
- 238000013461 design Methods 0.000 title claims abstract description 13
- 238000012545 processing Methods 0.000 claims description 18
- 239000012634 fragment Substances 0.000 claims description 15
- 230000015572 biosynthetic process Effects 0.000 claims description 12
- 238000003786 synthesis reaction Methods 0.000 claims description 12
- 230000009471 action Effects 0.000 claims description 9
- 230000004927 fusion Effects 0.000 claims description 9
- 238000000605 extraction Methods 0.000 claims description 6
- 238000009499 grossing Methods 0.000 claims description 6
- 238000012805 post-processing Methods 0.000 claims description 6
- 238000012549 training Methods 0.000 claims description 6
- 230000000007 visual effect Effects 0.000 claims description 6
- 230000003993 interaction Effects 0.000 claims description 4
- 230000008569 process Effects 0.000 claims description 4
- 238000013528 artificial neural network Methods 0.000 claims description 3
- 238000006243 chemical reaction Methods 0.000 claims description 3
- 238000010367 cloning Methods 0.000 claims description 3
- 239000000284 extract Substances 0.000 claims description 3
- 238000007781 pre-processing Methods 0.000 claims description 3
- 230000000306 recurrent effect Effects 0.000 claims description 3
- 230000003068 static effect Effects 0.000 claims description 3
- 238000013473 artificial intelligence Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 230000002194 synthesizing effect Effects 0.000 description 2
- 230000006978 adaptation Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
Landscapes
- Machine Translation (AREA)
Abstract
The invention relates to the technical field of AIGC, in particular to a planar design display method based on AIGC technology, which comprises the following steps: inputting object features and appearance features, inputting dynamic features, analyzing the object features by using an encoder, outputting the results by using a decoder, analyzing the appearance features and the dynamic features by using an encoder, fusing the analysis results of the two, and outputting the results by using the decoder. Compared with the traditional display mode, the invention can generate the content in real time according to the requirements and descriptions of operators, not just simply play the manufactured video, which means that the display device is not limited to the content prepared in advance any more and can generate the content automatically according to the actual demands of the operators.
Description
Technical Field
The invention relates to the technical field of AIGC, in particular to a plane design display method based on AIGC technology.
Background
ARTIFICIAL INTELLIGENCE GENERATED ContentAIGC techniques refer to the generation of various forms of content, such as articles, music, images, video, etc., using artificial intelligence algorithms and techniques that train large amounts of data using artificial intelligence models, which can then generate new content based on the conditions and requirements entered.
The conventional display mode generally copies the prefabricated video to the platform for display, relies on manual operation to a great extent, and mainly has the following problems:
1. Because the traditional display mode mainly depends on the pre-manufactured video to play, the display device cannot generate content based on real-time images or descriptions of operators, the intelligent degree of the mode is low, and the personalized requirements of different users cannot be met;
2. the traditional display mode cannot generate the content through real-time adjustment and according to actual conditions, so that the display mode cannot adapt to the display requirements of different scenes;
3. Under the traditional display mode, an operator must make videos in advance and input the videos into the display device, however, complicated manual operation is required for control and adjustment in the display device, so that the difficulty and complexity of the operation are increased, and the convenience of a display platform is reduced;
Therefore, there is an urgent need for a planar design display method based on AIGC technology, which solves the above problems.
Disclosure of Invention
The invention aims to provide a plane design display method based on AIGC technology, which has the advantage of intelligence and solves the problems presented by the background technology.
In order to achieve the above purpose, the present invention provides the following technical solutions: a plane design display method based on AIGC technology comprises the following steps:
S1: inputting object features and appearance features, inputting dynamic features, analyzing the object features by using an encoder, outputting the results by using a decoder, analyzing the appearance features and the dynamic features by using an encoder, fusing analysis results of the two, outputting the results by using the decoder, fusing the input object features, the input appearance features and the input dynamic feature analysis results, and forming key entities at the moment; on the basis, a one-to-one encoder is used for respectively analyzing specific object features, specific appearance features and specific dynamic features, after the dynamic entity features are obtained, the model is spliced with original dynamic features, after key entity generation, the extracted image features are subjected to semantic enhancement under the action of feature fusion to obtain rich information, the interaction relationship between the key entity and the feature fusion is fused, at the moment, global visual information and the description of the whole language layer are connected, and finally the description is generated;
S2: inputting text information, and generating voice output by using a voice synthesis engine; further, voice output and text input can be performed by utilizing a voice synthesis mode, recorded voice fragments are prepared in advance, the voice fragments are connected according to the input text information, the text is converted into audible voice output technology through a computer algorithm, and the synthesized voice is output by voice cloning;
S3: inputting voice, recognizing the voice by a voice recognizer, generating continuous texts by a recurrent neural network through learning context dependency relations of sequence data, and successfully generating the texts by taking the output of the last time step as the input of the current time step;
s4: after the text, the image and the voice are generated, an encoder and two decoders are used for analyzing the text, the image and the voice, the encoder is responsible for encoding the characteristics of the text, the image and the voice into a group of characteristics which face each other, the decoders respectively learn the association between static information and dynamic information, learn the relationship between context information, and carry out cross matching and fusion on the information of the text, the image and the voice, so that the visual representation is finally obtained;
S5: the method comprises the steps of collecting information of texts, voices and images by using a collecting module, processing the collected information under the action of a processing module, dividing the texts, the voices and the images into different areas after processing, analyzing the texts, the voices and the images by using a decoder when the texts, the voices and the images are generated, fusing the information by using a generating module, generating videos or images at the moment, and displaying by using a AIGC platform after the videos are generated.
In the invention, in step S2, the specific steps of converting input characters into voice are as follows:
S2.1.1: preprocessing the input text, including removing punctuation marks, marking into words or characters, processing numbers and specific abbreviations, etc.;
s2.1.2: selecting a speech synthesis model suitable for task requirements, wherein the step comprises converting text into phoneme sequences or using word phoneme sequences, and analyzing language characteristics such as syllables, phonemes, tones and the like;
s2.1.3: training an acoustic model using a large amount of text and corresponding speech data, the model being capable of learning a correlation between text and audio, and acoustic features of the audio;
s2.1.4: generating a voice waveform through an acoustic model based on the input text and the model obtained through training to generate synthetic voice;
S2.1.5: some post-processing operations are performed on the generated speech to improve the quality and naturalness of the speech, including adjusting pitch, volume, speech rate, etc.
In the invention, in step S2, the specific steps of inputting words and synthesizing voice by voice fragments are as follows:
S2.2.1: a speech database, also called a speech library or speech unit library, is built in advance, which contains a large number of recorded speech, which covers various possible phonemes, syllables, words and phrases, etc.;
S2.2.2: converting the input text into a corresponding sequence of phonemes, which may be accomplished by using a tool for text-to-speech conversion or internal processing of a speech synthesis system;
S2.2.3: selecting, from the speech library, an appropriate speech segment corresponding to each phoneme, each speech segment generally corresponding to a phoneme or a group of phonemes, based on the sequence of phonemes;
S2.2.4: the selected voice fragments are spliced together according to the sequence to form continuous voice output, wherein the voice output comprises the proper adjustment and smoothing treatment of the voice fragments so as to ensure that the spliced voice is smooth and natural;
S2.2.5: and carrying out some post-processing operations on the generated voice output, such as adjustment of tone, volume and speech speed, so as to improve the quality and naturalness of the voice.
Preferably, in step S3, the speech clone includes a speech feature extraction module, and the speech feature extraction module extracts synthesized speech, and processes the smoothing of the speech by using a signal processing technology, so as to output by using the speech clone.
In the present invention, in step S3, before text generation, the method for detecting text output is as follows:
s3.1: firstly, setting a judgment of whether a generated result is fit or not, after the text is generated, judging whether the generated result is matched with a display theme or not, if yes, re-editing and re-generating the text;
s3.2: then, whether the sentence of the generated result is smooth or not is detected, if yes, the sentence is output, otherwise, the sentence is edited again, and the text is regenerated.
The beneficial effects are that the technical scheme of the application has the following technical effects: compared with the traditional display mode, the application can generate the content in real time according to the requirements and descriptions of operators, not just simply play the manufactured video, which means that the display device is not limited to the content prepared in advance any more and can generate the content automatically according to the actual demands of the operators;
The operator can instruct the display device to generate the content meeting the requirements of the operator through inputting various forms of information such as pictures, texts and the like, in addition, the operator can interact with the display device in a voice instruction mode without complicated manual operation, the display device can recognize and understand voice commands of the operator, the display content is adjusted according to the instructions or related information is provided, and the natural and convenient interaction mode greatly improves user experience, so that the display process is more intelligent and personalized.
Drawings
The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate the invention and together with the embodiments of the invention, serve to explain the invention. In the drawings:
FIG. 1 is a flow chart of a technical system of the structure AIGC of the present invention;
FIG. 2 is a functional block diagram of text generation in accordance with the present invention;
FIG. 3 is a functional block diagram of speech generation according to the present invention;
Fig. 4 is a functional block diagram of image generation of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and in order to better understand the technical content of the present invention, specific embodiments are specifically described below with reference to the accompanying drawings. Aspects of the invention are described in this disclosure with reference to the drawings, in which are shown a number of illustrative embodiments. It should be appreciated that the various concepts and embodiments described above, as well as those described in more detail below, may be implemented in any of a wide variety of ways. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
As shown in fig. 1 to 4: the embodiment provides a plane design display method based on AIGC technology, which comprises the following steps:
S1: inputting object features and appearance features, inputting dynamic features, analyzing the object features by using an encoder, outputting the results by using a decoder, analyzing the appearance features and the dynamic features by using an encoder, fusing analysis results of the two, outputting the results by using the decoder, fusing the input object features, the input appearance features and the input dynamic feature analysis results, and forming key entities at the moment; on the basis, a one-to-one encoder is used for respectively analyzing specific object features, specific appearance features and specific dynamic features, after the dynamic entity features are obtained, the model is spliced with original dynamic features, after key entity generation, the extracted image features are subjected to semantic enhancement under the action of feature fusion to obtain rich information, the interaction relationship between the key entity and the feature fusion is fused, at the moment, global visual information and the description of the whole language layer are connected, and finally the description is generated;
S2: inputting text information, and generating voice output by using a voice synthesis engine; further, voice output and text input can be performed by utilizing a voice synthesis mode, recorded voice fragments are prepared in advance, the voice fragments are connected according to the input text information, the text is converted into audible voice output technology through a computer algorithm, and the synthesized voice is output by voice cloning;
S3: inputting voice, recognizing the voice by a voice recognizer, generating continuous texts by a recurrent neural network through learning context dependency relations of sequence data, and successfully generating the texts by taking the output of the last time step as the input of the current time step;
s4: after the text, the image and the voice are generated, an encoder and two decoders are used for analyzing the text, the image and the voice, the encoder is responsible for encoding the characteristics of the text, the image and the voice into a group of characteristics which face each other, the decoders respectively learn the association between static information and dynamic information, learn the relationship between context information, and carry out cross matching and fusion on the information of the text, the image and the voice, so that the visual representation is finally obtained;
S5: the method comprises the steps of collecting information of texts, voices and images by using a collecting module, processing the collected information under the action of a processing module, dividing the texts, the voices and the images into different areas after processing, analyzing the texts, the voices and the images by using a decoder when the texts, the voices and the images are generated, fusing the information by using a generating module, generating videos or images at the moment, and displaying by using a AIGC platform after the videos are generated.
In the invention, in step S2, the specific steps of converting input characters into voice are as follows:
S2.1.1: preprocessing the input text, including removing punctuation marks, marking into words or characters, processing numbers and specific abbreviations, etc.;
s2.1.2: selecting a speech synthesis model suitable for task requirements, wherein the step comprises converting text into phoneme sequences or using word phoneme sequences, and analyzing language characteristics such as syllables, phonemes, tones and the like;
s2.1.3: training an acoustic model using a large amount of text and corresponding speech data, the model being capable of learning a correlation between text and audio, and acoustic features of the audio;
s2.1.4: generating a voice waveform through an acoustic model based on the input text and the model obtained through training to generate synthetic voice;
S2.1.5: some post-processing operations are performed on the generated speech to improve the quality and naturalness of the speech, including adjusting pitch, volume, speech rate, etc.
In the invention, in step S2, the specific steps of inputting words and synthesizing voice by voice fragments are as follows:
S2.2.1: a speech database, also called a speech library or speech unit library, is built in advance, which contains a large number of recorded speech, which covers various possible phonemes, syllables, words and phrases, etc.;
S2.2.2: converting the input text into a corresponding sequence of phonemes, which may be accomplished by using a tool for text-to-speech conversion or internal processing of a speech synthesis system;
S2.2.3: selecting, from the speech library, an appropriate speech segment corresponding to each phoneme, each speech segment generally corresponding to a phoneme or a group of phonemes, based on the sequence of phonemes;
S2.2.4: the selected voice fragments are spliced together according to the sequence to form continuous voice output, wherein the voice output comprises the proper adjustment and smoothing treatment of the voice fragments so as to ensure that the spliced voice is smooth and natural;
S2.2.5: and carrying out some post-processing operations on the generated voice output, such as adjustment of tone, volume and speech speed, so as to improve the quality and naturalness of the voice.
Specifically, in step S3, the speech clone includes a speech feature extraction module, and the speech feature extraction module extracts synthesized speech, and processes the smoothing of the speech by using a signal processing technology, so as to output by using the speech clone.
In the present invention, in step S3, before text generation, the method for detecting text output is as follows:
s3.1: firstly, setting a judgment of whether a generated result is fit or not, after the text is generated, judging whether the generated result is matched with a display theme or not, if yes, re-editing and re-generating the text;
s3.2: then, whether the sentence of the generated result is smooth or not is detected, if yes, the sentence is output, otherwise, the sentence is edited again, and the text is regenerated.
It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.
While the invention has been described with reference to preferred embodiments, it is not intended to be limiting. Those skilled in the art will appreciate that various modifications and adaptations can be made without departing from the spirit and scope of the present invention. Accordingly, the scope of the invention is defined by the appended claims.
Claims (5)
1. A plane design display method based on AIGC technology is characterized in that: the method comprises the following steps:
S1: inputting object features and appearance features, inputting dynamic features, analyzing the object features by using an encoder, outputting the results by using a decoder, analyzing the appearance features and the dynamic features by using an encoder, fusing analysis results of the two, outputting the results by using the decoder, fusing the input object features, the input appearance features and the input dynamic feature analysis results, and forming key entities at the moment; on the basis, a one-to-one encoder is used for respectively analyzing specific object features, specific appearance features and specific dynamic features, after the dynamic entity features are obtained, the model is spliced with original dynamic features, after key entity generation, the extracted image features are subjected to semantic enhancement under the action of feature fusion to obtain rich information, the interaction relationship between the key entity and the feature fusion is fused, at the moment, global visual information and the description of the whole language layer are connected, and finally the description is generated;
S2: inputting text information, and generating voice output by using a voice synthesis engine; further, voice output and text input can be performed by utilizing a voice synthesis mode, recorded voice fragments are prepared in advance, the voice fragments are connected according to the input text information, the text is converted into audible voice output technology through a computer algorithm, and the synthesized voice is output by voice cloning;
S3: inputting voice, recognizing the voice by a voice recognizer, generating continuous texts by a recurrent neural network through learning context dependency relations of sequence data, and successfully generating the texts by taking the output of the last time step as the input of the current time step;
s4: after the text, the image and the voice are generated, an encoder and two decoders are used for analyzing the text, the image and the voice, the encoder is responsible for encoding the characteristics of the text, the image and the voice into a group of characteristics which face each other, the decoders respectively learn the association between static information and dynamic information, learn the relationship between context information, and carry out cross matching and fusion on the information of the text, the image and the voice, so that the visual representation is finally obtained;
S5: the method comprises the steps of collecting information of texts, voices and images by using a collecting module, processing the collected information under the action of a processing module, dividing the texts, the voices and the images into different areas after processing, analyzing the texts, the voices and the images by using a decoder when the texts, the voices and the images are generated, fusing the information by using a generating module, generating videos or images at the moment, and displaying by using a AIGC platform after the videos are generated.
2. The planar design display method based on AIGC technology according to claim 1, wherein: in step S2, the specific step of converting the input text into voice is as follows:
S2.1.1: preprocessing the input text, including removing punctuation marks, marking into words or characters, processing numbers and specific abbreviations, etc.;
s2.1.2: selecting a speech synthesis model suitable for task requirements, wherein the step comprises converting text into phoneme sequences or using word phoneme sequences, and analyzing language characteristics such as syllables, phonemes, tones and the like;
s2.1.3: training an acoustic model using a large amount of text and corresponding speech data, the model being capable of learning a correlation between text and audio, and acoustic features of the audio;
s2.1.4: generating a voice waveform through an acoustic model based on the input text and the model obtained through training to generate synthetic voice;
S2.1.5: some post-processing operations are performed on the generated speech to improve the quality and naturalness of the speech, including adjusting pitch, volume, speech rate, etc.
3. The planar design display method based on AIGC technology according to claim 1, wherein: in step S2, the specific steps of inputting text and speech fragment synthesized speech are as follows:
S2.2.1: a speech database, also called a speech library or speech unit library, is built in advance, which contains a large number of recorded speech, which covers various possible phonemes, syllables, words and phrases, etc.;
S2.2.2: converting the input text into a corresponding sequence of phonemes, which may be accomplished by using a tool for text-to-speech conversion or internal processing of a speech synthesis system;
S2.2.3: selecting, from the speech library, an appropriate speech segment corresponding to each phoneme, each speech segment generally corresponding to a phoneme or a group of phonemes, based on the sequence of phonemes;
S2.2.4: the selected voice fragments are spliced together according to the sequence to form continuous voice output, wherein the voice output comprises the proper adjustment and smoothing treatment of the voice fragments so as to ensure that the spliced voice is smooth and natural;
S2.2.5: and carrying out some post-processing operations on the generated voice output, such as adjustment of tone, volume and speech speed, so as to improve the quality and naturalness of the voice.
4. The planar design display method based on AIGC technology according to claim 1, wherein: in step S2, the voice clone includes a voice feature extraction module, the voice feature extraction module extracts synthesized voice, and the signal processing technology is used to process the smoothing of the voice, so as to output by using the voice clone.
5. The planar design display method based on AIGC technology according to claim 1, wherein: in step S3, before the text is generated, the method for detecting the text output is as follows:
s3.1: firstly, setting a judgment of whether a generated result is fit or not, after the text is generated, judging whether the generated result is matched with a display theme or not, if yes, re-editing and re-generating the text;
s3.2: then, whether the sentence of the generated result is smooth or not is detected, if yes, the sentence is output, otherwise, the sentence is edited again, and the text is regenerated.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410182259.6A CN117992169A (en) | 2024-02-19 | 2024-02-19 | Plane design display method based on AIGC technology |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410182259.6A CN117992169A (en) | 2024-02-19 | 2024-02-19 | Plane design display method based on AIGC technology |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117992169A true CN117992169A (en) | 2024-05-07 |
Family
ID=90888538
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410182259.6A Pending CN117992169A (en) | 2024-02-19 | 2024-02-19 | Plane design display method based on AIGC technology |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117992169A (en) |
-
2024
- 2024-02-19 CN CN202410182259.6A patent/CN117992169A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112863483B (en) | Voice synthesizer supporting multi-speaker style and language switching and controllable rhythm | |
CN111667812B (en) | Speech synthesis method, device, equipment and storage medium | |
CN112184858B (en) | Virtual object animation generation method and device based on text, storage medium and terminal | |
CN112017644A (en) | Sound transformation system, method and application | |
CN107103900A (en) | A kind of across language emotional speech synthesizing method and system | |
CN116863038A (en) | Method for generating digital human voice and facial animation by text | |
Malcangi | Text-driven avatars based on artificial neural networks and fuzzy logic | |
CN110675853A (en) | Emotion voice synthesis method and device based on deep learning | |
JP2003186379A (en) | Program for voice visualization processing, program for voice visualization figure display and for voice and motion image reproduction processing, program for training result display, voice-speech training apparatus and computer system | |
CN112466313A (en) | Method and device for synthesizing singing voices of multiple singers | |
CN112489629A (en) | Voice transcription model, method, medium, and electronic device | |
WO2023279976A1 (en) | Speech synthesis method, apparatus, device, and storage medium | |
CN117349427A (en) | Artificial intelligence multi-mode content generation system for public opinion event coping | |
CN116092472A (en) | Speech synthesis method and synthesis system | |
CN113112575B (en) | Mouth shape generating method and device, computer equipment and storage medium | |
Chen et al. | Neural fusion for voice cloning | |
CN117079637A (en) | Mongolian emotion voice synthesis method based on condition generation countermeasure network | |
Nazir et al. | Deep learning end to end speech synthesis: A review | |
CN116129868A (en) | Method and system for generating structured photo | |
CN117992169A (en) | Plane design display method based on AIGC technology | |
CN113628609A (en) | Automatic audio content generation | |
Aso et al. | Speakbysinging: Converting singing voices to speaking voices while retaining voice timbre | |
CN112992116A (en) | Automatic generation method and system of video content | |
Um et al. | Facetron: A Multi-speaker Face-to-Speech Model based on Cross-Modal Latent Representations | |
CN117711374B (en) | Audio-visual consistent personalized voice synthesis system, synthesis method and training method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication |