CN116389849A

CN116389849A - Video generation method, device, equipment and storage medium

Info

Publication number: CN116389849A
Application number: CN202310079103.0A
Authority: CN
Inventors: 陈曦; �田�浩; 宋愷晟; 张皛珏
Original assignee: Baidu com Times Technology Beijing Co Ltd; Baidu USA LLC
Current assignee: Baidu com Times Technology Beijing Co Ltd; Baidu USA LLC
Priority date: 2023-01-18
Filing date: 2023-01-18
Publication date: 2023-07-04

Abstract

The disclosure provides a video generation method, a device, equipment and a storage medium, relates to the technical field of computers, in particular to the technical field of artificial intelligence, and specifically relates to the technical fields of deep learning, natural language processing, computer vision and the like. The specific implementation scheme is as follows: acquiring description information; determining a target 3D scene matched with the description information; determining sub-scenes matched with each description fragment in the description information; determining a lens-carrying mode of each sub-scene based on semantic analysis results of each description fragment; determining a shot switching mode between sub-scenes based on semantic analysis results of adjacent description fragments; generating a 3D video description file based on the ordering and mirror-carrying modes of all the sub-scenes, the lens switching modes among the sub-scenes and the description information; and processing the 3D video description file based on the 3D rendering engine to generate a video corresponding to the description information. The embodiment of the disclosure can automatically match the content of the descriptive information to generate high-quality video based on understanding the descriptive information.

Description

Video generation method, device, equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technology, and in particular, to the field of artificial intelligence technology, and specifically, to the technical fields of deep learning, natural language processing, computer vision, and the like.

Background

With the continuous popularization of mobile terminals such as mobile phones, tablets and the like, the interests of the vast users for video watching are improved. The video recorded manually needs to be input with manpower, material resources and time cost.

In addition to manual recording methods, video clips and pictures can be stitched together to obtain video given image material such as video clips and pictures. However, the quality of the video generated by the method is not controllable and is not flexible enough.

Disclosure of Invention

The present disclosure provides a video generation method, apparatus, device and storage medium.

According to an aspect of the present disclosure, there is provided a video generating method including:

acquiring description information for describing video content;

determining a target 3D (Three Dimensions) scene matched with the description information;

determining a sub-scene matched with each description fragment in the description information from a plurality of sub-scenes included in the target 3D scene;

determining a lens-carrying mode of each sub-scene based on semantic analysis results of each description fragment; the method comprises the steps of,

determining a shot switching mode between sub-scenes of the adjacent description fragments based on semantic analysis results of the adjacent description fragments;

Generating a 3D video description file based on the ordering of each sub-scene in the description information, the mirror operation mode of each sub-scene, the lens switching mode among the sub-scenes and the description information;

and processing the 3D video description file based on the 3D rendering engine to generate a video corresponding to the description information.

According to another aspect of the present disclosure, there is provided a video generating apparatus including:

the acquisition module is used for acquiring description information for describing the video content;

the first matching module is used for determining a target 3D scene matched with the description information;

the second matching module is used for determining sub-scenes matched with each description fragment in the description information from a plurality of sub-scenes included in the target 3D scene;

the lens-carrying determining module is used for determining the lens-carrying mode of each sub-scene based on the semantic analysis result of each description fragment;

the shot switching determining module is used for determining a shot switching mode between sub-scenes of the adjacent description fragments based on semantic analysis results of the adjacent description fragments;

the file generation module is used for generating a 3D video description file based on the ordering of all the sub-scenes in the description information, the lens operation mode of all the sub-scenes, the lens switching mode among the sub-scenes and the description information;

The video generation module is used for processing the 3D video description file based on the 3D rendering engine and generating a video corresponding to the description information.

According to another aspect of the present disclosure, there is provided an electronic device including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of the embodiments of the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform a method according to any one of the embodiments of the present disclosure.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a method according to any of the embodiments of the present disclosure.

In the embodiment of the disclosure, based on analysis of the description information, a 3D scene, a sub scene, a lens-moving mode and a lens-switching mode required for recording the video are determined, so that a 3D video description file can be generated, and a 3D rendering engine can generate the video corresponding to the description information. It follows that the disclosed embodiments are capable of automatically matching the content of descriptive information to generate high quality video based on an understanding of the descriptive information, without being limited to the content of image material.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

fig. 1 (a) is a schematic diagram of an application scenario of a video generation method according to an embodiment of the present disclosure;

FIG. 1 (b) is a flow diagram of a video generation method according to an embodiment of the present disclosure;

FIG. 2 is a flow diagram of determining a sub-scenario in accordance with an embodiment of the present disclosure;

FIG. 3 is a schematic illustration of the locations of anchor points and current analysis points in an embodiment according to the present disclosure;

FIG. 4 (a) is a schematic diagram of a video generation method in accordance with another embodiment of the present disclosure;

FIG. 4 (b) is a schematic diagram of a video generation method in accordance with another embodiment of the present disclosure;

FIG. 4 (c) is an overall flow diagram of a video generation method according to an embodiment of the present disclosure;

FIG. 5 is a schematic diagram of scene configuration of a sub-scene based on scene elements in an embodiment in accordance with the disclosure;

FIG. 6 is a schematic diagram of a generation process of a video in accordance with another embodiment of the present disclosure;

Fig. 7 is a schematic structural view of a video generating apparatus according to an embodiment of the present disclosure;

fig. 8 is a block diagram of an electronic device for implementing a video generation method of an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the related art, besides the video generation based on the manual shooting mode, some materials can be spliced to generate the video. For example, video clips and pictures are spliced together to obtain video. However, the quality of the video generated by the method is not controllable and is not flexible enough.

In view of this, an embodiment of the disclosure provides a video generating method, and fig. 1 (a) is a schematic view of a scene to which the method is applied. Fig. 1 (a) includes a server 11 and a terminal device 12.

The terminal device 12 and the server 11 are connected through a wireless or wired network, and the terminal device 12 includes, but is not limited to, electronic devices such as a desktop computer, a mobile phone, a mobile computer, a tablet computer, a media player, an intelligent wearable device, and an intelligent television. The server 11 may be a server, a server cluster formed by a plurality of servers, or a cloud computing center. The server 11 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligence platforms.

In the embodiment of the disclosure, the user may upload the text to the server 11 based on the terminal device 12, generate a corresponding video based on the text by the server 11, and send the video to the terminal device 12 through a network so as to be convenient for the user to watch.

As shown in fig. 1 (b), a flowchart of a video generating method according to an embodiment of the present disclosure includes:

s101, description information for describing video content is acquired.

The description information may be text, document file, audio, uniform resource locator (Uniform Resource Location, URL), etc. It is understood that URL is a character string used to obtain information resources on a network, and is mainly used in various terminal applications and server programs. The URL may be used to obtain various information resources, including files, videos, pictures, directories, etc., in a uniform format. Descriptive information in text form may be obtained from works authored by news, novels, blogs, papers, etc., or may be authored by the user, which is not limited by the embodiments of the present disclosure.

The descriptive information may also be in audio form. The description information in the audio form can be obtained from public works with audio such as movie drama and broadcast drama, or can be recorded by a user, for example, a section of voice can be recorded to be used for generating video, and the audio obtaining mode is not limited in the embodiment of the disclosure.

The descriptive information may also be a composite resource that includes both text and image material. For example, the travel log may include a text description of travel conditions and scenery, and may include pictures and video clips. The multi-modal content can be understood based on the composite resource, thereby generating a video.

S102, determining a target 3D scene matched with the description information.

In implementation, a 3D scene resource library may be provided, and a target 3D scene matched with the description information may be obtained from the 3D scene resource library.

S103, determining the sub-scene matched with each description fragment in the description information from a plurality of sub-scenes included in the target 3D scene.

Wherein the target 3D scene may be a larger scene. The target 3D scene may be, for example, a city, and the 3D scene of the city may include a sub-scene of a theater, a concert hall, a school, a residential area, a street, a movie theater, etc.

S104, determining the mirror mode of each sub-scene based on the semantic analysis result of each description fragment.

The mirror mode can be determined based on the content of the description fragment and the description mode.

The lens control library can be constructed in advance, and the lens control library comprises a plurality of lens transportation modes. For example, in the case that the descriptive section is a lyric, the sub-scene may be zoomed in slowly; in the case that the descriptive segment is a surprise story, the position of the sub-scene in the mirror mode may be suddenly raised, suddenly dropped, etc. Note that, the lens-carrying manner in the lens control library is not limited to the lens-carrying manner illustrated in the embodiments of the present disclosure.

In implementation, a training sample of the descriptive information can be obtained, and a standard lens-carrying mode corresponding to the training sample is determined to be a training label. After the training sample is input into the first neural network model to be trained for semantic understanding, the first neural network model to be trained determines the mirror mode to be compared. And comparing the output to-be-compared lens-transporting mode with the training label, determining a loss value, and then adjusting model parameters of the first to-be-trained neural network model based on the loss value until training is finished under the condition that the first to-be-trained neural network model meets training convergence conditions, so as to obtain the first neural network model of the lens-transporting mode capable of determining the sub-scene. The convergence condition may be that the loss value is smaller than a preset threshold value, or a preset iteration number is satisfied.

In other embodiments, the lens-transporting mode in one sub-scene is not limited to one lens-transporting mode, and the lens-transporting mode of each sub-segment can be determined according to the semantic analysis results of different sub-segments in the description segment. For example, the description fragment may be analyzed to obtain description objects in the description fragment, and different sub-fragments may be partitioned for different description objects. For example, the content of the same person is continuously described as one sub-segment, and the content of the same cat is continuously described as another sub-segment, so that the sub-segments of different description objects can be separately subjected to semantic analysis to obtain an adaptive mirror mode.

The mirror mode may include at least one of: slowly pushing, slowly pulling, slowly shaking, slowly moving, slowly lifting, rapidly pushing, rapidly pulling, rapidly shaking, rapidly moving, rapidly lifting, comprehensively moving and the like, and the mirror conveying mode can be determined according to actual requirements when the mirror conveying device is implemented.

S105, determining a shot switching mode between sub-scenes of the adjacent description fragments based on semantic analysis results of the adjacent description fragments.

The lens control library can also comprise lens switching modes of different sub-scenes. For example, in the case where the semantic analysis result of the adjacent descriptive fragments indicates that the scene is switched from one place to another place, the switching manner between the sub-scenes may be smooth switching; when the semantic analysis result of the adjacent descriptive fragments indicates recall, the switching mode between the sub-scenes can be to block the shot transition. It should be noted that, the lens switching method in the lens control library is not limited to the lens switching method proposed in the embodiments of the present disclosure.

Similar to the mirror mode of determining the sub-scene, in the embodiment of the present disclosure, the second generation training neural network may be trained to obtain a second neural network capable of giving the shot switching mode of the sub-scene.

In implementation, training samples of adjacent description fragments can be obtained, and a shot switching mode between standard sub-scenes corresponding to the adjacent description fragments is determined to be a training label. After the training sample is input into the second neural network model to be trained for semantic understanding, the second neural network model to be trained determines the shot switching mode to be compared. And comparing the output shot switching mode to be compared with the training label, determining a loss value, and then adjusting model parameters of the second neural network model to be trained based on the loss value until training is finished under the condition that the second neural network model to be trained meets training convergence conditions, so as to obtain a second neural network model capable of determining the shot switching mode among sub-scenes. The convergence condition may be that the loss value is smaller than a preset threshold value, or a preset iteration number is satisfied.

The lens switching mode needs to consider the relation among different pictures and the switching technical mode. Shot switching can be divided into two forms, trick switching and no trick switching. Skillful handoffs such as: fade-in talk-out, picture overlapping, grid fixing, image drawing, page turning, multi-picture screen division, virtual-real exchange, throwing-in and throwing-out, computer stunt and the like. The trick-free switching is to use the semantic relation of the upper shot and the lower shot on the content to convert the space time and connect the scenes, thereby enabling the shot switching to be natural and smooth without any trace of additional skills. The lens switching mode can be determined according to actual requirements during implementation.

S106, generating a 3D video description file based on the ordering of the sub-scenes in the description information, the mirror mode of the sub-scenes, the lens switching mode among the sub-scenes and the description information.

The 3D video description file is a three-dimensional scene description file, and the 3D rendering engine may generate video based on the file.

S107, processing the 3D video description file based on the 3D rendering engine, and generating a video corresponding to the description information.

Wherein the 3D rendering engine may employ, for example, a fantasy game engine or the like, any 3D game engine capable of generating video based on a three-dimensional scene description file is suitable for use with embodiments of the present disclosure.

In the embodiment of the disclosure, after determining the 3D target scene of the description information, the corresponding adapted sub-scene may be determined based on the description fragment, and the lens-carrying mode of the sub-scene and the lens-switching mode between adjacent sub-scenes may be determined based on understanding of the description fragment. Based on the method, the scene matched with the descriptive information and the shot recording mode can be automatically matched based on understanding of the descriptive information to generate the video. And the matching degree of the generated video and the description information is higher. Because the 3D video description file and the 3D rendering engine are adopted to generate the video, the video meeting the quality requirement can be generated, and the video quality is not limited by the original material, so that the generated video quality is controllable. In addition, even if similar description information adopts the same target scene, the mode of generating the video is more flexible because the lens-moving mode and the lens-switching mode are determined based on the semantics of the description information, and the generated video content also can generate differences.

In some embodiments, in a case that the total length of the original description information of the description information is greater than a length threshold, the compression processing is performed on the original description information to obtain the description information. Therefore, under the condition that the content of the descriptive information is longer and a shorter video needs to be generated, the descriptive information meeting the length requirement can be obtained through compression processing, so that a data foundation is laid for the subsequent generation of the video.

In the embodiment of the disclosure, the original description information is compressed, so that core content required to be expressed by the original description information needs to be kept as much as possible. The key content of the original descriptive information can be extracted based on a semantic understanding mode to generate descriptive information. For example, in the case where the original description information is a long text, the long text may be rewritten based on the text duplication model to obtain shorter text content.

In other embodiments, in the case where the original description information is text, a text summary of the original description information is extracted to obtain the description information.

For example, in the case where the original description information is text, the length threshold is 500 words, and the text of the original description information is 1000 words, it may be subjected to the abstract extraction. Of course, the length threshold may also be a length threshold, for example, in the case where the original description information is text, the audio corresponding to the original description information is generated, and if the length of the audio is greater than the length threshold, it is determined that the total length of the original description information is greater than the length threshold.

Whether the original descriptive information is text, audio, or a composite resource. The compression process can be realized by extracting the abstract for the original description information. Taking text as an example, when the method is implemented, the abstract can be extracted from the original description information based on an abstract model (Summarization model), and the abstract model can also be called a text compression model (Compression model) for compressing long text into short text without losing main information of the long text, so that main content of the long text can be known when the short text is browsed. The summary model can be divided into two categories according to the different ways in which the summary is generated: a decimated abstract model (Extractive Summarization model) and a generated abstract model (Abstractive Summarization model).

The extraction type abstract model refers to a model for selecting a target sentence from a long text to obtain a short text. Wherein, the target sentence needs to satisfy two requirements: (1) Informativeness, which means that all important information in a long text is contained and logically consistent with the information; (2) Redundancy means that there is minimal redundancy, i.e. sentences containing similar information should not appear in short text at the same time, sentences containing non-important information should not appear in short text.

The abstract generation task is generally regarded as a sequence marking task (sequence labeling task), and an encoder in the abstract model performs two classifications for each clause of a long text, and when decoding each clause, determines whether the current clause should be selected as a target clause according to semantic information of the clause, decoding state of the last clause, and global semantic information of the long text.

The generated summary model is capable of automatically generating short text by an encoder based on important information of long text. Similar to the extraction type abstract model, the generated abstract still needs to meet the two requirements, namely informativeness and redundancy. The generative summary model typically uses a seq2seq (sequence to sequence) framework to adaptively select valid context information through an attention mechanism based on semantic information of long text, generating a consistent, information-consistent summary word by word.

In the embodiment of the disclosure, when the original description information is greater than the length threshold and is text, the original description information can be compressed based on the abstract extraction mode, so that the generated video can be adaptively adjusted in length based on the compressed text under the condition of ensuring that the description information semantics are unchanged, and the generated video can express the core content of the original description information.

In other embodiments, in the case where the original description information is audio, text corresponding to the audio may be obtained; and extracting a text abstract from the text corresponding to the audio to obtain descriptive information.

Under the condition that the description information is audio, the audio can be converted into text by using a voice recognition technology, and then the original description information is compressed based on the mode of extracting the text abstract, so that the compressed description information is obtained.

In the embodiment of the disclosure, since the original audio data may be too long, and the original audio data is converted into text, that is, the length threshold is exceeded, the description information may be compressed based on the abstract extraction mode. The description information capable of expressing the key content of the original audio is obtained, so that the video with controllable length can be generated, and the video content can still express the content of the original audio.

For the composite resource mentioned above, it can be determined whether the length of the text in the composite resource exceeds the length threshold, and in the case that the length threshold is exceeded, the text can be compressed in the manner of abstracting as described above, and the image elements in the composite resource can be used as scene materials required for generating video.

In some embodiments, the generated video requires audio corresponding to the text description information. In the case that the description information is text, audio of the description information may be generated; the playing time length of the generated video is matched with the playing time length of the audio. That is, the play time length of the video generated in the embodiments of the present disclosure is adapted to the time length of the audio of the text description information.

Where the description information is Text, audio corresponding To the description information may be generated based on a Speech synthesis technology (TTS) so as To restrict a playing time length of the generated video.

In the embodiment of the disclosure, the audio corresponding to the description information is generated, so that the audio and the video can be corresponding, and the audio data support is provided for generating the video.

In some embodiments, the original description information has unlimited source and unlimited form, so that some noise may exist in the content of the original description information. For example, some advertisements may be interspersed with the original description information. To ensure the quality of the generated video, in implementations of the present disclosure, the advertising content in the original description information may be removed.

The advertising content may include advertising pictures, watermarks of information sources, advertising two-dimensional codes unrelated to descriptive information content, and the like.

By removing advertisements in the original description information, the description information is more accurate in describing the video, is not interfered by other factors, and further lays a data foundation for generating high-quality video.

In the implementation, the advertisement identification technology and the watermark detection technology can be used for detecting the description information, and the advertisement identification and the watermark detection can be respectively carried out on the description information. In implementation, advertisement recognition and watermark detection do not distinguish between sequencing. Wherein, advertisement identification can realize keyword detection of descriptive information. For example, keywords such as "product" and "commodity" are detected at a clause in the description information, so that semantic understanding is performed based on the context of the keywords, and further, whether the clause is an advertisement is determined. And if the clause is an advertisement, deleting the clause.

Watermarks in embodiments of the present disclosure may refer to marks of provenance or copyright of text, e.g., logo (mark) of a producer may be understood as a watermark that needs to be detected. After the position of the watermark region is determined in the description information, a rectangular target region containing the watermark is determined, wherein the vertex of the rectangular target region is related to the coordinates of the watermark, watermark pixel points in the rectangular target region are identified, and the pixel value of the watermark pixel points in the region is adjusted to be a background color pixel value, so that the watermark can be effectively removed.

It will be appreciated that when the text portion of the description information does not contain a watermark region, i.e. there is no watermark occluding the description information, there is no need to perform the process of erasing the watermark.

For selection of a 3D scene, a 3D scene repository may be provided in embodiments of the present disclosure, where the 3D scene repository includes a plurality of candidate scenes. To facilitate screening out target 3D scenes from the 3D scene repository that are suitable for descriptive information, text labels may be added for each candidate scene in the 3D scene repository. On the basis of the obtained description information, determining the target 3D scene matched with the description information can be implemented as follows: determining the similarity between the description information and the text labels of the candidate scenes; and selecting the candidate scene with the highest similarity as a target 3D scene matched with the description information.

Taking the description information as a text example, a natural language processing model (Natural Language Processing, NLP) can be adopted to process the description information, so that the similarity between the description information and the text labels of the candidate scenes is obtained. And under the condition that the similarity is larger than a preset threshold, the candidate scene corresponding to the semantic tag can be used as the target 3D scene of the description information.

Under the condition that the descriptive information is audio, text information corresponding to the audio can be obtained, the descriptive information in a corresponding text form is obtained, and then a matched target 3D scene is obtained by adopting an NLP technology.

In the case that the description information is a composite resource, a 3D target scene of the description information can be obtained based on multi-modal content understanding. For example, text features conforming to texts in the resources and image features conforming to image materials in the resources can be extracted respectively, then feature fusion is carried out on the two features to obtain fusion features, and matching processing is carried out on the basis of the fusion features and text labels of candidate scenes to obtain corresponding 3D target scenes.

In the embodiment of the disclosure, the target 3D scene suitable for the descriptive information can be screened out by determining the similarity of the descriptive information and the text labels of the candidate scenes, and the screened target 3D scene semantically meets the requirement of the descriptive information, so that the target 3D scene suitable for recording the video can be accurately determined.

For example, the descriptive information is used to describe the content of financial interviews. The target 3D scene may be determined as a scene of a city street interview in order to adapt the description information of financial classes.

For another example, the descriptive information is a literature story, and when telling the environment, the descriptive information can be matched to the corresponding environment scene. For example, when teaching the structure of a historic building, the 3D model interior of the related historic building can be matched to serve as a scene for recording video. Therefore, by matching proper scenes, the target 3D scene and the description information are matched in content, and the quality of the generated video is improved.

On the basis of acquiring the target 3D scene, since the target 3D scene includes a plurality of sub-scenes, and different contents are described in the description information based on different description fragments, the sub-scenes matching with the description fragments in the description information need to be determined, which can be implemented as shown in fig. 2:

s201, acquiring a plurality of anchor points of the descriptive information.

The video generated for the description information may be divided into a long video and a short video. For example, for a video with a video length of n minutes or less, it may be referred to as a short video, and for a video with a video length of more than n minutes, it may be referred to as a long video, n being a positive integer. In implementation, the value of n may be 2 or 5, and may be configured according to actual requirements, which is not limited in the embodiment of the present disclosure.

In the case where the generated video is a short video, the anchor point may be a sentence anchor point. For example, the description information is divided into a plurality of sentences with the start position of each sentence as the sentence anchor point. In practice, multiple sentences may be partitioned based on punctuation.

Considering that frequent scene switching may cause a degradation of the look and feel, in the embodiment of the present disclosure, the plurality of anchor points acquired in S201 are key anchor points regardless of long video or short video.

For example, in the case of obtaining the sentence anchor, according to a preset screening policy, the key anchor is screened out from the plurality of sentence anchors and is used as a plurality of anchors required in S201.

For example, for each sentence anchor, sentences within m sentences of the sentence anchor may be considered neighbor sentences of the sentence anchor. In implementation, the semantics of the neighbor sentences can be analyzed, and sentences with the semantic similarity larger than a preset threshold value are combined to obtain a combined sentence subset. Sentence anchor points in the middle of the merged sentence subset may be taken as anchor points required in S201. The global semantics of the combined sentence subset can be determined as reference semantics, the semantic similarity between each sentence in the combined sentence set and the reference semantics is determined, the mean value of the semantic similarity between each sentence and the reference semantics is calculated, and the calculation mode is shown as an expression (1). And further determining a sentence with the semantic similarity with the reference semantic as the mean value from the combined sentence set, and adopting a sentence anchor point of the sentence as an anchor point required by S201. Of course, the sentence with the highest semantic similarity with the reference semantic may be selected, and the sentence anchor point of the sentence may be used as the anchor point required in S201. Based on the method, sparse key anchor points can be obtained, so that resource waste is avoided, and frequent and excessive switching scenes are avoided.

In expression (1), O _i Mean value representing semantic similarity of ith sentence in merged sentence set, a ₁ Meaning similarity, a, of the i-th sentence and the reference semantics ₂ Meaning similarity, a, of the i-th sentence and the reference semantics _n Meaning the semantic similarity of the ith sentence and the reference semantics, n meaning that there are n sentences in the combined sentence set.

S202, intercepting a first fragment of each anchor point from the description information.

In some embodiments, since there are multiple anchors in the description information, for each anchor, the content with a preset radius length is intercepted from the description information with the anchor as a center, so as to obtain a first segment of the anchor.

For example, the preset radius length may be measured by the number of sentences, e.g., the preset radius length is k sentences, and k is a positive integer. K sentences before the anchor point, k sentences after the anchor point and sentences where the anchor point is located can be obtained as the first fragment.

The preset radius length can also be described by using a time length, for example, the position of the anchor point is 5s, and the preset radius length is 3s, and then the content in 2-8 s is intercepted from the description information as the first segment of the anchor point.

It should be noted that the preset radius length may be set based on practical situations, which is not limited by the embodiment of the present disclosure. And under the condition that the content before or after the anchor point is less than the preset radius length, acquiring all the content which is less than the preset radius length. For example, the anchor point is located at the 5 th s, the preset radius length is 3s, and only 2s of content remains after the anchor point, and the 2s of content is obtained.

In this embodiment, the anchor point is used as the center to determine the range of the first segment in combination with the context, and this operation mode is simple, no complex calculation is required, and the resource consumption of the computer can be reduced.

S203, carrying out semantic analysis on the first segment of each anchor point to obtain sub-scenes matched with each anchor point.

Since the target 3D scene comprises a plurality of sub-scenes, in order to facilitate screening out sub-scenes matching with anchor points from the target 3D scene, a text label may be added to each sub-scene of the target 3D scene.

In some embodiments, natural language processing (Natural Language Processing, NLP) techniques may be employed to determine the sub-scene corresponding to the text label that best matches the descriptive fragment.

For example, if the target 3D scene is a zoo panorama, the sub-scenes included in the target 3D scene may include a lion show sub-scene corresponding to a text tag lion, a tiger show sub-scene corresponding to a text tag tiger, a panda show scene corresponding to a text tag panda, and so on. The first segment corresponding to the anchor point A is 'in zoos, a fierce tiger is seen', the semantic label of the first segment is determined to be 'tiger', and then the tiger display sub-scene is determined to be a sub-scene matched with the anchor point A.

Since the first segment does not necessarily fully interpret the content of the same sub-scene. In order to find accurate switching points of different sub-scenes in the embodiment of the present disclosure, in S204, for second segments between adjacent anchor points in the description information, a natural language processing technique is adopted to analyze semantic turning points in each second segment.

The semantic turning point, namely, the place where the semantics change, different semantics correspond to different sub-scenes. Therefore, the semantic turning point can be used for segmenting the description information between the two anchor points and is used for realizing the switching of the sub-scene.

In some embodiments, the content before the first anchor point in the description information may be rendered by using the sub-scene corresponding to the first anchor point, and the content after the last anchor point in the description information may be rendered by using the sub-scene corresponding to the last anchor point. And for content between adjacent anchor points, the switching position of the sub-scene needs to be clarified. For the second segments between adjacent anchor points in the description information, the semantic turning points in each second segment are analyzed by adopting a natural language processing technology, and the method can be implemented as follows:

and A1, acquiring a current analysis point in each second segment and contents with specified length before the current analysis point from the description information to obtain reference information of the current analysis point.

Wherein the current analysis point is any point in the second segment. It can be understood as the video playing time point, and also can be understood as any sentence position in the text.

The specified length is similar to the length of the preset radius from which the first segment is obtained, and may be a duration or a number of sentences, which is not limited in the embodiment of the disclosure.

And step A2, determining semantic similarity between the reference information and a first anchor point and a second anchor point in the second segment respectively, wherein the positions of the first anchor point, the second anchor point and the current analysis point are shown in figure 3, and the first anchor point is positioned before the second anchor point in the description information.

And A3, determining the current analysis point as a semantic turning point in the second segment under the condition that the semantic similarity between the current analysis point and the first anchor point is lower than the semantic similarity between the current analysis point and the second anchor point and the semantic similarity between the previous analysis point of the current analysis point and the first anchor point is higher than the semantic similarity between the current analysis point and the second anchor point.

In short, any analysis point is applicable to the sub-scene of the A anchor point under the condition that the semantic similarity between the analysis point and the A anchor point is higher than that of another anchor point. Under the condition that the current analysis point is different from the sub-scene applicable to the previous analysis point, the current analysis point is the semantic turning point.

Taking any second segment as an example, the current analysis point is 8s, and the specified length is 3s, and then the content of 5-8 s can be determined as the reference information of the current analysis point. And determining the semantic similarity of the reference information and the first anchor point and the second anchor point in the second segment respectively on the assumption that the first anchor point is 3s and the second anchor point is 10s because of the text content of the second segment determined by the two adjacent anchor points. The NLP technology can be adopted to carry out semantic understanding on the reference information, obtain the corresponding sub-scene, and determine the semantic similarity with the sub-scene. And under the condition that the semantic similarity between the current analysis point and the first anchor point is 50%, the semantic similarity between the current analysis point and the second anchor point is 90%, and the semantic similarity between the previous analysis point of the current analysis point and the first anchor point is 98%, determining that the current analysis point is a semantic turning point in the second segment.

In the embodiment of the disclosure, the semantic turning points can be accurately found based on the second segment and the semantic similarity, so that a data foundation is laid for accurately separating out the description segments used by different sub-scenes, and the quality of the generated video can be improved.

In other embodiments, a maximum of semantic similarity of the content in the second segment to the first anchor point may also be determined. And analyzing that the semantic similarity between the current analysis point in the second segment and the first anchor point is lower than the maximum value, and determining the semantic turning point of the current analysis point under the condition that the difference between the current analysis point and the maximum value is greater than a difference threshold value.

In addition to the first anchor point, the second anchor point may be used as a reference to identify a semantic turning point. A maximum value of semantic similarity of the content in the second segment to the second anchor point may also be determined. And analyzing that the semantic similarity between the current analysis point in the second segment and the second anchor point is lower than the maximum value, and determining the semantic turning point of the current analysis point under the condition that the difference between the current analysis point and the maximum value is greater than a difference threshold value.

S205, determining the content between two semantic turning points adjacent to the same anchor point in the description information as a description fragment.

S206, determining the sub-scene matched with the anchor point included by the description fragment as the sub-scene of the description fragment.

In the embodiment of the disclosure, the description information is divided into a plurality of description fragments based on the semantic turning point, and then the sub-scene corresponding to the description fragment is determined based on each description fragment.

In some embodiments, in the case of determining the sub-scene corresponding to the description fragment, the layout of the sub-scene may be adjusted based on the description fragment, which may be implemented as follows: aiming at each description fragment, acquiring a preset problem of a sub-scene of the description fragment; performing machine reading understanding on the description fragments to obtain answers corresponding to all preset questions; and selecting the elements of the sub-scene and the attribute information of the elements of the sub-scene in the sub-scene based on the answers of all the preset questions in the 3D scene element set.

Taking a target 3D scene as a zoo panorama as an example, taking a first segment corresponding to an anchor point B as a teacher to take the classmates to visit the zoo, arranging the teacher in the first segment, arranging the small light in the second segment, arranging the small light behind the small light into a small red, wearing blue clothes by the small light, wearing red clothes by the small red, and determining the semantic tag of the first segment as queuing play in the zoo. And may further determine that the "queued play" sub-scenario is a sub-scenario that matches anchor point B. Students in the scene can be male students or female students, and the clothes worn by the students can be different. Thus, the preset question in this scenario may be "who is the character included in the scenario? "what is the appearance characteristic of the character? ". Based on the preset question, a machine reading understanding technology (Machine Reading and Comprehension, MRC) can be adopted to carry out semantic understanding on the description fragments, answers are found from the description fragments, and then the fact that the roles are queued students is determined, wherein the students are characterized by the colors of the students.

It should be noted that, the scene elements in the sub-scene may be known elements in the 3D resource library, and these elements may be 3D images set in advance. Each sub-scene may be associated with a partial element, and the preset question of each sub-scene is a question raised for the partial element associated therewith, whereby the required scene elements within the sub-scene may be analyzed based on the description fragment.

For example, broadcasting a weather forecast, the preset question may be what the sex of the anchor is, whereby the appropriate anchor setting may be selected into the sub-scene.

For another example, in a literature assembly scenario, a preset question may ask a desired instrument and employ character features of the instrument, then layout 3D elements of the corresponding instrument into sub-scenarios, and layout characters of the operating instrument.

In case the scene elements in the 3D repository cannot describe the description fragment accurately, the scene elements may also be generated based on a generation technique, e.g. a generation-wise antagonism network (Generative Adversarial Networks, GAN). The video file generated by the method can be more vivid, more accords with the description of the description file, and is not limited by the materials of the 3D resource library.

In addition, with respect to the description information, the user may also specify whether to employ the material of the 3D repository. For example, some literary works describe more similar places, can adopt the mode of generating the formula to get the scene material completely, thus make different descriptive information get the video that the style varies as much as possible.

For another example, multiple shots may be laid out in the same sub-scene, and based on understanding the description information, only shots may be taken from different shots of the sub-scene, and differences may also occur in visual effect.

In the embodiment of the disclosure, based on the preset problem, the characteristics about the scene elements in the description information can be obtained, so that the scene elements corresponding to the description information layout can be accurately understood, and further, support is provided for generating high-quality video resources.

Because the video of the 3D scene is generated based on the description information in the embodiment of the disclosure, no matter whether the 3D scene belongs to an indoor scene or an outdoor scene, illumination is required in the scene. In order to use proper illumination, in the embodiment of the disclosure, for each description fragment, emotion analysis is performed on the description fragment under the condition that the description fragment does not explicitly record illumination conditions, so as to obtain emotion analysis results; and selecting the illumination condition matched with the emotion analysis result from the illumination set as the illumination condition adopted by the sub-scene of the descriptive fragment.

Taking a target 3D scene as a financial street panorama as an example, a sub-scene is street interview A, and a financial street data display scene B. The semantic analysis result of the descriptive fragment is a company c financial newspaper, and the sub-scene corresponding to the descriptive information is a financial street data display scene B. The street corner guideboard in the sub-scene can be replaced by 'company c financial report', and the financial street data display scene B displays a data histogram of recent income of company c. And carrying out text tendency analysis on the descriptive fragment based on NLP to obtain emotion analysis results, wherein under the condition that the emotion analysis results represent optimism, as many people are more happy in sunny days and represent some distress in overcast days, the weather of the sub-scene B can be set as sunny days, and the time can be set as midday.

In the embodiment of the disclosure, the emotion tendency of the text is determined based on the description fragment, and then the illumination condition is set, so that the generated video is more close to the content expressed by the description information.

In some embodiments, the generated video may be added with a head, a tail, a head song, a tail song, background music, and the like, and if beautifying is required, filter processing may be further superimposed on a portion of the content in the video.

In summary, according to the video generation method provided by the embodiment of the disclosure, AI (Artificial Intelligence ) director can be realized, scene and role are determined, recording strategy (i.e. 3D video description file) is determined, and then video is generated. The whole process can be divided into two processes, including 'every position' and 'shooting'. The flow of "each bit" includes selecting a target scene from a 3D scene repository and determining sub-scenes. And further determining scene elements for each sub-scene and performing three steps of scene construction. The schematic diagram is shown in fig. 4 (a), semantic understanding is performed on the description information, so that a target scene is determined, and semantic understanding is performed on each description fragment in the description information to determine a sub-scene matched with each description fragment because the target scene comprises a plurality of sub-scenes. After determining the sub-scene, the scene elements in the sub-scene can be selected and arranged based on the semantic information of the description fragment, and then the scene construction is completed based on the scene elements and the scene element arrangement.

The process of the "shooting" comprises two steps of "blueprint planning" and "shooting". The schematic diagram is shown in fig. 4 (b), and the "blueprint planning" realizes that the time axis in the audio of the descriptive information connects the sub-scene shots of each frame in series and has the spatial positioning of the organic bit, that is, the mirror mode for determining the sub-scene is based on the switching mode between different sub-scenes.

The method is prepared in earlier work, and the 3D video description file can be finally obtained.

For example, assuming that a target 3D scene corresponding to a 3D video description file (sequence) is "city", a sub-scene a is photographed by camera_a, a sub-scene B is photographed by camera_b, and if a camera attribute is added to a output's label, the output content of this camera is described as being used to generate a video. The entire process is wrapped by a plurality of keyframes, each keyframe representing a series of object property values (e.g., position, rotation angle, etc.) at the instant of this point in time. Where camera has a property "transition," this property indicates the manner of transition (e.g., fly-in, direct handoff, etc.). In this example, the form of the 3D video description file (sequence) is as follows:

< sequence scene= "city" >// target 3D scene is city

< key-frame time= "0" >// start time of sub-scene a

The camera adopted by the object output id= "camera_a" position= "-63164.158798,39656.181052,73.000096" rotation= "0" transmission= "/>// sub scene a is camera_a, and has the properties of camera position, angle, transition mode and the like;

information of the position, angle, etc. of the presenter in < object id= "-host" position= "-63162.198708,12315.181137,0.000096" rotation= "" 0 "transmission="/>// sub-scene a;

</key-frame>

< key-frame time= "5" >// sub-scene a 5s time

< object output id= "camera_a" position= "-63164.158798,39656.181052,73.000096" rotation= "-27" transmission= "fly-in"/>// sub-scene a 5s time takes advantage of camera properties

</key-frame>

< key-frame time= "20" >// sub-scene B start time

< object output id= "camera_b" motion= "-39142.009,812364.829102,28.1902" motion= "0" transmission= "cut-in"/>// sub-scene B takes camera properties

</key-frame>

</sequence>

And by analogy, sequentially recording the adopted cameras and shooting angles of each sub-scene through a series of descriptions, and shooting the arranged sub-scenes. The obtained 3D video description file can clearly express the shot vision, content and mode, thereby forming a clear recording strategy.

After the pre-job is ready, the illusion Engine (UE) simulates real camera walking and broadcasting guiding behavior based on the 3D video description file, shoots between different machine positions, and as shown in fig. 4 (b), 4 key anchor points are exemplarily given on the time axis. Each anchor point corresponds to a respective sub-scene.

The anchor point 1 corresponds to a sub-scene machine position A, and the sub-scene is an interaction screen operated by a host; the anchor point 2 corresponds to the sub-scene machine position B, and the corresponding sub-scene is a large screen with digital people; the anchor point 4 corresponds to the sub-scene machine position C and corresponds to the stereo display data statistics; the anchor point 3 corresponds to the sub-scene machine position D, and the corresponding sub-scene is three stereoscopic billboards. Based on the switching of the four machine positions, the operation mirror of the scene and the switching between the scenes are completed, and finally, the video is generated. For example, the content of the sub-scene machine a responsible for recording is: the host operates the interactive screen, the related reports of the world cup are displayed on the interactive screen, and the host can select the ball star needing to be reported. Then, go to sub-scene machine B. The machine position B aims at a large screen introduced by a digital person, and the large screen displays the performance of the ball star in the world cup, the record of the ball star, the growth record and the like. Then the advertisement board of the sub scene machine D can be turned to display sponsors of the world cup, and finally the sub scene machine C can be turned to display some statistical data analysis of the world cup. Therefore, a plurality of sub-scenes are mutually connected, and corresponding content can be generated along with the description information.

In order to understand the video generation method according to the embodiments of the present disclosure in more detail, taking description information as an example, the overall flowchart of the method is shown in fig. 4 (c):

s401, acquiring an input text, and removing advertisement content in the input file.

S402, determining a target scene based on the input text and the scene resource library.

S403, determining the sub-scene corresponding to the text segment based on the text segment of the input text.

S404, determining scene elements in the sub-scene based on the question-answer model.

S405, scene configuration is conducted on the sub-scene based on scene element construction.

Wherein, in the sub-scene containing the screen presentation, as in the B-station large screen in fig. 4 (B). The material information for the display of the screen shown in fig. 5 may be derived from an existing material library, may also be text-based material generation, or may be derived from other material sources. And under the condition that a plurality of materials exist in the large screen, the materials are distributed in a sequence mode based on a time axis, and then the sub-scene is acquired to carry out scene configuration.

For example, in a scene where a famous ball star is interviewed, the goal highlight moment of the ball star may be shown in a large screen. Since there may be a problem that some pictures have copyrights, the material displayed in the large screen may acquire the material based on the manner shown in fig. 5, and further, the material may be distributed in a sorted manner based on the time axis, so as to acquire the configuration of the large screen in the sub-scene.

Even in the same scene, due to different lenses and the like, the scene layout, the lenses and the like are adjusted in cooperation with text content, so that rich materials can be generated instead of the existing materials, and the dependence on the existing materials is reduced. Moreover, materials with different resolutions can be generated based on the embodiment of the disclosure, and the influence of the resolution of the original materials is not required to be relied on.

S406, for each sub-scene, generating a corresponding mirror mode based on the description fragment of the sub-scene.

S407, determining a shot switching mode among the sub-scenes according to the adjacent sub-scenes.

And S408, generating a 3D video description file.

S409, rendering the 3D video description file using the illusion engine and generating a video.

In summary, the overall video generation process can be considered as a 3D movie planning and shooting process dominated by AI (Artificial Intelligence ). In the process, firstly, scene selection and prop setting placement are carried out according to understanding of the description information, and the space layout and scene decoration are fully utilized to carry out scene arrangement. In implementation, not only can the existing scene elements be selected, but also the scene elements can be generated based on the needs, and finally the customization of the scene elements aiming at the description information is realized. The time axis is used for series connection of space sub-scene conversion, and 3D 'live-action' shooting effect is realized in the sub-scene by using a lens-transporting mode of different machine positions. The method can generate the required material fragments in the scene aiming at the existing pictures and videos, can still be used in films, and can display the existing materials in the elements of the scene more naturally.

Taking the input description information as an article as an example, the whole video generation process is shown in fig. 6. After the articles are input, corresponding scenes are selected from a scene library according to the content of the articles, for example, financial articles and social news articles can select urban scenes, a ball game war report can select a studio scene, and a free-form scene can select a natural landscape scene. After the large scene is selected, assuming that a city scene is selected, sub-scenes corresponding to different description fragments can be selected according to understanding of the content of the article. In the scene library, sub-scenes exist in the form of templates. For example, financial categories may also select financial street templates, social news categories may also select residential templates, and future categories may automatically generate scene templates based on content if they do not match the appropriate templates. Of course, the description fragments that do not match the appropriate sub-scene may all generate the corresponding sub-scene according to the content of the description fragment. Assuming that the selected template is a financial street template, the description fragment can be subjected to self-question and self-answer to realize the arrangement of scenes. For example, the method can inquire about the anchor, the emotional tendency of the words and the time of occurrence of the event, can perform summarized analysis on the profit situation and can generate a corresponding chart for displaying in the sub-scene according to the analyzed data. Therefore, the scene arrangement of different sub-scenes is completed, and the different sub-scenes can be spliced according to the sequence of the time axis. Some image materials needed by the sub-scene can not only depend on a material library, but also can be applied to the existing image materials in the article, even can be used for generating videos based on a video synthesis platform as the materials, and can also be generated in a mode of generating images by adopting characters. For example, a presentation of a ball star, a video may be automatically generated based on the disclosure and presented in a large screen. Corresponding elements are placed in corresponding locations for presentation in accordance with an understanding of the article. After the lens operation mode of different sub-scenes and the lens switching mode among the sub-scenes are determined, a 3D video description file can be finally generated. And the 3D rendering engine renders the 3D video description file, and the output video is stored, so that the video recording is completed. The recorded video may be sent links to the user for the user to download and view.

In terms of post-processing, the user can add a head, tail, and filter effect to the generated video. However, the whole video generation process is automatically completed by the AI director, and the user only needs to perform fine adjustment and optimization according to the self requirement.

Based on the same technical concept, in an embodiment of the present disclosure, there is provided a video generating apparatus 700, as shown in fig. 7, including:

an acquisition module 701, configured to acquire description information for describing video content;

a first matching module 702, configured to determine a target 3D scene that matches the description information;

a second matching module 703, configured to determine a sub-scene that matches each description fragment in the description information from among a plurality of sub-scenes included in the target 3D scene;

the lens-carrying determining module 704 is configured to determine a lens-carrying manner of each sub-scene based on a semantic analysis result of each description fragment;

the shot switching determining module 705 is configured to determine a shot switching manner between sub-scenes of adjacent description fragments based on a semantic analysis result of the adjacent description fragments;

the file generating module 706 is configured to generate a 3D video description file based on the ordering of the sub-scenes in the description information, the lens-operating mode of each sub-scene, the lens-switching mode between the sub-scenes, and the description information;

The video generating module 707 is configured to process the 3D video description file based on the 3D rendering engine, and generate a video corresponding to the description information.

In some embodiments, the second matching module comprises:

the acquisition unit is used for acquiring a plurality of anchor points of the descriptive information;

the first segment determining unit is used for intercepting a first segment of each anchor point from the description information;

the first matching unit is used for carrying out semantic analysis on the first segment of each anchor point to obtain sub-scenes matched with each anchor point respectively;

the turning point determining unit is used for analyzing semantic turning points in the second segments by adopting a natural language processing technology aiming at the second segments between adjacent anchor points in the description information;

the descriptive fragment determining unit is used for determining the content between two semantic turning points adjacent to the same anchor point in the descriptive information as descriptive fragments;

and the second matching unit is used for determining the sub-scene matched with the anchor point included in the description fragment as the sub-scene of the description fragment.

In some embodiments, the method further comprises a scene element determination module for:

aiming at each description fragment, acquiring a preset problem of a sub-scene of the description fragment;

performing machine reading understanding on the description fragments to obtain answers corresponding to all preset questions;

And selecting the elements of the sub-scene and the attribute information of the elements of the sub-scene in the sub-scene based on the answers of all the preset questions in the 3D scene element set.

In some embodiments, the turning point determining unit is configured to:

for each second segment, acquiring a current analysis point in the second segment and contents with specified length before the current analysis point from the description information to obtain reference information of the current analysis point;

determining semantic similarity of the reference information and a first anchor point and a second anchor point in the second segment respectively, wherein the first anchor point is positioned before the second anchor point in the description information;

and determining the current analysis point as a semantic turning point in the second segment under the condition that the semantic similarity between the current analysis point and the first anchor point is lower than the semantic similarity between the current analysis point and the second anchor point and the semantic similarity between the previous analysis point of the current analysis point and the first anchor point is higher than the semantic similarity between the current analysis point and the second anchor point.

In some embodiments, the first segment determining unit is configured to:

and aiming at each anchor point, taking the anchor point as a center, and intercepting the content with the preset radius length from the description information to obtain a first segment of the anchor point.

In some embodiments, the acquisition module is configured to:

and under the condition that the total length of the original description information of the description information is larger than a length threshold value, compressing the original description information to obtain the description information.

In some embodiments, the acquisition module is further to:

and under the condition that the original descriptive information is text, extracting a text abstract of the original descriptive information to obtain the descriptive information.

In some embodiments, the acquisition module is further to:

under the condition that the original description information is audio, acquiring a text corresponding to the audio;

and extracting a text abstract from the text corresponding to Wen Yinpin to obtain descriptive information.

In some embodiments, the audio determination module is further configured to:

generating audio of the descriptive information under the condition that the descriptive information is text; the playing time length of the generated video is matched with the playing time length of the audio.

In some embodiments, the first matching module is configured to:

determining the similarity between the description information and the text labels of the candidate scenes;

and selecting the candidate scene with the highest similarity as a target 3D scene matched with the description information.

In some embodiments, the advertisement removal module is further configured to:

Advertisement content in the original descriptive information of the descriptive information is removed.

In some embodiments, the system further comprises an illumination determination module for:

carrying out emotion analysis on each description fragment under the condition that the description fragment does not clearly record the illumination condition, so as to obtain an emotion analysis result;

and selecting the illumination condition matched with the emotion analysis result from the illumination set as the illumination condition adopted by the sub-scene of the descriptive fragment.

For descriptions of specific functions and examples of each module and sub-module of the apparatus in the embodiments of the present disclosure, reference may be made to the related descriptions of corresponding steps in the foregoing method embodiments, which are not repeated herein.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the related user personal information all conform to the regulations of related laws and regulations, and the public sequence is not violated.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 8 illustrates a schematic block diagram of an example electronic device 800 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile apparatuses, such as personal digital assistants, cellular telephones, smartphones, wearable devices, and other similar computing apparatuses. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 8, the apparatus 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM803, various programs and data required for the operation of the device 800 can also be stored. The computing unit 801, the ROM802, and the RAM803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to the bus 804.

Various components in device 800 are connected to I/O interface 805, including: an input unit 806 such as a keyboard, mouse, etc.; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, etc.; and a communication unit 809, such as a network card, modem, wireless communication transceiver, or the like. The communication unit 809 allows the device 800 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 801 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 801 performs the respective methods and processes described above, for example, a video generation method. For example, in some embodiments, the video generation method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 808. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 800 via ROM802 and/or communication unit 809. When a computer program is loaded into RAM803 and executed by computing unit 801, one or more steps of the video generation method described above may be performed. Alternatively, in other embodiments, the computing unit 801 may be configured to perform the video generation method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above can be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially, or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions, improvements, etc. that are within the principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A video generation method, comprising:

acquiring description information for describing video content;

determining a target 3D scene matched with the description information;

determining sub-scenes matched with each description fragment in the description information from a plurality of sub-scenes included in the target 3D scene;

determining a shot switching mode between sub-scenes of adjacent description fragments based on semantic analysis results of the adjacent description fragments;

Generating a 3D video description file based on the ordering of each sub-scene in the description information, the mirror operation mode of each sub-scene, the lens switching mode among sub-scenes and the description information;

and processing the 3D video description file based on a 3D rendering engine, and generating a video corresponding to the description information.

2. The method of claim 1, wherein determining a sub-scene that matches each description fragment in the description information from among a plurality of sub-scenes included in the target 3D scene comprises:

acquiring a plurality of anchor points of the description information;

intercepting a first segment of each anchor point from the description information;

carrying out semantic analysis on the first segment of each anchor point to obtain sub-scenes matched with each anchor point respectively;

aiming at second fragments between adjacent anchor points in the description information, analyzing semantic turning points in the second fragments by adopting a natural language processing technology;

determining the content between two semantic turning points adjacent to the same anchor point in the description information as a description fragment;

and determining the sub-scene matched with the anchor point included by the description fragment as the sub-scene of the description fragment.

3. The method of claim 1 or 2, further comprising:

and selecting the elements of the sub-scene and the attribute information of the elements of the sub-scene in the sub-scene based on the answers of all preset questions in the 3D scene element set.

4. The method of claim 2, wherein for the second segments between adjacent anchor points in the description information, analyzing semantic turning points in each of the second segments using natural language processing techniques comprises:

and determining the current analysis point as the semantic turning point in the second segment under the condition that the semantic similarity between the current analysis point and the first anchor point is lower than the semantic similarity between the current analysis point and the second anchor point and the semantic similarity between the previous analysis point of the current analysis point and the first anchor point is higher than the semantic similarity between the current analysis point and the second anchor point.

5. The method of claim 2, wherein intercepting the first segment of each anchor from the description information comprises:

and aiming at each anchor point, taking the anchor point as a center, and intercepting the content with the preset radius length from the description information to obtain a first fragment of the anchor point.

6. The method of any of claims 1-5, further comprising:

7. The method of claim 6, wherein compressing the original description information comprises:

and extracting a text abstract of the original descriptive information to obtain the descriptive information under the condition that the original descriptive information is text.

8. The method of claim 6, wherein compressing the original description information comprises:

acquiring a text corresponding to the audio under the condition that the original description information is the audio;

and extracting a text abstract from the text corresponding to the audio to obtain the description information.

9. The method of any of claims 1-7, further comprising:

10. The method of any of claims 1-9, wherein determining a target 3D scene that matches the descriptive information comprises:

11. The method of any of claims 6-10, further comprising:

and removing advertisement content in the original description information of the description information.

12. The method of any of claims 1-11, further comprising:

carrying out emotion analysis on each description fragment under the condition that the description fragment does not explicitly record illumination conditions, so as to obtain emotion analysis results;

and selecting the illumination condition matched with the emotion analysis result from the illumination set as the illumination condition adopted by the sub-scene of the description fragment.

13. A video generating apparatus comprising:

the file generation module is used for generating a 3D video description file based on the ordering of each sub-scene in the description information, the lens operation mode of each sub-scene, the lens switching mode among the sub-scenes and the description information;

and the video generation module is used for processing the 3D video description file based on a 3D rendering engine and generating a video corresponding to the description information.

14. The apparatus of claim 13, wherein the second matching module comprises:

an acquisition unit, configured to acquire a plurality of anchor points of the description information;

the description fragment determining unit is used for determining the content between two semantic turning points adjacent to the same anchor point in the description information as a description fragment;

and the second matching unit is used for determining the sub-scene matched with the anchor point included by the description fragment as the sub-scene of the description fragment.

15. The apparatus of claim 13 or 14, further comprising a scene element determination module to:

16. The apparatus of claim 14, wherein the turning point determination unit is configured to:

17. The apparatus of claim 14, wherein the first segment determination unit is configured to:

18. The apparatus of any of claims 13-17, the acquisition module to:

19. The apparatus of claim 18, wherein the acquisition module is further configured to:

20. The apparatus of claim 18, wherein the acquisition module is further configured to:

21. The apparatus of any of claims 13-19, further comprising an audio determination module to:

22. The apparatus of any of claims 13-21, wherein the first matching module is to:

23. The apparatus of any of claims 18-22, further comprising an advertisement removal module to:

24. The apparatus of any of claims 13-23, further comprising an illumination determination module to:

25. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-12.

26. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-12.

27. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any of claims 1-12.