CN117376502A

CN117376502A - Video production system based on AI technology

Info

Publication number: CN117376502A
Application number: CN202311671371.8A
Authority: CN
Inventors: 刘秋菊
Original assignee: Xiangfei Tianjin Intelligent Technology Co ltd
Current assignee: Xiangfei Tianjin Intelligent Technology Co ltd
Priority date: 2023-12-07
Filing date: 2023-12-07
Publication date: 2024-01-09
Anticipated expiration: 2043-12-07
Also published as: CN117376502B

Abstract

The invention relates to the technical field of video processing, in particular to a video production system based on an AI technology. In the invention, the deep learning decoding algorithm obviously improves the accuracy and efficiency of video frame extraction, the convolutional neural network is used for feature extraction, so that a system can more deeply understand video content, the recurrent neural network is used for content identification, the time sequence information analysis accuracy is improved, reinforcement learning is used for lens selection, the selection process is optimized, the logic and ornamental performance of the video are enhanced, an antagonism network is generated for video frame processing, the quality and artistic effect diversity are improved, the video coding technology ensures the compression efficiency without sacrificing the quality, and the video abstract generation and special effect dynamic rendering module is combined to strengthen the video expressive force and attract audiences.

Description

Video production system based on AI technology

Technical Field

The invention relates to the technical field of video processing, in particular to a video production system based on an AI technology.

Background

The field of video processing technology relates to the use of computers and related algorithms to process video data, including editing, compositing, enhancing, analyzing, etc. of video. This field includes a wide range of applications, from movie production to video advertising, to monitoring systems to virtual reality applications.

The video production system based on the AI technology is a computer software or hardware system, and utilizes the artificial intelligence technology to automate and optimize the video production process. These systems use techniques of machine learning, computer vision, speech recognition, etc., to automatically analyze, edit, synthesize, and enhance video material to generate high quality video content. The main purpose of the system is to improve the efficiency, quality and creativity of video production. The method aims at reducing manual intervention and automatically identifying the elements such as the optimal lens, audio, special effects and the like so as to generate satisfactory video. In addition, the method is used for realizing personalized video generation so as to meet the requirements of different purposes and audiences. By using AI technology, various aspects of the video content, including image quality, audio quality, special effects, post-production, etc., can be improved, ensuring visual and audible appeal of the video content.

Existing systems suffer from several disadvantages. The existing systems often lack highly automated feature extraction and content recognition capabilities, resulting in significant manual intervention in the video production process, inefficiency, and susceptibility to errors. Moreover, the traditional shot selection process mainly relies on manual editing, which is time-consuming and difficult to ensure the consistency of the narration. There is no advanced algorithm in video frame processing, such as generating support against the network, and there is limited improvement in video quality and artistic style. In addition, the existing system lacks intellectualization in terms of video abstraction and special effect rendering, often cannot accurately capture key information in video, or cannot achieve high matching with video content and emotion in special effect application.

Disclosure of Invention

The invention aims to solve the defects in the prior art, and provides a video production system based on an AI technology.

In order to achieve the above purpose, the present invention adopts the following technical scheme: the video production system based on the AI technology comprises a video decoding module, a feature extraction module, a content identification module, a lens selection module, a video frame processing module, a video synthesis module, a video abstract generation module and a special effect dynamic rendering module;

The video decoding module analyzes the input video file by adopting a decoding algorithm based on deep learning, extracts continuous picture frames and generates a frame sequence;

the feature extraction module carries out image feature learning by adopting a convolutional neural network based on a video frame sequence and generates a feature data set;

the content identification module is used for automatically identifying and classifying video content based on the characteristic data set by adopting a recurrent neural network to analyze time sequence information and generating a content abstract;

the shot selection module selects shots based on the content abstract and by adopting a reinforcement learning strategy to refer to scene continuity and plot development, and generates a shot decision list;

the video frame processing module performs pattern migration and super-resolution reconstruction by adopting a generation countermeasure network based on the shot decision list to generate a high-quality video frame;

the video synthesis module re-encodes and synthesizes the processed frames into a new video file by adopting a video encoding technology based on high-quality video frames to generate an enhanced video;

the video abstract generating module is used for extracting key information in the video by adopting a long-short-period memory network and an attention mechanism based on the enhanced video to generate a video abstract;

The special effect dynamic rendering module dynamically adds or adjusts the special effect of the video based on the video abstract by utilizing scene analysis and deep learning technology, and matches the video content and emotion to generate a special effect video;

the frame sequence is a visual picture sequence sequenced in time sequence, the characteristic data set is visual characteristics and semantic labels of a plurality of groups of picture frames, the content abstract comprises time information, main events and characters in a segment, the shot decision list is a shot sequence and a time point selected based on plot requirements, and the high-quality video frame is a video frame subjected to super-resolution reconstruction and style migration.

As a further scheme of the invention, the video decoding module comprises a video stream analysis sub-module, a frame extraction sub-module and a format conversion sub-module;

the feature extraction module comprises an image recognition sub-module, a feature encoding sub-module and a semantic analysis sub-module;

the content identification module comprises a sequence analysis sub-module, a scenario extraction sub-module and a summary generation sub-module;

the lens selection module comprises a decision support sub-module, a lens evaluation sub-module and an editing planning sub-module;

The video frame processing module comprises a super-resolution sub-module, a pattern migration sub-module and a quality evaluation sub-module;

the video synthesis module comprises a frame synthesis sub-module, a coding optimization sub-module and a file encapsulation sub-module;

the video abstract generation module comprises a key information extraction sub-module, an abstract editing sub-module and a highlight moment selection sub-module;

the special effect dynamic rendering module comprises a scene analysis sub-module, a special effect matching sub-module and a rendering optimization sub-module.

As a further scheme of the invention, the video stream analysis submodule adopts a deep learning decoding algorithm to carry out deep analysis on the coding format and the frame structure of the video data stream based on the input video file so as to obtain the video stream characteristic data;

the frame extraction submodule adopts a frame-by-frame analysis method to divide the video stream based on the video stream characteristic data, extracts each frame of image and establishes a picture frame sequence;

the format conversion submodule converts an image frame format based on the image frame sequence by adopting an image format standardization technology to generate a standardized frame sequence;

the video stream characteristic data is specifically key parameters including coding information, frame rate and resolution in a video stream, the picture frame sequence comprises continuous unprocessed original image frames, and the standardized frame sequence is specifically converted into JPEG or PNG image format.

As a further scheme of the invention, the image recognition submodule adopts a convolutional neural network to detect and recognize characteristic points of the image based on a standardized frame sequence, and acquires an image characteristic point data set;

the feature coding sub-module is used for coding feature points by adopting a feature vector coding technology based on the image feature point data set, carrying out data compression and retaining key visual information, and generating a coded feature data set;

the semantic analysis sub-module adopts a semantic analysis algorithm to deeply understand the image content based on the encoded feature data set, extracts semantic tags of scenes and objects and establishes an image semantic information data set;

the image characteristic point data set specifically refers to characteristic information comprising edges, angular points and textures in an image, the coded characteristic data set specifically refers to coded low-dimensional characteristic representation, and the image semantic information data set specifically refers to high-level semantic description of objects and scenes in the image.

As a further scheme of the invention, the sequence analysis submodule adopts a long-period memory network to perform time dependency analysis based on the characteristic data set, performs data preprocessing and generates a time sequence analysis result;

The scenario extraction submodule extracts scenario key elements and analyzes the emotion by adopting an entity identification technology in natural language processing based on a time sequence analysis result to generate a scenario extraction report;

the abstract generation sub-module adopts an extraction type abstract method to simplify information and select core sentences based on a scenario extraction report to generate a content abstract;

the long-term and short-term memory network is specifically a recurrent neural network and is used for capturing long-distance dependency relations in time sequence data, the entity recognition technology comprises named entity recognition and key phrase extraction, and the extraction type abstract method specifically refers to a technology for extracting key sentences or phrases from texts to construct an abstract.

As a further scheme of the invention, the decision support sub-module adopts model-based reinforcement learning to evaluate the lens value and performs decision optimization based on the content abstract to generate a lens selection scheme;

the lens evaluation submodule adopts an image quality evaluation algorithm to evaluate the content quality of the lens based on a lens selection scheme, and performs visual effect analysis to generate a lens quality evaluation report;

The editing planning sub-module adopts a sequence decision process to carry out editing planning and scene flow optimization based on the shot quality evaluation report to generate a shot decision list;

the model-based reinforcement learning specifically refers to a method for learning an optimal strategy by simulating and predicting environmental feedback, the image quality evaluation algorithm specifically refers to visual characteristic evaluation comprising image definition, color saturation and contrast, and the sequence decision process is used for selecting an optimal action sequence according to preset rules and targets.

As a further scheme of the invention, the super-resolution submodule adopts a deep learning convolutional neural network algorithm to reconstruct super-resolution based on a lens decision list, extracts a characteristic map of a video frame, performs up-sampling operation, enhances picture details and generates the super-resolution video frame;

the style migration submodule carries out style migration by adopting migration learning and a depth convolution network based on the super-resolution video frame, and adjusts the visual style of the video frame by utilizing a pre-trained stylized model to generate a stylized video frame;

the quality evaluation submodule carries out quality evaluation based on the stylized video frame by adopting image quality evaluation indexes including SSIM structure similarity indexes and PSNR peak signal to noise ratio to generate a quality evaluation report;

The deep learning convolutional neural network comprises a feature extraction layer, a nonlinear mapping layer and a reconstruction layer, the migration learning and deep convolutional network specifically refers to model parameters obtained by training a large amount of marked data, and the image quality evaluation indexes comprise calculation and analysis of local contrast, brightness and color fidelity.

As a further scheme of the invention, the frame synthesis submodule optimizes the continuity between frames by adopting an optical flow technology and a frame interpolation algorithm based on the high-quality video frames screened in the quality evaluation report to generate a synthesized video stream;

the coding optimization submodule compresses and optimizes the quality of the video stream by adopting an H.265/HEVC coding technology based on the synthesized video stream to generate an optimized video stream;

the file packaging submodule is used for carrying out MP4 or AVI packaging based on the optimized video stream by adopting a multimedia container formatting technology, integrating audio and video data streams and generating an enhanced video;

the optical flow technique is specifically to calculate motion vectors of intermediate frames by analyzing the motion of pixels between adjacent frames, the H.265/HEVC coding technique specifically includes utilizing intra-frame prediction, inter-frame prediction, transformation and quantization techniques to reduce redundant information, and the multimedia container formatting technique specifically refers to encapsulating video and audio data.

As a further scheme of the invention, the key information extraction submodule adopts a long-short-period memory network and an attention mechanism to analyze a video frame sequence based on the enhanced video, identifies and extracts key frames and scenes, and establishes a key information data set;

the abstract editing submodule adopts a sequence decision algorithm to optimize information combination based on the key information data set and edits the video abstract draft;

the highlight moment selecting submodule screens highlight moment based on video abstract draft by adopting cluster analysis and user feedback learning to generate video abstract;

the sequence decision algorithm is specifically an algorithm for carrying out current decision according to historical information and is used for processing and generating sequence data, the clustering analysis is specifically to group video frames to identify similar characteristics, and the user feedback learning comprises analysis of user behavior data and optimization of highlight moment selection.

As a further scheme of the invention, the scene analysis submodule performs scene analysis by utilizing a convolutional neural network based on video abstraction, identifies elements and attributes and generates a scene analysis report;

the special effect matching sub-module dynamically matches video special effects by adopting a pattern matching algorithm based on a scene analysis report, and obtains special effect matching data;

The rendering optimization submodule adopts a real-time rendering technology and image synthesis based on the special effect matching data to adjust the special effect to match the video emotion so as to generate a special effect video;

the real-time rendering technique particularly refers to a technique for calculating and generating images on the fly in computer graphics, and the image synthesis comprises merging a plurality of image layers.

Compared with the prior art, the invention has the advantages and positive effects that:

in the invention, the deep learning decoding algorithm obviously improves the accuracy and efficiency of video frame extraction. The convolutional neural network is applied to the feature extraction module, so that the system can understand the video content more deeply. The content recognition module uses a recurrent neural network to improve the accuracy of time sequence information analysis, so that the content classification and the abstract generation are more accurate. The shot selection module adopts reinforcement learning, so that the selection process of the shots is effectively optimized, and the video is more logical and ornamental. The video frame processing module not only improves the video quality but also increases the diversity of artistic effects by using the generated countermeasure network. Video coding techniques guarantee the compression efficiency of video files without sacrificing quality. The combination of the video abstract generating module and the special effect dynamic rendering module further strengthens the expressive force of the video, and makes the final video more attractive to the audience in terms of content and form.

Drawings

FIG. 1 is a system flow diagram of the present invention;

FIG. 2 is a schematic diagram of a system framework of the present invention;

FIG. 3 is a flow chart of a video decoding module according to the present invention;

FIG. 4 is a flow chart of a feature extraction module of the present invention;

FIG. 5 is a flow chart of a content identification module of the present invention;

FIG. 6 is a flowchart of a lens selection module according to the present invention;

FIG. 7 is a flow chart of a video frame processing module according to the present invention;

FIG. 8 is a flow chart of a video composition module of the present invention;

FIG. 9 is a flow chart of a video summary generation module of the present invention;

FIG. 10 is a flow chart of the special effect dynamic rendering module of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

In the description of the present invention, it should be understood that the terms "length," "width," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," and the like indicate orientations or positional relationships based on the orientation or positional relationships shown in the drawings, merely to facilitate describing the present invention and simplify the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and therefore should not be construed as limiting the present invention. Furthermore, in the description of the present invention, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.

Embodiment one: referring to fig. 1, a video production system based on AI technology includes a video decoding module, a feature extraction module, a content identification module, a lens selection module, a video frame processing module, a video synthesis module, a video abstract generation module, and a special effect dynamic rendering module;

the feature extraction module carries out image feature learning by adopting a convolutional neural network based on the video frame sequence and generates a feature data set;

the content recognition module is used for automatically recognizing and classifying video content based on the characteristic data set by adopting a recurrent neural network to analyze time sequence information and generating a content abstract;

the video frame processing module carries out pattern migration and super-resolution reconstruction by adopting a generated countermeasure network based on the shot decision list, and generates a high-quality video frame;

the video synthesis module re-encodes and synthesizes the processed frames into a new video file by adopting a video encoding technology based on high-quality video frames, and generates an enhanced video;

The video abstract generating module is used for extracting key information in the video based on the enhanced video by adopting a long-short-period memory network and an attention mechanism to generate a video abstract;

the frame sequence is a visual picture sequence sequenced in time sequence, the characteristic data set is visual characteristics and semantic labels of a plurality of groups of picture frames, the content abstract comprises time information, main events and characters in fragments, the shot decision list is a shot sequence and time points selected based on plot requirements, and the high-quality video frame is a video frame subjected to super-resolution reconstruction and style migration.

The system realizes automatic manufacturing, reduces manufacturing cost and time, and reduces manual workload. Through intelligent content recognition, shot selection optimization and high-quality video generation, video quality and viewing experience are improved.

Content intelligent recognition enables the system to intelligently recognize and classify video content, making video production more intelligent and accurate. Shot selection optimization improves video continuity and plot development and increases ornamental value through the guidance of reinforcement learning strategies and content summaries.

The video abstract extraction enables a user to quickly know the video content, and the special effect dynamic rendering module enhances the attraction and emotion expression of the video. The system reduces the requirement of manual editing, reduces the cost, improves the efficiency and provides a more attractive video production scheme.

Referring to fig. 2, the video decoding module includes a video stream analysis sub-module, a frame extraction sub-module, and a format conversion sub-module;

the feature extraction module comprises an image recognition sub-module, a feature coding sub-module and a semantic analysis sub-module;

the video abstract generating module comprises a key information extracting sub-module, an abstract editing sub-module and a highlight moment selecting sub-module;

In the video decoding module, a video stream analysis submodule analyzes an input video stream and determines a coding and decoding format and parameters. Next, the frame extraction sub-module parses the video stream frame-by-frame into a sequence of image frames. The format conversion sub-module ensures that all frames have a consistent universal format ready for subsequent processing.

In the feature extraction module, visual features of the image frames are extracted through a convolutional neural network. The feature encoding submodule encodes the features to form a feature data set. The semantic analysis sub-module uses the feature dataset to perform deep semantic analysis to identify objects, scenes, and episodes in the image frame.

In the content recognition module, the feature data set is parsed by using a recurrent neural network to understand the time series development of the video content. The scenario refinement submodule automatically identifies main scenarios and events and simplifies video content. The summary generation sub-module integrates these information to generate a content summary, including points in time, major events, and characters.

And in the shot selection module, the best shot is selected by using a reinforcement learning strategy according to the content abstract, so that video plot consistency is ensured. The shot evaluation submodule evaluates the quality and applicability of the selected shots. The edit planning sub-module plans the arrangement order and time point of the selected shots to create a complete video sequence.

In the video frame processing module, super resolution technology is used to improve image quality. The style migration submodule improves the visual effect of the frame to match the overall style. The quality evaluation submodule evaluates the processed frames and ensures high-quality output.

And in the video synthesis module, the processed frames are synthesized into a new video sequence. The encoding optimization sub-module optimizes file size and quality using video encoding techniques. The file packaging sub-module packages the final video file to prepare for playing or sharing.

And in the video abstract generating module, the generated content abstract is edited, so that a concise and clear summary is provided. The highlight moment selection submodule selects highlight moments in the video to emphasize important content.

And selecting and applying special effects in the special effect dynamic rendering module, and matching video content and emotion. The rendering optimization submodule optimizes the application of special effects and improves visual attractiveness.

Referring to fig. 3, the video stream analysis sub-module uses a deep learning decoding algorithm to perform deep analysis on the coding format and frame structure of the video data stream based on the input video file, so as to obtain video stream characteristic data;

the video stream characteristic data is specifically key parameters including coding information, frame rate and resolution in the video stream, the picture frame sequence comprises continuous, unprocessed original image frames, and the standardized frame sequence is specifically converted into a JPEG or PNG image format.

In the video stream analysis submodule, an input video file is received, a deep learning decoding algorithm is adopted to analyze a video data stream, and the coding format and the frame structure of the video are deeply analyzed to obtain characteristic data of the video stream. The characteristic data comprise key parameters such as coding information, frame rate and resolution of the video stream. Providing important information about the video content for subsequent processing ensures that subsequent operations are based on accurate data.

And in the frame extraction sub-module, each frame image in the video stream is extracted one by adopting a frame-by-frame analysis method based on the video stream characteristic data, so that a picture frame sequence is established. This sequence of picture frames contains each frame of video, consecutive, unprocessed, representing the time axis of the video. By frame extraction, the original video stream is converted into a sequence of image frames, providing raw data for subsequent image processing and analysis.

In the format conversion sub-module, the sequence of picture frames created by the frame extraction sub-module is accepted and each image frame is converted to a standard image format, such as JPEG or PNG, using image format normalization techniques. This ensures that all frames are in the same format, simplifying the subsequent processing flow, resulting in a consistent format for the frame sequence. Finally, the converted frame sequences form a standardized frame sequence, and a unified data source is provided for each step of subsequent processing and video production.

Referring to fig. 4, the image recognition sub-module performs feature point detection and recognition on an image by using a convolutional neural network based on a standardized frame sequence to obtain an image feature point data set;

the semantic analysis sub-module is used for carrying out deep understanding on the image content by adopting a semantic analysis algorithm based on the encoded feature data set, extracting semantic tags of scenes and objects and establishing an image semantic information data set;

In the image recognition sub-module, the input standardized frame sequence receives the standardized frame sequence from the format conversion sub-module, and the frames all adopt the same image format. Feature extraction is performed on each frame using a convolutional neural network. To deep learning frameworks (e.g., tensorFlow, pyTorch) and trained convolutional neural network models such as VGG, res net, or acceptance. The following are example code segments:

import tensorflow as tf；

from tensorflow.keras.applications import VGG16；

# loading a pre-trained VGG16 model;

model = VGG16(weights='imagenet', include_top=False)；

extracting characteristics;

features = model.predict(frame)；

and extracting characteristic points from each frame in the image characteristic point data set, wherein the characteristic points comprise characteristic information such as edges, corner points, textures and the like. Feature points are detected using feature detection algorithms such as SIFT (scale invariant feature transform) or ORB (Oriented FAST and Rotated BRIEF).

And in the feature coding submodule, a feature point data set is acquired from the image identification submodule. The feature vector encoding adopts a feature vector encoding technique to encode each feature point. This includes converting pixel values around the feature points into feature vectors of low dimensions. One example is to use PCA (principal component analysis) or LDA (linear discriminant analysis) to reduce the dimensions.

Encoded feature data set: the encoded feature data set contains a low-dimensional representation of each feature point, which facilitates data compression and preserves critical visual information. Example code segment:

from sklearn.decomposition import PCA；

# reduce dimensionality using PCA;

pca = PCA(n_components=50)；

encoded_features = pca.fit_transform(feature_points)；

in the semantic analysis sub-module, the encoded feature data set from the feature encoding sub-module is received. The semantic analysis algorithm uses a deep learning method, such as a convolutional neural network or a cyclic neural network, to perform deep image content understanding on the encoded characteristic data. This includes tasks such as object detection, scene classification, and semantic segmentation. The following is one example code segment:

import tensorflow as tf；

# loading a trained image classification model;

model = tf.keras.applications.InceptionV3(weights='imagenet')；

predicting the coded characteristic data;

predictions = model.predict(encoded_features)；

and in the image semantic information data set, establishing the image semantic information data set comprising object identification, scene labels and high-level semantic description according to the output of the deep learning model. This information is used for subsequent content summaries and special effects dynamic rendering;

referring to fig. 5, the sequence analysis sub-module performs time dependency analysis by using a long-short-term memory network based on the feature data set, and performs data preprocessing to generate a time sequence analysis result;

based on the time sequence analysis result, the scenario extraction submodule adopts an entity identification technology in natural language processing to extract scenario key elements and analyzes the emotion so as to generate a scenario extraction report;

The abstract generation sub-module adopts an extraction type abstract method to simplify information and select core sentences based on the scenario extraction report to generate a content abstract;

In the sequence analysis sub-module, a characteristic data set is input, an LSTM model is established, and a training model processes time sequence data. And carrying out data preprocessing operation on the input data, including standardization and filling, so as to ensure the consistency and quality of the data. The submodule generates a time-series analysis result, which is a prediction of time-series, pattern detection, or other analysis result related to time dependency.

In the scenario refinement sub-module, text in the time series analysis result is analyzed using entity recognition techniques in natural language processing, such as named entity recognition and key phrase extraction, based on the results of the sequence analysis sub-module. Emotion analysis is performed to determine the emotion polarity (positive, negative or neutral) of the time series. Integrating the results of entity recognition and emotion analysis to generate a scenario refinement report including key elements and emotion information.

In the summary generation sub-module, a scenario refinement report is received, including key elements and emotion information. And extracting key sentences or phrases from the scenario extraction report by adopting a drawing type abstract method to construct a content abstract. This involves selecting the most relevant sentence using a text summarization algorithm, such as TextRank or LexRank. Selecting a key sentence building abstract typically relies on the importance scores of the sentences, which are calculated by a decimated abstract algorithm. Finally, a content summary is generated, providing a simplification and generalization of the original information.

Referring to fig. 6, the decision support sub-module performs a shot value evaluation and a decision optimization based on the content abstract by using reinforcement learning based on a model to generate a shot selection scheme;

the lens evaluation sub-module adopts an image quality evaluation algorithm to evaluate the content quality of the lens based on a lens selection scheme, and performs visual effect analysis to generate a lens quality evaluation report;

model-based reinforcement learning specifically refers to a method of learning an optimal strategy by simulating and predicting environmental feedback, an image quality evaluation algorithm specifically refers to visual characteristic evaluation including image definition, color saturation and contrast, and a sequence decision process is used for selecting an optimal action sequence according to a preset rule and a target.

In the decision support sub-module, a content digest is received as input. And (3) carrying out value evaluation on each potential lens by using a model-based reinforcement learning method, and determining the contribution degree in the whole narrative. And a decision optimization algorithm, such as reinforcement learning, is adopted to comprehensively consider the factors of the value, the narrative continuity and the like of different shots, so as to generate an optimal shot selection scheme, which is a group of carefully planned shot sequences.

In the lens evaluation sub-module, an image quality evaluation algorithm is used to evaluate visual characteristics such as image sharpness, color saturation, contrast and the like of each lens. In addition to image quality, visual effect analysis was performed, taking into account transitions between shots and emotional expressions. And combining the information to generate a lens quality assessment report, and providing detailed lens assessment, including visual quality and narrative effect.

In the edit planning sub-module, an optimal edit action sequence is determined based on predetermined rules, targets and previous evaluation information using a sequence decision process. This includes operations such as reordering shots, adding or deleting clips, etc., creating the most attractive scene flow. The submodule generates a detailed shot decision list to guide the actual video editing process.

Referring to fig. 7, the super-resolution submodule performs super-resolution reconstruction by adopting a deep learning convolutional neural network algorithm based on a lens decision list, and generates a super-resolution video frame by extracting a feature map of the video frame, performing up-sampling operation, and enhancing picture details;

the deep learning convolutional neural network comprises a feature extraction layer, a nonlinear mapping layer and a reconstruction layer, the migration learning and deep convolutional network specifically refers to model parameters obtained by training a large amount of marked data, and image quality evaluation indexes comprise calculation and analysis of local contrast, brightness and color fidelity.

And in the super-resolution sub-module, receiving the video frames selected from the shot decision list, and performing super-resolution reconstruction by using a deep learning convolutional neural network. The feature extraction layer is used for capturing features of the image, the nonlinear mapping layer is used for further processing the features, the reconstruction layer is used for performing up-sampling operation, and definition and detail of the image are enhanced. The output is high resolution video frames that will be used for the next processing.

In the style migration submodule, migration learning and deep convolution networks are used, as well as pre-trained stylized models. Super-resolution video frames are used as input, and migration learning and depth convolution networks are used to apply the desired visual style to the video frames. This process includes adjustment of the visual style to ensure that the generated video frames conform to a predetermined style. The output result is a stylized video frame.

In the quality evaluation submodule, image quality evaluation indexes such as Structural Similarity Index (SSIM) and peak signal to noise ratio (PSNR), local contrast, brightness, color fidelity and other indexes are used for comprehensively evaluating the quality of the video frame. This assessment process generates a quality assessment report providing detailed quality information ensuring that the generated video frames meet the quality requirements.

Referring to fig. 8, the frame synthesis submodule optimizes the continuity between frames based on the high-quality video frames screened in the quality evaluation report by adopting an optical flow technology and a frame interpolation algorithm to generate a synthesized video stream;

The file packaging submodule adopts a multimedia container formatting technology to package MP4 or AVI based on the optimized video stream, integrates the audio and video data stream and generates an enhanced video;

the optical flow technique specifically calculates motion vectors of intermediate frames by analyzing the motion of pixels between adjacent frames, the h.265/HEVC coding technique specifically includes utilizing intra-frame prediction, inter-frame prediction, transformation, and quantization techniques to reduce redundant information, and the multimedia container formatting technique specifically refers to encapsulating video and audio data.

In the frame synthesis sub-module, high-quality video frames are screened from the quality evaluation report, then an optical flow technology is applied, and the motion vector of an intermediate frame is calculated by analyzing the pixel motion of adjacent frames, so that the continuity between frames is improved. The situation of insufficient frame rate is filled by adopting a frame interpolation algorithm, so that the video stream is ensured not to be blocked when being played. Finally, the frame composition sub-module generates a composite video stream containing high quality frames that have undergone optical flow analysis and frame interpolation processing, providing a smoother viewing experience.

In the coding optimization sub-module, the composite video stream is received as input. The H.265/HEVC coding technology is adopted, and the method comprises the steps of intra-frame prediction, inter-frame prediction, transformation, quantization and the like, so that the file size is reduced, and the video quality is improved. This process helps reduce storage and transmission costs and provides higher quality video content. The coding optimization submodule generates an optimized video stream subjected to coding processing.

And in the file packaging submodule, the optimized video stream is accepted as input. The video data is then combined with possible audio data (if any) using multimedia container formatting techniques and the appropriate multimedia container format is selected, such as MP4 or AVI. The file encapsulation sub-module generates an enhanced video file, including the composite video stream and audio data (if applicable), which is conveniently stored, transmitted and played.

Referring to fig. 9, the key information extraction sub-module analyzes the video frame sequence based on the enhanced video by using a long-short-term memory network and an attention mechanism, identifies and extracts key frames and scenes, and establishes a key information data set;

the sequence decision algorithm is specifically an algorithm for carrying out current decision according to historical information and is used for processing and generating sequence data, the clustering analysis is specifically to identify similar features by grouping video frames, and the user feedback learning comprises analysis of user behavior data and optimization of highlight moment selection.

In the key information extraction sub-module, the enhanced video is received as input, and a long short term memory network (LSTM) and an attention mechanism are adopted to analyze the video frame sequence. LSTM helps capture the temporal correlation between frames, while the attention mechanism can identify the importance of the frames. By these techniques, the key information extraction submodule identifies key frames and scenes, such as climax moments, emotional climax, and key episodes. These key frames and scenes are consolidated into one key information dataset for subsequent summary editing.

And in the abstract editing sub-module, a sequence decision algorithm is adopted to optimize the combination of information according to the historical information and the editing target defined by the user. This includes determining the ordering and selection of key frames and deciding which scenes to include in the final video summary. In the process of editing the video abstract, the time sequence and the continuity are maintained, so that the generated abstract draft can be ensured to convey the main content and emotion of the video.

In the highlight moment selection submodule, a clustering analysis technology is utilized to group video frames and identify similar characteristics, and candidates possibly becoming highlight moments are screened out. User feedback learning is also integrated to further optimize the choice of highlight moments. The user feedback comprises feedback of the user on the generated abstract, such as click rate, play quantity and the like, and the interests and behaviors of the user are known. And combining cluster analysis and user feedback, the highlight moment selection submodule generates a final video abstract, including the most attractive part, for users to watch or share.

Referring to fig. 10, the scene analysis sub-module performs scene analysis by using a convolutional neural network based on the video abstraction, identifies elements and attributes, and generates a scene analysis report;

the special effect matching sub-module dynamically matches video special effects by adopting a pattern matching algorithm based on the scene analysis report, and obtains special effect matching data;

real-time rendering techniques particularly refer to techniques in computer graphics that compute and generate images on the fly, and image synthesis involves merging multiple image layers.

In the scene analysis sub-module, a video abstract is received as input, a Convolutional Neural Network (CNN) is utilized to process video frames, and key information such as objects, emotions, colors and positions are extracted. And generating a scene analysis report according to the output of the CNN, wherein the scene analysis report comprises a detailed description of video content, and providing a basis for subsequent special effect matching and rendering.

And in the special effect matching sub-module, a scene analysis report is received, and a mode matching algorithm is adopted to dynamically select a proper video special effect according to elements and attribute information in the report. The matching algorithm will pick the special effects matching the current scene from the special effects library. Once the special effects are selected, the special effect matching sub-module can acquire special effect matching data, including parameters and settings of the selected special effects, for optimization use in subsequent rendering.

In the rendering optimization sub-module, real-time rendering technology and image synthesis are utilized to allow real-time adjustment of special effects so as to ensure that the special effects are consistent with video content and emotion. The real-time rendering technique can adjust special effects as needed to adapt to changes in video content. The rendering optimization submodule generates a video with special effects and ensures seamless fusion of the special effects and video contents.

The present invention is not limited to the above embodiments, and any equivalent embodiments which can be changed or modified by the technical disclosure described above can be applied to other fields, but any simple modification, equivalent changes and modification made to the above embodiments according to the technical matter of the present invention will still fall within the scope of the technical disclosure.

Claims

1. The utility model provides a video production system based on AI technique which characterized in that: the video production system based on the AI technology comprises a video decoding module, a feature extraction module, a content identification module, a lens selection module, a video frame processing module, a video synthesis module, a video abstract generation module and a special effect dynamic rendering module;

2. The AI-technology-based video production system of claim 1, wherein: the video decoding module comprises a video stream analysis sub-module, a frame extraction sub-module and a format conversion sub-module;

3. The AI-technology-based video production system of claim 2, wherein: the video stream analysis submodule adopts a deep learning decoding algorithm to carry out deep analysis on the coding format and the frame structure of the video data stream based on the input video file, so as to obtain the characteristic data of the video stream;

4. The AI-technology-based video production system of claim 2, wherein: the image recognition submodule is used for carrying out feature point detection and recognition on the image by adopting a convolutional neural network based on a standardized frame sequence to obtain an image feature point data set;

5. The AI-technology-based video production system of claim 2, wherein: the sequence analysis submodule performs time dependency analysis by adopting a long-period memory network based on the characteristic data set, performs data preprocessing and generates a time sequence analysis result;

6. The AI-technology-based video production system of claim 2, wherein: the decision support sub-module performs lens value assessment and decision optimization by adopting model-based reinforcement learning based on the content abstract, and generates a lens selection scheme;

7. The AI-technology-based video production system of claim 2, wherein: the super-resolution submodule carries out super-resolution reconstruction by adopting a deep learning convolutional neural network algorithm based on a lens decision list, and generates a super-resolution video frame by extracting a characteristic map of the video frame, carrying out up-sampling operation and enhancing picture details;

8. The AI-technology-based video production system of claim 2, wherein: the frame synthesis submodule optimizes the continuity between frames by adopting an optical flow technology and a frame interpolation algorithm based on the high-quality video frames screened in the quality evaluation report to generate a synthesized video stream;

9. The AI-technology-based video production system of claim 2, wherein: the key information extraction submodule analyzes a video frame sequence based on the enhanced video by adopting a long-short-period memory network and an attention mechanism, identifies and extracts key frames and scenes and establishes a key information data set;

10. The AI-technology-based video production system of claim 2, wherein: the scene analysis submodule performs scene analysis by utilizing a convolutional neural network based on the video abstract, identifies elements and attributes and generates a scene analysis report;