CN114218437A

CN114218437A - Adaptive picture clipping and fusing method, system, computer device and medium

Info

Publication number: CN114218437A
Application number: CN202111564539.6A
Authority: CN
Inventors: 肖冠正; 郝德禄
Original assignee: iMusic Culture and Technology Co Ltd
Current assignee: iMusic Culture and Technology Co Ltd
Priority date: 2021-12-20
Filing date: 2021-12-20
Publication date: 2022-03-22

Abstract

The invention provides a method, a system, computer equipment and a medium for cutting and fusing self-adaptive pictures, wherein the method comprises the following steps: acquiring a video file, and extracting text content, audio content and video key frames from the video file; performing natural language analysis on the text content, and generating a semantic label according to an analysis result; performing audio matching according to the audio content, and generating an audio label according to a matching result and an audio knowledge graph; predicting the content according to the video key frame to generate a content label; the video file is cut according to the semantic label, the audio label and the content label, and cut materials are fused to obtain a target video.

Description

Adaptive picture clipping and fusing method, system, computer device and medium

Technical Field

The invention relates to the technical field of video processing, in particular to a self-adaptive picture cutting and fusing method, a self-adaptive picture cutting and fusing system, computer equipment and a storage medium.

Background

Currently, related technologies mainly label long videos (more than 60 seconds) by video classification, generally finish labeling short videos by analyzing the content of the videos, and mainly include the following 2 scenes: firstly, manually checking the whole video content in a manual editing mode, and marking a classification label for the video by combining subjective judgment and understanding; secondly, through an AI (artificial intelligence) recognition technology, face, scene and object recognition is carried out on frames appearing in the video content, and labels of corresponding classes, such as stars, food, libraries and the like, are extracted.

However, the manual labeling method is often inefficient, slow, subjective in label quality, and low in video frame coverage, because the manual labeling method is labor-intensive and requires editors to have high aesthetic abilities and patience. In addition, some related technologies adopt an AI identification and labeling mode, but the AI identification and labeling mode cannot be applied to an application scene with too many interference pictures, and the application range is limited. In addition, the AI identification and labeling method in the related art has a limited target subject for identification and labeling, and thus has a problem of insufficient effectiveness.

Disclosure of Invention

In view of the above, to at least partially solve one of the above technical problems, embodiments of the present invention provide an adaptive image cropping fusion method with wide applicability and comprehensive coverage of target subject identification; meanwhile, the technical scheme of the application also provides a system, computer equipment and a computer readable and writable storage medium which can correspondingly realize the method.

On one hand, the technical scheme of the application provides a self-adaptive picture cutting and fusing method, which comprises the following steps:

acquiring a video file, and extracting text content, audio content and video key frames from the video file;

performing natural language analysis on the text content, and generating a semantic label according to an analysis result;

performing audio matching according to the audio content, and generating an audio label according to a matching result and an audio knowledge graph;

predicting the content according to the video key frame to generate a content label;

and clipping the video file according to the semantic label, the audio label and the content label, and fusing the clipped materials to obtain a target video.

In a possible embodiment of the present disclosure, the clipping the video file according to the semantic tag, the audio tag, and the content tag includes:

determining a tag weight value of each tag in a tag set, wherein the tag set comprises the semantic tag, the audio tag and the content tag;

and generating a label sequence according to the label weight value, and labeling the label information to the video file according to the label sequence and the confidence degree of each label in the sequence.

In a possible embodiment of the present disclosure, the text content includes description text and subtitle text; the step of performing natural language analysis on the text content and generating a semantic label according to an analysis result includes:

extracting the description text from the video file, and formatting the description text to obtain first formatting information;

extracting the subtitle text from the video file;

performing natural language processing on the first formatting information and the subtitle text to obtain a key entity matrix;

and inputting the key entity matrix into a semantic prediction model, and determining the semantic label according to a model prediction result.

In a possible embodiment of the present disclosure, after the step of performing natural language analysis on the text content and generating a semantic tag according to an analysis result, the method further includes:

acquiring structural information of the description text, and matching the structural information with a semantic knowledge graph to obtain a derivative label;

and cutting the video file according to the derived label, the semantic label, the audio label and the content label, and fusing the cut materials to obtain a target video.

In a possible embodiment of the present disclosure, the step of performing audio matching according to the audio content and generating an audio tag according to a matching result and an audio knowledge graph includes:

converting the audio content to obtain text information, and adding the text information into the text content;

extracting an audio fingerprint according to the audio content;

matching in a fingerprint database according to the audio fingerprints to determine candidate audio;

and inputting the candidate audio into the audio knowledge graph to be matched to obtain the audio label.

In a possible embodiment of the present disclosure, the step of generating a content tag according to content prediction performed by the video key frame includes:

slicing the video file to obtain a plurality of video frame files;

graying the video frame file to obtain a gray picture, and calculating according to the intra-class difference and the inter-class difference of the gray picture to obtain a content feature matrix;

reducing the dimension of the content feature matrix, and determining to obtain a key frame according to the difference between the content feature matrices after dimension reduction;

and inputting the key frame into a video content prediction model, and determining the content label according to a model prediction result.

In a possible embodiment of the present application, the step of performing dimension reduction on the content feature matrix, and determining to obtain a key frame according to a difference between the content feature matrices after the dimension reduction includes:

constructing a first feature class set according to the content feature matrix, and performing iteration to generate a plurality of second feature class sets;

extracting a first picture in a video frame file cluster corresponding to the second feature class set as the key frame;

the iterative process comprises:

calculating the distance between each category in the first feature class set and generating a distance matrix;

and extracting the minimum element in the distance matrix to construct a second feature class set.

On the other hand, this application technical scheme still provides self-adaptation picture and tailors fusion system, includes:

the system comprises a preprocessing module, a video processing module and a video processing module, wherein the preprocessing module is used for acquiring a video file and extracting text content, audio content and video key frames from the video file;

the semantic analysis module is used for carrying out natural language analysis on the text content and generating a semantic label according to an analysis result;

the audio analysis module is used for carrying out audio matching according to the audio content and generating an audio label according to a matching result and an audio knowledge graph;

the content analysis module is used for predicting the content according to the video key frame to generate a content label;

and the decision analysis module is used for cutting the video file according to the semantic label, the audio label and the content label, and fusing the cut materials to obtain a target video.

On the other hand, the technical solution of the present invention further provides a computer device for adaptive image cropping and fusion, comprising:

at least one processor;

at least one memory for storing at least one program;

when the at least one program is executed by the at least one processor, the at least one processor is caused to execute the adaptive picture cropping fusion method of the first aspect.

In another aspect, the present invention further provides a storage medium, in which a processor-executable program is stored, and the processor-executable program is used to execute the method in the first aspect when executed by a processor.

Advantages and benefits of the present invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention:

according to the technical scheme, the corresponding semantic tags are generated through natural language analysis on the video files, the corresponding audio tags are generated through audio matching, the content tags are generated through content prediction of the video key frames, the content prediction rate can be greatly saved, and then the video frames are labeled according to a plurality of tags; according to the scheme, labels do not need to be added manually, so that the labor cost is saved, and a user can directly check the related information of the short video through the target label information when watching the short video, so that the convenience of short video application is improved; finally, cutting key content according to the marked content in the video frame, and finally fusing to obtain a target video, wherein the scheme adds a text and audio label prediction method, and has a wider application range; the method has wider application scenes, is suitable for storing the database and can provide real-time prediction interface service.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a flowchart illustrating steps of an adaptive image cropping fusion method according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating steps of another adaptive image cropping and blending method according to an embodiment of the present invention;

FIG. 3 is a schematic flow chart of content management adding video based on classification tags in the embodiment of the present invention;

FIG. 4 is a schematic flow chart of a recommendation preference model and a similar model based on classification labels according to an embodiment of the present invention;

FIG. 5 is a diagram illustrating search flow index synchronization based on category labels according to an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention. The step numbers in the following embodiments are provided only for convenience of illustration, the order between the steps is not limited at all, and the execution order of each step in the embodiments can be adapted according to the understanding of those skilled in the art.

In combination with the description in the foregoing background art, in some practical scenarios, the following technical problems exist in the scheme or method for labeling video classification tags in the related art: the marking process requires that an editor has high aesthetic ability and patience, and has the problems of low efficiency, low speed, high subjective quality of labels, low coverage rate of video frames and the like; the AI content identification technology has high requirements on the content of the video, requires that the video frame is relatively simple and cannot generate too many interference pictures; the AI content identification technology can only acquire effective information of the video content, and under the condition that the video content is not enough to represent the video key information, other key information is easy to ignore, so that the problem of insufficient effectiveness exists, and a more meaningful reference value cannot be provided for the actual service requirement.

Therefore, with the rapid development of short video services, the number of short videos also comes up to the outbreak period, but considering some features below the short videos, the traditional classification labeling method for the long videos is not applicable to the short videos. The content of the short video is more precise and shorter, the time length of the short video is about 60 seconds generally, and the short video does not contain a large amount of different information, so that a large amount of even complete video frames do not need to be analyzed like the traditional method of the long video, and a large optimization space exists in the classification prediction time. The information of the short video is more standard and clear, and the mining value is very high. Since short videos belong to postero-start, in the era of prevailing data analysis nowadays, clear and normative descriptive information such as { singer } _ { song name } _ MV/singing } is generally possessed, the effective features are ignored by the traditional classification labeling method of long videos, and an optimization space exists in the label accuracy.

Based on the above theoretical basis, in one aspect, as shown in fig. 1, an embodiment of the present application provides an adaptive image cropping and fusing method, where the method includes steps S100 to S500:

s100, acquiring a video file, and extracting text content, audio content and video key frames from the video file;

specifically, in the implementation process, the obtained video file is an original material video file without any target content marking or label marking; the text content is extracted from the video file by a character extraction tool, and the text content in the embodiment includes, but is not limited to, the title, the description and the caption of the video. Meanwhile, the embodiment can also acquire the source address information of the video from the music library, acquire the video physical file through the video source address, and extract the audio in the video file. Aiming at the extraction of key frames, the embodiment acquires the source address information of the video from a music library; acquiring a video physical file through a video source address, and extracting through a corresponding function of a CV2 library in OpenCV to obtain a video frame file; embodiments then extract preliminary key frames using a segmented key frame extraction method.

S200, performing natural language analysis on the text content, and generating a semantic label according to an analysis result;

in particular, in the implementation process, semantic tags are generated according to the text content, such as titles, video descriptions, and subtitles, acquired in step S100. Exemplarily, text information such as subtitles, titles and descriptions is analyzed through natural semantic NLP, and a Top K semantic label is obtained by combining a semantic knowledge map and text importance; and then performing weight decision calculation and the like.

S300, performing audio matching according to audio content, and generating an audio label according to a matching result and an audio knowledge graph;

specifically, in the implementation process, the audio length of the video is further determined according to the audio content extracted in step S100, and the spectrum of the audio can be used as a feature value, and an RF random forest algorithm is adopted to classify the quality of the video audio and output an audio tag.

S400, performing content prediction according to the video key frame to generate a content label;

as shown in fig. 2, in an implementation process, in an embodiment, a fragmentation key frame extraction method may be used to extract a preliminary key frame, then a hierarchical clustering algorithm based on content features is used for reducing dimensions of the key frame, 10 groups of key frames with the largest difference are extracted, the key frames are respectively placed in a scene recognition model, an object recognition/target detection model, and a character recognition model to perform label prediction, so as to generate a content label, where the content label in the embodiment includes, but is not limited to, a scene, an object, and a character label; and subsequently, transmitting the obtained scene labels, object labels and character labels to a decision analysis subsystem for weight decision calculation.

S500, cutting the video file according to the semantic label, the audio label and the content label, and fusing the cut materials to obtain a target video;

specifically, in the implementation process, weight decision is carried out on various types of labels obtained in the steps S200-S400, and a uniform label is finally generated; specifically, multiple types of labels are collected in a trial mode, the weight and the comprehensive confidence coefficient of each type of label are calculated through a weight formula, label information is determined according to the weight and the comprehensive confidence coefficient of the label, and the label information is assembled into a JSON body to be output.

After labeling of various types of labels is completed, the embodiment uses the YoloV3 model for detection of key targets and label information localization, as shown in fig. 2. YoloV3 is an open source model realized based on Darknet-53 framework, the Top5 accuracy rate can reach 93.8% on 2 ten thousand multi-class data sets of ImageNet, and key targets in pictures can be effectively detected, so that subsequent weight confidence calculation and picture cutting work can be conveniently carried out. And finally, synthesizing picture materials including uncut pictures or cut pictures and video templates, and making the video templates by using a mainstream AE tool (Adobe After Effects) to realize the personalized picture migration synthesis capability.

In order to make the tagged video material more usable, in some alternative embodiments, the step S500 of the method clips the video file according to the semantic tag, the audio tag, and the content tag, and may include steps S510 and S520:

s510, determining a label weight value of each label in a label set, wherein the label set comprises a semantic label, an audio label and a content label;

s520, generating a label sequence according to the label weight value, and labeling the label information to the video file according to the label sequence and the confidence of each label in the sequence;

exemplarily, in the embodiment, according to the quality function matrix, calculating to obtain corresponding label weights, wherein the content label CT, the audio label AT and the text label TT are sorted from high to low according to the weights, and outputting Top K label information; for example, the correspondence between the information quality of the tag and the weight value of the tag in the embodiment is shown in table 1:

TABLE 1

Quality of	Weight of
		High	1
Medium	0.5
		Low	0

In addition, after the labels are sorted according to the weight values of the label quality, the embodiment can also perform two-round sorting according to the comprehensive confidence of each label, and select thousands of labels in the sequence as the final label labeling result according to the two-round sorting result. It should be noted that in some other embodiments, the embodiment may set the corresponding weight threshold and the confidence threshold to filter the tags.

In some optional embodiments, the process of performing natural language analysis on the text content in the method step S200 and generating semantic tags according to the analysis result may include steps S210-S240:

s210, extracting a description text from a video file, and formatting the description text to obtain first formatting information;

specifically, in the implementation process, because the short video has good structural information, the title Name and Description of the video can be acquired, and by cutting the Name and Description according to special characters such as _ or "" or |, standardized processing is performed, { T_name1....T_namejDenotes a set of text information after normalization processing.

S220, extracting a subtitle text from the video file;

specifically, in the implementation process, the video physical file can be acquired through a video source address, and the video subtitles are extracted.

S230, performing natural language processing on the first formatting information and the subtitle text to obtain a key entity matrix;

specifically, in the implementation process, the text information matrix learns the features by using a GRU-based network structure, the features of all text information (including first formatting information and subtitle text) are accessed to a CRF decoding layer to complete sequence labeling, and corresponding word boundaries and parts of speech in the text information and the relation between entity categories are output. The output part-of-speech tags contain 24 part-of-speech tags (lower case letters), 4 professional category tags (upper case letters), and the names of people, places, organizations and TIME are marked by upper case and lower case (PER/LOC/ORG/TIME and nr/ns/nt/t), wherein the lower case represents information such as names of people with low degree of position. Finally, the key entity matrix is output by deleting non-key entities such as adjectives, adverbs, quantifiers, pronouns, prepositions, adverbs and the like:

wherein Entity_iRepresents a representation from { T }_name1....T_namejObtaining the ith de-duplicated key entity;

IMP_iwhich indicates the degree of importance of the correspondence,

wherein the word frequency

Representing an Entity_iIn { T_name1....T_namejThe number of occurrences/i in (j),

representation contains Entity_iTname + 1.

S240, inputting the key entity matrix into a semantic prediction model, and determining a semantic label according to a model prediction result;

specifically, in the implementation process, a key entity matrix array (entity) [: 0 ] is obtained]Putting the semantic prediction model into the semantic prediction model, and outputting corresponding labels and confidence coefficient matrixes

Wherein Tag_izRepresentation according to Entity_iAcquired label and confidence CL_izAccording to the completion matrix compression under the same Tag, finally generating no repetition according to the text accuracy

A prediction tag matrix PredTag ordered from high to low; most preferablyFinal return of array (PredTag) [: K]The tag information of (1).

In order to make the tag coverage rate higher, in some possible embodiments, the tag may also be labeled by using a derivative tag, and further, after the process of performing natural language analysis on the text content and generating a semantic tag according to the analysis result, the method in the embodiment may further include step S250: acquiring structural information of a description text, and matching the structural information with a semantic knowledge map to obtain a derivative label;

specifically, in the implementation process, the embodiment can match the video knowledge graph through the structured information such as the title, the description and the like to obtain the derivative tag; and transmitting the semantic tags and the derived tags to a decision analysis subsystem for weight decision calculation. And finally, cutting the video file according to the derivative label, the semantic label, the audio label and the content label, and fusing the cut materials to obtain the target video.

The semantic tag extraction method of the embodiment can still generate a classification tag effect with high availability under the condition that the quality of video and audio contents is not high or effective information is insufficient.

In some possible embodiments, the process of performing audio matching according to audio content in step S300 and generating an audio tag according to a matching result and an audio knowledge graph in the method may include steps S310 to S340:

s310, converting the audio content to obtain text information, and adding the text information into the text content;

specifically, in the implementation process, the audio in the extracted video is subjected to voice translation to generate corresponding text information, the content of the text information is added to the text content, and then the corresponding semantic tag is generated in step S200.

S320, extracting an audio fingerprint according to the audio content;

s330, matching in a fingerprint database according to the audio fingerprints to determine candidate audio;

s340, inputting the candidate audio into an audio knowledge graph for matching to obtain an audio label;

specifically, in the implementation process, besides a label generation mode of converting the audio content into the text content, the embodiment can also generate corresponding audio fingerprints for the audio, match the fingerprint database and find out the audio with the similarity rate of more than 80%; and inputting the similar audio into the audio knowledge graph to match a graph audio tag, and performing decision calculation according to the audio tag.

In some possible embodiments, the process of generating content tags by performing content prediction according to video key frames in method step S400 may include steps S410-S440:

s410, fragmenting the video file to obtain a plurality of video frame files;

on the premise of keeping video key information as much as possible, the embodiment extracts a preliminary key frame by using a fragment key frame extraction method, then uses a hierarchical clustering algorithm based on content features for key frame dimension reduction, and achieves the effect of reducing the prediction time of video content by using the fewest key frames. Specifically, in the implementation process, firstly, the extraction of the fragmentation key frame is carried out, and the total frame number of the video is frame_iSetting segment number segment as 60 and per segment truncated video frame perSegmentFrame as 1, one video frame can be obtained

And (4) matrix.

S420, graying the video frame file to obtain a gray picture, and calculating according to the intra-class difference and the inter-class difference of the gray picture to obtain a content feature matrix;

specifically, in the implementation process, extracting the segmented key frames, and then constructing the content features; graying the frame, using 0-255 to represent all picture pixels, wherein background is less than 120, Foregroud is more than or equal to 120, and foreground color is

The ratio of the components is as follows:

the background color ratio is:

foreground color mean and variance FA, FV, background color mean and variance BA, BV, intra-class differences:

ID＝F×FV²+B×BV²

the difference between classes:

OD＝F×B×(FA-BA)²

taking Min (ID) as a threshold value, and comparing the threshold value with each pixel point to obtain a content feature matrix of the picture [0,1 ].

S430, reducing the dimension of the content feature matrix, and determining to obtain a key frame according to the difference between the reduced content feature matrices;

specifically, in the implementation process, after the content feature is constructed, the key frame dimension reduction is performed, and the key frame dimension reduction process in the embodiment may further include steps S331 and S332:

s331, constructing a first feature class set according to the content feature matrix, and performing iteration to generate a plurality of second feature class sets; wherein the iterative process comprises: calculating the distance between each category in the first characteristic class set and generating a distance matrix; extracting the minimum element in the distance matrix to construct a second feature class set;

s332, extracting a first picture in the video frame file cluster corresponding to the second feature class set as a key frame;

illustratively, in an embodiment, a hierarchical clustering method is used to perform dimension reduction on a 60-dimensional frame matrix, find 10 pictures with the largest difference, designate a cluster as a 10 cluster, and perform the following steps:

A. firstly, a 60-dimensional frame matrix is used as an initial sample, and then 60 content feature classes are formed by self-classification: g1 (0.).. G60(0), Single-link distances between classes are calculated, resulting in a 60 × 60 distance matrix, with "0" representing the initial state.

B. And (3) setting the obtained distance matrix D (n) (n is the times of successive clustering merging), finding out the minimum element in D (n), and merging the two corresponding types into one type. Thereby establishing a new classification: g1(n +1), G2(n + 1.);

C. and then calculating the distance between the new classes after combination to obtain D (n +1).

D. And B, jumping to the step B, and repeating the calculation and the combination.

E. And ending the traversal until the step is reduced to G10, and taking the first picture in each cluster as the key frame after dimension reduction.

S440, inputting the key frame into a video content prediction model, and determining a content label according to a model prediction result;

specifically, in the implementation process, the 10 frames after the dimension reduction are put into a video content prediction algorithm to obtain final content label information.

In some alternative embodiments, in the case of multimodal fusion, to generate a highly accurate classification tag effect; the process of multi-type tag decision may include:

A. calculating the video content quality:

the duration, the total frame number and the resolution of the short video are obtained as characteristic values, an RF random forest algorithm is adopted to classify the video content quality, and three tags of CQ ═ High, Medium and Low are output.

B. Calculating the video text quality:

the method comprises the steps of acquiring the number of title special symbols, the length and the OCR recognition result ratio (OCR result/frame number 100) of a video as characteristic values, classifying the video text quality by adopting an RF random forest algorithm, and outputting three labels of TQ { (High, Medium and Low }.

C. Calculating video and audio quality

The method comprises the steps of obtaining the audio length of a video, using the frequency spectrum of the audio as a characteristic value, adopting an RF random forest algorithm to classify the quality of the video and the audio, and outputting three tags of AQ { High, Medium and Low }.

D. And selecting corresponding labels according to the quality function matrix, wherein CT is a content label, AT is an audio label, TT is a text label weight, and finally, sequencing from high to low according to the weight and outputting Top K label information.

Referring to fig. 3, a detailed description is made of a practical application process of the present application:

in a certain system, an application layer provides a video classification label prediction inlet, so that a user can upload a short video and edit video text information, meanwhile, manual correction is carried out on a prediction result and finally the prediction result is stored in a database, the short video is opened next time, the label information of the video can be consulted in real time, and the number and the operation condition of various classification labels of the current music library are counted and displayed. The flow of the newly added video is as follows:

1) uploading and editing the short video: the user can upload the short video in a local or URL mode selected on the application layer page, and text information editing such as short video titles, notes, singers and the like is supported.

2) And calling a classification label prediction interface and displaying a prediction result.

3) And manually correcting the prediction result, carrying out negative feedback on inaccurate labels, and manually removing improper labels.

Referring to fig. 4, another practical application process of the present application is described:

the existing short video resources of the music library are processed in batch in an off-line mode, and a corresponding short video classification label library is generated, so that rich label information is provided for short video recommendation. The process of recommending and constructing the user preference model and the video similarity model comprises the following steps:

1) and processing the short video resources of the song library in batch off line.

2) Constructing a user preference model: and associating the label information of the video according to the past user behavior of the user to generate a corresponding user preference model.

3) Constructing a video similarity model: and constructing a video similarity matrix according to the label relevance and similarity between the videos, and providing more video sources with the same label for video recommendation.

Referring to fig. 5, another practical application process of the present application is described:

the existing short video resources of the song library are processed in batch in an off-line mode, a corresponding short video classification label library is generated, and a search hit result is returned by adding a label matching strategy. The process of searching the synchronization index is as follows:

1) processing short video resources of the song library in batches in an off-line mode: and (4) performing T +1 off-line processing on the new video every day, generating corresponding label information, and storing the label information in a database.

2) Synchronization to the search index repository: and synchronizing the video tag information to the search index database at regular increments each day.

In a second aspect, the technical solution of the present application further provides a self-adaptive image cropping fusion system, including:

the preprocessing module is used for acquiring a video file and extracting text content, audio content and video key frames from the video file;

the semantic analysis module is used for carrying out natural language analysis on the text content and generating a semantic label according to the analysis result;

the content analysis module is used for predicting content according to the video key frame to generate a content label;

and the decision analysis module is used for cutting the video file according to the semantic label, the audio label and the content label, and fusing the cut materials to obtain the target video.

In a third aspect, the present disclosure also provides a computer device for adaptive image cropping and fusion, including at least one processor; at least one memory for storing at least one program; when the at least one program is executed by the at least one processor, the at least one processor is caused to perform the adaptive picture cropping fusion method as in the first aspect.

The embodiment of the invention also provides a storage medium in which a program is stored, and the program is executed by the processor to realize the self-adaptive picture cutting and fusing method.

From the above specific implementation process, it can be concluded that the technical solution provided by the present invention has the following advantages or advantages compared to the prior art:

1. the technical scheme of the application can solve the problems that manual label labeling is low in efficiency and high in requirement on personnel, and can generate a large number of short video labels rapidly.

2. The method solves the limitation of high requirement on the video content, and can still label the classification label with high usability when the complex scene appears in the video content.

3. According to the technical scheme, the problem that the effectiveness of the classification labels is not enough can be solved, and the classification labels with high effectiveness can still be marked under the condition that the video content does not reflect a large amount of information of the video.

4. According to the technical scheme, after the user uploads the video, the video can be automatically analyzed, and the classification label preview image is generated in real time.

5. According to the technical scheme, the user can be allowed to classify and label single or batch videos, JSON label information is generated, the classification efficiency is improved, and meanwhile unified label service is provided for a third party.

6. According to the technical scheme, the user can be allowed to define the Top K label information, and the interference of the tail label on the accuracy of the short video is reduced

7. According to the technical scheme, a comprehensive confidence concept can be provided for the classification labels, so that the classification labels have different weight ratios, and meanwhile, freely selected reference values are provided for a third party.

In alternative embodiments, the functions/acts noted in the block diagrams may occur out of the order noted in the operational illustrations. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. Furthermore, the embodiments presented and described in the flow charts of the present invention are provided by way of example in order to provide a more thorough understanding of the technology. The disclosed methods are not limited to the operations and logic flows presented herein. Alternative embodiments are contemplated in which the order of various operations is changed and in which sub-operations described as part of larger operations are performed independently.

Furthermore, although the present invention is described in the context of functional modules, it should be understood that, unless otherwise stated to the contrary, one or more of the functions and/or features may be integrated in a single physical device and/or software module, or one or more of the functions and/or features may be implemented in a separate physical device or software module. It will also be appreciated that a detailed discussion of the actual implementation of each module is not necessary for an understanding of the present invention. Rather, the actual implementation of the various functional modules in the apparatus disclosed herein will be understood within the ordinary skill of an engineer, given the nature, function, and internal relationship of the modules. Accordingly, those skilled in the art can, using ordinary skill, practice the invention as set forth in the claims without undue experimentation. It is also to be understood that the specific concepts disclosed are merely illustrative of and not intended to limit the scope of the invention, which is defined by the appended claims and their full scope of equivalents.

The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

While embodiments of the invention have been shown and described, it will be understood by those of ordinary skill in the art that: various changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.

While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. The adaptive picture clipping and fusing method is characterized by comprising the following steps of:

2. The adaptive picture cropping fusion method of claim 1, wherein said step of cropping the video file according to the semantic tag, the audio tag, and the content tag comprises:

3. The adaptive picture cropping fusion method according to claim 1, wherein the text content comprises description text and subtitle text; the step of performing natural language analysis on the text content and generating a semantic label according to an analysis result includes:

extracting the subtitle text from the video file;

4. The adaptive image cropping and fusion method according to claim 3, wherein after the step of performing natural language analysis on the text content and generating semantic tags according to the analysis result, the method further comprises:

5. The adaptive picture cropping and fusing method according to claim 1, wherein the step of performing audio matching according to the audio content and generating an audio tag according to a matching result and an audio knowledge graph comprises:

extracting an audio fingerprint according to the audio content;

6. The adaptive picture cropping fusion method according to claim 1, wherein said step of performing content prediction based on said video key frames to generate content tags comprises:

slicing the video file to obtain a plurality of video frame files;

7. The adaptive picture cropping and fusing method according to claim 6, wherein the step of performing dimension reduction on the content feature matrix and determining to obtain a key frame according to a difference between the content feature matrices after the dimension reduction comprises:

the iterative process comprises:

8. The adaptive picture cropping and fusing system is characterized by comprising:

9. Self-adaptation picture is tailor and is fused computer equipment, its characterized in that includes:

at least one processor;

at least one memory for storing at least one program;

when executed by the at least one processor, cause the at least one processor to perform the adaptive picture cropping fusion method according to any one of claims 1 to 7.

10. A storage medium having stored therein a processor-executable program, wherein the processor-executable program, when executed by a processor, is configured to execute the adaptive picture cropping fusion method according to any one of claims 1-7.