CN114218437A - Adaptive picture clipping and fusing method, system, computer device and medium - Google Patents

Adaptive picture clipping and fusing method, system, computer device and medium Download PDF

Info

Publication number
CN114218437A
CN114218437A CN202111564539.6A CN202111564539A CN114218437A CN 114218437 A CN114218437 A CN 114218437A CN 202111564539 A CN202111564539 A CN 202111564539A CN 114218437 A CN114218437 A CN 114218437A
Authority
CN
China
Prior art keywords
content
label
audio
video
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111564539.6A
Other languages
Chinese (zh)
Inventor
肖冠正
郝德禄
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
iMusic Culture and Technology Co Ltd
Original Assignee
iMusic Culture and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by iMusic Culture and Technology Co Ltd filed Critical iMusic Culture and Technology Co Ltd
Priority to CN202111564539.6A priority Critical patent/CN114218437A/en
Publication of CN114218437A publication Critical patent/CN114218437A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/7867Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, title and artist information, manually generated time, location and usage information, user ratings
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/68Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/683Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Library & Information Science (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

The invention provides a method, a system, computer equipment and a medium for cutting and fusing self-adaptive pictures, wherein the method comprises the following steps: acquiring a video file, and extracting text content, audio content and video key frames from the video file; performing natural language analysis on the text content, and generating a semantic label according to an analysis result; performing audio matching according to the audio content, and generating an audio label according to a matching result and an audio knowledge graph; predicting the content according to the video key frame to generate a content label; the video file is cut according to the semantic label, the audio label and the content label, and cut materials are fused to obtain a target video.

Description

Adaptive picture clipping and fusing method, system, computer device and medium
Technical Field
The invention relates to the technical field of video processing, in particular to a self-adaptive picture cutting and fusing method, a self-adaptive picture cutting and fusing system, computer equipment and a storage medium.
Background
Currently, related technologies mainly label long videos (more than 60 seconds) by video classification, generally finish labeling short videos by analyzing the content of the videos, and mainly include the following 2 scenes: firstly, manually checking the whole video content in a manual editing mode, and marking a classification label for the video by combining subjective judgment and understanding; secondly, through an AI (artificial intelligence) recognition technology, face, scene and object recognition is carried out on frames appearing in the video content, and labels of corresponding classes, such as stars, food, libraries and the like, are extracted.
However, the manual labeling method is often inefficient, slow, subjective in label quality, and low in video frame coverage, because the manual labeling method is labor-intensive and requires editors to have high aesthetic abilities and patience. In addition, some related technologies adopt an AI identification and labeling mode, but the AI identification and labeling mode cannot be applied to an application scene with too many interference pictures, and the application range is limited. In addition, the AI identification and labeling method in the related art has a limited target subject for identification and labeling, and thus has a problem of insufficient effectiveness.
Disclosure of Invention
In view of the above, to at least partially solve one of the above technical problems, embodiments of the present invention provide an adaptive image cropping fusion method with wide applicability and comprehensive coverage of target subject identification; meanwhile, the technical scheme of the application also provides a system, computer equipment and a computer readable and writable storage medium which can correspondingly realize the method.
On one hand, the technical scheme of the application provides a self-adaptive picture cutting and fusing method, which comprises the following steps:
acquiring a video file, and extracting text content, audio content and video key frames from the video file;
performing natural language analysis on the text content, and generating a semantic label according to an analysis result;
performing audio matching according to the audio content, and generating an audio label according to a matching result and an audio knowledge graph;
predicting the content according to the video key frame to generate a content label;
and clipping the video file according to the semantic label, the audio label and the content label, and fusing the clipped materials to obtain a target video.
In a possible embodiment of the present disclosure, the clipping the video file according to the semantic tag, the audio tag, and the content tag includes:
determining a tag weight value of each tag in a tag set, wherein the tag set comprises the semantic tag, the audio tag and the content tag;
and generating a label sequence according to the label weight value, and labeling the label information to the video file according to the label sequence and the confidence degree of each label in the sequence.
In a possible embodiment of the present disclosure, the text content includes description text and subtitle text; the step of performing natural language analysis on the text content and generating a semantic label according to an analysis result includes:
extracting the description text from the video file, and formatting the description text to obtain first formatting information;
extracting the subtitle text from the video file;
performing natural language processing on the first formatting information and the subtitle text to obtain a key entity matrix;
and inputting the key entity matrix into a semantic prediction model, and determining the semantic label according to a model prediction result.
In a possible embodiment of the present disclosure, after the step of performing natural language analysis on the text content and generating a semantic tag according to an analysis result, the method further includes:
acquiring structural information of the description text, and matching the structural information with a semantic knowledge graph to obtain a derivative label;
and cutting the video file according to the derived label, the semantic label, the audio label and the content label, and fusing the cut materials to obtain a target video.
In a possible embodiment of the present disclosure, the step of performing audio matching according to the audio content and generating an audio tag according to a matching result and an audio knowledge graph includes:
converting the audio content to obtain text information, and adding the text information into the text content;
extracting an audio fingerprint according to the audio content;
matching in a fingerprint database according to the audio fingerprints to determine candidate audio;
and inputting the candidate audio into the audio knowledge graph to be matched to obtain the audio label.
In a possible embodiment of the present disclosure, the step of generating a content tag according to content prediction performed by the video key frame includes:
slicing the video file to obtain a plurality of video frame files;
graying the video frame file to obtain a gray picture, and calculating according to the intra-class difference and the inter-class difference of the gray picture to obtain a content feature matrix;
reducing the dimension of the content feature matrix, and determining to obtain a key frame according to the difference between the content feature matrices after dimension reduction;
and inputting the key frame into a video content prediction model, and determining the content label according to a model prediction result.
In a possible embodiment of the present application, the step of performing dimension reduction on the content feature matrix, and determining to obtain a key frame according to a difference between the content feature matrices after the dimension reduction includes:
constructing a first feature class set according to the content feature matrix, and performing iteration to generate a plurality of second feature class sets;
extracting a first picture in a video frame file cluster corresponding to the second feature class set as the key frame;
the iterative process comprises:
calculating the distance between each category in the first feature class set and generating a distance matrix;
and extracting the minimum element in the distance matrix to construct a second feature class set.
On the other hand, this application technical scheme still provides self-adaptation picture and tailors fusion system, includes:
the system comprises a preprocessing module, a video processing module and a video processing module, wherein the preprocessing module is used for acquiring a video file and extracting text content, audio content and video key frames from the video file;
the semantic analysis module is used for carrying out natural language analysis on the text content and generating a semantic label according to an analysis result;
the audio analysis module is used for carrying out audio matching according to the audio content and generating an audio label according to a matching result and an audio knowledge graph;
the content analysis module is used for predicting the content according to the video key frame to generate a content label;
and the decision analysis module is used for cutting the video file according to the semantic label, the audio label and the content label, and fusing the cut materials to obtain a target video.
On the other hand, the technical solution of the present invention further provides a computer device for adaptive image cropping and fusion, comprising:
at least one processor;
at least one memory for storing at least one program;
when the at least one program is executed by the at least one processor, the at least one processor is caused to execute the adaptive picture cropping fusion method of the first aspect.
In another aspect, the present invention further provides a storage medium, in which a processor-executable program is stored, and the processor-executable program is used to execute the method in the first aspect when executed by a processor.
Advantages and benefits of the present invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention:
according to the technical scheme, the corresponding semantic tags are generated through natural language analysis on the video files, the corresponding audio tags are generated through audio matching, the content tags are generated through content prediction of the video key frames, the content prediction rate can be greatly saved, and then the video frames are labeled according to a plurality of tags; according to the scheme, labels do not need to be added manually, so that the labor cost is saved, and a user can directly check the related information of the short video through the target label information when watching the short video, so that the convenience of short video application is improved; finally, cutting key content according to the marked content in the video frame, and finally fusing to obtain a target video, wherein the scheme adds a text and audio label prediction method, and has a wider application range; the method has wider application scenes, is suitable for storing the database and can provide real-time prediction interface service.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a flowchart illustrating steps of an adaptive image cropping fusion method according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating steps of another adaptive image cropping and blending method according to an embodiment of the present invention;
FIG. 3 is a schematic flow chart of content management adding video based on classification tags in the embodiment of the present invention;
FIG. 4 is a schematic flow chart of a recommendation preference model and a similar model based on classification labels according to an embodiment of the present invention;
FIG. 5 is a diagram illustrating search flow index synchronization based on category labels according to an embodiment of the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention. The step numbers in the following embodiments are provided only for convenience of illustration, the order between the steps is not limited at all, and the execution order of each step in the embodiments can be adapted according to the understanding of those skilled in the art.
In combination with the description in the foregoing background art, in some practical scenarios, the following technical problems exist in the scheme or method for labeling video classification tags in the related art: the marking process requires that an editor has high aesthetic ability and patience, and has the problems of low efficiency, low speed, high subjective quality of labels, low coverage rate of video frames and the like; the AI content identification technology has high requirements on the content of the video, requires that the video frame is relatively simple and cannot generate too many interference pictures; the AI content identification technology can only acquire effective information of the video content, and under the condition that the video content is not enough to represent the video key information, other key information is easy to ignore, so that the problem of insufficient effectiveness exists, and a more meaningful reference value cannot be provided for the actual service requirement.
Therefore, with the rapid development of short video services, the number of short videos also comes up to the outbreak period, but considering some features below the short videos, the traditional classification labeling method for the long videos is not applicable to the short videos. The content of the short video is more precise and shorter, the time length of the short video is about 60 seconds generally, and the short video does not contain a large amount of different information, so that a large amount of even complete video frames do not need to be analyzed like the traditional method of the long video, and a large optimization space exists in the classification prediction time. The information of the short video is more standard and clear, and the mining value is very high. Since short videos belong to postero-start, in the era of prevailing data analysis nowadays, clear and normative descriptive information such as { singer } _ { song name } _ MV/singing } is generally possessed, the effective features are ignored by the traditional classification labeling method of long videos, and an optimization space exists in the label accuracy.
Based on the above theoretical basis, in one aspect, as shown in fig. 1, an embodiment of the present application provides an adaptive image cropping and fusing method, where the method includes steps S100 to S500:
s100, acquiring a video file, and extracting text content, audio content and video key frames from the video file;
specifically, in the implementation process, the obtained video file is an original material video file without any target content marking or label marking; the text content is extracted from the video file by a character extraction tool, and the text content in the embodiment includes, but is not limited to, the title, the description and the caption of the video. Meanwhile, the embodiment can also acquire the source address information of the video from the music library, acquire the video physical file through the video source address, and extract the audio in the video file. Aiming at the extraction of key frames, the embodiment acquires the source address information of the video from a music library; acquiring a video physical file through a video source address, and extracting through a corresponding function of a CV2 library in OpenCV to obtain a video frame file; embodiments then extract preliminary key frames using a segmented key frame extraction method.
S200, performing natural language analysis on the text content, and generating a semantic label according to an analysis result;
in particular, in the implementation process, semantic tags are generated according to the text content, such as titles, video descriptions, and subtitles, acquired in step S100. Exemplarily, text information such as subtitles, titles and descriptions is analyzed through natural semantic NLP, and a Top K semantic label is obtained by combining a semantic knowledge map and text importance; and then performing weight decision calculation and the like.
S300, performing audio matching according to audio content, and generating an audio label according to a matching result and an audio knowledge graph;
specifically, in the implementation process, the audio length of the video is further determined according to the audio content extracted in step S100, and the spectrum of the audio can be used as a feature value, and an RF random forest algorithm is adopted to classify the quality of the video audio and output an audio tag.
S400, performing content prediction according to the video key frame to generate a content label;
as shown in fig. 2, in an implementation process, in an embodiment, a fragmentation key frame extraction method may be used to extract a preliminary key frame, then a hierarchical clustering algorithm based on content features is used for reducing dimensions of the key frame, 10 groups of key frames with the largest difference are extracted, the key frames are respectively placed in a scene recognition model, an object recognition/target detection model, and a character recognition model to perform label prediction, so as to generate a content label, where the content label in the embodiment includes, but is not limited to, a scene, an object, and a character label; and subsequently, transmitting the obtained scene labels, object labels and character labels to a decision analysis subsystem for weight decision calculation.
S500, cutting the video file according to the semantic label, the audio label and the content label, and fusing the cut materials to obtain a target video;
specifically, in the implementation process, weight decision is carried out on various types of labels obtained in the steps S200-S400, and a uniform label is finally generated; specifically, multiple types of labels are collected in a trial mode, the weight and the comprehensive confidence coefficient of each type of label are calculated through a weight formula, label information is determined according to the weight and the comprehensive confidence coefficient of the label, and the label information is assembled into a JSON body to be output.
After labeling of various types of labels is completed, the embodiment uses the YoloV3 model for detection of key targets and label information localization, as shown in fig. 2. YoloV3 is an open source model realized based on Darknet-53 framework, the Top5 accuracy rate can reach 93.8% on 2 ten thousand multi-class data sets of ImageNet, and key targets in pictures can be effectively detected, so that subsequent weight confidence calculation and picture cutting work can be conveniently carried out. And finally, synthesizing picture materials including uncut pictures or cut pictures and video templates, and making the video templates by using a mainstream AE tool (Adobe After Effects) to realize the personalized picture migration synthesis capability.
In order to make the tagged video material more usable, in some alternative embodiments, the step S500 of the method clips the video file according to the semantic tag, the audio tag, and the content tag, and may include steps S510 and S520:
s510, determining a label weight value of each label in a label set, wherein the label set comprises a semantic label, an audio label and a content label;
s520, generating a label sequence according to the label weight value, and labeling the label information to the video file according to the label sequence and the confidence of each label in the sequence;
exemplarily, in the embodiment, according to the quality function matrix, calculating to obtain corresponding label weights, wherein the content label CT, the audio label AT and the text label TT are sorted from high to low according to the weights, and outputting Top K label information; for example, the correspondence between the information quality of the tag and the weight value of the tag in the embodiment is shown in table 1:
TABLE 1
Quality of Weight of
High 1
Medium 0.5
Low 0
In addition, after the labels are sorted according to the weight values of the label quality, the embodiment can also perform two-round sorting according to the comprehensive confidence of each label, and select thousands of labels in the sequence as the final label labeling result according to the two-round sorting result. It should be noted that in some other embodiments, the embodiment may set the corresponding weight threshold and the confidence threshold to filter the tags.
In some optional embodiments, the process of performing natural language analysis on the text content in the method step S200 and generating semantic tags according to the analysis result may include steps S210-S240:
s210, extracting a description text from a video file, and formatting the description text to obtain first formatting information;
specifically, in the implementation process, because the short video has good structural information, the title Name and Description of the video can be acquired, and by cutting the Name and Description according to special characters such as _ or "" or |, standardized processing is performed, { Tname1....TnamejDenotes a set of text information after normalization processing.
S220, extracting a subtitle text from the video file;
specifically, in the implementation process, the video physical file can be acquired through a video source address, and the video subtitles are extracted.
S230, performing natural language processing on the first formatting information and the subtitle text to obtain a key entity matrix;
specifically, in the implementation process, the text information matrix learns the features by using a GRU-based network structure, the features of all text information (including first formatting information and subtitle text) are accessed to a CRF decoding layer to complete sequence labeling, and corresponding word boundaries and parts of speech in the text information and the relation between entity categories are output. The output part-of-speech tags contain 24 part-of-speech tags (lower case letters), 4 professional category tags (upper case letters), and the names of people, places, organizations and TIME are marked by upper case and lower case (PER/LOC/ORG/TIME and nr/ns/nt/t), wherein the lower case represents information such as names of people with low degree of position. Finally, the key entity matrix is output by deleting non-key entities such as adjectives, adverbs, quantifiers, pronouns, prepositions, adverbs and the like:
Figure BDA0003421709430000071
wherein EntityiRepresents a representation from { T }name1....TnamejObtaining the ith de-duplicated key entity;
IMPiwhich indicates the degree of importance of the correspondence,
Figure BDA0003421709430000072
wherein the word frequency
Figure BDA0003421709430000073
Representing an EntityiIn { Tname1....TnamejThe number of occurrences/i in (j),
Figure BDA0003421709430000074
representation contains EntityiTname + 1.
S240, inputting the key entity matrix into a semantic prediction model, and determining a semantic label according to a model prediction result;
specifically, in the implementation process, a key entity matrix array (entity) [: 0 ] is obtained]Putting the semantic prediction model into the semantic prediction model, and outputting corresponding labels and confidence coefficient matrixes
Figure BDA0003421709430000075
Wherein TagizRepresentation according to EntityiAcquired label and confidence CLizAccording to the completion matrix compression under the same Tag, finally generating no repetition according to the text accuracy
Figure BDA0003421709430000076
A prediction tag matrix PredTag ordered from high to low; most preferablyFinal return of array (PredTag) [: K]The tag information of (1).
In order to make the tag coverage rate higher, in some possible embodiments, the tag may also be labeled by using a derivative tag, and further, after the process of performing natural language analysis on the text content and generating a semantic tag according to the analysis result, the method in the embodiment may further include step S250: acquiring structural information of a description text, and matching the structural information with a semantic knowledge map to obtain a derivative label;
specifically, in the implementation process, the embodiment can match the video knowledge graph through the structured information such as the title, the description and the like to obtain the derivative tag; and transmitting the semantic tags and the derived tags to a decision analysis subsystem for weight decision calculation. And finally, cutting the video file according to the derivative label, the semantic label, the audio label and the content label, and fusing the cut materials to obtain the target video.
The semantic tag extraction method of the embodiment can still generate a classification tag effect with high availability under the condition that the quality of video and audio contents is not high or effective information is insufficient.
In some possible embodiments, the process of performing audio matching according to audio content in step S300 and generating an audio tag according to a matching result and an audio knowledge graph in the method may include steps S310 to S340:
s310, converting the audio content to obtain text information, and adding the text information into the text content;
specifically, in the implementation process, the audio in the extracted video is subjected to voice translation to generate corresponding text information, the content of the text information is added to the text content, and then the corresponding semantic tag is generated in step S200.
S320, extracting an audio fingerprint according to the audio content;
s330, matching in a fingerprint database according to the audio fingerprints to determine candidate audio;
s340, inputting the candidate audio into an audio knowledge graph for matching to obtain an audio label;
specifically, in the implementation process, besides a label generation mode of converting the audio content into the text content, the embodiment can also generate corresponding audio fingerprints for the audio, match the fingerprint database and find out the audio with the similarity rate of more than 80%; and inputting the similar audio into the audio knowledge graph to match a graph audio tag, and performing decision calculation according to the audio tag.
In some possible embodiments, the process of generating content tags by performing content prediction according to video key frames in method step S400 may include steps S410-S440:
s410, fragmenting the video file to obtain a plurality of video frame files;
on the premise of keeping video key information as much as possible, the embodiment extracts a preliminary key frame by using a fragment key frame extraction method, then uses a hierarchical clustering algorithm based on content features for key frame dimension reduction, and achieves the effect of reducing the prediction time of video content by using the fewest key frames. Specifically, in the implementation process, firstly, the extraction of the fragmentation key frame is carried out, and the total frame number of the video is frameiSetting segment number segment as 60 and per segment truncated video frame perSegmentFrame as 1, one video frame can be obtained
Figure BDA0003421709430000081
And (4) matrix.
S420, graying the video frame file to obtain a gray picture, and calculating according to the intra-class difference and the inter-class difference of the gray picture to obtain a content feature matrix;
specifically, in the implementation process, extracting the segmented key frames, and then constructing the content features; graying the frame, using 0-255 to represent all picture pixels, wherein background is less than 120, Foregroud is more than or equal to 120, and foreground color is
The ratio of the components is as follows:
Figure BDA0003421709430000091
the background color ratio is:
Figure BDA0003421709430000092
foreground color mean and variance FA, FV, background color mean and variance BA, BV, intra-class differences:
ID=F×FV2+B×BV2
the difference between classes:
OD=F×B×(FA-BA)2
taking Min (ID) as a threshold value, and comparing the threshold value with each pixel point to obtain a content feature matrix of the picture [0,1 ].
S430, reducing the dimension of the content feature matrix, and determining to obtain a key frame according to the difference between the reduced content feature matrices;
specifically, in the implementation process, after the content feature is constructed, the key frame dimension reduction is performed, and the key frame dimension reduction process in the embodiment may further include steps S331 and S332:
s331, constructing a first feature class set according to the content feature matrix, and performing iteration to generate a plurality of second feature class sets; wherein the iterative process comprises: calculating the distance between each category in the first characteristic class set and generating a distance matrix; extracting the minimum element in the distance matrix to construct a second feature class set;
s332, extracting a first picture in the video frame file cluster corresponding to the second feature class set as a key frame;
illustratively, in an embodiment, a hierarchical clustering method is used to perform dimension reduction on a 60-dimensional frame matrix, find 10 pictures with the largest difference, designate a cluster as a 10 cluster, and perform the following steps:
A. firstly, a 60-dimensional frame matrix is used as an initial sample, and then 60 content feature classes are formed by self-classification: g1 (0.).. G60(0), Single-link distances between classes are calculated, resulting in a 60 × 60 distance matrix, with "0" representing the initial state.
B. And (3) setting the obtained distance matrix D (n) (n is the times of successive clustering merging), finding out the minimum element in D (n), and merging the two corresponding types into one type. Thereby establishing a new classification: g1(n +1), G2(n + 1.);
C. and then calculating the distance between the new classes after combination to obtain D (n +1).
D. And B, jumping to the step B, and repeating the calculation and the combination.
E. And ending the traversal until the step is reduced to G10, and taking the first picture in each cluster as the key frame after dimension reduction.
S440, inputting the key frame into a video content prediction model, and determining a content label according to a model prediction result;
specifically, in the implementation process, the 10 frames after the dimension reduction are put into a video content prediction algorithm to obtain final content label information.
In some alternative embodiments, in the case of multimodal fusion, to generate a highly accurate classification tag effect; the process of multi-type tag decision may include:
A. calculating the video content quality:
the duration, the total frame number and the resolution of the short video are obtained as characteristic values, an RF random forest algorithm is adopted to classify the video content quality, and three tags of CQ ═ High, Medium and Low are output.
B. Calculating the video text quality:
the method comprises the steps of acquiring the number of title special symbols, the length and the OCR recognition result ratio (OCR result/frame number 100) of a video as characteristic values, classifying the video text quality by adopting an RF random forest algorithm, and outputting three labels of TQ { (High, Medium and Low }.
C. Calculating video and audio quality
The method comprises the steps of obtaining the audio length of a video, using the frequency spectrum of the audio as a characteristic value, adopting an RF random forest algorithm to classify the quality of the video and the audio, and outputting three tags of AQ { High, Medium and Low }.
D. And selecting corresponding labels according to the quality function matrix, wherein CT is a content label, AT is an audio label, TT is a text label weight, and finally, sequencing from high to low according to the weight and outputting Top K label information.
Referring to fig. 3, a detailed description is made of a practical application process of the present application:
in a certain system, an application layer provides a video classification label prediction inlet, so that a user can upload a short video and edit video text information, meanwhile, manual correction is carried out on a prediction result and finally the prediction result is stored in a database, the short video is opened next time, the label information of the video can be consulted in real time, and the number and the operation condition of various classification labels of the current music library are counted and displayed. The flow of the newly added video is as follows:
1) uploading and editing the short video: the user can upload the short video in a local or URL mode selected on the application layer page, and text information editing such as short video titles, notes, singers and the like is supported.
2) And calling a classification label prediction interface and displaying a prediction result.
3) And manually correcting the prediction result, carrying out negative feedback on inaccurate labels, and manually removing improper labels.
Referring to fig. 4, another practical application process of the present application is described:
the existing short video resources of the music library are processed in batch in an off-line mode, and a corresponding short video classification label library is generated, so that rich label information is provided for short video recommendation. The process of recommending and constructing the user preference model and the video similarity model comprises the following steps:
1) and processing the short video resources of the song library in batch off line.
2) Constructing a user preference model: and associating the label information of the video according to the past user behavior of the user to generate a corresponding user preference model.
3) Constructing a video similarity model: and constructing a video similarity matrix according to the label relevance and similarity between the videos, and providing more video sources with the same label for video recommendation.
Referring to fig. 5, another practical application process of the present application is described:
the existing short video resources of the song library are processed in batch in an off-line mode, a corresponding short video classification label library is generated, and a search hit result is returned by adding a label matching strategy. The process of searching the synchronization index is as follows:
1) processing short video resources of the song library in batches in an off-line mode: and (4) performing T +1 off-line processing on the new video every day, generating corresponding label information, and storing the label information in a database.
2) Synchronization to the search index repository: and synchronizing the video tag information to the search index database at regular increments each day.
In a second aspect, the technical solution of the present application further provides a self-adaptive image cropping fusion system, including:
the preprocessing module is used for acquiring a video file and extracting text content, audio content and video key frames from the video file;
the semantic analysis module is used for carrying out natural language analysis on the text content and generating a semantic label according to the analysis result;
the audio analysis module is used for carrying out audio matching according to the audio content and generating an audio label according to a matching result and an audio knowledge graph;
the content analysis module is used for predicting content according to the video key frame to generate a content label;
and the decision analysis module is used for cutting the video file according to the semantic label, the audio label and the content label, and fusing the cut materials to obtain the target video.
In a third aspect, the present disclosure also provides a computer device for adaptive image cropping and fusion, including at least one processor; at least one memory for storing at least one program; when the at least one program is executed by the at least one processor, the at least one processor is caused to perform the adaptive picture cropping fusion method as in the first aspect.
The embodiment of the invention also provides a storage medium in which a program is stored, and the program is executed by the processor to realize the self-adaptive picture cutting and fusing method.
From the above specific implementation process, it can be concluded that the technical solution provided by the present invention has the following advantages or advantages compared to the prior art:
1. the technical scheme of the application can solve the problems that manual label labeling is low in efficiency and high in requirement on personnel, and can generate a large number of short video labels rapidly.
2. The method solves the limitation of high requirement on the video content, and can still label the classification label with high usability when the complex scene appears in the video content.
3. According to the technical scheme, the problem that the effectiveness of the classification labels is not enough can be solved, and the classification labels with high effectiveness can still be marked under the condition that the video content does not reflect a large amount of information of the video.
4. According to the technical scheme, after the user uploads the video, the video can be automatically analyzed, and the classification label preview image is generated in real time.
5. According to the technical scheme, the user can be allowed to classify and label single or batch videos, JSON label information is generated, the classification efficiency is improved, and meanwhile unified label service is provided for a third party.
6. According to the technical scheme, the user can be allowed to define the Top K label information, and the interference of the tail label on the accuracy of the short video is reduced
7. According to the technical scheme, a comprehensive confidence concept can be provided for the classification labels, so that the classification labels have different weight ratios, and meanwhile, freely selected reference values are provided for a third party.
In alternative embodiments, the functions/acts noted in the block diagrams may occur out of the order noted in the operational illustrations. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. Furthermore, the embodiments presented and described in the flow charts of the present invention are provided by way of example in order to provide a more thorough understanding of the technology. The disclosed methods are not limited to the operations and logic flows presented herein. Alternative embodiments are contemplated in which the order of various operations is changed and in which sub-operations described as part of larger operations are performed independently.
Furthermore, although the present invention is described in the context of functional modules, it should be understood that, unless otherwise stated to the contrary, one or more of the functions and/or features may be integrated in a single physical device and/or software module, or one or more of the functions and/or features may be implemented in a separate physical device or software module. It will also be appreciated that a detailed discussion of the actual implementation of each module is not necessary for an understanding of the present invention. Rather, the actual implementation of the various functional modules in the apparatus disclosed herein will be understood within the ordinary skill of an engineer, given the nature, function, and internal relationship of the modules. Accordingly, those skilled in the art can, using ordinary skill, practice the invention as set forth in the claims without undue experimentation. It is also to be understood that the specific concepts disclosed are merely illustrative of and not intended to limit the scope of the invention, which is defined by the appended claims and their full scope of equivalents.
The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
While embodiments of the invention have been shown and described, it will be understood by those of ordinary skill in the art that: various changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.
While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (10)

1. The adaptive picture clipping and fusing method is characterized by comprising the following steps of:
acquiring a video file, and extracting text content, audio content and video key frames from the video file;
performing natural language analysis on the text content, and generating a semantic label according to an analysis result;
performing audio matching according to the audio content, and generating an audio label according to a matching result and an audio knowledge graph;
predicting the content according to the video key frame to generate a content label;
and clipping the video file according to the semantic label, the audio label and the content label, and fusing the clipped materials to obtain a target video.
2. The adaptive picture cropping fusion method of claim 1, wherein said step of cropping the video file according to the semantic tag, the audio tag, and the content tag comprises:
determining a tag weight value of each tag in a tag set, wherein the tag set comprises the semantic tag, the audio tag and the content tag;
and generating a label sequence according to the label weight value, and labeling the label information to the video file according to the label sequence and the confidence degree of each label in the sequence.
3. The adaptive picture cropping fusion method according to claim 1, wherein the text content comprises description text and subtitle text; the step of performing natural language analysis on the text content and generating a semantic label according to an analysis result includes:
extracting the description text from the video file, and formatting the description text to obtain first formatting information;
extracting the subtitle text from the video file;
performing natural language processing on the first formatting information and the subtitle text to obtain a key entity matrix;
and inputting the key entity matrix into a semantic prediction model, and determining the semantic label according to a model prediction result.
4. The adaptive image cropping and fusion method according to claim 3, wherein after the step of performing natural language analysis on the text content and generating semantic tags according to the analysis result, the method further comprises:
acquiring structural information of the description text, and matching the structural information with a semantic knowledge graph to obtain a derivative label;
and cutting the video file according to the derived label, the semantic label, the audio label and the content label, and fusing the cut materials to obtain a target video.
5. The adaptive picture cropping and fusing method according to claim 1, wherein the step of performing audio matching according to the audio content and generating an audio tag according to a matching result and an audio knowledge graph comprises:
converting the audio content to obtain text information, and adding the text information into the text content;
extracting an audio fingerprint according to the audio content;
matching in a fingerprint database according to the audio fingerprints to determine candidate audio;
and inputting the candidate audio into the audio knowledge graph to be matched to obtain the audio label.
6. The adaptive picture cropping fusion method according to claim 1, wherein said step of performing content prediction based on said video key frames to generate content tags comprises:
slicing the video file to obtain a plurality of video frame files;
graying the video frame file to obtain a gray picture, and calculating according to the intra-class difference and the inter-class difference of the gray picture to obtain a content feature matrix;
reducing the dimension of the content feature matrix, and determining to obtain a key frame according to the difference between the content feature matrices after dimension reduction;
and inputting the key frame into a video content prediction model, and determining the content label according to a model prediction result.
7. The adaptive picture cropping and fusing method according to claim 6, wherein the step of performing dimension reduction on the content feature matrix and determining to obtain a key frame according to a difference between the content feature matrices after the dimension reduction comprises:
constructing a first feature class set according to the content feature matrix, and performing iteration to generate a plurality of second feature class sets;
extracting a first picture in a video frame file cluster corresponding to the second feature class set as the key frame;
the iterative process comprises:
calculating the distance between each category in the first feature class set and generating a distance matrix;
and extracting the minimum element in the distance matrix to construct a second feature class set.
8. The adaptive picture cropping and fusing system is characterized by comprising:
the system comprises a preprocessing module, a video processing module and a video processing module, wherein the preprocessing module is used for acquiring a video file and extracting text content, audio content and video key frames from the video file;
the semantic analysis module is used for carrying out natural language analysis on the text content and generating a semantic label according to an analysis result;
the audio analysis module is used for carrying out audio matching according to the audio content and generating an audio label according to a matching result and an audio knowledge graph;
the content analysis module is used for predicting the content according to the video key frame to generate a content label;
and the decision analysis module is used for cutting the video file according to the semantic label, the audio label and the content label, and fusing the cut materials to obtain a target video.
9. Self-adaptation picture is tailor and is fused computer equipment, its characterized in that includes:
at least one processor;
at least one memory for storing at least one program;
when executed by the at least one processor, cause the at least one processor to perform the adaptive picture cropping fusion method according to any one of claims 1 to 7.
10. A storage medium having stored therein a processor-executable program, wherein the processor-executable program, when executed by a processor, is configured to execute the adaptive picture cropping fusion method according to any one of claims 1-7.
CN202111564539.6A 2021-12-20 2021-12-20 Adaptive picture clipping and fusing method, system, computer device and medium Pending CN114218437A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111564539.6A CN114218437A (en) 2021-12-20 2021-12-20 Adaptive picture clipping and fusing method, system, computer device and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111564539.6A CN114218437A (en) 2021-12-20 2021-12-20 Adaptive picture clipping and fusing method, system, computer device and medium

Publications (1)

Publication Number Publication Date
CN114218437A true CN114218437A (en) 2022-03-22

Family

ID=80704391

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111564539.6A Pending CN114218437A (en) 2021-12-20 2021-12-20 Adaptive picture clipping and fusing method, system, computer device and medium

Country Status (1)

Country Link
CN (1) CN114218437A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116112763A (en) * 2022-11-15 2023-05-12 国家计算机网络与信息安全管理中心 Method and system for automatically generating short video content labels

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100121844A1 (en) * 2008-11-07 2010-05-13 Yahoo! Inc. Image relevance by identifying experts
CN104090955A (en) * 2014-07-07 2014-10-08 科大讯飞股份有限公司 Automatic audio/video label labeling method and system
US20150082330A1 (en) * 2013-09-18 2015-03-19 Qualcomm Incorporated Real-time channel program recommendation on a display device
CN110688526A (en) * 2019-11-07 2020-01-14 山东舜网传媒股份有限公司 Short video recommendation method and system based on key frame identification and audio textualization
US20200084519A1 (en) * 2018-09-07 2020-03-12 Oath Inc. Systems and Methods for Multimodal Multilabel Tagging of Video
CN113407645A (en) * 2021-05-19 2021-09-17 福建福清核电有限公司 Intelligent sound image archive compiling and researching method based on knowledge graph
WO2021184153A1 (en) * 2020-03-16 2021-09-23 阿里巴巴集团控股有限公司 Summary video generation method and device, and server

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100121844A1 (en) * 2008-11-07 2010-05-13 Yahoo! Inc. Image relevance by identifying experts
US20150082330A1 (en) * 2013-09-18 2015-03-19 Qualcomm Incorporated Real-time channel program recommendation on a display device
CN104090955A (en) * 2014-07-07 2014-10-08 科大讯飞股份有限公司 Automatic audio/video label labeling method and system
US20200084519A1 (en) * 2018-09-07 2020-03-12 Oath Inc. Systems and Methods for Multimodal Multilabel Tagging of Video
CN110688526A (en) * 2019-11-07 2020-01-14 山东舜网传媒股份有限公司 Short video recommendation method and system based on key frame identification and audio textualization
WO2021184153A1 (en) * 2020-03-16 2021-09-23 阿里巴巴集团控股有限公司 Summary video generation method and device, and server
CN113407645A (en) * 2021-05-19 2021-09-17 福建福清核电有限公司 Intelligent sound image archive compiling and researching method based on knowledge graph

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
迪潘简•萨卡尔: "基于权重标签的短语提取", 《数据科学与工程技术丛书 PYTHON文本分析》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116112763A (en) * 2022-11-15 2023-05-12 国家计算机网络与信息安全管理中心 Method and system for automatically generating short video content labels

Similar Documents

Publication Publication Date Title
CN112818906B (en) Intelligent cataloging method of all-media news based on multi-mode information fusion understanding
CN101021855B (en) Video searching system based on content
CN107590224B (en) Big data based user preference analysis method and device
CN109862397B (en) Video analysis method, device, equipment and storage medium
CN114297439B (en) Short video tag determining method, system, device and storage medium
CN111274442B (en) Method for determining video tag, server and storage medium
CN113590850A (en) Multimedia data searching method, device, equipment and storage medium
CN111190997A (en) Question-answering system implementation method using neural network and machine learning sequencing algorithm
CN109033060B (en) Information alignment method, device, equipment and readable storage medium
CN111414735A (en) Text data generation method and device
CN112199932A (en) PPT generation method, device, computer-readable storage medium and processor
CN113591530A (en) Video detection method and device, electronic equipment and storage medium
CN116738250A (en) Prompt text expansion method, device, electronic equipment and storage medium
CN114363695B (en) Video processing method, device, computer equipment and storage medium
JP6389296B1 (en) VIDEO DATA PROCESSING DEVICE, VIDEO DATA PROCESSING METHOD, AND COMPUTER PROGRAM
CN114938473A (en) Comment video generation method and device
CN114218437A (en) Adaptive picture clipping and fusing method, system, computer device and medium
US20220004773A1 (en) Apparatus for training recognition model, apparatus for analyzing video, and apparatus for providing video search service
CN116186310B (en) AR space labeling and displaying method fused with AI general assistant
CN114510564A (en) Video knowledge graph generation method and device
CN107656760A (en) Data processing method and device, electronic equipment
CN115909390B (en) Method, device, computer equipment and storage medium for identifying low-custom content
CN110888896A (en) Data searching method and data searching system thereof
CN114218407A (en) Content creation system based on digital automatic indexing
CN113869043A (en) Content labeling method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20220322