CN111143613B

CN111143613B - Method, system, electronic device and storage medium for selecting video cover

Info

Publication number: CN111143613B
Application number: CN201911395856.2A
Authority: CN
Inventors: 成丹妮; 罗超; 吉聪睿; 胡泓
Original assignee: Ctrip Computer Technology Shanghai Co Ltd
Current assignee: Ctrip Computer Technology Shanghai Co Ltd
Priority date: 2019-12-30
Filing date: 2019-12-30
Publication date: 2024-02-06
Anticipated expiration: 2039-12-30
Also published as: CN111143613A

Abstract

The invention discloses a method, a system, an electronic device and a storage medium for selecting a video cover, wherein the method comprises the following steps: extracting multi-frame target images from the target video; identifying an image category corresponding to the target image of each frame; determining one of the identified image categories as a video category of the target video; and selecting one frame from the target images corresponding to the video categories as the cover of the target video. The invention takes the attached video category and the representative image as the video cover based on the understanding of the video content, which not only can accurately display the video information, but also is convenient for users to browse and screen the video quickly, thereby improving the viscosity of the users and further improving the preset conversion rate in the OTA.

Description

Method, system, electronic device and storage medium for selecting video cover

Technical Field

The present invention relates to the field of computers, and in particular, to a method, a system, an electronic device, and a storage medium for selecting a video cover.

Background

With the continuous enrichment of internet information and the continuous upgrading of internet technology, traditional text information and picture information can not meet the requirement of users for browsing information, which promotes the rapid development of video information technology. For example, for OTA (online travel company), accurate and aesthetic presentation of OTA video information can greatly increase user viscosity, which can increase the predetermined conversion rate in OTA. The video cover is used as the first eye information of the video content, so that the clicking will of the user is greatly influenced, particularly under the condition that the video display area is limited, the first frame or the last frame of image of the video is usually used as the video cover by the current OTA platform, so that the wonderful content of the video is hidden, the interests of the user are difficult to attract, and the experience of the user is poor.

Disclosure of Invention

The invention aims to overcome the defect that the first frame or the last frame of the video is taken as a video cover in the prior art, and provides a method, a system, electronic equipment and a storage medium for selecting the video cover.

The invention solves the technical problems by the following technical scheme:

a method of selecting a video cover, the method comprising:

extracting multi-frame target images from the target video;

identifying an image category corresponding to the target image of each frame;

determining one of the identified image categories as a video category of the target video;

and selecting one frame from the target images corresponding to the video categories as the cover of the target video.

Preferably, after the step of extracting the multi-frame target image from the target video, the method further comprises:

filtering the extracted multi-frame target image according to the filtering condition; wherein the filtering conditions include:

at least one of the brightness of the target image is smaller than a first threshold, the definition of the target image is smaller than a second threshold, and the color single degree of the target image is larger than a third threshold.

Preferably, the step of identifying the image category corresponding to the target image for each frame includes:

identifying the image category corresponding to the target image of each frame according to the image identification model;

the input of the image recognition model is the target image, and the input of the image recognition model is the image category corresponding to the target image;

and/or the number of the groups of groups,

the step of selecting a frame from the target images corresponding to the video categories as the cover of the target video includes:

determining a target image corresponding to the video category as a candidate image;

evaluating the image scores corresponding to the candidate images of each frame according to an image score model;

determining the candidate image with the highest image score as the cover of the target video;

and the input of the image scoring model is the candidate image, and the image scoring model is output as the image scoring corresponding to the candidate image.

Preferably, the step of determining one of the identified target image categories as a video category of the target video includes:

determining the image category with the largest number of corresponding target images as the video category;

or,

the step of determining one of the identified image categories as a video category of the target video includes:

acquiring comment information corresponding to the target video;

determining the image category matched with the evaluation information in the identified image categories as a candidate category;

and determining the candidate category with the largest number of corresponding target images as the video category.

An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing any of the methods of selecting a video cover described above when the computer program is executed.

A computer readable storage medium having stored thereon a computer program which when executed by a processor performs any of the steps of the method of selecting a video cover described above.

A system for selecting a video cover, the system comprising:

the extraction module is used for extracting multi-frame target images from the target video;

the identification module is used for identifying the image category corresponding to each frame of the target image;

a determining module, configured to determine one of the identified image categories as a video category of the target video;

and the selection module is used for selecting one frame from the target images corresponding to the video categories as the cover of the target video.

Preferably, the system further comprises:

the filtering module is used for filtering the extracted multi-frame target image according to the filtering condition; wherein the filtering conditions include:

Preferably, the identification module is specifically configured to identify, according to an image identification model, an image category corresponding to the target image of each frame;

and/or the number of the groups of groups,

the selection module comprises:

the first determining unit is used for determining that the target image corresponding to the video category is a candidate image;

the image scoring unit is used for evaluating the image scores corresponding to the candidate images of each frame according to the image scoring model;

the second determining unit is used for determining that the candidate image with the highest image score is the front cover of the target video;

Preferably, the determining module is specifically configured to determine, as the video category, an image category with the largest number of corresponding target images;

or,

the determining module includes:

the acquisition unit is used for acquiring comment information corresponding to the target video;

a third determination unit configured to determine, as a candidate category, an image category matching the comment information among the identified image categories;

and a fourth determining unit, configured to determine, as the video category, a candidate category with the largest number of corresponding target images.

The invention has the positive progress effects that: the invention takes the attached video category and the representative image as the video cover based on the understanding of the video content, which not only can accurately display the video information, but also is convenient for users to browse and screen the video quickly, thereby improving the viscosity of the users and further improving the preset conversion rate in the OTA.

Drawings

Fig. 1 is a flowchart of a method of selecting a video cover according to embodiment 1 of the present invention.

Fig. 2 is a schematic hardware structure of an electronic device according to embodiment 2 of the present invention.

Fig. 3 is a block diagram of a system for selecting a video cover according to embodiment 4 of the present invention.

Detailed Description

The invention is further illustrated by means of the following examples, which are not intended to limit the scope of the invention.

Example 1

The present embodiment provides a method for selecting a video cover, and fig. 1 shows a flowchart of the present embodiment. Referring to fig. 1, the method of the present embodiment includes:

s101, extracting multi-frame target images from target videos.

In this embodiment, considering that the amount of information included in the target video is large and the feature dimension is large, the calculation dimension of the target video may be reduced by the extraction method, specifically, for a video with n transmission frames per second and a duration of t, the number of extracted target images may be determined to be f (f < n×t), and then the target videos may be extracted at uniform intervals to obtain f frame target images.

S102, filtering the extracted multi-frame target image according to the filtering condition.

In this embodiment, in order to further reduce the calculation dimension of the target video, some target images with poor objective indexes may be filtered according to the filtering condition, where the objective indexes may include, but are not limited to, brightness, definition, color single degree, and the like.

For example, in this embodiment, the filtering condition may include that the brightness of the target image is smaller than a first threshold, where the first threshold may be set in a customized manner according to the practical application, and the brightness may be calculated according to the following formula:

Luminance(I _rgb )＝0.2126I _r +0.7152I _g +0.0722I _b

in the above formula, I _rgb Representing a color chart, I _r 、I _g And I _b Three channels red, green and blue of the color image are represented, respectively.

For example, in this embodiment, the filtering condition may include that the sharpness of the target image is smaller than a second threshold, where the second threshold may be set in a customized manner according to an actual application, and the sharpness may be calculated according to the following formula:

in the above formula, I _gray Gray scale map, delta, representing gray scale color image _x And delta _y Representing the gradient map in x and y directions on the image, respectively.

For another example, in this embodiment, the filtering condition may include that the color single degree of the target image is greater than a third threshold, where the third threshold may be set in a customized manner according to the practical application, and the color single degree may be calculated according to the following formula:

in the above equation, hist (), which is used to characterize the color singleness, is ordered by gray value duty cycle, and the gray value of 5% of the duty cycle is found to be the proportion of all pixels.

S103, identifying the image category corresponding to each frame of target image.

In this embodiment, specifically, the image type corresponding to each frame of target image may be identified according to an image identification model, where the input of the image identification model is a frame of target image, and the input is the image type corresponding to the frame of target image.

Specifically, in the image recognition model of the present embodiment, 159 network layers may be included, and 7 dense blocks are adopted, where the size of the feature map in each dense block is unchanged, and different convolution layers in the dense block are connected in a jump manner, so as to ensure the transfer of feature information. The activation function of the last layer in the network layer can be a softmax function, the number of neurons is N, and the output value p of the neuron i (i is a positive integer less than or equal to N) _i Between 0-1, the network weights can be updated during training based on the cross entropy loss function back propagation. For each image class i, τ can be set _i As a threshold value for each image class, p _i ≥τ _i The frame target image contains a label for this image category.

In this embodiment, a set of image categories may be obtained after the obtained multi-frame target images are respectively input into the image recognition model, for example, the output of the image recognition model may include a transition frame, a foreground, a swimming pool, an appearance, etc., where the transition frame may be used to represent that the frame target image does not have an actual meaning for representing the image category, the foreground, the swimming pool, the appearance, etc. may be used to represent the image category, and the tag sequence of the set of image categories obtained by processing the target video may include: appearance, transition frame, foreground, background, and the like front desk, hall, transition frame transition frame, swimming pool swimming pool, transition frame.

S104, determining one of the identified image categories as a video category of the target video.

Specifically, in this embodiment, the image category with the largest number of corresponding target images may be determined as the video category, for example, in the tag sequence of the image categories shown above, the image category with the largest number of corresponding target images is a swimming pool, and then the swimming pool may be determined as the video category corresponding to the target video.

In this embodiment, the video category may also be determined in combination with comment information corresponding to the target video, where the comment information may include, but is not limited to, comments, descriptions, and the like, and specifically, step S104 may include a step of acquiring comment information corresponding to the target video, a step of determining, as a candidate category, an image category matching the comment information from the identified image categories, and a step of determining, as a video category, a candidate category having the largest number of corresponding target images. For example, when the obtained comment information is "the hotel is very praise, the foreground is in service enthusiasm, the swimming pool is large, and the room is comfortable", the matching image categories can be obtained as the foreground and the swimming pool after matching the keywords by combining the tag sequences of the image categories shown above, and the number of target images corresponding to the swimming pool is greater than the number of target images corresponding to the foreground, so that the swimming pool can be determined as the video category corresponding to the target video.

S105, selecting one frame from target images corresponding to the video categories as a cover of the target video.

In this embodiment, step S105 may include a step of determining that a target image corresponding to a video category is a candidate image, a step of evaluating an image score corresponding to each frame of candidate image according to an image score model, and a step of determining that a candidate image with the highest image score is a cover of a target video, where an input of the image score model is a candidate image and an output is an image score corresponding to the candidate image.

Specifically, in the present embodiment, the quality of each frame of image may be evaluated and scored manually first to construct a training set, for example, 1000 frames of images out of 100 videos may be randomly extracted, by3 American staff scored the image from the angles of picture color, composition and the like, wherein the scoring range comprises: 1. 2, 3, 4, 5, and rounding the average value of the 3 person scores to be the image score of the frame image. Then, based on a training set training image scoring model, the image scoring model in the embodiment can comprise 43 network layers, and Res blocks are adopted, wherein the size of a feature map in each Res block is unchanged, and different convolution layers in the Res blocks are connected in a jumping manner, so that the transmission of feature information is ensured. The activation function of the last layer in the network layer can be softmax function, the number of neurons is 5, and the output value p of each neuron _i Between 0-1, the probabilities for five image scoring class categories are represented. The network weights can be updated during training based on the cross entropy loss function back propagation. For each rank category i of image scoring, the output of the image scoring model is the probability p _i Then can be usedAn image score representing the target image. In this embodiment, after determining the image score corresponding to each frame of candidate image, if the number of candidate images with the highest image score is a plurality of candidate images, the first frame of the plurality of candidate images may be selected as the cover of the target video.

According to the embodiment, based on understanding of video content, the attached video category and the representative image are used as the video cover, so that video information can be accurately displayed, a user can conveniently and quickly browse and screen videos, the viscosity of the user can be improved, the preset conversion rate in OTA can be improved, the image with poor objective index performance is filtered, the image is scored based on picture aesthetics, the content and quality of the video are comprehensively considered, the finally determined cover is the most representative and high-quality cover, better visual experience is achieved, and the click rate of the user on the video can be improved.

Example 2

The present embodiment provides an electronic device, which may be expressed in the form of a computing device (for example, may be a server device), including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor may implement the method for selecting a video cover provided in embodiment 1 when executing the computer program.

Fig. 2 shows a schematic hardware structure of the present embodiment, and as shown in fig. 2, the electronic device 9 specifically includes:

at least one processor 91, at least one memory 92, and a bus 93 for connecting the different system components (including the processor 91 and the memory 92), wherein:

the bus 93 includes a data bus, an address bus, and a control bus.

The memory 92 includes volatile memory such as Random Access Memory (RAM) 921 and/or cache memory 922, and may further include Read Only Memory (ROM) 923.

Memory 92 also includes a program/utility 925 having a set (at least one) of program modules 924, such program modules 924 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.

The processor 91 executes various functional applications and data processing such as the method of selecting a video cover provided in embodiment 1 of the present invention by running a computer program stored in the memory 92.

The electronic device 9 may further communicate with one or more external devices 94 (e.g., keyboard, pointing device, etc.). Such communication may occur through an input/output (I/O) interface 95. Also, the electronic device 9 may communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the Internet, through a network adapter 96. The network adapter 96 communicates with other modules of the electronic device 9 via the bus 93. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in connection with the electronic device 9, including but not limited to: microcode, device drivers, redundant processors, external disk drive arrays, RAID (disk array) systems, tape drives, data backup storage systems, and the like.

It should be noted that although several units/modules or sub-units/modules of an electronic device are mentioned in the above detailed description, such a division is merely exemplary and not mandatory. Indeed, the features and functionality of two or more units/modules described above may be embodied in one unit/module according to embodiments of the present application. Conversely, the features and functions of one unit/module described above may be further divided into ones that are embodied by a plurality of units/modules.

Example 3

The present embodiment provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method of selecting a video cover provided by embodiment 1.

More specifically, among others, readable storage media may be employed including, but not limited to: portable disk, hard disk, random access memory, read only memory, erasable programmable read only memory, optical storage device, magnetic storage device, or any suitable combination of the foregoing.

In a possible embodiment, the invention may also be implemented in the form of a program product comprising program code for causing a terminal device to carry out the steps of the method of implementing the selection of a video cover in embodiment 1, when said program product is run on the terminal device.

Wherein the program code for carrying out the invention may be written in any combination of one or more programming languages, which program code may execute entirely on the user device, partly on the user device, as a stand-alone software package, partly on the user device and partly on the remote device or entirely on the remote device.

Example 4

The present embodiment provides a system for selecting a video cover, and fig. 3 shows a schematic block diagram of the present embodiment. Referring to fig. 3, the method of the present embodiment includes:

and the extraction module 1 is used for extracting multi-frame target images from the target video.

And the filtering module 2 is used for filtering the extracted multi-frame target image according to the filtering condition.

Luminance(I _rgb )＝0.2126I _r +0.7152I _g +0.0722I _b

And the identification module 3 is used for identifying the image category corresponding to each frame of target image.

In this embodiment, the identifying module 3 may specifically identify, according to an image identifying model, an image category corresponding to each frame of target image, where an input of the image identifying model is a frame of target image, and output the input of the image identifying model is the image category corresponding to the frame of target image.

A determining module 4, configured to determine one of the identified image categories as a video category of the target video.

In this embodiment, the video category may also be determined in combination with comment information corresponding to the target video, where the comment information may include, but is not limited to, comments, descriptions, and the like, and specifically, the determining module 4 may include an obtaining unit for obtaining comment information corresponding to the target video, a third determining unit for determining, as a candidate category, an image category matching the comment information among the identified image categories, and a fourth determining unit for determining, as a video category, a candidate category having the largest number of corresponding target images. For example, when the obtained comment information is "the hotel is very praise, the foreground is in service enthusiasm, the swimming pool is large, and the room is comfortable", the matching image categories can be obtained as the foreground and the swimming pool after matching the keywords by combining the tag sequences of the image categories shown above, and the number of target images corresponding to the swimming pool is greater than the number of target images corresponding to the foreground, so that the swimming pool can be determined as the video category corresponding to the target video.

And the selection module 5 is used for selecting one frame from the target images corresponding to the video categories as the cover of the target video.

In this embodiment, the selection module 5 may include a first determining unit for determining that a target image corresponding to a video category is a candidate image, an image scoring unit for evaluating an image score corresponding to each frame of candidate image according to an image scoring model, and a second determining unit for determining that a candidate image with a highest image score is a cover of the target video, where an input of the image scoring model is the candidate image, and an output is an image score corresponding to the candidate image.

Specifically, in this embodiment, the quality of each frame of image may be evaluated and scored manually first to construct a training set, for example, 1000 frames of images in 100 videos may be randomly extracted, and the images may be scored from the angles of screen color, composition, and the like by 3 building staff, where the scoring range includes: 1. 2, 3, 4, 5, and rounding the average value of the 3 person scores to be the image score of the frame image. Then, based on a training set training image scoring model, the image scoring model in the embodiment can comprise 43 network layers, and Res blocks are adopted, wherein the size of a feature map in each Res block is unchanged, and different convolution layers in the Res blocks are connected in a jumping manner, so that the transmission of feature information is ensured. The activation function of the last layer in the network layer can be softmax function, the number of neurons is 5, and the output value p of each neuron _i Between 0-1, the probabilities for five image scoring class categories are represented. The network weights can be updated during training based on the cross entropy loss function back propagation. For each rank category i of image scoring, the output of the image scoring model is the probability p _i Then can be usedAn image score representing the target image. In this embodiment, after determining the image score corresponding to each frame of candidate image, if the number of candidate images with the highest image score is a plurality of candidate images, the first frame of the plurality of candidate images may be selected as the cover of the target video.

While specific embodiments of the invention have been described above, it will be appreciated by those skilled in the art that this is by way of example only, and the scope of the invention is defined by the appended claims. Various changes and modifications to these embodiments may be made by those skilled in the art without departing from the principles and spirit of the invention, but such changes and modifications fall within the scope of the invention.

Claims

1. A method of selecting a video cover, the method comprising:

extracting multi-frame target images from the target video;

at least one of the brightness of the target image being less than a first threshold, the sharpness of the target image being less than a second threshold, and the color single degree of the target image being greater than a third threshold;

identifying an image category corresponding to the target image of each frame;

acquiring comment information corresponding to the target video;

determining the candidate category with the largest number of corresponding target images as the video category;

2. The method of selecting a video cover as claimed in claim 1, wherein the step of identifying an image category corresponding to the target image for each frame includes:

and/or the number of the groups of groups,

3. The method of selecting a video cover as recited in claim 1, wherein the step of determining one of the identified target image categories as the video category of the target video comprises:

and determining the image category with the largest number of corresponding target images as the video category.

4. An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of selecting a video cover as claimed in any one of claims 1-3 when the computer program is executed.

5. A computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the method of selecting a video cover as claimed in any one of claims 1 to 3.

6. A system for selecting a video cover, the system comprising:

the determining module includes:

a fourth determining unit, configured to determine, as the video category, a candidate category having the largest number of corresponding target images;

7. The system for selecting a video cover as recited in claim 6, wherein the identification module is specifically configured to identify an image category corresponding to the target image for each frame based on an image recognition model;

and/or the number of the groups of groups,

the selection module comprises:

8. The system for selecting a video cover as recited in claim 6, wherein the determination module is specifically configured to determine the image category that corresponds to the greatest number of target images as the video category.