CN111209439B

CN111209439B - Video clip retrieval method, device, electronic equipment and storage medium

Info

Publication number: CN111209439B
Application number: CN202010026271.XA
Authority: CN
Inventors: 龙翔; 周志超; 李甫; 何栋梁; 王平; 迟至真; 赵翔; 孙昊; 文石磊; 丁二锐
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-01-10
Filing date: 2020-01-10
Publication date: 2023-11-21
Anticipated expiration: 2040-01-10
Also published as: CN111209439A

Abstract

The application discloses a video clip retrieval method, a video clip retrieval device, electronic equipment and a storage medium, and relates to the technical field of video processing. The specific implementation scheme is as follows: a video level retrieval module in a video segment retrieval model is adopted to retrieve the most relevant target video from a video library according to retrieval information input by a user; and positioning the target video fragment most relevant to the retrieval information from the target video by adopting a fragment positioning module in the video fragment retrieval model. The technical scheme of the application can realize video retrieval with fragment granularity, and can effectively improve the accuracy and the retrieval efficiency of video retrieval compared with video-level retrieval in the prior art.

Description

Video clip retrieval method, device, electronic equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a video clip retrieval method, a device, an electronic apparatus, and a storage medium.

Background

With the richness of video-type services, video-level retrieval is involved in more and more scenes.

The current video level search mainly searches the most relevant video according to the Query text input by the user. The method generally adopts a neural network to extract video features from the whole video, adopts another neural network to extract text features from the Query text, performs relevance ranking on all video features in a video library aiming at the text features, finds the video most relevant to the Query text, and returns the video to a user.

However, when searching for video, the Query text entered by the user is often associated with only one segment of the video, and the user is also often interested in only one segment of the video. According to the video detection scheme, the whole video can be obtained only, and video clips which are most relevant to the Query text in the video cannot be positioned, so that the accuracy of the existing video retrieval is not high.

Disclosure of Invention

In order to solve the technical problems, the application provides a video clip retrieval method, a video clip retrieval device, electronic equipment and a storage medium, which are used for improving the accuracy of video retrieval.

In one aspect, the present application provides a video clip retrieval method, including:

a video level retrieval module in a video segment retrieval model is adopted to retrieve the most relevant target video from a video library according to retrieval information input by a user;

and positioning the target video fragment most relevant to the retrieval information from the target video by adopting a fragment positioning module in the video fragment retrieval model.

Further alternatively, in the method as described above, the searching the most relevant target video from the video library according to the search information input by the user using the video level search module in the video clip search model includes:

Acquiring video level features of each video in the video library based on a pre-trained frame feature extraction model and a first attention mechanism module;

extracting corresponding text features based on the retrieval information;

respectively calculating the relevance of the video level features and the text features of each video;

and acquiring the video with the maximum correlation from the video library as the most correlated target video.

Further optionally, in the method as described above, the obtaining video level features of each video in the video library based on a pre-trained frame feature extraction model and a first attention mechanism module includes:

for each video in the video library, acquiring a frame-level feature of each video frame by adopting the frame feature extraction model;

and acquiring the video level characteristics of the corresponding video according to the characteristics of the frame levels of the videos and the first attention mechanism module.

Further optionally, in the method as described above, acquiring video-level features of the corresponding videos according to the features of the frame levels of the videos and the first attention mechanism module includes:

Inputting the frame level features of the videos to the first attention mechanism module according to the sequence in the videos, fusing the video level features of the videos based on the frame level features of the video frames by the first attention mechanism module, and outputting the fused video level features.

Further alternatively, in the method as described above, extracting the corresponding text feature based on the search information includes:

and after each word in the search information is embedded and expressed, inputting the embedded and expressed word into a pre-trained text feature extraction model to obtain the corresponding text feature.

Further alternatively, in the method as described above, the positioning, using a segment positioning module in the video segment search model, of a target video segment that is most relevant to the search information from the target video includes:

splicing the characteristics of the frame level of each video frame of the target video and the text characteristics of the search information to obtain splicing characteristics of the frame level;

and inputting the splicing characteristics of each frame level to a pre-trained second attention mechanism module, and acquiring the start and stop positions of the target video clips which are output by the second attention mechanism module and are most relevant to the retrieval information.

On the other hand, the application also provides a training method of the video clip retrieval model, which comprises the following steps:

collecting a plurality of pieces of training video data;

and training a video segment retrieval model by adopting the training video data, wherein the video segment retrieval model comprises a video level retrieval module, a segment positioning module and a joint ordering module.

Further optionally, in the method as described above, each piece of training video data includes training search information, a plurality of training videos, and a training video clip most related to the training search information from among the plurality of training videos that are manually labeled and most related to the training search information.

Further alternatively, in the method as described above, training the video clip search model using the plurality of pieces of training video data includes:

for each piece of training data, searching N candidate training videos most relevant to the training search information from the plurality of training videos by adopting the video-level search module;

positioning candidate video clips most relevant to the training retrieval information from the candidate training videos by adopting the clip positioning module to obtain N candidate video clips;

The relevance between the N candidate video clips and the training retrieval information is ranked by adopting the joint ranking module, and the candidate video clip most relevant to the training retrieval information is obtained;

detecting whether the obtained most relevant candidate video segments are consistent with the marked most relevant training video segments;

and if the video segments are inconsistent, adjusting parameters of the video-level retrieval module and the segment positioning module so that the acquired most relevant candidate video segments tend to be consistent with the marked most relevant training video segments.

Further optionally, in the method as described above, the ranking the relevance between the N candidate video clips and the training search information by using the joint ranking module, and obtaining a candidate video clip most relevant to the training search information includes:

acquiring segment level characteristics of each candidate video segment in the N candidate video segments;

acquiring text characteristics of the training retrieval information;

calculating the relevance between the segment level features of each candidate video segment and the text features of the training retrieval information;

and acquiring the candidate video segment with the maximum correlation degree from the N candidate video segments, and taking the candidate video segment as the candidate video segment most correlated with the training retrieval information.

Further optionally, in the method as described above, each piece of training video data includes a plurality of pieces of training search information, a training video, training search information most related to the training video among the plurality of pieces of training search information manually labeled, and a training video clip most related to the most related training search information among the training video.

for each piece of training data, the video-level retrieval module is adopted to retrieve N pieces of candidate training retrieval information which are most relevant to the training video from the plurality of pieces of training retrieval information;

positioning candidate video clips most relevant to each piece of candidate training retrieval information from the training video by adopting the clip positioning module to obtain N candidate video clips;

the relevance between the N candidate video clips and the training video is ranked by adopting the joint ranking module, and the candidate video clip most relevant to the training video is obtained;

Further optionally, in the method as described above, the ranking the relevance between the N candidate video segments and the training video by using the joint ranking module, and obtaining a candidate video segment most relevant to the training video includes:

acquiring video level characteristics of the training video;

calculating the correlation degree between the segment level features of each candidate video segment and the video level features of the training video;

and acquiring the candidate video segment with the maximum correlation degree from the N candidate video segments, and taking the candidate video segment as the candidate video segment most correlated with the training video.

In still another aspect, the present application further provides a video clip retrieving apparatus, including:

the video level retrieval module is used for retrieving the most relevant target video from the video library according to the retrieval information input by the user;

And the fragment positioning module is used for positioning the target video fragment most relevant to the retrieval information from the target video.

In still another aspect, the present application further provides a training device for a video clip retrieval model, including:

the acquisition module is used for acquiring a plurality of pieces of training video data;

the training module is used for training a video segment search model by adopting the plurality of pieces of training video data, and the video segment search model comprises a video level search module, a segment positioning module and a joint ordering module.

In still another aspect, the present application provides an electronic device, including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of the preceding claims.

In yet another aspect, the present application also provides a non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of the above.

One embodiment of the above application has the following advantages or benefits: retrieving the most relevant target video from the video library according to the retrieval information input by the user by adopting a video level retrieval module; the segment locating module is adopted to locate the target video segment most relevant to the retrieval information from the target video, so that video retrieval with segment granularity can be realized, and compared with video-level retrieval in the prior art, the accuracy and the retrieval efficiency of video retrieval can be effectively improved; and the method can help the user to quickly browse the information to be searched, and can greatly improve the use experience of the user.

In addition, in the application, a joint sequencing module is added in the training of the video segment retrieval model, the process of acquiring the target video segment is more detailed, and the video segment retrieval model can be trained more finely, so that the accuracy and the retrieval efficiency of video retrieval can be effectively improved; and the method can help the user to quickly browse the information to be searched, and can greatly improve the use experience of the user.

Furthermore, in the application, not only the video segment search model can be trained by adopting the training data comprising one piece of training search information and a plurality of pieces of training videos, but also the video segment search model can be trained by adopting the training data comprising one piece of training videos and a plurality of pieces of training search information, so that the performance of the video segment search model can be further improved, and the accuracy and the search efficiency of the video segment search model are higher.

Other effects of the above alternative will be described below in connection with specific embodiments.

Drawings

The drawings are included to provide a better understanding of the present application and are not to be construed as limiting the application. Wherein:

FIG. 1 is a schematic diagram of a first embodiment according to the present application;

FIG. 2 is a schematic diagram of a second embodiment according to the present application;

FIG. 3 is a schematic diagram of a third embodiment according to the present application;

FIG. 4 is a schematic diagram of a fourth embodiment according to the present application;

FIG. 5 is a schematic diagram of a fifth embodiment according to the present application;

FIG. 6 is a schematic diagram of a sixth embodiment according to the present application;

fig. 7 is a block diagram of an electronic device for implementing the above-described method of an embodiment of the present application.

Detailed Description

Exemplary embodiments of the present application will now be described with reference to the accompanying drawings, in which various details of the embodiments of the present application are included to facilitate understanding, and are to be considered merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Fig. 1 is a flowchart of a video clip retrieval method according to a first embodiment of the present application. As shown in fig. 1, the video clip retrieval method of the present embodiment may specifically include the following steps:

s101, searching the most relevant target video from a video library by adopting a video level searching module in a video fragment searching model according to searching information input by a user;

s102, a segment locating module in a video segment retrieval model in the video segment retrieval model is adopted to locate the target video segment most relevant to the retrieval information from the target video.

The main execution body of the video clip searching method in this embodiment is a video clip searching device, and the video clip searching device may be an electronic device of an entity, or may also be an application system adopting software integration. In use, the video clip retrieval device may output the target video clip most relevant to the retrieval information by providing the retrieval information input by the user and the video library available for retrieval.

The video clip retrieval device of this embodiment may be implemented by using a video clip retrieval model, where the video clip retrieval model may include a video-level retrieval module and a clip positioning module. Therefore, the video clip search method of the present embodiment is specifically a method for using the video clip search model.

In this embodiment, the search information input by the user may be in text form, for example, may be a search Query input by the user. Or may be text-form search information obtained by performing speech recognition on the speech search information input by the user.

For example, step S101 of the present embodiment adopts a video level search module to search the most relevant target video from the video library according to the search information input by the user, and specifically may include the following steps:

(a1) Acquiring video level features of each video in a video library based on a pre-trained frame feature extraction model and a first attention mechanism module;

for example, for each video in the video library, a pre-trained frame feature extraction model may be first used to obtain frame-level features of each video frame. Wherein the frame feature extraction model of the present embodiment is used to extract frame-level features of each video frame in the video. And then acquiring the video level characteristics of the corresponding videos according to the characteristics of each frame level of each video and the first attention mechanism module.

In this embodiment, the frame feature extraction model may include a pre-trained senet152 neural network model as well as a bi-directional GRU network model. Specifically, when extracting the frame-level features of each video frame, firstly adopting a pre-trained senet152 neural network model to extract the primary features of each video frame in the video; and then inputting the extracted primary characteristics of each video frame into a pre-trained bidirectional GRU network model to obtain the characteristics of the corresponding frame level. When the frame feature extraction model is trained, the pre-trained senet152 neural network model can be trained in advance, and the training is not participated, but only the bidirectional GRU network model is trained. The process of acquiring the frame-level features of each video frame by using the frame feature extraction model in this embodiment can be understood as a video encoding process, and specifically, the process can be implemented by setting a video encoder.

The first attention mechanism module of this embodiment may be implemented by a Keyless attention module, which may be referred to herein as a first Keyless attention module.

For each video, after the frame-level features of each video frame are obtained in the above manner, the frame-level features of the video are input to the first Keyless attention module according to the sequence in the video, and the video-level features of the corresponding video output by the first Keyless attention module are obtained. The first Keyless attention module can identify the importance of each frame-level feature in the video and fuse the frame-level features into video-level features based on the importance. The video level features can comprehensively identify information of the video. The frame-level features in this embodiment are each represented in the form of vectors.

(b1) Extracting corresponding text features based on the retrieval information;

the process is understood to be a text encoding process and may also be implemented in a text encoder. Specifically, each word in the search information can be embedded and expressed and then input into a pre-trained text feature extraction model to obtain corresponding text features.

In this embodiment, each word in the search information may be embedded and expressed in a word2vec embedding manner, that is, converted into a vector form. Then splicing the embedded expressions of all words of the search information according to the sequence in the search information, inputting the embedded expressions into a text feature extraction model, and outputting corresponding text features by the text feature extraction model, wherein the text features can be in the form of a vector. The text feature extraction model of this embodiment may be a bi-directional GRU network and global average pooling model.

(c1) Respectively calculating the correlation of video level features and text features of each video;

(d1) And acquiring the video with the highest correlation from the video library as the most correlated target video.

The steps (a 1) - (d 1) are video retrieval processes implemented by the video level retrieval module.

The video level retrieval module of the embodiment is used for retrieving one target video most relevant to the retrieval information from the video library. And the segment locating module is used for locating the most relevant target video segment with the retrieval information from the most relevant target video. For example, step S102 of the present embodiment may specifically include the following steps:

(a2) Splicing the characteristics of the frame level of each video frame of the target video and the text characteristics of the search information to obtain splicing characteristics of the frame level;

(b2) And inputting the splicing characteristics of each frame level to a pre-trained second attention mechanism module, and acquiring the start and stop positions of the target video clips which are output by the second attention mechanism module and are most relevant to the retrieval information.

The second attention mechanism module of this embodiment may also be implemented using a Keyless attention module, which may be referred to herein as a second Keyless attention module.

The second Keyless attention module can also identify a relevance weight to the stitching feature at each frame level. For example, if the correlation weight of a certain frame is larger, the correlation between the feature at the frame level and the text feature of the search information is larger, and vice versa. Then the second Keyless attention module intercepts video segments with the correlation weight of each frame being larger than the preset correlation weight threshold value in the continuous multi-frame by setting the preset correlation weight threshold value, takes the video segments as target video segments, and outputs the start and stop positions of the target video segments. Meanwhile, the second Keyless attention module can splice the splicing characteristics of each frame level in the target video segment to generate segment level characteristics of the target video segment, and can also output the segment level characteristics. In training, it is necessary to acquire segment-level features in this way.

The steps (a 2) - (b 2) are segment positioning processes implemented by the segment positioning module.

According to the video clip searching method, a video level searching module is adopted to search the most relevant target video from a video library according to searching information input by a user; the segment locating module is adopted to locate the target video segment most relevant to the retrieval information from the target video, so that video retrieval with segment granularity can be realized, and compared with video-level retrieval in the prior art, the accuracy and the retrieval efficiency of video retrieval can be effectively improved; and the method can help the user to quickly browse the information to be searched, and can greatly improve the use experience of the user.

Fig. 2 is a flowchart of a training method of a video clip retrieval model according to a second embodiment of the present application. As shown in fig. 2, the training method of the video clip retrieval model of the present embodiment may specifically include the following steps:

s201, collecting a plurality of pieces of training video data;

s202, training a video segment retrieval model by adopting a plurality of pieces of training video data, wherein the video segment retrieval model comprises a video level retrieval module, a segment positioning module and a joint ordering module.

The main execution body of the training method of the video clip retrieval model in this embodiment is a training device of the video clip retrieval model, and the training device of the video clip retrieval model may be an electronic entity, such as a large-scale computer device. Alternatively, a software-integrated application system may be employed, which in use runs on a computer device for training the video clip retrieval model as described in the embodiment of fig. 1 above.

However, according to the related description of the embodiment shown in fig. 1, it can be seen that in the embodiment shown in fig. 1, when the video clip search model is used, only the video level search module is required to search a most relevant target video from the video library, and the clip locating module is used to locate the most relevant target video clip from the template video, where only the video level search module and the clip locating module are involved. In the training of this embodiment, the video clip search model further includes a joint ranking module. That is, the training process of the video clip retrieval model of the present embodiment is not completely consistent with the use process. For example, in order to enhance the training effect during training, the video level search module may not only select one target video most relevant to the search information, but may acquire a plurality of target videos so as to sort in the final joint sorting module, so as to determine the final target video segment. By adopting the training mode, the accuracy of the video clip retrieval model can be improved.

In order to effectively improve the training effect, the training data of the present embodiment may include the following two cases:

first case: each piece of training video data comprises training retrieval information, a plurality of training videos and training video fragments which are most relevant to the training retrieval information in the training videos which are most relevant to the training retrieval information in the plurality of manually marked training videos.

Fig. 3 is a flowchart of a training method of a video clip retrieval model according to a third embodiment of the present application. As shown in fig. 3, a specific implementation manner of training the video clip search model by using several pieces of training video data in step S202 of the embodiment shown in fig. 2 when training data of the first case is used is described in detail, which may specifically include the following steps:

s301, for each piece of training data, a video level retrieval module is adopted to retrieve N candidate training videos most relevant to training retrieval information from a plurality of training videos;

the implementation process of this step S301 differs from the above-described step S101 of the embodiment shown in fig. 1 only in that: in the step S101, only one most relevant target video is searched for as an example, but the step S301 in this embodiment needs to search for N candidate training videos from a plurality of training videos, and the implementation principle is similar, and detailed description will not be repeated herein with reference to the specific implementation of the step S101 in the embodiment shown in fig. 1.

Referring to the description related to step S101 of the embodiment shown in fig. 1, it may be known that the models involved in the video level search module of this embodiment include a frame feature extraction model, a first attention mechanism module, a text feature extraction model, and a second attention mechanism module. The frame feature extraction model comprises a pre-trained senet152 neural network model and a bidirectional GRU network model, wherein the senet152 neural network model is pre-trained and does not participate in the training of the embodiment. The text feature extraction model may be specifically a bidirectional GRU network and global average pooling model. Thus, it can be known that in the video level retrieval module in this embodiment, there are a bidirectional GRU network model, a bidirectional GRU network and global average pooling model, a first attention mechanism module and a second attention mechanism module that need to be trained.

S302, positioning candidate video clips most relevant to training retrieval information from each candidate training video by adopting a clip positioning module to obtain N candidate video clips;

specifically, reference may be made to the specific implementation of step S102 in the embodiment shown in fig. 1, where from each candidate training video, a candidate video segment most relevant to the training search information is located. N candidate video clips can be obtained altogether, and details can be referred to for the relevant description of the embodiment shown in FIG. 1, and are not repeated here.

S303, sequencing the correlation degree of N candidate video clips and training retrieval information by adopting a joint sequencing module, and acquiring the candidate video clip most correlated with the training retrieval information;

in the training process of this embodiment, N candidate video segments are acquired, and in order to acquire a final target video segment, in this embodiment, a joint ordering module is required to perform ordering to acquire an optimal video segment.

For example, the method can be realized in the following steps:

(a3) Obtaining segment level characteristics of each candidate video segment in the N candidate video segments;

specifically, reference may be made to the description of the second attention mechanism module, i.e. the second Keyless attention module in the embodiment shown in fig. 1, where it may be known that the second attention mechanism module may identify, according to the splicing feature of each frame level of the input video, the correlation weight of the splicing feature of each frame level, and obtain, based on the set correlation weight threshold, video segments in consecutive frames, where the correlation weight is greater than the correlation weight threshold. Then, the splicing characteristics of each frame level in the video segment can be spliced together to generate segment level characteristics of the video segment and output. In the above manner, segment level features of each candidate video segment may be obtained.

(b3) Acquiring text characteristics of training retrieval information;

the step (b 1) may be implemented with reference to the embodiment shown in fig. 1, and will not be described herein.

(c3) Calculating the relativity between the segment level characteristics of each candidate video segment and the text characteristics of the training retrieval information;

(d3) And acquiring the candidate video segment with the highest correlation degree from the N candidate video segments as the candidate video segment most correlated with the training retrieval information.

S304, detecting whether the obtained most relevant candidate video segments are consistent with the marked most relevant training video segments; if not, executing step S305; otherwise, step S306 is performed;

s305, adjusting parameters of a video level retrieval module and a fragment positioning module to enable the obtained most relevant candidate video fragments to be consistent with the marked most relevant training video fragments, and returning to the step S301 to continue training;

s306, judging whether the obtained most relevant candidate video clips are consistent with the marked most relevant training video clips in continuous training of the preset number of rounds, if so, ending the training, otherwise, returning to the step S301 to continue the training.

For example, in combination with the above analysis, parameters of the bidirectional GRU network model, the bidirectional GRU network and the global averaging pool model, the first attention mechanism module and the second attention mechanism module may be adjusted, specifically, the adjustment mode may adjust parameters of one model at a time, or adjust parameters of a plurality of models at a time, so long as the final training effect is ensured.

Steps S301 to S305 are a round of parameter adjustment operations involved in training. For example, after step S305, since the parameters of the model are still being adjusted, the next round of training is necessarily performed at this time. In addition, in practical application, a termination condition of training is also required to be designed, for example, after each round of training, after detecting that the obtained most relevant candidate video segment is consistent with the marked most relevant training video segment, whether the obtained most relevant candidate video segment is consistent with the marked most relevant training video segment in continuous training of preset rounds can be judged, if yes, the video segment retrieval model is considered to be trained mature, the training can be terminated, and parameters of the model at the moment are unchanged and are used for subsequent video segment retrieval. Otherwise, training is continued according to the above step flow. The number of consecutive preset wheels may be 100 consecutive wheels, 200 consecutive wheels or other integer wheels, which is not limited herein.

Second case: each piece of training video data comprises a plurality of pieces of training retrieval information, training video, training retrieval information most relevant to the training video in the plurality of pieces of training retrieval information of manual annotation, and training video fragments most relevant to the most relevant training retrieval information in the training video.

Fig. 4 is a flowchart of a training method of a video clip retrieval model according to a fourth embodiment of the present application. As shown in fig. 4, a specific implementation manner of training the video clip search model by using several pieces of training video data in step S202 of the embodiment shown in fig. 2 when training data of the second case is used is described in detail, which may specifically include the following steps:

s401, for each piece of training data, searching N pieces of candidate training search information which are most relevant to training video from a plurality of pieces of training search information by adopting a video level search module;

the training data in this case is different from the previous case, and is the case of several pieces of training search information and one piece of training video. This will not occur in practical applications, but training of this embodiment is also employed in the training of this embodiment in order to improve the performance of the video clip retrieval model. In practical applications, the video clip search model may be put into use only by training the embodiment shown in fig. 3. Or training of the embodiments shown in fig. 3 and fig. 4 may be used simultaneously, so as to further enhance the performance of the video clip retrieval model.

Specifically, in this embodiment, the step S401 may specifically include the following steps:

(a4) Acquiring video level features of a training video based on a pre-trained frame feature extraction model and a first attention mechanism module;

(b4) Extracting corresponding text features based on each piece of training retrieval information;

(c4) Respectively calculating the correlation between the video level characteristics of the video and the text characteristics of each training retrieval information;

(d4) N candidate training search information with the largest correlation is obtained from the plurality of training search information.

The detailed implementation principles of steps (a 1) - (d 1) in the above embodiments are similar, and will not be repeated here.

S402, positioning candidate video clips most relevant to each piece of candidate training retrieval information from the training video by adopting a clip positioning module to obtain N candidate video clips;

specifically, for each piece of candidate training search information, reference may be made to the specific implementation manner of step S102 in the embodiment shown in fig. 1, and the candidate video segments most relevant to the candidate training search information may be located from the training video, so that N candidate video segments may be obtained in total.

S403, sequencing the correlation degree of the N candidate video clips and the training video by adopting a joint sequencing module, and acquiring the candidate video clip most correlated with the training video;

Unlike step S303 of the embodiment shown in fig. 3, in this embodiment, the candidate video segments most relevant to the training video are obtained by sorting N candidate video segments based on their relevance to the training video.

For example, the method can be realized in the following steps:

(a5) Obtaining segment level characteristics of each candidate video segment in the N candidate video segments;

reference may be made in detail to the implementation of step (a 3) above, and details are not repeated here.

(b5) Acquiring video level characteristics of a training video;

reference may be made in detail to the implementation of step (a 1) above, and details are not repeated here.

(c5) Calculating the correlation degree between the segment level characteristics of each candidate video segment and the video level characteristics of the training video;

(d5) And acquiring the candidate video segment with the highest correlation degree from the N candidate video segments as the candidate video segment most correlated with the training video.

S404, detecting whether the obtained most relevant candidate video segments are consistent with the marked most relevant training video segments; if not, the two images are consistent; step S405 is performed; otherwise, step S406 is performed;

s405, adjusting parameters of a video level retrieval module and a fragment positioning module to enable the obtained most relevant candidate video fragments to be consistent with the marked most relevant training video fragments; returning to the step S401 to continue training;

S406, judging whether the obtained most relevant candidate video clips are consistent with the marked most relevant training video clips in continuous training of the preset number of rounds, if so, ending the training, otherwise, returning to the step S401 to continue the training.

Similarly, steps S401 to S405 are a round of operations involved in training. For example, after step S405, since the parameters of the model are still being adjusted, the next round of training is necessarily performed at this time. In addition, in practical application, a termination condition related to training is also needed, for example, after each round of training detects that the obtained most relevant candidate video segment is consistent with the labeled most relevant training video segment, whether the obtained most relevant candidate video segment is consistent with the labeled most relevant training video segment in continuous training of preset rounds can be judged, if yes, the video segment search model is considered to be trained and mature, if the video segment search model is considered to be trained and mature, training can be terminated, and parameters of the model at the moment are determined to be unchanged for subsequent video segment search. Otherwise, training is continued according to the above step flow.

According to the training method of the video segment retrieval model, the joint sequencing module is added in the training, the process of obtaining the target video segment is more detailed, and the video segment retrieval model can be trained more finely, so that the accuracy and the retrieval efficiency of video retrieval can be effectively improved; and the method can help the user to quickly browse the information to be searched, and can greatly improve the use experience of the user.

In addition, the training method of the video segment search model in the embodiment not only can train the video segment search model by adopting training data comprising one piece of training search information and a plurality of pieces of training videos, but also can train the video segment search model by adopting training data comprising one piece of training videos and a plurality of pieces of training search information, and can further improve the performance of the video segment search model, so that the accuracy and the search efficiency of the video segment search model are higher.

Fig. 5 is a block diagram of a video clip retrieving apparatus according to a fifth embodiment of the present application. As shown in fig. 5, the video clip retrieval apparatus 500 of the present embodiment includes:

a video level retrieval module 501, configured to retrieve the most relevant target video from the video library according to the retrieval information input by the user;

And a segment positioning module 502, configured to position a target video segment most relevant to the retrieval information from the target video.

Further alternatively, in the video clip retrieval apparatus 500 of the present embodiment, a video level retrieval module 501 is configured to:

acquiring video level features of each video in a video library based on a pre-trained frame feature extraction model and a first attention mechanism module;

extracting corresponding text features based on the retrieval information;

respectively calculating the correlation of video level features and text features of each video;

and acquiring the video with the highest correlation from the video library as the most correlated target video.

for each video in a video library, a frame feature extraction model is adopted to acquire the frame level features of each video frame;

and acquiring video level characteristics of the corresponding videos according to the characteristics of each frame level of each video and the first attention mechanism module.

and inputting the frame level characteristics of each video to a first attention mechanism module according to the sequence in the video, and fusing the video level characteristics of the video and outputting the video based on the frame level characteristics of each video frame by the first attention mechanism module.

and after each word in the search information is embedded and expressed, inputting the embedded and expressed word into a pre-trained text feature extraction model to obtain corresponding text features.

Further alternatively, in the video clip retrieving apparatus 500 of the present embodiment, a clip positioning module 502 is configured to:

The implementation principle and technical effects of the video clip retrieval device 500 in this embodiment by using the above modules are the same as those of the above related method embodiments, and details of the above related method embodiments may be referred to for details, which are not described herein.

Fig. 6 is a block diagram of a training device for a video clip search model according to a sixth embodiment of the present application. As shown in fig. 6, the training apparatus 600 of the video clip retrieval model of the present embodiment includes:

The acquisition module 601 is configured to acquire a plurality of pieces of training video data;

the training module 602 is configured to train a video clip search model using a plurality of pieces of training video data, where the video clip search model includes a video level search module, a clip positioning module, and a joint ordering module.

Optionally, in this embodiment, each piece of training video data includes training search information, a plurality of training videos, and a training video clip most related to the training search information from among the plurality of manually labeled training videos.

At this point, correspondingly, training module 602 is configured to:

for each piece of training data, a video level retrieval module is adopted to retrieve N candidate training videos most relevant to training retrieval information from a plurality of training videos;

a segment positioning module is adopted to position candidate video segments most relevant to training retrieval information from each candidate training video, and N candidate video segments are obtained in total;

the relevance of N candidate video clips and training retrieval information is ranked by adopting a joint ranking module, and the candidate video clip most relevant to the training retrieval information is obtained;

Further, a training module 602, configured to;

obtaining segment level characteristics of each candidate video segment in the N candidate video segments;

acquiring text characteristics of training retrieval information;

calculating the relativity between the segment level characteristics of each candidate video segment and the text characteristics of the training retrieval information;

and acquiring the candidate video segment with the highest correlation degree from the N candidate video segments as the candidate video segment most correlated with the training retrieval information.

In addition, optionally, in this embodiment, each piece of training video data includes training search information most relevant to the training video in the pieces of training search information, the training video, and the manually labeled training video clip most relevant to the most relevant training search information in the training video.

At this point, correspondingly, training module 602 is configured to:

for each piece of training data, a video level retrieval module is adopted to retrieve N pieces of candidate training retrieval information which are most relevant to training video from a plurality of pieces of training retrieval information;

A segment positioning module is adopted to position candidate video segments most relevant to each piece of candidate training retrieval information from the training video, and N candidate video segments are obtained in total;

the relevance between N candidate video clips and the training video is ranked by adopting a joint ranking module, and the candidate video clip most relevant to the training video is obtained;

Further, a training module 602 is configured to:

acquiring video level characteristics of a training video;

calculating the correlation degree between the segment level characteristics of each candidate video segment and the video level characteristics of the training video;

and acquiring the candidate video segment with the highest correlation degree from the N candidate video segments as the candidate video segment most correlated with the training video.

The training device 600 for the video clip retrieval model according to the present embodiment adopts the above modules to implement the training principle and the technical effect of the video clip retrieval model, which are the same as those of the above related method embodiments, and detailed description of the above related method embodiments may be referred to, and will not be repeated here.

According to an embodiment of the present application, the present application also provides an electronic device and a readable storage medium.

As shown in fig. 7, a block diagram of an electronic device implementing the above-described related method according to an embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the applications described and/or claimed herein. For example, the electronic device of the embodiment may be specifically used to implement the above-mentioned video clip search method, or a training method for viewing the above-mentioned video clip search model.

As shown in fig. 7, the electronic device includes: one or more processors 701, memory 702, and interfaces for connecting the various components, including high-speed interfaces and low-speed interfaces. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions executing within the electronic device, including instructions stored in or on memory to display graphical information of the GUI on an external input/output device, such as a display device coupled to the interface. In other embodiments, multiple processors and/or multiple buses may be used, if desired, along with multiple memories and multiple memories. Also, multiple electronic devices may be connected, each providing a portion of the necessary operations (e.g., as a server array, a set of blade servers, or a multiprocessor system). One processor 701 is illustrated in fig. 7.

Memory 702 is a non-transitory computer readable storage medium provided by the present application. The memory stores instructions executable by the at least one processor to cause the at least one processor to perform the video clip retrieval method or the training method of the video clip retrieval model provided by the application. The non-transitory computer-readable storage medium of the present application stores computer instructions for causing a computer to execute the video clip search method or the training method of the video clip search model provided by the present application.

The memory 702 is used as a non-transitory computer readable storage medium for storing non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules (e.g., the relevant modules shown in fig. 5 and the relevant modules shown in fig. 6) corresponding to the video clip search method or the training method of the video clip search model in the embodiment of the present application. The processor 701 executes various functional applications of the server and data processing, i.e., implements the video clip search method or the training method of the video clip search model in the above-described method embodiment, by running non-transitory software programs, instructions, and modules stored in the memory 702.

Memory 702 may include a storage program area that may store an operating system, at least one application program required for functionality, and a storage data area; the storage data area may store data created by use of an electronic device implementing a video clip retrieval method or a training method of a video clip retrieval model, and the like. In addition, the memory 702 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage device. In some embodiments, memory 702 optionally includes memory remotely located relative to processor 701, which may be connected via a network to an electronic device implementing a video clip retrieval method or a training method for a video clip retrieval model. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device implementing the video clip search method or the training method of the video clip search model may further include: an input device 703 and an output device 704. The processor 701, the memory 702, the input device 703 and the output device 704 may be connected by a bus or otherwise, in fig. 7 by way of example.

The input device 703 may receive input numeric or character information and generate key signal inputs related to user settings and function controls of an electronic device implementing the video clip search method or training method of the video clip search model, such as a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointer stick, one or more mouse buttons, a track ball, a joystick, or the like. The output device 704 may include a display apparatus, auxiliary lighting devices (e.g., LEDs), and haptic feedback devices (e.g., vibration motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device may be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASIC (application specific integrated circuit), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

These computing programs (also referred to as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

According to the technical scheme of the embodiment of the application, the most relevant target video is retrieved from the video library according to the retrieval information input by the user by adopting the video level retrieval module; the segment locating module is adopted to locate the target video segment most relevant to the retrieval information from the target video, so that video retrieval with segment granularity can be realized, and compared with video-level retrieval in the prior art, the accuracy and the retrieval efficiency of video retrieval can be effectively improved; and the method can help the user to quickly browse the information to be searched, and can greatly improve the use experience of the user.

In addition, according to the technical scheme of the embodiment of the application, a joint ordering module is added in the training of the video segment retrieval model, the process of acquiring the target video segment is more detailed, and the video segment retrieval model can be trained more finely, so that the accuracy and the retrieval efficiency of video retrieval can be effectively improved; and the method can help the user to quickly browse the information to be searched, and can greatly improve the use experience of the user.

Furthermore, according to the technical scheme of the embodiment of the application, not only can the video segment search model be trained by adopting the training data comprising one piece of training search information and a plurality of pieces of training videos, but also the video segment search model can be trained by adopting the training data comprising one piece of training videos and a plurality of pieces of training search information, so that the performance of the video segment search model can be further improved, and the accuracy and the search efficiency of the video segment search model are higher.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present application may be performed in parallel, sequentially, or in a different order, so long as the desired results of the technical solution disclosed in the present application can be achieved, and are not limited herein.

The above embodiments do not limit the scope of the present application. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present application should be included in the scope of the present application.

Claims

1. A video clip retrieval method, comprising:

positioning a target video fragment most relevant to the retrieval information from the target video by adopting a fragment positioning module in the video fragment retrieval model;

the method for searching the most relevant target video from the video library by adopting a video level searching module in a video fragment searching model according to searching information input by a user comprises the following steps:

for each video in the video library, acquiring a frame-level feature of each video frame by adopting a pre-trained frame feature extraction model;

for each video, inputting the characteristics of each frame level of the video into a pre-trained first attention mechanism module according to the sequence of corresponding video frames in the video, identifying the importance of the characteristics of each frame level in the video by the first attention mechanism module, and fusing the characteristics of each frame level into the characteristics of the video level of the video according to the importance of the characteristics of each frame level;

The frame feature extraction model comprises a pre-trained extrusion and excitation network set 152 neural network model and a bidirectional gating cycle unit GRU network model;

for each video in the video library, a pre-trained frame feature extraction model is adopted to obtain the frame-level features of each video frame, and the method comprises the following steps:

for each video in the video library, extracting primary characteristics of each video frame in the video by adopting a pre-trained extrusion and excitation network set 152 neural network model;

and inputting the extracted primary characteristics of each video frame into a pre-trained bidirectional gating circulating unit GRU network model to obtain the characteristics of the corresponding frame level.

2. The method of claim 1, wherein retrieving the most relevant target video from the video library using a video level retrieval module in the video clip retrieval model based on the retrieval information entered by the user, further comprises:

extracting corresponding text features based on the retrieval information;

3. The method of claim 1, wherein extracting the corresponding text feature based on the retrieved information comprises:

4. The method of claim 1, wherein locating the target video segment most relevant to the search information from the target video using a segment locating module in the video segment search model comprises:

5. A method for training a video clip retrieval model, comprising:

collecting a plurality of pieces of training video data;

training a video segment retrieval model by adopting the plurality of pieces of training video data, wherein the video segment retrieval model comprises a video level retrieval module, a segment positioning module and a joint ordering module; the video clip detection model is a video clip retrieval model for use in the method of any of the preceding claims 1-4.

6. The method of claim 5, wherein each piece of the training video data comprises training search information, a plurality of training videos, and a training video segment most relevant to the training search information among the plurality of manually annotated training videos most relevant to the training search information.

7. The method of claim 6, wherein training the video clip search model using the plurality of pieces of training video data comprises:

for each piece of training video data, searching N candidate training videos most relevant to the training search information from the plurality of training videos by adopting the video-level search module;

8. The method of claim 7, wherein ranking the relevance of the N candidate video clips to the training search information using the joint ranking module and obtaining the candidate video clip that is most relevant to the training search information comprises:

acquiring text characteristics of the training retrieval information;

9. The method of claim 5, wherein each piece of training video data comprises a plurality of pieces of training search information, a training video, training search information most relevant to the training video in the plurality of pieces of training search information of a manual annotation, and a training video clip most relevant to the most relevant training search information in the training video.

10. The method of claim 9, wherein training the video clip search model using the plurality of pieces of training video data comprises:

for each piece of training video data, searching N pieces of candidate training search information most relevant to the training video from the plurality of pieces of training search information by adopting the video-level search module;

11. The method of claim 10, wherein ranking the relevance of the N candidate video segments to the training video using the joint ranking module and obtaining the candidate video segment that is most relevant to the training video comprises:

acquiring video level characteristics of the training video;

12. A video clip retrieval apparatus, the apparatus comprising:

the fragment positioning module is used for positioning a target video fragment most relevant to the retrieval information from the target video;

the video level retrieval module is used for:

The frame feature extraction model includes a pre-trained extrusion and excitation network 152 neural network model and a bi-directional gating cyclic unit network model;

the video level retrieval module is used for:

for each video in the video library, extracting primary characteristics of each video frame in the video by adopting a pre-trained extrusion and excitation network 152 neural network model;

and inputting the extracted primary characteristics of each video frame into a pre-trained bidirectional gating cyclic unit network model to obtain the characteristics of the corresponding frame level.

13. The apparatus of claim 12, wherein the video-level retrieval module is further configured to:

extracting corresponding text features based on the retrieval information;

14. The apparatus of claim 13, wherein the video level retrieval module is configured to:

15. The apparatus of claim 12, wherein the segment locating module is configured to:

16. A training device for a video clip retrieval model, comprising:

the training module is used for training a video segment retrieval model by adopting the plurality of pieces of training video data, and the video segment retrieval model comprises a video level retrieval module, a segment positioning module and a joint ordering module; the video level retrieval module and the segment locating module are the video level retrieval module and the segment locating module, respectively, for use in the apparatus of any of the preceding claims 12-15.

17. The apparatus of claim 16, wherein each piece of the training video data comprises training search information, a plurality of training videos, and a training video segment most relevant to the training search information among the plurality of manually annotated training videos most relevant to the training search information.

18. The apparatus of claim 17, wherein the training module is configured to:

19. The apparatus of claim 18, wherein the training module is configured to;

Acquiring text characteristics of the training retrieval information;

20. The apparatus of claim 16, wherein each piece of training video data comprises a plurality of pieces of training search information, a training video, training search information most relevant to the training video from among the plurality of pieces of training search information manually labeled, and a training video clip most relevant to the most relevant training search information from among the training video.

21. The apparatus of claim 20, wherein the training module is configured to:

22. The apparatus of claim 21, wherein the training module is configured to:

acquiring video level characteristics of the training video;

23. An electronic device, comprising:

At least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-4 or 5-11.

24. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-4 or 5-11.