CN112784102B

CN112784102B - Video retrieval method and device and electronic equipment

Info

Publication number: CN112784102B
Application number: CN202110076616.7A
Authority: CN
Inventors: 薛学通; 杨敏
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-01-20
Filing date: 2021-01-20
Publication date: 2023-07-28
Anticipated expiration: 2041-01-20
Also published as: CN112784102A

Abstract

The invention discloses a video retrieval method, a video retrieval device and electronic equipment, relates to the technical field of artificial intelligence, and particularly relates to the field of computer vision and deep learning. The specific implementation scheme is as follows: acquiring a plurality of first feature vectors corresponding to a plurality of video frames of a video to be retrieved, and acquiring a second feature vector corresponding to a query picture; clustering the first feature vectors corresponding to the videos to be searched to generate a plurality of clustering centers and a plurality of corresponding first feature vectors; and determining target videos in the videos to be searched according to the time sequence dependency relationship among the video frames, the second feature vector, the clustering centers and the first feature vectors corresponding to the clustering centers. According to the method, the influence of the time sequence dependency relationship among the video frames on the video retrieval result can be considered, so that the video retrieval result is more accurate, the video retrieval is performed according to the clustering result, the number of the first feature vectors can be reduced, and the method is suitable for video retrieval with larger number scale.

Description

Video retrieval method and device and electronic equipment

Technical Field

The present disclosure relates to the field of artificial intelligence in the field of computer technology, and in particular, to a video retrieval method, apparatus, electronic device, storage medium, and computer program product.

Background

Currently, video retrieval is mostly performed by adopting a picture comparison mode. However, due to the large video volume, for example, a short video of about 2 minutes may include more than 3000 frames of pictures, which may cause an excessive number of pictures to be compared, consume more computing resources, have a slow video retrieval speed, and need to wait for a long time to obtain a retrieval result.

Disclosure of Invention

Provided are a video retrieval method, apparatus, electronic device, storage medium, and computer program product.

According to a first aspect, there is provided a video retrieval method comprising: acquiring a plurality of first feature vectors corresponding to a plurality of video frames of a video to be retrieved, and acquiring a second feature vector corresponding to a query picture; clustering a plurality of first feature vectors corresponding to the videos to be searched to generate a plurality of clustering centers and a plurality of first feature vectors corresponding to the clustering centers; and determining target videos in the videos to be searched according to the time sequence dependency relationship among the video frames, the second feature vector, the clustering centers and the first feature vectors corresponding to the clustering centers.

According to a second aspect, there is provided a video retrieval apparatus comprising: the acquisition module is used for acquiring a plurality of first feature vectors corresponding to a plurality of video frames of the video to be retrieved and acquiring a second feature vector corresponding to the query picture; the clustering module is used for clustering a plurality of first feature vectors corresponding to the videos to be searched to generate a plurality of clustering centers and a plurality of first feature vectors corresponding to the clustering centers; the determining module is used for determining target videos in the videos to be searched according to the time sequence dependency relationship among the video frames, the second feature vector, the clustering centers and the first feature vectors corresponding to the clustering centers.

According to a third aspect, there is provided an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the video retrieval method of the first aspect of the present disclosure.

According to a fourth aspect, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the video retrieval method of the first aspect of the present disclosure.

According to a fifth aspect, there is provided a computer program product comprising a computer program, wherein the computer program, when executed by a processor, implements the video retrieval method of the first aspect of the present disclosure.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for better understanding of the present solution and do not constitute a limitation of the present application. Wherein:

fig. 1 is a flow diagram of a video retrieval method according to a first embodiment of the present disclosure;

FIG. 2 is a flow chart of determining a target video of a plurality of videos to be retrieved in a video retrieval method according to a second embodiment of the present disclosure;

FIG. 3 is a flow chart of determining a target video of a plurality of videos to be retrieved in a video retrieval method according to a third embodiment of the present disclosure;

Fig. 4 is a schematic diagram of a video retrieval method according to a fourth embodiment of the present disclosure;

fig. 5 is a block diagram of a video retrieval device according to a first embodiment of the present disclosure;

fig. 6 is a block diagram of a video retrieval device according to a second embodiment of the present disclosure;

fig. 7 is a block diagram of an electronic device for implementing a video retrieval method of an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

AI (Artificial Intelligence ) is a technical science that studies, develops theories, methods, techniques and application systems for simulating, extending and expanding human intelligence. At present, the AI technology has the advantages of high automation degree, high accuracy and low cost, and is widely applied.

CV (Computer Vision) is a science of researching how to make an artificial system "feel" from an image or multidimensional data by using a camera and a Computer to replace human eyes to perform machine Vision such as recognition, tracking and measurement on a target and further performing graphic processing, so that the Computer processing becomes more suitable for human eyes to observe or transmit to an instrument to detect an image.

DL (Deep Learning) is a new research direction in the field of ML (Machine Learning), and is an inherent rule and expression hierarchy of Learning sample data, so that a Machine can analyze Learning ability like a person, can recognize data such as characters, images and sounds, and is widely applied to speech and image recognition.

Fig. 1 is a flow chart of a video retrieval method according to a first embodiment of the present disclosure.

As shown in fig. 1, a video retrieval method according to a first embodiment of the present disclosure includes:

s101, acquiring a plurality of first feature vectors corresponding to a plurality of video frames of a video to be retrieved, and acquiring a second feature vector corresponding to a query picture.

It should be noted that, the execution body of the video retrieval method according to the embodiment of the present disclosure may be a hardware device having data information processing capability and/or software necessary for driving the hardware device to operate. Alternatively, the execution body may include a workstation, a server, a computer, a user terminal, and other intelligent devices. The user terminal comprises, but is not limited to, a mobile phone, a computer, intelligent voice interaction equipment, intelligent household appliances, vehicle-mounted terminals and the like.

It is understood that the video to be retrieved may include a plurality of video frames, each of which may correspond to one of the first feature vectors, and then the plurality of video frames of the video to be retrieved may correspond to a plurality of the first feature vectors.

Optionally, a mapping relationship or a mapping table between a plurality of video frames of the video to be searched and the first feature vector may be pre-established, and after the plurality of video frames of the video to be searched are acquired, the mapping relationship or the mapping table is queried, so that a plurality of first feature vectors corresponding to the plurality of video frames of the video to be searched can be acquired. It should be noted that, the mapping relationship or the mapping table may be set according to the actual situation.

In an embodiment of the disclosure, a second feature vector corresponding to the query picture may be obtained. Wherein, the query picture refers to a picture for video retrieval. It is understood that different query pictures may correspond to different second feature vectors.

Optionally, acquiring the second feature vector corresponding to the query picture may include acquiring the query picture, and generating the corresponding second feature vector according to the query picture and the CNN (Convolutional Neural Networks, convolutional neural network) model. For example, the query picture may be input into a convolutional neural network model to obtain a corresponding second feature vector.

It should be noted that, in the embodiment of the present disclosure, the type of the video to be retrieved is not excessively limited. For example, videos to be retrieved include, but are not limited to, long videos, short videos, and the like.

S102, clustering a plurality of first feature vectors corresponding to the videos to be searched to generate a plurality of clustering centers and a plurality of first feature vectors corresponding to the clustering centers.

In an embodiment of the disclosure, a plurality of first feature vectors corresponding to a plurality of videos to be retrieved may be clustered to generate a plurality of cluster centers and a plurality of first feature vectors corresponding to the plurality of cluster centers. It is understood that the number of cluster centers is at least one, and each cluster center may correspond to a plurality of first feature vectors.

It should be noted that, the cluster center is also in a vector form, and the latitude of the cluster center is the same as the latitude of the first feature vector.

Optionally, a preset algorithm may be used to cluster a plurality of first feature vectors corresponding to the plurality of videos to be retrieved. The preset algorithm can be set according to actual conditions. For example, a KNN (K-nearest neighbor) algorithm may be used to cluster a plurality of first feature vectors corresponding to the plurality of videos to be retrieved.

And S103, determining target videos in the videos to be searched according to the time sequence dependency relationship among the video frames, the second feature vectors, the clustering centers and the first feature vectors corresponding to the clustering centers.

In the embodiment of the disclosure, the target video in the videos to be searched can be determined according to the time sequence dependency relationship among the video frames, the second feature vector, the clustering centers and the first feature vectors corresponding to the clustering centers, so that the influence of the time sequence dependency relationship among the video frames on the video search result can be considered, the video search result is more accurate, the first feature vectors corresponding to the videos are clustered, the video search is performed according to the first feature vectors corresponding to the clustering centers and the clustering centers, the number of the first feature vectors can be greatly reduced, the video search speed can be increased, and the video search method can be suitable for video search with large quantity scale.

Wherein, the time sequence dependency relationship among a plurality of video frames comprises, but is not limited to, the time sequence relationship among video frames and the like. For example, the time of video frame a is before the time of video frame B, and the time of video frame C is after the time of video frame B.

Optionally, determining the target video of the plurality of videos to be retrieved may include determining an identification of the target video of the plurality of videos to be retrieved, so as to determine the target video according to the identification of the target video. The identification of the target video includes, but is not limited to, the name, number, etc. of the target video, which is not limited herein.

In summary, according to the video retrieval method of the embodiment of the present disclosure, a target video in a plurality of videos to be retrieved may be determined according to a time sequence dependency relationship between a plurality of video frames, a second feature vector, a plurality of cluster centers, and a plurality of first feature vectors corresponding to the plurality of cluster centers, so that an influence of the time sequence dependency relationship between the video frames on a video retrieval result may be considered, the video retrieval result may be more accurate, and the first feature vectors corresponding to the videos may be clustered, and video retrieval may be performed according to the cluster centers and the first feature vectors corresponding to the cluster centers, so that the number of the first feature vectors may be greatly reduced, which is conducive to accelerating the video retrieval speed, and may be adapted to video retrieval with a large number scale.

On the basis of any of the foregoing embodiments, the obtaining a plurality of first feature vectors corresponding to a plurality of video frames of the video to be retrieved in step S101 may include obtaining the video to be retrieved, extracting the plurality of video frames from the video to be retrieved, and generating the corresponding first feature vectors according to the video frames and the CNN (Convolutional Neural Networks, convolutional neural network) model.

Optionally, extracting the plurality of video frames from the video to be retrieved may include extracting the plurality of video frames from the video to be retrieved at preset time intervals. The preset time interval may be set according to practical situations, for example, may be set to 2 seconds.

Optionally, generating the corresponding first feature vector according to the video frame and the convolutional neural network model may include inputting the video frame into the convolutional neural network model to obtain the corresponding first feature vector.

The method can acquire the video to be searched, extract a plurality of video frames from the video to be searched, and generate corresponding first feature vectors according to the video frames and the convolutional neural network model so as to acquire a plurality of first feature vectors corresponding to the plurality of video frames of the video to be searched.

On the basis of any of the above embodiments, as shown in fig. 2, determining, in step S103, a target video of the plurality of videos to be retrieved according to the time sequence dependency relationship among the plurality of video frames, the second feature vector, the plurality of cluster centers, and the plurality of first feature vectors corresponding to the plurality of cluster centers may include:

s201, according to time sequence dependency relations among a plurality of video frames, determining an association cluster center with a link relation with a cluster center and the weight of the link relation.

In the embodiment of the disclosure, the weight of the association cluster center and the link relation with the link relation between the cluster center can be determined according to the time sequence dependency relation among the plurality of video frames, and the influence of the time sequence dependency relation among the plurality of video frames on the weight of the association cluster center and the link relation can be considered.

It is understood that different cluster centers may correspond to different associated cluster centers, and one cluster center may correspond to at least one associated cluster center.

Optionally, determining the association cluster center having the link relation with the cluster center and the weight of the link relation according to the time sequence dependency relation among the plurality of video frames may include determining that a first video frame of the plurality of video frames is an adjacent video frame of a second video frame, determining the cluster center corresponding to the first video frame as the association cluster center having the link relation with the cluster center corresponding to the second video frame, and adding one to the weight of the link relation between the cluster center corresponding to the second video frame and the cluster center corresponding to the first video frame. Therefore, the associated clustering center can be determined according to the clustering center corresponding to the adjacent video frames, and the weight of the link relation between the adjacent video frames is increased by one.

For example, the weights for updating the cluster center, associating the cluster center, and linking the relationships may be represented by a graph network. The cluster center may be used as the vertex of the graph, and the edges of the graph are determined from the time-series dependency relationship between the plurality of video frames. For example, the video frame a is the first video frame after the video frame B, the cluster center corresponding to the video frame a may be determined as the associated cluster center having a link relationship with the cluster center corresponding to the video frame B, the cluster centers corresponding to the video frame a and the video frame B respectively directly form one edge, and after all edges are determined, a weight of the link relationship may be obtained according to the number of links in the first-level domain of the vertex.

S202, determining target videos in the videos to be searched according to the second feature vectors, the plurality of clustering centers, the plurality of first feature vectors corresponding to the plurality of clustering centers, the associated clustering centers with the link relation with the clustering centers and the weight of the link relation.

Therefore, the target video in the videos to be searched can be determined according to the second feature vector, the plurality of clustering centers, the plurality of first feature vectors corresponding to the plurality of clustering centers, the associated clustering center with the link relation with the clustering center and the weight of the link relation, and the influence of the weight of the associated clustering center with the link relation with the clustering center on the video search result can be considered, so that the video search result is more accurate.

On the basis of any of the above embodiments, as shown in fig. 3, determining, in step S202, a target video of the plurality of videos to be retrieved according to the second feature vector, the plurality of cluster centers, the plurality of first feature vectors corresponding to the plurality of cluster centers, the associated cluster center having a link relationship with the cluster center, and the weight of the link relationship may include:

s301, calculating the first similarity between the second feature vector and the clustering center.

In embodiments of the present disclosure, a first similarity of the second feature vector and the cluster center may be calculated. It will be appreciated that the second feature vector differs from the first similarity of the different cluster centers.

Alternatively, the second feature vector and the cluster center may be input into a similarity model, and the first similarity between the second feature vector and the cluster center may be calculated by the similarity model. The similarity model can be set according to actual conditions.

S302, determining a first candidate cluster center in the plurality of cluster centers according to the first similarity.

In embodiments of the present disclosure, a first candidate cluster center of a plurality of cluster centers may be determined according to a first similarity, such that the first candidate cluster center is selected from the plurality of cluster centers according to the first similarity.

Optionally, determining the first candidate cluster center in the plurality of cluster centers according to the first similarity may include sorting the first similarity from high to low, and taking the cluster centers corresponding to the first similarities of the N first clusters before sorting as the first candidate cluster center. The method can screen N cluster centers with higher first similarity from a plurality of cluster centers to be used as first candidate cluster centers. Wherein, N is an integer greater than 1, which can be set according to practical conditions.

S303, determining a second candidate cluster center in the cluster centers with the link relation with the first candidate cluster center according to the weight of the link relation corresponding to the first candidate cluster center.

In the embodiment of the disclosure, the second candidate cluster center in the cluster centers having the link relationship with the first candidate cluster center can be determined according to the weight of the link relationship corresponding to the first candidate cluster center, so that the second candidate cluster center having the link relationship with the first candidate cluster center is screened out from the plurality of cluster centers according to the weight of the link relationship.

Optionally, determining the second candidate cluster center in the cluster centers having the link relationship with the first candidate cluster center according to the weight of the link relationship corresponding to the first candidate cluster center may include sorting the weights of the link relationship from high to low, and taking the cluster center corresponding to the weights of the link relationships of M before sorting as the second candidate cluster center having the link relationship with the first candidate cluster center. The method can screen M cluster centers with higher weights of the link relation from the plurality of cluster centers to serve as second candidate cluster centers. Wherein M is an integer greater than 1, and can be set according to practical conditions.

S304, calculating a second similarity between the second feature vector and the candidate first feature vector, wherein the candidate first feature vector is the first feature vector corresponding to the first candidate cluster center and the second candidate cluster center.

In an embodiment of the disclosure, a second similarity between the second feature vector and a candidate first feature vector may be calculated, where the candidate first feature vector is a first feature vector corresponding to the first candidate cluster center and the second candidate cluster center. That is, the candidate first feature vector includes a first feature vector corresponding to the first candidate cluster center and a first feature vector corresponding to the second candidate cluster center.

It should be noted that, for the related content of calculating the second similarity between the second feature vector and the candidate first feature vector, reference may be made to the above embodiment, and details thereof are not repeated here.

And S305, determining target videos in the videos to be retrieved according to the second similarity.

Optionally, determining the target video of the plurality of videos to be retrieved according to the second similarity may include determining a target first feature vector of the candidate first feature vectors according to the second similarity, and determining the corresponding target video according to the target first feature vector.

The determining the target first feature vector in the candidate first feature vectors according to the second similarity may include sorting the second similarity from high to low, and taking the candidate first feature vectors corresponding to the second similarities of the S before sorting as the target first feature vector. The method can screen S candidate first feature vectors with higher second similarity from a plurality of candidate first feature vectors to serve as target first feature vectors. Wherein S is an integer greater than 1, and can be set according to practical situations.

The determining the corresponding target video according to the target first feature vector may include obtaining a target video identifier corresponding to the target first feature vector, and determining the corresponding target video according to the target video identifier.

The method can calculate the first similarity of the second feature vector and the clustering centers, determine a first candidate clustering center in the plurality of clustering centers according to the first similarity, determine a second candidate clustering center in the clustering centers with the link relationship with the first candidate clustering center according to the weight of the link relationship corresponding to the first candidate clustering center, calculate the second similarity between the second feature vector and the candidate first feature vector, wherein the candidate first feature vector is the first feature vector corresponding to the first candidate clustering center and the second candidate clustering center, and determine the target video in the plurality of videos to be searched according to the second similarity.

As shown in fig. 4, a plurality of video frames may be extracted from a video to be retrieved, and the plurality of video frames may be input to a convolutional neural network model to obtain a plurality of first feature vectors corresponding to the plurality of video frames, and the plurality of first feature vectors corresponding to the plurality of video frames may be clustered to generate a plurality of cluster centers and a plurality of first feature vectors corresponding to the plurality of cluster centers, or a query picture may be input to the convolutional neural network model to obtain a second feature vector corresponding to the query picture, and then a target video in the plurality of video to be retrieved may be determined according to a time sequence dependency relationship among the plurality of video frames, the second feature vector, the plurality of cluster centers, and the plurality of first feature vectors corresponding to the plurality of cluster centers.

Fig. 5 is a block diagram of a video retrieval device according to a first embodiment of the present disclosure.

As shown in fig. 5, a video retrieval apparatus 500 of an embodiment of the present disclosure includes: an acquisition module 501, a clustering module 502 and a determination module 503.

The obtaining module 501 is configured to obtain a plurality of first feature vectors corresponding to a plurality of video frames of a video to be retrieved, and obtain a second feature vector corresponding to a query picture;

the clustering module 502 is configured to cluster a plurality of first feature vectors corresponding to a plurality of videos to be retrieved, so as to generate a plurality of cluster centers and a plurality of first feature vectors corresponding to the plurality of cluster centers;

The determining module 503 is configured to determine a target video in the plurality of videos to be retrieved according to the timing dependency relationship among the plurality of video frames, the second feature vector, the plurality of cluster centers, and the plurality of first feature vectors corresponding to the plurality of cluster centers.

In one embodiment of the present disclosure, the obtaining module 501 is specifically configured to: acquiring the video to be retrieved; extracting the plurality of video frames from the video to be retrieved; and generating the corresponding first feature vector according to the video frame and the convolutional neural network model.

In one embodiment of the present disclosure, the obtaining module 501 is specifically configured to: acquiring the query picture; and generating the corresponding second feature vector according to the query picture and the convolutional neural network model.

In one embodiment of the present disclosure, the clustering module 502 is specifically configured to: and clustering a plurality of first feature vectors corresponding to the videos to be searched by adopting a k nearest neighbor classification algorithm.

In summary, the video retrieval device according to the embodiment of the present disclosure may determine, according to a time sequence dependency relationship between a plurality of video frames, a second feature vector, a plurality of cluster centers, and a plurality of first feature vectors corresponding to the plurality of cluster centers, a target video in a plurality of videos to be retrieved, so that an influence of the time sequence dependency relationship between the video frames on a video retrieval result may be considered, so that the video retrieval result is more accurate, and the first feature vectors corresponding to the videos are clustered, and video retrieval is performed according to the cluster centers and the first feature vectors corresponding to the cluster centers, so that the number of the first feature vectors may be greatly reduced, which is conducive to accelerating the video retrieval speed, and may be adapted to video retrieval with a large number scale.

Fig. 6 is a block diagram of a video retrieval device according to a second embodiment of the present disclosure.

As shown in fig. 6, a video retrieval apparatus 600 of an embodiment of the present disclosure includes: an acquisition module 601, a clustering module 602 and a determination module 603.

Wherein the acquisition module 601 has the same function and structure as the acquisition module 501, and the clustering module 602 has the same function and structure as the clustering module 502.

In one embodiment of the present disclosure, the determining module 603 includes: a first determining unit 6031 configured to determine an association cluster center having a link relationship with the cluster center and a weight of the link relationship according to a time-series dependency relationship between the plurality of video frames; and a second determining unit 6032, configured to determine the target video in the plurality of videos to be retrieved according to the second feature vector, the plurality of cluster centers, the plurality of first feature vectors corresponding to the plurality of cluster centers, the association cluster center having a link relationship with the cluster center, and the weight of the link relationship.

In one embodiment of the present disclosure, the second determining unit 6032 includes: a first calculating subunit, configured to calculate a first similarity between the second feature vector and the cluster center; a first determining subunit configured to determine a first candidate cluster center of the plurality of cluster centers according to the first similarity; a second determining subunit, configured to determine, according to the weight of the link relationship corresponding to the first candidate cluster center, a second candidate cluster center in cluster centers having a link relationship with the first candidate cluster center; a second computing subunit, configured to compute a second similarity between the second feature vector and a candidate first feature vector, where the candidate first feature vector is the first feature vector corresponding to the first candidate cluster center and the second candidate cluster center; and the third determining subunit is used for determining the target video in the videos to be retrieved according to the second similarity.

In one embodiment of the disclosure, the third determining subunit is specifically configured to: determining a target first feature vector in the candidate first feature vectors according to the second similarity; and determining the corresponding target video according to the target first feature vector.

In one embodiment of the present disclosure, the first determining unit 6031 is specifically configured to: and determining a cluster center corresponding to the first video frame as an associated cluster center with a link relation with the cluster center corresponding to the second video frame, and adding one to the weight of the link relation between the cluster center corresponding to the second video frame and the cluster center corresponding to the first video frame.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 7 illustrates a schematic block diagram of an example electronic device 700 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 7, the electronic device 700 includes a computing unit 701 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the electronic device 700 may also be stored. The computing unit 701, the ROM 702, and the RAM 703 are connected to each other through a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

Various components in the electronic device 700 are connected to the I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, etc.; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, an optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the electronic device 700 to exchange information/data with other devices through a computer network, such as the internet, and/or various telecommunication networks.

The computing unit 701 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 701 performs the various methods and processes described above, such as the video retrieval method described in fig. 1-4. For example, in some embodiments, the video retrieval method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 708. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 700 via the ROM 702 and/or the communication unit 709. When the computer program is loaded into RAM 703 and executed by computing unit 701, one or more steps of the video retrieval method described above may be performed. Alternatively, in other embodiments, the computing unit 701 may be configured to perform the video retrieval method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service ("Virtual Private Server" or simply "VPS") are overcome. The server may also be a server of a distributed system or a server that incorporates a blockchain.

According to an embodiment of the present application, there is also provided a computer program product, including a computer program, where the computer program, when executed by a processor, implements the video retrieval method according to the above embodiment of the present application.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A video retrieval method comprising:

acquiring a plurality of first feature vectors corresponding to a plurality of video frames of a video to be retrieved, and acquiring a second feature vector corresponding to a query picture;

Clustering a plurality of first feature vectors corresponding to the videos to be searched to generate a plurality of clustering centers and a plurality of first feature vectors corresponding to the clustering centers;

determining target videos in the videos to be searched according to the time sequence dependency relationship among the video frames, the second feature vector, the clustering centers and the first feature vectors corresponding to the clustering centers; wherein,,

the determining, according to the timing dependency relationship among the plurality of video frames, the second feature vector, the plurality of cluster centers, and the plurality of first feature vectors corresponding to the plurality of cluster centers, a target video in the plurality of videos to be retrieved includes:

determining an association cluster center with a link relation with the cluster center and the weight of the link relation according to the time sequence dependency relation among the plurality of video frames;

and determining the target video in the videos to be searched according to the second feature vector, the plurality of clustering centers, the plurality of first feature vectors corresponding to the plurality of clustering centers, the association clustering center with the link relation with the clustering center and the weight of the link relation.

2. The video retrieval method according to claim 1, wherein the determining the target video of the plurality of videos to be retrieved according to the second feature vector, the plurality of cluster centers, the plurality of first feature vectors corresponding to the plurality of cluster centers, the associated cluster center having a link relationship with the cluster center, and the weight of the link relationship includes:

calculating the first similarity between the second feature vector and the clustering center;

determining a first candidate cluster center of the plurality of cluster centers according to the first similarity;

determining a second candidate cluster center in the cluster centers with the link relation with the first candidate cluster center according to the weight of the link relation corresponding to the first candidate cluster center;

calculating a second similarity between the second feature vector and a candidate first feature vector, wherein the candidate first feature vector is the first feature vector corresponding to the first candidate cluster center and the second candidate cluster center;

and determining the target video in the videos to be retrieved according to the second similarity.

3. The video retrieval method according to claim 2, wherein the determining the target video of the plurality of videos to be retrieved according to the second similarity includes:

Determining a target first feature vector in the candidate first feature vectors according to the second similarity;

and determining the corresponding target video according to the target first feature vector.

4. The video retrieval method according to claim 1, wherein the determining, from the timing dependency relationship among the plurality of video frames, the associated cluster center having a link relationship with the cluster center and the weight of the link relationship includes:

and determining a cluster center corresponding to the first video frame as an associated cluster center with a link relation with the cluster center corresponding to the second video frame, and adding one to the weight of the link relation between the cluster center corresponding to the second video frame and the cluster center corresponding to the first video frame.

5. The video retrieval method according to claim 1, wherein the acquiring a plurality of first feature vectors corresponding to a plurality of video frames of the video to be retrieved includes:

acquiring the video to be retrieved;

extracting the plurality of video frames from the video to be retrieved;

and generating the corresponding first feature vector according to the video frame and the convolutional neural network model.

6. The video retrieval method according to claim 1, wherein the obtaining the second feature vector corresponding to the query picture includes:

acquiring the query picture;

and generating the corresponding second feature vector according to the query picture and the convolutional neural network model.

7. The video retrieval method according to claim 1, wherein the clustering the plurality of first feature vectors corresponding to the plurality of videos to be retrieved includes:

and clustering a plurality of first feature vectors corresponding to the videos to be searched by adopting a k nearest neighbor classification algorithm.

8. A video retrieval apparatus comprising:

the acquisition module is used for acquiring a plurality of first feature vectors corresponding to a plurality of video frames of the video to be retrieved and acquiring a second feature vector corresponding to the query picture;

the clustering module is used for clustering a plurality of first feature vectors corresponding to the videos to be searched to generate a plurality of clustering centers and a plurality of first feature vectors corresponding to the clustering centers;

the determining module is used for determining target videos in the videos to be searched according to the time sequence dependency relationship among the video frames, the second feature vector, the clustering centers and the first feature vectors corresponding to the clustering centers; wherein,,

The determining module includes:

a first determining unit, configured to determine, according to a time-sequence dependency relationship between the plurality of video frames, an association cluster center having a link relationship with the cluster center and a weight of the link relationship;

and the second determining unit is used for determining the target video in the videos to be searched according to the second feature vector, the plurality of clustering centers, the plurality of first feature vectors corresponding to the plurality of clustering centers, the association clustering center with the link relation with the clustering center and the weight of the link relation.

9. The video retrieval device according to claim 8, wherein the second determination unit includes:

a first calculating subunit, configured to calculate a first similarity between the second feature vector and the cluster center;

a first determining subunit configured to determine a first candidate cluster center of the plurality of cluster centers according to the first similarity;

a second determining subunit, configured to determine, according to the weight of the link relationship corresponding to the first candidate cluster center, a second candidate cluster center in cluster centers having a link relationship with the first candidate cluster center;

A second computing subunit, configured to compute a second similarity between the second feature vector and a candidate first feature vector, where the candidate first feature vector is the first feature vector corresponding to the first candidate cluster center and the second candidate cluster center;

and the third determining subunit is used for determining the target video in the videos to be retrieved according to the second similarity.

10. The video retrieval device according to claim 9, wherein the third determination subunit is specifically configured to:

11. The video retrieval device according to claim 8, wherein the first determining unit is specifically configured to:

12. The video retrieval device of claim 8, wherein the acquisition module is specifically configured to:

acquiring the video to be retrieved;

extracting the plurality of video frames from the video to be retrieved;

13. The video retrieval device of claim 8, wherein the acquisition module is specifically configured to:

acquiring the query picture;

14. The video retrieval device of claim 8, wherein the clustering module is specifically configured to:

15. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the video retrieval method of any one of claims 1-7.

16. A non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the video retrieval method of any one of claims 1-7.