CN113821687A

CN113821687A - Content retrieval method and device and computer readable storage medium

Info

Publication number: CN113821687A
Application number: CN202110733613.6A
Authority: CN
Inventors: 王文哲; 张梦丹; 彭湃; 孙星
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-06-30
Filing date: 2021-06-30
Publication date: 2021-12-21

Abstract

The embodiment of the invention discloses a content retrieval method, a content retrieval device and a computer-readable storage medium; after the content to be retrieved for retrieving the target content is obtained, when the content to be retrieved is video content, performing multi-mode feature extraction on the video content to obtain a modal feature of each mode, performing feature extraction on the modal feature of each mode respectively to obtain a modal content feature of each mode, fusing the modal content features to obtain a video feature of the video content, and retrieving the target text content corresponding to the video content in a preset content set according to the video feature; the scheme can improve the accuracy of content retrieval.

Description

Content retrieval method and device and computer readable storage medium

Technical Field

The present invention relates to the field of communications technologies, and in particular, to a content retrieval method, a content retrieval device, and a computer-readable storage medium.

Background

In recent years, a huge amount of contents are generated on the internet, and the contents may include various types, for example, text, video, and the like. In order to better retrieve the required content from the massive content, it is common to retrieve the matching content from one type of content, for example, the matching text content may be retrieved from the video content provided by the user. The existing content retrieval usually adopts a feature extraction network to directly extract video features and text features for feature matching to complete the content retrieval.

In the research and practice process of the prior art, the inventor of the invention finds that the accuracy of content retrieval is insufficient because the video comprises various modes and complex semantics, and the accuracy of the video features extracted by adopting a single feature extraction network is insufficient, so that the video features cannot be in one-to-one correspondence with text semantics.

Disclosure of Invention

The embodiment of the invention provides a content retrieval method, a content retrieval device and a computer readable storage medium, which can improve the accuracy of content retrieval.

A content retrieval method, comprising:

acquiring content to be retrieved for retrieving target content;

when the content to be retrieved is video content, performing multi-modal feature extraction on the video content to obtain modal features of each modality;

respectively extracting the characteristics of the modal characteristics of each mode to obtain the modal content characteristics of each mode;

and fusing the modal content features to obtain video features of the video content, and retrieving target text content corresponding to the video content in a preset content set according to the video features.

Accordingly, an embodiment of the present invention provides a content retrieval apparatus, including:

an acquisition unit configured to acquire content to be retrieved for retrieving target content;

the first extraction unit is used for performing multi-modal feature extraction on the video content to obtain modal features of each mode when the content to be retrieved is the video content;

the second extraction unit is used for respectively extracting the characteristics of the modal characteristics of each mode to obtain the modal video characteristics of each mode;

and the text retrieval unit is used for fusing the modal video features to obtain the video features of the video content, and retrieving the target text content corresponding to the video content in the preset content set according to the video features.

Optionally, in some embodiments, the first extraction unit may be specifically configured to perform multi-modal feature extraction on the video content by using a trained content retrieval model to obtain an initial modal feature of each modality in the video content; extracting video frames from the video content, and performing multi-mode feature extraction on the video frames by adopting the trained content retrieval model to obtain basic modal features of each video frame; and screening out target modal characteristics corresponding to each mode from the basic modal characteristics, and fusing the target modal characteristics and the corresponding initial modal characteristics to obtain the modal characteristics of the video content of each mode.

Optionally, in some embodiments, the second extraction unit may be specifically configured to identify a target video feature extraction network corresponding to each modality in the video feature extraction networks of the trained content retrieval model; and performing feature extraction on the modal features by adopting the target video feature extraction network to obtain modal video features of each modal.

Optionally, in some embodiments, the content retrieval apparatus may further include a training unit, where the training unit may be specifically configured to obtain a content sample set, where the content sample set includes a video sample and a text sample, and the text sample includes at least one text word; performing multi-modal feature extraction on the video sample by adopting a preset content retrieval model to obtain sample modal features of each mode; respectively extracting the characteristics of the sample modal characteristics of each mode to obtain the sample modal content characteristics of the video sample, and fusing the sample modal content characteristics to obtain the sample video characteristics of the video sample; and performing feature extraction on the text sample to obtain sample text features and text word features corresponding to each text word, and converging the preset content retrieval model according to the sample modal video features, the sample text features and the text word features to obtain the trained content retrieval model.

Optionally, in some embodiments, the training unit may be specifically configured to determine feature loss information of the content sample set according to the sample modal content features and the text word features; determining content loss information for the set of content samples based on the sample video features and sample text features; and fusing the characteristic loss information and the content loss information, and converging a preset content retrieval model based on the fused loss information to obtain a trained content retrieval model.

Optionally, in some embodiments, the training unit may be specifically configured to calculate a feature similarity between the sample modal content feature and a text word feature, so as to obtain a first feature similarity; determining the sample similarity between the video sample and the text sample according to the first feature similarity; and calculating the characteristic distance between the video sample and the text sample based on the sample similarity so as to obtain the characteristic loss information of the content sample set.

Optionally, in some embodiments, the training unit may be specifically configured to perform feature interaction on the sample modal content features and text word features according to the first feature similarity, so as to obtain post-interaction video features and post-interaction text word features; calculating the feature similarity between the video features after the interaction and the text word features after the interaction to obtain a second feature similarity; and fusing the second feature similarity to obtain the sample similarity between the video sample and the text sample.

Optionally, in some embodiments, the training unit may be specifically configured to perform normalization processing on the first feature similarity to obtain a target feature similarity; determining an association weight of the sample modal content features according to the target feature similarity, wherein the association weight is used for indicating an association relation between the sample modal content features and text word features; and weighting the sample modal content features based on the associated weights, and updating the text word features based on the weighted sample modal content features to obtain the video features after interaction and the text word features after interaction.

Optionally, in some embodiments, the training unit may be specifically configured to take the weighted sample modal content features as initial post-interaction video features, and update the text word features based on the initial post-interaction video features to obtain initial post-interaction text word features; calculating the feature similarity of the video features after the initial interaction and the text word features after the initial interaction to obtain a third feature similarity; and updating the video characteristics after the initial interaction and the text word characteristics after the initial interaction according to the third characteristic similarity to obtain the video characteristics after the interaction and the text word characteristics after the interaction.

Optionally, in some embodiments, the training unit may be specifically configured to perform feature interaction on the initially interacted video features and the initially interacted text word features according to the third feature similarity, so as to obtain target interacted video features and target interacted text word features; taking the video features after the target interaction as initial video features after the target interaction, and taking the text word features after the target interaction as initial text word features after the target interaction; and returning to the step of calculating the feature similarity of the video features after the initial interaction and the text word features after the initial interaction until the feature interaction times of the video features after the initial interaction and the text word features after the initial interaction reach preset times, and obtaining the video features after the interaction and the text word features after the interaction.

Optionally, in some embodiments, the training unit may be specifically configured to obtain a preset feature boundary value corresponding to the content sample set; screening out a first content sample pair of which the video sample is matched with the text sample and a second content sample pair of which the video sample is not matched with the text sample from the content sample set according to the sample similarity; and calculating the characteristic distance between the first content sample pair and the second content sample pair based on the preset characteristic boundary value to obtain the characteristic loss information of the content sample set.

Optionally, in some embodiments, the training unit may be specifically configured to screen out, in the second content sample pair, a content sample pair with a largest sample similarity, so as to obtain a target content sample pair; calculating a similarity difference between the sample similarity of the first content sample pair and the sample similarity of the target content sample pair to obtain a first similarity difference; and fusing the preset characteristic boundary value and the first similarity difference value to obtain the characteristic loss information of the content sample set.

Optionally, in some embodiments, the training unit may be specifically configured to calculate a feature similarity between the sample video feature and the text feature, so as to obtain a content similarity between the video sample and the text sample; according to the content similarity, a third content sample pair of which the video sample is matched with the text sample and a fourth content sample pair of which the video sample is not matched with the content sample are screened from the content sample set; and acquiring a preset content boundary value corresponding to the content sample set, and calculating a content difference value between the third content sample pair and the fourth content sample pair according to the preset content boundary value to obtain content loss information of the content sample set.

Optionally, in some embodiments, the training unit may be specifically configured to calculate a similarity difference between the content similarity of the third content sample pair and the content similarity of the fourth content sample pair, so as to obtain a second similarity difference; fusing the second similarity difference value with a preset content boundary value to obtain a content difference value between the third content sample pair and the fourth content sample pair; and carrying out standardization processing on the content difference value to obtain the content loss information of the content sample set.

Optionally, in some embodiments, the content retrieval device may further include a video retrieval unit, where the video retrieval unit is specifically configured to, when the content to be retrieved is text content, perform feature extraction on the text content to obtain a text feature of the text content; and retrieving target video content corresponding to the text content in the preset content set according to the text characteristics. In addition, an embodiment of the present invention further provides an electronic device, which includes a processor and a memory, where the memory stores an application program, and the processor is configured to run the application program in the memory to implement the content retrieval method provided in the embodiment of the present invention.

In addition, the embodiment of the present invention further provides a computer-readable storage medium, where a plurality of instructions are stored, and the instructions are suitable for being loaded by a processor to perform the steps in any content retrieval method provided by the embodiment of the present invention.

After the content to be retrieved for retrieving the target content is obtained, when the content to be retrieved is video content, performing multi-mode feature extraction on the video content to obtain a modal feature of each mode, performing feature extraction on the modal feature of each mode respectively to obtain a modal content feature of each mode, fusing the modal content features to obtain a video feature of the video content, and retrieving the target text content corresponding to the video content in a preset content set according to the video feature; according to the scheme, multi-modal feature extraction is firstly carried out on video content, and then modal video features are respectively extracted from the modal features corresponding to each mode, so that the accuracy of the modal video features in the video is improved, the modal video features are fused to obtain the video features of the video content, the extracted video features can better express information in the video, and therefore the accuracy of content retrieval can be improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic view of a scene of a content retrieval method provided in an embodiment of the present invention;

FIG. 2 is a flow chart of a content retrieval method provided by an embodiment of the invention;

FIG. 3 is a schematic diagram of modal feature extraction for video content according to an embodiment of the present invention;

FIG. 4 is a schematic diagram illustrating training of a preset content retrieval model according to an embodiment of the present invention;

FIG. 5 is a schematic flow chart of a content retrieval method according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a content retrieval device according to an embodiment of the present invention;

fig. 7 is another schematic structural diagram of a content retrieval device according to an embodiment of the present invention;

fig. 8 is another schematic structural diagram of a content retrieval device according to an embodiment of the present invention;

fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention provides a content retrieval method, a content retrieval device and a computer-readable storage medium. The content retrieval device may be integrated into an electronic device, and the electronic device may be a server or a terminal.

The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, Network service, cloud communication, middleware service, domain name service, security service, Network acceleration service (CDN), big data and an artificial intelligence platform. The terminal may be, but is not limited to, a smart phone, a tablet computer, a laptop computer, a desktop computer, a smart speaker, a smart watch, and the like. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein.

For example, referring to fig. 1, for example, the content retrieval device is integrated in the electronic device, after the electronic device obtains content to be retrieved for retrieving target content, when the content to be retrieved is video content, the electronic device performs multi-modal feature extraction on the video content to obtain modal features of each modality, performs feature extraction on the modal features of each modality to obtain modal content features of each modality, fuses the modal content features to obtain video features of the video content, and retrieves target text content corresponding to the video content from a preset content set according to the video features, thereby improving accuracy of content retrieval.

It should be noted that the content retrieval method provided in the embodiment of the present application relates to a computer vision technology in the field of artificial intelligence, that is, in the embodiment of the present application, feature extraction may be performed on text content and video content by using the computer vision technology of artificial intelligence, and target content is screened from a preset content set based on the extracted features.

So-called Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making. The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and the like.

Computer Vision technology (CV) is a science for researching how to make a machine see, and further means that a camera and a Computer are used for replacing human eyes to perform machine Vision such as identification, tracking and measurement on a target, and further image processing is performed, so that the Computer processing becomes an image more suitable for human eyes to observe or is transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. The computer vision technology generally includes image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technology, virtual reality, augmented reality, synchronous positioning and map construction, automatic driving, intelligent transportation and other technologies, and also includes common biometric identification technologies such as face recognition and fingerprint recognition.

The following are detailed below. It should be noted that the following description of the embodiments is not intended to limit the preferred order of the embodiments.

The embodiment will be described from the perspective of a content retrieval device, which may be specifically integrated in an electronic device, where the electronic device may be a server or a terminal; the terminal may include a tablet Computer, a notebook Computer, a Personal Computer (PC), a wearable device, a virtual reality device, or other intelligent devices capable of content retrieval.

A content retrieval method, comprising:

the method comprises the steps of obtaining content to be retrieved for retrieving target content, when the content to be retrieved is video content, conducting multi-mode feature extraction on the video content to obtain modal features of each mode, conducting feature extraction on the modal features of each mode respectively to obtain modal content features of each mode, fusing the modal content features to obtain video features of the video content, and retrieving target text content corresponding to the video content in a preset content set according to the video features.

As shown in fig. 2, the specific flow of the content retrieval method is as follows:

101. and acquiring the content to be retrieved for retrieving the target content.

The content to be retrieved may be understood as content in a retrieval condition for retrieving the target content, and the content type of the content to be retrieved may be various, for example, text content or video content.

The method for acquiring the content to be retrieved may be various, and specifically may be as follows:

for example, the content to be retrieved sent by the user through the terminal may be directly received, or the content to be retrieved may be obtained from a network or a third-party database, or when the memory of the content to be retrieved is large or large in number, the content retrieval request is received, the content retrieval request carries the storage address of the content to be retrieved, and the content to be retrieved is obtained in the memory, the cache or the third-party database according to the storage address.

102. And when the content to be retrieved is video content, performing multi-modal feature extraction on the video content to obtain modal features corresponding to multiple modalities.

The modality features may be understood as feature information corresponding to each modality in the video content, and the video content may include multiple modalities, for example, describing an action, audio, a scene, a face, an OCR presentation and/or an entity, and the like.

The multi-modal feature extraction method for video content may be various, and specifically may be as follows:

for example, a trained content retrieval model is used for performing multi-modal feature extraction on video content to obtain initial modal features of each modality in the video content, video frames are extracted from the video content, the trained content retrieval model is used for performing multi-modal feature extraction on the video frames to obtain basic modal features of each video frame, target modal features corresponding to each modality are screened out from the basic modal features, and the target modal features and the corresponding initial modal features are fused to obtain modal features of each modality.

Wherein, the video content and the video frame comprise a plurality of modes, and for different modes, different feature extraction methods can be adopted to perform multi-mode feature extraction on the video content and the video frame, for example, for describing action modes, feature extraction may be performed using an S3D (a motion recognition model) model pre-trained on a motion recognition data set, for audio modalities, feature extraction may be performed using a pre-trained VGGish (an audio extraction model) model, which, for scene modalities, feature extraction can be performed using a pre-trained DenseNet-161 (a depth model) model, which, for face modalities, feature extraction can be performed using a pre-trained SSD model and a ResNet50 model, for face modalities, feature extraction may be performed using the Google API, a feature extraction network, for entity modalities, feature extraction may be performed using pre-trained SENET-154 (a feature extraction network). The extracted initial modality features and basic modality features may include image features, expert features, time features, and the like.

The target modality features and the corresponding initial modality features are fused, and there are various fusion modes, for example, the image features (F), the expert features (E), and the time features (T) in the target modality features and the initial modality features may be added to obtain the modality features (Ω) of each modality, which may be specifically shown in fig. 3. Or, a weighting coefficient of the target modal characteristics and the initial modal characteristics can be obtained, the target modal characteristics and the initial modal characteristics are weighted according to the weighting coefficient, and the weighted target modal characteristics and the weighted initial modal characteristics are fused to obtain modal characteristics of each mode.

The trained content retrieval model may be set according to requirements of practical applications, and it should be noted that the trained content retrieval model may be set in advance by a maintenance person, or may be trained by a content retrieval device, that is, before the step of performing multi-modal feature extraction on video content by using the trained content retrieval model to obtain an initial modal feature of each modality in the video content, the content retrieval method may further include:

obtaining a content sample set, wherein the content sample set comprises a video sample and a text sample, the text sample comprises at least one text word, multi-modal feature extraction is performed on the video sample by adopting a preset content retrieval model to obtain a sample modal feature of each modal, feature extraction is performed on the sample modal feature of each modal respectively to obtain a sample modal content feature of the video sample, the sample modal content features are fused to obtain a sample video feature of the video sample, feature extraction is performed on the text sample to obtain a sample text feature and a text word feature corresponding to each text word, and the preset content retrieval model is converged according to the sample modal video feature, the sample text feature and the text word feature to obtain a trained content retrieval model, which specifically comprises the following steps:

and S1, acquiring a content sample set.

Wherein the set of content samples comprises a video sample and a text sample, the text sample comprising at least one text word.

The content sample set may be obtained in various ways, which may specifically be as follows:

for example, the video sample and the text sample may be directly obtained, resulting in a set of content samples, or the original video content and the original text content may be obtained, then, the original video content and the original text content are sent to an annotation server, a matching label between the original video content and the original text content returned by the annotation server is received, the matching label is added to the original video content and the original text content, thereby obtaining a video sample and a text sample, combining the video sample and the text sample to obtain a content sample set, or, when the number of content samples in the content sample set is large or the memory is large, the model training request may be received, the model training request carries a storage address of the content sample set, and the content sample set is obtained in a memory, a cache or a third-party database according to the storage address.

And S2, performing multi-modal feature extraction on the video sample by adopting a preset content retrieval model to obtain the modal features of the sample of each mode.

For example, a preset content retrieval model is used to perform multi-modal feature extraction on a video sample to obtain initial sample modal features of each modality in the video sample, a video frame is extracted from the video sample, the preset content retrieval model is used to perform multi-modal feature extraction on the video frame to obtain basic sample modal features of each video frame, target sample modal features corresponding to each modality are screened out from the basic sample modal features, and the target sample modal features and the corresponding initial sample modal features are fused to obtain mode sample modal features of each modality.

And S3, respectively extracting the characteristics of the sample modal characteristics of each mode to obtain the sample modal content characteristics of the video sample, and fusing the sample modal content characteristics to obtain the sample video characteristics of the video sample.

For example, according to the modality of the modal characteristics of the sample, a target video characteristic extraction network corresponding to each modality is identified in a video characteristic extraction network of a preset content retrieval model, and the modal characteristics of the sample are subjected to characteristic extraction by adopting the target video characteristic extraction network to obtain the modal content characteristics of the sample corresponding to each modality. And fusing the modal content characteristics of the sample to obtain the video characteristics of the sample of the video sample.

The modality of the video feature extraction network of the preset content retrieval model is fixed, so that the video feature extraction network corresponding to the modality can be identified only according to the modality of the sample modality feature, and the identified video feature extraction network is used as a target video feature extraction network.

After the target video feature extraction network is identified, the target video feature extraction network may be used to perform feature extraction on modal features, and the feature extraction process may be multiple, for example, the target video feature extraction network may encode the sample modal features for an encoder of a modality-specific Transformer (a transform network), so as to extract the sample modal content features of each modality.

After the sample modal content features are extracted, the sample modal content features may be fused, and the fusion process may be multiple, for example, the sample modal content features of each modality may be combined to obtain a sample modal content feature set of the video sample, the sample modal content feature set is input to a Transformer to be encoded to calculate the association weight of the sample modal content features, the sample modal content features are weighted according to the association weight, and the weighted sample modal content features are fused to obtain the sample video features of the video sample.

And S4, extracting the characteristics of the text sample to obtain sample text characteristics and text word characteristics corresponding to each text word, and converging the preset content retrieval model according to the sample modal content characteristics, the sample video characteristics, the sample text characteristics and the text word characteristics to obtain the trained content retrieval model.

For example, feature extraction is performed on a text sample by using a text feature extraction network of a preset content retrieval model to obtain text features of the text sample and text word features of text words, and then the preset content retrieval model is converged according to the sample modal content features, the sample video features, the sample text features and the text word features to obtain a trained content retrieval model.

For example, a text encoder may be used to perform feature extraction on the text features of the text sample to obtain text features and text word features, and the type of the text encoder may be various, and may include, for example, Bert (a text encoder) or word2vector (a word vector generation model).

After the text features and the text word features are extracted, the preset content retrieval model can be converged according to the sample modal content features, the sample video features, the sample text features and the text word features, and the convergence mode can be various and specifically can be as follows:

for example, the feature loss information of the content sample set may be determined according to the sample modal content feature and the text word feature, the content loss information of the content sample set is determined based on the sample video feature and the sample text feature, the feature loss information and the content loss information are fused, and the preset content retrieval model is converged based on the fused loss information to obtain the trained content retrieval model, which may specifically be as follows:

(1) and determining the characteristic loss information of the content sample set according to the sample modal content characteristics and the text word characteristics.

For example, the feature similarity between the content features of the sample modality and the features of the text words can be calculated to obtain a first feature similarity, the sample similarity between the video sample and the text sample is determined according to the first feature similarity, and the feature distance between the video sample and the text sample is calculated based on the sample similarity to obtain the feature loss information of the content sample set.

For example, cosine similarity between the sample modal content feature and the text word feature may be calculated, and the cosine similarity is used as the first feature similarity, which may be specifically shown in formula (1):

wherein S is_ijIs the first feature similarity, w_iIn order to be a feature of a text word,

are sample modality video features.

After the first feature similarity is calculated, the sample similarity between the video sample and the text sample can be determined according to the first feature similarity, and the determination manner can be various, for example, feature interaction can be performed on the sample modal content features and the text word features according to the first feature similarity to obtain video features after interaction and text word features after interaction, feature similarity between the video features after interaction and the text word features after interaction is calculated to obtain second feature similarity, and the second feature similarity is fused to obtain the sample similarity between the video sample and the text sample.

For example, the first feature similarity may be standardized to obtain a target feature similarity, an association weight of the sample modal content features is determined according to the target feature similarity, the association weight is used to indicate an association relationship between the sample modal content features and the text word features, the sample modal content features are weighted based on the association weight, and the text word features are updated based on the weighted sample modal content features to obtain post-interaction video features and post-interaction text word features.

For example, the first feature similarity may be normalized by using an activation function, and the type of the activation function may be various, for example, ReLU (x) max (0, x), and the normalization process may be as shown in equation (2):

wherein the content of the first and second substances,

as target feature similarity, S_ijFor the first feature similarity, relu is the activation function.

For example, preset associated parameters may be obtained, and the associated parameters and the target feature similarity are fused to obtain associated weights, which may also be understood as attention weights, and specifically may be as shown in formula (3):

wherein, a_ijFor the correlation weight, λ is a preset correlation parameter, which may be an inverse temperature parameter of softmax,

the target feature similarity is shown.

After determining the association weight of the sample modal content features, weighting the sample modal content features based on the association weight, and fusing the weighted sample modal content features to obtain weighted sample modal content features, where the weighted sample modal content features are used as initial post-interaction video features of the video sample, and may specifically refer to formula (4):

wherein, a_iFor initial post-interaction video features, a_ijIn order to associate the weights with each other,

are sample modality video features.

After the initial post-interaction video features are calculated, the text word features can be updated based on the initial post-interaction video features to obtain post-interaction video features and post-interaction text word features, for example, the text word features can be updated based on the initial post-interaction video features to obtain the initial post-interaction text word features, feature similarity between the initial post-interaction video features and the initial post-interaction text word features is calculated to obtain third feature similarity, and the initial post-interaction video features and the initial post-interaction text word features are updated according to the third feature similarity to obtain post-interaction video features and post-interaction text word features.

The method for updating the text word features based on the video features after the initial interaction may be various, for example, a preset update parameter may be obtained, and the preset update parameter, the video features after the initial interaction, and the text word features are fused to obtain the text word features after the initial interaction, which may specifically be as shown in formula (5):

wherein, f (w)_i，a_i) For initial post-interaction text word features, w_iAs a feature of a text word, a_iFor initial post-interaction video features, g_iFor gate operations, for selecting the most useful information, o_iFor fusing features for enhancing the interaction of text word features with initially interacted-with video features, F_g、b_g、F_oAnd b_oUpdating the parameters for presetting. For multi-step operation (multiple feature interaction), multiple updates of the text word features are involved, so that formula (5) can be integrated to obtain F_aThen, the formula for K times of feature intersection (inter-intersection attention operation) can be obtained, as shown in formula (6):

wherein, K represents the K-th order,

and

respectively representing the characteristics of the K-th and (K-1) -th text words, A_kAs a K-th interactive rearviewFrequency characteristic, V^repIs a modal video feature.

For example, according to the third feature similarity, feature interaction can be performed on the initially interacted video features and the initially interacted text word features to obtain target interacted video features and target interacted text word features, the target interacted video features are used as the initially interacted video features, the target interacted text word features are used as the initially interacted text word features, the step of calculating the feature similarity between the initially interacted video features and the initially interacted text word features is returned until the number of feature interactions between the initially interacted video features and the initially interacted text word features reaches a preset number, and the interacted video features and the interacted text word features are obtained.

The feature interaction process can be regarded as multi-step cross attention calculation, so that the video features and the text word features after interaction are obtained. The number of feature interactions may be set according to the actual application, and may be generally set.

After the video features and the text word features after interaction are obtained, the sample similarity between the video sample and the text sample can be calculated, and the calculation modes can be various, for example, the feature similarity between the video features after interaction and the text word features after interaction can be calculated to obtain a second feature similarity, and the second feature similarity is fused to obtain the sample similarity between the video sample and the text sample, as shown in formula (7):

wherein, S (T)^w，V^rep) Is the sample similarity, w_kiFor post-interaction text word features, a_kiIs a post-interaction video feature.

After the sample similarity is calculated, the feature distance between the video sample and the text sample may be calculated, so as to obtain the feature loss information of the content sample set, and the calculation manners may be various, for example, a preset feature boundary value corresponding to the content sample set may be obtained, a first content sample pair in which the video sample matches the text sample and a second content sample pair in which the video sample does not match the text sample are screened out from the content sample set according to the sample similarity, and the feature distance between the first content sample pair and the second content sample pair is calculated based on the preset feature boundary value, so as to obtain the feature loss information of the content sample set.

For example, the sample similarity may be compared with a preset similarity threshold, video samples and corresponding text samples, of which the sample similarity exceeds the preset similarity threshold, are screened from the content sample set, so that the first content sample pair may be obtained, and video samples and corresponding text samples, of which the sample similarity does not exceed the preset similarity threshold, are screened from the content sample set, so that the second content sample pair may be obtained.

After the first content sample pair and the second content sample pair are screened out, the characteristic distance between the first content sample pair and the second content sample pair may be calculated, and the calculation manner may be various, for example, a content sample pair with the largest sample similarity may be screened out in the second content sample pair to obtain a target content sample pair, a similarity difference between the sample similarity of the first content sample pair and the sample similarity of the target content sample pair is calculated to obtain a first similarity difference, and the preset characteristic boundary value and the first similarity difference are fused to obtain the characteristic loss information of the content sample set, as shown in formula (8):

wherein L is_TriFor the feature loss information, Δ is a preset feature boundary value, B is the number of content sample sets,b represents that the video sample is matched with the text sample, b represents a difficult negative sample, namely the video sample or the text sample in the content sample pair with the maximum sample similarity in the second content sample. It can be found that there may be two target content sample pairs, and after the preset feature boundary value and the first similarity difference value are fused, the fused similarity difference value may be normalized, and the normalized similarity difference value is fused, so that the feature loss information of the content sample set may be obtained.

The feature loss information can be regarded as loss information obtained by performing back propagation and parameter updating by using triple loss (a loss function), and is mainly used for shortening the distance of the matched video text on the feature space.

(2) Content loss information for the set of content samples is determined based on the sample video features and the sample text features.

For example, the feature similarity between the video features and the text features of the samples may be calculated to obtain the content similarity between the video samples and the text samples, a third content sample pair matching the video samples and the text samples and a fourth content sample pair not matching the video samples and the content samples are screened from the content sample set according to the content similarity, a preset content boundary value corresponding to the content sample set is obtained, and a content difference between the third content sample pair and the fourth content sample pair is calculated according to the preset content boundary value to obtain the content loss information of the content sample set.

For example, a similarity difference between the content similarity of the third content sample pair and the content similarity of the fourth content sample pair may be calculated to obtain a second similarity difference, the second similarity difference is fused with a preset content boundary value to obtain a content difference between the third content sample pair and the fourth content sample pair, and the content difference is normalized to obtain the content loss information of the content sample set, as shown in formula (9):

wherein L is_marB is the number of the content sample set, B represents that the video sample is matched with the text sample, d represents the content sample except for B in the content sample set, theta is a preset content boundary value, S^hIn order to be the degree of similarity of the contents,

in order to be a feature of the text,

is prepared by reacting with

Video features of the matched video samples. The content loss information can be regarded as loss information obtained by back propagation and parameter updating through a bidirectional max-margin ranking loss.

(3) And fusing the characteristic loss information and the content loss information, and converging the preset content retrieval model based on the fused loss information to obtain the trained content retrieval model.

For example, a preset balance parameter may be obtained, the preset balance parameter is fused with the feature loss information to obtain balanced feature loss information, and the balanced feature loss information is added to the content loss information to obtain fused loss information, as shown in formula (10):

L＝L_mar+β*L_Tri (10)

wherein L is loss information after fusion, L_marFor content loss information, L_TriFor the characteristic loss information, β is a preset balance parameter that is used to balance the two loss functions in scale.

Optionally, weighting parameters of the characteristic loss information and the content loss information may also be obtained, the characteristic loss information and the content loss information are weighted based on the weighting parameters, and the weighted characteristic loss information and the weighted content loss information are fused to obtain fused loss information,

after the fused loss information is obtained, the preset content retrieval model can be converged based on the fused loss information, and the convergence mode can be various, for example, the preset content retrieval model can be converged to obtain the trained content retrieval model by updating the network parameters in the preset content retrieval model by using a gradient descent algorithm according to the fused loss information, or the preset content retrieval model can be converged to obtain the trained content retrieval model by updating the network parameters in the preset content retrieval model by using the fused loss information by using other algorithms.

It should be noted that, in the content retrieval model training process, the text sample and the video sample are subjected to multi-step cross attention calculation and content similarity calculation, and a triple loss and a bidirectional max-margin ranking loss are respectively adopted for back propagation and parameter updating, so as to obtain a trained content retrieval model, which may be specifically shown in fig. 4.

103. And respectively extracting the characteristics of the modal characteristics corresponding to each modal to obtain the modal content characteristics corresponding to each modal.

Wherein, the modal content characteristics can be the overall characteristics of each modal content, and are used for indicating the content characteristics in the modal.

The mode of extracting the features of the mode features may be various, and specifically may be as follows:

for example, according to the modality of the modality features, a target video feature extraction network corresponding to each modality is identified in a video feature extraction network of the trained content retrieval model, and the target video feature extraction network is adopted to perform feature extraction on the modality features to obtain the modality content features corresponding to each modality.

The modality of the video feature extraction network of the trained content retrieval model is fixed, so that the video feature extraction network corresponding to the modality can be identified only according to the modality of the modality feature, and the identified video feature extraction network is used as a target video feature extraction network.

After the target video feature extraction network is identified, the target video feature extraction network may be used to extract features of modal features, and the process of feature extraction may be various, for example, the target video feature extraction network may encode the modal features for an encoder of a modality-specific Transformer, so as to extract modal content features corresponding to video content of each modality.

104. And fusing the modal content features to obtain the video features of the video content, and retrieving the target text content corresponding to the video content in the preset content set according to the video features.

The mode of fusing the modal video features may be various, and specifically may be as follows:

for example, the modal content features of each modality may be combined to obtain a modal content feature set of the video content, the modal content feature set is input to a transform model to be encoded to calculate the associated weight of the modal content features, the modal content features are weighted according to the associated weight, and the weighted modal content features are fused to obtain the video features of the video content, or a weighting parameter corresponding to each modality is obtained, the modal content features are weighted based on the weighting parameter, and the weighted modal content features are fused to obtain the video features of the video content, or the modal content features are directly spliced to obtain the video features of the video content.

After the video features of the video content are obtained, the target text content corresponding to the video content can be retrieved from the preset content set according to the video features, and the retrieval modes can be various, for example, feature similarities between the video features and the text features of the candidate text content in the preset content set can be respectively calculated, and the target text content corresponding to the video content is screened from the candidate text content according to the feature similarities.

For example, a text encoder may be used to perform feature extraction on the candidate text content to obtain text features of the candidate text content, and the type of the text encoder may be various, for example, the text encoder may include Bert and word2vector, or may also extract features of each text word in the candidate text content, then calculate an association weight between each text word, and weight the text word features based on the association weight, thereby obtaining the text features of the candidate text content. The time for extracting the text features of the candidate text contents in the preset content set may be various, for example, the time may be real-time extraction, for example, when the obtained content to be retrieved is video content, the text features of the candidate text contents may be obtained by performing text feature extraction on the candidate text contents, or before the content to be retrieved is obtained, the text features of the candidate text contents in the preset content set may be obtained by performing text feature extraction on the candidate text contents, so that feature similarity between the text features and the video features may be calculated off-line, and the target text content corresponding to the video content may be screened out from the candidate text contents more quickly.

For example, cosine similarity between the video features and the text features of the candidate text content may be calculated, so that feature similarity may be obtained, or feature distance between the video features and the text features of the candidate text content may be calculated, and feature similarity between the video features and the text features may be determined according to the feature distance.

After the feature similarity is calculated, the target text content corresponding to the video content can be screened from the candidate text content according to the feature similarity, the screening manner can be various, for example, the candidate video text content of which the feature similarity exceeds a preset similarity threshold value is screened from the candidate text content, the screened candidate text content is ranked, and the ranked candidate text content is used as the target text content corresponding to the video content, or the candidate text content can be ranked according to the feature similarity, the target text content corresponding to the video content is screened from the ranked candidate text content, the screened target text content can be one or more, when the number of the target text content is one, the candidate text content with the maximum feature similarity with the video feature can be used as the target text content, when the number of the target text contents is multiple, TOP N candidate text contents ranked at the TOP with the feature similarity of the video features can be screened out from the sorted candidate text contents as the target text contents.

Optionally, when the content to be retrieved is a text content, feature extraction may be performed on the text content, and a target video content corresponding to the text content is retrieved from a preset content set according to the extracted text feature, which may specifically be as follows:

for example, when the content to be retrieved is text content, feature extraction is performed on the text content by adopting a text feature extraction network of the trained content retrieval model, so that text features of the text content are obtained. Respectively calculating the feature similarity between the text features and the video features of the candidate video contents in the preset content set, and screening target video contents corresponding to the text contents from the candidate video contents according to the feature similarity.

The text content may be extracted in various ways, for example, a text encoder may be used to extract overall features in the text content to obtain text features, the type of the text encoder may be various, for example, the text encoder may include Bert and word2vector, or features of each text word in the text content may be extracted, then, an association weight between each text word is calculated, and the text word features are weighted based on the association weights to obtain the text features.

After the text features of the text content are extracted, feature similarity between the text features and the video features can be calculated, and various ways for calculating the feature similarity are available, for example, feature extraction can be performed on candidate video contents in a preset content set to obtain the video features of each candidate video content, and then cosine similarity between the text features and the video features is calculated, so that the feature similarity can be obtained.

For example, a trained content retrieval model can be used to perform multi-modal feature extraction on the candidate video content to obtain modal features corresponding to multiple modalities, feature extraction is performed on the modal features corresponding to each modality to obtain modal video features corresponding to each modality, and the modal video features are fused to obtain the video features of each candidate video content. The time for extracting the video features of the candidate video content may be various, for example, the video features of the candidate video content may be extracted in real time, for example, each time the content to be retrieved is acquired, the video features may be extracted from the candidate video content, or before the content to be retrieved is acquired, feature extraction may be performed on each candidate video content in a preset content set to extract the video features, so that feature similarity between the text features and the video features may be calculated off-line, and thus target video content corresponding to the text content may be screened out from the candidate video content more quickly.

For example, candidate video contents with the characteristic similarity exceeding a preset similarity threshold are screened from the candidate video contents, the screened candidate video contents are ranked, and the ranked candidate video contents are used as target video contents corresponding to the text contents, or the candidate video contents are ranked according to the characteristic similarity, the target video contents corresponding to the text contents are screened from the ranked candidate video contents, the screened target video contents can be one or more, when the number of the target video contents is one, the candidate video contents with the maximum characteristic similarity to the text characteristics can be used as the target video contents, and when the number of the target video contents is more, TOP N candidate video contents ranked at the TOP with the feature similarity of the text feature can be screened out from the sorted candidate video contents as target video contents. According to the scheme, the multi-modal information in the video is subjected to better feature extraction, more important words in the retrieval text are better concerned, and a better retrieval result is achieved. On data sets MSR-VTT, LSMDC and activityNet, the content retrieval performance is greatly improved compared with the current mainstream method, and the results are shown in tables 1, 2 and 3. In the table, R1, R5, R10 and R50 represent recognition rates of row 1, row 5, row 10 and row 50, respectively, and MdR and MnR are the mean and median of the recognition rates.

TABLE 1 results on MSR-VTT dataset

TABLE 2 results on LSMDC dataset

TABLE 3 results on ActivityNet dataset

As can be seen from the above, in the embodiment of the application, after the content to be retrieved for retrieving the target content is obtained, when the content to be retrieved is the video content, performing multi-mode feature extraction on the video content to obtain the modal feature of each mode, performing feature extraction on the modal feature of each mode respectively to obtain the modal content feature of each mode, fusing the modal content features to obtain the video feature of the video content, and retrieving the target text content corresponding to the video content in the preset content set according to the video feature; according to the scheme, multi-modal feature extraction is firstly carried out on video content, and then modal video features are respectively extracted from the modal features corresponding to each mode, so that the accuracy of the modal video features in the video is improved, the modal video features are fused to obtain the video features of the video content, the extracted video features can better express information in the video, and therefore the accuracy of content retrieval can be improved.

The method described in the above examples is further illustrated in detail below by way of example.

In this embodiment, the content search device is specifically integrated in an electronic device, and the electronic device is taken as an example to be described.

Server training content retrieval model

C1, the server obtains the content sample set.

For example, the server may directly obtain the video sample and the text sample, obtain the content sample set, or, the original video content and the original text content may be obtained, and then, the original video content and the original text content are sent to the annotation server, a matching tag between the original video content and the original text content returned by the annotation server is received, the matching tag is added to the original video content and the original text content, thereby obtaining a video sample and a text sample, combining the video sample and the text sample to obtain a content sample set, or, when the number of content samples in the content sample set is large or the memory is large, the model training request may be received, the model training request carries a storage address of the content sample set, and the content sample set is obtained in a memory, a cache or a third-party database according to the storage address.

And C2, the server adopts a preset content retrieval model to perform multi-modal feature extraction on the video sample to obtain the sample modal feature of each mode.

For example, the server performs multi-modal feature extraction on a video sample by using a preset content retrieval model to obtain initial sample modal features of each modality in the video sample, extracts a video frame from the video sample, performs multi-modal feature extraction on the video frame by using the preset content retrieval model to obtain basic sample modal features of each video frame, screens out target sample modal features corresponding to each modality from the basic sample modal features, and fuses the target sample modal features and the corresponding initial sample modal features to obtain mode sample modal features of each modality.

C3, the server extracts the characteristics of the sample modal characteristics of each mode respectively to obtain the sample modal content characteristics of the video sample, and the sample modal content characteristics are fused to obtain the sample video characteristics of the video sample.

For example, the server identifies a transform network corresponding to each modality in a video feature extraction network of a preset content retrieval model as a target video feature extraction network according to the modality of the modality features of the sample, and encodes the modality features of the sample by using an encoder of the transform network, thereby extracting the modality content features of the sample of each modality. Combining the sample modal content characteristics of each mode to obtain a sample modal content characteristic set of the video sample, inputting the sample modal content characteristic set into an overall Transformer network for coding to calculate the associated weight of the sample modal content characteristics, weighting the sample modal content characteristics according to the associated weight, and fusing the weighted sample modal content characteristics to obtain the sample video characteristics of the video sample.

And C4, the server extracts features of the text sample to obtain sample text features and text word features corresponding to each text word, and the preset content retrieval model is converged according to the sample modal content features, the sample video features, the sample text features and the text word features to obtain the trained content retrieval model.

For example, the server may perform feature extraction on the text features of the text sample by using a text encoder such as a Bert or word2vector, so as to obtain the text features and text word features. Determining characteristic loss information of a content sample set according to sample modal content characteristics and text word characteristics, determining content loss information of the content sample set based on sample video characteristics and sample text characteristics, fusing the characteristic loss information and the content loss information, and converging a preset content retrieval model based on the fused loss information to obtain a trained content retrieval model, wherein the method specifically comprises the following steps:

(1) and the server determines the characteristic loss information of the content sample set according to the sample modal content characteristics and the text word characteristics.

For example, the server may calculate a cosine similarity between the sample modal content feature and the text word feature, and use the cosine similarity as the first feature similarity, which may specifically be shown in formula (1). The first feature similarity is normalized by using an activation function, the type of the activation function may be multiple, for example, ReLU (x) ═ max (0, x)), and the normalization process may be as shown in formula (2), so as to obtain the normalized target feature similarity, obtain a preset association parameter, and fuse the association parameter and the target feature similarity to obtain an association weight, which may also be understood as an attention weight, and may be specifically as shown in formula (3). Weighting the sample modal content features based on the associated weights, and fusing the weighted sample modal content features to obtain weighted modal video features, wherein the weighted modal content features are used as initial post-interaction video features of the video sample, and can be specifically referred to as formula (4).

After the server calculates the video features after the initial interaction, the server may obtain a preset update parameter, and fuse the preset update parameter, the video features after the initial interaction, and the text word features to obtain the text word features after the initial interaction, which may be specifically shown in formula (5). Calculating the feature similarity of the video features after the initial interaction and the text word features after the initial interaction to obtain a third feature similarity, performing feature interaction on the video features after the initial interaction and the text word features after the initial interaction according to the third feature similarity to obtain the video features after the target interaction and the text word features after the target interaction, taking the video features after the target interaction as the video features after the initial interaction and taking the text word features after the target interaction as the text word features after the initial interaction, returning to the step of calculating the feature similarity of the video features after the initial interaction and the text word features after the initial interaction until the feature interaction frequency of the video features after the initial interaction and the text word features after the initial interaction reaches the preset frequency, and obtaining the video features after the interaction and the text word features after the interaction.

After the server obtains the video features after the interaction and the text word features after the interaction, the server can calculate the feature similarity between the video features after the interaction and the text word features after the interaction to obtain a second feature similarity, and the second feature similarity is fused to obtain the sample similarity between the video sample and the text sample, as shown in formula (7). And comparing the sample similarity with a preset similarity threshold, screening out video samples and corresponding text samples of which the sample similarity exceeds the preset similarity threshold from the content sample set so as to obtain a first content sample pair, and screening out video samples and corresponding text samples of which the sample similarity does not exceed the preset similarity threshold from the content sample set so as to obtain a second content sample pair. Acquiring a preset characteristic boundary value corresponding to the content sample set, screening out a content sample pair with the maximum sample similarity in the second content sample pair to obtain a target content sample pair, calculating a similarity difference value between the sample similarity of the first content sample pair and the sample similarity of the target content sample pair to obtain a first similarity difference value, and fusing the preset characteristic boundary value and the first similarity difference value to obtain characteristic loss information of the content sample set, as shown in formula (8).

(2) The server determines content loss information for the set of content samples based on the sample video features and the sample text features.

For example, the server may calculate a feature similarity between the video features of the samples and the text features to obtain a content similarity between the video samples and the text samples, and screen out a third content sample pair in which the video samples and the text samples are matched and a fourth content sample pair in which the video samples and the content samples are not matched from the content sample set according to the content similarity to obtain a preset content boundary value corresponding to the content sample set. And (3) calculating a similarity difference between the content similarity of the third content sample pair and the content similarity of the fourth content sample pair to obtain a second similarity difference, fusing the second similarity difference with a preset content boundary value to obtain a content difference between the third content sample pair and the fourth content sample pair, and normalizing the content difference to obtain content loss information of the content sample set, wherein the content loss information is shown in a formula (9).

(3) And the server fuses the characteristic loss information and the content loss information, and converges the preset content retrieval model based on the fused loss information to obtain the trained content retrieval model.

For example, the server may obtain a preset balance parameter, fuse the preset balance parameter with the feature loss information to obtain balanced feature loss information, and add the balanced feature loss information and the content loss information to obtain fused loss information, as shown in formula (10). Then, according to the fused loss information, updating the network parameters in the preset content retrieval model by adopting a gradient descent algorithm so as to converge the preset content retrieval model to obtain the trained content retrieval model, or updating the network parameters in the preset content retrieval model by adopting the fused loss information by adopting other algorithms so as to converge the preset content retrieval model to obtain the trained content retrieval model.

As shown in fig. 5, a content retrieval method specifically includes the following processes:

201. the server acquires content to be retrieved for retrieving the target content.

For example, the server may directly receive the content to be retrieved sent by the user through the terminal, or may obtain the content to be retrieved from the network or the third-party database, or, when the memory of the content to be retrieved is large or large in number, receive a content retrieval request, where the content retrieval request carries a storage address of the content to be retrieved, and obtain the content to be retrieved in the memory, the cache, or the third-party database according to the storage address.

202. And when the content to be retrieved is video content, the server performs multi-modal feature extraction on the video content to obtain modal features corresponding to multiple modalities.

For example, when the content to be retrieved is video content, the server performs multi-modal feature extraction on the video content by using the trained content retrieval model to obtain initial modal features of each modality in the video content, extracts video frames from the video content, performs multi-modal feature extraction on the video frames by using the trained content retrieval model to obtain basic modal features of each video frame, screens out target modal features corresponding to each modality from the basic modal features, and fuses the target modal features and the corresponding initial modal features to obtain modal features of each modality.

The video content and the video frames in the video content can include multiple modalities, for describing an action modality, a pre-trained S3D model on an action recognition data set can be used for feature extraction, for an audio modality, a pre-trained VGGish model can be used for feature extraction, for a scene modality, a pre-trained DenseNet-161 model can be used for feature extraction, for a face modality, a pre-trained SSD model and a ResNet50 model can be used for feature extraction, for the face modality, a Google API can be used for feature extraction, and for an entity modality, pre-trained SEnet-154 can be used for feature extraction. The extracted initial modality features and basic modality features may include image features, expert features, time features, and the like.

203. And the server respectively extracts the characteristics of the modal characteristics corresponding to each modality to obtain the modal content characteristics corresponding to each modality.

For example, according to the modality of the modality features, a Transformer network corresponding to each modality is identified in the video feature extraction network of the trained content retrieval model as a target video feature extraction network, and the modality features are encoded by using a modality-specific Transformer encoder, so as to extract the modality content features corresponding to each modality.

204. And the server fuses the modal content features to obtain the video features of the video content.

For example, the server may combine the modal content features of each modality to obtain a sample modal content feature set of the video content, input the modal content feature set into a transform model for encoding to calculate the associated weights of the modal content features, weight the modal content features according to the associated weights, and fuse the weighted modal content features to obtain the video features of the video content, or obtain a weighting parameter corresponding to each modality, weight the modal content features based on the weighting parameters, and fuse the weighted modal content features to obtain the video features of the video content, or directly splice the modal video features to obtain the video features of the video content.

205. And the server retrieves the target text content corresponding to the video content from the preset content set according to the video characteristics.

For example, the server may extract features of the candidate text content by using a text encoder such as a Bert or word2vector, to obtain text features of the candidate text content, or may extract features of each text word in the candidate text content, then calculate an association weight between each text word, and weight the text word features based on the association weight, to obtain the text features of the candidate text content.

The server calculates cosine similarity between the video features and the text features of the candidate text contents, so that feature similarity can be obtained, or characteristic distance between the video features and the text features of the candidate text contents can be calculated, and the feature similarity between the video features and the text features is determined according to the characteristic distance.

The server screens candidate text contents with the characteristic similarity exceeding a preset similarity threshold from the candidate text contents, and the screened candidate text contents are ranked, and the ranked candidate text contents are used as target text contents corresponding to the video contents, or according to the feature similarity, the candidate text contents are sorted, the target text contents corresponding to the video contents are screened out from the sorted candidate text contents, one or more target text contents can be screened out, when the number of the target text contents is one, the candidate text contents with the maximum feature similarity with the video features can be used as the target text contents, when the number of the target text contents is multiple, TOP N candidate text contents ranked at the TOP with the feature similarity of the video features can be screened out from the sorted candidate text contents as the target text contents.

The time for extracting the text features of the candidate text contents in the preset content set may be various, for example, the time may be real-time extraction, for example, when the obtained content to be retrieved is video content, the text features of the candidate text contents may be obtained by extracting the text features of the candidate text contents, or before the content to be retrieved is obtained, the text features of the candidate text contents in the preset content set may be obtained by extracting the text features of the candidate text contents, so that feature similarity between the text features and the video features may be calculated off-line, and the target text content corresponding to the video content may be screened out from the candidate text contents more quickly.

206. And when the content to be retrieved is text content, the server extracts the characteristics of the text content, and retrieves the target video content corresponding to the text content in a preset content set according to the extracted text characteristics.

For example, when the content to be retrieved is text content, the server may extract the overall features in the text content by using a text encoder such as Bert or word2vector, so as to obtain the text features of the text content. And performing multi-modal feature extraction on the candidate video content by adopting the trained content retrieval model to obtain modal features corresponding to a plurality of modalities, performing feature extraction on the modal features corresponding to each modality respectively to obtain modal video features corresponding to each modality, and fusing the modal video features to obtain the video features of each candidate video content. Then, cosine similarity between the text feature and the video feature is calculated, so that feature similarity can be obtained. Screening candidate video contents with the characteristic similarity exceeding a preset similarity threshold from the candidate video contents, and the screened candidate video contents are sequenced, and the sequenced candidate video contents are used as target video contents corresponding to the text contents, or according to the feature similarity, the candidate video contents are sequenced, the target video contents corresponding to the text contents are screened from the sequenced candidate video contents, one or more target video contents can be screened, when the number of the target video contents is one, the candidate video contents with the maximum feature similarity with the text features can be used as the target video contents, when the number of the target video contents is multiple, TOP N candidate video contents ranked at the TOP with the feature similarity of the text feature can be screened out from the sorted candidate video contents as the target video contents.

The time for extracting the video features of the candidate video content may be various, for example, the video features of the candidate video content may be extracted in real time, for example, each time the content to be retrieved is acquired, the video features may be extracted from the candidate video content, or before the content to be retrieved is acquired, feature extraction may be performed on each candidate video content in a preset content set to extract the video features, so that feature similarity between the text features and the video features may be calculated off-line, and thus target video content corresponding to the text content may be screened out from the candidate video content more quickly.

As can be seen from the above, after the server acquires the content to be retrieved for retrieving the target content, when the content to be retrieved is the video content, performing multi-mode feature extraction on the video content to obtain a modal feature of each mode, performing feature extraction on the modal feature of each mode respectively to obtain a modal content feature of each mode, fusing the modal content features to obtain a video feature of the video content, retrieving the target text content corresponding to the video content in the preset content set according to the video feature, when the content to be retrieved is the text content, performing feature extraction on the text content, and retrieving the target video content corresponding to the text content in the preset content set according to the extracted text feature; according to the scheme, multi-mode feature extraction is firstly carried out on video content, and then modal video features are respectively extracted from the modal features corresponding to each mode, so that the accuracy of the modal video features in the video is improved, the modal video features are fused to obtain the video features of the video content, the extracted video features can better express information in the video, bidirectional retrieval of texts and the video is realized, and therefore the accuracy of content retrieval can be improved.

In order to better implement the above method, the embodiment of the present invention further provides a content retrieval apparatus, which may be integrated in an electronic device, such as a server or a terminal, and the terminal may include a tablet computer, a notebook computer, and/or a personal computer.

For example, as shown in fig. 6, the content retrieval apparatus may include an acquisition unit 301, a first extraction unit 302, a second extraction unit 303, and a text retrieval unit 304 as follows:

(1) an acquisition unit 301;

an obtaining unit 301, configured to obtain content to be retrieved for retrieving target content.

For example, the obtaining unit 301 may be specifically configured to receive a content to be retrieved, which is sent by a user through a terminal, or may obtain the content to be retrieved from a network or a third-party database, or receive a content retrieval request when a memory of the content to be retrieved is large or large in number, where the content retrieval request carries a storage address of the content to be retrieved, and obtain the content to be retrieved from the memory, a cache, or the third-party database according to the storage address.

(2) A first extraction unit 302;

the first extraction unit 302 is configured to, when the content to be retrieved is video content, perform multi-modal feature extraction on the video content to obtain a modal feature of each modality.

For example, the first extraction unit 302 may be specifically configured to, when the content to be retrieved is video content, perform multi-modal feature extraction on the video content by using a trained content retrieval model to obtain an initial modal feature of each modality in the video content, extract a video frame in the video content, perform multi-modal feature extraction on the video frame by using the trained content retrieval model to obtain a basic modal feature of each video frame, screen out a target modal feature corresponding to each modality from the basic modal feature, and fuse the target modal feature and the corresponding initial modal feature to obtain a modal feature of the video content of each modality.

(3) A second extraction unit 303;

the second extracting unit 303 is configured to perform feature extraction on the modal features of each modality respectively to obtain the modal content features of each modality.

For example, the second extracting unit 303 may be specifically configured to identify, according to the modality of the modality feature, a target video feature extraction network corresponding to each modality in the video feature extraction networks of the trained content retrieval model, and perform feature extraction on the modality feature by using the target video feature extraction network to obtain the modality content feature of each modality.

(4) A text retrieval unit 304;

and the text retrieval unit 304 is configured to fuse the modal content features to obtain video features of the video content, and retrieve target text content corresponding to the video content from a preset content set according to the video features.

For example, the text retrieval unit 304 may be specifically configured to combine the modal content features of each modality to obtain a sample modal content feature set of the video content, input the modal content feature set into a Transformer model for encoding to calculate an association weight of the modal content features, weight the modal content features according to the association weight, fuse the weighted modal content features to obtain video features of the video content, calculate feature similarities between the video features and text features of candidate text contents in the preset content set, and screen out target text contents corresponding to the video content from the candidate text contents according to the feature similarities.

Optionally, the content retrieval apparatus may further include a training unit 305, as shown in fig. 7, which may specifically be as follows:

the training unit 305 is configured to train a preset content retrieval model to obtain a trained content retrieval model.

For example, the training unit 305 may be specifically configured to obtain a set of content samples, which includes video samples and text samples, the text sample comprises at least one text word, a preset content retrieval model is adopted to carry out multi-modal feature extraction on the video sample to obtain sample modal features of each mode, the sample modal features of each mode are respectively carried out feature extraction to obtain sample modal content features of the video sample, the sample modal content features are fused to obtain sample video features of the video sample, extracting the characteristics of the text sample to obtain the text characteristics of the sample and the text word characteristics corresponding to each text word, and converging the preset content retrieval model according to the sample modal content characteristics, the sample video characteristics, the sample text characteristics and the text word characteristics to obtain the trained content retrieval model.

Optionally, the content retrieval apparatus may further include a video retrieval unit 306, as shown in fig. 8, which may specifically be as follows:

the video retrieval unit 306 is configured to, when the content to be retrieved is a text content, perform feature extraction on the text content, and retrieve a target video content corresponding to the text content from a preset content set according to the extracted text feature.

For example, the video retrieval unit 306 may be specifically configured to, when the content to be retrieved is text content, perform feature extraction on the text content by using a text feature extraction network of a trained content retrieval model to obtain text features of the text content. Respectively calculating the feature similarity between the text features and the video features of the candidate video contents in the preset content set, and screening target video contents corresponding to the text contents from the candidate video contents according to the feature similarity.

In a specific implementation, the above units may be implemented as independent entities, or may be combined arbitrarily to be implemented as the same or several entities, and the specific implementation of the above units may refer to the foregoing method embodiments, which are not described herein again.

As can be seen from the above, in this embodiment, after the obtaining unit 301 obtains the content to be retrieved for retrieving the target content, when the content to be retrieved is the video content, the first extracting unit 302 performs multi-modal feature extraction on the video content to obtain a modal feature of each modality, the second extracting unit 303 performs feature extraction on the modal feature of each modality to obtain a modal content feature of each modality, the text retrieving unit 304 fuses the modal content features to obtain a video feature of the video content, and retrieves the target text content corresponding to the video content in the preset content set according to the video feature; according to the scheme, multi-modal feature extraction is firstly carried out on video content, and then modal video features are respectively extracted from the modal features corresponding to each mode, so that the accuracy of the modal video features in the video is improved, the modal video features are fused to obtain the video features of the video content, the extracted video features can better express information in the video, and therefore the accuracy of content retrieval can be improved.

An embodiment of the present invention further provides an electronic device, as shown in fig. 9, which shows a schematic structural diagram of the electronic device according to the embodiment of the present invention, specifically:

the electronic device may include components such as a processor 401 of one or more processing cores, memory 402 of one or more computer-readable storage media, a power supply 403, and an input unit 404. Those skilled in the art will appreciate that the electronic device configuration shown in fig. 9 does not constitute a limitation of the electronic device and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components. Wherein:

the processor 401 is a control center of the electronic device, connects various parts of the whole electronic device by various interfaces and lines, performs various functions of the electronic device and processes data by running or executing software programs and/or modules stored in the memory 402 and calling data stored in the memory 402, thereby performing overall monitoring of the electronic device. Optionally, processor 401 may include one or more processing cores; preferably, the processor 401 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 401.

The memory 402 may be used to store software programs and modules, and the processor 401 executes various functional applications and data processing by operating the software programs and modules stored in the memory 402. The memory 402 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data created according to use of the electronic device, and the like. Further, the memory 402 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 402 may also include a memory controller to provide the processor 401 access to the memory 402.

The electronic device further comprises a power supply 403 for supplying power to the various components, and preferably, the power supply 403 is logically connected to the processor 401 through a power management system, so that functions of managing charging, discharging, and power consumption are realized through the power management system. The power supply 403 may also include any component of one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and the like.

The electronic device may further include an input unit 404, and the input unit 404 may be used to receive input numeric or character information and generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.

Although not shown, the electronic device may further include a display unit and the like, which are not described in detail herein. Specifically, in this embodiment, the processor 401 in the electronic device loads the executable file corresponding to the process of one or more application programs into the memory 402 according to the following instructions, and the processor 401 runs the application program stored in the memory 402, thereby implementing various functions as follows:

For example, the electronic device receives a content to be retrieved, which is sent by a user through a terminal, or may obtain the content to be retrieved from a network or a third-party database, or, when the memory of the content to be retrieved is large or large in number, receives a content retrieval request, where the content retrieval request carries a storage address of the content to be retrieved, and obtains the content to be retrieved from the memory, a cache, or the third-party database according to the storage address. When the content to be retrieved is video content, performing multi-modal feature extraction on the video content by adopting a trained content retrieval model to obtain initial modal features of each modal in the video content, extracting video frames from the video content, performing multi-modal feature extraction on the video frames by adopting the trained content retrieval model to obtain basic modal features of each video frame, screening out target modal features corresponding to each modal from the basic modal features, and fusing the target modal features with the corresponding initial modal features to obtain the modal features of each modal. According to the mode of the mode features, identifying a target video feature extraction network corresponding to each mode in a video feature extraction network of the trained content retrieval model, and performing feature extraction on the mode features by adopting the target video feature extraction network to obtain the mode content features corresponding to each mode. Combining modal content features of each modal to obtain a sample modal content feature set of the video content, inputting the modal content feature set into a transform model for encoding to calculate the associated weight of the modal content features, weighting the modal content features according to the associated weight, fusing the weighted modal content features to obtain the video features of the video content, respectively calculating the feature similarity between the video features and the text features of candidate text contents in a preset content set, and screening out target text contents corresponding to the video content from the candidate text contents according to the feature similarity.

The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.

As can be seen from the above, after the content to be retrieved for retrieving the target content is obtained, when the content to be retrieved is the video content, performing multi-mode feature extraction on the video content to obtain a modal feature of each mode, performing feature extraction on the modal feature of each mode respectively to obtain a modal content feature of each mode, fusing the modal content features to obtain a video feature of the video content, and retrieving the target text content corresponding to the video content in the preset content set according to the video feature; according to the scheme, multi-modal feature extraction is firstly carried out on video content, and then modal video features are respectively extracted from the modal features corresponding to each mode, so that the accuracy of the modal video features in the video is improved, the modal video features are fused to obtain the video features of the video content, the extracted video features can better express information in the video, and therefore the accuracy of content retrieval can be improved.

It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by instructions or by associated hardware controlled by the instructions, which may be stored in a computer readable storage medium and loaded and executed by a processor.

To this end, the embodiment of the present invention provides a computer-readable storage medium, in which a plurality of instructions are stored, and the instructions can be loaded by a processor to execute the steps in any content retrieval method provided by the embodiment of the present invention. For example, the instructions may perform the steps of:

For example, a content to be retrieved sent by a user through a terminal is received, or the content to be retrieved can be obtained from a network or a third-party database, or when the memory of the content to be retrieved is large or the number of the content to be retrieved is large, a content retrieval request is received, the content retrieval request carries the storage address of the content to be retrieved, and the content to be retrieved is obtained from the memory, the cache or the third-party database according to the storage address. When the content to be retrieved is video content, performing multi-modal feature extraction on the video content by adopting a trained content retrieval model to obtain initial modal features of each modal in the video content, extracting video frames from the video content, performing multi-modal feature extraction on the video frames by adopting the trained content retrieval model to obtain basic modal features of each video frame, screening out target modal features corresponding to each modal from the basic modal features, and fusing the target modal features with the corresponding initial modal features to obtain the modal features of each modal. According to the mode of the mode features, identifying a target video feature extraction network corresponding to each mode in a video feature extraction network of the trained content retrieval model, and performing feature extraction on the mode features by adopting the target video feature extraction network to obtain the mode content features corresponding to each mode. Combining modal content features of each modal to obtain a sample modal content feature set of the video content, inputting the modal content feature set into a transform model for encoding to calculate the associated weight of the modal content features, weighting the modal content features according to the associated weight, fusing the weighted modal content features to obtain the video features of the video content, respectively calculating the feature similarity between the video features and the text features of candidate text contents in a preset content set, and screening out target text contents corresponding to the video content from the candidate text contents according to the feature similarity.

Wherein the computer-readable storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.

Since the instructions stored in the computer-readable storage medium can execute the steps in any content retrieval method provided by the embodiment of the present invention, the beneficial effects that can be achieved by any content retrieval method provided by the embodiment of the present invention can be achieved, which are detailed in the foregoing embodiments and will not be described herein again.

According to an aspect of the application, there is provided, among other things, a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method provided in the various alternative implementations of the content retrieval aspect or the video/text bi-directional retrieval aspect described above.

The content retrieval method, the content retrieval device and the computer-readable storage medium according to the embodiments of the present invention are described in detail, and the principles and embodiments of the present invention are described herein by applying specific embodiments, and the description of the embodiments is only used to help understanding the method and the core concept of the present invention; meanwhile, for those skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A method for retrieving content, comprising:

acquiring content to be retrieved for retrieving target content;

2. The content retrieval method according to claim 1, wherein the performing multi-modal feature extraction on the video content to obtain modal features of each modality comprises:

performing multi-modal feature extraction on the video content by adopting a trained content retrieval model to obtain initial modal features of each mode in the video content;

extracting video frames from the video content, and performing multi-mode feature extraction on the video frames by adopting the trained content retrieval model to obtain basic modal features of each video frame;

and screening out target modal characteristics corresponding to each mode from the basic modal characteristics, and fusing the target modal characteristics and the corresponding initial modal characteristics to obtain modal characteristics corresponding to the video content of each mode.

3. The content retrieval method according to claim 2, wherein the feature extraction of the modal features of each modality to obtain the modal video features of each modality comprises:

identifying a target video feature extraction network corresponding to each mode in the video feature extraction networks of the trained content retrieval model;

and performing feature extraction on the modal features by adopting the target video feature extraction network to obtain modal video features of each modal.

4. The content retrieval method according to claim 2, wherein before performing multi-modal feature extraction on the video content by using the trained content retrieval model to obtain the initial model feature of each modality in the video content, the method further comprises:

obtaining a content sample set, wherein the content sample set comprises a video sample and a text sample, and the text sample comprises at least one text word;

performing multi-modal feature extraction on the video sample by adopting a preset content retrieval model to obtain sample modal features of each mode;

respectively extracting the characteristics of the sample modal characteristics of each mode to obtain the sample modal content characteristics of the video sample, and fusing the sample modal content characteristics to obtain the sample video characteristics of the video sample;

and performing feature extraction on the text sample to obtain sample text features and text word features corresponding to each text word, and converging the preset content retrieval model according to the sample modal content features, the sample video features, the sample text features and the text word features to obtain the trained content retrieval model.

5. The content retrieval method according to claim 4, wherein the converging the preset content retrieval model according to the sample modal content feature, the sample video feature, the sample text feature and the text word feature to obtain the trained content retrieval model comprises:

determining feature loss information of the content sample set according to the sample modal content features and the text word features;

determining content loss information for the set of content samples based on the sample video features and sample text features;

and fusing the characteristic loss information and the content loss information, and converging a preset content retrieval model based on the fused loss information to obtain a trained content retrieval model.

6. The method of claim 5, wherein determining feature loss information for the set of content samples based on the sample modal content features and text word features comprises:

calculating the feature similarity between the sample modal content features and the text word features to obtain first feature similarity;

determining the sample similarity between the video sample and the text sample according to the first feature similarity;

and calculating the characteristic distance between the video sample and the text sample based on the sample similarity so as to obtain the characteristic loss information of the content sample set.

7. The method according to claim 6, wherein the determining the sample similarity between the video sample and the text sample according to the first feature similarity comprises:

performing feature interaction on the sample modal content features and text word features according to the first feature similarity to obtain video features and text word features after interaction;

calculating the feature similarity between the video features after the interaction and the text word features after the interaction to obtain a second feature similarity;

and fusing the second feature similarity to obtain the sample similarity between the video sample and the text sample.

8. The content retrieval method according to claim 7, wherein the performing feature interaction on the sample modal content features and text word features according to the first feature similarity to obtain post-interaction video features and post-interaction text word features comprises:

carrying out standardization processing on the first feature similarity to obtain a target feature similarity;

determining an association weight of the sample modal content features according to the target feature similarity, wherein the association weight is used for indicating an association relation between the sample modal content features and text word features;

and weighting the sample modal content features based on the associated weights, and updating the text word features based on the weighted sample modal content features to obtain the video features after interaction and the text word features after interaction.

9. The content retrieval method of claim 8, wherein the updating the text word features based on the weighted sample modal content features to obtain the post-interaction video features and post-interaction text word features comprises:

taking the weighted sample modal content features as initial interacted video features, and updating the text word features based on the initial interacted video features to obtain initial interacted text word features;

calculating the feature similarity of the video features after the initial interaction and the text word features after the initial interaction to obtain a third feature similarity;

and updating the video characteristics after the initial interaction and the text word characteristics after the initial interaction according to the third characteristic similarity to obtain the video characteristics after the interaction and the text word characteristics after the interaction.

10. The content retrieval method according to claim 9, wherein the updating the initial post-interaction video features and the initial post-interaction text word features according to the third feature similarity to obtain the post-interaction video features and the post-interaction text word features comprises:

performing feature interaction on the initially interacted video features and the initially interacted text word features according to the third feature similarity to obtain target interacted video features and target interacted text word features;

taking the video features after the target interaction as initial video features after the target interaction, and taking the text word features after the target interaction as initial text word features after the target interaction;

and returning to the step of calculating the feature similarity of the video features after the initial interaction and the text word features after the initial interaction until the feature interaction times of the video features after the initial interaction and the text word features after the initial interaction reach preset times, and obtaining the video features after the interaction and the text word features after the interaction.

11. The method of claim 6, wherein the calculating a feature distance between the video sample and a text sample based on the sample similarity to obtain feature loss information of the content sample set comprises:

acquiring a preset characteristic boundary value corresponding to the content sample set;

screening out a first content sample pair of which the video sample is matched with the text sample and a second content sample pair of which the video sample is not matched with the text sample from the content sample set according to the sample similarity;

and calculating the characteristic distance between the first content sample pair and the second content sample pair based on the preset characteristic boundary value to obtain the characteristic loss information of the content sample set.

12. The method of claim 11, wherein the calculating a feature distance between the first content sample and the second content sample pair based on the preset feature boundary value to obtain the feature loss information of the content sample set comprises:

screening out a content sample pair with the maximum sample similarity in the second content sample pair to obtain a target content sample pair;

calculating a similarity difference between the sample similarity of the first content sample pair and the sample similarity of the target content sample pair to obtain a first similarity difference;

and fusing the preset characteristic boundary value and the first similarity difference value to obtain the characteristic loss information of the content sample set.

13. The method of claim 5, wherein determining the content loss information for the set of content samples based on the sample video features and sample text features comprises:

calculating the feature similarity between the sample video features and the text features to obtain the content similarity between the video samples and the text samples;

according to the content similarity, a third content sample pair of which the video sample is matched with the text sample and a fourth content sample pair of which the video sample is not matched with the content sample are screened from the content sample set;

and acquiring a preset content boundary value corresponding to the content sample set, and calculating a content difference value between the third content sample pair and the fourth content sample pair according to the preset content boundary value to obtain content loss information of the content sample set.

14. The method of claim 13, wherein the calculating a content difference value between the third content sample and a fourth content sample pair according to the preset content boundary value to obtain content loss information of the content sample set comprises:

calculating a similarity difference between the content similarity of the third content sample pair and the content similarity of the fourth content sample pair to obtain a second similarity difference;

fusing the second similarity difference value with a preset content boundary value to obtain a content difference value between the third content sample pair and the fourth content sample pair;

and carrying out standardization processing on the content difference value to obtain the content loss information of the content sample set.

15. The content retrieval method according to claim 1, further comprising:

when the content to be retrieved is text content, performing feature extraction on the text content to obtain text features of the text content;

and retrieving target video content corresponding to the text content in the preset content set according to the text characteristics.

16. A content retrieval apparatus, comprising:

the second extraction unit is used for respectively extracting the characteristics of the modal characteristics of each mode to obtain the modal content characteristics of each mode;

and the text retrieval unit is used for fusing the modal content features to obtain the video features of the video content, and retrieving the target text content corresponding to the video content in a preset content set according to the video features.

17. A computer-readable storage medium storing instructions adapted to be loaded by a processor to perform the steps of the content retrieval method according to any one of claims 1 to 15.