WO2022188644A1 - 词权重的生成方法、装置、设备及介质 - Google Patents

词权重的生成方法、装置、设备及介质 Download PDF

Info

Publication number
WO2022188644A1
WO2022188644A1 PCT/CN2022/078183 CN2022078183W WO2022188644A1 WO 2022188644 A1 WO2022188644 A1 WO 2022188644A1 CN 2022078183 W CN2022078183 W CN 2022078183W WO 2022188644 A1 WO2022188644 A1 WO 2022188644A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
vector
word
text
module
Prior art date
Application number
PCT/CN2022/078183
Other languages
English (en)
French (fr)
Inventor
黄剑辉
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2022188644A1 publication Critical patent/WO2022188644A1/zh
Priority to US17/975,519 priority Critical patent/US20230057010A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7844Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/71Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/258Heading extraction; Automatic titling; Numbering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/761Proximity, similarity or dissimilarity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/62Text, e.g. of license plates, overlay texts or captions on TV images
    • G06V20/635Overlay text, e.g. embedded captions in a TV program
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/148Segmentation of character regions
    • G06V30/153Segmentation of character regions using recognition of characters or words
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/18Extraction of features or characteristics of the image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/191Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
    • G06V30/1918Fusion techniques, i.e. combining data from various sources, e.g. sensor fusion

Definitions

  • the present application relates to the field of information processing, and in particular, to a method, apparatus, device and medium for generating word weights.
  • the video title is text information used to describe the video content of the video.
  • it is necessary to pre-extract the weight value of each word in the video title based on the understanding of the semantics of the video content, so as to facilitate the subsequent video search process. For example, the higher the weight value of a word in the video title, the higher the relevance of the word and the video content, and therefore the higher the importance of the word during search.
  • the method for generating the word weight is mainly to encode the sentence of the video title and each word in the video title respectively, to obtain the sentence vector and the word vector. Perform feature fusion on the encoded sentence vector and word vector to obtain a fusion vector, and perform a binary classification judgment on the fusion vector to determine whether the current word is a core word, and then output the word weight of the current word.
  • the present application provides a method, device, device and medium for generating word weights, which can improve the accuracy and reliability of word weight values by incorporating picture feature information of a video.
  • the technical solution is as follows:
  • a method for generating word weights executed by a computer device, the method comprising:
  • the video-related text includes at least one word
  • the video-related text is text information that is associated with the content of the video
  • Multi-modal feature fusion is performed on the features of video, video-related text and words to generate intermediate vectors of words;
  • the word weights of the words are generated.
  • an apparatus for generating word weights comprising:
  • an acquisition module configured to acquire a video and a video-related text, the video-related text includes at least one word, and the video-related text is text information that is associated with the content of the video;
  • the generation module is used to perform multi-modal feature fusion on the features of video, video-related text and words to generate intermediate vectors of words;
  • the generation module is also used to generate word weights of words based on the intermediate vectors of words.
  • the generation module includes an extraction module and a fusion module.
  • the extraction module is used to extract the video feature vector of the video; extract the text feature vector of the associated text of the video; and extract the word feature vector of the word;
  • the fusion module is configured to fuse the video feature vector, the text feature vector and the word feature vector to obtain an intermediate vector of the word.
  • the fusion module includes a first fusion submodule and a second fusion submodule.
  • the first fusion submodule is configured to perform the first fusion of the video feature vector, the text feature vector and the word feature vector to obtain a first fusion vector;
  • the second fusion sub-module is configured to perform a second fusion between the first fusion vector and the word feature vector to obtain an intermediate vector of the word.
  • the first fusion sub-module includes a first splicing module and a first mapping module.
  • the first splicing module is used to sequentially splicing the video feature vector, the text feature vector and the word feature vector to obtain a first splicing vector
  • the first mapping module is configured to perform fully-connected feature mapping on the first splicing vector to obtain the first fusion vector.
  • the second fusion sub-module includes a second splicing module and a second mapping module.
  • the second splicing module is used to splicing the first fusion vector and the word feature vector in sequence to obtain a second splicing vector
  • the second mapping module is configured to perform fully-connected feature mapping on the second spliced vector to obtain the intermediate vector of the word.
  • the generation module further includes a conversion module.
  • the conversion module is configured to perform dimension transformation on the intermediate vector to obtain a one-dimensional vector
  • the conversion module is further configured to normalize the one-dimensional vector to obtain the word weight of the word.
  • the conversion module is configured to convert the one-dimensional vector through a threshold function to obtain the word weight of the word.
  • the extraction module includes a video extraction module, a text extraction module and a word extraction module.
  • the video extraction module includes a framing module, an extraction sub-module and a calculation module.
  • the framing module is configured to perform a framing operation on the video to obtain at least two video frames
  • the extraction submodule is used to extract the video frame vectors of at least two video frames
  • the calculation module is configured to calculate the average vector of the video frame vectors of the at least two video frames, and determine the average vector as the video feature vector; or, calculate the weighted weight of the video frame vectors of the at least two video frames vector, and determine the weighted vector as the video feature vector.
  • the computing module is used to:
  • a weighted vector of the video frame vectors of the at least two video frames is calculated, and the weighted vector is determined as a video feature vector.
  • the extraction submodule is further configured to invoke the residual neural network to extract video frame vectors of at least two video frames in the video.
  • the text extraction module is configured to call the bidirectional transcoding network to extract the text feature vector of the video-related text, or call the long short-term memory network to extract the text feature vector of the video-related text.
  • the word extraction module includes a word segmentation module and a word extraction sub-module.
  • a word segmentation module is used to perform word segmentation on the video-related text to obtain words
  • the word extraction sub-module is configured to invoke a deep neural network to extract word feature vectors of words.
  • the word segmentation module is further configured to call a Chinese word segmentation tool to segment the video-related text to obtain words.
  • the extraction module is used to:
  • At least two of the video frame vector, the audio frame vector and the text subtitle vector are fused to obtain a video feature vector.
  • a computer device comprising: a processor and a memory, the memory storing a computer program, the computer program being loaded and executed by the processor to implement the above word weight generation method.
  • a computer-readable storage medium storing a computer program, the computer program being loaded and executed by a processor to implement the word weight generating method as described above.
  • a computer program product or computer program comprising computer instructions stored in a computer readable storage medium.
  • the processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the above-mentioned word weight generation method.
  • an intermediate vector is generated, and based on the intermediate vector, the word weight of the word is generated.
  • the above-mentioned word weight generation method is used to pre-extract the weight value of the word, which not only considers the features of the text dimension, but also introduces the features of the video dimension, and generates the word weight based on the multi-dimensional features, which is beneficial
  • the accuracy and reliability of the output word weights are improved, and the distinction between key words and confusing words in the video-related text is improved.
  • FIG. 1 is a schematic diagram of a word weight generation system provided according to an exemplary embodiment
  • FIG. 2 is a flowchart of a method for generating word weights provided by an exemplary embodiment of the present application
  • FIG. 3 is a schematic diagram of a word weight generation model provided by an exemplary embodiment of the present application.
  • FIG. 4 is a schematic diagram of a word weight generation model provided by another exemplary embodiment of the present application.
  • FIG. 5 is a flowchart of a method for generating word weights provided by an exemplary embodiment of the present application
  • FIG. 6 is a flowchart of generating a video feature vector provided by an exemplary embodiment of the present application.
  • FIG. 7 is a flowchart of generating a text feature vector provided by an exemplary embodiment of the present application.
  • FIG. 8 is a flowchart of generating a word feature vector provided by an exemplary embodiment of the present application.
  • FIG. 9 is a flowchart of a method for training a word weight generation model provided by an exemplary embodiment of the present application.
  • FIG. 10 is a flowchart of a method for generating word weights provided by an exemplary embodiment of the present application.
  • FIG. 11 is a structural block diagram of an apparatus for generating word weights provided by an exemplary embodiment of the present application.
  • FIG. 12 shows a structural block diagram of a computer device provided by an exemplary embodiment of the present application.
  • Word Importance It refers to how much a word contributes to the meaning of a sentence.
  • the components of a complete sentence include subject, predicate, object, attributive, adverbial, and complement.
  • An exemplary sentence is "Double-click this video, you will find that roasting pork is easier than roasting fish", remove the conjunction And personal pronouns, the sentence is mainly composed of words “double-click”, "video”, “discovery”, “roast pork”, “roast fish”, “practice” and “simple”. Based on the understanding of the overall meaning of the sentence, it is easy to get that "roasted pork” and "roasted fish” play a key role in the meaning of the sentence. More specifically, "roasted pork” has a further effect on the meaning of the sentence than "roasted fish”, that is, the importance of the word "roasted pork” is higher than that of the word "roasted fish”.
  • the weight value of the word in the sentence is used to represent the importance of the word.
  • the weight value of "roast pork” is 0.91
  • the weight value of "roast fish” is 0.82, that is, by comparing the weight values, it can be concluded that "roast pork” is more important than “roast fish”.
  • Residual Neural Network A deep learning-based feature extraction neural network.
  • Residual Network A deep learning-based feature extraction neural network.
  • the neural network can converge, as the depth of the network increases, the performance of the network first gradually increases to saturation, and then decreases rapidly, which is the problem of network degradation; in traditional deep learning, there is a gradient Diffusion problem; Residual neural network adds an identity map to deep learning neural network, which solves the above-mentioned network degradation problem and gradient dispersion problem.
  • a residual neural network is used to convert an image into a mathematical language in which operations can be performed.
  • a residual neural network converts a video frame into a video frame vector containing content reflecting the video frame. information, that is, the above video frame can be replaced with the video frame vector.
  • BERT Bidirectional Encoder Representations from Transformers
  • Deep Neural Networks A multi-layer neural network with a fully connected neuron structure that converts objective things in the real world into vectors that can be manipulated by mathematical formulas.
  • the DNN converts the input word into a word vector, where the word vector contains information reflecting the content of the word, that is, the word vector can be used to replace the above-mentioned word.
  • Threshold function realize the conversion of numerical interval, for example, the interval of the number x is [0, 100], through the threshold function, the number x is converted to the number y in the interval [0, 1].
  • the sigmoid function a threshold function
  • Fig. 1 is a schematic diagram of a word weight generation system according to an exemplary embodiment.
  • the model training device 110 trains a word weight generation model with high accuracy through the preset training sample set, and in the word weight prediction stage, the word weight generation device 120 The word weight generation model and the input video and text, predict the weight value of the words in the text.
  • the above-mentioned model training device 110 and word weight prediction device 120 may be computer devices with machine learning capabilities, for example, the computer device may be a terminal or a server.
  • the model training device 110 and the word weight prediction device 120 may be the same computer device, or the model training device 110 and the word weight prediction device 120 may also be different computer devices. Also, when the model training device 110 and the word weight prediction device 120 are different devices, the model training device 110 and the word weight prediction device 120 may be the same type of device, for example, the model training device 110 and the word weight prediction device 120 may be both server; or, the model training device 110 and the word weight prediction device 120 may also be different types of devices.
  • the above server may be an independent physical server, a server cluster or a distributed system composed of multiple physical servers, or a cloud service, cloud database, cloud computing, cloud function, cloud storage, network service, cloud communication, Cloud servers for basic cloud computing services such as middleware services, domain name services, security services, Content Delivery Network (CDN), and big data and artificial intelligence platforms.
  • the above-mentioned terminal may be a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, etc., but is not limited thereto.
  • the terminal and the server may be directly or indirectly connected through wired or wireless communication, which is not limited in this application.
  • FIG. 2 shows a flowchart of a method for generating word weights provided by an exemplary embodiment of the present application. The method is performed by a computer device. As shown in Figure 2, the method includes:
  • Step 201 Obtain a video and a video-related text, where the video-related text includes at least one word.
  • the video-related text is text information that is related to the content of the video.
  • the video associated text is the title of the video or the video introduction of the video.
  • the video-related text includes at least two words.
  • the video-related text is a title corresponding to the video
  • the video-related text and the video are independent of each other, wherein the title is manually annotated or generated by a machine, and is used to briefly explain the central meaning of the video.
  • the content of the video is an introduction to the practice of roasting meat, and the title is "Teach you how to cook roasted meat”.
  • the video associated text is a video introduction corresponding to the video
  • the video associated text and the video are independent of each other, wherein the video introduction is written by humans or generated by a machine, and is used to briefly explain the specific content of the video.
  • the content of the video is to introduce the practice of roasting meat
  • the video introduction is "This video introduces the practice of roasting meat, which is divided into three steps: pretreatment, marinating, and roasting".
  • the computer device obtains the video and video associated text through a local database or content server.
  • the content server is used to store a large number of videos and video-related texts corresponding to the videos, and push them to the user side for display.
  • the content server is a background server of a video-on-demand application, a short video application, and a song application.
  • the computer device and the content server are the same or different devices.
  • Step 202 Perform multi-modal feature fusion on the features of three types of information, video, video-related text and words, to generate intermediate vectors of words.
  • multi-modal feature fusion refers to the feature extraction of video, video-related text, and words by computer equipment to obtain video feature vectors, text feature vectors, and word feature vectors, and then the video feature vector, text feature vector, and word feature vector.
  • Vector fusion operation Since video, video-related text, and words are information of different modalities, feature fusion of the features of video, video-related text, and words can be called multi-modal feature fusion.
  • the above-mentioned multimodal feature fusion includes the following two steps:
  • the video feature vector indicates the video feature information
  • the text feature vector indicates the feature information of the video-related text
  • the word feature vector indicates the word feature information.
  • the video feature vector is used to reflect the feature of the video content
  • the text feature vector is used to reflect the semantic feature of the text
  • the word feature vector is used to reflect the semantic feature of the word.
  • the video feature vector, text feature vector and word feature vector are fused to obtain the intermediate vector of the word.
  • the intermediate vector obtained by feature fusion also contains feature information of video, video-related text and words.
  • Step 203 Generate word weights of words based on the intermediate vectors of words.
  • generating the word weight of the word based on the intermediate vector of the word includes the following two steps:
  • the intermediate vector is dimensionally transformed to obtain a one-dimensional vector
  • the computer device can generate intermediate vectors of words based on multimodal feature fusion of the features of videos, video-related texts and words.
  • the intermediate vector is a multi-dimensional vector containing feature information of the video, video-related text and words; in one embodiment, the intermediate vector is transformed into a one-dimensional vector through fully connected mapping.
  • the computer device performs dimension transformation on the intermediate vector through the above-mentioned fully connected mapping. For example, if the dimension of the intermediate vector is 388 dimensions, the dimension transformation is performed to obtain a 1-dimensional vector. Among them, the 1-dimensional vector contains the importance information of the word vector in the sentence vector. By normalizing the 1-dimensional vector, the 1-dimensional vector can be converted into a value in the interval [0, 1], and the value is the word weight of the word. In one embodiment, the computer device can normalize the one-dimensional vector by using a threshold function. For example, the one-dimensional vector can be converted to a numerical interval by a sigmoid function, and the one-dimensional vector can be mapped to the interval [0, 1 ] to get the word weight.
  • a normalization process for a 1-dimensional vector can also be implemented through a linear function, such as a Min-Max scaling (Min-Max scaling) function. The present application does not limit the manner of realizing the normalization processing.
  • an intermediate vector is generated, and based on the intermediate vector, the word weight of the word is generated.
  • the above-mentioned word weight generation method is used to pre-extract the weight value of the word, which not only considers the features of the text dimension, but also introduces the features of the video dimension, and generates the word weight based on the multi-dimensional features, which is beneficial
  • the accuracy and reliability of the output word weights are improved, and the distinction between key words and confusing words in the video-related text is improved.
  • FIG. 3 shows a schematic diagram of a word weight generation model provided by an exemplary embodiment of the present application.
  • the word weight generation model 300 in FIG. 3 includes a word segmentation network 310 , a transformation network 320 , a fusion network 330 and a mapping network 340 .
  • the word segmentation network 310 is used to segment the video associated text to obtain at least one word; the conversion network 320 is used to convert the video into a video feature vector, convert the video associated text into a text feature vector, and convert the word into a word feature vector; fusion; The network 330 is used to fuse the video feature vector, the text feature vector and the word feature vector to obtain an intermediate vector; the mapping network 340 is used to map the intermediate vector to the word weight of the word corresponding to the intermediate vector.
  • FIG. 4 shows a schematic diagram of a word weight generation model provided by another exemplary embodiment of the present application.
  • the word weight generation model includes a word segmentation network 310 , a conversion network 320 , a fusion network 330 , and a mapping network 340 .
  • the conversion network 320 includes a first conversion sub-network 321 , a second conversion sub-network 322 and a third conversion sub-network 323 .
  • the fusion network 330 includes a first fusion sub-network 331 and a second fusion sub-network 332 .
  • FIG. 5 shows a flowchart of a method for generating word weights according to an exemplary embodiment of the present application. The method is performed by a computer device.
  • the word weight generation method includes:
  • Step 510 call the first conversion sub-network 321 to process the video, and output the video feature vector of the video;
  • the first conversion sub-network 321 is used to perform framing operations on the video
  • the computer device can obtain at least two video frames of the video by calling the first conversion sub-network 321 to process the video, and then extract at least two video frames.
  • the video frame vector of the frame then calculate the average vector of the video frame vectors of at least two video frames, and determine the average vector as the video feature vector; or, calculate the weighted vector of the video frame vectors of at least two video frames, and determine the weighted vector is the video feature vector.
  • the computer device determines the target object included in each video frame through the target detection model, and classifies the target object through the classification model to obtain the target object classification corresponding to each video frame.
  • the target detection model is used to detect target objects included in video frames, such as people, animals, plants, and different types of objects.
  • the classification model is used to classify the detected target objects, thereby obtaining the target object classification.
  • the detected target object is the area where the animal in the video frame is located, and inputting the area into the classification model can obtain that the target object is a cat.
  • the target detection model and classification model are implemented based on Convolutional Neural Network (CNN).
  • CNN Convolutional Neural Network
  • the target detection model can not only detect target objects in video frames, but also classify target objects. In this case, the computer equipment can directly obtain the target object classification through the target detection model.
  • the computer device calculates the similarity between the target object classification and the word corresponding to each video frame, and determines the weight of the video frame vector of each video frame according to the similarity corresponding to each video frame, and the weight is positively correlated with the similarity. That is, the higher the similarity between the target object classification and the word, the higher the weight of the video frame vector of the video frame corresponding to the target object classification.
  • the computer device calculates the video frame of the at least two video frames according to the video frame vectors of the at least two video frames and the respective weights of the video frame vectors of the at least two video frames A weighted vector of vectors, and determine the weighted vector as a video feature vector.
  • the weight of the video frame vector associated with the word can be increased, so that the video frame associated with the word plays a greater role in the process of determining the video feature vector. It can be realized that the determined video feature vector can highlight the features that are strongly associated with words in the video features, thereby improving the accuracy of determining the video feature vector.
  • the above weights can also be manually set.
  • the above frame division operation includes at least the following two processing methods:
  • the computer device performing the frame division operation on the video refers to collecting video frames every 0.2s.
  • the video duration is 30s, it is preset within the first 20% of the video duration to capture video frames every 1s, and within the middle 60% of the video duration, capture video frames every 0.2s, During the last 20% of the video duration, video frames are collected every 1s.
  • extracting the video frame vectors of the at least two video frames by the computer device includes: invoking a residual neural network to extract the video frame vectors of the at least two video frames in the video.
  • the computer device divides the video 601 into frames to obtain four video frames 602 .
  • the above four frame vectors are averaged or weighted to obtain a video frame vector.
  • calculating the average vector of the video frame vectors of the at least two video frames refers to calculating the average value after accumulating the first frame vector, the second frame vector, the third frame vector and the fourth frame vector.
  • the above calculation of the weighted vector of the video frame vectors of the at least two video frames refers to the weighted summation of the first frame vector, the second frame vector, the third frame vector and the fourth frame vector.
  • the first frame vector is a
  • the second frame vector is b
  • the third frame vector is c
  • the fourth frame vector is d.
  • the first frame vector is given a weight of 0.3
  • the second frame vector is given a weight of 0.1
  • the third frame vector is given a weight of 0.3
  • the frame vector is given a weight of 0.2
  • the fourth frame vector is given a weight of 0.4
  • the resulting video feature vector is 0.3a+0.1b+0.2c+0.4d.
  • Step 520 call the second conversion sub-network 322 to process the video-related text, and output the text feature vector of the video-related text;
  • the second transformation sub-network 322 includes a Bidirectional Encoder Representation from Transformers (BERT) network and/or a Long Short-Term Memory (Long Short-Term Memory, LSTM) network.
  • the computer device invokes the bidirectional transcoding network to extract the text feature vector of the video-related text, or invokes the long short-term memory network to extract the text feature vector of the video-related text.
  • the computer device can also extract the text feature vector of the video-related text by invoking the bidirectional transcoding network and the long short-term memory network respectively. Then, the average or weighted average of the text feature vectors extracted by the two networks is calculated to obtain the final text feature vector.
  • the video-related text 701 is input to the Bert network 702 to obtain a text feature vector.
  • Step 530 call the word segmentation network 310 to segment the video-related text to obtain words
  • jieba (a third-party Chinese word segmentation database) is built in the word segmentation network, and jieba supports three word segmentation modes. Redundant data, suitable for text analysis; second, full mode: all the words in the sentence that may be words are segmented, the speed is very fast, but there is redundant data; third, search engine mode: in the precise mode On the basis, the long words are segmented again. In an actual usage scenario, the mode is selected according to the type, length, etc. of the video-related text, and finally the video-related text is converted into at least one word. By invoking the above word segmentation network, the computer device can implement word segmentation for the video-related text.
  • the associated text of the video is "This Luban is out of service, the economy is suppressed, and I can't get up at all, the mobile phone is here for you to play!
  • the words obtained from the word segmentation in the precise mode include “this”, “Luban” “No help”, “economy”, “by”, “suppressed”, “completely”, “can't get up”, “mobile phone”, “give”, "you come and play", "!;
  • the words obtained from word segmentation in full mode include “Luban” “No rescue”, “economy”, “suppression", "completely”, “mobile phone”
  • words obtained from word segmentation in the search engine mode include "this", “Luban”, “no”, "saved”, "le”, “economy”, “by”, “suppressed” “Completely", “Start”, “No", “Mobile”, “Give”, “You", “Come", “Play”! ".
  • Step 540 call the third conversion sub-network 323 to process the word, and output the word feature vector of the word;
  • the third transformation sub-network 323 comprises a deep neural network.
  • the computer equipment invokes the deep neural network to extract the word feature vector of the word.
  • the computer device inputs words into DNN801 to obtain word vectors.
  • Step 550 call the first fusion sub-network 331 to perform the first fusion of the video feature vector, the text feature vector and the word feature vector to obtain the first fusion vector;
  • the computer device invokes the first fusion sub-network to sequentially splicing the video feature vector, the text feature vector and the word feature vector to obtain the first splicing vector. Afterwards, the first splicing vector is fully connected feature mapping to obtain the first fusion vector.
  • the above splicing refers to splicing all vectors in dimension.
  • the dimension of the original video frame vector is 318 dimensions
  • the text vector is 50 dimensions
  • the word vector is 10 dimensions
  • the obtained first splice vector dimension is 378 dimensions.
  • the above-mentioned fully connected feature mapping refers to mapping the obtained first splicing vector to obtain the first fusion vector.
  • the first splicing vector is [a, b, c], where a, b, c respectively indicate video information, video-related text information and word information
  • the first fusion vector [0.9a, 3b, 10c], where 0.9a, 3b, 10c, respectively indicate video information, video-related text information and word information
  • the fully connected feature map changes the degree of fusion between video information, video-related text information and word information.
  • the above examples are only used for explanation.
  • the actual fully connected feature map is implemented in a high-dimensional space, and the degree of fusion changes with the changes of the input video, video-related text and words.
  • Step 560 call the second fusion sub-network 332 to perform the second fusion of the first fusion vector and the word feature vector to obtain the intermediate vector of the word;
  • the computer device invokes the second fusion sub-network 332 to sequentially splicing the first fusion vector and the word feature vector to obtain the second splicing vector. After that, the second spliced vector is fully connected to feature mapping to obtain the intermediate vector of the word.
  • the splicing and fully-connected feature maps are similar to the first fusion sub-network, and are not repeated here.
  • Step 570 Call the mapping network 340 to perform dimension transformation on the intermediate vector to obtain a one-dimensional vector, and normalize the one-dimensional vector to obtain the word weight of the word.
  • dimension transformation is performed on the intermediate vector through the above-mentioned full connection mapping.
  • the dimension transformation is performed to obtain a 1-dimensional vector.
  • the 1-dimensional vector contains the importance information of the word feature vector in the text feature vector.
  • the computer device uses a sigmoid function to convert the 1-dimensional vector into a numerical interval (ie, normalize it), and obtain the word weight by mapping the 1-dimensional vector to the interval [0, 1].
  • the method provided in this embodiment obtains video feature vectors, text feature vectors and word feature vectors by performing feature extraction on videos, video-related texts and words, and then splicing and summing the feature vectors of the above three modalities.
  • the full connection map is used to obtain the first fusion vector, and then the first fusion vector and the word feature vector are spliced and fully connected to map to obtain an intermediate vector. Based on the intermediate vector, the word weight of the word is obtained.
  • the method provided in this embodiment also strengthens the information content ratio of the current word feature vector in the intermediate vector by splicing the word feature vector twice, which is beneficial to improve the weight value discrimination of different words in the video-related text.
  • the method provided in this embodiment also fuses the video feature vector, the text feature vector and the word feature vector, not only considers the features of the text dimension, but also introduces the features fused with the video dimension, and performs word weight generation based on the multi-dimensional features. , which is beneficial to improve the accuracy and reliability of the output word weight, and improve the distinction between key words and confusing words in the video-related text.
  • the method provided in this embodiment further realizes the feature extraction of video by using residual neural network, the feature extraction of video-related text by using bidirectional coding conversion network or long short-term memory network, and the feature extraction of words by using deep neural network. Converting the natural language into a feature vector capable of mathematical operation simplifies the mathematical operation of the word weight generation method of the present application.
  • FIG. 9 is a flowchart of a training method for a word weight generation model provided by an exemplary embodiment of the present application.
  • the method is performed by a computer device.
  • the method includes:
  • Step 901 Obtain sample video, sample video associated text and sample word weights.
  • the sample video associated text is text information that is associated with the content of the sample video.
  • the sample video associated text includes at least one word.
  • the sample word weight is obtained by manually calibrating the importance of the words in the associated text of the sample video.
  • Step 902 Input the sample video and the associated text of the sample video into the word weight generation model.
  • Step 903 Obtain the predicted word weight output by the word weight generation model.
  • the predicted word weight refers to the word weight output by the word weight generation model obtained by inputting the sample video and the associated text of the sample video into the word weight generation model by the computer equipment.
  • Step 904 Calculate the error between the weight of the sample word and the weight of the predicted word.
  • Step 905 According to the error, optimize the network parameters of the word weight generation model.
  • the network parameters of the word weight generation model are used to adjust the performance of the word weight generation model.
  • the network parameters of the word weight generation model include at least the network parameters of ResNet, the network parameters of BERT, the network parameters of DNN, the video feature vector, Fusion parameter between text feature vectors and word feature vectors.
  • step 201 acquires video and video-related text, where the video-related text includes at least one word, and the video acquisition method includes: acquiring video files in a target video library one by one as target videos for subsequent processing.
  • the target video is a video clip of a stored video file in the video library
  • the extraction of the target video includes at least one of the following methods:
  • the video files stored in the video library are input into the video extraction model, and after the feature extraction is performed on the above-mentioned video files by the video-extraction model, the correlation between the frames in the above-mentioned video files is analyzed, so that the video files are analyzed. Extract to get the target video.
  • the target video is a video file that meets the screening conditions
  • the target video is a video file uploaded by a specified user
  • the target video is a video file of a video type that meets the requirements, or, the video duration reaches a threshold video file.
  • the target video is a video file of a video type that meets the requirements
  • the video file is a video of a certain episode in a TV series, a movie video, a movie clip, a documentary video, etc.
  • the video file is used as the target video to obtain.
  • target video is a video file uploaded by a designated user, exemplarily, when the video file is a video uploaded by a professional organization, a video uploaded by a public figure, or a video uploaded by an authoritative person, the video file is used as the target video.
  • extracting the video feature vector of the video includes the following steps:
  • the video includes video frames and audio frames, and the video frames here are represented as pictures.
  • the picture features refer to the features extracted from the interface performance of the video, wherein the picture features include features corresponding to text content such as theme names, bullet screens, and dialogues, as well as features corresponding to video picture frames.
  • the computer device uses ResNet features to extract the video frame to obtain the video frame vector, that is, convert the video frame from the original natural language into a vector capable of performing mathematical operations.
  • the audio frame is expressed as the sound in the video.
  • the audio frame and the picture frame are matched, that is, the audio and picture are synchronized, that is, the audio frame and the picture frame are simultaneously extracted at the same time point; in one embodiment, the audio frame and the picture frame are simultaneously extracted.
  • the audio and the picture frame are simultaneously extracted.
  • There is a mismatch between the frame and the picture frame that is, the audio and picture are asynchronous, that is, the time points for extracting the audio frame and the picture frame are inconsistent.
  • the computer device uses a convolutional neural network (Convolutional Neural Networks, CNN) to perform feature extraction on the audio frame to obtain an audio frame vector, that is, the audio frame is converted from the original natural language to a mathematical operation. vector.
  • CNN convolutional Neural Networks
  • the text in the video frame refers to the text content related to the target video and involved in the target video.
  • the text in the video frame includes the content of the bullet screen, the content appearing in the screen, the content of the dialogue, and so on.
  • the computer device uses the BERT feature to extract the text on the screen to obtain a text curtain vector, that is, convert the text on the screen from the original natural language into a vector capable of performing mathematical operations.
  • At least two of a video frame vector, an audio frame vector, and a text scene vector are fused in a weighted manner.
  • the video frame vector is x
  • the audio frame vector is y
  • the text scene vector is z.
  • FIG. 10 is a flowchart of a method for generating word weights provided by an exemplary embodiment of the present application.
  • the input word is xi
  • the extracted video key frame is fi
  • the input of the second fusion is the Fusion1 vector and the word vector obtained from the first fusion.
  • the second fusion vector Fusion2 fusion(Fusion1, Vword), the two fusions in the model feature fusion process strengthen the importance of words and can effectively identify The importance of the word in the sentence, that is, the word weight value.
  • used in Figure 10 Indicates one dimension information of the key frame encoding vector, using Indicates one dimension of the sentence encoding vector, using Indicates one-dimensional information of the word encoding vector.
  • the first fusion vector Fusion1 and the second fusion vector Fusion2 use the proportion relationship of the above three circles to represent the fusion degree of the key frame encoding vector, the sentence encoding vector and the word encoding vector.
  • the process of realizing video search can be divided into three stages: a model training stage, a preprocessing stage, and a search stage.
  • Model training phase The server obtains the sample video, the associated text of the sample video, and the weight of the sample word.
  • the sample video associated text is text information that is associated with the content of the sample video, and the sample word weight is obtained by manually calibrating the importance of words in the sample video associated text.
  • the server obtains the above training samples through a local database or through a content server. After acquiring the training samples, the server inputs the training samples into the word weight generation model, and obtains the predicted word weights of the words in the video-related text predicted by the word weight generation model. After that, the server trains the word weight generation model according to the error between the sample word weight and the predicted word weight.
  • the implementation process of the above model training phase can be implemented as a training method for a word weight generation model executed by a computer device. Alternatively, implement a training device that becomes a word weight generation model.
  • the server obtains the video and the video associated text.
  • the video associated text includes a video title and/or a video introduction of the video.
  • the video-related text includes at least one word
  • the video-related text is text information that is associated with the content of the video
  • the video-related text is either marked by humans or generated by a machine.
  • the server obtains the above information through a local database or through a content server.
  • the above video is used to push to the user's client for playback.
  • the server performs word segmentation on the video-related text to obtain each word in the video-related text.
  • the server extracts the video feature vector of the video, the text feature vector of the video-related text, and the word feature vector of the words in the video-related text through the above-mentioned word weight generation model that has completed the training. And perform feature fusion on the three feature vectors to generate the intermediate vector of the word.
  • the feature fusion refers to the vector fusion operation on the video feature vector, the text feature vector and the word feature vector.
  • the server can generate the word weights of the words, so as to obtain the word weights of each word in the video-related text.
  • the server performs dimension transformation on the intermediate vector to obtain a one-dimensional vector, and then converts the one-dimensional vector through a threshold function to obtain the word weight of the word.
  • Search stage When a user searches for a video on the client of the terminal through a search term, the server will receive a video search request sent by the client.
  • the client is wired or wirelessly connected to the server, the server is a background server of the client, and the video search request includes at least one search term.
  • the server matches the search term with words in the video associated text of each video, and determines whether the search term matches the video according to the similarity obtained by the matching, thereby obtaining a matching video.
  • the server uses the word weights obtained in the preprocessing stage.
  • the server calculates the similarity between the search word and each word in the video-related text, and calculates the average similarity between the search word and each word as the similarity between the search word and the video-related text according to the word weights corresponding to the words.
  • the search term is x
  • the video-related text includes the terms o, p, q.
  • the similarity between x and o is 0.8
  • the similarity between x and p is 0.3
  • the similarity between x and q is 0.5
  • the word weight of o is 0.5
  • the word weight of p is 0.2
  • the word weight of q is 0.3.
  • the server determines the video corresponding to the video-related text as a matching video. For example, if the similarity threshold is 0.5, if the search term is x and the video-related text includes words o, p, and q, the computer device will determine the video corresponding to the video-related text as a matching video.
  • the computer device will first perform a quick pre-retrieval in the video according to the search term, and after obtaining part of the video, the video will be searched in the above manner to improve efficiency. After the server determines the matching video, it will send the matching video to the client for playback.
  • the implementation process of the above search phase can be implemented as a video search method executed by a computer device (server or terminal). Alternatively, it is implemented as a video search device.
  • the above-mentioned preprocessing stage is executed before the search stage.
  • the above-mentioned preprocessing stage is interspersed with the search stage, for example, after the server receives the search request, the preprocessing stage and the subsequent steps of the search stage are executed.
  • the above description only takes the method provided in the present application applied to a video search scenario as an example for description, and is not intended to limit the application scenario of the present application.
  • FIG. 11 is a structural block diagram of an apparatus for generating word weights according to an exemplary embodiment of the present application. As shown in Figure 11, the device includes:
  • an acquisition module 1120 configured to acquire a video and a video-related text, the video-related text includes at least one word, and the video-related text is text information associated with the content of the video;
  • the generating module 1140 is used to perform multi-modal feature fusion on the features of the three types of information of video, video-related text and words, to generate intermediate vectors of words;
  • the generating module 1140 is further configured to generate word weights of the words based on the intermediate vectors of the words.
  • the generation module 1140 includes an extraction module 41 and a fusion module 42:
  • the extraction module 41 is used to extract the video feature vector of the video; extract the text feature vector of the video associated text; and extract the word feature vector of the word;
  • the fusion module 42 is configured to fuse the video feature vector, the text feature vector and the word feature vector to obtain an intermediate vector of the word.
  • the fusion module 42 includes a first fusion sub-module 421 and a second fusion sub-module 422 .
  • the first fusion submodule 421 is configured to perform a first fusion of the video feature vector, the text feature vector and the word feature vector to obtain a first fusion vector;
  • the second fusion sub-module 422 is configured to perform a second fusion of the first fusion vector and the word feature vector to obtain an intermediate vector of the word.
  • the first fusion sub-module 421 includes a first splicing module 211 and a first mapping module 212 .
  • the first splicing module 211 is used to sequentially splicing the video feature vector, the text feature vector and the word feature vector to obtain a first splicing vector;
  • the first mapping module 212 is configured to perform fully connected feature mapping on the first spliced vector to obtain a first fusion vector.
  • the second fusion sub-module 422 includes a second splicing module 221 and a second mapping module 222 .
  • the second splicing module 221 is configured to sequentially splicing the first fusion vector and the word feature vector to obtain a second splicing vector
  • the second mapping module 222 is configured to perform fully-connected feature mapping on the second spliced vector to obtain the intermediate vector of the word.
  • the generation module 1140 further includes a conversion module 43 .
  • the conversion module 43 is used to perform dimension transformation on the intermediate vector to obtain a one-dimensional vector
  • the conversion module 43 is further configured to normalize the one-dimensional vector to obtain the word weight of the word.
  • the conversion module 43 is configured to convert the one-dimensional vector through a threshold function to obtain the word weight of the word.
  • the extraction module includes a video extraction module 411 , a text extraction module 412 and a word extraction module 413 .
  • the video extraction module 411 includes a frame segmentation module 111 , an extraction sub-module 112 and a calculation module 113 .
  • the framing module 111 is configured to perform a framing operation on the video to obtain at least two video frames;
  • the extraction sub-module 112 is configured to extract video frame vectors of at least two video frames
  • the calculation module 113 is configured to calculate the average vector of the video frame vectors of the at least two video frames, and determine the average vector as the video feature vector; or, calculate the average vector of the video frame vectors of the at least two video frames.
  • Weight vector which determines the weight vector as a video feature vector.
  • the computing module 113 is used for:
  • the target object included in each video frame is determined by the target detection model.
  • the target object is classified by the classification model, and the target object classification corresponding to each video frame is obtained.
  • the weight of the video frame vector of each video frame is determined according to the similarity corresponding to each video frame, and the weight is positively correlated with the similarity.
  • a weighted vector of the video frame vectors of the at least two video frames is calculated, and the weighted vector is determined as a video feature vector.
  • the extraction sub-module 112 is further configured to invoke the residual neural network to extract video frame vectors of at least two video frames in the video.
  • the text extraction module 412 is configured to call the bidirectional transcoding network to extract the text feature vector of the video-related text, or call the long short-term memory network to extract the text feature vector of the video-related text.
  • the word extraction module 413 includes a word segmentation module 131 and a word extraction sub-module 132 .
  • the word segmentation module 131 is configured to perform word segmentation on the video-related text to obtain words
  • the word extraction sub-module 132 is configured to invoke a deep neural network to extract word feature vectors of words.
  • the word segmentation module 131 is further configured to call a Chinese word segmentation tool to segment the video-related text to obtain words.
  • the extraction module 41 is used for:
  • the video frame vector is extracted.
  • the audio frame vector is extracted.
  • the text subtitle vector is extracted. At least two of the video frame vector, the audio frame vector and the text subtitle vector are fused to obtain a video feature vector.
  • the device obtains video feature vectors, text feature vectors and word feature vectors by extracting features from videos, video-related texts and words, and then splicing and fully connecting the feature vectors of the above three modalities to obtain:
  • the first fusion vector, and then the first fusion vector and the word feature vector are spliced and fully connected to map to obtain an intermediate vector, and based on the intermediate vector, the word weight of the word is obtained.
  • the above device splices the word feature vector twice, which strengthens the information content ratio of the current word feature vector in the intermediate vector, which is beneficial to improve the weight value discrimination of different words in the video-related text.
  • the above device realizes that in the video search process, the above word weight generation method is used to pre-extract the weight value of the word, not only considering the feature of the text dimension, but also introducing the feature fused with the video dimension, and performing the word weight based on the multi-dimensional feature. It is beneficial to improve the accuracy and reliability of the output word weight, and improve the distinction between key words and confusing words in the video-related text.
  • the above device uses a residual neural network to perform feature extraction on video, a bidirectional coding conversion network or a long short-term memory network is used to perform feature extraction on video-related text, and a deep neural network is used to perform feature extraction on words, so as to realize the conversion of natural language into executable language.
  • the feature vector of the mathematical operation simplifies the mathematical operation of the word weight generating device of the present application.
  • the embodiments of the present application also provide a computer device, the computer device includes: a processor and a memory, and the memory stores at least one instruction, at least one piece of program, code set or instruction set, at least one instruction, at least one piece of program, code
  • the set or instruction set is loaded and executed by the processor to implement the word weight generation method provided by the above method embodiments.
  • FIG. 12 shows a structural block diagram of a computer device 1200 provided by an exemplary embodiment of the present application.
  • the computer device 1200 may be a portable mobile terminal, such as: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, a moving picture expert compression standard audio layer 3), MP4 (Moving Picture Experts Group Audio Layer IV, Motion Picture Expert Compresses Standard Audio Layer 4) Player, Laptop or Desktop.
  • Computer device 1200 may also be called user equipment, portable terminal, laptop terminal, desktop terminal, and the like by other names.
  • the computer device 1200 can also refer to a server.
  • computer device 1200 includes: processor 1201 and memory 1202 .
  • the processor 1201 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like.
  • the processor 1201 can use at least one hardware form among DSP (Digital Signal Processing, digital signal processing), FPGA (Field-Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array, programmable logic array) accomplish.
  • the processor 1201 may also include a main processor and a coprocessor.
  • the main processor is a processor used to process data in the wake-up state, also called CPU (Central Processing Unit, central processing unit); the coprocessor is A low-power processor for processing data in a standby state.
  • the processor 1201 may be integrated with a GPU (Graphics Processing Unit, image processor), and the GPU is responsible for rendering and drawing of the displayed content.
  • the processor 1201 may further include an AI (Artificial Intelligence, artificial intelligence) processor, where the AI processor is used to process computing operations related to machine learning.
  • AI Artificial Intelligence, artificial intelligence
  • Memory 1202 may include one or more computer-readable storage media, which may be non-transitory. Memory 1202 may also include high-speed random access memory, as well as non-volatile memory, such as one or more disk storage devices, flash storage devices. In some embodiments, the non-transitory computer-readable storage medium in the memory 1202 is used to store at least one instruction, and the at least one instruction is used to be executed by the processor 1201 to implement the word weight provided by the method embodiments in this application. Generate method.
  • the memory further includes one or more programs, the one or more programs are stored in the memory, and the one or more programs include the method for generating word weights provided by the embodiments of the present application.
  • the present application further provides a computer-readable storage medium, where at least one instruction, at least one piece of program, code set or instruction set is stored in the storage medium, the at least one instruction, the at least one piece of program, the code set or The instruction set is loaded and executed by the processor to implement the word weight generation method provided by the above method embodiments.
  • the present application provides a computer program product or computer program comprising computer instructions stored in a computer-readable storage medium.
  • the processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the word weight generation method provided by the above method embodiments.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Medical Informatics (AREA)
  • Library & Information Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

一种词权重的生成方法、装置、设备及介质,属于信息处理领域。所述方法包括:获取视频和视频关联文本,视频关联文本包括至少一个词语(201);对视频、视频关联文本和词语三种信息的特征进行多模态特征融合,生成词语的中间向量(202);基于词语的中间向量,生成词语的词权重(203)。

Description

词权重的生成方法、装置、设备及介质
本申请要求于2021年3月9日提交的申请号为202110258046.3、发明名称为“词权重的生成方法、装置、设备及介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及信息处理领域,特别涉及一种词权重的生成方法、装置、设备及介质。
背景技术
视频标题是用于描述视频的视频内容的文字信息。在视频搜索***中需要基于对视频内容的语义的理解,对视频标题中的各个词语的权重值进行预先提取,以便于后续的视频搜索过程。例如视频标题中某个词语的权重值越高,则该词语与视频内容的关联度越高,因此搜索时该词语的重要程度也会越高。
相关技术中,对词权重的生成方法主要是对视频标题的句子和视频标题中的各词语进行分别编码,得到句子向量和词语向量。对编码完成后的句子向量和词语向量进行特征融合,得到融合向量,对融合向量进行二分类判断,从而判断当前词语是否为核心词,进而输出当前词语的词权重。
上述方法生成的词权重在某些情况下是不准确的。比如,视频标题是“双击这个视频,你会发现烤猪肉比烤鱼肉的做法更简单”,上述方法难以对“烤猪肉”和“烤鱼肉”的权重做出有力区分。
发明内容
本申请提供了一种词权重生成方法、装置、设备及介质,通过融入视频的画面特征信息,能够提高词权重值的准确率和可靠程度。所述技术方案如下:
根据本申请的一个方面,提供了一种词权重生成方法,由计算机设备执行,所述方法包括:
获取视频和视频关联文本,视频关联文本包括至少一个词语,视频关联文本是与视频的内容存在关联关系的文本信息;
对视频、视频关联文本和词语三种信息的特征进行多模态特征融合,生成词语的中间向量;
基于词语的中间向量,生成词语的词权重。
根据本申请的一个方面,提供了一种词权重的生成装置,所述装置包括:
获取模块,用于获取视频和视频关联文本,视频关联文本包括至少一个词语,视频关联文本是与视频的内容存在关联关系的文本信息;
生成模块,用于对视频、视频关联文本和词语三种信息的特征进行多模态特征融合,生成词语的中间向量;
生成模块,还用于基于词语的中间向量,生成词语的词权重。
在一个可选的实施例中,生成模块包括提取模块和融合模块。
在一个可选的实施例中,提取模块,用于提取视频的视频特征向量;提取视频关联文本的文本特征向量;以及提取词语的词语特征向量;
在一个可选的实施例中,融合模块,用于将视频特征向量、文本特征向量和词语特征向量进行融合,得到词语的中间向量。
在一个可选的实施例中,融合模块包括第一融合子模块和第二融合子模块。
在一个可选的实施例中,第一融合子模块,用于将视频特征向量、文本特征向量和词语特征向量进行第一融合,得到第一融合向量;
在一个可选的实施例中,第二融合子模块,用于将第一融合向量和词语特征向量进行第二融合,得到词语的中间向量。
在一个可选的实施例中,第一融合子模块包括第一拼接模块和第一映射模块。
在一个可选的实施例中,第一拼接模块,用于将视频特征向量、文本特征向量和词语特征向量进行依次拼接,得到第一拼接向量;
在一个可选的实施例中,第一映射模块,用于将第一拼接向量进行全连接特征映射,得到第一融合向量。
在一个可选的实施例中,第二融合子模块包括第二拼接模块和第二映射模块。
在一个可选的实施例中,第二拼接模块,用于将第一融合向量和词语特征向量进行依次拼接,得到第二拼接向量;
在一个可选的实施例中,第二映射模块,用于将第二拼接向量进行全连接特征映射,得到词语的中间向量。
在一个可选的实施例中,生成模块还包括转换模块。
在一个可选的实施例中,转换模块,用于将中间向量进行维度变换,得到一维向量;
在一个可选的实施例中,转换模块,还用于将一维向量进行归一化处理,得到词语的词权重。
在一个可选的实施例中,转换模块,用于将一维向量通过阈值函数进行转换,得到词语的词权重。
在一个可选的实施例中,提取模块包括视频提取模块、文本提取模块和词语提取模块。其中,视频提取模块包括分帧模块、提取子模块和计算模块。
在一个可选的实施例中,分帧模块用于对视频进行分帧操作,得到至少两个视频帧;
在一个可选的实施例中,提取子模块用于提取至少两个视频帧的视频帧向量;
在一个可选的实施例中,计算模块用于计算至少两个视频帧的视频帧向量的平均向量,将平均向量确定为视频特征向量;或,计算至少两个视频帧的视频帧向量的加权向量,将加权向量确定为视频特征向量。
在一个可选的实施例中,计算模块,用于:
通过目标检测模型确定每个视频帧包括的目标对象;
通过分类模型对目标对象进行分类,得到每个视频帧对应的目标对象分类;
计算每个视频帧对应的目标对象分类与词语的相似度;
根据每个视频帧对应的相似度确定每个视频帧的视频帧向量的权重,权重与相似度正相关;
根据至少两个视频帧的视频帧向量,以及至少两个视频帧的视频帧向量各自的权重,计算至少两个视频帧的视频帧向量的加权向量,将加权向量确定为视频特征向量。
在一个可选的实施例中,提取子模块还用于调用残差神经网络提取视频中的至少两个视频帧的视频帧向量。
在一个可选的实施例中,文本提取模块,用于调用双向编码转换网络提取视频关联文本的文本特征向量,或,调用长短期记忆网络提取视频关联文本的文本特征向量。
在一个可选的实施例中,词语提取模块包括分词模块和词语提取子模块。
在一个可选的实施例中,分词模块,用于对视频关联文本进行分词,得到词语;
在一个可选的实施例中,词语提取子模块,用于调用深度神经网络提取词语的词语特征向量。
在一个可选的实施例中,分词模块还用于调用中文分词工具对视频关联文本进行分词,得到词语。
在一个可选的实施例中,提取模块,用于:
基于视频中的视频帧,提取得到视频帧向量;
基于视频中的音频帧,提取得到音频帧向量;
基于视频帧中的文本,提取得到文本幕向量;
将视频帧向量、音频帧向量和文本幕向量中的至少两种进行融合,得到视频特征向量。
根据本申请的一个方面,提供了一种计算机设备,所述计算机设备包括:处理器和存储器,所述存储器存储有计算机程序,所述计算机程序由所述处理器加载并执行以实现如上所述的词权重生成方法。
根据本申请的另一方面,提供了一种计算机可读存储介质,所述存储介质存储有计算机程序,所述计算机程序由处理器加载并执行以实现如上所述的词权重生成方法。
根据本申请的另一个方面,提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行上述词权重生成方法。
本申请实施例提供的技术方案带来的有益效果至少包括:
通过将视频、视频关联文本和词语进行多维度的特征结合,生成中间向量,基于中间向量,生成词语的词权重。在视频搜索过程中,采用上述词权重生成方法来预先提取词语的权重值,不仅考虑了文本维度的特征,还引入融合了视频维度的特征,基于多维度的特征来进行词权重生成,有利于提升输出的词权重的准确率和可靠程度,提高了视频关联文本中对关键词语和混淆词语之间的区分度。
附图说明
图1是根据一示例性实施例提供的一种词权重生成***的示意图;
图2是本申请一个示例性实施例提供的词权重的生成方法的流程图;
图3是本申请一个示例性实施例提供的词权重生成模型的示意图;
图4是本申请另一个示例性实施例提供的词权重生成模型的示意图;
图5是本申请一个示例性实施例提供的词权重生成方法的流程图;
图6是本申请一个示例性实施例提供的生成视频特征向量的流程图;
图7是本申请一个示例性实施例提供的生成文本特征向量的流程图;
图8是本申请一个示例性实施例提供的生成词语特征向量的流程图;
图9是本申请一个示例性实施例提供的词权重生成模型的训练方法的流程图;
图10是本申请一个示例性实施例提供的词权重生成方法流程图;
图11是本申请的一个示例性实施例提供的词权重生成装置的结构框图;
图12示出了本申请一个示例性实施例提供的计算机设备的结构框图。
具体实施方式
首先,对本申请实施例中涉及的名词进行简单介绍:
词语重要度:指词语对句子表达的意思起到的作用大小。常见的,一个完整句子的组成部分包括主语、谓语、宾语、定语、状语和补语,示例性的,句子为“双击这个视频,你会发现烤猪肉比烤鱼肉的做法更简单”,去掉连接词和人称代词,该句子主要由词语“双击”“视频”“发现”“烤猪肉”“烤鱼肉”“做法”“简单”组成。基于对句子整体的意思理解,容易得到“烤猪肉”、“烤鱼肉”对句子的意思表达起到关键作用。更为具体的,“烤猪肉”比“烤鱼肉”对句子意思表达起的作用更进一步,即,“烤猪肉”词语的重要度比“烤鱼肉”词语的重要度要高。
在一个实施例中,采用词语在句子中的权重值来表示词语的重要度。示意性的,上述句子中,“烤猪肉”的权重值为0.91,“烤鱼肉”的权重值为0.82,即通过权重值的大小比较,可得“烤猪肉”比“烤鱼肉”更重要。
残差神经网络(Residual Network,ResNet):一种基于深度学习的特征提取神经网络。在传统的深度学习中,在神经网络可以收敛的前提下,随着网络深度增加,网络的表现先是逐渐增加至饱和,然后迅速下降,即为网络退化问题;在传统的深度学习中,存在梯度弥散 问题;残差神经网络为深度学习神经网络添加一个恒等映射,解决了上述网络退化问题和梯度弥散问题。
在本申请中,残差神经网络用于将图像转换为可进行运算的数学语言,示例性的,残差神经网络将视频帧转换为视频帧向量,该视频帧向量包含了反映视频帧的内容的信息,即可用该视频帧向量替换上述视频帧。
双向编码转换模型(Bidirectional Encoder Representations from Transformers,BERT):一种句子转换模型,可实现将真实世界抽象存在的文字转换成能够进行数学公式操作的向量。在一个实施例中,BERT将输入的文本转换为文本向量,该文本向量包含反映文本的内容的信息,即可用该文本向量替换上述文本。
深度神经网络(Deep Neural Networks,DNN):含有全连接的神经元结构的多层神经网络,实现把真实世界存在的客观事物转换为可以进行数学公式操作的向量。在一个实施例中,DNN将输入的词语转换为词向量,该词向量包含反映词语的内容的信息,即可用该词向量替换上述词语。
阈值函数:实现数值区间的转换,例如,数字x所处区间为[0,100],通过阈值函数,将数字x转换为区间[0,1]的数字y。通过S型(sigmoid)函数(一种阈值函数),能够实现将一维向量映射为区间[0,1]上的数字,在本申请中,通过将一维向量映射到区间[0,1]上,得到词权重。
本申请实施例的方案包括模型训练阶段和词权重预测阶段。图1是根据一示例性实施例示出的一种词权重生成***的示意图。如图1所示,在模型训练阶段,模型训练设备110通过预先设置好的训练样本集训练出准确性较高的词权重生成模型,在词权重预测阶段,词权重生成设备120根据训练出的词权重生成模型以及输入的视频和文本,预测文本中词语的权重值。
其中,上述模型训练设备110和词权重预测设备120可以是具有机器学习能力的计算机设备,比如,该计算机设备可以是终端或服务器。
可选的,上述模型训练设备110和词权重预测设备120可以是同一个计算机设备,或者,模型训练设备110和词权重预测设备120也可以是不同的计算机设备。并且,当模型训练设备110和词权重预测设备120是不同的设备时,模型训练设备110和词权重预测设备120可以是同一类型的设备,比如模型训练设备110和词权重预测设备120可以都是服务器;或者,模型训练设备110和词权重预测设备120也可以是不同类型的设备。上述服务器可以是独立的物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式***,还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、内容分发网络(Content Delivery Network,CDN)、以及大数据和人工智能平台等基础云计算服务的云服务器。上述终端可以是智能手机、平板电脑、笔记本电脑、台式计算机、智能音箱、智能手表等,但并不局限于此。终端以及服务器可以通过有线或无线通信方式进行直接或间接地连接,本申请在此不做限制。
针对词权重的预测阶段进行介绍:
为提升生成的词权重值的准确率和可靠程度,采用图2所述的方法生成词语的词权重。图2示出了本申请一个示例性实施例提供的词权重的生成方法流程图。该方法由计算机设备执行。如图2所示,该方法包括:
步骤201:获取视频和视频关联文本,视频关联文本包括至少一个词语。
视频和视频关联文本之间存在对应关系,具体的,视频关联文本是与视频的内容存在关联关系的文本信息。例如,视频关联文本为视频的标题或视频的视频简介。可选地,在需要比较不同词语之间的词权重的情况下,该视频关联文本包括至少两个词语。
在一个实施例中,视频关联文本是与视频相对应的标题,视频关联文本与视频彼此独立,其中,该标题由人为标注或机器生成,用于简要阐述该视频的中心含义。示例地,视频的内容为介绍烤肉的做法,标题为“教你学会做烤肉”。
在一个实施例中,视频关联文本是与视频相对应的视频简介,视频关联文本与视频彼此独立,其中,该视频简介由人为撰写或机器生成,用于简要阐述该视频的具体内容。示例地,视频的内容为介绍烤肉的做法,视频介绍为“本视频介绍了烤肉的做法,分为预处理、腌制、和烧烤三个步骤”。
可选地,计算机设备通过本地数据库或内容服务器获取视频和视频关联文本。该内容服务器用于存储大量的视频以及视频对应的视频关联文本,并推送至用户侧进行展示。例如该内容服务器为视频点播应用、短视频应用、歌曲应用的后台服务器。该计算机设备与内容服务器为相同或不同的设备。
步骤202:对视频、视频关联文本和词语三种信息的特征进行多模态特征融合,生成词语的中间向量。
其中,多模态特征融合指计算机设备对视频、视频关联文本、词语分别进行特征提取,得到视频特征向量、文本特征向量和词语特征向量,之后对视频特征向量、文本特征向量和词语特征向量进行向量融合运算。由于视频、视频关联文本以及词语是不同模态的信息,因此对视频、视频关联文本以及词语的特征进行特征融合可称为多模态特征融合。
可选地,上述多模态特征融合包括以下两个步骤:
第一,提取视频的视频特征向量;提取视频关联文本的文本特征向量;以及提取词语的词语特征向量;
其中,视频特征向量指示视频特征信息、文本特征向量指示视频关联文本的特征信息、词语特征向量指示词语特征信息。视频特征向量用于反映视频的内容的特征,文本特征向量用于反映文本的语义的特征,词语特征向量用于反映词语的语义的特征。
第二,将视频特征向量、文本特征向量和词语特征向量进行融合,得到词语的中间向量。
其中,通过特征融合得到的中间向量,同时包含视频、视频关联文本和词语的特征信息。
步骤203:基于词语的中间向量,生成词语的词权重。
可选地,基于词语的中间向量生成词语的词权重包括以下两个步骤:
第一、将中间向量进行维度变换,得到一维向量;
计算机设备基于对视频、视频关联文本和词语的特征进行多模态特征融合,能够生成词语的中间向量。其中,中间向量为包含视频、视频关联文本和词语的特征信息的多维度向量;在一个实施例中,通过全连接映射实现将中间向量变换为一维向量。
第二、将一维向量进行归一化处理,得到词语的词权重。
在一个实施例中,计算机设备通过上述全连接映射对中间向量进行维度变换,如中间向量维度为388维,进行维度变换得到1维向量。其中,1维向量包含了词向量在句子向量的重要度信息。通过对1维向量进行归一化处理,能够实现将1维向量转化为区间[0,1]上的一个数值,该数值即为词语的词权重。在一个实施例中,计算机设备通过阈值函数能够实现对一维向量的归一化处理,例如通过sigmoid函数对1维向量进行数值区间的转换,可实现将一维向量映射到区间[0,1]上,得到词权重。可选地,通过线性函数,也能够实现对1维向量的归一化处理,例如最小最大缩放(Min-Max scaling)函数。本申请对实现归一化处理的方式不作限制。
综上所述,通过将视频、视频关联文本和词语的特征结合,生成中间向量,基于中间向量,生成词语的词权重。在视频搜索过程中,采用上述词权重生成方法来预先提取词语的权重值,不仅考虑了文本维度的特征,还引入融合了视频维度的特征,基于多维度的特征来进行词权重生成,有利于提升输出的词权重的准确率和可靠程度,提高了视频关联文本中对关键词语和混淆词语之间的区分度。
图3示出了本申请一个示例性实施例提供的词权重生成模型的示意图,图3中词权重生成模型300包括:分词网络310、转换网络320、融合网络330和映射网络340。其中,分词网络310用于将视频关联文本进行分词得到至少一个词语;转换网络320用于将视频转换为视频特征向量、将视频关联文本转化为文本特征向量、将词语转换为词语特征向量;融合网络330用于将视频特征向量、文本特征向量和词语特征向量融合得到中间向量;映射网络340用于将中间向量映射为中间向量对应的词语的词权重。
图4示出了本申请另一个示例性实施例提供的词权重生成模型的示意图。词权重生成模型包括分词网络310、转换网络320、融合网络330、映射网络340。转换网络320包括第一转换子网络321、第二转换子网322和第三转换子网络323。融合网络330包括第一融合子网络331和第二融合子网络332。
图5示出了本申请一个示例性实施例的词权重生成方法流程图。该方法由计算机设备执行。结合参考图4的词权重生成模型,该词权重生成方法包括:
步骤510:调用第一转换子网络321对视频进行处理,输出视频的视频特征向量;
示例性的,第一转换子网络321用于对视频进行分帧操作,计算机设备通过调用第一转换子网络321对视频进行处理,能够得到视频的至少两个视频帧,然后提取至少两个视频帧的视频帧向量,再计算至少两个视频帧的视频帧向量的平均向量,将平均向量确定为视频特征向量;或,计算至少两个视频帧的视频帧向量的加权向量,将加权向量确定为视频特征向量。
可选地,计算机设备通过目标检测模型确定每个视频帧包括的目标对象,并通过分类模型对目标对象进行分类,得到每个视频帧对应的目标对象分类。该目标检测模型用于检测视频帧包括的目标对象,例如人物、动物、植物、以及不同类型的物体。该分类模型用于对检测到的目标对象进行分类,从而得到目标对象分类。例如检测的目标对象为视频帧中的动物所在的区域,将该区域输入分类模型可得到该目标对象为猫。该目标检测模型以及分类模型基于卷积神经网络(Convolutional Neural Network,CNN)实现。可选地,该目标检测模型不仅能够实现检测视频帧中的目标对象,还能够实现对目标对象进行分类。在该情况下计算机设备通过该目标检测模型即可直接得到目标对象分类。
之后,计算机设备计算每个视频帧对应的目标对象分类与词语的相似度,并根据每个视频帧对应的相似度确定每个视频帧的视频帧向量的权重,该权重与相似度正相关。即目标对象分类与词语的相似越高,该目标对象分类对应的视频帧的视频帧向量的权重越高。在确定每个视频帧的视频帧向量的权重后,计算机设备根据至少两个视频帧的视频帧向量,以及至少两个视频帧的视频帧向量各自的权重,计算至少两个视频帧的视频帧向量的加权向量,并将加权向量确定为视频特征向量。通过上述方式确定视频帧向量的权重,能够实现提升与词语关联的视频帧向量的权重,从而使与词语关联的视频帧在确定视频特征向量的过程中起到更大的作用。能够实现使确定的视频特征向量,更突出视频的特征中与词语存在强关联的特征,进而能够提升确定视频特征向量的准确度。可选地,上述权重还能够是人工设置的。
可选的,上述分帧操作至少包括以下两种处理方式:
第一、根据固定时间间隔提取视频帧;
示意性的,假设视频时长为30s,预设采样时长间隔为0.2s,则计算机设备对视频进行分帧操作指每隔0.2s采集视频帧。
第二、根据预设的采集规则提取视频帧。
在一个实施例中,假设视频时长为30s,预先设定在视频时长的前20%时长内,每隔1s采集视频帧,在视频时长的中间60%时长内,每隔0.2s采集视频帧,在视频时长的后20%时长内,每隔1s采集视频帧。
可选的,上述计算机设备提取至少两个视频帧的视频帧向量包括:调用残差神经网络提取视频中的至少两个视频帧的视频帧向量。
示意性的,如图6所示,计算机设备对视频601分帧,得到四个视频帧602。将四个视频帧602输入至ResNet603,分别得到第一帧向量、第二帧向量、第三帧向量和第四帧向量。将上述四个帧向量取平均或加权得到视频帧向量。在一个实施例中,上述计算至少两个视频帧的视频帧向量的平均向量,指对第一帧向量、第二帧向量、第三帧向量和第四帧向量进行累加之后求平均值。在一个实施例中,上述计算至少两个视频帧的视频帧向量的加权向量,指对第一帧向量、第二帧向量、第三帧向量和第四帧向量进行加权求和。例如,第一帧向量为a、第二帧向量为b、第三帧向量为c和第四帧向量为d,假设对第一帧向量赋予权重0.3,第二帧向量赋予权重0.1,第三帧向量赋予权重0.2和第四帧向量赋予权重0.4,则得到的视频特征向量为0.3a+0.1b+0.2c+0.4d。
步骤520:调用第二转换子网络322对视频关联文本进行处理,输出视频关联文本的文本特征向量;
在一个实施例中,第二转换子网络322包括双向编码转换(Bidirectional Encoder Representation from Transformers,BERT)网络和/或长短期记忆(Long Short-Term Memory,LSTM)网络。计算机设备调用双向编码转换网络提取视频关联文本的文本特征向量,或,调用长短期记忆网络提取所述视频关联文本的文本特征向量。可选地,计算机设备还能够通过分别调用双向编码转换网络和长短期记忆网络,提取视频关联文本的文本特征向量。之后计算两个网络提取到的文本特征向量的平均值或加权平均值,从而得到最终确定的文本特征向量。
可选的,如图7所示,视频关联文本701输入至Bert网络702,得到文本特征向量。
步骤530:调用分词网络310对视频关联文本进行分词,得到词语;
在一个可选的实施例中,分词网络内设jieba(一种第三方中文分词库),jieba内支持三种分词模式,第一、精确模式:将语句进行最精确的切分,不存在冗余数据,适合做文本分析;第二、全模式:将语句中所有可能是词的词语都切分出来,速度很快,但是存在冗余数据;第三、搜索引擎模式:在精确模式的基础上,对长词再次进行切分。在实际使用场景中,根据视频关联文本的类型、长短等对模式进行选择,最终实现将视频关联文本转换为至少一个词语。计算机设备通过调用上述分词网络,能够实现对视频关联文本进行分词。
在一个实施例中,视频关联文本为“这鲁班没救了,经济被压制,完全起不来,手机给你来玩!”,其中,在精确模式下分词得到的词语包括“这”“鲁班”“没救了”“经济”“被”“压制”“完全”“起不来”“手机”“给”“你来玩”“!”;在全模式下分词得到的词语包括“鲁班”“没救”“经济”“压制”“完全”“手机”;在搜索引擎模式下分词得到的词语包括“这”“鲁班”“没”“救”“了”“经济”“被”“压制”“完全”“起”“不”来”“手机”“给”“你”“来”“玩”“!”。
步骤540:调用第三转换子网络323对词语进行处理,输出词语的词语特征向量;
在一个实施例中,第三转换子网络323包括深度神经网络。计算机设备调用深度神经网络提取词语的词语特征向量。示意性的,如图8所示,计算机设备将词语输入DNN801,得到词语向量。
步骤550:调用第一融合子网络331将视频特征向量、文本特征向量和词语特征向量进行第一融合,得到第一融合向量;
在一个实施例中,计算机设备调用第一融合子网络将视频特征向量、文本特征向量和词语特征向量进行依次拼接,得到第一拼接向量。之后将第一拼接向量进行全连接特征映射,得到第一融合向量。
上述拼接指对所有向量进行维度拼接,如原本视频帧向量维度为318维、文本向量为50维、词向量为10维,则得到的第一拼接向量维度为378维。在一个实施例中,上述全连接特征映射指对得到的第一拼接向量进行映射,得到第一融合向量。示意性的,第一拼接向量为 [a,b,c],其中a,b,c分别指示视频信息、视频关联文本信息和词语信息,通过全连接层映射得到第一融合向量[0.9a,3b,10c],其中,0.9a、3b、10c、分别指示视频信息、视频关联文本信息和词语信息,即全连接特征映射改变了视频信息、视频关联文本信息和词语信息之间的融合程度。上述示例仅起到解释说明作用,实际全连接特征映射实现在高维空间,且融合的程度随着输入的视频、视频关联文本和词语的改变随之发生变化。
步骤560:调用第二融合子网络332将第一融合向量和词语特征向量进行第二融合,得到词语的中间向量;
在一个实施例中,计算机设备调用第二融合子网络332将第一融合向量和词语特征向量进行依次拼接,得到第二拼接向量。之后将第二拼接向量进行全连接特征映射,得到词语的中间向量。拼接和全连接特征映射同第一融合子网络相类似,不再赘述。通过上述第一融合子网络的拼接和第二融合子网络的拼接,强化了当前词语的重要性,提升了词语特征向量在中间向量的权重。
步骤570:调用映射网络340将中间向量进行维度变换,得到一维向量,并将一维向量进行归一化处理,得到词语的词权重。
在一个实施例中,通过上述全连接映射对中间向量进行维度变换,如中间向量维度为388维,进行维度变换得到1维向量。其中,1维向量包含了词语特征向量在文本特征向量的重要度信息。计算机设备通过将1维向量进行归一化处理,能够得到区间[0,1]上的一个值,该值即为词语的词权重。在一个实施例中,计算机设备采用sigmoid函数对1维向量进行数值区间的转换(即归一化处理),通过将一维向量映射到区间[0,1]上得到词权重。
综上所述,本实施例提供的方法,通过对视频、视频关联文本和词语进行特征提取得到视频特征向量、文本特征向量和词语特征向量,再将上述三种模态的特征向量进行拼接和全连接映射,得到第一融合向量,然后将第一融合向量和词语特征向量进行拼接和全连接映射,得到中间向量,基于中间向量,得到词语的词权重。
本实施例提供的方法,还通过对词语特征向量进行了两次拼接,强化了当前词语特征向量在中间向量的信息量占比,有利于提高视频关联文本中不同词语的权重值区分度。
本实施例提供的方法,还对视频特征向量、文本特征向量和词语特征向量进行融合,不仅考虑了文本维度的特征,还引入融合了视频维度的特征,基于多维度的特征来进行词权重生成,有利于提升输出的词权重的准确率和可靠程度,提高了视频关联文本中对关键词语和混淆词语之间的区分度。
本实施例提供的方法,还通过采用残差神经网络对视频进行特征提取、采用双向编码转换网络或长短期记忆网络对视频关联文本进行特征提取和采用深度神经网络对词语进行特征提取,实现了将自然语言转换为可进行数学运算的特征向量,简化了本申请词权重生成方法的数学运算。
针对模型训练阶段进行介绍:
上述词权重生成模型是采用训练方法训练得到的。图9是本申请的一个示例性实施例提供的词权重生成模型的训练方法的流程图。该方法由计算机设备执行。该方法包括:
步骤901:获取样本视频、样本视频关联文本和样本词权重。
样本视频和样本视频关联文本之间存在对应关系。样本视频关联文本是与样本视频的内容存在关联关系的文本信息。可选地,该样本视频关联文本包括至少一个词语。该样本词权重是人工对样本视频关联文本中的词语进行重要程度标定得到的。
步骤902:将样本视频、样本视频关联文本输入词权重生成模型。
步骤903:获取词权重生成模型输出的预测词权重。
预测词权重指的是计算机设备通过将样本视频和样本视频关联文本输入词权重生成模型,从而得到的词权重生成模型输出的词权重。
步骤904:计算样本词权重和预测词权重的误差。
步骤905:根据误差,优化词权重生成模型的网络参数。
词权重生成模型的网络参数用于调整词权重生成模型的性能,在本申请中,词权重生成模型的网络参数至少包括ResNet的网络参数、BERT的网络参数、DNN的网络参数,视频特征向量、文本特征向量和词语特征向量之间的融合参数。
基于图2的可选实施例中,步骤201获取视频和视频关联文本,视频关联文本包括至少一个词语中,视频的获取方法包括:逐个获取目标视频库中的视频文件作为目标视频进行后续处理。
在一个实施例中,目标视频为视频库中已存储视频文件的一个视频片段,该目标视频的提取包括如下方式中的至少一种:
(1)基于预设时长区间对视频文件进行划分,如:提取视频文件开头前两分钟的视频片段作为视频。
(2)通过人工手动对目标视频进行提取。
也即,基于人工对视频库中已存储的视频文件进行提取,如:观看人认为视频文件中第5至第6分钟的视频片段为本视频文件的核心视频,观看人提取该核心视频作为目标视频。
(3)通过视频提取模型对目标视频进行提取。
即,将视频库中存储的视频文件输入视频提取模型,由视频提取模型对上述视频文件进行特征提取后,对上述视频文件中的帧与帧之间的关联性进行分析,从而对视频文件进行提取,得到目标视频。
在一个实施例中,目标视频为符合筛选条件的视频文件,示意性的,目标视频为指定用户上传的视频文件,或,目标视频为符合要求的视频类型的视频文件,或,视频时长达到阈值的视频文件。
针对目标视频为符合要求的视频类型的视频文件的情况,示例性的,当视频文件为电视剧中的某一集、电影视频、电影片段、纪录片视频等类型的视频时,将该视频文件作为目标视频进行获取。
针对上述目标视频为指定用户上传的视频文件,示例性的,当视频文件为某专业机构上传的视频、某公共人物上传的视频、某权威人士上传的视频时,将该视频文件作为目标视频进行获取。
基于图2的可选实施例中,提取视频的视频特征向量包括以下步骤:
第一、基于视频中的视频帧,提取得到视频帧向量;
其中,视频包括视频帧和音频帧,此处的视频帧表现为画面。其中,画面特征是指从视频的界面表现上提取得到的特征,其中,画面特征中包括与主题名称、弹幕、对白等文本内容对应的特征,也包括与视频画面帧对应的特征。
在一个可选的实施例中,计算机设备采用ResNet特征提取视频帧得到视频帧向量,即将视频帧由原本的自然语言转换为能进行数学运算的向量。
第二、基于视频中的音频帧,提取得到音频帧向量;
音频帧表现为视频中的声音,在一个实施例中,音频帧与画面帧之间达成匹配,即音画同步,即在同一时间点同时提取音频帧和画面帧;在一个实施例中,音频帧与画面帧之间不匹配,即音画异步,即提取音频帧和画面帧的时间点不一致。
在一个可选的实施例中,计算机设备采用卷积神经网络(Convolutional Neural Networks,CNN)对音频帧进行特征提取,得到音频帧向量,即将音频帧由原本的自然语言转换为能进行数学运算的向量。
第三、基于视频帧中的文本,提取得到文本幕向量;
视频帧中的文本是指与目标视频相关的,目标视频所涉及的文本内容,示意性的,视频帧中的文本包括弹幕内容、画面中出现的内容、对白内容等。
在一个可选的实施例中,计算机设备采用BERT特征提取画面上的文本,得到文本幕向量,即将画面上的文本由原本的自然语言转换为能进行数学运算的向量。
第四、将视频帧向量、音频帧向量和文本幕向量中的至少两种进行融合,得到视频特征向量。
在一个实施例中,采用加权方式实现视频帧向量、音频帧向量和文本幕向量中的至少两种进行融合,示意性的,视频帧向量为x、音频帧向量为y、文本幕向量为z,假设对视频帧向量赋予权重0.5,音频帧向量赋予权重0.1,文本幕向量赋予权重0.4,则得到的视频特征向量为0.5x+0.1y+0.4z。
图10是本申请一个示例性实施例提供的词权重生成方法的流程图。示例性的,输入句子“双击这个视频,你会发现烤猪肉比烤鱼肉的做法更简单”可表示为text=[x0,x1,…],输入词语为xi,抽取得到的视频关键帧为fi,则句子的编码向量为Vtext=BERT(text),关键帧编码向量为Vimg=ResNet(fi),词语的编码向量为Vword=DNN(xi),则第一融合向量Fusion1=fusion(Vtext,Vimg,Vword),其中fusion为多类特征向量拼接后通过全连接方式完成特征映射得到。第二次融合的输入为首次融合得到的Fusion1向量和词语向量,第二融合向量Fusion2=fusion(Fusion1,Vword),模型特征融合过程中的两次融合强化了词语的重要性,可以有效的识别该词在句子的重要程度,即词权重值。图10中用
Figure PCTCN2022078183-appb-000001
指示关键帧编码向量一个维度信息,用
Figure PCTCN2022078183-appb-000002
指示句子编码向量的一个维度信息,用
Figure PCTCN2022078183-appb-000003
指示词语编码向量的一个维度信息。第一次融合向量Fusion1和第二次融合向量Fusion2采用上述三种圆的占比关系来表示关键帧编码向量、句子编码向量和词语编码向量的融合程度。
对本申请涉及的应用场景进行介绍:
以本申请提供的方法应用于视频搜索场景为例进行说明,实现视频搜索的过程可分为3个阶段:模型训练阶段、预处理阶段以及搜索阶段。
模型训练阶段:服务器获取样本视频、样本视频关联文本和样本词权重。其中,样本视频关联文本是与样本视频的内容存在关联关系的文本信息,样本词权重是人工对样本视频关联文本中的词语进行重要程度标定得到的。可选地,服务器通过本地数据库或通过内容服务器,获取上述训练样本。在获取到训练样本后,服务器将训练样本输入词权重生成模型,得到词权重生成模型预测的视频关联文本中的词语的预测词权重。之后,服务器根据样本词权重和预测词权重之间的误差,训练词权重生成模型。上述模型训练阶段的实现过程,可实现成为由计算机设备执行的词权重生成模型的训练方法。或者,实现成为词权重生成模型的训练装置。
预处理阶段:服务器获取视频和视频关联文本。示例地,该视频关联文本包括视频的视频标题和/或视频简介。该视频关联文本包括至少一个词语,视频关联文本是与视频的内容存在关联关系的文本信息,该视频关联文本由人为标注或机器生成。可选地,服务器通过本地数据库或通过内容服务器,获取上述信息。上述视频用于推送至用户的客户端进行播放。服务器会对视频关联文本进行分词,从而得到视频关联文本中的各个词语。服务器通过上述完成训练的词权重生成模型提取视频的视频特征向量、视频关联文本的文本特征向量以及视频关联文本中的词语的词语特征向量。并对该三种特征向量进行特征融合,从而生成词语的中间向量。其中,特征融合指对视频特征向量、文本特征向量和词语特征向量进行向量融合运算。具体特征融合过程参考上述图5所示的实施例所示的细节。基于词语的中间向量,服务器能够生成词语的词权重,从而得到视频关联文本中各词语的词权重。示意性的,服务器将中间向量进行维度变换,得到一维向量,再将一维向量通过阈值函数进行转换,得到词语的词权重。
搜索阶段:当用户在终端的客户端上通过搜索词进行视频搜索时,服务器会接收到客户 端发送的视频搜索请求。该客户端与服务器有线或无线连接,服务器为客户端的后台服务器,该视频搜索请求包括至少一个搜索词。服务器将该搜索词与各视频的视频关联文本中的词语进行匹配,并根据匹配得到的相似度确定搜索词与视频是否匹配,从而得到匹配视频。在确定相似度的过程中,服务器会使用预处理阶段得到的词权重。具体的,服务器计算搜索词与视频关联文本中的各词语的相似度,并根据词语对应的词权重,计算搜索词与各词语的平均相似度作为搜索词与视频关联文本的相似度。例如,搜索词为x,视频关联文本包括词语o、p、q。x与o的相似度为0.8,x与p的相似度为0.3,x与q的相似度为0.5。o的词权重为0.5,p的词权重为0.2,q的词权重为0.3。则服务器确定的搜索词与视频关联文本的相似度为0.8*0.5+0.3*0.2+0.5*0.3=0.61。在搜索词与视频关联文本的相似度大于相似度阈值的情况下,服务器会将视频关联文本对应的视频确定为匹配视频。例如该相似度阈值为0.5,则在搜索词为x,视频关联文本包括词语o、p、q的情况下,计算机设备会将该视频关联文本对应的视频确定为匹配视频。可选地,计算机设备会根据搜索词先在视频中快速进行预检索,得到部分视频后,再通过上述方式搜索视频以提升效率。服务器在确定匹配视频后,会将匹配视频发送至客户端以进行播放。上述搜索阶段的实现过程,可实现成为由计算机设备(服务器或终端)执行的视频搜索方法。或者,实现成为视频搜索装置。
需要说明的是,上述预处理阶段在搜索阶段之前执行。或者,上述预处理阶段与搜索阶段穿插进行,例如在服务器接收到搜索请求后,执行预处理阶段,以及搜索阶段的后续步骤。上述介绍仅以本申请提供的方法应用于视频搜索场景为例进行说明,并不用于限制本申请的应用场景。
图11是本申请一个示例性实施例的词权重生成装置的结构框图。如图11所示,该装置包括:
获取模块1120,用于获取视频和视频关联文本,视频关联文本包括至少一个词语,视频关联文本是与视频的内容存在关联关系的文本信息;
生成模块1140,用于对视频、视频关联文本和词语三种信息的特征进行多模态特征融合,生成词语的中间向量;
生成模块1140,还用于基于词语的中间向量,生成词语的词权重。
在一个可选的实施例中,生成模块1140包括提取模块41和融合模块42:
在一个可选的实施例中,提取模块41,用于提取视频的视频特征向量;提取视频关联文本的文本特征向量;以及提取词语的词语特征向量;
在一个可选的实施例中,融合模块42,用于将视频特征向量、文本特征向量和词语特征向量进行融合,得到词语的中间向量。
在一个可选的实施例中,融合模块42包括第一融合子模块421和第二融合子模块422。
在一个可选的实施例中,第一融合子模块421,用于将视频特征向量、文本特征向量和词语特征向量进行第一融合,得到第一融合向量;
在一个可选的实施例中,第二融合子模块422,用于将第一融合向量和词语特征向量进行第二融合,得到词语的中间向量。
在一个可选的实施例中,第一融合子模块421包括第一拼接模块211和第一映射模块212。
在一个可选的实施例中,第一拼接模块211,用于将视频特征向量、文本特征向量和词语特征向量进行依次拼接,得到第一拼接向量;
在一个可选的实施例中,第一映射模块212,用于将第一拼接向量进行全连接特征映射,得到第一融合向量。
在一个可选的实施例中,第二融合子模块422包括第二拼接模块221和第二映射模块222。
在一个可选的实施例中,第二拼接模块221,用于将第一融合向量和词语特征向量进行依次拼接,得到第二拼接向量;
在一个可选的实施例中,第二映射模块222,用于将第二拼接向量进行全连接特征映射,得到词语的中间向量。
在一个可选的实施例中,生成模块1140还包括转换模块43。
在一个可选的实施例中,转换模块43,用于将中间向量进行维度变换,得到一维向量;
在一个可选的实施例中,转换模块43,还用于将一维向量进行归一化处理,得到词语的词权重。
在一个可选的实施例中,转换模块43,用于将一维向量通过阈值函数进行转换,得到词语的词权重。
在一个可选的实施例中,提取模块包括视频提取模块411、文本提取模块412和词语提取模块413。其中,视频提取模块411包括分帧模块111、提取子模块112和计算模块113。
在一个可选的实施例中,分帧模块111用于对视频进行分帧操作,得到至少两个视频帧;
在一个可选的实施例中,提取子模块112用于提取至少两个视频帧的视频帧向量;
在一个可选的实施例中,计算模块113用于计算至少两个视频帧的视频帧向量的平均向量,将平均向量确定为视频特征向量;或,计算至少两个视频帧的视频帧向量的加权向量,将加权向量确定为视频特征向量。
在一个可选的实施例中,计算模块113,用于:
通过目标检测模型确定每个视频帧包括的目标对象。通过分类模型对目标对象进行分类,得到每个视频帧对应的目标对象分类。计算每个视频帧对应的目标对象分类与词语的相似度。根据每个视频帧对应的相似度确定每个视频帧的视频帧向量的权重,权重与相似度正相关。根据至少两个视频帧的视频帧向量,以及至少两个视频帧的视频帧向量各自的权重,计算至少两个视频帧的视频帧向量的加权向量,将加权向量确定为视频特征向量。
在一个可选的实施例中,提取子模块112还用于调用残差神经网络提取视频中的至少两个视频帧的视频帧向量。
在一个可选的实施例中,文本提取模块412,用于调用双向编码转换网络提取视频关联文本的文本特征向量,或,调用长短期记忆网络提取视频关联文本的文本特征向量。
在一个可选的实施例中,词语提取模块413包括分词模块131和词语提取子模块132。
在一个可选的实施例中,分词模块131,用于对视频关联文本进行分词,得到词语;
在一个可选的实施例中,词语提取子模块132,用于调用深度神经网络提取词语的词语特征向量。
在一个可选的实施例中,分词模块131还用于调用中文分词工具对视频关联文本进行分词,得到词语。
在一个可选的实施例中,提取模块41,用于:
基于视频中的视频帧,提取得到视频帧向量。基于视频中的音频帧,提取得到音频帧向量。基于视频帧中的文本,提取得到文本幕向量。将视频帧向量、音频帧向量和文本幕向量中的至少两种进行融合,得到视频特征向量。
综上所述,本装置通过对视频、视频关联文本和词语进行特征提取得到视频特征向量、文本特征向量和词语特征向量,再将上述三种模态的特征向量进行拼接和全连接映射,得到第一融合向量,然后将第一融合向量和词语特征向量进行拼接和全连接映射,得到中间向量,基于中间向量,得到词语的词权重。
上述装置对词语特征向量进行了两次拼接,强化了当前词语特征向量在中间向量的信息量占比,有利于提高视频关联文本中不同词语的权重值区分度。
上述装置实现了在视频搜索过程中,采用上述词权重生成方法来预先提取词语的权重值,不仅考虑了文本维度的特征,还引入融合了视频维度的特征,基于多维度的特征来进行词权重生成,有利于提升输出的词权重的准确率和可靠程度,提高了视频关联文本中对关键词语和混淆词语之间的区分度。
上述装置采用残差神经网络对视频进行特征提取、采用双向编码转换网络或长短期记忆网络对视频关联文本进行特征提取和采用深度神经网络对词语进行特征提取,实现了将自然语言转换为可进行数学运算的特征向量,简化了本申请词权重生成装置的数学运算。
本申请的实施例还提供了一种计算机设备,该计算机设备包括:处理器和存储器,存储器中存储有至少一条指令、至少一段程序、代码集或指令集,至少一条指令、至少一段程序、代码集或指令集由处理器加载并执行以实现上述各方法实施例提供的词权重生成方法。
图12示出了本申请一个示例性实施例提供的计算机设备1200的结构框图。该计算机设备1200可以是便携式移动终端,比如:智能手机、平板电脑、MP3播放器(Moving Picture Experts Group Audio Layer III,动态影像专家压缩标准音频层面3)、MP4(Moving Picture Experts Group Audio Layer IV,动态影像专家压缩标准音频层面4)播放器、笔记本电脑或台式电脑。计算机设备1200还可能被称为用户设备、便携式终端、膝上型终端、台式终端等其他名称。该计算机设备1200还能够指服务器。
通常,计算机设备1200包括有:处理器1201和存储器1202。
处理器1201可以包括一个或多个处理核心,比如4核心处理器、8核心处理器等。处理器1201可以采用DSP(Digital Signal Processing,数字信号处理)、FPGA(Field-Programmable Gate Array,现场可编程门阵列)、PLA(Programmable Logic Array,可编程逻辑阵列)中的至少一种硬件形式来实现。处理器1201也可以包括主处理器和协处理器,主处理器是用于对在唤醒状态下的数据进行处理的处理器,也称CPU(Central Processing Unit,中央处理器);协处理器是用于对在待机状态下的数据进行处理的低功耗处理器。在一些实施例中,处理器1201可以集成有GPU(Graphics Processing Unit,图像处理器),GPU用于负责显示的内容的渲染和绘制。一些实施例中,处理器1201还可以包括AI(Artificial Intelligence,人工智能)处理器,该AI处理器用于处理有关机器学习的计算操作。
存储器1202可以包括一个或多个计算机可读存储介质,该计算机可读存储介质可以是非暂态的。存储器1202还可包括高速随机存取存储器,以及非易失性存储器,比如一个或多个磁盘存储设备、闪存存储设备。在一些实施例中,存储器1202中的非暂态的计算机可读存储介质用于存储至少一个指令,该至少一个指令用于被处理器1201所执行以实现本申请中方法实施例提供的词权重生成方法。
所述存储器还包括一个或者一个以上的程序,所述一个或者一个以上程序存储于存储器中,所述一个或者一个以上程序包含用于进行本申请实施例提供的词权重生成方法。
本申请还提供一种计算机可读存储介质,所述存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由处理器加载并执行以实现上述方法实施例提供的词权重生成方法。
本申请提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行上述方法实施例提供的词权重生成方法。

Claims (31)

  1. 一种词权重的生成方法,其特征在于,所述方法由计算机设备执行,所述方法包括:
    获取视频和视频关联文本,所述视频关联文本包括至少一个词语,所述视频关联文本是与所述视频的内容存在关联关系的文本信息;
    对所述视频、所述视频关联文本和所述词语三种信息的特征进行多模态特征融合,生成所述词语的中间向量;
    基于所述词语的中间向量,生成所述词语的词权重。
  2. 根据权利要求1所述的方法,其特征在于,所述对所述视频、所述视频关联文本和所述词语三种信息的特征进行多模态特征融合,生成所述词语的中间向量,包括:
    提取所述视频的视频特征向量;提取所述视频关联文本的文本特征向量;以及提取所述词语的词语特征向量;
    将所述视频特征向量、所述文本特征向量和所述词语特征向量进行融合,得到所述词语的中间向量。
  3. 根据权利要求2所述的方法,其特征在于,所述将所述视频特征向量、所述文本特征向量和所述词语特征向量进行融合,得到所述词语的中间向量,包括:
    将所述视频特征向量、所述文本特征向量和所述词语特征向量进行第一融合,得到第一融合向量;
    将所述第一融合向量和所述词语特征向量进行第二融合,得到所述词语的中间向量。
  4. 根据权利要求3所述的方法,其特征在于,所述将所述视频特征向量、所述文本特征向量和所述词语特征向量进行第一融合,得到第一融合向量,包括:
    将所述视频特征向量、所述文本特征向量和所述词语特征向量进行依次拼接,得到第一拼接向量;
    将所述第一拼接向量进行全连接特征映射,得到所述第一融合向量。
  5. 根据权利要求3所述的方法,其特征在于,所述将所述第一融合向量和所述词语特征向量进行第二融合,得到所述词语的中间向量,包括:
    将所述第一融合向量和所述词语特征向量进行依次拼接,得到第二拼接向量;
    将所述第二拼接向量进行全连接特征映射,得到所述词语的中间向量。
  6. 根据权利要求1至5任一所述的方法,其特征在于,所述基于所述词语的中间向量,生成所述词语的词权重,包括:
    将所述中间向量进行维度变换,得到一维向量;
    将所述一维向量进行归一化处理,得到所述词语的词权重。
  7. 根据权利要求6所述的方法,其特征在于,所述将所述一维向量进行归一化处理,得到所述词语的词权重,包括:
    将所述一维向量通过阈值函数进行转换,得到所述词语的词权重。
  8. 根据权利要求2至5任一所述的方法,其特征在于,所述提取所述视频的视频特征向量,包括:
    对所述视频进行分帧操作,得到至少两个视频帧;
    提取所述至少两个视频帧的视频帧向量;
    计算所述至少两个视频帧的视频帧向量的平均向量,将所述平均向量确定为所述视频特征向量;或,计算所述至少两个视频帧的视频帧向量的加权向量,将所述加权向量确定为所述视频特征向量。
  9. 根据权利要求8所述的方法,其特征在于,所述计算所述至少两个视频帧的视频帧向量的加权向量,将所述加权向量确定为所述视频特征向量,包括:
    通过目标检测模型确定每个视频帧包括的目标对象;
    通过分类模型对所述目标对象进行分类,得到所述每个视频帧对应的目标对象分类;
    计算所述每个视频帧对应的目标对象分类与所述词语的相似度;
    根据所述每个视频帧对应的相似度确定所述每个视频帧的视频帧向量的权重,所述权重与所述相似度正相关;
    根据所述至少两个视频帧的视频帧向量,以及所述至少两个视频帧的视频帧向量各自的权重,计算所述至少两个视频帧的视频帧向量的加权向量,将所述加权向量确定为所述视频特征向量。
  10. 根据权利要求8所述的方法,其特征在于,所述提取所述至少两个视频帧的视频帧向量,包括:
    调用残差神经网络提取所述视频中的所述至少两个视频帧的视频帧向量。
  11. 根据权利要求2至5任一所述的方法,其特征在于,所述提取所述视频关联文本的文本特征向量,包括:
    调用双向编码转换网络提取所述视频关联文本的文本特征向量;
    或,
    调用长短期记忆网络提取所述视频关联文本的文本特征向量。
  12. 根据权利要求2至5任一所述的方法,其特征在于,所述提取所述词语的词语特征向量,包括:
    对所述视频关联文本进行分词,得到所述词语;
    调用深度神经网络提取所述词语的词语特征向量。
  13. 根据权利要求12所述的方法,其特征在于,所述对所述视频关联文本进行分词,得到所述词语,包括:
    调用中文分词工具对所述视频关联文本进行分词,得到所述词语。
  14. 根据权利要求8所述的方法,其特征在于,所述提取所述视频的视频特征向量,包括:
    基于所述视频中的所述视频帧,提取得到所述视频帧向量;
    基于所述视频中的音频帧,提取得到音频帧向量;
    基于所述视频帧中的文本,提取得到文本幕向量;
    将所述视频帧向量、所述音频帧向量和所述文本幕向量中的至少两种进行融合,得到所述视频特征向量。
  15. 一种词权重的生成装置,其特征在于,所述装置包括:
    获取模块,用于获取视频和视频关联文本,所述视频关联文本包括至少一个词语,所述视频关联文本是与所述视频的内容存在关联关系的文本信息;
    生成模块,用于对所述视频、所述视频关联文本和所述词语三种信息的特征进行多模态特征融合,生成所述词语的中间向量;
    生成模块,还用于基于所述词语的中间向量,生成所述词语的词权重。
  16. 根据权利要求15所述的装置,其特征在于,所述生成模块包括提取模块和融合模块;
    所述提取模块,用于提取所述视频的视频特征向量;提取所述视频关联文本的文本特征向量;以及提取所述词语的词语特征向量;
    所述融合模块,用于将所述视频特征向量、所述文本特征向量和所述词语特征向量进行融合,得到所述词语的中间向量。
  17. 根据权利要求16所述的装置,其特征在于,所述融合模块包括第一融合子模块和第二融合子模块;
    所述第一融合子模块,用于将所述视频特征向量、所述文本特征向量和所述词语特征向量进行第一融合,得到第一融合向量;
    所述第二融合子模块,用于将所述第一融合向量和所述词语特征向量进行第二融合,得到所述词语的中间向量。
  18. 根据权利要求17所述的装置,其特征在于,所述第一融合子模块包括第一拼接模块和第一映射模块;
    所述第一拼接模块,用于将所述视频特征向量、所述文本特征向量和所述词语特征向量进行依次拼接,得到第一拼接向量;
    所述第一映射模块,用于将所述第一拼接向量进行全连接特征映射,得到所述第一融合向量。
  19. 根据权利要求17所述的装置,其特征在于,所述第二融合子模块包括第二拼接模块和第二映射模块;
    所述第二拼接模块,用于将所述第一融合向量和所述词语特征向量进行依次拼接,得到第二拼接向量;
    所述第二映射模块,用于将所述第二拼接向量进行全连接特征映射,得到所述词语的中间向量。
  20. 根据权利要求15至19任一所述的装置,其特征在于,所述生成模块还包括转换模块;
    所述转换模块,用于将所述中间向量进行维度变换,得到一维向量;
    所述转换模块,还用于将所述一维向量进行归一化处理,得到所述词语的词权重。
  21. 根据权利要求20所述的装置,其特征在于,所述转换模块,用于:
    将所述一维向量通过阈值函数进行转换,得到所述词语的词权重。
  22. 根据权利要求16至19任一所述的装置,其特征在于,所述提取模块包括视频提取模块,所述视频提取模块包括分帧模块、提取子模块和计算模块;
    所述分帧模块,用于对所述视频进行分帧操作,得到至少两个视频帧;
    所述提取子模块,用于提取所述至少两个视频帧的视频帧向量;
    所述计算模块,用于计算所述至少两个视频帧的视频帧向量的平均向量,将所述平均向量确定为所述视频特征向量;或,计算所述至少两个视频帧的视频帧向量的加权向量,将所述加权向量确定为所述视频特征向量。
  23. 根据权利要求22所述的装置,其特征在于,所述计算模块,用于:
    通过目标检测模型确定每个视频帧包括的目标对象;
    通过分类模型对所述目标对象进行分类,得到所述每个视频帧对应的目标对象分类;
    计算所述每个视频帧对应的目标对象分类与所述词语的相似度;
    根据所述每个视频帧对应的相似度确定所述每个视频帧的视频帧向量的权重,所述权重与所述相似度正相关;
    根据所述至少两个视频帧的视频帧向量,以及所述至少两个视频帧的视频帧向量各自的权重,计算所述至少两个视频帧的视频帧向量的加权向量,将所述加权向量确定为所述视频特征向量。
  24. 根据权利要求22所述的装置,其特征在于,所述提取子模块,用于:
    调用残差神经网络提取所述视频中的所述至少两个视频帧的视频帧向量。
  25. 根据权利要求16至19任一所述的装置,其特征在于,所述提取模块包括文本提取模块,所述文本提取模块,用于:
    调用双向编码转换网络提取所述视频关联文本的文本特征向量;
    或,
    调用长短期记忆网络提取所述视频关联文本的文本特征向量。
  26. 根据权利要求16至19任一所述的装置,其特征在于,所述提取模块包括词语提取模块,所述词语提取模块包括分词模块和词语提取子模块;
    所述分词模块,用于对所述视频关联文本进行分词,得到所述词语;
    所述词语提取子模块,用于调用深度神经网络提取所述词语的词语特征向量。
  27. 根据权利要求26所述的装置,其特征在于,所述分词模块,用于:
    调用中文分词工具对所述视频关联文本进行分词,得到所述词语。
  28. 根据权利要求22所述的装置,其特征在于,所述提取模块,用于:
    基于所述视频中的所述视频帧,提取得到所述视频帧向量;
    基于所述视频中的音频帧,提取得到音频帧向量;
    基于所述视频帧中的文本,提取得到文本幕向量;
    将所述视频帧向量、所述音频帧向量和所述文本幕向量中的至少两种进行融合,得到所述视频特征向量。
  29. 一种计算机设备,其特征在于,所述计算机设备包括:处理器和存储器,所述存储器存储有计算机程序,所述计算机程序由所述处理器加载并执行以实现如权利要求1至14任一所述的词权重生成方法。
  30. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有计算机程序,所述计算机程序由处理器加载并执行以实现如权利要求1至14任一所述的词权重生成方法。
  31. 一种计算机程序产品,其中,所述计算机程序产品存储有计算机程序,所述计算机程序由处理器加载并执行以实现如权利要求1至14任一所述的词权重生成方法。
PCT/CN2022/078183 2021-03-09 2022-02-28 词权重的生成方法、装置、设备及介质 WO2022188644A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/975,519 US20230057010A1 (en) 2021-03-09 2022-10-27 Term weight generation method, apparatus, device and medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110258046.3 2021-03-09
CN202110258046.3A CN113010740B (zh) 2021-03-09 2021-03-09 词权重的生成方法、装置、设备及介质

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/975,519 Continuation US20230057010A1 (en) 2021-03-09 2022-10-27 Term weight generation method, apparatus, device and medium

Publications (1)

Publication Number Publication Date
WO2022188644A1 true WO2022188644A1 (zh) 2022-09-15

Family

ID=76403513

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/078183 WO2022188644A1 (zh) 2021-03-09 2022-02-28 词权重的生成方法、装置、设备及介质

Country Status (3)

Country Link
US (1) US20230057010A1 (zh)
CN (1) CN113010740B (zh)
WO (1) WO2022188644A1 (zh)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113010740B (zh) * 2021-03-09 2023-05-30 腾讯科技(深圳)有限公司 词权重的生成方法、装置、设备及介质
CN113723166A (zh) * 2021-03-26 2021-11-30 腾讯科技(北京)有限公司 内容识别方法、装置、计算机设备和存储介质
CN113868519B (zh) * 2021-09-18 2023-11-14 北京百度网讯科技有限公司 信息搜索方法、装置、电子设备和存储介质
CN115221875B (zh) * 2022-07-28 2023-06-20 平安科技(深圳)有限公司 词权重生成方法、装置、电子设备及存储介质
CN116029294B (zh) * 2023-03-30 2023-06-09 华南师范大学 词项配对方法、装置及设备
CN117573925A (zh) * 2024-01-15 2024-02-20 腾讯科技(深圳)有限公司 预估播放时长的确定方法、装置、电子设备及存储介质

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170177708A1 (en) * 2015-12-17 2017-06-22 Linkedin Corporation Term weight optimization for content-based recommender systems
CN109635081A (zh) * 2018-11-23 2019-04-16 上海大学 一种基于词频幂律分布特性的文本关键词权重计算方法
CN110781347A (zh) * 2019-10-23 2020-02-11 腾讯科技(深圳)有限公司 一种视频处理方法、装置、设备以及可读存储介质
CN111241811A (zh) * 2020-01-06 2020-06-05 平安科技(深圳)有限公司 确定搜索词权重的方法、装置、计算机设备和存储介质
CN111581437A (zh) * 2020-05-07 2020-08-25 腾讯科技(深圳)有限公司 一种视频检索方法及装置
CN111767461A (zh) * 2020-06-24 2020-10-13 北京奇艺世纪科技有限公司 数据处理方法及装置
CN111988668A (zh) * 2020-08-28 2020-11-24 腾讯科技(深圳)有限公司 一种视频推荐方法、装置、计算机设备及存储介质
CN113010740A (zh) * 2021-03-09 2021-06-22 腾讯科技(深圳)有限公司 词权重的生成方法、装置、设备及介质

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6810580B2 (ja) * 2016-11-22 2021-01-06 日本放送協会 言語モデル学習装置およびそのプログラム
KR20190115319A (ko) * 2018-04-02 2019-10-11 필아이티 주식회사 문장을 복수의 클래스들로 분류하는 모바일 장치 및 방법
CN109543714B (zh) * 2018-10-16 2020-03-27 北京达佳互联信息技术有限公司 数据特征的获取方法、装置、电子设备及存储介质
CN110309304A (zh) * 2019-06-04 2019-10-08 平安科技(深圳)有限公司 一种文本分类方法、装置、设备及存储介质
CN111767726B (zh) * 2020-06-24 2024-02-06 北京奇艺世纪科技有限公司 数据处理方法及装置
CN111930999B (zh) * 2020-07-21 2022-09-30 山东省人工智能研究院 逐帧跨模态相似度关联实施文本查询定位视频片段方法
CN112287070A (zh) * 2020-11-16 2021-01-29 腾讯科技(深圳)有限公司 词语的上下位关系确定方法、装置、计算机设备及介质

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170177708A1 (en) * 2015-12-17 2017-06-22 Linkedin Corporation Term weight optimization for content-based recommender systems
CN109635081A (zh) * 2018-11-23 2019-04-16 上海大学 一种基于词频幂律分布特性的文本关键词权重计算方法
CN110781347A (zh) * 2019-10-23 2020-02-11 腾讯科技(深圳)有限公司 一种视频处理方法、装置、设备以及可读存储介质
CN111241811A (zh) * 2020-01-06 2020-06-05 平安科技(深圳)有限公司 确定搜索词权重的方法、装置、计算机设备和存储介质
CN111581437A (zh) * 2020-05-07 2020-08-25 腾讯科技(深圳)有限公司 一种视频检索方法及装置
CN111767461A (zh) * 2020-06-24 2020-10-13 北京奇艺世纪科技有限公司 数据处理方法及装置
CN111988668A (zh) * 2020-08-28 2020-11-24 腾讯科技(深圳)有限公司 一种视频推荐方法、装置、计算机设备及存储介质
CN113010740A (zh) * 2021-03-09 2021-06-22 腾讯科技(深圳)有限公司 词权重的生成方法、装置、设备及介质

Also Published As

Publication number Publication date
CN113010740A (zh) 2021-06-22
US20230057010A1 (en) 2023-02-23
CN113010740B (zh) 2023-05-30

Similar Documents

Publication Publication Date Title
WO2022188644A1 (zh) 词权重的生成方法、装置、设备及介质
CN113762322B (zh) 基于多模态表示的视频分类方法、装置和设备及存储介质
KR102416558B1 (ko) 영상 데이터 처리 방법, 장치 및 판독 가능 저장 매체
CN111488489B (zh) 视频文件的分类方法、装置、介质及电子设备
CN110674350B (zh) 视频人物检索方法、介质、装置和计算设备
CN111581437A (zh) 一种视频检索方法及装置
CN111382555B (zh) 数据处理方法、介质、装置和计算设备
WO2023065617A1 (zh) 基于预训练模型和召回排序的跨模态检索***及方法
CN112348111B (zh) 视频中的多模态特征融合方法、装置、电子设备及介质
CN114328807A (zh) 一种文本处理方法、装置、设备及存储介质
CN109670073B (zh) 一种信息转换方法及装置、交互辅助***
CN113392265A (zh) 多媒体处理方法、装置及设备
CN113673613A (zh) 基于对比学习的多模态数据特征表达方法、装置及介质
WO2023197749A1 (zh) 背景音乐的***时间点确定方法、装置、设备和存储介质
JP2023535108A (ja) ビデオタグ推薦モデルのトレーニング方法及びビデオタグの決定方法、それらの装置、電子機器、記憶媒体及びコンピュータプログラム
CN112188306A (zh) 一种标签生成方法、装置、设备及存储介质
CN114359775A (zh) 关键帧检测方法、装置、设备及存储介质、程序产品
CN115408488A (zh) 用于小说场景文本的分割方法及***
CN114398505A (zh) 目标词语的确定方法、模型的训练方法、装置及电子设备
CN111488813A (zh) 视频的情感标注方法、装置、电子设备及存储介质
CN114510564A (zh) 视频知识图谱生成方法及装置
CN113409803A (zh) 语音信号处理方法、装置、存储介质及设备
CN117009577A (zh) 一种视频数据处理方法、装置、设备及可读存储介质
CN115019137A (zh) 一种多尺度双流注意力视频语言事件预测的方法及装置
CN113704544A (zh) 一种视频分类方法、装置、电子设备和存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22766170

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 120224)