CN113822138A

CN113822138A - Similar video determination method and device

Info

Publication number: CN113822138A
Application number: CN202110843235.7A
Authority: CN
Inventors: 刘刚
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-07-26
Filing date: 2021-07-26
Publication date: 2021-12-21

Abstract

The application discloses a method and a device for determining similar videos, and belongs to the technical field of artificial intelligence. The method comprises the following steps: acquiring a first video; extracting content-based feature information of the first video to obtain semantic feature information of the first video; and under the condition that the semantic feature information of the first video and the semantic feature information of the second video accord with preset conditions, determining that the first video is a similar video of the second video. According to the technical scheme, the video is subjected to content-based feature information extraction processing, so that the key of feature extraction can be focused on the video content, the extracted video semantic information is less influenced by video editing operation, whether two videos are similar or not is judged by comparing whether the video semantic information capable of representing the features of the video content per se meets preset conditions, and the accuracy of similar video identification can be effectively improved.

Description

Similar video determination method and device

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a method and an apparatus for determining similar videos.

Background

In the era of rapid internet development, the information flow content service is widely popular, and a large amount of high-quality original content is shown in an information flow content service platform. However, at the same time, a large number of transport accounts are generated in the information flow content service platform, and the transport account owner can copy or simply edit the original content, and then distribute the transport content similar to the original content, so that the benefit of the original author is damaged, and the ecological healthy development of the whole content is not facilitated.

In order to effectively identify the transportation contents similar to the original contents, the related art may identify the transportation contents similar to the original contents by a method of calculating a hash value of a related component in the original contents or a method of verifying a message digest of the original contents, but these methods have poor anti-variation performance, and if the transportation contents are obtained from the original contents after several editing operations, it is difficult to identify the transportation contents.

Disclosure of Invention

The embodiment of the application provides a similar video determining method and device, which can accurately identify a similar video obtained through video editing operation, and effectively improve the accuracy of similar video identification.

According to an aspect of the embodiments of the present application, there is provided a method for determining a similar video, the method including:

acquiring a first video;

extracting content-based feature information of the first video to obtain semantic feature information of the first video, wherein the semantic feature information is feature information which is influenced by video editing operation and has a degree lower than a first influence degree requirement;

and under the condition that the semantic feature information of the first video and the semantic feature information of the second video accord with preset conditions, determining that the first video is a similar video of the second video.

In one possible design, the semantic feature information of the first video includes a first video semantic feature, where the first video semantic feature is used to represent the semantic feature information of the first video from the overall dimension of the video, and the extracting the content-based feature information of the first video to obtain the semantic feature information of the first video includes:

inputting the first video into a video semantic extraction model to extract characteristic information to obtain the semantic characteristics of the first video;

the video semantic extraction model is a machine learning model obtained by training based on a triple loss constraint condition by taking triples as training samples, wherein a positive sample pair of the triples comprises a first sample video and a second sample video, a negative sample pair of the triples comprises the first sample video and a third sample video, the second sample video is a video obtained by performing video editing operation on the first sample video, and the third sample video is a video with different content from the first sample video.

In one possible design, the extracting feature information of the semantic extraction model of the first video input video to obtain the semantic features of the first video includes:

obtaining a target image set according to at least one video frame in the first video and a video cover of the first video;

embedding each image in the target image set to obtain an embedding feature set of the target image set, wherein the embedding feature set represents visual modal information of the first video;

extracting content-based feature information of each embedded feature in the embedded feature set to obtain embedded semantic features;

and carrying out average pooling processing on each embedded semantic feature to obtain the first video semantic feature.

In one possible design, the semantic feature information of the first video includes a first image semantic feature sequence, where the first image semantic feature sequence is used to represent the semantic feature information of the first video from a video frame dimension, and the extracting the content-based feature information of the first video to obtain the semantic feature information of the first video includes:

and extracting feature information based on image content from the video frame of the first video to obtain the first image semantic feature sequence, wherein the first image semantic feature in the first image semantic feature sequence is a feature which is influenced by image editing operation to a degree lower than a second influence degree requirement.

In one possible design, the method further includes:

determining a video duration of the first video;

and under the condition that the video duration meets the preset duration condition, performing image content-based feature information extraction on the video frame of the first video to obtain the operation of the first image semantic feature sequence.

In a possible design, the extracting, based on image content, feature information of a video frame of the first video to obtain the first image semantic feature sequence when the video duration of the first video meets a preset duration condition includes:

performing frame extraction operation on the first video to obtain a plurality of video frames;

for each video frame, determining interframe difference information of the video frame and an adjacent video frame;

screening the plurality of video frames according to the interframe difference information to obtain a target frame sequence;

and extracting feature information based on image content for each target frame in the target frame sequence to obtain the first image semantic feature sequence.

In one possible design, the determining that the first video is a similar video of the second video when the semantic feature information of the first video and the semantic feature information of the second video meet a preset condition includes:

determining a second video semantic feature in the semantic feature information of the second video under the condition that the semantic feature information of the first video comprises the semantic feature of the first video, and if the semantic feature of the first video and the semantic feature of the second video meet a first preset condition, determining that the first video is a content similar video of the second video;

and under the condition that the semantic feature information of the first video comprises the semantic feature sequence of the first image, determining a semantic feature sequence of a second image in the semantic feature information of the second video, and if the semantic feature sequence of the first image and the semantic feature sequence of the second image accord with a second preset condition, determining that the first video is a content similar video of the second video.

In a possible design, the determining that the first video is a video with similar content to the second video if the first image semantic feature sequence and the second image semantic feature sequence meet a second preset condition includes:

determining a matching feature pair according to the first image semantic feature sequence and the second image semantic feature sequence, wherein the distance between a first image semantic feature in the matching feature pair and a second image semantic feature in the matching feature pair is smaller than a distance threshold;

determining a matching target frame according to the matching feature pair, wherein the matching target frame comprises a first target frame and a second target frame, the first target frame is a video frame corresponding to the semantic feature of the first image in the first video, and the second target frame is a video frame corresponding to the semantic feature of the second image in the second video;

and if the number of the matched target frames meets the number condition, determining that the first video is the content similar video of the second video.

under the condition that the semantic feature information of the first video and the semantic feature information of the second video meet the preset condition, determining a first audio feature sequence of the first video, and acquiring a second audio feature sequence of the second video;

and determining that the first video is a similar video of the second video when the similarity between the first audio feature sequence and the second audio feature sequence meets an audio similarity condition.

In one possible design, the determining a first sequence of audio features of the first video includes:

acquiring audio data corresponding to the first video;

performing frequency domain conversion processing on the audio data to obtain frequency domain characteristics of the audio data;

generating the first sequence of audio features based on the frequency-domain features.

In one possible design, the second video is any one of videos in a video database, the video database comprises a feature information base and an online index base, the online index base comprises stock indexes and incremental indexes, the stock indexes comprise index information of the stock videos, the stock videos refer to videos in a target historical period, the incremental indexes comprise index information of the incremental videos, and the incremental videos refer to videos newly added after the target historical period;

the method further comprises the following steps:

under the condition that the second video is the stock video, acquiring target index information of the second video from the stock index; acquiring semantic feature information of the second video from the feature information base based on the target index information;

under the condition that the second video is the incremental video, acquiring target index information of the second video from the incremental index set; and acquiring semantic feature information of the second video from the feature information base based on the target index information.

In one possible design, the online index store is a first index store, the target historical period is a first historical period, and the method further includes:

determining a second history period, wherein the second history period takes the latest time allowed to be added with the video in the first index base as a right boundary time node, and the time span of the second history period is the same as that of the first history period;

index reconstruction is carried out on the basis of the first index base to obtain a second index base, and the target historical time period of the second index base is the second historical time period;

and switching the online index library from the first index library to the second index library.

In one possible design, the method further includes:

determining that the first video is a transport video under the condition that the first video is a similar video of the second video, wherein the transport video is a non-original video;

restricting the first video from being pushed.

According to an aspect of embodiments of the present application, there is provided a similar video determining apparatus, including:

the video acquisition module is used for acquiring a first video;

the semantic feature extraction module is used for extracting content-based feature information of the first video to obtain semantic feature information of the first video, wherein the semantic feature information is feature information which is influenced by video editing operation and has a degree lower than a first influence degree requirement;

and the similar video determining module is used for determining that the first video is the similar video of the second video under the condition that the semantic feature information of the first video and the semantic feature information of the second video accord with preset conditions.

According to an aspect of embodiments of the present application, there is provided a computer device comprising a processor and a memory, the memory having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, which is loaded and executed by the processor to implement the above-described similar video determination method.

According to an aspect of embodiments of the present application, there is provided a computer-readable storage medium having at least one instruction, at least one program, a set of codes, or a set of instructions stored therein, which is loaded and executed by a processor to implement the above-mentioned similar video determining method.

According to an aspect of embodiments herein, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the similar video determination method described above.

The technical scheme provided by the embodiment of the application can bring the following beneficial effects:

by carrying out content-based feature information extraction processing on the video, the key point of feature extraction can be ensured to focus on the video content, and the influence of other irrelevant factors on feature extraction is reduced, so that the extracted video semantic information is less influenced by video editing operation. If the video semantic information between the two videos meets the preset condition, the two videos can be determined to be similar videos, whether the two videos are similar or not is judged by comparing the video semantic information capable of representing the characteristics of the video content, and the accuracy rate of similar video identification can be effectively improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 illustrates a schematic diagram of a video transport;

FIG. 2 is a schematic diagram of an application execution environment provided by one embodiment of the present application;

fig. 3 is a flowchart of a similar video determination method according to an embodiment of the present application;

FIG. 4 illustrates a schematic view of a page of a video editing tool;

fig. 5 is a flowchart of a similar video determination method provided in an embodiment of the present application;

FIG. 6 illustrates a training flow diagram of a video semantic extraction model;

fig. 7 is a flowchart of a similar video determination method according to another embodiment of the present application;

FIG. 8 illustrates a flow chart for determining semantic feature vectors for video;

FIG. 9 illustrates a flow chart for extracting a sequence of audio features;

FIG. 10 illustrates a schematic diagram of a distributed vector retrieval service;

fig. 11 is a flowchart illustrating a similar video determining method according to an embodiment of the present application;

fig. 12 is a flowchart illustrating a similar video determining method according to another embodiment of the present application;

fig. 13 illustrates a technical framework diagram of an information flow content service system;

fig. 14 is a block diagram of a similar video determining apparatus provided in an embodiment of the present application;

fig. 15 is a block diagram of a computer device according to an embodiment of the present application.

Detailed Description

In the age of rapid development of the internet, as the threshold for content production in the information flow content service is reduced, the uploading amount and the distribution amount of video and image-text content are exponentially increased. These Content sources include internet users and various Content authoring facilities such as PGCs (Professional Generated Content) from media and organizations, UGCs (User Generated Content), multimedia Content streaming services that rely on social networking for example. The peak daily upload volume from each source to the warehouse has been in the past year in excess of millions. Since the uploading amount of the content is greatly increased, in order to ensure the security and timeliness of the content distribution in the information flow content service and the legal benefit of the content copyright side, the uploaded content audit needs to be completed in a short time.

With higher and higher user demands and requirements, the platform is more and more expected to appear for original authors and more for high quality works. At present, each self-media platform has no author and no high-quality content. Moreover, for some premium content platforms, the incentives can be very rich and appeal to the stay of premium users. But under the incentive temptation, the original cost is high, the video transportation black products are produced, and a large number of transportation account numbers are produced in the platform. In order to improve the income and the concern of users who register to use the transport accounts, original videos are simply edited and modified and then uploaded again so as to bypass similar identification and rearrangement of the platforms, and a large number of similar videos appear in short video platforms.

In one example, as shown in FIG. 1, a schematic diagram of a video transport is illustrated. For an original author, a creative video can be produced only by a series of complex processes of designing a script, selecting actors, setting scenes, shooting original films, finally performing post production and the like, then releasing the creative video to a platform, expecting to obtain exposure and absorb powder, and finally realizing the possibility of showing the creative video, for example, showing the creative video in the modes of advertisement, live broadcast and delivery and the like. However, the black gray producer simply edits the shipped hot video through some tool software. The above processing methods are many, such as adding video titles, adding watermarks to cover pictures, performing various editing, cutting and transforming on video contents, modifying audio, adding black edges, picture-in-picture, adding subtitles, and the like. The powder is distributed to the platform after simple processing and editing, and then is shunted and sucked, and is changed in a drainage mode.

The black gray producer can carry a large amount of hot videos in batches with extremely low cost and production line operation, easily pirates the labor results of the original producer, occupies a large amount of flow, and is not beneficial to the ecological healthy development of the whole content. Because the video content needs to be manually checked, on one hand, the manual checking needs to increase a lot of cost, and on the other hand, the processing efficiency is not enough. With the rapid increase of the content amount, the processing cost is very high, and if the content cannot be rapidly audited and processed, the content cannot be rapidly distributed, and the user experience is also greatly influenced. With the explosion of short videos, various means for modifying and editing the content of the short videos to bypass similar identification systems are increasing, and a multi-dimensional capability for detecting and carrying the content is urgently needed.

In order to solve the above problems, the present application provides a similar video determining method, which makes full use of information of each dimension of video content and identifies transport content in an information stream based on a machine learning technology.

The similar video determination method provided by the embodiment of the application relates to an artificial intelligence technology and a block chain technology, which are briefly described below to facilitate understanding of those skilled in the art.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

Deep learning: the concept of deep learning stems from the study of artificial neural networks. A multi-layer perceptron with multiple hidden layers is a deep learning structure. Deep learning forms a more abstract class or feature of high-level representation properties by combining low-level features to discover a distributed feature representation of the data.

DNN: deep Neural Networks (Deep Neural Networks) are understood to be Neural Networks with many hidden layers, sometimes called Multi-Layer perceptrons (MLPs). The neural network layers inside the DNN can be divided into three categories, an input layer, a hidden layer and an output layer. Typically the first layer is the input layer, the last layer is the output layer, and the number of layers in between are all hidden layers. The layers are all connected, that is, any neuron of the ith layer is necessarily connected with any neuron of the (i + 1) th layer, and i is a positive integer.

Depth Metric Learning (DML) is a method of Metric Learning, and its objective is to learn a mapping from original features to a low-dimensional dense vector Space (called Embedding Space), so that the distances calculated by using common distance functions (euclidean distance, Cosine distance, etc.) on the Embedding Space for similar objects are relatively close, and the distances between objects of different classes are relatively far.

Computer Vision technology (CV) Computer Vision is a science for researching how to make a machine "see", and further refers to that a camera and a Computer are used to replace human eyes to perform machine Vision such as identification, tracking and measurement on a target, and further image processing is performed, so that the Computer processing becomes an image more suitable for human eyes to observe or transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. The computer vision technology generally includes technologies such as image processing, image recognition, image semantic understanding, image retrieval, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, virtual reality, augmented reality, synchronous positioning, map construction and the like, and also includes common biometric technologies such as face recognition, fingerprint recognition and the like.

The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism and an encryption algorithm. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product services layer, and an application services layer.

The block chain underlying platform can comprise processing modules such as user management, basic service, intelligent contract and operation monitoring. The user management module is responsible for identity information management of all blockchain participants, and comprises public and private key generation maintenance (account management), key management, user real identity and blockchain address corresponding relation maintenance (authority management) and the like, and under the authorization condition, the user management module supervises and audits the transaction condition of certain real identities and provides rule configuration (wind control audit) of risk control; the basic service module is deployed on all block chain node equipment and used for verifying the validity of the service request, recording the service request to storage after consensus on the valid request is completed, for a new service request, the basic service firstly performs interface adaptation analysis and authentication processing (interface adaptation), then encrypts service information (consensus management) through a consensus algorithm, transmits the service information to a shared account (network communication) completely and consistently after encryption, and performs recording and storage; the intelligent contract module is responsible for registering and issuing contracts, triggering the contracts and executing the contracts, developers can define contract logics through a certain programming language, issue the contract logics to a block chain (contract registration), call keys or other event triggering and executing according to the logics of contract clauses, complete the contract logics and simultaneously provide the function of upgrading and canceling the contracts; the operation monitoring module is mainly responsible for deployment, configuration modification, contract setting, cloud adaptation in the product release process and visual output of real-time states in product operation, such as: alarm, monitoring network conditions, monitoring node equipment health status, and the like.

The platform product service layer provides basic capability and an implementation framework of typical application, and developers can complete block chain implementation of business logic based on the basic capability and the characteristics of the superposed business. The application service layer provides the application service based on the block chain scheme for the business participants to use.

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Referring to fig. 2, a schematic diagram of an application execution environment according to an embodiment of the present application is shown. The application execution environment may include: a terminal 10 and a server 20.

The terminal 10 may be an electronic device such as a mobile phone, a tablet Computer, a game console, an electronic book reader, a multimedia playing device, a wearable device, a PC (Personal Computer), and the like. A client of the application may be installed in the terminal 10.

In the embodiment of the present application, the application may be any application capable of providing a video information streaming content service. Typically, the application is a video-type application. Of course, streaming content services may be provided in other types of applications besides video-type applications. For example, the application may be a news application, a social interaction application, an interactive entertainment application, a browser application, a shopping application, a content sharing application, a Virtual Reality (VR) application, an Augmented Reality (AR) application, and the like, which is not limited in this embodiment. In addition, for different applications, videos pushed by the applications may also be different, and corresponding functions may also be different, which may be configured in advance according to actual requirements, and this is not limited in this embodiment of the application. Optionally, a client of the above application program runs in the terminal 10. In some embodiments, the streaming content service covers many vertical contents such as art, movie, news, finance, sports, entertainment, games, etc., and users can enjoy many forms of content services such as articles, pictures, videos, short videos, live broadcasts, titles, columns, etc. through the streaming content service.

The server 20 is used to provide background services for clients of applications in the terminal 10. For example, the server 20 may be a backend server for the application described above. The server 20 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a web service, cloud communication, a middleware service, a domain name service, a security service, a CDN (Content Delivery Network), a big data and artificial intelligence platform. Optionally, the server 20 provides background services for applications in multiple terminals 10 simultaneously.

Alternatively, the terminal 10 and the server 20 may communicate with each other through the network 30. The terminal 10 and the server 20 may be directly or indirectly connected through wired or wireless communication, and the present application is not limited thereto.

Before describing the method embodiments provided in the present application, a brief description is given to the application scenarios, related terms, or terms that may be involved in the method embodiments of the present application, so as to facilitate understanding by those skilled in the art of the present application.

Faiss is an open source search library aiming at clustering and similarity, provides efficient similarity search and clustering for dense vectors, supports search of billion-level vectors, and is the most mature approximate neighbor search library at present. It contains a number of algorithms that search a set of vectors of arbitrary size, and supporting code for algorithm evaluation and parameter adjustment.

Faiss vector retrieval: a conventional database consists of a structured table containing symbolic information. For example, a collection of images is represented by a list of indexed photos placed in each row. Each row contains information such as image identification and descriptive statements. Each row may also be associated with entries of other tables, such as a photo associated with a list of names of people. Many AI tools produce high-dimensional vectors, such as text embedding tools like word to vector (word vectorization), and Convolutional Neural Network (CNN) descriptors (descriptors) trained with deep learning. These representations are more powerful and flexible than fixed symbolic representations. However, traditional databases retrieved using Structured Query Language (SQL) are not adapted to these new vector representations and are very inefficient. First, the huge volume of new multimedia streams creates billions of vectors. Second, and more importantly, finding similar entries means finding similar high-dimensional vectors. This is extremely inefficient or even impossible for the current standard search languages. For similarity search and classification, the following operations are required: given a search vector, a list of database objects that are closest in euclidean distance to this vector is returned, and given a search vector, a list of database objects with the highest vector dot product is returned. Traditional SQL database systems are not highly available because they are optimized for hash-based searches or 1D interval searches.

Referring to fig. 3, a flowchart of a similar video determination method according to an embodiment of the present application is shown. The method can be applied to a computer device, which refers to an electronic device with data calculation and processing capabilities, for example, the execution subject of each step can be the server 20 in the application program running environment shown in fig. 2. The method can include the following steps (310-330).

At step 310, a first video is obtained.

The first video may be any video. Video generally refers to various storage formats of moving images. Optionally, the first video is a video recommended to the user for viewing in the information flow content service, such as a portrait version of a small video and a landscape version of a short video.

Optionally, the first video is presented in a Feed stream, such as Web Feed, News Feed, and multimedia Feed. Through which the web site propagates the latest information to the user. For example, the website presents multimedia contents of various video types to the user through Feeds. Feeds are usually arranged in a Timeline, Timeline being the most primitive and basic presentation of Feeds. It should be noted that, in the embodiment of the present application, the source of the first video is not limited, and the first video may be from Feeds stream, or may be in other network media forms.

In one possible implementation, the first video is a video uploaded by a user into a streaming content service. For example, the first video is a video uploaded by the user in real time. The first video may be a complete video or may be a portion of a complete video. For example, an entry for uploading a video may be displayed on an operation interface displayed by the terminal device, a user may select a video to be uploaded, and the terminal device detects an upload request for uploading the video and may upload the video to a designated server; after receiving the video, the server may use the video as the first video.

In one possible implementation, the first video is a short video. Short videos, i.e., short-film videos, are a way of content dissemination. In some application scenarios, short videos are high-frequency pushed video content played on various new media platforms, suitable for viewing in mobile and short-time leisure states, varying from seconds to minutes. The short video can also be video broadcast content which is broadcast on the new internet media within 30 minutes of the time. The definition of the short video is not limited in the embodiment of the application and can be determined according to actual conditions.

It should be noted that the present application does not limit the category of the first video, the first video may be a sports video, a life video, a variety video, a short video, a game video, etc., and the manner of obtaining is not limited to the above description. Similarly, the format of the video is not limited in the embodiments of the present application.

And 320, extracting the feature information of the first video based on the content to obtain the semantic feature information of the first video.

In some application scenarios, such as streaming content service, a video carrier usually performs a certain video editing operation on a video to avoid machine review of the carried video, and the video carrier varies. These video editing operations may change original video data, and generate some image transformation information with weak correlation with video content information, such as watermark information added in the video, black border style information of a frame, subtitle information, etc., and these image transformation information may cause a large difference between feature information extracted from the edited video by the machine and feature information of the original video, so that the machine may misunderstand that the edited video is not similar to the original video, but actually the edited video is a transport video of the original video.

In order to protect high-quality original videos in the information flow content service from being embezzled and carried at will, therefore, the video needs to be subjected to content-based feature information extraction, so that insensitivity to data change information caused by video editing operation is ensured during feature information extraction, great influence of image transformation information on the extracted feature information is avoided, and semantic feature information with the influence degree of the video editing operation lower than the requirement of a first influence degree is obtained.

The content-based feature information extraction in step 320 is a feature extraction process for semantic information of video content, and does not pay attention to image conversion information generated by video editing operation in a video, so that the extracted semantic feature information is feature information that is influenced by the video editing operation to a degree lower than a first influence degree requirement.

The video editing operations include, but are not limited to, operations of cropping video pictures, changing video resolution, removing/watermarking, adding/subtracting frames from head to tail, picture-in-picture, and black-side subtitles. In one example, as shown in FIG. 4, a schematic of a page of a video editing tool is illustrated. The video editing tool page 40 includes controls 41 corresponding to various video editing operations, and a user can perform corresponding operations on the controls 41 to perform corresponding editing operations on a video. In a video carrying scene, the video editing tool can be used as a tool for a video carrier to tamper the original video. For example, the original video is often added with the watermark information of the original account, but the carrier often erases the watermark information of the original account by using the video editing tool, thereby avoiding watermark detection.

The degree of influence of the semantic feature information by the video editing operation can be represented as the difference degree between the semantic feature information of the transport video and the semantic feature information of the original video, and the larger the difference degree between the semantic feature information of the transport video and the semantic feature information of the original video is, the larger the degree of influence of the semantic feature information by the video editing operation is. Therefore, the semantic feature information extracted based on the feature information of the content is influenced by the video editing operation to a lower degree than the first influence degree requirement.

The first influence degree requirement is a constraint condition that the difference degree between the semantic feature information corresponding to the video after the video editing operation and the semantic feature information corresponding to the original video is lower than a threshold. Optionally, the first degree of influence is required to limit a degree of difference between the semantic feature information of the transport video and the semantic feature information of the original video to be lower than a threshold value. Optionally, the transport video may be a video obtained by performing a video editing operation on the original video.

In some application scenarios, the mathematical representation of the semantic feature information is typically a semantic feature vector. Accordingly, the first degree of influence requirement includes: and the similarity between the semantic feature vectors corresponding to the video subjected to the video editing operation and the original video is smaller than a similarity threshold. Optionally, the similarity is any one of cosine similarity, euclidean distance, edit distance, or manhattan distance between two semantic feature vectors. It should be noted that, in the embodiment of the present application, the determination method of the similarity is not limited, and an appropriate similarity calculation method may be selected according to actual situations.

In a possible implementation manner, the content-based feature information extraction manner may be to extract image-level semantic feature information corresponding to a video frame, and by performing the above-mentioned image-content-based feature information extraction on a video, an image semantic feature corresponding to each of at least one video frame in the video may be obtained. In another possible implementation manner, the content-based feature information may be extracted by extracting overall video semantic feature information, and the video semantic features capable of representing the overall content of the video may be obtained by performing the above-mentioned content-based feature information extraction on the video.

Since the contents of the two embodiments are relatively long, they are only described in general terms, and the details of the two embodiments will be described in the following examples.

Step 330, determining that the first video is a similar video of the second video when the semantic feature information of the first video and the semantic feature information of the second video meet a preset condition.

The second video may be any historical video in a video database. The semantic feature information of the second video may be obtained by extracting content-based feature information, which is the same as that of the first video, from the second video after the second video is uploaded.

The preset condition is a condition for judging video similarity based on semantic feature information. The preset condition may be that a difference degree between the semantic feature information of the first video and the semantic feature information of the second video is smaller than a preset difference degree, or that a similarity degree between the semantic feature information of the first video and the semantic feature information of the second video is greater than or equal to the preset similarity degree.

In a possible implementation manner, the preset condition includes that a difference degree between image semantic features corresponding to the video frame of the first video and image semantic features corresponding to the video frame of the second video is smaller than a preset difference degree, or a similarity degree between image semantic features corresponding to the video frame of the first video and image semantic features corresponding to the video frame of the second video is greater than or equal to the preset similarity degree.

In another possible implementation, the preset condition includes that a difference degree between the video semantic feature information of the first video and the video semantic feature information of the second video is smaller than a preset difference degree, or a similarity degree between the video semantic feature information of the first video and the video semantic feature information of the second video is greater than or equal to the preset similarity degree.

Judging the semantic feature information of the first video and the semantic feature information of the second video according to a preset condition, judging whether the semantic feature information of the first video and the semantic feature information of the second video accord with the preset condition, and if the semantic feature information of the first video and the semantic feature information of the second video accord with the preset condition, determining that the first video is a similar video of the second video; and if the semantic feature information of the first video and the semantic feature information of the second video do not accord with the preset condition, determining that the first video is not the similar video of the second video.

In some application scenarios, such as a streaming content service scenario, the similar video mentioned above refers to a transport video. In an exemplary embodiment, after the step 330, the following steps are further included: under the condition that the first video is a similar video of the second video, determining that the first video is a transport video, wherein the transport video is a non-original video; the first video is restricted from being pushed. There are a variety of ways to limit the push of the first video, including but not limited to: reducing a priority of distributing the first video; limiting a distribution range of the first video; the first video is undistributed. For an account for issuing a first video, determining the issuing account of the first video as a video carrying account; reducing the rating score of the video carrying account; limiting the video under the video carrying account number to be pushed and the newly added video; in some cases, the video transport account described above may be disabled.

In summary, according to the technical scheme provided by the embodiment of the application, the video is subjected to content-based feature information extraction processing, so that the key point of feature extraction can be focused on the video content, and the influence of other irrelevant factors on feature extraction is reduced, so that the extracted video semantic information is less influenced by video editing operation. If the video semantic information between the two videos meets the preset condition, the two videos can be determined to be similar videos, whether the two videos are similar or not is judged by comparing the video semantic information capable of representing the characteristics of the video content, and the accuracy rate of similar video identification can be effectively improved.

The following further describes the beneficial effects of the technical solutions provided in the embodiments of the present application with reference to the application background. In order to improve the account profit and the account influence, a content creator in the information flow content service uploads a large amount of similar (simple editing and modifying of videos, such as video watermarking or editing and cutting, adding of a head and a tail of an advertisement, modifying of audio and the like) or directly copies and copies duplicated contents of other account users. As a result, the carried content organizes the enabling of the normal user content, occupies a large amount of traffic, and cannot contribute to the ecological and healthy development of the whole content. The Simhash is used for the requirement of removing the duplication of the massive texts, and a hash value (64-bit shaping) can be calculated by using a Simhash algorithm. The two articles are similar to each other, namely, the distance between two Simhash values is less than or equal to 3, the distance calculation adopts Hamming distance, namely 2 Simhashes are subjected to one-time XOR operation, and if one bit is 1, N bits exist, the distance is N. For the cover page picture, the judgment is made by calculating the hash (perceptual hash) and Dhash (differential hash) of the picture, and the video content itself is verified by means of a video file MD5(Message-Digest Algorithm). The scheme based on the video MD5 has too poor deformation resistance, so that the editing change of black and gray products, such as the situation that a picture is cut, translated or the angle of a shooting visual angle changes slightly, cannot be effectively resisted, and the recognition effect is poor; for the information flow video content, one part of the newly added video is original video content uploaded by a user, and the other part of the newly added video is the existing content of the carrying platform. The transportation can cause videos with the same content to exist in the platform at the same time, which is a great attack to original video authors, and since even though the content is the same, the video frame rate and the resolution are still different, the judgment cannot be made by the verification of the video file MD 5. Meanwhile, in the auditing process, the repeated auditing of the similar videos is carried, and optimization is also needed, for example, versions with higher definition and resolution can be audited, and other versions are not audited any more, so that unnecessary labor waste is reduced. The cost and the influence efficiency are increased by manually checking the carried videos, along with the rapid increase of the content volume, if the rapid checking and processing cannot be performed, the videos cannot be rapidly distributed, great influence is caused to the experience of users, and the operation pressure and the storage pressure of the server are increased; according to the technical scheme, the feature information extraction processing based on the content is carried out on the uploaded video in the information flow content service, the influence of various video modification editing operations on the feature information extraction result can be reduced to a certain extent, and then the video semantic information which is influenced by the video editing operations and meets the preset requirement is obtained, finally whether the two videos are similar videos can be determined by comparing the video semantic information between the two videos, whether the newly uploaded video is a transport video can be determined, the influence of the video editing operations on the identification result of the transport video can be effectively reduced, the behavior of identifying the similar videos by clipping the videos is effectively attacked, and the accuracy of identifying the transport video can be effectively improved.

Referring to fig. 5, a flowchart of a similar video determination method according to an embodiment of the present application is shown. The method can be applied to a computer device, which refers to an electronic device with data calculation and processing capabilities, for example, the execution subject of each step can be the server 20 in the application program running environment shown in fig. 2. The method can include the following steps (501-509).

Step 501, a first video is acquired.

Step 502, inputting the first video into a video semantic extraction model for feature information extraction, so as to obtain a first video semantic feature.

The semantic feature information of the first video comprises semantic features of the first video, and the semantic features of the first video are used for representing the semantic feature information of the first video from the overall dimension of the video. Optionally, the first video semantic features are used to characterize semantic information in video content of the first video from an overall perspective. The influence degree of the first video semantic features by the video editing operation meets the preset requirement.

The video semantic extraction model is a machine learning model obtained by training based on a triple loss constraint condition by taking triples as training samples, positive sample pairs of the triples comprise a first sample video and a second sample video, negative sample pairs of the triples comprise the first sample video and a third sample video, the second sample video is a video obtained by performing video editing operation on the first sample video, and the third sample video is a video with different contents from the first sample video.

The triplet loss constraint condition may be a preset triplet loss function (tripletloss), and a specific setting manner of the triplet loss function is not limited in the embodiment of the present application and may be determined according to a specific implementation situation. The video semantic extraction model can be trained through depth measurement learning, and a video semantic feature vector is extracted from the whole video. In the embodiment of the present application, the first degree of influence is required to be achieved by setting the triplet loss function.

By constructing the triplets in the manner and constraining the model training by the triple loss function, the difference between the video semantic features corresponding to the edited video obtained by the video editing operation and the original video is small, and then the content-based feature information can be extracted, so that the influence of the video editing operation on the finally extracted features is reduced.

In one example, as shown in fig. 6, a training flow diagram of a video semantic extraction model is exemplarily shown. The training process of the video semantic extraction model comprises the following steps: and acquiring a video training set, wherein the video training set comprises a plurality of sample videos. And grouping sample videos in the video training set to obtain a plurality of triples. For the ith training triplet, the video V including the first sample_iSecond sample video V_i ⁺And a third sample video V_i ^-First sample video V_iAnd a second sample video V_i ⁺Belonging to the same category, second sample video V_i ⁺May be the first sample video V_iObtaining a video through video editing operation; third sample video V_i ^-Video V of the first sample_iBelonging to different categories. Video V of the first sample_iSecond sample video V_i ⁺And a third sample video V_i ^-Respectively inputting the weight data into a DNN (deep neural network) model shared by weights to obtain a first sample video V_iVideo semantic feature vector F_θ(V_i) Second sample video V_i ⁺Video semantic feature vector F_θ(V_i ⁺) And a third sample video V_i ^-Video semantic feature vector F_θ(V_i ^-). In the process of training the weight-shared DNN model, a triple loss function (tripletloss) is used for constraining the output result of the weight-shared DNN model, so that the distance between video semantic feature vectors output by two similar videos through the weight-shared DNN model is close, and the distance between video semantic feature vectors output by two dissimilar videos through the weight-shared DNN model is far.

Optionally, step 502 is a sub-step of step 320 in the above embodiment.

In an exemplary embodiment, as shown in fig. 7, a flow chart of a similar video determination method provided in another embodiment of the present application is shown. The step 502 can be realized by the following steps (502a to 502 d):

step 502a, obtaining a target image set according to at least one video frame in the first video and a video cover of the first video.

Optionally, the first video is densely decimated, for example, decimated every 1s (second), resulting in a total of K video frames, where K is a positive integer. The K video frames and the video cover then constitute a set of target images. The Video frames and the Video covers in the Video jointly form visual modal information of the first Video, Video Embedding characteristic (Video Embedding) information can be further generated, the Video is a main body of Video content and comprises main content information, and the cover map is essence of the Video content and can supplement the Video frames and the Video covers.

And step 502b, embedding each image in the target image set to obtain an embedded feature set of the target image set.

The embedded feature set characterizes visual modality information of the first video.

Optionally, each image in the target image set is input into an embedding processing model pre-trained based on the ImageNet image set, and the embedding feature of each image is extracted through the embedding processing model and output in a vector form to obtain an embedding feature set of the target image set. Optionally, the embedded feature set includes an embedded feature vector corresponding to each target image.

Among classic image classification models such as VGG (Visual Geometry Group) 16 Network model, multi-size convolution (inclusion) series model, Residual Network (ResNet), and the like, an inclusion-resource v2 (multi-size convolution-Residual Network) model is selected as an embedding processing model.

And 502c, extracting the feature information based on the content of each embedded feature in the embedded feature set to obtain embedded semantic features.

Optionally, the embedded feature vector corresponding to each target image in the embedded feature set is input into the deep neural network model sharing the weight to perform content-based feature information extraction, so as to obtain the embedded semantic features of each image, and the embedded semantic features are output in a vector form, that is, the embedded semantic feature vector of each image.

Step 502d, performing average pooling on each embedded semantic feature to obtain a first video semantic feature.

Optionally, performing average pooling on each embedded semantic feature vector, and obtaining a final first video semantic feature by using the average pooling. Optionally, the first video semantic feature is a first video semantic feature vector.

In one example, as shown in FIG. 8, an illustration thereof is providedA flow chart for determining a semantic feature vector of a video is illustrated. For any one video v, determining a video semantic feature vector F of the video v_θ(v) The process of (2) is as follows:

dividing the video v into K video segments, F respectively₁、…、F_k、…、F_K. And performing dense frame extraction processing on the K video segments, for example, extracting one video frame every second, inputting the video frame of each video segment into an inclusion-Resnet v2 model trained based on an ImageNet training set, and outputting the embedded feature vector of each video frame through the inclusion-Resnet v2 model.

For video clip F₁Video frame vF in (1)₁To convert a video frame vF₁Inputting the video frame vF into a weight-sharing Incepration-Resnet v2 model, and carrying out the weight-sharing Incepration-Resnet v2 model on the video frame vF₁For Embedding (Embedding), the output video frame vF₁Embedded feature vector f of_θ(vF₁)。

For video clip F_kVideo frame vF in (1)_kTo convert a video frame vF_kInputting the video frame vF into a weight-sharing Incepration-Resnet v2 model, and carrying out the weight-sharing Incepration-Resnet v2 model on the video frame vF_kFor Embedding (Embedding), the output video frame vF_kEmbedded feature vector f of_θ(vF_k)。

After the embedded characteristic vector of each video frame extracted from the K video clips is obtained, the embedded characteristic vector of each video frame is subjected to average pooling processing to obtain a video semantic characteristic vector F of the video v_θ(v)。

Step 503, extracting feature information based on image content from the video frame of the first video to obtain a first image semantic feature sequence.

The semantic feature information of the first video comprises a first image semantic feature sequence used for representing the semantic feature information of the first video from the video frame dimension. In some embodiments, the first image semantic feature sequence includes image semantic features corresponding to at least one video frame in the first video.

The first image semantic feature in the first image semantic feature sequence is a feature which is influenced by the image editing operation to a lower degree than the requirement of the second influence degree.

The second influence degree requirement is a constraint condition that the difference degree between the image semantic features corresponding to the video frame image after the video editing operation and the image semantic features corresponding to the original video frame image is lower than a threshold value. Optionally, the second degree of influence is required to limit the degree of difference between the image semantic features of the images of the transport video frame and the original video frame to be lower than a threshold value.

In some application scenarios, the mathematical expression of the image semantic features is usually an image semantic feature vector corresponding to the video frame image. Accordingly, the second degree of influence requirement includes: and the similarity between the semantic feature vectors of the video frame image after the video editing operation and the corresponding image of the original video frame image is smaller than a similarity threshold. Optionally, the similarity is any one of cosine similarity, euclidean distance, edit distance, or manhattan distance between two semantic feature vectors of the image. It should be noted that, in the embodiment of the present application, the determination method of the similarity is not limited, and an appropriate similarity calculation method may be selected according to actual situations.

In some application scenarios, the manner and content of feature information extraction may be determined according to the video duration. Optionally, under the condition that the video duration meets a preset duration condition, extracting feature information based on image content from a video frame of the first video to obtain a first image semantic feature sequence. The preset duration condition is used for determining the extraction mode and content of the feature information according to the video duration. Optionally, the preset duration condition includes that the video duration is less than or equal to a duration threshold. Correspondingly, the video time length of the first video meets the preset time length condition, that is, the video time length of the first video is smaller than or equal to the time length threshold. For some short videos with short duration, in addition to determining the first video semantic feature of the short time frequency to perform feature comparison, the feature comparison can be performed by determining the first image semantic feature sequence of the short video to judge whether the two videos are similar.

It should be noted that, for a video whose video duration does not meet the preset duration condition, step 502 may be executed separately to obtain video semantic features of the video, so as to perform fast feature comparison. However, this does not mean that the video with the video duration meeting the preset duration condition cannot perform step 502, step 503 may be a step of performing feature information extraction for the short video, and step 502 may be a step of performing fast feature information extraction for the long video, but the short video may still use step 502 to perform fast feature information extraction.

Optionally, when the video duration of the first video does not meet the preset duration condition, the feature information extraction based on the image content may not be performed on the video frame of the first video, but only the first video is input into the video semantic extraction model for feature information extraction, and the subsequent feature comparison step is performed on the obtained first video semantic features.

In some embodiments, step 502 may not be executed, and only step 503 is executed, and correspondingly, the preset duration condition may also be an unlimited duration condition, that is, step 503 is executed for the first video to be put in storage, and only the image semantic feature sequence of the video is extracted to perform subsequent feature comparison.

In some embodiments, a compact image feature extraction algorithm insensitive to various image transformations is adopted, so that distances between semantic features of images extracted from various edited and transformed pictures are still similar (lower than a distance threshold), and various edited and transformed pictures can still be matched and recalled.

In one possible implementation, specifically, in the classic image classification models such as VGG16, inclusion series model, ResNet, and the like, inclusion-ResNet v2 is selected as the image semantic extraction model. And inputting the video frame of the first video into an image semantic extraction model to extract the characteristic information based on the image content, so as to obtain the image semantic characteristic vector of each image. Each image semantic feature vector can be arranged in time sequence to form an image semantic feature sequence.

Optionally, step 503 is a sub-step of step 320 in the above embodiment.

In some application scenarios, step 503 and step 502 may not be performed at the same time. In some embodiments, the content-based feature information extraction can be completed only by performing the above step 502 or step 503, so as to obtain the semantic feature information of the first video. In the case where step 502 is performed and step 503 is not performed, the semantic feature information includes only video semantic features. Correspondingly, whether the two videos are similar videos can be determined only by carrying out similarity judgment on the video semantic features of the two videos. In the case where step 503 is performed and step 502 is not performed, the semantic feature information includes only the image semantic feature sequence. Correspondingly, whether the two videos are similar videos can be determined only by carrying out similarity judgment on the image semantic feature sequences of the two videos. There is no timing limitation between the

above steps

502 and 503, and the deployment can be performed according to the actual application scenario.

In an exemplary embodiment, since the redundancy between adjacent video frames in a video is high, the extraction of the video frames of all video contents is performed with low efficiency based on the feature information of the image contents and the storage, and a stable target frame extraction scheme is provided. As shown in fig. 7, the step 503 includes steps (503a to 503d), wherein the steps 503a to 503c are used for extracting the target frame in the first video.

Step 503a, performing frame extraction on the first video to obtain a plurality of video frames.

For example, the frame decimation operation may be to decimate a video frame every second of the video, resulting in multiple video frames.

In step 503b, for each video frame, the inter-frame difference information between the video frame and the adjacent video frame is determined.

Optionally, the inter-frame difference information includes a difference value. In a possible implementation manner, for each video frame, a difference value between the video frame and a previous frame and a next frame (i.e., an adjacent frame) of the video frame is calculated, for example, a difference value between corresponding pixel points in the video frame and the adjacent frame is calculated, then, absolute values of the difference values are taken, the absolute values are summed to obtain a sum of the absolute values, and then, an average value is taken as the difference value according to the sum of the absolute values. The embodiment of the application does not limit the calculation mode of the difference value, and the determination mode of the difference value or the difference information can be selected according to the actual situation.

And 503c, screening the plurality of video frames according to the interframe difference information to obtain a target frame sequence.

In a possible implementation manner, the difference values corresponding to the video frames are arranged in a descending order, and N video frames corresponding to the first N difference values are taken as target frames, where N is a positive integer. Optionally, the N video frames are arranged according to their respective timestamps and a time sequence, so as to obtain a target frame sequence.

For each type of transform on a video image, the ordering of differences between previous and subsequent frames of the video is not changed, and thus the above method is "stable" in most cases.

It should be noted that: some video segments may be the same scene for a long time, and the scene may be switched many times in the last short time (for example, a simple photo-switching video), and the target frames extracted at this time may be concentrated on the scene switching frame in the last short time, while the scene in the previous long time may only extract one or two target frames, so that there is a bottom-preserving frame-fetching scheme to ensure that at least one target frame is fetched every 5s (seconds).

Step 503d, extracting feature information based on image content for each target frame in the target frame sequence to obtain a first image semantic feature sequence.

Optionally, each target frame in the target frame sequence is input to the image semantic extraction model for extracting feature information based on image content, so as to obtain an image semantic feature vector of each target frame. Each image semantic feature vector can be arranged in time sequence to form an image semantic feature sequence.

Step 504, determining the semantic features of the second video in the semantic feature information of the second video under the condition that the semantic feature information of the first video comprises the semantic features of the first video.

In a possible implementation manner, the semantic feature information of the second video is already stored in the database, the semantic feature information of the second video includes a semantic feature of the second video, and the semantic feature of the second video may be determined according to an index of the second video.

In step 505, if the semantic features of the first video and the semantic features of the second video meet a first preset condition, it is determined that the first video is a content-similar video of the second video.

In a possible implementation, the first preset condition may be that the characteristic distance is smaller than a characteristic distance threshold. And determining a feature distance between the first video semantic feature and the second video semantic feature, and judging whether the first video semantic feature is similar to the second video semantic feature or not according to the feature distance. For example, if the feature distance is smaller than the feature distance threshold, it may be determined that the semantic feature of the first video is similar to the semantic feature of the second video, and under the condition that the semantic feature of the first video is similar to the semantic feature of the second video, it is determined that the first video is a content-similar video of the second video.

In another possible implementation, the first preset condition may be that the vector distance is smaller than a vector distance threshold. And determining the vector distance between the first video semantic feature vector and the second video semantic feature vector, and judging whether the first video semantic feature vector is close to the second video semantic feature vector or not according to the vector distance. For example, if the vector distance is smaller than the vector distance threshold, it may be determined that the first video semantic feature vector is similar to the second video semantic feature vector, and under the condition that the first video semantic feature vector is similar to the second video semantic feature vector, it is determined that the first video is a content-similar video of the second video.

The content-similar video refers to video with similar video content. Alternatively, the content-similar video is a video whose picture content is similar.

Step 506, in the case that the semantic feature information of the first video includes the semantic feature sequence of the first image, determining a semantic feature sequence of a second image in the semantic feature information of the second video.

In a possible implementation, the semantic feature information of the second video is already stored in the database, and the semantic feature information of the second video includes a second image semantic feature sequence, and the second image semantic feature sequence may be determined according to an index of the second video.

Step 507, if the first image semantic feature sequence and the second image semantic feature sequence meet a second preset condition, determining that the first video is a content similar video of the second video.

In a possible implementation manner, the second preset condition includes that a difference degree between the first image semantic feature sequence and the second image semantic feature sequence is smaller than a preset difference degree, or a similarity degree between the first image semantic feature sequence and the second image semantic feature sequence is greater than or equal to the preset similarity degree.

Optionally, at least one image semantic feature in the first image semantic feature sequence is compared with at least one image semantic feature in the second image semantic feature sequence, the number of similar image semantic features in the first image semantic feature sequence and the second image semantic feature sequence is determined according to the feature distance between the image semantic features, and then the proportion of the similar image semantic features is determined, if the proportion is higher than a proportion threshold, the similarity degree between the first image semantic feature sequence and the second image semantic feature sequence is determined to be greater than or equal to a preset similarity degree, that is, the first image semantic feature sequence and the second image semantic feature sequence meet a second preset condition.

In an exemplary embodiment, as shown in fig. 7, the above step 507 may be implemented by the following steps (507a to 507 c).

And step 507a, determining a matching feature pair according to the first image semantic feature sequence and the second image semantic feature sequence.

The distance between a first image semantic feature in the pair of matching features and a second image semantic feature in the pair of matching features is less than a distance threshold.

In one possible mode, at least one image semantic feature in the first image semantic feature sequence is compared with at least one image semantic feature in the second image semantic feature sequence respectively, the distance between the first image semantic feature and the second image semantic feature is determined, and if the distance between the first image semantic feature and the second image semantic feature is smaller than a distance threshold value, the first image semantic feature and the second image semantic feature can be determined as a matching feature pair.

And step 507b, determining a matching target frame according to the matching feature pair.

The matching target frames include a first target frame and a second target frame.

The first target frame is a video frame corresponding to the semantic features of the first image in the first video, and the second target frame is a video frame corresponding to the semantic features of the second image in the second video.

In one possible implementation, according to a first image semantic feature in the matching feature pair, a first target frame corresponding to the first image semantic feature is determined; and determining a second target frame corresponding to the semantic features of the second image according to the semantic features of the second image in the matching feature pair.

And step 507c, if the number of the matched target frames meets the number condition, determining that the first video is a content similar video of the second video.

The number condition may be that the number of matching target frames is higher than a number threshold, or that a ratio of the matching target frames calculated based on the number of matching target frames is higher than a ratio threshold.

In the above process, the ratio of the matching target frames is calculated, and some thresholds are set, for example, 10 frames are extracted from each video, 6 frames are matched, and the video can be considered as a matched similar video. Optionally, the ratio threshold of the matching target frame is 0.6, and the similar video is judged by extracting image-level semantic feature information corresponding to the video frame, and may also be determined by combining with other manners.

The semantic features of the image take scaling, clipping, mirroring, color difference and the like into consideration. In some practical services, there are more advanced processing methods to escape from inspection, such as: cutting the top and the bottom, adding black edges, adding subtitles in large area, and picture-in-picture. For such a carrying mode, in addition to a mode of extracting image-level semantic feature information corresponding to a video frame to determine a similar video, the similar video may be determined in combination with other modes, for example, the similar video is determined by extracting overall video semantic feature information.

Some video carriers may have a deviation in target frame extraction that may result from some processing of the video, and at this time, similar videos may also be determined in combination with other manners, for example, by extracting overall video semantic feature information to determine similar videos.

In some implementation scenarios, video frame storage is very stressful. If the content of the video after 5 minutes is more, even if the redundancy is reduced by extracting the target frame, one video usually needs to extract dozens of frames, which means that billions of features need to be stored, the storage overhead is too large, and the implementation cost is high, so a mode of extracting image-level semantic feature information corresponding to the video frame to determine a similar video is usually adopted for short video content, and a long video can determine a similar video by extracting the whole video semantic feature information.

Step 508, determining a first audio feature sequence of the first video and acquiring a second audio feature sequence of the second video under the condition that the semantic feature information of the first video and the semantic feature information of the second video meet preset conditions.

Optionally, the first video is a content-similar video of the second video, that is, it can be shown that the semantic feature information of the first video and the semantic feature information of the second video meet a preset condition.

In an exemplary embodiment, as shown in fig. 7, the above step 508 may alternatively be implemented by the following step 508 a.

Step 508a, in the case that the first video is a video with similar content to the second video, determining a first audio feature sequence of the first video, and acquiring a second audio feature sequence of the second video.

In some embodiments, there are some videos, such as a training video and a lecture video, a weather forecast video, and the pictures of main videos are very close to each other, but the audio is different, and at this time, the moving video and the repeated video are judged according to the content-based feature information, which is easy to cause misjudgment, and at this time, verification needs to be performed through audio feature matching.

In an exemplary embodiment, the above process of determining the first audio feature sequence of the first video includes: acquiring audio data corresponding to a first video; carrying out frequency domain conversion processing on the audio data to obtain frequency domain characteristics of the audio data; based on the frequency domain features, a first sequence of audio features is generated.

The manner of acquiring the audio data corresponding to the first video may be to separate the audio data corresponding to the first video from the video data of the first video. The frequency domain conversion processing on the audio data may be mapping the audio time domain data to a frequency domain space, for example, fourier transform, to obtain frequency domain features of the audio data.

The first audio feature sequence may be considered as a hash value of the audio, the same audio having the same audio feature sequence, and different audio having different audio feature sequences. Unlike hash values, however, the audio signature sequence of the audio of a video file is not a single number or string of characters, but rather a sequence of numbers with time attributes attached. Alternatively, a Landmark (feature point extraction) algorithm, a Chromaprint (chroma print) algorithm, and an Echoprint (open source music fingerprint) algorithm are adopted as the audio feature sequence calculation method.

In step 509, in the case that the similarity between the first audio feature sequence and the second audio feature sequence meets the audio similarity condition, it is determined that the first video is a similar video of the second video.

The audio similarity condition includes that the similarity between the first audio feature sequence and the second audio feature sequence is smaller than a preset similarity threshold.

In one possible implementation, a sequence value distance between the first audio feature sequence and the second audio feature sequence is determined as the similarity, the sequence value distance is compared with a sequence value threshold distance (as a similarity threshold), and if the sequence value distance is smaller than the sequence value threshold distance, it can be determined that the similarity between the first audio feature sequence and the second audio feature sequence meets the audio similarity condition. Accordingly, the audio similarity condition includes a sequence value distance between the first audio feature sequence and the second audio feature sequence being less than a sequence value threshold distance.

In one example, as shown in fig. 9, a flow chart for extracting a sequence of audio features is illustrated. As shown in fig. 9, the method obtains the audio feature sequence through a filter-based extraction algorithm, and comprises the following steps:

step 1, video is obtained. For example, a source file of a video is obtained.

And 2, extracting the audio of the video from the video. For example, an audio file of a video is extracted from a source file of the video. For example, audio is extracted by FFmpeg (Fast Forward Mpeg, multimedia video processing tool).

And 3, storing the audio in an Object Storage service (COS). The object storage service is a distributed storage service for storing massive files, and has the advantages of high expansibility, low cost, reliability, safety and the like. Through diversified modes such as a console, an Application Programming Interface (API), a Software Development Kit (SDK), tools and the like, a user can simply and quickly access the COS, upload, download and manage multi-format files, and mass data storage and management are realized.

And 4, acquiring the audio from the object storage service.

And 5, dividing the audio into a plurality of audio segments. For example, audio may be divided into overlapping audio segments. The duration of an audio clip may be 1 second, with about 6 samples per second of audio, each sample point corresponding to 32bit feature data.

And 6, converting each audio clip into a spectrogram. The spectrogram is used for representing the change of audio energy in each audio segment along with time. For example, Short-Time Fourier transform (STFT) is used to convert audio segments into spectrograms. The short-time fourier transform is a mathematical transform related to the fourier transform to determine the frequency and phase of the local area sinusoid of the time-varying signal. Selecting a time-frequency localized window function, and assuming that the analysis window function G (t) is stable (pseudo-stable) in a short time interval, moving the window function to enable F (t) G (t) to be stable signals in different limited time widths, thereby calculating the power spectrum at different moments. The short-time fourier transform uses a fixed window function, whose shape is not changed once determined, and whose resolution is determined. If the resolution is to be changed, the window function needs to be reselected. The SFTF is used as windowed Fourier transform, and the signal is only effective in a certain cell through a time window, so that the Fourier transform has the capability of local positioning. The audio waveform map corresponding to a common audio segment does not well describe the variation of the intensity of a specific frequency with time, so that the embodiment converts the audio waveform map into a spectrogram, which can describe the variation of the intensity of the specific frequency with time.

And 7, converting the spectrogram into a note map. For example, spectral energy within a certain frequency band range (e.g., 200-2000Hz) can be quantized to M note classes (e.g., 12 note classes) by Chromaprint algorithm, each corresponding to a range. Thus, a ' Chroma ' feature ' can be obtained, which is essentially a 1M one-dimensional feature vector, and represents the melody information of the audio, and the above-mentioned character graph shows the change of the Chroma feature with time. Chromaprint provides a common library of clients that can compute 64-bit audio strings through a particular algorithm.

And 8, screening out the filter based on the training data. In determining filters, based on training data, several filters may be filtered out using an Asymmetric Pair Boosting (APB) Algorithm (Algorithm).

And 9, performing binarization filtering on the note graph by using the screened filter.

And step 10, outputting the audio characteristic sequence. For each audio segment, a 64-bit binary fingerprint sequence is obtained through binarization filtering. For example, 1001011011.

And step 11, storing the audio characteristic sequence of the video into a key value storage service. The Key-Value storage service is a high-performance, low-latency, persistent, distributed KV (Key-Value) storage service, and is compatible with open-source protocols such as Redis (Remote Dictionary Server), Memcached (distributed cache system), and the like.

And step 12, acquiring the audio feature sequence pair to be detected from the key value storage service.

And step 13, determining the similarity based on the editing distance.

The edit distance is a quantitative measure of the difference between two strings (e.g., english text) by how many times a string is changed into another string.

And determining the editing distance d between the two audio feature sequences in the pair of audio feature sequences to be detected. Wherein the replacement operation distance is noted as 2. The lengths of the two audio feature sequences are respectively marked as l₁、l₂Then the Similarity (Similarity) can be calculated by the following formula.

Similarity＝1-d/(l₁+l₂)

And step 14, judging whether the similarity meets a preset similarity threshold condition.

If so, judging that the videos corresponding to the two audio characteristic sequences are similar, otherwise, judging that the videos corresponding to the two audio characteristic sequences are not similar.

In an exemplary embodiment, management of video representation vectors for multiple videos may be achieved through a distributed vector retrieval service, wherein a Faiss library may be utilized to provide an efficient similarity search and clustering service for dense video representation vectors.

The second video is any one of videos in a video database, the video database comprises a characteristic information base and an online index base, the online index base comprises stock indexes and incremental indexes, the stock indexes comprise index information of the stock videos, the stock videos are videos in a target historical period, the incremental indexes comprise index information of the incremental videos, and the incremental videos are videos newly added after the target historical period. Optionally, the online index library is a Faiss library.

The target history period is a set history period, for example, data of the previous 90 days, and the target history period may change with time, and the time duration of the period does not change with time, but only the start time and the end time of the target history period change.

Accordingly, as shown in fig. 7, the method further includes:

and 510, acquiring target index information of the second video from the stock index in the case that the second video is the stock video.

If the second video is the stock video, the target index information of the second video is located in the stock index, and therefore, when the second video is the stock video, the target index information of the second video needs to be acquired from the stock index. Alternatively, the stock video is a video uploaded 89 days before the current day.

Index (Index): the index is used for pointing to one or more data in the database, and when the data volume stored in the database is huge, the index can greatly accelerate the query speed. In the embodiment of the present application, the video representation vector and the index of the video representation vector may be stored in a special index library.

And 511, acquiring semantic feature information of the second video from the feature information base based on the target index information.

Optionally, the target index information of the second video has a corresponding relationship with the semantic feature information of the second video, and the target index information may point to the position information of the semantic feature information of the second video in the feature information base.

In a possible implementation manner, the semantic feature information of the second video includes at least one of a second image semantic feature sequence, a video semantic feature and an audio feature sequence, and the second image semantic feature sequence, the video semantic feature or the audio feature sequence may also be used as the target index information of the second video. Accordingly, in a variation of the above-described implementation of step 510, in the case that the second video is the stock quantity video, the semantic feature information of the second video is obtained from the stock quantity index.

And step 512, under the condition that the second video is the incremental video, acquiring target index information of the second video from the incremental index set.

If the second video is an incremental video, the target index information of the second video is located in the incremental index, and therefore, when the second video is an incremental video, the target index information of the second video needs to be acquired from the incremental index. Optionally, the incremental video is a video uploaded on the current day.

Step 513, based on the target index information, obtaining semantic feature information of the second video in the feature information base.

In a possible implementation manner, the semantic feature information of the second video includes at least one of a second image semantic feature sequence, a video semantic feature and an audio feature sequence, and the second image semantic feature sequence, the video semantic feature or the audio feature sequence may also be used as the target index information of the second video. Accordingly, in a variation of the implementation manner of step 512, when the second video is an incremental video, the semantic feature information of the second video is obtained from the incremental index.

In an exemplary embodiment, the online index store is a first index store and the target historical period is a first historical period. Accordingly, as shown in fig. 7, the method further includes:

at step 514, a second history period is determined.

The second history period takes the latest time allowed to be added to the video in the first index base as a right boundary time node, and the time span of the second history period is the same as that of the first history period.

Step 515, index reconstruction is performed based on the first index library to obtain a second index library.

The target history period of the second index repository is a second history period. And performing index reconstruction on the video in the second historical time period to obtain a second index database.

At step 516, the online index library is switched from the first index library to the second index library.

In some embodiments, the top 1 day data is eliminated from the index library every day. The performance is seriously influenced by deleting a large amount of data from the Faiss database, so that a mechanism for switching the double index databases is constructed, and the influence on the performance of the index databases caused by deleting a large amount of data from the index databases is avoided. The online index repository may be switched from the first index repository to the second index repository according to a switching period. In the next period, the online index library is switched from the second index library to the first index library.

In one example, as shown in fig. 10, a schematic diagram of a distributed vector retrieval service is illustrated. Since millions of videos may be newly added to a content link every day, the videos need to be added to a database in real time, and video semantic features, image semantic features and audio feature sequences of the videos are added to an index library (here, a Faiss index library is taken as an example) for performing similar calculation, that is, the Faiss index library includes video semantic features, image semantic features, audio feature sequences of the videos and indexes of the videos. In the embodiment of the present application, the generation manner of the index is not limited, for example, the video semantic feature, the image semantic feature, and the audio feature sequence of the video may be used as the index of the video, or the video semantic feature, the image semantic feature, and the audio feature sequence of the video may be subjected to hash processing (by means of a hash function body of Faiss), and the obtained hash value is used as the index of the video. In the distributed retrieval process, in order to avoid adverse performance influence caused by a large amount of mixed reading and writing of the Faiss index library, a reading and writing separation mechanism is applied in the embodiment of the application. The specific rule is as follows:

(1) 2 sets of indexes, stock indexes and increment indexes are established. The stock index is read only, and the data of the previous 89 days is saved (the validity period of searching the similar repeated content library is assumed to be within 3 months, which can be determined by the product policy). The increment index is read and written simultaneously, and the real-time data of the latest 1 day is stored;

(2) writing an incremental index into each new video of 1 video in a warehouse, simultaneously retrieving the stock index and the incremental index, and combining retrieval results;

if the detection period of the carried content is considered, the range of the video repetition is assumed to be 90 days for how the outdated old index data is eliminated. The first 1 day data was eliminated from the sample pool each day. For a large amount of deleted data in the Faiss library, which seriously affects performance, the embodiment adopts a double-cache switching mechanism, which is specifically as follows:

firstly, a set of index libraries are used for online service calling, namely an online index library (a first index library); and a set of index libraries (a second index library) uses the latest 90 days of data to reconstruct the index offline, and is switched into an online index library after being prepared, so that the first index library is eliminated, continues to use the latest 90 days of data to reconstruct the index offline, and repeats according to the switching period.

Reconstructing indexes every day, keeping vector retrieval precision, storing index names in a Redis database, and modifying the Redis index state during switching.

In the face of complex vector retrieval scenes with different purposes, a plurality of sample library management redundancies such as image semantic feature vectors, video semantic feature vectors and audio feature sequences of video frames have corresponding video content validity periods, an abstract common component Faiss manager (vector retrieval service management component) is completely decoupled with services and used as a set of universal Faiss vector management framework to manage different vector libraries, and different feature vector retrieval and recall process multiplexing is realized. The method has the general characteristics of high abstraction, multi-party multiplexing, reading, writing and folding of a set of components, support of standardized access, separation of reading and writing of stock indexes and incremental indexes, no influence on performance, online seamless switching of double index libraries, horizontal expansion of all the components and capability of realizing identification of carried videos very efficiently.

Next, the various modules included in the Faiss Manager will be described:

1) version management: the index management system is used for managing indexes of different versions, such as indexes in a library 1 and a library 2 in the figure, and when a switching period arrives, the Faiss Proxy is informed to apply the index of a new version.

2) Model training: the feature vectors used to manage different versions, e.g., different versions of the model used to perform the feature extraction process on the video, are different, and the resulting feature vectors are also different.

3) Configuration management: the method is mainly used for recording addresses of equipment (such as servers) and modules (modules with independent functions are deployed on the servers, a plurality of modules can be deployed in one server, for example, a content reading module and a content writing module are deployed in the servers for storage), and the method is also used for recording relevant storage size and configuration information of a table structure.

4) Incremental sampling: for extracting portions (not shown) from the incremental index space for analysis and locating problems. For example, a portion of the index may be randomly sampled from the delta index to identify similar ones of the videos corresponding to the portion of the index. The incremental sampling module is also used for recording which part is extracted, the number of extracted indexes, the extraction time and the like.

5) File management: for reading video representation vectors from a Cloud DataBase (CDB) and building corresponding indexes. Here, the CDB may be used as a storage example of the MySQL database for storing feature vectors of a small number of videos, for example, multi-dimensional feature vectors of videos released on the current day, and the CDB may be stand-alone; the Faiss index library is a distributed storage, and is used for storing multidimensional feature vectors and corresponding indexes of a large number of videos, for example, multidimensional feature vectors and corresponding indexes of videos within several months. The FILE management module may be further configured to store the index in a sharded manner, that is, divided into several blocks, where each block has a plurality of FILEs (FILE). After the storage of the file segments, the file management module may be further configured to record file original information of the files _1 to _ N, including but not limited to a file size, a file saving format, a file version, a file validity period, and a file creation time.

In summary, according to the technical scheme provided by the embodiment of the application, the similar video is determined in two ways, one way is to perform content-based feature extraction on the video through a trained video semantic extraction model, so as to obtain video semantic features capable of representing the whole content of the video, and the similar video is determined by comparing the video semantic features between the two videos so as to be convenient for recall; the other method is that a plurality of image semantic features of the video are obtained by extracting the features of the video frames in the video based on the image content, and similar videos are determined by comparing the matching degree of the image semantic features of the two videos so as to be convenient for recall; the video can be determined as a transport video as long as the video meets the similar video determination rule of any one of the two modes; finally, the audio characteristics of the video are utilized to further verify the carried video, so that misjudgment is avoided, the determination efficiency of the similar video is improved, and the accuracy of determining the similar video can be guaranteed.

In some application scenarios, for example, in an information streaming content service, the similar video determination method provided by the embodiment of the present application can perform deep machine learning by making full use of information of each dimension of video content. The method comprises the steps of adopting an image semantic feature vector matching method of video frames and video semantic feature extraction for Depth Metric Learning (DML) based on video content, obtaining a video semantic feature vector matching method which can represent the overall content features of a video finally by utilizing average pooling after embedded features of all the video frames pass through a DNN network sharing weight, and extracting audio feature vectors by using a Chromaprint algorithm so as to realize multi-path recall of approximate content of video carrying and verification of the audio feature vectors, so that the video carrying content and authors thereof have no reclination, and the ecological healthy development of the content is ensured. The method provided by the embodiment of the application can effectively deal with various modification and editing (including pictures, audios and video contents) of video content carriers on the video content, and bypasses repeated similar inspection; when the original account is not available, the right of the transport account is reduced or the distribution is limited or even the distribution is cancelled when the distribution is recommended, so that the introduction of the original account is accelerated, and the flow can be concentrated on a real content creator; on a content auditing link, due to limited auditing resources, in order to enable the content of the original head account to be processed and distributed as soon as possible, the video carrying account is placed at the end of auditing scheduling or low-quality carrying is directly forbidden during auditing scheduling, so that the whole content ecology enters a benign cycle, and the living space of the carrying account is compressed.

In some application scenarios, step 503 and step 502 in the above embodiments may not be performed at the same time or performed differently, which is described below.

In one possible implementation, as shown in fig. 11, a flowchart of a similar video determination method provided in an embodiment of the present application is shown. In the embodiment shown in fig. 11, only the content corresponding to the step 502 is needed to be executed to complete the content-based feature information extraction, so as to obtain the video semantic features of the first video. Specifically, the method comprises the following steps (1101-1106).

Step 1101, a first video is acquired.

Step 1102, inputting the first video into a video semantic extraction model for feature information extraction, so as to obtain a first video semantic feature.

Step 1103, determining semantic features of the second video in the semantic feature information of the second video.

In step 1104, if the semantic features of the first video and the semantic features of the second video meet a first preset condition, it is determined that the first video is a content-similar video of the second video.

Step 1105, under the condition that the first video is a video with similar content to the second video, determining a first audio feature sequence of the first video, and obtaining a second audio feature sequence of the second video.

In step 1106, under the condition that the similarity between the first audio feature sequence and the second audio feature sequence meets the audio similarity condition, the first video is determined to be a similar video of the second video.

In the embodiment mode, the content-based feature information extraction processing is performed on the video, so that the influence of various video modification editing operations on feature information extraction results can be reduced to a certain extent, and then video semantic information which is influenced by the video editing operations and meets preset requirements is obtained, and finally whether the two videos are similar videos can be determined by comparing the video semantic information between the two videos, so that the influence of the video editing operations on the identification results of the similar videos is effectively reduced, the behavior of identifying the similar videos by clipping the videos is effectively attacked, and the accuracy of similar video identification can be effectively improved.

In another possible implementation, as shown in fig. 12, a flowchart of a similar video determination method provided in another embodiment of the present application is shown. In the embodiment shown in fig. 12, the corresponding contents of step 502 and step 503 are performed differently. Specifically, the method comprises the following steps (1201-1210).

Step 1201, a first video is acquired.

Step 1202, determining a video duration of a first video.

Step 1203, under the condition that the video duration meets the preset duration condition, extracting feature information based on image content from the video frame of the first video to obtain a first image semantic feature sequence.

The preset duration condition is used for determining the extraction mode and content of the feature information according to the video duration. The preset duration condition has already been described in the above embodiments, and is not described herein again.

In a possible embodiment, the preset duration condition is that the video duration is less than or equal to the duration threshold. Correspondingly, the video time length of the first video meets the preset time length condition, that is, the video time length of the first video is smaller than or equal to the time length threshold.

And if the video duration of the first video is less than or equal to the duration threshold, extracting the feature information based on the image content from the video frame of the first video to obtain a first image semantic feature sequence.

Step 1204, determining a second image semantic feature sequence in the semantic feature information of the second video.

Step 1205, if the semantic feature sequence of the first image and the semantic feature sequence of the second image meet a second preset condition, determining that the first video is a content similar video of the second video.

And 1206, inputting the first video into the video semantic extraction model for feature information extraction to obtain a first video semantic feature under the condition that the video duration does not meet the preset duration condition.

In the foregoing embodiment, the video duration of the first video does not meet the preset duration condition, that is, the video duration of the first video is greater than the duration threshold.

And if the video duration of the first video is greater than the duration threshold, inputting the first video into a video semantic extraction model for feature information extraction to obtain a first video semantic feature.

Step 1207, determining semantic features of the second video in the semantic feature information of the second video.

In step 1208, if the semantic features of the first video and the semantic features of the second video meet the first preset condition, determining that the first video is a content-similar video of the second video.

Step 1209, in the case that the first video is a video with similar content to the second video, determining a first audio feature sequence of the first video, and acquiring a second audio feature sequence of the second video.

In step 1210, determining that the first video is a similar video of the second video when the similarity between the first audio feature sequence and the second audio feature sequence meets the audio similarity condition.

In this embodiment, different similar video determination modes are selected according to the video time length. For a long video, similar video determination can be carried out by determining video semantic features of the long video on the whole dimension; for the short video, the similar video can be determined by determining the image semantic features of the short video on the more detailed video frame dimension, so that the efficiency and the accuracy of determining the similar video can be effectively improved.

Based on the above embodiments, the similar video determination method in the embodiments of the present application is further described below with reference to specific application scenarios.

Referring to fig. 13, a technical framework diagram of an information flow content service system is exemplarily shown. As shown in fig. 13, the information flow content service system supports video uploading, video auditing, manual auditing, video publishing and other services. The similar video determination method provided by the embodiment of the application can be applied to the information flow content service system. Specifically, each service module and its main function in the information flow content service system are as follows:

a PGC, UGC, or MCN (Multi-Channel Network) content producer uploads the published content to the uplink and downlink content interface server through a content production end (e.g., a mobile terminal end or a backend interface API system). The following description will be made with the distribution content as the first video in the above embodiment.

And the uplink and downlink content interface server receives the data of the first video submitted by the content production end, writes the content meta information into the content database, and enables the first video to be subjected to the dispatching center. Optionally, the data of the first video includes content meta information and a content entity file. And then, the uplink and downlink content interface server writes the content meta information of the first video into a content database, uploads the content entity file of the first video to a content storage service, and synchronizes the first video to the dispatching center server so as to perform subsequent content processing and circulation.

Optionally, the content meta-information stored in the content database includes information such as a video cover book link, a video rate, a video format, a video title, a release time, an author, a video file size, an original mark or an initial mark. The original mark or the first-issue mark can be determined by the similar video determination method provided by the embodiment of the application. For example, whether the newly uploaded first video and the existing video in the system are similar videos is determined by the similar video determination method provided by the embodiment of the present application. If the first video is not similar to the existing video in the system, an original mark or an initial mark can be added into the content meta-information of the first video.

And after receiving the first video, the dispatching center server calls a carried video identification service to start identifying the first video. And the transport video identification service calls a distributed vector retrieval service to perform distributed management and retrieval matching on the vectors. Optionally, the distributed vector retrieval service specifically adopts a distributed Faiss library and a read-write separation double-Buff mechanism to manage and retrieve massive video index information.

On one hand, the distributed vector retrieval service stores the feature vector of the existing video, namely the second video in the embodiment of the application; and on the other hand, calling a multi-dimensional embedded vector generation service to generate a feature vector of the first video and reading and storing the feature vector. The multi-dimensional embedded vector generation service is in communication connection with the content database, reads content metadata to construct a multi-dimensional content feature vector, and is also in communication connection with the frame extraction and audio extraction service to read video frame data and audio frame data to serve as source data of feature information extraction processing, so that the feature vector of the first video is generated. And the frame-extracting audio-frequency service calls a downloading file system to download the video file from the content storage service, namely the content entity file of the first video. The frame-extracting audio service processes the video file to perform a task of frame-extracting audio, and the related method may refer to the description in the above embodiments, which is not described herein again. Alternatively, the frame extraction audio service may temporarily store the extracted video frame and audio frame data in the content storage service, avoiding repeated extraction.

Optionally, the feature vector of the first video includes at least one of a first video semantic feature vector, a first image semantic feature sequence, and a first audio feature sequence. In one possible implementation, the feature vector of the first video may be determined according to the duration of the first video. For example, if the video duration of the first video is less than or equal to the duration threshold, the semantic feature vector of the first video may be determined as the feature vector of the first video according to the correlation method provided in the above embodiment, or the semantic feature sequence of the first image may be determined as the feature vector of the first video according to the correlation method provided in the above embodiment, and any one of the two ways may be implemented according to actual situations; if the video duration of the first video is higher than the duration threshold, determining the semantic feature vector of the first video as the feature vector of the first video according to the correlation method provided in the above embodiment.

The distributed vector retrieval service generates a service read vector to the multidimensional embedded vector. And after reading and storing the characteristic vector of the first video, the distributed vector retrieval service performs vector retrieval matching, determines similar videos according to the result of the vector retrieval matching, determines the transport videos in the similar videos, and returns the transport videos to the transport video identification service to inform the scheduling center server of the transport videos.

If the first video is identified as the transport video, corresponding processing can be performed, such as video off-shelf, account number penalty, and the like. By recall matching based on video frame image semantic features and recall based on a video semantic feature extraction scheme and video audio feature vector verification matching, the accuracy of video carrying identification can be ensured.

If the first video is not identified as the transport video, the dispatching center server synchronizes the manual review system, and synchronizes the first video to the manual review system for manual review. And the manual checking system records the checking result of the first video content in the manual checking process. Optionally, the audit result is classification information of the first video, and the classification information includes first, second, and third level classifications and tag information. For example, one explains the video content of xx brand mobile phones, the first class is science and technology, the second class is smart phones, the third class is domestic mobile phones, and the label information is xx brand and xx type. The manual auditing system reads the content meta information in the content database in the manual auditing process, and simultaneously, the manual auditing system transmits the auditing result back to the content database.

If the first video passes manual review, the scheduling center server can call the content distribution service to distribute the index information of the first video to the content consumption end, and the content consumption end can also obtain the index information of the first video from the uplink and downlink content interface service, so that the content entity file of the first video is downloaded from the content storage service to watch the first video.

The following are embodiments of the apparatus of the present application that may be used to perform embodiments of the method of the present application. For details which are not disclosed in the embodiments of the apparatus of the present application, reference is made to the embodiments of the method of the present application.

Referring to fig. 14, a block diagram of a similar video determination apparatus provided in an embodiment of the present application is shown. The device has the function of realizing the similar video determination method, and the function can be realized by hardware or hardware executing corresponding software. The device can be a computer device and can also be arranged in the computer device. The apparatus 1400 may include: a video acquisition module 1410, a semantic feature extraction module 1420, and a similar video determination module 1430.

A video obtaining module 1410, configured to obtain a first video.

The semantic feature extraction module 1420 is configured to perform content-based feature information extraction on the first video to obtain semantic feature information of the first video, where the semantic feature information is feature information that is influenced by a video editing operation to a degree lower than a first influence degree requirement.

The similar video determining module 1430 is configured to determine that the first video is a similar video of the second video when the semantic feature information of the first video and the semantic feature information of the second video meet a preset condition.

In an exemplary embodiment, the semantic feature information of the first video includes a first video semantic feature for characterizing the semantic feature information of the first video from the overall video dimension, and the semantic feature extraction module 1420 includes a video semantic extraction unit.

And the video semantic extraction unit is used for inputting the first video into a video semantic extraction model to extract the characteristic information to obtain the first video semantic characteristic.

In an exemplary embodiment, the video semantic extraction unit includes: the device comprises an image screening subunit, an embedding processing subunit, a semantic feature extraction subunit and a semantic feature generation subunit.

And the image screening subunit is used for obtaining a target image set according to at least one video frame in the first video and the video cover of the first video.

And the embedding processing subunit is configured to perform embedding processing on each image in the target image set to obtain an embedding feature set of the target image set, where the embedding feature set represents visual modality information of the first video.

And the semantic feature extraction subunit is used for extracting content-based feature information of each embedded feature in the embedded feature set to obtain embedded semantic features.

And the semantic feature generation subunit is used for performing average pooling processing on each embedded semantic feature to obtain the first video semantic feature.

In an exemplary embodiment, the semantic feature information of the first video includes a first image semantic feature sequence for characterizing semantic feature information of the first video from a video frame dimension, the semantic feature extraction module 1420 includes: and an image semantic extraction unit.

And the image semantic extracting unit is used for extracting the feature information of the video frame of the first video based on the image content to obtain the first image semantic feature sequence, wherein the first image semantic feature in the first image semantic feature sequence is a feature which is influenced by the image editing operation to a degree lower than the requirement of a second influence degree.

In an exemplary embodiment, the apparatus 1400 further comprises: and a duration determination module.

And the time length determining module is used for determining the video time length of the first video.

And under the condition that the video duration meets the preset duration condition, the image semantic extracting unit is used for extracting the feature information of the video frame of the first video based on the image content to obtain the operation of the first image semantic feature sequence.

In an exemplary embodiment, the image semantic extracting unit includes: the video frame extraction sub-unit, the inter-frame difference determination sub-unit, the target frame determination sub-unit and the image semantic extraction sub-unit.

And the video frame extraction subunit is used for performing frame extraction operation on the first video to obtain a plurality of video frames.

And the inter-frame difference determining subunit is used for determining the inter-frame difference information of the video frame and the adjacent video frame for each video frame.

And the target frame determining subunit is used for screening the plurality of video frames according to the interframe difference information to obtain a target frame sequence.

And the image semantic extraction subunit is used for extracting the feature information based on the image content for each target frame in the target frame sequence to obtain the first image semantic feature sequence.

In an exemplary embodiment, the similar video determination module 1430 includes: a first determination unit and a second determination unit.

The first determining unit is configured to determine a second video semantic feature in the semantic feature information of the second video when the semantic feature information of the first video includes the semantic feature of the first video, and determine that the first video is a content-similar video of the second video if the semantic feature of the first video and the semantic feature of the second video meet a first preset condition.

A second determining unit, configured to determine a second image semantic feature sequence in the semantic feature information of the second video when the semantic feature information of the first video includes the first image semantic feature sequence, and if the first image semantic feature sequence and the second image semantic feature sequence meet a second preset condition, determine that the first video is a content-similar video of the second video.

In an exemplary embodiment, the second determining unit includes: the device comprises a feature matching subunit, a matching frame determining subunit and a content similarity determining subunit.

And the feature matching subunit is used for determining a matching feature pair according to the first image semantic feature sequence and the second image semantic feature sequence, wherein the distance between the first image semantic feature in the matching feature pair and the second image semantic feature in the matching feature pair is smaller than a distance threshold value.

A matching frame determining subunit, configured to determine a matching target frame according to the matching feature pair, where the matching target frame includes a first target frame and a second target frame, the first target frame is a video frame in the first video that corresponds to the first image semantic feature, and the second target frame is a video frame in the second video that corresponds to the second image semantic feature.

And the content similarity determining subunit is configured to determine that the first video is a content similarity video of the second video if the number of the matching target frames meets a number condition.

In an exemplary embodiment, the similar video determination module 1430 includes:

and the audio characteristic determining unit is used for determining a first audio characteristic sequence of the first video and acquiring a second audio characteristic sequence of the second video under the condition that the semantic characteristic information of the first video and the semantic characteristic information of the second video accord with the preset condition.

And the audio feature verification unit is used for determining that the first video is a similar video of the second video when the similarity between the first audio feature sequence and the second audio feature sequence meets an audio similarity condition.

In an exemplary embodiment, the audio feature determination unit includes: the device comprises an audio acquisition subunit, a frequency domain conversion subunit and an audio sequence generation subunit.

And the audio acquisition subunit is used for acquiring the audio data corresponding to the first video.

And the frequency domain conversion subunit is used for performing frequency domain conversion processing on the audio data to obtain the frequency domain characteristics of the audio data.

An audio sequence generating subunit, configured to generate the first audio feature sequence based on the frequency-domain feature.

In an exemplary embodiment, the second video is any one of videos in a video database, the video database includes a feature information base and an online index base, the online index base includes an inventory index and an increment index, the inventory index includes index information of inventory videos, the inventory videos are videos in a target historical period, the increment index includes index information of increment videos, and the increment videos are videos newly added after the target historical period.

The apparatus 1400 further comprises: the index query module and the characteristic information acquisition module.

The index query module is used for acquiring target index information of the second video from the stock index under the condition that the second video is the stock video; and the characteristic information acquisition module is used for acquiring the semantic characteristic information of the second video from the characteristic information base based on the target index information.

The index query module is further configured to obtain target index information of the second video from the incremental index set when the second video is the incremental video; the feature information obtaining module is further configured to obtain semantic feature information of the second video in the feature information base based on the target index information.

In an exemplary embodiment, the online index bank is a first index bank, the target historical period is a first historical period, and the apparatus 1400 further includes: the device comprises a time interval determining module, an index rebuilding module and an index library switching module.

And the time period determining module is used for determining a second historical time period, the second historical time period takes the latest time allowed to be added with the video in the first index base as a right boundary time node, and the time span of the second historical time period is the same as that of the first historical time period.

And the index reconstruction module is used for reconstructing an index based on the first index base to obtain a second index base, and the target historical time period of the second index base is the second historical time period.

And the index base switching module is used for switching the online index base from the first index base to the second index base.

In an exemplary embodiment, the apparatus 1400 further comprises: the device comprises a transport video determining module and a video pushing module.

And the carried video determining module is used for determining that the first video is a carried video under the condition that the first video is a similar video of the second video, wherein the carried video refers to a non-original video.

And the video pushing module is used for limiting the pushing of the first video.

In summary, according to the technical scheme provided by the embodiment of the application, through performing content-based feature information extraction processing on videos, it is ensured that key points of feature extraction can focus on the video contents, so that the extracted video semantic information is less influenced by video editing operation, whether two videos are similar or not is judged by comparing whether video semantic information capable of representing features of the video contents per se meets preset conditions, and the accuracy of similar video identification can be effectively improved.

It should be noted that, when the apparatus provided in the foregoing embodiment implements the functions thereof, only the division of the functional modules is illustrated, and in practical applications, the functions may be distributed by different functional modules according to needs, that is, the internal structure of the apparatus may be divided into different functional modules to implement all or part of the functions described above. In addition, the apparatus and method embodiments provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments for details, which are not described herein again.

Referring to fig. 15, a block diagram of a computer device according to an embodiment of the present application is shown. The computer device may be a server for performing the similar video determination method described above. Specifically, the method comprises the following steps:

the computer device 1500 includes a Central Processing Unit (CPU) 1501, a system Memory 1504 including a Random Access Memory (RAM) 1502 and a Read Only Memory (ROM) 1503, and a system bus 1505 connecting the system Memory 1504 and the Central Processing Unit 1501. The computer device 1500 also includes a basic Input/Output system (I/O) 1506, which facilitates transfer of information between devices within the computer, and a mass storage device 1507 for storing an operating system 1513, application programs 1514, and other program modules 1515.

The basic input/output system 1506 includes a display 1508 for displaying information and an input device 1509 such as a mouse, keyboard, etc. for inputting information by a user. Wherein a display 1508 and an input device 1509 are connected to the central processing unit 1501 via an input output controller 1510 connected to the system bus 1505. The basic input/output system 1506 may also include an input/output controller 1510 for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, the input-output controller 1510 also provides output to a display screen, a printer, or other type of output device.

The mass storage device 1507 is connected to the central processing unit 1501 through a mass storage controller (not shown) connected to the system bus 1505. The mass storage device 1507 and its associated computer-readable media provide non-volatile storage for the computer device 1500. That is, the mass storage device 1507 may include a computer-readable medium (not shown) such as a hard disk or a CD-ROM (Compact disk Read-Only Memory) drive.

Without loss of generality, computer readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, EPROM (Erasable Programmable Read-Only Memory), EEPROM (Electrically Erasable Programmable Read-Only Memory), flash Memory or other solid state Memory technology, CD-ROM, DVD (Digital Video Disc) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will appreciate that computer storage media is not limited to the foregoing. The system memory 1504 and mass storage device 1507 described above may be collectively referred to as memory.

According to various embodiments of the present application, the computer device 1500 may also operate as a remote computer connected to a network via a network, such as the Internet. That is, the computer device 1500 may be connected to the network 1512 through the network interface unit 1511 connected to the system bus 1505 or may be connected to other types of networks or remote computer systems (not shown) using the network interface unit 1511.

The memory also includes a computer program stored in the memory and configured to be executed by the one or more processors to implement the similar video determination methods described above.

In an exemplary embodiment, there is also provided a computer readable storage medium having stored therein at least one instruction, at least one program, code set, or set of instructions which, when executed by a processor, implements the similar video determination method described above.

Optionally, the computer-readable storage medium may include: ROM (Read Only Memory), RAM (Random Access Memory), SSD (Solid State drive), or optical disc. The Random Access Memory may include a ReRAM (resistive Random Access Memory) and a DRAM (Dynamic Random Access Memory).

In an exemplary embodiment, a computer program product or computer program is also provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the similar video determination method described above.

It should be understood that reference to "a plurality" herein means two or more. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. In addition, the step numbers described herein only exemplarily show one possible execution sequence among the steps, and in some other embodiments, the steps may also be executed out of the numbering sequence, for example, two steps with different numbers are executed simultaneously, or two steps with different numbers are executed in a reverse order to the order shown in the figure, which is not limited by the embodiment of the present application.

The above description is only exemplary of the present application and should not be taken as limiting the present application, and any modifications, equivalents, improvements and the like that are made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method for similar video determination, the method comprising:

acquiring a first video;

2. The method according to claim 1, wherein the semantic feature information of the first video comprises a first video semantic feature, the first video semantic feature is used for representing the semantic feature information of the first video from an overall video dimension, and the extracting the content-based feature information of the first video to obtain the semantic feature information of the first video comprises:

3. The method according to claim 2, wherein the extracting feature information of the first video input video semantic extraction model to obtain the first video semantic features comprises:

4. The method according to claim 1 or 2, wherein the semantic feature information of the first video comprises a first image semantic feature sequence, the first image semantic feature sequence is used for representing the semantic feature information of the first video from a video frame dimension, and the extracting the semantic feature information of the first video based on content to obtain the semantic feature information of the first video comprises:

5. The method of claim 4, further comprising:

determining a video duration of the first video;

6. The method according to claim 4 or 5, wherein the extracting feature information based on image content from the video frames of the first video to obtain the first image semantic feature sequence comprises:

7. The method according to claim 4, wherein the determining that the first video is a similar video of the second video when the semantic feature information of the first video and the semantic feature information of the second video meet a preset condition comprises:

8. The method according to claim 7, wherein the determining that the first video is a video with similar content to the second video if the first image semantic feature sequence and the second image semantic feature sequence meet a second preset condition comprises:

9. The method according to claim 1, wherein the determining that the first video is a similar video of a second video when the semantic feature information of the first video and the semantic feature information of the second video meet a preset condition comprises:

10. A similar video determination apparatus, the apparatus comprising:

the video acquisition module is used for acquiring a first video;