CN113469152A

CN113469152A - Similar video detection method and device

Info

Publication number: CN113469152A
Application number: CN202111030237.0A
Authority: CN
Inventors: 刘刚
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-09-03
Filing date: 2021-09-03
Publication date: 2021-10-01
Anticipated expiration: 2041-09-03
Also published as: CN113469152B

Abstract

The application discloses a similar video detection method and device, and belongs to the technical field of artificial intelligence. The method comprises the following steps: acquiring a feature embedding vector of a first video; the characteristic embedding vector of the first video is compared with the characteristic embedding vector of at least one second video for recall processing, and a recall video related to the first video is obtained; rechecking and checking the first video and the recall video, and determining a characteristic checking result of the first video and the recall video on at least one checking dimension; video feature data of the first video and the recalled video in a target verification dimension are determined. According to the technical scheme, through the decoupling of the video recall process and the video similarity check process, video recall is carried out on the basis of the characteristic embedding vector of the video, after the recall video is obtained, checking check is carried out on the first video and the recall video more accurately from multiple checking dimensions, then similar video judgment is completed, and video duplicate removal efficiency is effectively improved.

Description

Similar video detection method and device

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a method and an apparatus for detecting similar videos.

Background

In the era of rapid internet development, the information flow content service is widely popular, and a large amount of high-quality original content is shown in an information flow content service platform. However, at the same time, a large number of similar videos are generated in the information flow content service platform, and similar video publishers publish the original videos by copying or simply editing the original videos, so that the interests of original authors are damaged, and the healthy development of the whole content ecology is not facilitated.

In order to effectively identify similar videos from a massive video content library, the related technology compares the video to be detected with all videos in the video content library one by one, judges whether the video to be detected is similar to the videos in the video content library, and further determines whether the video to be detected is a similar video.

However, the video recall and the video similarity check are coupled in the above manner, which results in a large amount of calculation required by the video duplication elimination task, slow duplication elimination speed and low efficiency.

Disclosure of Invention

The embodiment of the application provides a similar video detection method and device, which can effectively reduce the video recall calculation amount, improve the accuracy of similar video detection and improve the video duplicate removal efficiency.

According to an aspect of the embodiments of the present application, there is provided a method for detecting similar videos, the method including:

acquiring a feature embedding vector of a first video;

comparing the feature embedded vector of the first video with the feature embedded vector of at least one second video to recall to obtain a recalled video associated with the first video;

rechecking and checking the first video and the recall video, and determining a characteristic checking result of the first video and the recall video on at least one checking dimension;

and if the feature verification result on the at least one verification dimension meets the preset video similarity condition, determining that the first video is a similar video of the recalled video.

According to an aspect of the embodiments of the present application, there is provided a similar video detection apparatus, including:

the embedded feature acquisition module is used for acquiring a feature embedded vector of the first video;

the video recall module is used for performing recall processing on the feature embedded vector of the first video and the feature embedded vector of at least one second video to obtain a recalled video related to the first video;

the video checking module is used for rechecking and checking the first video and the recall video and determining a feature checking result of the first video and the recall video on at least one checking dimension;

and the similar video determining module is used for determining that the first video is a similar video of the recalled video if the feature verification result on the at least one verification dimension meets a preset video similarity condition.

According to an aspect of embodiments of the present application, there is provided a computer device comprising a processor and a memory, wherein at least one instruction, at least one program, a set of codes, or a set of instructions is stored in the memory, and the at least one instruction, the at least one program, the set of codes, or the set of instructions is loaded and executed by the processor to implement the similar video detection method described above.

According to an aspect of embodiments of the present application, there is provided a computer-readable storage medium having at least one instruction, at least one program, a set of codes, or a set of instructions stored therein, which is loaded and executed by a processor to implement the similar video detection method described above.

According to an aspect of embodiments herein, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the similar video detection method described above.

The technical scheme provided by the embodiment of the application can bring the following beneficial effects:

the method comprises the steps of performing video recall processing on at least one second video by taking an obtained feature embedding vector of a first video as a recall basis to obtain a recall video associated with the first video, further rechecking and checking the first video and the recall video, performing similar video checking on the first video and the recall video from multiple dimensions to obtain multi-dimensional feature checking results and judging whether the feature checking results meet preset video similar conditions or not, further finishing similar video judgment, and realizing decoupling of a video recall process and a video similar checking process. The video recall process does not need to carry out accurate verification on the video, the calculated amount of a video duplicate removal task can be effectively reduced, the recall rate is improved, after the second video associated with the first video is rapidly screened out as a recall result, further rechecking verification is carried out on the recall result, the accuracy of similar video detection is ensured, and finally the video duplicate removal efficiency is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic diagram of an application execution environment provided by one embodiment of the present application;

FIG. 2 is a first flowchart of a similar video detection method according to an embodiment of the present application;

fig. 3 is a flowchart ii of a similar video detection method according to another embodiment of the present application;

FIG. 4 illustrates a flow diagram for determining text feature embedding vectors for a video;

FIG. 5 illustrates a flow diagram for determining feature embedding vectors for a video;

fig. 6 is a flowchart three of a similar video detection method according to an embodiment of the present application;

FIG. 7 illustrates a schematic diagram of a template video;

FIG. 8 illustrates a diagram of a review check of similar videos;

FIG. 9 illustrates a flow chart for video deduplication;

FIG. 10 illustrates a technical framework diagram of an information flow content service system;

fig. 11 is a block diagram of a similar video detection apparatus provided in an embodiment of the present application;

fig. 12 is a block diagram of a computer device according to an embodiment of the present application.

Detailed Description

In the age of rapid development of the internet, as the threshold for content production in the information flow content service is reduced, the uploading amount and the distribution amount of video and image-text content are exponentially increased. These Content sources include internet users and various Content authoring mechanisms such as PGC (Professional Generated Content), UGC (User Generated Content) from media and organizations, e.g., multimedia Content streaming services that rely on social networking as a basis. The peak daily upload volume from each source to the warehouse has been in the past year in excess of millions. Since the amount of uploaded content is greatly increased, in order to ensure the security, timeliness and legal benefits of content copyright parties of content distribution in information flow content service, uploaded content auditing needs to be completed in a short time, for example, whether information which is not suitable for propagation exists in the content is audited, and the quality and security of the content are identified and processed.

The main approach at present is to identify and process the content by a lot of manpower while assisted by machine algorithm capabilities. Taking short video content as an example, firstly, introducing a current short video content distribution process, wherein the short video content enters the user consumption process from the beginning of uploading, the successful uploading and the successful uploading: shooting videos through a terminal shooting tool, such as applications such as social applications or short video sharing platforms; and then, the video is uploaded through the terminal, in the process of uploading the video, the video can be subjected to transcoding again, the video file is normalized, the meta-information of the video is stored, and the playing compatibility of the video on each platform is improved. Then, the video is manually checked, and the machine can acquire some auxiliary characteristics of the content through an algorithm, such as determining classification, adding a label and the like, while the video is manually checked; then, manual standardization labeling is carried out on the basis of machine algorithm processing, and relevant information, such as labels, categories and star information of the videos, is filled in the videos, and sometimes a piece of description text is assisted, so that the standardization of the contents is realized. And after the video verification is passed, the video platform enters a content library of the video platform. And finally, directly distributing the video to an operation and distribution outer network or a recommendation engine, and recommending the video through a recommendation algorithm such as collaborative recommendation, matrix decomposition, model Factorization (FM) and global binary transformation (GBDT) based on deep learning and the like based on the portrait characteristics of the user.

Meanwhile, each short video platform has a related subsidy and incentive mechanism for encouraging the creation of the content. The content creator can upload a large amount of similar videos, and meanwhile, a large amount of traffic is occupied, so that the ecological healthy development of the whole content is not facilitated. Because the video content needs to be manually checked, the manual checking needs to increase a lot of cost on one hand, and the content processing efficiency on the other hand is not enough. With the rapid increase of the content amount, the processing cost is very high, and if the content cannot be rapidly audited and processed, the content cannot be rapidly distributed, and the user experience is also greatly influenced. With the explosion of short videos, various means for modifying and editing the content of the short videos to bypass the duplication elimination recognition system are increasing, and a multi-dimensional accurate duplication elimination capability is urgently needed.

In some possible embodiments, embedded vectors based on video content or embedded vectors of title cover drawings are extracted from the content needing to be rearranged, similarity comparison of each type of vectors is performed, the distance between the vectors is used for measuring the similarity degree, and whether different videos are similar or not is judged according to the comparison result of different dimensions. In the above embodiment, there are some problems that the video recall process and the video similarity check process in the content re-ordering process are coupled together, the required calculation amount is very large, the re-ordering speed is very slow, and especially when hot content is generated and needs to quickly pass through a content link, the processing cannot be performed in time. The video recall process and the video similarity check process are coupled together, which is not beneficial to optimizing the accuracy of video duplicate removal and improving the video duplicate removal effect.

In order to solve the above problems, the present application provides a similar video detection method, which decouples a video recall process and a video similarity check process, is beneficial to independent maintenance and optimization of each part of a system, and improves accuracy and recall rate of video duplicate removal

The similar video detection method provided by the embodiment of the application relates to an artificial intelligence technology and a block chain technology, which are briefly described below to facilitate understanding by those skilled in the art.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

Deep learning: the concept of deep learning stems from the study of artificial neural networks. A multi-layer perceptron with multiple hidden layers is a deep learning structure. Deep learning forms a more abstract class or feature of high-level representation properties by combining low-level features to discover a distributed feature representation of the data.

Computer Vision technology (CV) Computer Vision is a science for researching how to make a machine "see", and further refers to that a camera and a Computer are used to replace human eyes to perform machine Vision such as identification, tracking and measurement on a target, and further image processing is performed, so that the Computer processing becomes an image more suitable for human eyes to observe or transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. The computer vision technology generally includes technologies such as image processing, image recognition, image semantic understanding, image retrieval, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, virtual reality, augmented reality, synchronous positioning, map construction and the like, and also includes common biometric technologies such as face recognition, fingerprint recognition and the like.

The key technologies of Speech Technology (Speech Technology) are Automatic Speech Recognition (ASR) and Speech synthesis (Text To Speech, TTS) as well as voiceprint Recognition. The computer can listen, see, speak and feel, and the development direction of the future human-computer interaction is provided, wherein the voice becomes one of the best viewed human-computer interaction modes in the future.

Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.

The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism and an encryption algorithm. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product services layer, and an application services layer.

The block chain underlying platform can comprise processing modules such as user management, basic service, intelligent contract and operation monitoring. The user management module is responsible for identity information management of all blockchain participants, and comprises public and private key generation maintenance (account management), key management, user real identity and blockchain address corresponding relation maintenance (authority management) and the like, and under the authorization condition, the user management module supervises and audits the transaction condition of certain real identities and provides rule configuration (wind control audit) of risk control; the basic service module is deployed on all block chain node equipment and used for verifying the validity of the service request, recording the service request to storage after consensus on the valid request is completed, for a new service request, the basic service firstly performs interface adaptation analysis and authentication processing (interface adaptation), then encrypts service information (consensus management) through a consensus algorithm, transmits the service information to a shared account (network communication) completely and consistently after encryption, and performs recording and storage; the intelligent contract module is responsible for registering and issuing contracts, triggering the contracts and executing the contracts, developers can define contract logics through a certain programming language, issue the contract logics to a block chain (contract registration), call keys or other event triggering and executing according to the logics of contract clauses, complete the contract logics and simultaneously provide the function of upgrading and canceling the contracts; the operation monitoring module is mainly responsible for deployment, configuration modification, contract setting, cloud adaptation in the product release process and visual output of real-time states in product operation, such as: alarm, monitoring network conditions, monitoring node equipment health status, and the like.

The platform product service layer provides basic capability and an implementation framework of typical application, and developers can complete block chain implementation of business logic based on the basic capability and the characteristics of the superposed business. The application service layer provides the application service based on the block chain scheme for the business participants to use.

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Referring to fig. 1, a schematic diagram of an application execution environment according to an embodiment of the present application is shown. The application execution environment may include: a terminal 10 and a server 20.

The terminal 10 includes, but is not limited to, a mobile phone, a Computer, a tablet Computer, a smart voice interaction device, a smart appliance, a vehicle-mounted terminal, a game console, an e-book reader, a multimedia playing device, a wearable device, a PC (Personal Computer), and other electronic devices. A client of the application may be installed in the terminal 10.

In the embodiment of the present application, the application may be any application capable of providing a video information streaming content service. Typically, the application is a video-type application. Of course, streaming content services may be provided in other types of applications besides video-type applications. For example, the application may be a news application, a social interaction application, an interactive entertainment application, a browser application, a shopping application, a content sharing application, a Virtual Reality (VR) application, an Augmented Reality (AR) application, and the like, which is not limited in this embodiment. In addition, for different applications, videos pushed by the applications may also be different, and corresponding functions may also be different, which may be configured in advance according to actual requirements, and this is not limited in this embodiment of the application. Optionally, a client of the above application program runs in the terminal 10. In some embodiments, the streaming content service covers many vertical contents such as art, movie, news, finance, sports, entertainment, games, etc., and users can enjoy many forms of content services such as articles, pictures, videos, short videos, live broadcasts, titles, columns, etc. through the streaming content service.

The server 20 is used to provide background services for clients of applications in the terminal 10. For example, the server 20 may be a backend server for the application described above. The server 20 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a web service, cloud communication, a middleware service, a domain name service, a security service, a CDN (Content Delivery Network), a big data and artificial intelligence platform. Optionally, the server 20 provides background services for applications in multiple terminals 10 simultaneously.

Alternatively, the terminal 10 and the server 20 may communicate with each other through the network 30. The terminal 10 and the server 20 may be directly or indirectly connected through wired or wireless communication, and the present application is not limited thereto.

Before describing the method embodiments provided in the present application, a brief description is given to the application scenarios, related terms, or terms that may be involved in the method embodiments of the present application, so as to facilitate understanding by those skilled in the art of the present application.

Social network: social networks originate from network societies, the starting point of which is email. The internet is essentially a network between computers, and early E-mail (electronic mail) solved the problem of remote mail transmission, which is the most popular application on the internet today, and is also the starting point of social networking. The BBS (Bulletin Board System, internet forum) further normalizes "mass sending" and "forwarding" and theoretically realizes the functions of issuing information to all people and discussing topics. Becoming a platform for the spontaneous generation of early internet content. In recent 2 years, due to The widespread use of smart phones, there is a ubiquitous wi-fi (wireless Communication Technology) facility, The universal reduction of The 4G (The 4th Generation Mobile Communication Technology, fourth Generation Mobile Communication Technology) tariff, The coming of The 5G (5 th Generation Mobile Communication Technology, fifth Generation Mobile Communication Technology) era, and under The strong context of The current Mobile internet era, The demand of users for receiving information is transitioning from The text era to The video era. Therefore, short videos will gradually become one of the dominant content forms of the mobile internet, replace the consumption of the teletext content to a certain extent, and gradually take a leading position in teletext media such as news and social platforms. These contents are usually presented in the form of Feeds stream (information stream) for users to refresh quickly, the feed is your friends or public people concerned about, and the contents are the dynamic of their open publication. When the number of friends is large and active, continuously updated content can be received, which is a common form of Feed. Time is the ultimate dimension followed by Feed because updates to content are the result of constant requests to the server. Timeline (Timeline) is the most primitive and most basic presentation form of Feed, and further Feed stream mode design can be carried out on the basis of Timeline.

Video generally refers to various techniques for capturing, recording, processing, storing, transmitting, and reproducing a series of still images as electrical signals. Optionally, the video comprises a video recommended by the streaming content service to the user for reading, and comprises a vertical version of a small video and a horizontal version of a short video, and is provided in the form of a Feeds stream.

Short video: namely, short-film video, is a mode for spreading internet content, and is generally a video which is spread on new internet media within 30 minutes. With the popularization of mobile terminals and the increasing speed of networks, short and fast mass flow transmission contents are gradually favored by various large platforms, fans and capital. Short video refers to high-frequency pushed video content played on various new media platforms, suitable for viewing in mobile and short-time leisure states, varying from a few seconds to a few minutes. Optionally, the short video content incorporates topics such as skill sharing, humorous, fashion trends, social hotspots, street interviews, public education, advertising creatives, business customizations, and the like. Because the content is short, the content can be individually sliced or can be a series of columns. Different from micro-movies and live broadcasting, short video production does not have specific expression forms and team configuration requirements like micro-movies, has the characteristics of simple production flow, low production threshold, strong participation and the like, has a spreading value compared with live broadcasting, and has a certain challenge to the file and plan work of the short video production team due to the ultrashort production period and interesting content; the advent of short videos enriches the form of new media native advertisements. At present, UGC, PGC and users upload short videos from the beginning, a mechanism specially manufacturing the short videos, MCN (Multi-Channel Network) and professional short video application and other head traffic platforms grow up continuously, and the short videos become one of important transmission modes of content creation and social media platforms. The influence of the short video is further upgraded while the short video arouses the enthusiasm of content creators and impacts a video media platform, and each large information platform also develops a fight around the short video. A wide variety of short video content is becoming increasingly rich. Both producers and consumers of short video content are a huge group.

MCN: the method is a product form of a multi-channel network, combines PGC (Professional produced Content) contents, and guarantees continuous output of the contents under the powerful support of capital, thereby finally realizing stable business change.

PGC refers to professionally produced content (e.g., video in a video website), expert produced content (content in a social network), and is used to generally refer to content personalization, view diversification, spreading democratization, and social relationship virtualization. Also known as PPC (professional-produced Content).

NetVLAD (NetVector of Locally Aggregated Descriptors, local Aggregated vectors) is a scene recognition algorithm, which is improved over VLAD (Vector of Locally Aggregated Descriptors), and is based on Scale-invariant feature transform (SIFT) or the like, and the features extracted by the VLAD are encoded to obtain a short feature string, and NetVLAD is connected with a convolutional neural network as a basic feature extraction structure to realize end-to-end training.

Faiss (clustering and similarity search library) is an open source aiming at clustering and similarity search library, provides efficient similarity search and clustering for dense vectors, supports search of billion-level vectors, and is the most mature approximate neighbor search library at present. It contains a number of algorithms that search a set of vectors of arbitrary size, and supporting code for algorithm evaluation and parameter adjustment.

Faiss vector retrieval: a conventional database consists of a structured table containing symbolic information. For example, a collection of images is represented by a list of indexed photos placed in each row. Each row contains information such as image identification and descriptive statements. Each row may also be associated with entries of other tables, such as a photo associated with a list of names of people. Many AI tools produce high-dimensional vectors, such as text embedding tools like Word2vec (Word To Vector), and Convolutional Neural Network (CNN) Descriptors (Descriptors) trained with deep learning. These representations are more powerful and flexible than fixed symbolic representations. However, traditional databases retrieved using Structured Query Language (SQL) are not adapted to these new vector representations and are very inefficient. First, the huge volume of new multimedia streams creates billions of vectors. Second, and more importantly, finding similar entries means finding similar high-dimensional vectors. This is extremely inefficient or even impossible for the current standard search languages. For similarity search and classification, the following operations are required: given a search vector, a list of database objects that are closest in euclidean distance to this vector is returned, and given a search vector, a list of database objects with the highest vector dot product is returned. Traditional SQL database systems are not highly available because they are optimized for Hash-Based Searches or 1D interval Searches (1 Dimensional interval Searches).

Please refer to fig. 2, which shows a first flowchart of a similar video detection method according to an embodiment of the present application. The method can be applied to a computer device, which refers to an electronic device with data calculation and processing capabilities, for example, the execution subject of each step can be the server 20 in the application program running environment shown in fig. 1. The method can include the following steps (210-240).

Step 210, a feature embedding vector of the first video is obtained.

In one possible implementation, a first video is obtained, and a feature embedding vector of the first video is determined.

The first video may be any video. Optionally, the first video is a video recommended to the user for viewing in the information flow content service, such as a portrait version of a small video and a landscape version of a short video.

Optionally, the first video is presented in a Feed stream, such as Web Feed, News Feed, and multimedia Feed. Through which the web site propagates the latest information to the user. For example, the website presents multimedia contents of various video types to the user through Feeds. Feeds are usually arranged in a Timeline, Timeline being the most primitive and basic presentation of Feeds. It should be noted that, in the embodiment of the present application, the source of the first video is not limited, and the first video may be from Feeds stream, or may be in other network media forms.

Optionally, the first video is a video uploaded to a streaming content service by a user. For example, the first video is a video uploaded by the user in real time. The first video may be a complete video or may be a portion of a complete video. For example, an entry for uploading a video may be displayed on an operation interface displayed by the terminal device, a user may select a video to be uploaded, and the terminal device detects an upload request for uploading the video and may upload the video to a designated server; after receiving the video, the server may use the video as the first video.

Optionally, the first video is a short video. The present application does not limit the category of the first video, which may be a sports video, a life video, a variety video, a short video, a game video, etc., and the manner of acquisition is not limited to the above description. Similarly, the format of the video is not limited in the embodiments of the present application.

The feature embedding vector is a mathematical representation of the video over a certain feature space. In some embodiments, the feature embedding vector for the first video may be determined by a feature extraction algorithm, which is described below and is not described here due to space limitations.

In another possible implementation, the feature embedding vector of the first video is pre-stored, and the feature embedding vector of the first video can be obtained according to the index of the first video.

In an exemplary embodiment, the feature embedding vector comprises the feature embedding vector in at least one recall dimension, the at least one recall dimension comprising at least one of a keyframe feature dimension, an audio feature dimension, a cover feature dimension, a text feature dimension, and a video content feature dimension.

The recall dimension refers to a feature dimension which needs to be referred to in recall processing, the key frame feature dimension refers to a feature dimension for judging video similarity according to image features of key frames, the audio feature dimension refers to a feature dimension for judging video similarity according to audio data features, the cover feature dimension refers to a feature dimension for judging video similarity according to image features of video covers, the text feature dimension refers to a feature dimension for judging video similarity according to text content corresponding to videos, and the video content feature dimension refers to a feature dimension for judging video similarity according to features of the whole content of the videos.

Accordingly, as shown in fig. 3, it shows a flowchart of a similar video detection method provided in another embodiment of the present application. In FIG. 3, the implementation of step 210 includes the following steps (211-215).

Step 211, determining a key frame feature embedding vector of the first video on the key frame feature dimension in case that the at least one recall dimension includes the key frame feature dimension.

The key frame feature embedding vector is a mathematical representation of the video in the key frame feature dimension. The first video includes at least one keyframe, each keyframe corresponding to a keyframe feature embedding vector, so the first video has at least one keyframe feature embedding vector in a keyframe feature dimension.

The key frame may be determined from the video frame in the first video through a specific key frame extraction algorithm, which is not limited in this embodiment of the application.

In a possible implementation manner, a key frame extraction process is performed on a first video to obtain a key frame of the first video; and performing image feature extraction processing on the key frames to obtain key frame feature embedded vectors corresponding to each key frame.

Optionally, the image feature extraction processing is performed on the key frame by an image feature extraction model. Specifically, the key frames are sequentially input into the image feature extraction model, and the image feature extraction model sequentially outputs the key frame feature embedding vector corresponding to each key frame. Optionally, the image feature extraction model is an image classification model such as a VGG16 model, an inclusion series model, a ResNet model, and the like, and a specific model structure is not limited in the embodiment of the present application. The embodiment of the application selects inclusion-Resnet v2 as an image feature extraction model to extract a key frame feature embedding vector in visual modality information of a first video.

In the event that the at least one recall dimension includes an audio feature dimension, an audio feature embedding vector for the first video in the audio feature dimension is determined, step 212.

The audio feature embedding vector is a mathematical representation of audio data corresponding to the video in the dimension of the audio feature.

In one possible implementation, corresponding audio data is separated from the first video; and carrying out audio feature extraction processing on the audio data to obtain an audio feature embedded vector.

The audio signal is first separated from the video, the audio signal can be converted into image input by calculating the MFCC (Mel Frequency Cepstral Coefficients, Mel Frequency Cepstral coefficient) characteristics, and then the audio characteristic sequence is extracted by using VGGish (Audio feature extraction model). The audio feature sequence is similar to the video feature sequence, and the NetVLAD is used for extracting audio features corresponding to different lenses, and then global feature vectors of audio modalities, namely audio feature embedded vectors, are generated through learnable weight fusion. Alternatively, vggist is an audio feature extraction model trained on AudioSet data sets, resulting in 128-dimensional embedded feature vectors.

In the event that the at least one recall dimension includes a cover features dimension, a cover features embedding vector of the first video in the cover features dimension is determined, step 213.

The cover feature embedding vector is a mathematical representation of a video cover corresponding to the video on the cover feature dimension, and is a mathematical feature of the video cover.

The video and the video cover jointly form visual modal information, the video is a main body of video content and contains main content information, and the cover is essence of the video content and can supplement each other.

In one possible embodiment, a video cover of a first video is obtained; and carrying out feature extraction processing on the video cover to obtain the cover feature embedding vector.

If the first video has a cover, the video cover of the first video is directly obtained, and if the first video does not have a cover, the video cover of the first video can be determined through a cover determination algorithm.

Optionally, the feature extraction process for the video cover is performed by a cover feature extraction model. Specifically, the video cover is input into a cover feature extraction model, and a cover feature embedding vector is output.

In the training process of the cover feature extraction model, a metric learning mode is adopted, so that the feature distance between video covers before and after transformation is short, and the feature distance between different covers is long. Optionally, based on contrast Loss or triple Loss as a Loss function of the cover feature extraction model, the basic idea is as follows: two cover drawings of the same category are closely spaced in feature space. Two covers, with different categories, are far apart in the feature space.

In the event that the at least one recall dimension includes a text feature dimension, a text feature embedding vector for the first video in the text feature dimension is determined, step 214.

The text feature embedding vector is a mathematical representation of the text content corresponding to the video in the text feature dimension.

In one possible implementation, text content corresponding to the first video is determined; and performing text feature extraction processing on the text content to obtain a text feature embedded vector. Specifically, the text content is input into a text feature extraction model, and a text feature embedding vector is output. Optionally, the text feature extraction model is a BERT (Bidirectional encoding representation based on a conversion model) model.

The text content corresponding to the video mainly comprises a video title, and if the teletext content is the text content of the text body besides the title. Here, in particular, the BERT model is used to generate text feature embedding vectors for video titles and teletext titles (or teletext body content). The text is processed by the BERT model to extract semantic features of the text, namely, text character strings are converted into a vector, generally, a vector of the second last layer of the BERT model is extracted as a text representation vector, and if the text representation vector is extracted for the last time, because the text representation vector is too close to a target, a deviation can be caused on a new task. The BERT model is a large-scale text pre-training model, and the BERT model improves the benchmark performance of NLP (neural linear Programming) tasks by a 12-layer transform-Encoder. Compared with word2vec, the BERT model pre-trained by massive texts can introduce more transfer knowledge into a video text classification algorithm, and provides more accurate text feature expression.

In one example, as shown in fig. 4, a flow diagram for determining text feature embedding vectors for a video is illustrated. In some application scenarios, there is a text feature extraction Pre-Training (Pre-Training) model 410, such as a BERT model (bi-directional coding representation model based on a transformation model). The pre-training process is to train through sample sentences such as the sentence A and the sentence B, and the training process is to divide words for the sentence A and the sentence B. And obtaining the characters corresponding to the sentence A and the sentence B. If the sentence A has N characters, the word segmentation result is the characters 1 to N; if sentence B has M characters, the word segmentation results in characters 1 to M. After word segmentation, a start character [ CLS ] is added before the character]Adding a separator [ SEP ] between statement A and statement B]And obtaining the character sequence. Then extracting the corresponding characteristic of each character, the above-mentioned initial character [ CLS ]]The corresponding character is characterized by E_[CLS](ii) a The character characteristics corresponding to the characters 1 to N of the sentence A are respectively E₁To E_N(ii) a Separator [ SEP]The corresponding character is characterized by E_[SEP](ii) a The character characteristics corresponding to the characters 1 to M of the sentence B are respectively E₁ ¹To E_M ¹. Then, based on each character feature, extracting the hidden feature of each character, the start character [ CLS ]]The corresponding character hidden feature is C;the character hidden features corresponding to the characters 1 to N of the sentence A are respectively T₁To T_N(ii) a Separator [ SEP]The corresponding character hidden feature is T_[SEP](ii) a The character hidden features corresponding to the characters 1 to M of the sentence B are respectively T₁ ¹To T_M ¹. And finally generating text feature embedding vectors of sample sentences such as the sentence A, the sentence B and the like according to the hidden features. The text feature extraction pre-training model 410 already has certain model accuracy, can be subjected to fine tuning, introduces more migration knowledge, obtains a text feature extraction model 413 which can be applied to different text feature extraction tasks and has higher accuracy, and outputs more accurate text feature expression.

The text feature extraction model 413 is a text feature extraction model after fine tuning adopted in the embodiment of the present application, and text contents corresponding to a video, such as a sentence C and a sentence D, can be input into the text feature extraction model 413, and based on the same method, a text feature embedding vector corresponding to the video can be obtained.

In the event that the at least one recall dimension includes a video content feature dimension, a video content feature embedding vector for the first video in the video content feature dimension is determined, step 215.

The video content feature embedding vector is a mathematical representation of the video overall content on the video content feature dimension.

In a possible implementation manner, frame extraction processing is performed on a first video to obtain a video frame sequence corresponding to the first video, video frames in the video frame sequence are sequentially input to an image feature extraction model to obtain a video feature sequence, the video feature sequence is an ordered arrangement of image embedding features of each extracted video frame, the video feature sequence contains more abundant information than common separated image features, and time sequence correlation exists between different features. And inputting the video feature sequence into a video frame feature aggregation model to obtain feature embedded vectors corresponding to a plurality of video shots of the first video, and then performing weighted summation on the image feature embedded vectors corresponding to the plurality of video shots according to weights obtained by machine learning to obtain video content feature embedded vectors.

The video frame feature aggregation model is NetVLAD, and the NetVLAD realizes a VLAD algorithm by using a network structure of a convolutional neural network, so that a newly generated VLAD layer is formed. Compared with Average Pooling, NetVLAD can convert a video feature sequence into a plurality of video shot features through a clustering center, and then obtains a global feature vector, namely the video content feature embedding vector, through weighted summation of a plurality of video shots with learnable weights, as a feature representation of the video overall content.

In one example, as shown in fig. 5, a flow diagram for determining feature embedding vectors for a video is illustrated. In the first video 510 in fig. 5, a text box 511 and an article information bar 512 are displayed. Text content is displayed in the text box 511, which is a kind of text information of the first video 510. The product information column 512 displays text content corresponding to the product 560, which is also a kind of text information of the first video 510. After the system acquires the first video 510, the feature extraction process is performed on a different recall dimension. For the video content feature dimension, the video frames in the video frame sequence 520 corresponding to the first video 510 are sequentially input into an image feature extraction model (inclusion-respet v 2), and the image features of each video frame extracted by the image feature extraction model (inclusion-respet v 2) are input into a video frame feature aggregation model NetVLAD, so that feature embedding vectors corresponding to a plurality of video shots of the first video 510 can be obtained, and further, the image feature embedding vectors corresponding to the plurality of video shots are weighted and summed according to weights obtained by machine learning, so that the video content feature embedding vectors are obtained. For the audio feature dimension, the audio data 530 corresponding to the first video 510 is input to the audio feature extraction model VGGish, and each audio feature extracted by the audio feature extraction model VGGish is input to the audio feature aggregation model NetVLAD, so as to obtain the audio features corresponding to different shots of the first video 510, and further, the audio features corresponding to the different shots are weighted and summed according to the weights obtained by machine learning, so as to obtain the audio feature embedding vector. For the cover feature dimension, the video cover 540 corresponding to the first video 510 is input to the image feature extraction model, and a cover feature embedding vector is obtained. For the text feature dimension, the first text information 550 corresponding to the text box 511 in the first video 510 and the second text information 570 corresponding to the commodity information column 512 are input into the text feature extraction model together to obtain respective text feature embedding vectors. Finally, the extracted vectors are merged, the merging mode may be splicing, or weighted addition, or not only merging, and the feature embedding vector of each recall dimension is separately stored, which is not limited in the embodiment of the present application.

On the main link of information flow content service, for example, short and small video content services are developed very rapidly, and a large amount of new video is generated every day. And one part of the newly added videos is original video content uploaded by a user, and the other part of the newly added videos is the existing content of the carrying platform. The transportation can cause videos with similar contents to exist in the platform at the same time, which is a great attack to the enthusiasm of original video authors and can also cause traffic dispersion and copyright dispute. Through the characteristic embedding vectors in the plurality of recall dimensions, subsequent video recall is facilitated, the information flow content service platform is helped to detect similar videos, and therefore the platform can be helped to better perform video duplication elimination. For example, videos with high frame rate and high resolution quality are reserved, meanwhile, versions with higher definition and resolution can be audited in the auditing process, and other versions are not audited any more, so that unnecessary labor waste is reduced. After the feature embedded vectors of multiple recall dimensions are generated, video recall management is implemented through a distributed vector recall service, i.e., the following step 220 is performed.

And step 220, comparing the feature embedded vector of the first video with the feature embedded vector of at least one second video and recalling to obtain a recalled video related to the first video.

The recall video refers to a second video that is similar to the first video in at least one recall dimension.

In an exemplary embodiment, the feature embedding vector comprises a feature embedding vector in at least one recall dimension. Accordingly, as shown in FIG. 3, the implementation of step 220 includes the following steps (221-222).

And step 221, for the target recall dimension in the at least one recall dimension, performing comparison recall processing on the feature embedded vector of the first video in the target recall dimension and the feature embedded vector of the at least one second video in the target recall dimension to obtain a recall video associated with the first video in the target recall dimension.

And under the condition that the target recall dimension is the key frame feature dimension, performing comparison recall processing on the key frame feature embedded vector of the first video and at least one second video in the key frame feature embedded vector to obtain a recall video associated with the first video in the key frame feature dimension, wherein the key frame of the recall video associated in the key frame feature dimension is similar to the key frame of the first video. Optionally, the number of similar key frames of both exceeds a preset threshold.

And under the condition that the target recall dimension is the audio feature dimension, performing comparison recall processing on the audio feature embedded vector of the first video and at least one second video in the audio feature embedded vector to obtain a recall video associated with the first video in the audio feature dimension, wherein the audio of the recall video associated in the audio feature dimension is similar to the audio of the first video. Optionally, the vector distance of the audio feature embedding vectors of the two is smaller than the threshold.

And under the condition that the target recall dimension is a cover feature dimension, comparing the cover feature embedding vector of the first video with at least one second video in the cover feature embedding vector to perform recall processing to obtain a recall video related to the first video in the cover feature dimension, wherein the cover of the recall video related in the cover feature dimension is similar to the cover of the first video. Optionally, the vector distance of the cover feature embedding vectors of the two is smaller than the threshold.

And under the condition that the target recall dimension is a text feature dimension, performing comparison recall processing on the text feature embedded vector of the first video and at least one second video in the text feature embedded vector to obtain a recall video associated with the first video in the text feature dimension, wherein the text content of the recall video associated in the text feature dimension is similar to the text content of the first video. Optionally, the vector distance of the text feature embedding vectors of the two is smaller than the threshold.

And under the condition that the target recall dimension is the video content feature dimension, performing comparison recall processing on the video content feature embedded vector of the first video and at least one second video in the video content feature embedded vector to obtain a recall video associated with the first video in the video content feature dimension, wherein the video content of the recall video associated in the video content feature dimension is similar to the video content of the first video. Optionally, the vector distance of the video content feature embedding vectors of the two is smaller than the threshold.

By the method, the multi-channel recall of the video is realized, and the video recall rate is improved.

Step 222, obtaining a recall video associated with the first video based on the recall videos associated with the first video in the recall dimensions.

And obtaining a recall video associated with the first video based on the recall video associated with the first video in the key frame feature dimension, the recall video associated with the audio feature dimension, the recall video associated with the cover feature dimension, the recall video associated with the text feature dimension and the recall video associated with the video content feature dimension respectively.

The recall videos related to each recall dimension may or may not be overlapped, and only one video is calculated when the videos are overlapped.

And step 230, performing rechecking verification processing on the first video and the recall video, and determining a feature verification result of the first video and the recall video on at least one verification dimension.

And independently checking the video characteristic data of the first video and the recall video in at least one checking dimension to obtain the characteristic checking result of the first video and the recall video in at least one checking dimension.

The feature verification results correspond to verification dimensions, and one feature verification result represents whether the first video and the recall video are similar in the corresponding verification dimensions.

In an exemplary embodiment, as shown in FIG. 3, the implementation of step 230 includes the following steps (231-233).

Step 231, determining video feature data of the first video and the recall video in the target verification dimension respectively.

The target verification dimension is any one of at least one verification dimension. The video feature data in the verification dimension is generally higher in precision than the feature embedding vector in the recall dimension, the data dimension is denser, the recall is only the recall with coarse granularity, the rechecking verification range is reduced, the high-precision verification can be carried out, the duplicate removal speed cannot be delayed, and the detection accuracy of similar videos is improved.

In a possible implementation manner, the target verification dimension is a video frame verification dimension, as shown in fig. 6, which shows a flowchart three of a similar video detection method provided in an embodiment of the present application. In fig. 6, the implementation of the above step 231 includes the following step 231 a.

In step 231a, the keyframe feature vectors of the first video and the recalled video, respectively, in the video frame check dimension are determined.

The key frame feature vector is video feature data corresponding to the video frame check dimension.

Optionally, the image feature extraction processing is performed on the key frame to obtain a key frame feature vector.

In a possible embodiment, the target verification dimension is an audio verification dimension, and as shown in fig. 6, the implementation process of the above step 231 includes the following step 231 b.

Step 231b, determining the audio feature sequences of the first video and the recalled video in the audio verification dimension respectively.

The audio feature sequence is video feature data corresponding to the audio verification dimension.

In one possible implementation, the audio feature sequence is a MFCC sequence. The mel frequency is extracted based on the auditory characteristics of human ears, and is in a nonlinear corresponding relation with the Hz (hertz) frequency. The mel-frequency cepstrum coefficients (MFCCs) are the Hz spectral features calculated by using the relationship between them. The method is mainly used for voice data feature extraction and operation dimensionality reduction. For example, for a frame of 512-dimensional (sampling point) data, the most important 40-dimensional (general) data can be extracted after MFCC, and the purpose of reducing dimensions is achieved while feature extraction is performed. MFCC typically goes through these several steps: pre-emphasis, framing, windowing, Fast Fourier Transform (FFT), mel filter bank, Discrete Cosine Transform (DCT), where MFCC can be extracted directly for verification.

In another possible embodiment, the audio in the first video is extracted by FFmpeg (Fast Forward Mpeg, multimedia video processing tool). And storing the audio in an Object Storage service (COS), wherein the COS is a distributed Storage service for storing massive files, and the COS has the advantages of high expansibility, low cost, reliability, safety and the like. Through diversified modes such as a console, an Application Programming Interface (API), a Software Development Kit (SDK), tools and the like, a user can simply and quickly access the COS, upload, download and manage multi-format files, and mass data storage and management are realized. Next, the audio is divided into a plurality of audio pieces by Chromaprint algorithm. For example, audio may be divided into overlapping audio segments. The duration of the audio segment can be 1 second, about 6 sampling points are provided for each second of audio, each sampling point corresponds to 32bit characteristic data and is stored in a CKV, and the CKV is a distributed KV storage service and is compatible with open source protocols such as Redis, Memcached and the like.

In a possible embodiment, the target verification dimension is a pixel value verification dimension, as shown in fig. 6, the implementation process of the above step 231 includes the following step 231 c.

In step 231c, the key frame hash values of the first video and the recalled video in the pixel value check dimension are determined.

The key frame hash value is video feature data corresponding to the pixel value check dimension.

Optionally, the key frame hash value may be calculated by a perceptual hash algorithm. The perceptual hash algorithm is a generic term of a class of algorithms, including an average hash aHash, a perceptual hash pHash, and a differential hash dHash. As the name implies, perceptual hashing does not compute Hash values in a strict way, but rather computes Hash values in a more relative way, since "similar" or not is a relative decision. Optionally, the key frame hash value is an average hash value aHash, and the calculation speed is high and the accuracy is poor. Alternatively, the key frame hash value is the perceptual hash value pHash, which is more accurate but less fast. Optionally, the key frame hash value is a difference value hash dHash, the accuracy is high, the speed is very high, and dHash can be selected as a supplementary verification algorithm for picture similarity judgment.

Step 232, based on the video feature data, determining a feature verification intermediate result of the first video and the recalled video in the target verification dimension.

The characteristic checking intermediate result is used for representing the similarity degree among various video characteristic data of different videos.

In the embodiment where the target verification dimension is a video frame verification dimension, as shown in fig. 6, the implementation process of the step 232 includes the following step 232 a.

Step 232a, determining a key frame check intermediate result between the first video and the recalled video based on the key frame feature vector.

The key frame check intermediate result comprises the similarity between the key frame of the first video and the key frame of the recalled video. The similarity may be a vector distance of feature vectors of the key frames, and the key frame check intermediate result indicates whether the two key frames are similar, where the similarity corresponds to a similar flag, such as a number 1, and the dissimilarity corresponds to a dissimilar flag, such as a number 0.

In the embodiment where the target verification dimension is an audio verification dimension, as shown in fig. 6, the implementation process of the step 232 includes the following step 232 b.

Step 232b, determining an audio check intermediate result between the first video and the recalled video based on the audio feature sequence.

The audio check intermediate result comprises an audio feature distance between respective audio feature sequences of the first video and the recalled video.

Acquiring audio characteristic sequences of to-be-detected audios corresponding to the first video and the recalled video respectively, and recording the characteristic lengths of the audio characteristic sequence of the first video and the audio characteristic sequence of the recalled video as L respectively₁、L₂Calculating the edit distance d of the binary bitstream (where the replacement operation distance is 2), the Similarity (i.e. the audio feature distance) is expressed as: similarity = 1-d/(L)₁+L₂)。

In the embodiment where the target verification dimension is a pixel value verification dimension, as shown in fig. 6, the implementation process of the step 232 includes the following step 232 c.

Step 232c, determining a pixel value check intermediate result between the first video and the recalled video based on the key frame hash value.

The pixel value check intermediate result comprises a hash value similarity between the key frame of the first video and the key frame of the recalled video. The above-mentioned hash value similarity may be a distance between hash values of different key frames, and the above-mentioned pixel value check intermediate result represents whether two key frames are similar, where similarity corresponds to a similar mark, such as a number 1, and dissimilarity corresponds to a dissimilar mark, such as a number 0.

And step 233, determining the feature verification results of the first video and the recall video in the target verification dimension according to the feature verification intermediate result.

The feature verification result is used for representing the overall similarity degree of different videos in the target verification dimension.

In the embodiment where the target verification dimension is a video frame verification dimension, as shown in fig. 6, the implementation process of step 233 includes the following step 233 a.

Step 233a, determining a key frame verification result of the first video and the recall video in the video frame verification dimension according to the similarity between the key frame of the first video and the key frame of the recall video.

The key frame checking result is a feature checking result corresponding to the video frame checking dimension, and the key frame checking result is used for representing the similarity degree of the key frame of the first video and the key frame of the recalled video.

In the embodiment where the target verification dimension is an audio verification dimension, as shown in fig. 6, the implementation process of step 233 includes the following step 233 b.

And step 233b, determining an audio verification result of the first video and the recalled video in the audio verification dimension according to the audio characteristic distance.

The audio verification result is a characteristic verification result corresponding to the audio verification dimension, and the audio verification result is used for representing the similarity degree of the audio of the first video and the audio of the recalled video.

In the embodiment where the target verification dimension is a pixel value verification dimension, as shown in fig. 6, the implementation process of step 233 includes the following step 233 c.

And step 233c, determining a local area verification result of the first video and the recall video in the pixel value verification dimension according to the similarity of the hash values.

The local area verification result is a feature verification result corresponding to the pixel value verification dimension, and the local area verification result is used for representing the similarity degree of the first video and the recall video in the local area.

In one example, as shown in FIG. 7, a schematic diagram of a template video is illustrated. In fig. 7, 3 videos, i.e., a video 710, a video 720, and a video 730, are shown, which are produced from the same video template 700. The background image in the video template 700 is fixed and has a video display area 740, and the video 710, the video 720, and the video 730 respectively display different contents in the video display area 740, and possibly different audio contents. Although the background images of the

videos

710, 720, and 730 are the same, they are substantially different from each other by 3 videos, and in order to avoid the

videos

710, 720, and 730 being identified as similar videos, the differences between the

videos

710, 720, and 730 can be effectively detected by comparing the difference hash values dhash of the key frames of the

videos

710, 720, and 730.

The key frame hash value dhash and the audio feature sequence (MFCC + chromaprint feature) perform enhanced identification on the middle fuzzy area of the video, can effectively identify some videos with similar pictures but different actual pictures, and set different subdivision check thresholds in the threshold of the non-repeated and repeated intervals, so that the detection efficiency of the similar videos is improved effectively overall.

In an exemplary embodiment, as shown in FIG. 3, the above step 231 further comprises the following steps (234 to 235).

Step 234, obtain the first level cache result.

The first-level cache result is used for storing a characteristic checking intermediate result between different videos on at least one checking dimension.

In an information flow content service scene, due to the fact that hot videos exist in a content flow storage, the videos with repeated hot spots can be recalled repeatedly, reading and writing in the recall process can be effectively reduced by adding the cache, and therefore speed-up of rechecking check and duplicate removal judgment is achieved.

The first-level cache result comprises a key frame check intermediate result of different videos in a key frame check dimension, an audio check intermediate result in an audio check dimension and a pixel value check intermediate result in a pixel value check dimension.

Optionally, the first-level cache result includes a version number corresponding to the feature verification intermediate result in at least one verification dimension, and a vector comparison result between different version numbers is not comparable and needs to be verified according to the version number.

In step 235, if there is an intermediate result of feature verification of the first video and the recalled video in the target verification dimension in the first-level cache result, obtaining the intermediate result of feature verification, and performing step 233.

In an exemplary embodiment, as shown in FIG. 3, after the step 235, the following steps (236-238) are included.

And step 236, if the first-level cache result does not have a characteristic check intermediate result of the first video and the recalled video in the target check dimension, acquiring a second-level cache result.

The second-level cache result is used for storing video characteristic data of different videos on at least one checking dimension.

The second-level cache result comprises key frame feature vectors of different videos in a key frame checking dimension, audio feature sequences in an audio checking dimension and key frame hash values in a pixel value checking dimension.

Optionally, the second-level cache result includes compression encoding of key frame feature vectors of different videos in a key frame check dimension, compression encoding of audio feature sequences in an audio check dimension, and compression encoding of key frame hash values in a pixel value check dimension. Subsequently, the compressed code of the key frame feature vector can be decoded to obtain the key frame feature vector; the compressed codes of the audio characteristic sequence can be decoded to obtain the audio characteristic sequence; the compressed code of the key frame hash value can be decoded to obtain the key frame hash value.

Optionally, the second-level cache result includes a version number corresponding to the video feature data in at least one verification dimension, and the video feature data between different version numbers are not comparable and need to be verified according to the version number.

In step 237, if the video feature data of the first video and the recalled video in the target verification dimension exist in the second-level cache result, the video feature data is obtained, and step 232 is executed.

In order to facilitate independent optimization of the algorithm for extracting the video feature data, the method supports comparison and duplication removal of the video feature data of multiple versions, facilitates subsequent independent optimization of the extracted video feature data, has high expandability because new video feature data are added without influencing a system structure, and accelerates a rechecking and checking process by adopting a secondary cache mechanism.

In step 238, if the video feature data of the first video and the recalled video in the target verification dimension do not exist in the second-level cache result, step 231 is executed.

The video recall and rechecking check can be seamlessly connected, and the recall video is rechecked and checked as long as the recall video appears, so that the recall video does not need to wait until all recall results appear. The embodiment of the application completes the accurate identification of the content through the accurate rechecking check, accelerates the rechecking check process through the cache service, and improves the check process and the content processing efficiency.

And 240, if the feature verification result on at least one verification dimension meets the preset video similarity condition, determining that the first video is a similar video of the recalled video.

The preset video similarity condition comprises that the feature verification result on each verification dimension indicates that the first video is similar to the recall video on each verification dimension. And if the feature verification result in each verification dimension indicates that the first video is similar to the recalled video in each verification dimension, determining that the first video is a similar video of the recalled video, otherwise, determining that the first video is not a similar video of the recalled video.

The preset video similarity condition comprises that the feature verification results on the verification dimensions of the preset number indicate that the first video and the recall video are similar in the verification dimensions. If the characteristic verification results on the preset number of verification dimensions indicate that the first video is similar to the recalled video in the preset number of verification dimensions, the first video is determined to be the similar video of the recalled video, and otherwise, the first video is not the similar video of the recalled video.

Through the decoupling of the video recall process and the video similarity check process, video recall is carried out on the basis of the characteristic embedding vector of the video, after the recall video is obtained, checking check is carried out on the first video and the recall video more accurately from a plurality of checking dimensions, then similar video judgment is completed, and video duplicate removal efficiency is effectively improved.

In one example, as shown in fig. 8, a schematic diagram of reviewing checking similar videos is exemplarily shown. In the process of checking similar videos by double check, with the assistance of a security authentication module, a scheduling module, an overtime retry module and an error retry module, whether the two videos are similar or not can be judged by a vectorization module (pre-processor) and a judgment module (judggenode) and repeated. The safety authentication module aims to avoid the judgment module from being called by an improper module; the vectorization module generates feature data of the video in each check dimension by using the video feature determination method of the multiple check dimensions provided by the embodiment; the judging module can call the comparison result and the version number thereof in the first-level cache result, and the compressed vector, the characteristic and the corresponding version number in the second-level cache, and the cache service is utilized to accelerate the realization of the secondary verification service; in addition, the judgment module introduces the classification judgment of the subdivided pictures which do not need to be cut (for example, no other interference word descriptions around the video pictures), lace cut, rotation and 1/3 cut so as to improve the accuracy of the whole process. Feature data of each check dimension to be called is arranged in the vectorization module queue, comparison and verification of the feature data of different videos in each check dimension are carried out in the secondary check queue, and an aggregation result, namely an identification result of whether the two videos are similar or not, is obtained according to the feature verification result in each check dimension.

To sum up, according to the technical scheme provided by the embodiment of the application, by using the feature embedding vector of the acquired first video as a recall basis, video recall processing is performed on at least one second video, after the recall video associated with the first video is obtained, the first video and the recall video are further rechecked and checked, similar video checking is performed on the first video and the recall video from multiple dimensions, a multi-dimensional feature checking result is obtained, whether the feature checking result meets preset video similarity conditions or not is judged, then similar video judgment is completed, and decoupling of the video recall process and the video similarity checking process is achieved. The video recall process does not need to carry out accurate verification on the video, the calculated amount of a video duplicate removal task can be effectively reduced, the recall rate is improved, after the second video associated with the first video is rapidly screened out as a recall result, further rechecking verification is carried out on the recall result, the accuracy of similar video detection is ensured, and finally the video duplicate removal efficiency is improved.

Based on the above embodiment, the following describes the above method with reference to a specific video re-ordering task in the information stream content service. In one example, as shown in fig. 9, a flow chart of video re-ordering is exemplarily shown. The whole system of the information flow content service can be split into a multi-layer structure, and comprises a preprocessing layer, a feature extraction layer, a recall layer, an accuracy layer, a decision layer and a monitoring layer, wherein each layer can be horizontally expanded. And the preprocessing layer is used for performing a key frame extraction task, an audio preprocessing task, a cover picture preprocessing task and an average frame extraction task. And generating feature embedding vectors of the video in different recall dimensions in the feature extraction layer. On a content link of an information flow content service, newly-put videos and image-text contents reach a million level, the newly-put videos and the image-text contents need to be added into a duplicate removal sample library in real time for similar calculation, and after feature embedded vectors of the videos in different recall dimensions are generated, similar retrieval recall is carried out through a distributed vector recall service on a recall layer. In the distributed retrieval process, if the faiss library is mixed with a large number of reads and writes, the performance is seriously influenced, so that read and write separation is needed in the implementation process. The video recall in the recall layer includes multiple recalls such as a text vector recall, an audio recall, a cover page recall, a title recall, and a video content vector recall. In the face of complex vector retrieval scenes with different purposes, such as the feature embedded vectors with different recall dimensions, the feature embedded vector of each recall dimension is stored independently, each recall dimension is separated independently to perform vector recall, vector retrieval and recall process multiplexing is realized by using a Faiss library, and recall efficiency is improved. In order to accelerate and independently optimize the similar video verification process, the precision of similar video detection is improved by adopting a plurality of verification modes after the recall layer recalls, the main reason for the processing is that a plurality of difficult places and various interferences exist in the judgment of the similar video in the recall process, and the recall process cannot simultaneously balance recall speed and the precision of the similar video detection in the face of massive contents, so that the recall layer and the precision layer are separated and decoupled. For some similar videos with low distinctiveness and difficulty in identification, such as (1) videos with watermarks, picture cropping and interference brought by picture filters; (2) configuring videos with different lace borders on the same video picture; (3) some videos with similar picture backgrounds, such as games, lectures, news videos, can be accurately verified through the precision layer, and can be identified from the information flow content service.

The video deduplication process shown in fig. 9 upgrades the existing content multidimensional vector deduplication system, performs coupling separation of deduplication recall and verification capability, and improves accuracy of similar video detection after similar content recall (judge); the introduced multi-level cache mechanism accelerates the process of verification processing, especially for hot video content. By separating the recall and the check of the content re-weight removal, the calculation amount in the re-weight removal process can be effectively reduced, the speed of processing in the re-weight removal process is improved, a secondary cache mechanism is introduced in the check process, the speed of the check processing is further increased, and the distribution efficiency of the hot video content is improved; through the introduction of various verification means, the recall rate can be improved by reducing the similar threshold value of the recall stage, the accurate verification is strengthened in the verification stage, and the result is that the accuracy of the weight removal and the recall are greatly improved, and the recall and the verification 2 processes can be independently optimized.

In another example, as shown in fig. 10, a technical framework diagram of an information flow content service system is exemplarily shown. The respective service modules and their main functions in the information streaming contents service system shown in fig. 10 are as follows.

Content production end and content consumption end

(1) A PGC or UGC, MCN content producer provides a publishing portal for video content and image-text content through a mobile terminal or a backend Interface API (Application Programming Interface) system, and multimedia content acquired by the publishing portal is a main content source of the information stream content service.

(2) The content production end uploads the release content through communication with the uplink and downlink content interface service. The video content is usually released by a terminal with a shooting function as a shooting end, a user can select matched music for local video content in the shooting process, corresponding editing is carried out, a cover picture, a filter template, the beautifying function of video and the like are selected, and the image-text content is usually released by an image-text editor and a typesetting system.

(3) The content consumption end obtains content index information through communication with the uplink and downlink content interface service, directly obtains a content source file from the content storage service according to the content index information, and then loads the content source file to be displayed for a user. The index information may be index information of contents to which the user subscribes. The content storage server stores content entities such as video source files, picture source files of cover art, and meta information of contents such as title, author, cover art, category, tag information, etc. is stored in the content database.

(4) The content production end and the content consumption end simultaneously report behavior data, card pause, loading time, playing click and the like played by a user in the uploading and downloading processes to the uplink and downlink content interface server or another background server for subsequent data statistical analysis.

(5) The content consuming end usually presents the content to the user in a Feeds stream manner, so that the user can browse and consume the content data.

Second, up and down going content interface server

(1) And directly communicating with the content production end to obtain the content submitted from the content production end. Usually, the content meta information includes the title, the publisher, the abstract, the cover page, the publishing time, etc. of the content.

(2) And writing content meta information into the content database, such as file size, cover map link, title, release time, author and the like.

(3) The content released and submitted by the content production end is synchronized to the scheduling center server (in fig. 10, the content is referred to as the content entering the scheduling center for short), so that the scheduling center server performs subsequent content processing and circulation.

Third, content database

(1) The content database is a core database of the content, the meta-information of the content released by all content production terminals is stored in the content database, and the meta-information of the content such as file size, cover map link, code rate, file format, title, release time, author, video file size, video format, original mark, initial mark and classification label information of the content in the manual checking process is stored in a key mode. The classified label information comprises a first class, a second class and a third class, and label information, for example, one class explains videos of a certain brand of mobile phones, the first class is science and technology, the second class is smart phones, the third class is domestic mobile phones, and the label information is a certain brand and a certain model.

(2) The manual review system reads information in the content database (in fig. 10, the original content is simply read) during the manual review process, and the manual review result and status are also returned to the content database by the manual review system.

(3) The dispatching center server mainly comprises machine processing and manual review processing, wherein the core process of the machine processing comprises various judgment methods aiming at the video quality, such as low-quality video filtering, content label adding, such as adding video classification information and label information. In addition, the content similarity examination is also included, the result can be written into the content database, the repeated secondary processing can not be carried out on the completely same content, and the manpower resource for examination and verification is saved.

Fourth, dispatch center server and manual auditing system

(1) The dispatching center server is responsible for the whole dispatching process of content circulation, receives the warehoused content through the uplink and downlink content interface server, and then obtains the meta-information of the content from the content database.

(2) And scheduling a manual auditing system and accurate checking service, and controlling scheduling sequence and priority.

(3) The content is enabled by the dispatch center server to start distribution, and the content is distributed to the consuming end through a content distribution service (usually a recommendation engine or a search engine or an operation platform) so that the content consuming end displays the page content and directly provides the content to the content consumer of the terminal. That is, the content consumption end obtains content index information, such as URL address information.

(4) For video content and tag expansion service communication, tags of the video content are enriched and expanded, and the efficiency of content cold start and operation is improved.

(5) The dispatch center server updates the meta information to the content database.

(6) The manual checking system is a carrier of manual service capability, and is mainly used for checking and filtering sensitive contents, contents which cannot be determined and judged by machines such as machines which are not allowed by law and the like, and labeling the video contents.

Content storage service

(1) Content entity information other than meta information of the content, such as a video source file and a picture source file of the teletext content, is stored. The downloading file system can download the video file to the content storage service by communicating with the uplink and downlink interface server to store the source file and providing the content source file to the content consuming end.

(2) When the video feature embedding vector is generated, the temporary storage of the frame extraction content and the audio information in the middle of the source file of the video source file is provided, and repeated extraction is avoided.

Sixth, download the file system

(1) The method comprises the steps of downloading and acquiring original content from a content storage service, controlling the downloading speed and progress, generally comprising a group of parallel servers and related task scheduling and distribution clusters.

(2) And calling the frame extraction audio service to acquire necessary file key frames and audio information from the source file by the downloaded file, and providing a downloading service for subsequently constructing the video feature embedded vector.

Seven, frame extraction audio service

(1) The files downloaded by the downloaded file system from the content storage service are pre-processed for file characteristics according to the algorithms and policies mentioned above.

(2) The frame sequence and the audio characteristics of the video are extracted according to the characteristic construction method of the video modality and the audio modality described above. The core problem of frame extraction is the challenge of different time lengths, if a uniform frame extraction strategy (for example, 1 frame extraction in 1 second) is used all, the sampling frequency is too high, the burden and the calculation amount of frame extraction are increased, the calculation cost is increased sharply, and conversely, frame extraction with longer time (for example, frame extraction at intervals of 1, 3, 5 and 7 seconds) causes insufficient frame rate. Therefore, a variable-length frame extraction strategy is adopted, the core is to extract scene switching frames with obvious brightness change, and then frame extraction and frame compensation are carried out at equal intervals, so that the frame extraction amount is in the 2 strategies.

Eight, multi-dimensional embedded vector generation service

(1) According to the specific methods described above, such as teletext and video. The text and text are provided with a title, a cover picture and a text, wherein the title and the text are texts. The video also contains abundant information, and has modalities such as video content stream, cover picture, title text, audio and the like, and the information of different dimensionalities of the video is respectively described for extracting corresponding feature embedded vectors.

(2) And embedding the generated features into a vector library of the distributed vector recall service according to types for subsequent processing.

(3) For newly-stored contents, corresponding contents also obtain feature embedded vectors by the same method, the feature embedded vectors and the correspondingly generated algorithm have different versions correspondingly, and only the comparison and recall of the vectors of the same version are similar, so that the significance is realized.

Nine, distributed vector recall service

(1) As described above, on the basis of the constructed multidimensional embedded vector generation service, the vector indexes are subjected to distributed management and retrieval matching, and a distributed faiss library is specifically adopted to manage and retrieve massive video index information; here mainly a coarse recall procedure to implement a re-ranked search.

(2) Different types of vectors can realize single-path recalls in different independent libraries.

Ten, accurate verification service

(1) As described in detail above, the distributed vector retrieval service is invoked to implement vector similarity recalls of different dimensions according to different vector libraries, such as text-based feature embedding vectors, cover map feature embedding vectors, audio feature embedding vectors, video content feature embedding vectors, and key frame feature embedding vectors, which are respectively used for multi-channel video recalls.

(2) And receiving the scheduling of the scheduling center server, performing accurate verification service for removing the duplicate of the complete content of the newly put content, and outputting an accurate result of final duplicate removal calculation and a similarity relation chain, namely a combination of a group of content similar and close to one content and a corresponding similarity value.

(3) And communicating with the verification acceleration and cache service to finish acceleration processing judged by verification.

Eleven, check acceleration service

(1) According to the detailed method and steps described above, the verification acceleration and the caching process after the rough recall is completed are realized, the verification efficiency and the calculation speed of the hot video content are improved, and the calculation efficiency is improved.

(2) Communicating with the precision check service.

The following are embodiments of the apparatus of the present application that may be used to perform embodiments of the method of the present application. For details which are not disclosed in the embodiments of the apparatus of the present application, reference is made to the embodiments of the method of the present application.

Referring to fig. 11, a block diagram of a similar video detection apparatus provided in an embodiment of the present application is shown. The device has the function of realizing the similar video detection method, and the function can be realized by hardware or hardware executing corresponding software. The device can be a computer device and can also be arranged in the computer device. The apparatus 1100 may include: embedded feature acquisition module 1110, video recall module 1120, video verification module 1130, and similar video determination module 1140.

An embedded feature obtaining module 1110, configured to obtain a feature embedded vector of the first video.

And a video recall module 1120, configured to perform recall processing on the feature embedded vector of the first video and the feature embedded vector of the at least one second video, so as to obtain a recalled video associated with the first video.

The video verification module 1130 is configured to perform review verification processing on the first video and the recalled video, and determine a feature verification result of the first video and the recalled video in at least one verification dimension.

A similar video determining module 1140, configured to determine that the first video is a similar video of the recalled video if the feature verification result in the at least one verification dimension meets a preset video similarity condition.

In an exemplary embodiment, the video verification module 1130 includes: the device comprises a characteristic determining unit, a characteristic verifying unit and a result determining unit.

A feature determining unit, configured to determine video feature data of the first video and the recalled video in a target verification dimension, where the target verification dimension is any one of the at least one verification dimension.

And the feature verification unit is used for determining a feature verification intermediate result of the first video and the recalled video in the target verification dimension based on the video feature data, wherein the feature verification intermediate result is used for representing the similarity degree between various items of video feature data of different videos.

And the result determining unit is used for determining the feature verification results of the first video and the recalled video in the target verification dimension according to the feature verification intermediate result, wherein the feature verification results are used for representing the overall similarity degree of different videos in the target verification dimension.

In an exemplary embodiment, the video verification module 1130 further includes: the device comprises a first-level cache unit and a cache calling unit.

And the first-level cache unit is used for acquiring a first-level cache result, and the first-level cache result is used for storing a characteristic check intermediate result between different videos on the at least one check dimension.

And the cache calling unit is used for acquiring a feature verification intermediate result if the feature verification intermediate result of the first video and the recall video in the target verification dimension exists in the first-level cache result, and executing the operation of determining the feature verification result of the first video and the recall video in the target verification dimension according to the feature verification intermediate result.

In an exemplary embodiment, the video verification module 1130 further includes: and a second level cache unit.

A second-level cache unit, configured to obtain a second-level cache result if a feature verification intermediate result of the first video and the recalled video in the target verification dimension does not exist in the first-level cache result, where the second-level cache result is used to store video feature data of different videos in the at least one verification dimension.

The cache retrieval unit is further configured to, if video feature data of the first video and the recalled video in the target verification dimension respectively exists in the second-level cache result, obtain the video feature data, and perform the operation of determining a feature verification intermediate result of the first video and the recalled video in the target verification dimension based on the video feature data.

The cache retrieval unit is further configured to, if video feature data of the first video and the recalled video in the target verification dimension do not exist in the second-level cache result, perform the operation of determining the video feature data of the first video and the recalled video in the target verification dimension.

In an exemplary embodiment, the target verification dimension is a video frame verification dimension, and the feature determining unit is specifically configured to determine a key frame feature vector of each of the first video and the recalled video in the video frame verification dimension, where the key frame feature vector is video feature data corresponding to the video frame verification dimension.

The feature checking unit is specifically configured to determine, based on the key frame feature vector, a key frame checking intermediate result between the first video and the recalled video, where the key frame checking intermediate result includes a similarity between a key frame of the first video and a key frame of the recalled video.

The result determining unit is specifically configured to determine, according to a similarity between a key frame of the first video and a key frame of the recalled video, a key frame verification result of the first video and the recalled video in the video frame verification dimension, where the key frame verification result is a feature verification result corresponding to the video frame verification dimension, and the key frame verification result is used to represent a similarity between the key frame of the first video and the key frame of the recalled video.

In an exemplary embodiment, the target verification dimension is an audio verification dimension, and the feature determining unit is specifically configured to determine an audio feature sequence of each of the first video and the recalled video in the audio verification dimension, where the audio feature sequence is video feature data corresponding to the audio verification dimension.

The feature verification unit is specifically configured to determine an audio verification intermediate result between the first video and the recalled video based on the audio feature sequence, where the audio verification intermediate result includes an audio feature distance between the respective audio feature sequences of the first video and the recalled video.

The result determining unit is specifically configured to determine, according to the audio feature distance, an audio verification result of the first video and the recalled video in the audio verification dimension, where the audio verification result is a feature verification result corresponding to the audio verification dimension, and the audio verification result is used to represent a degree of similarity between an audio of the first video and an audio of the recalled video.

In an exemplary embodiment, the target verification dimension is a pixel value verification dimension, and the feature determining unit is specifically configured to determine a keyframe hash value of each of the first video and the recalled video in the pixel value verification dimension, where the keyframe hash value is video feature data corresponding to the pixel value verification dimension.

The feature verification unit is specifically configured to determine, based on the key frame hash value, a pixel value verification intermediate result between the first video and the recalled video, where the pixel value verification intermediate result includes hash value similarity between a key frame of the first video and a key frame of the recalled video.

The result determining unit is specifically configured to determine, according to the hash value similarity, a local area verification result of the first video and the recalled video in the pixel value verification dimension, where the local area verification result is a feature verification result corresponding to the pixel value verification dimension, and the local area verification result is used to represent a degree of similarity between the first video and the recalled video in a local area.

In an exemplary embodiment, the feature embedding vectors include feature embedding vectors in at least one recall dimension, the video recall module 1120 including: vector comparison unit and video recall unit.

And the vector comparison unit is used for carrying out comparison recall processing on the feature embedded vector of the first video in the target recall dimension and the feature embedded vector of the at least one second video in the target recall dimension for a target recall dimension in the at least one recall dimension to obtain a recall video associated with the first video in the target recall dimension.

A video recall unit, configured to obtain the recall video associated with the first video based on recall videos associated with the first video in each recall dimension.

In an exemplary embodiment, the at least one recall dimension includes at least one of a key frame feature dimension, an audio feature dimension, a cover feature dimension, a text feature dimension, and a video content feature dimension, and the embedded feature acquisition module 1110 includes: the system comprises a key frame characteristic determining unit, an audio characteristic determining unit, a cover characteristic determining unit, a text characteristic determining unit and a video content characteristic determining unit.

A key frame feature determination unit, configured to determine a key frame feature embedding vector of the first video in the key frame feature dimension if the at least one recall dimension includes the key frame feature dimension.

An audio feature determination unit to determine an audio feature embedding vector of the first video in the audio feature dimension if the at least one recall dimension comprises the audio feature dimension.

A cover feature determination unit to determine a cover feature embedding vector of the first video in the cover feature dimension if the at least one recall dimension includes the cover feature dimension.

A text feature determination unit to determine a text feature embedding vector of the first video in the text feature dimension if the at least one recall dimension includes the text feature dimension.

A video content feature determination unit, configured to determine a video content feature embedding vector of the first video in the video content feature dimension if the at least one recall dimension includes the video content feature dimension.

It should be noted that, when the apparatus provided in the foregoing embodiment implements the functions thereof, only the division of the functional modules is illustrated, and in practical applications, the functions may be distributed by different functional modules according to needs, that is, the internal structure of the apparatus may be divided into different functional modules to implement all or part of the functions described above. In addition, the apparatus and method embodiments provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments for details, which are not described herein again.

Referring to fig. 12, a block diagram of a computer device according to an embodiment of the present application is shown. The computer device may be a server for performing the similar video detection method described above. Specifically, the method comprises the following steps:

the computer apparatus 1200 includes a Central Processing Unit (CPU) 1201, a system Memory 1204 including a Random Access Memory (RAM) 1202 and a Read Only Memory (ROM) 1203, and a system bus 1205 connecting the system Memory 1204 and the Central Processing Unit 1201. The computer device 1200 also includes a basic Input/Output system (I/O) 1206 for facilitating information transfer between various devices within the computer, and a mass storage device 1207 for storing an operating system 1213, application programs 1214, and other program modules 1215.

The basic input/output system 1206 includes a display 1208 for displaying information and an input device 1209, such as a mouse, keyboard, etc., for user input of information. Wherein a display 1208 and an input device 1209 are connected to the central processing unit 1201 through an input-output controller 1210 coupled to the system bus 1205. The basic input/output system 1206 may also include an input/output controller 1210 for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, input-output controller 1210 also provides output to a display screen, a printer, or other type of output device.

The mass storage device 1207 is connected to the central processing unit 1201 through a mass storage controller (not shown) connected to the system bus 1205. The mass storage device 1207 and its associated computer-readable media provide non-volatile storage for the computer device 1200. That is, the mass storage device 1207 may include a computer-readable medium (not shown) such as a hard disk or a CD-ROM (Compact disk Read-Only Memory) drive.

Without loss of generality, computer readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, EPROM (Erasable Programmable Read-Only Memory), EEPROM (Electrically Erasable Programmable Read-Only Memory), flash Memory or other solid state Memory technology, CD-ROM, DVD (Digital Video Disc) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will appreciate that computer storage media is not limited to the foregoing. The system memory 1204 and mass storage device 1207 described above may be collectively referred to as memory.

According to various embodiments of the present application, the computer device 1200 may also operate as a remote computer connected to a network through a network, such as the Internet. That is, the computer device 1200 may connect to the network 1212 through a network interface unit 1211 connected to the system bus 1205, or may connect to other types of networks or remote computer systems (not shown) using the network interface unit 1211.

The memory also includes a computer program stored in the memory and configured to be executed by the one or more processors to implement the similar video detection methods described above.

In an exemplary embodiment, there is also provided a computer readable storage medium having stored therein at least one instruction, at least one program, code set, or set of instructions which, when executed by a processor, implements a similar video detection method as described above.

Optionally, the computer-readable storage medium may include: ROM (Read Only Memory), RAM (Random Access Memory), SSD (Solid State drive), or optical disc. The Random Access Memory may include a ReRAM (resistive Random Access Memory) and a DRAM (Dynamic Random Access Memory).

In an exemplary embodiment, a computer program product or computer program is also provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the similar video detection method described above.

It should be understood that reference to "a plurality" herein means two or more. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. In addition, the step numbers described herein only exemplarily show one possible execution sequence among the steps, and in some other embodiments, the steps may also be executed out of the numbering sequence, for example, two steps with different numbers are executed simultaneously, or two steps with different numbers are executed in a reverse order to the order shown in the figure, which is not limited by the embodiment of the present application.

The above description is only exemplary of the present application and should not be taken as limiting the present application, and any modifications, equivalents, improvements and the like that are made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method for detecting similar video, the method comprising:

acquiring a feature embedding vector of a first video;

2. The method of claim 1, wherein the reviewing the verification process for the first video and the recalled video and determining the feature verification result of the first video and the recalled video in at least one verification dimension comprises:

determining video feature data of the first video and the recalled video in a target verification dimension, wherein the target verification dimension is any one of the at least one verification dimension;

determining a feature verification intermediate result of the first video and the recalled video in the target verification dimension based on the video feature data, wherein the feature verification intermediate result is used for representing the similarity degree between various items of video feature data of different videos;

and determining the feature verification results of the first video and the recalled video on the target verification dimension according to the feature verification intermediate result, wherein the feature verification results are used for representing the overall similarity degree of different videos on the target verification dimension.

3. The method of claim 2, wherein the determining video feature data of the first video and the recalled video in the target verification dimension each precedes further comprising:

acquiring a first-level cache result, wherein the first-level cache result is used for storing a characteristic check intermediate result between different videos on at least one check dimension;

if the first-level cache result has a feature verification intermediate result of the first video and the recall video in the target verification dimension, acquiring the feature verification intermediate result, and executing the operation of determining the feature verification result of the first video and the recall video in the target verification dimension according to the feature verification intermediate result.

4. The method of claim 3, further comprising:

if the first-level cache result does not have a feature check intermediate result of the first video and the recalled video in the target check dimension, acquiring a second-level cache result, wherein the second-level cache result is used for storing video feature data of different videos in the at least one check dimension;

if video feature data of the first video and the recalled video in the target verification dimension respectively exist in the secondary cache result, acquiring the video feature data, and executing an operation of determining a feature verification intermediate result of the first video and the recalled video in the target verification dimension based on the video feature data;

if the video feature data of the first video and the recalled video in the target verification dimension do not exist in the secondary cache result, the operation of determining the video feature data of the first video and the recalled video in the target verification dimension is executed.

5. The method of any one of claims 2 to 4, wherein the target verification dimension is a video frame verification dimension, and wherein the determining video feature data of the first video and the recalled video in the target verification dimension comprises:

determining a key frame feature vector of each of the first video and the recalled video in the video frame check dimension, wherein the key frame feature vector is video feature data corresponding to the video frame check dimension;

the determining, based on the video feature data, a feature verification intermediate result of the first video and the recalled video in the target verification dimension includes:

determining a key frame check intermediate result between the first video and the recalled video based on the key frame feature vector, the key frame check intermediate result comprising a similarity between a key frame of the first video and a key frame of the recalled video;

the determining, according to the feature verification intermediate result, a feature verification result of the first video and the recalled video in the target verification dimension includes:

determining a key frame checking result of the first video and the recall video in the video frame checking dimension according to the similarity between the key frame of the first video and the key frame of the recall video, wherein the key frame checking result is a feature checking result corresponding to the video frame checking dimension, and the key frame checking result is used for representing the similarity between the key frame of the first video and the key frame of the recall video.

6. The method of any one of claims 2 to 4, wherein the target verification dimension is an audio verification dimension, and wherein the determining video feature data of the first video and the recalled video in the target verification dimension comprises:

determining an audio feature sequence of the first video and the recalled video in the audio verification dimension respectively, wherein the audio feature sequence is video feature data corresponding to the audio verification dimension;

determining an audio check intermediate result between the first video and the recalled video based on the audio feature sequence, the audio check intermediate result including an audio feature distance between respective audio feature sequences of the first video and the recalled video;

and according to the audio characteristic distance, determining an audio verification result of the first video and the recalled video in the audio verification dimension, wherein the audio verification result is a characteristic verification result corresponding to the audio verification dimension, and the audio verification result is used for representing the similarity degree of the audio of the first video and the audio of the recalled video.

7. The method of any one of claims 2 to 4, wherein the target verification dimension is a pixel value verification dimension, and the determining video feature data of the first video and the recalled video in the target verification dimension comprises:

determining a key frame hash value of each of the first video and the recalled video in the pixel value checking dimension, wherein the key frame hash value is video feature data corresponding to the pixel value checking dimension;

determining a pixel value check intermediate result between the first video and the recalled video based on the key frame hash value, the pixel value check intermediate result including a hash value similarity between a key frame of the first video and a key frame of the recalled video;

according to the Hash value similarity, determining a local area verification result of the first video and the recalled video in the pixel value verification dimension, wherein the local area verification result is a feature verification result corresponding to the pixel value verification dimension, and the local area verification result is used for representing the similarity degree of the first video and the recalled video in a local area.

8. The method of claim 1, wherein the feature embedding vector comprises a feature embedding vector in at least one recall dimension, and wherein the recalling the feature embedding vector of the first video with the feature embedding vector of at least one second video to obtain a recall video associated with the first video comprises:

for a target recall dimension of the at least one recall dimension, performing recall processing on the feature embedded vector of the first video in the target recall dimension in comparison with the feature embedded vector of the at least one second video in the target recall dimension to obtain a recall video associated with the first video in the target recall dimension;

obtaining the recall video associated with the first video based on recall videos associated with the first video in respective recall dimensions.

9. The method of claim 8, wherein the at least one recall dimension comprises at least one of a keyframe feature dimension, an audio feature dimension, a cover feature dimension, a text feature dimension, and a video content feature dimension, and wherein obtaining feature embedding vectors for the first video comprises:

determining a keyframe feature embedding vector for the first video in the keyframe feature dimension if the at least one recall dimension comprises the keyframe feature dimension;

determining an audio feature embedding vector of the first video in the audio feature dimension if the at least one recall dimension comprises the audio feature dimension;

determining a cover feature embedding vector of the first video in the cover feature dimension if the at least one recall dimension comprises the cover feature dimension;

determining a text feature embedding vector of the first video over the text feature dimension if the at least one recall dimension comprises the text feature dimension;

determining a video content feature embedding vector for the first video in the video content feature dimension if the at least one recall dimension comprises the video content feature dimension.

10. A similar video detection apparatus, comprising: