CN110413603B

CN110413603B - Method and device for determining repeated data, electronic equipment and computer storage medium

Info

Publication number: CN110413603B
Application number: CN201910723196.XA
Authority: CN
Inventors: 李�根; 何轶; 陈扬羽; 李磊
Original assignee: Beijing ByteDance Network Technology Co Ltd
Current assignee: Beijing ByteDance Network Technology Co Ltd
Priority date: 2019-08-06
Filing date: 2019-08-06
Publication date: 2023-02-24
Anticipated expiration: 2039-08-06
Also published as: CN110413603A

Abstract

The present disclosure provides a method, an apparatus, an electronic device and a computer storage medium for determining duplicate data, wherein the method includes: acquiring reference data; determining candidate data which are repeated with the reference data in the data to be compared based on the first feature vector of the first data type of the reference data and the second feature vector of the first data type of each data to be compared; and determining data which is repeated with the reference data from the candidate data according to the third feature vector of the second data type of the reference data and the fourth feature vector of the second data type of each data in the candidate data. In the embodiment of the disclosure, the candidate data which is repeated with the reference data in the data to be compared is determined based on the first feature vector of the reference data and the second feature vector of each data to be compared, so that the data processing efficiency is improved, and the data which is repeated with the reference data can be accurately determined from the candidate data based on the third feature vector and the fourth feature vector of each data in the candidate data.

Description

Method and device for determining repeated data, electronic equipment and computer storage medium

Technical Field

The present disclosure relates to the field of data processing technologies, and in particular, to a method and an apparatus for determining duplicate data, an electronic device, and a computer storage medium.

Background

In the prior art, data in a database usually has some similar or identical data, and when performing related processing based on the data in the database, the use experience of the data is often influenced by the identical or similar data, so how to quickly and accurately determine the repeated data (identical or similar data) in the database is a problem to be solved at present.

Disclosure of Invention

The purpose of the present disclosure is to solve at least one of the above technical drawbacks, and to improve the data processing efficiency and the accuracy of determining duplicate data. The technical scheme adopted by the disclosure is as follows:

in a first aspect, the present disclosure provides a method for determining duplicate data, where the method includes:

acquiring reference data, wherein the reference data is an image or a video;

determining candidate data which are repeated with the reference data in the data to be compared based on a first feature vector of the first data type of the reference data and a second feature vector of the first data type of each data to be compared, wherein the data to be compared comprises at least one of an image and a video;

and determining data which is repeated with the reference data from the candidate data according to the third feature vector of the second data type of the reference data and the fourth feature vector of the second data type of each data in the candidate data, wherein the precision of the data of the second data type is higher than that of the data of the first data type.

In an embodiment of the first aspect of the disclosure, the first data type is integer and the second data type is floating point.

In an embodiment of the first aspect of the present disclosure, determining candidate data that is repeated with reference data in data to be compared based on a first feature vector of a first data type of the reference data and a second feature vector of the first data type of each data to be compared includes:

and determining data corresponding to a second feature vector with at least one dimension and the same value as the first feature vector as candidate data.

In an embodiment of the first aspect of the present disclosure, determining, as candidate data, data corresponding to a second eigenvector having a same value as that of the first eigenvector and having at least one dimension includes:

acquiring an inverted index, wherein the inverted index is established based on the second feature vector;

and determining data corresponding to a second feature vector which has at least one dimensionality and the same value as the first feature vector as candidate data based on the first feature vector and the inverted index.

In an embodiment of the first aspect of the present disclosure, determining data that overlaps with the reference data from the candidate data according to the third eigenvector of the second data type of the reference data and the fourth eigenvector of the second data type of each data in the candidate data includes:

determining feature similarity between the third feature vector and the fourth feature vector;

and determining candidate data corresponding to the feature similarity larger than the similarity threshold as data overlapping with the reference data.

In an embodiment of the first aspect of the present disclosure, determining a feature similarity between the third feature vector and the fourth feature vector includes:

determining a cosine distance between the third feature vector and the fourth feature vector based on the third feature vector and the fourth feature vector;

and determining the feature similarity according to the cosine distance.

In an embodiment of the first aspect of the present disclosure, the reference data is data in the first database, and the data to be compared is data in the first database other than the data;

or,

acquiring reference data, including:

acquiring a search keyword;

and acquiring a search result from the second database according to the search keyword, wherein the reference data is data in the search result, and the data to be compared is data in the search result except the data or data in the second database.

In an embodiment of the first aspect of the present disclosure, if the reference data is data in the first database, after determining data that is duplicated with the reference data, the method further includes:

de-duplicating the first database based on data that is duplicative of the reference data;

if the data to be compared is data except the reference data in the search result, after the data which is repeated with the reference data is determined, the method further comprises the following steps:

de-duplicating the search results based on data that is duplicative of the reference data; taking the search result after the duplication removal as a final search result; or deleting the reference data and the data overlapping with the reference data;

if the data to be compared is data in the second database, after determining data that is repeated with the reference data, the method further includes:

the second database is deduplicated based on data that is duplicative of the reference data.

In an embodiment of the first aspect of the present disclosure, if the reference data is a video, the first feature vector and the third feature vector of the reference data are determined by:

extracting image characteristics of frame images in a video;

based on the image features of the extracted frame image, a first feature vector and a third feature vector of the video are determined.

In an embodiment of the first aspect of the present disclosure, determining the first feature vector and the third feature vector of the video based on the image features of the extracted frame image includes:

determining the average value of the image features of all frame images in the frame images based on the extracted image features of the frame images to obtain average image features; determining a first feature vector and a third feature vector of the video based on the average image feature;

or,

determining a first feature vector and a third feature vector of each frame image in the frame images based on the image features of the extracted frame images; and determining the average value of the first feature vectors of the frame images as the first feature vectors of the video, and determining the average value of the third feature vectors of the frame images as the third feature vectors of the video.

In a second aspect, the present disclosure provides an apparatus for determining duplicate data, the apparatus comprising:

the data acquisition module is used for acquiring reference data, and the reference data is an image or a video;

the candidate data determining module is used for determining candidate data which are repeated with the reference data in the data to be compared based on a first feature vector of the first data type of the reference data and a second feature vector of the first data type of each data to be compared, wherein the data to be compared comprises at least one of an image and a video;

and the repeated data determining module is used for determining data repeated with the reference data from the candidate data according to the third characteristic vector of the second data type of the reference data and the fourth characteristic vector of the second data type of each data in the candidate data, wherein the precision of the data of the second data type is higher than that of the data of the first data type.

In an embodiment of the second aspect of the disclosure, the first data type is integer and the second data type is floating point.

In an embodiment of the second aspect of the present disclosure, the candidate data determining module, when determining candidate data that is duplicated with the reference data in the data to be compared based on the first feature vector of the first data type of the reference data and the second feature vector of the first data type of each data to be compared, is specifically configured to:

and determining data corresponding to a second feature vector which has at least one dimension and the same value as the first feature vector as candidate data.

In an embodiment of the second aspect of the present disclosure, when determining, as candidate data, data corresponding to a second feature vector having at least one dimension and the same value as the first feature vector, the candidate data determination module is specifically configured to:

and determining data corresponding to a second feature vector with at least one dimension and the same value as the first feature vector as candidate data based on the first feature vector and the inverted index.

In an embodiment of the second aspect of the present disclosure, when determining, from the candidate data, data that is duplicated with the reference data according to the third feature vector of the second data type of the reference data and the fourth feature vector of the second data type of each data in the candidate data, the repeated data determining module is specifically configured to:

In an embodiment of the second aspect of the present disclosure, when determining the feature similarity between the third feature vector and the fourth feature vector, the repeated data determining module is specifically configured to:

and determining the feature similarity according to the cosine distance.

In an embodiment of the second aspect of the present disclosure, the reference data is data in the first database, and the data to be compared is data in the first database other than the data;

or,

when the data acquisition module acquires the reference data, the data acquisition module is specifically configured to:

acquiring a search keyword;

In an embodiment of the second aspect of the present disclosure, if the reference data is data in the first database, after determining data that overlaps with the reference data, the apparatus further includes:

the first data processing module is used for carrying out duplicate removal on the first database based on data which are repeated with the reference data;

if the data to be compared is data except the reference data in the search result, after determining the data repeated with the reference data, the device further comprises:

the second data processing module is used for carrying out duplicate removal on the search result based on the data which is repeated with the reference data; taking the search result after the duplication removal as a final search result; or deleting the reference data and the data which is overlapped with the reference data;

if the data to be compared is data in the second database, after determining the data repeated with the reference data, the apparatus further comprises:

and the third data processing module is used for carrying out duplicate removal on the second database based on the data which is repeated with the reference data.

In an embodiment of the second aspect of the present disclosure, if the reference data is a video, the apparatus further includes a feature vector determining module, configured to determine a first feature vector of the reference data by:

extracting image characteristics of frame images in a video;

In an embodiment of the second aspect of the present disclosure, the feature vector determination module, when determining the first feature vector and the third feature vector of the video based on the image features of the extracted frame image, is specifically configured to:

or,

In a third aspect, the present disclosure provides an electronic device comprising:

a processor and a memory;

a memory for storing computer operating instructions;

a processor for performing the method as shown in any embodiment of the first aspect of the present disclosure by invoking computer operation instructions.

In a fourth aspect, the present disclosure provides a computer readable storage medium having stored thereon at least one instruction, at least one program, set of codes, or set of instructions, which is loaded and executed by a processor to implement a method as shown in any embodiment of the first aspect of the present disclosure.

The technical scheme provided by the embodiment of the disclosure has the following beneficial effects:

according to the method, the device, the electronic device and the computer storage medium for determining the repeated data, the feature vectors (the first feature vector and the second feature vector) of the first data type and the feature vectors (the third feature vector and the fourth feature vector) of the second data type can both reflect the characteristics of the image, the feature vector of the second data type has higher precision than the feature vector of the first data type, and the feature vector of the second data type can reflect the characteristics of the image in more detail than the feature vector of the first data type, so that the candidate data repeated with the reference data in the data to be compared can be determined based on the first feature vector of the reference data and the second feature vector of each data to be compared, the data processing efficiency is improved, further, the data repeated with the reference data can be determined accurately from the candidate data based on the third feature vector of the reference data and the fourth feature vector of each data in the candidate data, and therefore, the data repeated with the reference data in the data to be compared can be determined quickly and accurately based on the scheme in the present disclosure. In addition, based on the scheme of the disclosure, the determination of repeated data from the data to be compared (mixed data, data including both video and image) based on the characteristics of the image or the video (single characteristics) can be realized.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present disclosure, the drawings used in the description of the embodiments of the present disclosure will be briefly described below.

Fig. 1 is a schematic flowchart of a method for determining duplicate data according to an embodiment of the present disclosure;

fig. 2 is a schematic structural diagram of an apparatus for determining duplicate data according to an embodiment of the present disclosure;

fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.

Detailed Description

Reference will now be made in detail to embodiments of the present disclosure, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below by referring to the drawings are exemplary only for explaining technical aspects of the present disclosure, and are not construed as limiting the present disclosure.

As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.

The following describes the technical solutions of the present disclosure and how to solve the above technical problems in specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present disclosure will be described below with reference to the accompanying drawings.

An embodiment of the present disclosure provides a method for determining duplicate data, as shown in fig. 1, the method may include:

step S110, reference data is acquired, and the reference data is an image or a video.

The source of the reference data is not limited in the embodiments of the present disclosure. Specifically, the reference data may be determined based on actual requirements, for example, the reference data may be a preset image or video; the reference data may also be data in the first database, for example, the reference data may be an image or a video in a blacklist database; the reference data may also be data obtained based on a search keyword, for example, the reference data may be a picture or a video obtained based on a keyword search, where the keyword may be a search keyword obtained in real time, or a pre-configured keyword, for example, when an image or a video that is forbidden needs to be screened, the keyword may be a forbidden word, a sensitive word, or the like.

As an example, for example, when a keyword is a, and when a picture and/or a video corresponding to the keyword a is found in the database, usually there are a plurality of pictures and/or videos corresponding to a certain keyword, the found pictures or videos may be used as reference data, and if sorting is possible, the first picture or video in the search result is used as reference data.

Step S120, determining candidate data that is repeated with the reference data in the data to be compared based on the first feature vector of the first data type of the reference data and the second feature vector of the first data type of each data to be compared, where the data to be compared includes at least one of an image and a video.

Specifically, the image feature may reflect a feature of the reference data, in the present disclosure, a feature vector may reflect a feature of the reference data, and a second feature vector of a first data type of each data to be compared may be determined from the data to be compared, where a second feature vector that is similar to or identical to the first feature vector is determined, and data corresponding to the second feature vector that is similar to or identical to the first feature vector is candidate data that is duplicated with the reference data, and data in the candidate data may all be data that is duplicated with the reference data or may not be data that is duplicated with the reference data, that is, data in the candidate data may all be data that is duplicated with the reference data or part of data therein is data that is duplicated with the reference data.

The source of the data to be compared is not limited, and the data may be data from the same database or data from different databases.

In the scheme of the present disclosure, the data repeated with the reference data refers to data identical or similar to the reference data, the reference data may be a video or an image, the data identical or similar to the reference data may be an image or a video, the data to be compared may include at least one of a video and an image, and if the data to be compared includes a video and an image, the candidate data may also include at least one of a video and an image, that is, based on an image or a video, the candidate data that may be determined from the data to be compared may be an image, may be a video, or may be a video and an image.

Step S130, determining data overlapping the reference data from the candidate data according to the third feature vector of the second data type of the reference data and the fourth feature vector of the second data type of each data in the candidate data, wherein the precision of the data of the second data type is higher than that of the data of the first data type.

Specifically, if the precision of the data of the second data type is higher than that of the data of the first data type, and correspondingly, the precision of the third feature vector and the fourth feature vector is higher than that of the first feature vector and the second feature vector, the features of the image can be more accurately described by the third feature vector of the second data type, and then, based on the third feature vector of the reference data and the fourth feature vector of the second data type of each data in the candidate data, the data overlapping with the reference data can be more accurately determined from the candidate data, it can be understood that the candidate data includes at least one of an image and a video, and as with the data overlapping with the reference data, the candidate data may also include at least one of an image and a video.

According to the scheme in the embodiment of the disclosure, the feature vectors of the first data type (the first feature vector and the second feature vector) and the feature vectors of the second data type (the third feature vector and the fourth feature vector) can both reflect the characteristics of the image, the feature vectors of the second data type have higher precision than the feature vectors of the first data type, and the feature vectors of the second data type can reflect the characteristics of the image in more detail than the feature vectors of the first data type, so that the candidate data which is duplicated with the reference data in the data to be compared can be determined first based on the first feature vector of the reference data and the second feature vector of each data to be compared, and the data processing efficiency is improved. In addition, based on the scheme of the disclosure, the determination of repeated data from the data to be compared (mixed data, data including both video and image) based on the characteristics of the image or the video (single characteristics) can be realized.

In an embodiment of the disclosure, the first data type may be integer and the second data type may be floating point.

In particular, floating point data is more accurate than integer data. In the scheme of the present disclosure, for ease of understanding, the first eigenvector is hereinafter described as a first integer eigenvector, the second eigenvector is hereinafter described as a second integer eigenvector, the third eigenvector is hereinafter described as a first floating-point eigenvector, and the fourth eigenvector is hereinafter described as a second floating-point eigenvector.

In the embodiment of the present disclosure, if the reference data is a video, the first feature vector and the third feature vector of the reference data are determined by:

extracting image characteristics of frame images in a video;

Specifically, if the reference data is a video including consecutive frame images, the first feature vector and the third feature vector may be determined based on image features of the frame images in the video, where the frame images in the video may be obtained in a manner that, in a first manner, all the frame images in the video are taken as frame images of the video; in a second mode, corresponding images are uniformly extracted from a video as frame images, for example, corresponding images are extracted from the video as frame images according to a preset interval, the preset interval may be configured based on actual requirements, for example, if the preset interval is 5, an image is extracted every 5 frame images as a frame image of the video; in a third way, corresponding images are extracted from the video as frame images according to the key frames, and the key frames can be configured based on actual requirements, for example, if the key frames are the 5 th frame, the 25 th frame, and the 38 th frame, the 5 th frame, the 25 th frame, and the 38 th frame in the video are correspondingly extracted as frame images of the video.

In the embodiment of the present disclosure, determining the first feature vector and the third feature vector of the video based on the image features of the extracted frame image may include any one of the following manners:

or,

Specifically, if the reference data is a video, the first feature vector and the third feature vector of the reference data are the first feature vector and the third feature vector of the video, and the first feature vector and the third feature vector are for an image, that is, the first feature vector and the third feature vector are usually feature vectors corresponding to the image, and the first feature vector and the third feature vector of the video can be determined and obtained by any one of the following manners:

based on the image features of the extracted frame image, determining a first feature vector of the video may include at least one of:

firstly, determining an average value of image features of each frame image in the frame images based on the image features of the extracted frame images to obtain average image features; based on the average image features, a first feature vector of the video is determined.

Secondly, determining a first feature vector of each frame image in the frame images based on the image features of the extracted frame images; and determining the average value of the first feature vectors of the frame images as the first feature vector of the video.

Based on the image features of the extracted frame image, determining a third feature vector of the video may include at least one of:

firstly, determining the average value of the image characteristics of each frame image in the frame images based on the image characteristics of the extracted frame images to obtain average image characteristics; based on the average image feature, a third feature vector of the video is determined.

Secondly, determining a third feature vector of each frame image in the frame images based on the image features of the extracted frame images; and determining the average value of the third feature vectors of the frame images as the third feature vector of the video.

It is understood that the second feature vector and the fourth feature vector of the video in the data to be compared can also be determined in the above manner.

In this embodiment of the disclosure, in step S120, determining candidate data that is repeated with the reference data in the data to be compared based on the first feature vector of the first data type of the reference data and the second feature vector of the first data type of each data to be compared may include:

Specifically, the integer feature vector may include elements with multiple dimensions, and in the solution of the present disclosure, the integer feature vector corresponding to the image is usually a one-dimensional feature vector, for example, the integer feature vector may be represented as: a = [ a1, a2, \8230;, a9], wherein a represents an integer feature vector, a1, a2, \8230;, a9 represents elements in the integer feature vector, and values of different elements may be the same or different.

Whether two different data are similar or not is judged through the integer eigenvector, the judgment can be carried out based on the value of each element in the integer eigenvector, and if the value of at least one dimension in the integer eigenvector of the two data is the same, the similarity of the two data can be shown.

As an example, if two data are data a and data B, respectively, data a corresponds to integer feature vectors including a1, a2, \8230;, a9 several elements, and data B corresponds to integer feature vectors including B1, B2, \8230;, B9 several elements, and if a1 and B1 take the same value, it can be determined that data a and data B are similar.

In the embodiment of the present disclosure, determining data corresponding to a second feature vector having at least one dimension and the same value as the first feature vector as candidate data may include:

acquiring an inverted index, wherein the inverted index is established based on the second eigenvector;

Specifically, an inverted index may be established in advance based on a second feature vector (a second integer feature vector) of a first data type of data in the data to be compared, and then candidate data of the reference data corresponding to the first integer feature vector may be acquired from the database based on the established inverted index and the first integer feature vector. Because the inverted index is established, the first integer characteristic vector of the reference data does not need to be compared with the integer characteristic vectors of all data in the data to be compared one by one, and the data processing efficiency can be further improved.

The establishing process of the inverted index may be:

1. acquiring images and videos in data to be compared;

2. selecting a frame image in a video, wherein the frame image is usually a plurality of images; the frame images in the video may be obtained in the manner described above, and are not described herein again.

3. Extracting image features of the image and the frame image, which is not limited in this disclosure, for example, a CNN (Convolutional Neural Networks) algorithm, a SIFT (Scale-invariant feature transform) algorithm, a SURF (speedup robust features) algorithm, a Vlad (vector of captured descriptors) algorithm, and the like;

4. the method includes obtaining an integer feature vector of an image based on image features of the image, and obtaining an integer feature vector of a video based on image features of a frame image, where the integer feature vector of the video may be determined based on the manner described above and is not described herein again, and in the present disclosure, the integer feature vector is a feature vector obtained by binarizing image features, for example, a string of 01.

5. And establishing an inverted index based on the integer feature vector of the image and the integer feature vector of the video, wherein in order to distinguish each video in the database, each video can be distinguished through a video identifier, for example, the video identifier can be a video sequence number, and similarly, in order to distinguish each image in the database, each image can be distinguished through a video identifier, for example, the image identifier can be an image sequence number.

As an example, the inverted index may be as shown in table 1:

TABLE 1

Vector dimension dereferencing	Image sequence number	Video sequence number
			A1	Image 1	Video 20
A2	Image 10	Video 67
			……	……	……
A100	Image 1 and image 5	Video 10 and video 20
			……	……	……

In the inverted index, if the vector dimension value of the integer feature vector is present in which image or video, it indicates that the data corresponding to the integer feature vector is similar to the video and the image, as shown in table 1, if the vector dimension value A1 of the integer feature vector is present in the image 1 and the video 20, the candidate data of the data (video or image) corresponding to the integer feature vector is the image 1 and the video 20; the vector dimension value A2 of the integer feature vector is correspondingly stored in the image 10 and the video 67, and candidate data of data (video or image) corresponding to the integer feature vector are the image 10 and the video 67; the vector dimension value a100 of the integer feature vector exists in the image 1, the image 5, the video 10, and the video 20, and candidate data representing data (video or image) corresponding to the integer feature vector is the image 1, the image 5, the video 10, and the video 20. Thus, if the integer feature vector of a certain reference data is determined by the reverse index, the image and video corresponding to the integer feature vector, that is, the candidate data corresponding to the integer feature vector can be determined based on the reverse index.

It should be noted that, in practical applications, the image and the video may respectively establish corresponding inverted indexes, or the image and the video may be put together to establish an inverted index, which is not limited in this disclosure.

In this embodiment of the disclosure, in step S130, determining data that overlaps with the reference data from the candidate data according to the third eigenvector of the second data type of the reference data and the fourth eigenvector of the second data type of each data in the candidate data may include:

and determining the candidate data corresponding to the feature similarity larger than the similarity threshold as the data which is overlapped with the reference data.

Specifically, the data overlapping with the reference data may be determined from the candidate data through the similarity between the feature vectors, in practical applications, a preset similarity threshold may be configured, if the feature similarity between two data is not smaller than the similarity threshold, the two data may be represented to be similar, and if the feature similarity between the two data is smaller than the similarity threshold, the two data may be represented to be dissimilar.

In an embodiment of the present disclosure, determining a feature similarity between the third feature vector and the fourth feature vector may include:

determining a cosine distance between the third feature vector and the fourth feature vector;

and determining the feature similarity according to the cosine distance.

Specifically, the similarity between two feature vectors can be determined by the cosine distance between the feature vectors, and generally, the larger the cosine distance between two feature vectors is, the lower the similarity between the two feature vectors is, and conversely, the smaller the cosine distance between two feature vectors is, the higher the similarity between the two feature vectors is, that is, the more similar the similarity is.

In an embodiment of the disclosure, if the reference data is the first image, the candidate data includes the first video, and the determining the feature similarity between the third feature vector and the fourth feature vector based on the third feature vector and the fourth feature vector may include at least one of the following manners:

firstly, determining a first similarity between a first floating point type characteristic vector and a second floating point type characteristic vector of each frame image in a frame image to be processed based on the first floating point type characteristic vector of a first image and the second floating point type characteristic vector of each frame image in a first video; based on the first similarity, an average of the first similarities is determined as the feature similarity.

Secondly, determining the average value of the second floating point type characteristic vectors of each frame image based on the second floating point type characteristic vectors of each frame image in the frame images to be processed in the first video to obtain a third floating point type characteristic vector; and determining the similarity between the third floating-point type feature vector and the first floating-point type feature vector as the feature similarity based on the third floating-point type feature vector and the first floating-point type feature vector.

Specifically, the frame image to be processed in the first video may be acquired in the manner described above, and the acquisition manner includes but is not limited to: the method comprises the steps of uniformly extracting frames of a first video, extracting frames of key frames and taking all frame images in the first video as any one of images to be processed, wherein the specific acquisition mode is not repeated.

If the reference data is the second video, the candidate data includes the second image, and the determining the feature similarity between the third feature vector and the fourth feature vector may include at least one of the following:

firstly, determining a second similarity between a first floating point type characteristic vector and a second floating point type characteristic vector of each frame image in a frame image to be processed in a second video based on the first floating point type characteristic vector and the second floating point type characteristic vector of each frame image in the frame image to be processed in the second video; based on the second similarity, an average value of the second similarities is determined as the feature similarity.

Secondly, determining the average value of the first floating point type characteristic vectors of each frame image based on the first floating point type characteristic vectors of each frame image in the frame image to be processed in the second video to obtain a fourth floating point type characteristic vector; and determining the similarity between the fourth floating-point type feature vector and the second floating-point type feature vector as a feature similarity based on the fourth floating-point type feature vector and the second floating-point type feature vector.

Specifically, the to-be-processed frame image in the second video may be obtained in the manner described above, and the obtaining manner includes but is not limited to: and uniformly extracting frames of the second video, extracting frames of the key frames, and taking all frame images in the second video as any one of the images to be processed, wherein the specific acquisition mode is not repeated.

If the reference data is a third video, the candidate data includes a fourth video, and the determining the feature similarity between the third feature vector and the fourth feature vector may include at least one of the following:

firstly, based on a first floating point type characteristic vector of each frame image in a frame image to be processed in a third video and a second floating point type characteristic vector of each frame image in a frame image to be processed in a fourth video, determining the similarity between each first floating point type characteristic vector and each second floating point type characteristic vector to obtain a third similarity, and determining the average value of the third similarity as the characteristic similarity.

Secondly, determining the average value of the first floating point type characteristic vectors of each frame image in the frame images to be processed in the third video based on the first floating point type characteristic vectors of each frame image in the frame images to be processed in the third video to obtain a fifth floating point type characteristic vector; determining an average value of second floating point type characteristic vectors of each frame image in the frame images to be processed in the fourth video based on the floating point type characteristic vectors of each frame image in the frame images to be processed in the fourth video to obtain a sixth floating point type characteristic vector; and determining the similarity between the fifth floating-point type feature vector and the sixth floating-point type feature vector as the feature similarity.

Specifically, the frame images to be processed in the third video and the fourth video may be obtained in the manner described above, and the obtaining manner includes but is not limited to: the method comprises the steps of carrying out uniform frame extraction and key frame extraction on a third video or a fourth video, and taking all frame images in the third video or the fourth video as any one of corresponding images to be processed, wherein the specific acquisition mode is not repeated.

In the embodiment of the disclosure, the reference data is data in the first database, and the data to be compared is data in the first database except the data;

or,

acquiring the reference data may include:

acquiring a search keyword;

Specifically, the first database and the second database may be the same database or different databases, which is not limited in this disclosure. The search keyword may be obtained from a keyword provided by a user in real time, or the search keyword may be a preset keyword, for example, a keyword in a blacklist, for example, a forbidden word, a sensitive word, and the like.

In an embodiment of the present disclosure, if the reference data is data in the first database, after determining data that is duplicated with the reference data, the method may further include:

if the data to be compared is data in the search result except the reference data, after determining the data repeated with the reference data, the method may further include:

if the data to be compared is data in the second database, after determining data that is repeated with the reference data, the method may further include:

Specifically, for different application scenarios, different processing may be performed on the determined data overlapping with the reference data, and the following description is given with reference to the specific application scenarios:

the first application scenario is that duplicate data in a first database is subjected to deduplication:

in practical applications, a large amount of repeated data may exist in the database, and based on the repeated data, the use experience of the data may be directly influenced.

In a first case, when the reference data is data in the first database, after determining data that is identical to the reference data based on the scheme described above, the data in the first database may be deduplicated based on the repeated data to reduce the data amount in the first database, so as to improve the use experience of the database.

In the second case, if the data to be compared is the data in the second database, after the data that is duplicated with the reference data is determined based on the method described above, the data in the second database may be deduplicated based on the duplicated data.

The second application scenario provides a search result for the user based on the search keyword input by the user:

in practical applications, the user usually searches for corresponding images and/or videos based on text information (keywords), but since a large amount of repeated data may exist in the second database, the search result obtained based on the search keywords includes many repeated data, and the use experience of the user is reduced.

According to the scheme, based on the search keyword, the search result corresponding to the search keyword can be obtained by searching in the first database, then one data is selected from the search result as the reference data, the data which is repeated with the reference data is determined based on the method described above, and other data except the repeated data in the search result obtained by searching for the search keyword is displayed to the user as the final search result, so that the obtained search result has no repeated data, and the search experience of the user is improved.

As an example, for example, if the keyword input by the user is "liujialing," based on the keyword, some data (hereinafter referred to as a first search result) in the first database that includes the keyword "liujialing" may be searched, the first search result may include images and videos related to liujialing, and in the first search result, there are similarities between the images and the videos, and if the first search result is presented to the user, the search experience of the user may be reduced.

In a third application scenario, based on a keyword input by a user, data corresponding to the keyword is destacked:

in practical applications, for a maintenance person of a database, if a user wants to put down certain data stored in the database, the corresponding data is usually searched from the database based on a keyword, and then put down the data, but the data corresponding to the keyword in the database cannot be accurately searched based on the keyword only, which may cause putting down some data that does not correspond to the keyword.

According to the scheme of the disclosure, based on the keyword input by the user, data corresponding to the keyword (hereinafter referred to as a second search result) may be obtained from the database first, then one data is arbitrarily selected from the second search result as reference data, then data overlapping with the reference data is determined based on the method described above, and finally the repeated data and the reference data are off-shelf, that is, deleted from the database. By the scheme, the data which are required to be off-shelf and correspond to the keywords in the database can be accurately searched.

Based on the same principle as the method shown in fig. 1, an embodiment of the present disclosure also provides an apparatus 20, as shown in fig. 2, where the apparatus 20 may include: a data acquisition module 210, a candidate data determination module 220, and a duplicate data determination module 230, wherein:

a data obtaining module 210, configured to obtain reference data, where the reference data is an image or a video;

a candidate data determining module 220, configured to determine candidate data that is repeated with reference data in the data to be compared based on a first feature vector of the first data type of the reference data and a second feature vector of the first data type of each data to be compared, where the data to be compared includes at least one of an image and a video;

and the repeated data determining module 230 is configured to determine data repeated with the reference data from the candidate data according to the third eigenvector of the second data type of the reference data and the fourth eigenvector of the second data type of each data in the candidate data, where the accuracy of the data of the second data type is higher than that of the data of the first data type.

According to the scheme in the embodiment of the disclosure, the feature vectors (the first feature vector and the second feature vector) of the first data type and the feature vectors (the third feature vector and the fourth feature vector) of the second data type can both reflect the characteristics of the image, the feature vectors of the second data type have higher precision than the feature vectors of the first data type, and the feature vectors of the second data type can reflect the characteristics of the image in more detail compared with the feature vectors of the first data type, so that the candidate data which is repeated with the reference data in the data to be compared can be determined based on the first feature vector of the reference data and the second feature vector of each data to be compared, the data processing efficiency is improved, and further, the data which is repeated with the reference data can be determined accurately from the candidate data based on the third feature vector of the reference data and the fourth feature vector of each data in the candidate data, so that the data which is repeated with the reference data can be determined quickly and accurately from the data to be compared based on the scheme in the disclosure. In addition, based on the scheme of the disclosure, the determination of repeated data from the data to be compared (mixed data, data including both video and image) based on the characteristics of the image or the video (single characteristics) can be realized.

In an embodiment of the disclosure, the first data type is integer and the second data type is floating point.

In an embodiment of the disclosure, when determining candidate data that is repeated with reference data in data to be compared based on a first feature vector of a first data type of the reference data and a second feature vector of the first data type of each data to be compared, the candidate data determination module 220 is specifically configured to:

In an embodiment of the present disclosure, when determining data corresponding to a second feature vector having at least one dimension and the same value as the first feature vector as a candidate data, the candidate data determining module 220 is specifically configured to:

In an embodiment of the disclosure, when determining data that is duplicated with the reference data from the candidate data according to the third eigenvector of the second data type of the reference data and the fourth eigenvector of the second data type of each data in the candidate data, the repeated data determining module 230 is specifically configured to:

In an embodiment of the disclosure, when determining the feature similarity between the third feature vector and the fourth feature vector, the repeated data determining module 230 is specifically configured to:

and determining the feature similarity according to the cosine distance.

or,

when acquiring the reference data, the data acquiring module 210 is specifically configured to:

acquiring a search keyword;

In an embodiment of the disclosure, if the reference data is data in the first database, after determining data that is duplicated with the reference data, the apparatus further includes:

the second data processing module is used for removing the duplicate of the search result based on the data which is repeated with the reference data; taking the search result after the duplication removal as a final search result; or deleting the reference data and the data which is overlapped with the reference data;

In an embodiment of the present disclosure, if the reference data is a video, the apparatus further includes a feature vector determining module, configured to determine a first feature vector of the reference data by:

extracting image characteristics of frame images in a video;

In an embodiment of the present disclosure, the feature vector determination module, when determining the first feature vector and the third feature vector of the video based on the image features of the extracted frame image, is specifically configured to:

or,

The determining apparatus for duplicate data of the embodiment of the present disclosure may execute the determining method for duplicate data shown in fig. 1, and the implementation principle is similar, the actions executed by each module in the determining apparatus for duplicate data of the embodiments of the present disclosure correspond to the steps in the determining method for duplicate data of the embodiments of the present disclosure, and for the detailed functional description of each module of the determining apparatus for duplicate data, reference may be specifically made to the description in the determining method for duplicate data shown in the foregoing, and details are not repeated here.

Based on the same principle as the method in the embodiments of the present disclosure, the present disclosure provides an electronic device including a processor and a memory; a memory for storing operating instructions; a processor for executing the method as shown in any embodiment of the method of the present disclosure by calling the operation instruction.

Based on the same principles as the method in the embodiments of the present disclosure, the present disclosure provides a computer-readable storage medium storing at least one instruction, at least one program, set of codes, or set of instructions, which is loaded and executed by a processor to implement the method as shown in any one of the embodiments of the data processing method of the present disclosure.

In the embodiment of the present disclosure, as shown in fig. 3, a schematic structural diagram of an electronic device 50 (for example, a terminal device or a server implementing the method shown in fig. 1) suitable for implementing the embodiment of the present disclosure is shown. The electronic devices in the embodiments of the present disclosure may include, but are not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., car navigation terminals), and the like, and fixed terminals such as digital TVs, desktop computers, and the like. The electronic device shown in fig. 3 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 3, electronic device 50 may include a processing means (e.g., central processing unit, graphics processor, etc.) 501 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM) 502 or a program loaded from a storage means 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data necessary for the operation of the electronic apparatus 30 are also stored. The processing device 501, the ROM 502, and the RAM 503 are connected to each other through a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.

Generally, the following devices may be connected to the I/O interface 505: input devices 506 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; output devices 507 including, for example, a Liquid Crystal Display (LCD), speakers, vibrators, and the like; storage devices 508 including, for example, magnetic tape, hard disk, etc.; and a communication device 509. The communication means 509 may allow the electronic device 50 to communicate with other devices wirelessly or by wire to exchange data. While fig. 3 illustrates an electronic device 50 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 509, or installed from the storage means 508, or installed from the ROM 502. The computer program, when executed by the processing device 501, performs the above-described functions defined in the methods of the embodiments of the present disclosure.

It should be noted that the computer readable medium of the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to perform the methods shown in the method embodiments; alternatively, the computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to perform the method shown in the above method embodiment.

Computer program code for carrying out operations for aspects of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present disclosure may be implemented by software or hardware. Where the name of a unit does not in some cases constitute a limitation of the unit itself, for example, the first retrieving unit may also be described as a "unit for retrieving at least two internet protocol addresses".

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Claims

1. A method for determining duplicate data, comprising:

acquiring reference data, wherein the reference data is an image or a frame image in a video;

acquiring an inverted index, wherein the inverted index is established based on a second feature vector of a first data type of each data to be compared; determining data corresponding to a second feature vector which has at least one dimension and has the same value as the first feature vector as repetitive candidate data based on the first feature vector of the first data type of the reference data and the inverted index, wherein the data to be compared comprises at least one of an image and a frame image in a video, and the first data type is integer;

and determining data which is repeated with the reference data from the candidate data according to the third feature vector of the second data type of the reference data and the fourth feature vector of the second data type of each data in the candidate data, wherein the first data type and the second data type correspond to a single feature of an image or a video, the precision of the data of the second data type is higher than that of the data of the first data type, and the second data type is a floating point type.

2. The method of claim 1, wherein determining duplicate data from the candidate data based on the third eigenvector of the second data type for the reference data and the fourth eigenvector of the second data type for each data in the candidate data comprises:

and determining candidate data corresponding to the feature similarity larger than a similarity threshold as data which are repeated with the reference data.

3. The method of claim 1,

the reference data is data in a first database, and the data to be compared is data in the first database except the data;

or,

the acquiring of the reference data comprises:

acquiring a search keyword;

and acquiring a search result from a second database according to the search keyword, wherein the reference data is data in the search result, and the data to be compared is data in the search result except the data or data in the second database.

4. The method of claim 3,

if the reference data is the data in the first database, after determining the data which is repeated with the reference data, the method further comprises:

de-duplicating the first database based on the data that is repeated with the reference data;

if the data to be compared is the data in the search result except the reference data, after the data which is repeated with the reference data is determined, the method further comprises the following steps:

performing deduplication on the search result based on the data that duplicates the reference data; taking the search result after the duplication removal as a final search result; or deleting the reference data and the data overlapping with the reference data;

if the data to be compared is data in a second database, after determining data repeated with the reference data, the method further comprises:

and performing deduplication on the second database based on the data repeated with the reference data.

5. The method of claim 1, wherein if the reference data is a video, the first eigenvector and the third eigenvector of the reference data are determined by:

extracting image features of frame images in the video;

determining a first feature vector and a third feature vector of the video based on the image features of the extracted frame image.

6. The method of claim 5, wherein determining the first feature vector and the third feature vector of the video based on the image features of the extracted frame image comprises:

determining an average value of the image features of each frame image in the frame images based on the extracted image features of the frame images to obtain average image features; determining a first feature vector and the third feature vector of the video based on the average image feature;

or,

determining a first feature vector and a third feature vector of each frame image in the frame images based on the image features of the extracted frame images; and determining the average value of the first characteristic vectors of the frame images as the first characteristic vector of the video, and determining the average value of the third characteristic vectors of the frame images as the third characteristic vector of the video.

7. An apparatus for determining duplicate data, comprising:

the data acquisition module is used for acquiring reference data, wherein the reference data is an image or a frame image in a video;

the candidate data determining module is used for acquiring an inverted index, and the inverted index is established based on a second feature vector of a first data type of each data to be compared; determining data corresponding to a second feature vector which has at least one dimension and has the same value as the first feature vector as repetitive candidate data based on the first feature vector of the first data type of the reference data and the inverted index, wherein the data to be compared comprises at least one of an image and a frame image in a video, and the first data type is integer;

and the repeated data determining module is used for determining data repeated with the reference data from the candidate data according to a third feature vector of a second data type of the reference data and a fourth feature vector of the second data type of each data in the candidate data, wherein the first data type and the second data type correspond to a single feature of an image or a video, the precision of the data of the second data type is higher than that of the data of the first data type, and the second data type is a floating point type.

8. An electronic device, comprising:

a processor and a memory;

the memory is used for storing computer operation instructions;

the processor is used for executing the method of any one of claims 1 to 6 by calling the computer operation instruction.

9. A computer readable storage medium, characterized in that it stores at least one instruction, at least one program, set of codes, or set of instructions, which is loaded and executed by a processor to implement the method of any of claims 1 to 6.