CN113553461A - Picture clustering method and related device - Google Patents

Picture clustering method and related device Download PDF

Info

Publication number
CN113553461A
CN113553461A CN202010338401.3A CN202010338401A CN113553461A CN 113553461 A CN113553461 A CN 113553461A CN 202010338401 A CN202010338401 A CN 202010338401A CN 113553461 A CN113553461 A CN 113553461A
Authority
CN
China
Prior art keywords
picture
algorithm
clustering
pictures
repeated
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010338401.3A
Other languages
Chinese (zh)
Inventor
董国盛
孙玉玺
周泽南
陈炜鹏
许静芳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sogou Technology Development Co Ltd
Original Assignee
Beijing Sogou Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sogou Technology Development Co Ltd filed Critical Beijing Sogou Technology Development Co Ltd
Priority to CN202010338401.3A priority Critical patent/CN113553461A/en
Publication of CN113553461A publication Critical patent/CN113553461A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/55Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Probability & Statistics with Applications (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

The application discloses a picture clustering method and a related device, wherein the method comprises the following steps: firstly, carrying out data slicing on each picture by adopting a multilayer clustering algorithm so as to enable each picture to be divided to each cluster device in a balanced manner; then, aiming at the pictures on each cluster device, a neighbor picture algorithm and a parallel searching algorithm are adopted for searching and merging so as to obtain a plurality of target repeated picture sets. Therefore, each picture is approximately and uniformly divided into each cluster device in a data slicing mode of a multilayer clustering algorithm, the picture comparison times are reduced in a retrieval mode of a neighbor picture algorithm, the clustering time is greatly shortened, and the clustering speed of repeated picture clustering can be improved; by means of the retrieval and combination mode of the neighbor graph algorithm and the parallel search algorithm, high recall rate is provided by label diffusion, and therefore the clustering effect of repeated picture clustering can be improved.

Description

Picture clustering method and related device
Technical Field
The present application relates to the field of image processing technologies, and in particular, to a method for clustering images and a related apparatus.
Background
At present, a large number of pictures exist on the Internet, a plurality of repeated pictures exist in the large number of pictures, the repeated pictures are clustered to obtain a repeated picture set, and the characteristics of the pictures can be mined at multiple angles by the repeated picture set compared with a single picture, so that the index scale of a picture search task can be effectively reduced, and the picture search effect can be greatly improved.
In the prior art, a large-scale repeated picture clustering method generally adopts a characteristic hash coding clustering method or a large-scale clustering method. The characteristic Hash code clustering method is that Hash is carried out on picture characteristics of each picture to obtain Hash codes, and repeated picture clustering is carried out by utilizing the Hash codes; the large-scale cluster clustering method is to directly utilize the picture characteristics of each picture to perform repeated picture clustering.
However, the inventor finds that more picture feature information is lost when the picture features of each picture are hashed, so that the recall rate of repeated picture clustering by using hash codes is low, and the subsequent picture searching effect is poor; and more time is consumed for carrying out repeated picture clustering by directly utilizing the picture characteristics of each picture. Namely, the characteristic hash code clustering method easily causes poor clustering effect of repeated picture clustering, and the large-scale cluster clustering method easily causes low clustering speed of repeated picture clustering.
Disclosure of Invention
The technical problem to be solved by the application is to provide a picture clustering method and a related device, so that the clustering speed and the clustering effect of repeated picture clustering are improved.
In a first aspect, an embodiment of the present application provides a method for clustering pictures, where the method includes:
carrying out data slicing on each picture by utilizing a multilayer clustering algorithm, and dividing each picture to each cluster device in a balanced manner;
and aiming at the pictures on each cluster device, searching and merging are carried out by utilizing a neighbor picture algorithm and a parallel searching algorithm, so as to obtain a plurality of target repeated picture sets.
Optionally, the slicing data of each picture by using the multi-layer clustering algorithm, and dividing each picture into the cluster devices in a balanced manner includes:
clustering each picture by utilizing a first-layer clustering algorithm in the multi-layer clustering algorithm to obtain a plurality of large clusters and a plurality of first small clusters;
clustering each large cluster by using a second-layer clustering algorithm in the multi-layer clustering algorithm to obtain a plurality of second small clusters;
merging each first subclass cluster and each second subclass cluster based on the load condition of each cluster device, and dividing each picture to each cluster device in a balanced manner.
Optionally, the multi-layer clustering algorithm includes a multi-layer K-Means algorithm.
Optionally, the retrieving and merging the pictures on each cluster device by using a neighbor picture algorithm and a merging and searching algorithm to obtain a plurality of target repeated picture sets includes:
searching by utilizing the neighbor graph algorithm aiming at each picture of each cluster device to obtain a repeated picture set of each picture;
and aiming at the repeated picture sets of any two pictures, merging by utilizing the parallel-searching algorithm to obtain a plurality of target repeated picture sets.
Optionally, the merging, by using the union-search algorithm, for the repeated picture sets of any two pictures to obtain a plurality of target repeated picture sets specifically includes:
and aiming at any two repeated picture sets of the pictures, if the intersection of the repeated picture sets of the two pictures is a non-empty set, combining the repeated picture sets of the two pictures to obtain a plurality of target repeated picture sets.
Optionally, the method further includes:
if the number of the pictures in the target repeated picture set is larger than a preset threshold value, checking the target repeated picture set;
and if the checking processing result of the target repeated picture set is abnormal, rejecting the target repeated picture set.
Optionally, the verifying the target repeated picture set specifically includes:
and acquiring the picture characteristic statistical distribution of the target repeated picture set, and judging whether the verification result of the target repeated picture set is abnormal or not according to the picture characteristic statistical distribution.
In a second aspect, an embodiment of the present application provides an apparatus for clustering pictures, where the apparatus includes:
the dividing unit is used for carrying out data slicing on each picture by utilizing a multilayer clustering algorithm and dividing each picture to each cluster device in a balanced manner;
and the obtaining unit is used for retrieving and merging the pictures on each cluster device by utilizing a neighbor picture algorithm and a merging and searching algorithm to obtain a plurality of target repeated picture sets.
Optionally, the dividing unit includes:
the first obtaining subunit is used for clustering the pictures by using a first-layer clustering algorithm in the multi-layer clustering algorithm to obtain a plurality of large clusters and a plurality of first small clusters;
the second obtaining subunit is configured to cluster each large cluster by using a second-layer clustering algorithm in the multi-layer clustering algorithm to obtain a plurality of second small clusters;
and the dividing subunit is configured to merge the first subclass clusters and the second subclass clusters based on the load conditions of the cluster devices, and divide the pictures into the cluster devices in a balanced manner.
Optionally, the multi-layer clustering algorithm includes a multi-layer K-Means algorithm.
Optionally, the obtaining unit includes:
a third obtaining subunit, configured to, for each picture of each cluster device, perform retrieval by using the neighbor graph algorithm to obtain a repeated picture set of each picture;
and the fourth obtaining subunit is configured to, for any two repeated picture sets of the pictures, merge the repeated picture sets by using the parallel-searching algorithm to obtain a plurality of target repeated picture sets.
Optionally, the fourth obtaining subunit is specifically configured to:
and aiming at any two repeated picture sets of the pictures, if the intersection of the repeated picture sets of the two pictures is a non-empty set, combining the repeated picture sets of the two pictures to obtain a plurality of target repeated picture sets.
Optionally, the apparatus further comprises:
the checking unit is used for checking the target repeated picture set if the number of the pictures of the target repeated picture set is greater than a preset threshold value;
and the rejecting unit is used for rejecting the target repeated picture set if the checking processing result of the target repeated picture set is abnormal.
Optionally, the removing unit is specifically configured to:
and acquiring the picture characteristic statistical distribution of the target repeated picture set, and judging whether the verification result of the target repeated picture set is abnormal or not according to the picture characteristic statistical distribution.
In a third aspect, an embodiment of the present application provides a device for picture clustering, the device comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory, and the one or more programs configured to be executed by the one or more processors include instructions for:
carrying out data slicing on each picture by utilizing a multilayer clustering algorithm, and dividing each picture to each cluster device in a balanced manner;
and aiming at the pictures on each cluster device, searching and merging are carried out by utilizing a neighbor picture algorithm and a parallel searching algorithm, so as to obtain a plurality of target repeated picture sets.
In a fourth aspect, the present application provides a machine-readable medium having stored thereon instructions, which when executed by one or more processors, cause an apparatus to perform the method of picture clustering as described in any one of the above first aspects.
Compared with the prior art, the method has the advantages that:
by adopting the technical scheme of the embodiment of the application, firstly, data slicing is carried out on each picture by adopting a multilayer clustering algorithm so that each picture is divided to each cluster device in a balanced manner; then, aiming at the pictures on each cluster device, a neighbor picture algorithm and a parallel searching algorithm are adopted for searching and merging so as to obtain a plurality of target repeated picture sets. Therefore, each picture is approximately and uniformly divided into each cluster device in a data slicing mode of a multilayer clustering algorithm, the picture comparison times are reduced in a retrieval mode of a neighbor picture algorithm, the clustering time is greatly shortened, and the clustering speed of repeated picture clustering can be improved; by means of the retrieval and combination mode of the neighbor graph algorithm and the parallel search algorithm, high recall rate is provided by label diffusion, and therefore the clustering effect of repeated picture clustering can be improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments of the present application will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a schematic diagram of a system framework related to an application scenario in an embodiment of the present application;
fig. 2 is a schematic flowchart of a method for clustering pictures according to an embodiment of the present application;
fig. 3 is a schematic diagram of a target duplicate picture set according to an embodiment of the present disclosure;
fig. 4 is a schematic structural diagram of an apparatus for clustering pictures according to an embodiment of the present disclosure;
fig. 5 is a schematic structural diagram of an apparatus for clustering pictures according to an embodiment of the present disclosure;
fig. 6 is a schematic structural diagram of a server according to an embodiment of the present application.
Detailed Description
In order to make the technical solutions of the present application better understood, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
In the prior art, a feature hash code clustering method or a large-scale clustering method is generally adopted to perform a repeated picture clustering method. The characteristic hash code clustering method is to hash the picture characteristics of each picture to obtain hash codes, and perform repeated picture clustering by using the hash codes, but the hash can lose more picture characteristic information, and the recall rate is low due to the repeated picture clustering by using the hash codes, so that the subsequent picture searching effect is poor, in other words, the characteristic hash code clustering method is easy to cause the clustering effect of the repeated picture clustering to be poor. The large-scale cluster clustering method is to directly use the picture characteristics of each picture to perform repeated picture clustering, which consumes much time, in other words, the large-scale cluster clustering method easily causes the clustering speed of repeated picture clustering to be low.
In order to solve the problem, in the embodiment of the application, firstly, a multilayer clustering algorithm is adopted for each picture to perform data slicing so that each picture is divided to each cluster device in a balanced manner; then, aiming at the pictures on each cluster device, a neighbor picture algorithm and a parallel searching algorithm are adopted for searching and merging so as to obtain a plurality of target repeated picture sets. Therefore, each picture is approximately and uniformly divided into each cluster device in a data slicing mode of a multilayer clustering algorithm, the picture comparison times are reduced in a retrieval mode of a neighbor picture algorithm, the clustering time is greatly shortened, and the clustering speed of repeated picture clustering can be improved; by means of the retrieval and combination mode of the neighbor graph algorithm and the parallel search algorithm, high recall rate is provided by label diffusion, and therefore the clustering effect of repeated picture clustering can be improved.
For example, one of the scenarios in the embodiment of the present application may be applied to the scenario shown in fig. 1, where the scenario includes a picture database 101, a processor 102, and a cluster device 103, where the picture database 101 stores a large number of pictures, and many repeated pictures exist in the large number of pictures. The processor 102 performs data slicing on each picture in the picture database 101 by using a multilayer clustering algorithm, and divides each picture into all the cluster devices 103 in a balanced manner; the processor 102 performs retrieval and merging on the pictures on each cluster device 103 by using a neighbor picture algorithm and a merging and searching algorithm to obtain a plurality of target repeated picture sets.
It is to be understood that, in the above application scenario, although the actions of the embodiment of the present application are described as being performed by the server 101, the present application is not limited in terms of the execution subject as long as the actions disclosed in the embodiment of the present application are performed.
It is to be understood that the above scenario is only one example of a scenario provided in the embodiment of the present application, and the embodiment of the present application is not limited to this scenario.
The following describes in detail a specific implementation manner of the method for clustering pictures and the related apparatus in the embodiment of the present application with reference to the drawings.
Exemplary method
Referring to fig. 2, a schematic flow chart of a method for clustering pictures in the embodiment of the present application is shown. In this embodiment, the method may include, for example, the steps of:
step 201: and carrying out data slicing on each picture by utilizing a multilayer clustering algorithm, and dividing each picture to each cluster device in a balanced manner.
It should be noted that, for the case that a large-scale picture needs to be subjected to repeated picture clustering, if each picture is placed on one cluster device for processing, the clustering speed of the repeated picture clustering is very low based on the load condition of the cluster device, and therefore, in the embodiment of the present application, each picture is considered to be evenly divided into multiple cluster devices for processing, that is, each picture is approximately evenly divided into multiple cluster devices by using a data slicing manner of a multilayer clustering algorithm, so as to improve the clustering speed of subsequent repeated picture clustering. Meanwhile, it should be noted that, by using a data slicing manner of a multilayer clustering algorithm for each picture, the same picture or similar pictures in each picture can be guaranteed to be divided into the same cluster device as much as possible, for example, pictures related to people are divided into the same cluster device, pictures related to cars are divided into the same cluster device, pictures related to landscapes are divided into the same cluster device, and the like.
Specifically, slicing data of each picture by using a multi-layer clustering algorithm means that coarse-grained clustering is firstly realized on each picture by using a first-layer clustering algorithm to obtain a plurality of large clusters and a plurality of small clusters (called a plurality of first small clusters); then, a second-layer clustering algorithm is used for realizing clustering with smaller granularity on the large cluster to obtain a plurality of small clusters (called a plurality of second small clusters); and finally, under the condition that the load of each cluster device is known, combining each first subclass cluster and each second subclass cluster to ensure that each picture is divided to each cluster device in a balanced manner. Therefore, in an optional implementation manner of this embodiment of the present application, the step 201 may include, for example, the following steps:
step A: and clustering each picture by utilizing a first-layer clustering algorithm in the multi-layer clustering algorithm to obtain a plurality of large clusters and a plurality of first small clusters.
And B: and clustering each large cluster by utilizing a second-layer clustering algorithm in the multi-layer clustering algorithm to obtain a plurality of second small clusters.
And C: merging each first subclass cluster and each second subclass cluster based on the load condition of each cluster device, and dividing each picture to each cluster device in a balanced manner.
It should be further noted that the multi-layer clustering algorithm may adopt, for example, a multi-layer K-Means algorithm, and the algorithm idea of the multi-layer K-Means algorithm is to cluster K points in the space as centers, classify the objects closest to them, and successively update the values of the clustering centers by an iterative method until the best clustering result is obtained. Therefore, in an optional implementation manner of the embodiment of the present application, the multi-layer clustering algorithm includes a multi-layer K-Means algorithm; the steps A-B are specifically as follows: clustering each picture by utilizing a first layer of K-Means algorithm in a multi-layer K-Means algorithm to obtain a plurality of large clusters and a plurality of first small clusters; and clustering each large cluster by using a second layer of K-Means algorithm in the multilayer K-Means algorithm to obtain a plurality of second small clusters.
Step 202: and aiming at the pictures on each cluster device, searching and merging are carried out by utilizing a neighbor picture algorithm and a parallel searching algorithm, so as to obtain a plurality of target repeated picture sets.
It should be noted that, aiming at the problem that a large-scale cluster clustering method needs to consume much time, which easily causes a low clustering speed of repeated image clustering; the characteristic hash coding clustering method has the problems that the recall rate is low due to more loss of the characteristic information of the pictures, and the clustering effect of repeated picture clustering is poor easily; on the basis of the step 201, for the pictures on each cluster device, the repeated picture clustering method is converted into a retrieval and combination method based on a neighbor picture algorithm and a union-search algorithm, so as to obtain a plurality of target repeated picture sets, for example, a schematic diagram of one target repeated picture set as shown in fig. 3. The method for searching by adopting the neighbor graph algorithm not only reduces the comparison times of each picture, but also greatly shortens the searching time, thereby improving the clustering speed of repeated picture clustering; and more picture characteristic information cannot be lost, label diffusion is realized, so that the recall rate is higher, and the clustering effect of repeated picture clustering is improved. And the mode of merging by adopting a merging and searching algorithm further realizes label diffusion and improves the recall rate, thereby further improving the clustering effect of repeated picture clustering.
Specifically, for each picture of each cluster device, firstly, a repeated picture set about the picture can be obtained by searching with the help of a neighbor picture algorithm, then, after the repeated picture sets of two pictures are obtained, label diffusion can be further realized by combining with a merging and gathering algorithm until the repeated picture sets of any two pictures in the repeated picture sets of each picture of each cluster device are combined with the help of the merging and gathering algorithm, so as to obtain a plurality of target repeated picture sets. Therefore, in an alternative implementation manner of this embodiment of the present application, the step 202 may include the following steps:
step D: and aiming at each picture of each cluster device, searching by using the neighbor picture algorithm to obtain a repeated picture set of each picture.
Step E: and aiming at the repeated picture sets of any two pictures, merging by utilizing the parallel-searching algorithm to obtain a plurality of target repeated picture sets.
It should be noted that, for any two repeated picture sets of pictures, merging by using a parallel-searching algorithm actually means that as long as the repeated picture sets of the two pictures both include at least one same picture, in other words, an intersection of the repeated picture sets of the two pictures is a non-empty set, the repeated picture sets of the two pictures are merged until a plurality of target repeated picture sets are obtained. Therefore, in an optional implementation manner of the embodiment of the present application, the step E may specifically be, for example: and aiming at any two repeated picture sets of the pictures, if the intersection of the repeated picture sets of the two pictures is a non-empty set, combining the repeated picture sets of the two pictures to obtain a plurality of target repeated picture sets.
It should be noted that, since the above description of step E only needs to describe that the duplicate picture sets of two pictures each include one same picture, the duplicate picture sets of the two pictures can be merged, and in the process of performing step E, there is a high possibility that duplicate picture sets of unrelated pictures are merged, and so on, when the number of pictures in a certain target duplicate picture set obtained in step 202 is particularly large, it indicates that the target duplicate picture set includes many unrelated pictures. Therefore, a picture quantity threshold needs to be preset as a preset threshold, after the multiple target duplicate picture sets are obtained in step 202, it is determined whether the picture quantity of the target duplicate picture set is greater than the preset threshold, and the target duplicate picture set whose picture quantity is greater than the preset threshold needs to be checked; and when the checking processing result of the target repeated picture set is abnormal, the target repeated picture set is equivalent to a garbage cluster, and the target repeated picture set needs to be removed. Therefore, in an optional implementation manner of this embodiment of the present application, after step 202, for example, the following steps may be further included:
step F: and if the number of the pictures in the target repeated picture set is greater than a preset threshold value, checking the target repeated picture set.
It should be noted that the picture feature statistical distribution of the target repeated picture set may represent whether the target repeated picture set includes many irrelevant pictures, and the checking of the target repeated picture set may be performed, for example, by obtaining the picture feature statistical distribution of the target repeated picture set, and when the picture feature statistical distribution of the target repeated picture set is abnormal, it indicates that the checking result of the target repeated picture set is abnormal. Therefore, in an optional implementation manner of the embodiment of the present application, the step F may specifically be, for example: and acquiring the picture characteristic statistical distribution of the target repeated picture set, and judging whether the verification result of the target repeated picture set is abnormal or not according to the picture characteristic statistical distribution.
Step G: and if the checking processing result of the target repeated picture set is abnormal, rejecting the target repeated picture set.
According to various implementation manners provided by the embodiment, firstly, a multilayer clustering algorithm is adopted for data slicing of each picture, so that each picture is divided to each cluster device in a balanced manner; then, aiming at the pictures on each cluster device, a neighbor picture algorithm and a parallel searching algorithm are adopted for searching and merging so as to obtain a plurality of target repeated picture sets. Therefore, each picture is approximately and uniformly divided into each cluster device in a data slicing mode of a multilayer clustering algorithm, the picture comparison times are reduced in a retrieval mode of a neighbor picture algorithm, the clustering time is greatly shortened, and the clustering speed of repeated picture clustering can be improved; by means of the retrieval and combination mode of the neighbor graph algorithm and the parallel search algorithm, high recall rate is provided by label diffusion, and therefore the clustering effect of repeated picture clustering can be improved.
Exemplary devices
Referring to fig. 4, a schematic structural diagram of an apparatus for clustering pictures in the embodiment of the present application is shown. In this embodiment, the apparatus may specifically include:
the dividing unit 401 is configured to perform data slicing on each picture by using a multilayer clustering algorithm, and divide each picture into all cluster devices in a balanced manner;
an obtaining unit 402, configured to perform, for the picture on each cluster device, retrieval and merging by using a neighbor graph algorithm and a merging and searching algorithm, so as to obtain a plurality of target repeated picture sets.
In an optional implementation manner of the embodiment of the present application, the dividing unit 401 includes:
the first obtaining subunit is used for clustering the pictures by using a first-layer clustering algorithm in the multi-layer clustering algorithm to obtain a plurality of large clusters and a plurality of first small clusters;
the second obtaining subunit is configured to cluster each large cluster by using a second-layer clustering algorithm in the multi-layer clustering algorithm to obtain a plurality of second small clusters;
and the dividing subunit is configured to merge the first subclass clusters and the second subclass clusters based on the load conditions of the cluster devices, and divide the pictures into the cluster devices in a balanced manner.
In an optional implementation manner of the embodiment of the present application, the multi-layer clustering algorithm includes a multi-layer K-Means algorithm.
In an optional implementation manner of this embodiment of this application, the obtaining unit 402 includes:
a third obtaining subunit, configured to, for each picture of each cluster device, perform retrieval by using the neighbor graph algorithm to obtain a repeated picture set of each picture;
and the fourth obtaining subunit is configured to, for any two repeated picture sets of the pictures, merge the repeated picture sets by using the parallel-searching algorithm to obtain a plurality of target repeated picture sets.
In an optional implementation manner of the embodiment of the present application, the fourth obtaining subunit is specifically configured to:
and aiming at any two repeated picture sets of the pictures, if the intersection of the repeated picture sets of the two pictures is a non-empty set, combining the repeated picture sets of the two pictures to obtain a plurality of target repeated picture sets.
In an optional implementation manner of the embodiment of the present application, the apparatus further includes:
the checking unit is used for checking the target repeated picture set if the number of the pictures of the target repeated picture set is greater than a preset threshold value;
and the rejecting unit is used for rejecting the target repeated picture set if the checking processing result of the target repeated picture set is abnormal.
In an optional implementation manner of the embodiment of the present application, the removing unit is specifically configured to:
and acquiring the picture characteristic statistical distribution of the target repeated picture set, and judging whether the verification result of the target repeated picture set is abnormal or not according to the picture characteristic statistical distribution.
According to various implementation manners provided by the embodiment, firstly, a multilayer clustering algorithm is adopted for data slicing of each picture, so that each picture is divided to each cluster device in a balanced manner; then, aiming at the pictures on each cluster device, a neighbor picture algorithm and a parallel searching algorithm are adopted for searching and merging so as to obtain a plurality of target repeated picture sets. Therefore, each picture is approximately and uniformly divided into each cluster device in a data slicing mode of a multilayer clustering algorithm, the picture comparison times are reduced in a retrieval mode of a neighbor picture algorithm, the clustering time is greatly shortened, and the clustering speed of repeated picture clustering can be improved; by means of the retrieval and combination mode of the neighbor graph algorithm and the parallel search algorithm, high recall rate is provided by label diffusion, and therefore the clustering effect of repeated picture clustering can be improved.
Fig. 5 is a block diagram illustrating an apparatus 500 for picture clustering according to an example embodiment. For example, the apparatus 500 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.
Referring to fig. 5, the apparatus 500 may include one or more of the following components: processing component 502, memory 504, power component 506, multimedia component 508, audio component 510, input/output (I/O) interface 512, sensor component 514, and communication component 516.
The processing component 502 generally controls overall operation of the device 500, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing components 502 may include one or more processors 520 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 502 can include one or more modules that facilitate interaction between the processing component 502 and other components. For example, the processing component 502 can include a multimedia module to facilitate interaction between the multimedia component 508 and the processing component 502.
The memory 504 is configured to store various types of data to support operation at the device 500. Examples of such data include instructions for any application or method operating on device 500, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 504 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.
The power supply component 506 provides power to the various components of the device 500. The power components 506 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the apparatus 500.
The multimedia component 508 includes a screen that provides an output interface between the device 500 and the user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure correlated to the touch or slide operation. In some embodiments, the multimedia component 508 includes a front facing camera and/or a rear facing camera. The front-facing camera and/or the rear-facing camera may receive external multimedia data when the device 500 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.
The audio component 510 is configured to output and/or input audio signals. For example, audio component 510 includes a Microphone (MIC) configured to receive external audio signals when apparatus 500 is in an operating mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 504 or transmitted via the communication component 516. In some embodiments, audio component 510 further includes a speaker for outputting audio signals.
The I/O interface 512 provides an interface between the processing component 502 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.
The sensor assembly 514 includes one or more sensors for providing various aspects of status assessment for the device 500. For example, the sensor assembly 514 may detect an open/closed state of the device 500, the relative positioning of the components, such as a display and keypad of the apparatus 500, the sensor assembly 514 may also detect a change in the position of the apparatus 500 or a component of the apparatus 500, the presence or absence of user contact with the apparatus 500, orientation or acceleration/deceleration of the apparatus 500, and a change in the temperature of the apparatus 500. The sensor assembly 514 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 514 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 514 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
The communication component 516 is configured to facilitate communication between the apparatus 500 and other devices in a wired or wireless manner. The apparatus 500 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 516 receives a broadcast signal or broadcast associated information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communications component 516 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
In an exemplary embodiment, the apparatus 500 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.
In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 504 comprising instructions, executable by the processor 520 of the apparatus 500 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
A non-transitory computer readable storage medium having instructions therein, which when executed by a processor of a mobile terminal, enable the mobile terminal to perform a method of picture clustering, the method comprising:
carrying out data slicing on each picture by utilizing a multilayer clustering algorithm, and dividing each picture to each cluster device in a balanced manner;
and aiming at the pictures on each cluster device, searching and merging are carried out by utilizing a neighbor picture algorithm and a parallel searching algorithm, so as to obtain a plurality of target repeated picture sets.
Fig. 6 is a schematic structural diagram of a server in an embodiment of the present application. The server 600 may vary significantly due to configuration or performance, and may include one or more Central Processing Units (CPUs) 622 (e.g., one or more processors) and memory 632, one or more storage media 630 (e.g., one or more mass storage devices) storing applications 642 or data 644. Memory 632 and storage medium 630 may be, among other things, transient or persistent storage. The program stored in the storage medium 630 may include one or more modules (not shown), each of which may include a series of instruction operations for the server. Still further, the central processor 622 may be configured to communicate with the storage medium 630 and execute a series of instruction operations in the storage medium 630 on the server 600.
The server 600 may also include one or more power supplies 626, one or more wired or wireless network interfaces 650, one or more input-output interfaces 658, one or more keyboards 656, and/or one or more operating systems 641, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. The terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The foregoing is merely a preferred embodiment of the present application and is not intended to limit the present application in any way. Although the present application has been described with reference to the preferred embodiments, it is not intended to limit the present application. Those skilled in the art can now make numerous possible variations and modifications to the disclosed embodiments, or modify equivalent embodiments, using the methods and techniques disclosed above, without departing from the scope of the claimed embodiments. Therefore, any simple modification, equivalent change and modification made to the above embodiments according to the technical essence of the present application still fall within the protection scope of the technical solution of the present application without departing from the content of the technical solution of the present application.

Claims (10)

1. A method for clustering pictures is characterized by comprising the following steps:
carrying out data slicing on each picture by utilizing a multilayer clustering algorithm, and dividing each picture to each cluster device in a balanced manner;
and aiming at the pictures on each cluster device, searching and merging are carried out by utilizing a neighbor picture algorithm and a parallel searching algorithm, so as to obtain a plurality of target repeated picture sets.
2. The method of claim 1, wherein the slicing each picture by using the multi-layer clustering algorithm to divide each picture into the cluster devices in a balanced manner comprises:
clustering each picture by utilizing a first-layer clustering algorithm in the multi-layer clustering algorithm to obtain a plurality of large clusters and a plurality of first small clusters;
clustering each large cluster by using a second-layer clustering algorithm in the multi-layer clustering algorithm to obtain a plurality of second small clusters;
merging each first subclass cluster and each second subclass cluster based on the load condition of each cluster device, and dividing each picture to each cluster device in a balanced manner.
3. The method of claim 1 or 2, wherein the multi-tiered clustering algorithm comprises a multi-tiered K-Means algorithm.
4. The method according to claim 1, wherein the retrieving and merging the pictures on each cluster device by using a neighbor map algorithm and a merging and searching algorithm to obtain a plurality of target repeated picture sets comprises:
searching by utilizing the neighbor graph algorithm aiming at each picture of each cluster device to obtain a repeated picture set of each picture;
and aiming at the repeated picture sets of any two pictures, merging by utilizing the parallel-searching algorithm to obtain a plurality of target repeated picture sets.
5. The method according to claim 4, wherein the merging by the union-search algorithm is performed on the repeated picture sets of any two pictures to obtain a plurality of target repeated picture sets, specifically:
and aiming at any two repeated picture sets of the pictures, if the intersection of the repeated picture sets of the two pictures is a non-empty set, combining the repeated picture sets of the two pictures to obtain a plurality of target repeated picture sets.
6. The method of claim 1, further comprising:
if the number of the pictures in the target repeated picture set is larger than a preset threshold value, checking the target repeated picture set;
and if the checking processing result of the target repeated picture set is abnormal, rejecting the target repeated picture set.
7. The method according to claim 6, wherein the checking the target duplicate picture set specifically comprises:
and acquiring the picture characteristic statistical distribution of the target repeated picture set, and judging whether the verification result of the target repeated picture set is abnormal or not according to the picture characteristic statistical distribution.
8. An apparatus for clustering pictures, comprising:
the dividing unit is used for carrying out data slicing on each picture by utilizing a multilayer clustering algorithm and dividing each picture to each cluster device in a balanced manner;
and the obtaining unit is used for retrieving and merging the pictures on each cluster device by utilizing a neighbor picture algorithm and a merging and searching algorithm to obtain a plurality of target repeated picture sets.
9. An apparatus for picture clustering comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by one or more processors, the one or more programs comprising instructions for:
carrying out data slicing on each picture by utilizing a multilayer clustering algorithm, and dividing each picture to each cluster device in a balanced manner;
and aiming at the pictures on each cluster device, searching and merging are carried out by utilizing a neighbor picture algorithm and a parallel searching algorithm, so as to obtain a plurality of target repeated picture sets.
10. A machine-readable medium having stored thereon instructions, which when executed by one or more processors, cause an apparatus to perform the method of picture clustering of any of claims 1-7.
CN202010338401.3A 2020-04-26 2020-04-26 Picture clustering method and related device Pending CN113553461A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010338401.3A CN113553461A (en) 2020-04-26 2020-04-26 Picture clustering method and related device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010338401.3A CN113553461A (en) 2020-04-26 2020-04-26 Picture clustering method and related device

Publications (1)

Publication Number Publication Date
CN113553461A true CN113553461A (en) 2021-10-26

Family

ID=78101453

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010338401.3A Pending CN113553461A (en) 2020-04-26 2020-04-26 Picture clustering method and related device

Country Status (1)

Country Link
CN (1) CN113553461A (en)

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101859326A (en) * 2010-06-09 2010-10-13 南京大学 Image searching method
CN103778146A (en) * 2012-10-23 2014-05-07 富士通株式会社 Image clustering device and method
CN103810261A (en) * 2014-01-26 2014-05-21 西安理工大学 K-means clustering method based on quotient space theory
CN104036261A (en) * 2014-06-30 2014-09-10 北京奇虎科技有限公司 Face recognition method and system
CN104281674A (en) * 2014-09-29 2015-01-14 同济大学 Adaptive clustering method and adaptive clustering system on basis of clustering coefficients
CN106652023A (en) * 2016-12-13 2017-05-10 华中科技大学 Rapid structure from motion method and system for large-scale disordered images
CN109086720A (en) * 2018-08-03 2018-12-25 腾讯科技(深圳)有限公司 A kind of face cluster method, apparatus and storage medium
US20190258719A1 (en) * 2017-02-28 2019-08-22 Laserlike, Inc. Emoji classifier
CN110297935A (en) * 2019-06-28 2019-10-01 京东数字科技控股有限公司 Image search method, device, medium and electronic equipment
CN110458078A (en) * 2019-08-05 2019-11-15 高新兴科技集团股份有限公司 A kind of face image data clustering method, system and equipment
CN110597719A (en) * 2019-09-05 2019-12-20 腾讯科技(深圳)有限公司 Image clustering method, device and medium for adaptation test

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101859326A (en) * 2010-06-09 2010-10-13 南京大学 Image searching method
CN103778146A (en) * 2012-10-23 2014-05-07 富士通株式会社 Image clustering device and method
CN103810261A (en) * 2014-01-26 2014-05-21 西安理工大学 K-means clustering method based on quotient space theory
CN104036261A (en) * 2014-06-30 2014-09-10 北京奇虎科技有限公司 Face recognition method and system
CN104281674A (en) * 2014-09-29 2015-01-14 同济大学 Adaptive clustering method and adaptive clustering system on basis of clustering coefficients
CN106652023A (en) * 2016-12-13 2017-05-10 华中科技大学 Rapid structure from motion method and system for large-scale disordered images
US20190258719A1 (en) * 2017-02-28 2019-08-22 Laserlike, Inc. Emoji classifier
CN109086720A (en) * 2018-08-03 2018-12-25 腾讯科技(深圳)有限公司 A kind of face cluster method, apparatus and storage medium
CN110297935A (en) * 2019-06-28 2019-10-01 京东数字科技控股有限公司 Image search method, device, medium and electronic equipment
CN110458078A (en) * 2019-08-05 2019-11-15 高新兴科技集团股份有限公司 A kind of face image data clustering method, system and equipment
CN110597719A (en) * 2019-09-05 2019-12-20 腾讯科技(深圳)有限公司 Image clustering method, device and medium for adaptation test

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
刘文杰;伍之昂;曹杰;潘金贵;: "基于成对约束Info-Kmeans聚类的图像索引方法", 通信学报, no. 07, 25 July 2013 (2013-07-25) *
黎光谱: "改进K-Means聚类算法在基于Hadoop平台的图像检索***中的研究与实现", 中国优秀硕士学位论文全文数据库 信息科技辑, 15 August 2014 (2014-08-15), pages 138 - 1326 *

Similar Documents

Publication Publication Date Title
RU2630580C1 (en) Information push method and device
WO2021031645A1 (en) Image processing method and apparatus, electronic device and storage medium
WO2021036382A1 (en) Image processing method and apparatus, electronic device and storage medium
CN112101238A (en) Clustering method and device, electronic equipment and storage medium
CN111553464B (en) Image processing method and device based on super network and intelligent equipment
US20160314164A1 (en) Methods and devices for sharing cloud-based business card
CN110796094A (en) Control method and device based on image recognition, electronic equipment and storage medium
CN111242303A (en) Network training method and device, and image processing method and device
CN108804684B (en) Data processing method and device
CN108573706B (en) Voice recognition method, device and equipment
CN109981624B (en) Intrusion detection method, device and storage medium
CN107707759B (en) Terminal control method, device and system, and storage medium
CN109214175B (en) Method, device and storage medium for training classifier based on sample characteristics
CN111797746B (en) Face recognition method, device and computer readable storage medium
CN109842688B (en) Content recommendation method and device, electronic equipment and storage medium
CN109901726B (en) Candidate word generation method and device and candidate word generation device
CN111062407A (en) Image processing method and device, electronic equipment and storage medium
CN109144286B (en) Input method and device
CN112131999B (en) Identity determination method and device, electronic equipment and storage medium
CN113553461A (en) Picture clustering method and related device
CN112732098B (en) Input method and related device
WO2021103742A1 (en) Resource management method and apparatus, and electronic device
CN108154092B (en) Face feature prediction method and device
CN110019657B (en) Processing method, apparatus and machine-readable medium
CN113378022A (en) In-station search platform, search method and related device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination