CN111061894A

CN111061894A - Processing method and device of peer data, electronic equipment and storage medium

Info

Publication number: CN111061894A
Application number: CN201911081819.4A
Authority: CN
Inventors: 戴世稳
Original assignee: Shenzhen Intellifusion Technologies Co Ltd
Current assignee: Shenzhen Intellifusion Technologies Co Ltd
Priority date: 2019-11-07
Filing date: 2019-11-07
Publication date: 2020-04-24

Abstract

The embodiment of the invention provides a processing method and device of peer data, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring file data to be processed, wherein the file data to be processed comprises a file ID, and the file data to be processed is file data which is sorted according to time by the same acquisition equipment; extracting the archive data to be processed through a preset sliding window to obtain a peer data set corresponding to each archive data to be processed; performing duplicate removal on all the same-row data sets based on the archive ID, wherein the duplicate removal comprises duplicate removal in the same-row data sets and duplicate removal among the same-row data sets; and merging the duplicate-removed same-row data sets to obtain target same-row data. In the same peer data set, the problem of repeated calculation is avoided, so that the calculation speed of the peer relationship is improved, the influence of the situation of multiple peers is avoided, the calculation accuracy of the peer relationship is improved, and the calculation speed and the calculation accuracy of the target peer data are improved.

Description

Processing method and device of peer data, electronic equipment and storage medium

Technical Field

The present invention relates to the field of image recognition technologies, and in particular, to a method and an apparatus for processing peer data, an electronic device, and a storage medium.

Background

With the development and progress of artificial intelligence technology, face recognition technology has been applied in a great number of people's relationship analysis, for example: the peer relationship analysis system based on face recognition can help police officers to dig out the peer relationship of target personnel, so that the relationship between the target personnel and the peer personnel is analyzed, the traditional peer relationship is traversed by the snapshot files of the target personnel and the snapshot files of other personnel in the archive library, and the peer relationship is obtained. Because the traversal calculation is a single-thread calculation, under the condition of mass data, the whole calculation needs each snapshot file of the target person to traverse the whole archive library, the calculation amount is very large, and because the traversal calculation has some repeated same-row relationships, for example, the target person A and the same-row person C calculate one same row at 2019/1/1-12:01:10 and calculate one same row at 2019/1/1-12:01:13, the number of times of two same rows can be recorded. Based on the above, the existing peer-to-peer computing system has the problems of low computing speed and low accuracy of computing results.

Disclosure of Invention

The embodiment of the invention provides a processing method of peer data, which can improve the calculation speed and the calculation accuracy of peer relationships.

In a first aspect, an embodiment of the present invention provides a method for processing peer data, including:

acquiring file data to be processed, wherein the file data to be processed comprises a file ID, and the file data to be processed is file data which is sorted according to time by the same acquisition equipment;

extracting the archive data to be processed through a preset sliding window to obtain a peer data set corresponding to each archive data to be processed;

performing duplicate removal on all the same-row data sets based on the archive ID, wherein the duplicate removal comprises duplicate removal in the same-row data sets and duplicate removal among the same-row data sets;

and merging the duplicate-removed same-row data sets to obtain target same-row data.

Optionally, the step of acquiring archive data to be processed includes:

the method comprises the steps of grouping archive data acquired from an archive database according to image acquisition equipment IDs to obtain m data sets to be processed, correspondingly storing the archive data in the m data sets to be processed in m cache regions according to time sequence to obtain the archive data to be processed, which is sequenced according to time under the same image acquisition equipment, wherein m represents that the archive data to be processed is acquired by the m image acquisition equipment.

Optionally, the extracting, through a preset sliding window, the to-be-processed archive data to obtain a peer data set corresponding to each to-be-processed archive data includes:

and extracting each to-be-processed archive data in the m cache regions in parallel through preset m sliding windows to obtain a same-row data set corresponding to each to-be-processed archive data.

Optionally, the preset sliding window includes a preset extraction window and a preset sliding step length, the to-be-processed archive data is extracted through the preset sliding window, and a peer data set corresponding to each to-be-processed archive data is obtained, including:

presetting an extraction window with an upper edge of t-x1 and a lower edge of t + x1, wherein t is the corresponding position of the reference archival data, and the extraction window extracts the upper and lower x1 archival data to be processed of the reference archival data;

presetting a sliding step length, and sliding the extraction windows according to the sliding step length so that each file to be processed corresponds to one extraction window;

and extracting the archive ID positioned in the extraction window once every time one sliding step length is slid to obtain a plurality of data sets in the same row corresponding to different extraction windows.

Optionally, the removing duplicate of the peer data set based on the profile ID includes:

deleting the file data to be processed with the same file ID as the reference file, and reserving the reference file;

reserving one piece of file data to be processed with the largest time under the same file ID in the rest file data to be processed, and deleting the rest file data to be processed to obtain a duplicate removal data set;

and performing relational deduplication on the deduplication data sets obtained in the same cache region according to the archive IDs in the deduplication data sets.

Optionally, the archive IDs have an ordering relationship, and the deduplication processing on the deduplication data sets obtained in the same cache region includes:

sorting the archive data to be processed in the duplicate removal data set according to the sorting relation of the archive IDs;

performing peer pairing on the archive IDs in the duplicate removal data set according to the sorting to obtain a peer relationship of the reference archive, wherein the peer relationship comprises peer time and a peer ID, the peer time is corresponding to one piece of to-be-processed archive data sorted in the peer ID, and the peer ID comprises the archive ID of the reference archive and other archive IDs paired with the archive ID of the reference archive;

and according to the peer time, carrying out relation duplicate removal on the same peer ID.

Optionally, the target peer-to-peer data includes a target peer-to-peer number, and the method further includes:

and counting the number of the same-row relations which have the same-row ID and correspond to different same-row time after the relation deduplication is carried out in all the cache regions, so as to obtain the target same-row times.

In a second aspect, an embodiment of the present invention provides an apparatus for processing peer data, including:

the acquisition module is used for acquiring file data to be processed, the file data to be processed comprises a file ID, and the file data to be processed is file data which is sorted according to time under the same acquisition equipment;

the extraction module is used for extracting the to-be-processed archive data through a preset sliding window to obtain a peer data set corresponding to each to-be-processed archive data;

the processing module is used for carrying out duplicate removal on all the same-row data sets based on the archive IDs, wherein the duplicate removal comprises duplicate removal in the same-row data sets and duplicate removal among the same-row data sets;

and the merging module is used for merging the duplicate-removed same-row data sets to obtain target same-row data.

In a third aspect, an embodiment of the present invention provides an electronic device, including: the data processing method comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the processor executes the computer program to realize the steps in the processing method of the same-row data provided by the embodiment of the invention.

In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements the steps in the method for processing peer data provided in the embodiment of the present invention.

In the embodiment of the invention, to-be-processed archive data are obtained, wherein the to-be-processed archive data comprise archive IDs (identities), and the to-be-processed archive data are archive data which are sorted according to time under the same acquisition equipment; extracting the archive data to be processed through a preset sliding window to obtain a peer data set corresponding to each archive data to be processed; performing duplicate removal on all the same-row data sets based on the archive ID, wherein the duplicate removal comprises duplicate removal in the same-row data sets and duplicate removal among the same-row data sets; and merging the duplicate-removed same-row data sets to obtain target same-row data. Because the archive data to be processed acquired by the same acquisition equipment is extracted through the sliding window, and duplicate removal in the peer data set is performed on the archive data to be processed extracted by the sliding window to the peer data set, the problem of repeated calculation caused by the fact that multiple peer conditions exist in the same sliding window can be avoided, and the problem of repeated calculation does not exist in the same peer data set, so that the calculation speed of the peer relationship is improved, the influence of the multiple peer conditions does not exist, the calculation accuracy of the peer relationship is improved, and the calculation speed and the calculation accuracy of the target peer data are improved; in addition, due to the fact that duplication is removed among the same-row data sets, the problem that target same-row data are inaccurate after combination due to the fact that repeated same-row relations exist among the same-row data sets can be avoided in the multiple same-row data sets under the same acquisition device, and calculation accuracy of the target same-row data is further improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic flow chart of a method for processing peer data according to an embodiment of the present invention;

FIG. 2 is a flow chart illustrating another method for processing peer data according to an embodiment of the present invention;

FIG. 3 is a flow chart illustrating another method for processing peer data according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a device for processing peer data according to an embodiment of the present invention;

FIG. 5 is a schematic structural diagram of an extraction module according to an embodiment of the present invention;

FIG. 6 is a schematic structural diagram of a processing module according to an embodiment of the present invention;

FIG. 7 is a schematic structural diagram of a third deduplication unit according to an embodiment of the present invention;

FIG. 8 is a schematic structural diagram of another apparatus for processing inline data according to an embodiment of the present invention;

fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, fig. 1 is a schematic flow chart of a processing method of peer data according to an embodiment of the present invention, as shown in fig. 1, including the following steps:

101. and acquiring archive data to be processed.

The file data to be processed can be obtained from a file database, the file data to be processed is image data of a person to be snapshotted, the snapshotted picture of the person to be snapshotted can be obtained through image acquisition equipment, the picture data is filed through image clustering, file data of each person to be snapshotted are obtained, files are correspondingly established and stored in the file database, each file can correspond to one file ID, and a plurality of file data in one file all comprise the file ID of the file. It will be appreciated that each snapshoter may have one or more profiles, each profile may have a unique profile ID, and each profile may include a plurality of profile data, each of the plurality of profile data in a profile including the profile ID of the profile. In one embodiment, a portrait picture may be used as archival data, and frame images including a portrait in a video may also be used as archival data.

After acquiring archive data from an archive database, grouping the acquired archive data according to the image acquisition device to which the archive data belongs, for example: the archive data a1 and a2 of the person a to be captured are captured by the image capturing device No. 1 and the image capturing device No. 2 respectively, and the archive data B2 and B3 of the person B to be captured are captured by the image capturing device No. 2 and the image capturing device No. 3 respectively, so that a1 is divided into 1 group, a2 is divided into 2 groups, B2 is divided into 2 groups, and B3 is divided into 3 groups, in the example, a1, a2, B2 and B3, a and B respectively represent the archive ID corresponding to the person A, B to be captured, and 1, 2 and 3 represent the number of the captured image capturing device, which can also be referred to as the image capturing device ID. Note that the above letter and data number are for convenience of description of the example in the present embodiment, and should not be construed as limiting the archive ID and the image capturing apparatus ID in the present embodiment.

After the archive data are grouped, the archive data in each group are sorted, and the sorting rule can be that the archive data are sorted according to the snapshot time. And loading each group of sorted file data into different cache regions to form hot data, and obtaining the file data to be processed.

It should be noted that the image capturing device may be a monitoring device disposed at a regular installation point, and the monitoring device may be a monitoring camera.

102. And extracting the archive data to be processed through a preset sliding window to obtain a peer data set corresponding to each archive data to be processed.

In this step, the sliding window sequentially extracts each piece of file data to be processed and file data to be processed up and down from the first piece of file data to be processed, thereby obtaining a peer data set of each piece of file data to be processed. The sliding window is configured by an extraction window and a sliding step, the extraction window is used for determining the number of the archive data to be processed extracted each time, the sliding step is used for determining the distance or time of moving the extraction window each time, and the sliding step may be 1, that is, the distance or time of moving 1 archive data to be processed.

103. And (4) removing the duplication of all the same-row data sets based on the file ID, and merging to obtain target same-row data.

In this step, the deduplication includes deduplication in the same row data set and deduplication between the same row data sets. The duplicate removal of the same data set in the same row may be to detect whether there are multiple same archive IDs in the data set in the same row, and if so, delete the to-be-processed archive data corresponding to the redundant archive IDs to ensure that there is only one same archive ID in the data set in the same row. The deduplication between different peer datasets may be: matching the corresponding peer relationship of each file in the same peer data set according to the file ID, for example: in a peer data set (a, b, c), matching the peer relationship systems (a-b, a-c, b-a, b-c, c-a, c-b) corresponding to the files through permutation and combination, wherein a-c and c-a are the same peer relationship, b-c and c-b are the same peer relationship, and a-b and b-a are the same peer relationship, and for the same peer relationship, only one group is reserved for duplication removal, or the same peer relationship systems (a-b, a-c, b-c) can be reserved according to the sequence, for example, the same peer relationship systems (a-b, a-c, b-c) are reserved according to the alphabetical sequence; in the next same-row data set (b, c, d) obtained through the sliding window, matching the corresponding same-row relation (b-c, b-d, c-b, c-d, d-b, d-c) of each file through permutation and combination, and after reserving according to the sequence, keeping the other same-row relation (b-c, b-d, c-d); in (a-b, a-c, b-c) and (b-c, b-d, c-d), b-c is repeated, and the analogous relationship (a-b, a-c, b-c, b-d, c-d) is obtained after deduplication.

In some possible embodiments, a piece of reference archive data corresponding to the extraction window is taken as the center of the extraction window, and the archive data to be processed above and below the reference archive data are extracted as the peer data sets, for example: one in-line data set (a)₁，b₁，c₁，b₂，c₂，a₂，b₃) Wherein the subscript numbers are used only for distinguishing order, b₂Performing deduplication in the same row data set for the reference archive data to obtain (a, b)₂C) matching the peer relationship to obtain (a-b)₂，b₂-c); the next in-line dataset obtained through a sliding window (b)₁，c₁，b₂，c₂，a₂，b₃，d₁) In the same row data set, after the duplication is removed, (a, b, c) is obtained₂D), matching the peer relationship to obtain (a-c)₂，b－c₂，c₂-d) at (a-b)₂，b₂-c) and (a-c)₂，b－c₂，c₂In-d), irrespective of the subscript number, b₂C and b-c₂Are repetitive, i.e. the co-ordination is (a-b, a-c, b-c, c-d). The embodiment combines in a centralized arrangement, and reduces the calculation amount of the peer-to-peer relationship.

The merging may be to merge the same-row relationships in the same cache region first, and then merge the same-row relationships in different cache regions, and a deduplication step is not required in the process of merging the same-row relationships in different cache regions.

It should be noted that the processing method for peer data provided in the embodiment of the present invention may be applied to a data processing device, for example: a database, a server, a computer, or the like capable of performing peer data processing.

Optionally, referring to fig. 2, fig. 2 is a schematic flowchart of another processing method for peer data according to an embodiment of the present invention, and as shown in fig. 2, the processing method includes the following steps:

201. and acquiring archive data from an archive database.

In this step, archive data may be obtained from an archive database by a data fetching thread. The number of the grabbing threads may be one or more, and when the number of the grabbing threads is multiple, the grabbing threads may be multiple grabbing threads in parallel.

In a possible embodiment, the plurality of threads may be m parallel grab threads, where the m threads correspond to m image capturing devices, and the image capturing devices may be monitoring devices, such as monitoring cameras, deployed at the planned installation point. The image capturing apparatus described above has a corresponding image capturing apparatus ID for distinguishing from image capturing apparatuses of other individuals. The image data captured by the image capturing device includes a corresponding image capturing device ID.

The thread names of the m threads correspond to the m image acquisition device IDs, and each thread acquires data related to the corresponding image acquisition device ID only in the archive database. For example, the thread 1 only acquires the archive data acquired by the image acquisition device No. 1 in the archive database, and the thread 2 only acquires the archive data acquired by the image acquisition device No. 2 in the archive database.

202. And grouping the acquired file data according to different threads, loading the grouped file data into different cache regions, and sequencing the file data according to the snapshot time.

In this embodiment, the archive data in the archive database includes both the image capture device ID and the snapshot time corresponding to the archive data. Different threads acquire the archive data corresponding to the image acquisition equipment IDs from the archive database, correspondingly, m image acquisition equipment correspond to m image acquisition equipment IDs, and simultaneously, the m threads acquire the data, correspondingly, the cache regions are also set to be m, and the archive data loaded in each cache region has the same image acquisition equipment ID. In the cache region, all the archive data in the cache region are sorted according to the snapshot time corresponding to each piece of archive data, and the sorting can be carried out according to a time sequence or a time reverse sequence, so as to obtain the archive data to be processed corresponding to each cache region.

203. And each cache region is provided with a sliding window to extract the archive data to be processed in the cache region, so as to obtain a same-row data set.

The sliding window comprises an extraction window and a sliding step length. The extraction window is provided with a window upper edge parameter and a window lower edge parameter and is used for determining the size of the extraction window so as to determine the number of the archive data to be processed extracted by the extraction window. The sliding step size refers to the sliding length of the extraction window, and may be a time step size or a distance step size. The time step is how much time to slide, for example, the minimum time granularity of the snapshot time corresponding to the archive data to be processed is second, and the time step may be 1 second. The distance step is a distance between a row of one to-be-processed file data and a next to-be-processed file data.

In addition, the sliding window may further be provided with a reference archive data position t, taking chronological ordering as an example: on the basis of the reference archive data, the top edge of the extraction window is t-x1, the bottom edge is t + x1, and the extraction range of the extraction window is [ t-x1, t + x1], that is, the extraction window extracts the archive data to be processed in [ t-x1, t + x1 ]. Where x1 is the time length when t is the time position, and x1 is the distance length when t is the distance position. In some possible embodiments, the top edge of the extraction window is t-x1 and the bottom edge is t + x2, where x1 and x2 are different values. In brief, the number of the to-be-processed file data extracted by the extraction window [ t-x1, t + x1] is odd, and the number of the to-be-processed file data extracted by the extraction window [ t-x1, t + x2] can be even or odd.

Additionally, in some possible embodiments, the chronological ordering is taken as an example: if the minimum time granularity of the file data to be processed is continuous, the file data to be processed is directly sequenced, and if the minimum time granularity of the file data to be processed is discontinuous, interpolation can be performed so that the file data to be processed obtained by the extraction window is continuous. As shown in table 1:

sorting	Image capturing device ID	File ID	Time of taking a snapshot
				1	1	a	2019/1/1-12:01:10
2	1	b	2019/1/1-12:01:12

TABLE 1

Table 1 shows the acquired to-be-processed archive data, which includes sorting, image capturing device ID, archive ID, snapshot time, and the like, in table 1, the snapshot time indicates that the minimum time granularity is second, and archive IDs a and b are discontinuous, and there is no corresponding to-be-processed archive data after a time interval of 2019/1/1-12:01:11, the archive ID at the time may be set to Null (Null value), as shown in table 2, table 2 is to-be-processed archive data inserted into Null (Null value), and when the extraction window is used for extraction, the to-be-processed archive data is also extracted. In addition, when the file data to be processed is used as the reference data, the file data to be processed is skipped or the obtained same-row data set is defined as empty.

Sorting	Image capturing device ID	File ID	Time of taking a snapshot
				1	1	a	2019/1/1-12:01:10
2	1	Null	2019/1/1-12:01:11
				3	1	b	2019/1/1-12:01:12

TABLE 2

For example, the above-mentioned extraction process of the sliding window may specifically be: the file data to be processed in the buffer area is shown in table 3:

TABLE 3

Table 3 sets an extraction window as [ t-5, t +5], the sliding step length is 1 second, and since the archive data to be processed sorted in the first 5 has insufficient upper edge archive data to be processed, and the co-line dataset obtained after extraction is also a proper subset of the co-line dataset extracted when t is 6, the extraction is not performed again, so that the extraction can be directly started when t is 6, and the first co-line dataset obtained by extraction through the extraction window is table 4:

1	1	a	2019/1/1-12:01:10
				2	1	b	2019/1/1-12:01:11
3	1	a	2019/1/1-12:01:12
				4	1	c	2019/1/1-12:01:13
5	1	b	2019/1/1-12:01:14
				6	1	c	2019/1/1-12:01:15
7	1	d	2019/1/1-12:01:16
				8	1	c	2019/1/1-12:01:17
9	1	g	2019/1/1-12:01:18
				10	1	h	2019/1/1-12:01:19
11	1	g	2019/1/1-12:01:20

TABLE 4

After the first peer data set is extracted, the sliding window slides downwards by a sliding step length, at this time, t is 7, and a second peer data set is obtained through extraction of the extraction window, which is shown in table 5:

TABLE 5

After the second in-line data set is extracted, the sliding window is slid downwards by a sliding step length, at this time, t is 8, and a third in-line data set is obtained through extraction of the extraction window and is shown in table 6:

3	1	a	2019/1/1-12:01:12
				4	1	c	2019/1/1-12:01:13
5	1	b	2019/1/1-12:01:14
				6	1	c	2019/1/1-12:01:15
7	1	d	2019/1/1-12:01:16
				8	1	c	2019/1/1-12:01:17
9	1	g	2019/1/1-12:01:18
				10	1	h	2019/1/1-12:01:19
11	1	g	2019/1/1-12:01:20
				12	1	h	2019/1/1-12:01:21
13	1	g	2019/1/1-12:01:22

table 6204, the duplicate of the peer data set is removed based on the archive ID of the archive data to be processed, and the peer data is obtained.

Wherein, the duplication elimination includes duplication elimination in the same row data set and duplication elimination between the same row data sets. The duplicate removal of the same data set in the same row may be to detect whether there are multiple same archive IDs in the data set in the same row, and if so, delete the to-be-processed archive data corresponding to the redundant archive IDs to ensure that there is only one same archive ID in the data set in the same row. The deduplication between different peer datasets may be: matching the corresponding peer relationship of each file in the same peer data set according to the file ID, and then removing duplication according to the peer relationship.

Optionally, the deduplication for the same row data set may specifically be: deleting the file data to be processed with the same file ID as the reference file, and reserving the reference file; the file data to be processed with the largest time under the same file ID in the other file data to be processed is retained, and the remaining file data to be processed is deleted to obtain the duplicate removal data set, as exemplified in table 4, table 4a, table 5a, table 6, and table 6 a.

Specifically, for example: table 4 shows a first inline data set, and in table 4, the inline relationship of the first inline data set is obtained by permutation and combination. Specifically, c when the reference file data t is 6 is taken as a main component, and the relationship of c is obtained as follows: a _ c, b _ c, a _ c, c _ c, b _ c, c _ d, c _ c, c _ g, c _ h, c _ g, it can be seen that there are repetitions in the above peer-to-peer relationship. Before determining the peer relationship, the duplicate removal can be performed on the first peer data set according to the archive ID to obtain a duplicate removal data set. Specifically, the data in the first peer data set is sorted according to the file ID, and if the first peer data set exists in the file ID with the same current reference data, the data is deleted, and if a plurality of pieces of file data to be processed with the same file ID exist in the first peer data set, the piece with the largest time is retained, so that the first deduplication data set is obtained as table 4 a:

3	1	a	2019/1/1-12:01:12
				5	1	b	2019/1/1-12:01:14
6	1	c	2019/1/1-12:01:15
				7	1	d	2019/1/1-12:01:16
11	1	g	2019/1/1-12:01:20
				10	1	h	2019/1/1-12:01:19

TABLE 4a

At this time, in table 4a, the in-line relationship of the first deduplication data set is obtained by permutation and combination. Specifically, based on c when the reference file data t is 6, the first co-ordinate relation of c is obtained as follows: a _ c, b _ c, c _ d, c _ g, c _ h, it can be seen that there is no duplication in the above peer-to-peer relationship.

Table 5 is the second rowed data set. Similarly, a second de-duplication dataset is obtained as table 5 a:

3	1	a	2019/1/1-12:01:12
				5	1	b	2019/1/1-12:01:14
8	1	c	2019/1/1-12:01:17
				7	1	d	2019/1/1-12:01:16
11	1	g	2019/1/1-12:01:20
				12	1	h	2019/1/1-12:01:21

TABLE 5a

At this time, in table 5a, the in-line relationship of the second deduplication data set is obtained by permutation and combination. Specifically, with reference to d when the reference file data t is 7, the second relation of identity of d is obtained as follows: a _ d, b _ d, c _ d, d _ g, d _ h, it can be seen that there is no duplication in the above peer-to-peer relationship.

Table 6 is the third in-line data set. Similarly, a third de-duplication dataset is obtained as table 6 a:

3	1	a	2019/1/1-12:01:12
				5	1	b	2019/1/1-12:01:14
8	1	c	2019/1/1-12:01:17
				7	1	d	2019/1/1-12:01:16
13	1	g	2019/1/1-12:01:22
				12	1	h	2019/1/1-12:01:21

TABLE 6a

At this time, in table 6a, the in-line relationship of the third deduplication data set is obtained by permutation and combination. Specifically, based on c when the reference file data t is 8, a third correlation obtained by c is: a _ c, b _ c, c _ d, c _ g, c _ h, it can be seen that there is no duplication in the above peer-to-peer relationship.

Optionally, the deduplication between different same-row data sets may specifically be: and performing relational deduplication on the deduplication data sets obtained in the same cache region according to the archive IDs in the deduplication data sets. More specifically, the data sets are sorted according to the file ID; and performing peer pairing on the file IDs in the duplicate removal data set according to the ordering to obtain a peer relationship of the reference file, wherein the peer relationship comprises peer time and a peer ID, and the peer time is the time corresponding to one piece of to-be-processed file data ordered in the peer ID.

The peer ID includes a file ID of the reference file and other file IDs paired with the file ID of the reference file; and according to the peer time, carrying out relation deduplication on the same peer ID to obtain peer data.

The profile IDs have a sorting relationship, and the sorting relationship may be sorting according to size or sorting according to dictionary order. When sorting is performed according to the size, the archive ID is a numeric type archive ID, and the smaller number is sorted in the front, and the larger number is sorted in the back, for example: 562319_562378, representing the column ID with file ID 562319 and file ID 562378; the above sorting according to dictionary order may be an order in the dictionary by the name field of the user profile, such as: liquan _ Wang II indicates that the file ID is the same ID as Liquan and the file ID is Wang II. It should be noted that the dictionary may be an alphabet, a chinese character dictionary, a dictionary with ordered sequences such as common names, a single dictionary, or a combination of multiple dictionaries (in this case, the archive ID has multiple fields, and each field corresponds to a different dictionary).

Specifically, as illustrated in tables 4a, 5a, and 6a, the corresponding first peer relationship (a _ c, b _ c, c _ d, c _ g, c _ h), the second peer relationship (a _ d, b _ d, c _ d, d _ g, d _ h), and the third peer relationship (a _ c, b _ c, c _ d, c _ g, c _ h) correspond to a next piece of file data to be processed in the peer ID, that is, in the first peer relationship:

a _ c _2019/1/1-12:01:15, c in the same row ID a _ c is ranked later, and the time corresponding to c in Table 4a is taken.

b _ c _2019/1/1-12:01:15, c in the same row ID b _ c is ranked later, and the time corresponding to c in Table 4a is taken.

c _ d _2019/1/1-12:01:16, with d in row ID c _ d at the end, the time corresponding to d in Table 4 a.

c _ g _2019/1/1-12:01:20, g in c _ g, row ID, is listed later, given the time corresponding to g in table 4 a.

c _ h _2019/1/1-12:01:19, with h in row ID c _ h being listed later, the time corresponding to h in Table 4 a.

In the second peer relationship:

a _ d _2019/1/1-12:01:16, with d in row ID a _ d being ranked later, the time corresponding to d in Table 5a is taken.

b _ d _2019/1/1-12:01:16, with d in row ID b _ d being ranked later, the time corresponding to d in Table 5a is taken.

c _ d _2019/1/1-12:01:16, with d in row ID c _ d at the end, the time corresponding to d in Table 5 a.

d _ g _2019/1/1-12:01:20, g in the peer ID d _ g is ranked behind, the time corresponding to g in Table 5 a.

d _ h _2019/1/1-12:01:21, with h in row ID d _ h being listed later, the time corresponding to h in Table 5 a.

When the peer ID and the peer time are the same, the peer is considered to be duplicated, and the duplicated peer data in the first peer and the second peer are c _ d _2019/1/1-12:01: 16. Therefore, only one strip needs to be reserved.

In the third peer relationship:

a _ c _2019/1/1-12:01:17, c in the same row ID a _ c is ranked later, and the time corresponding to c in Table 6a is taken.

b _ c _2019/1/1-12:01:17, c in the same row ID b _ c is ranked later, and the time corresponding to c in Table 6a is taken.

c _ d _2019/1/1-12:01:16, with d in row ID c _ d at the end, the time corresponding to d in Table 6 a.

c _ g _2019/1/1-12:01:22, g in c _ g, row ID, is listed later, given the time corresponding to g in table 6 a.

c _ h _2019/1/1-12:01:21, with h in row ID c _ h being ranked next, the time corresponding to h in Table 6a is taken.

If the peer ID and the peer time are the same, the peer is considered to be repeated, and the repeated peer data in the first peer relationship, the second peer relationship and the third peer relationship is c _ d _2019/1/1-12:01: 16. Therefore, only one strip needs to be reserved.

In the same cache region, the same-line data in the cache region can be obtained through the deduplication in the same-line data set (archive deduplication) and the deduplication between the same-line data sets (relationship deduplication).

In addition, in some possible embodiments, the peer data with the same peer ID and different time within a preset time period is alternatively retained, and the peer data close to the middle time point of the preset time period may be preferentially retained.

205. And merging the same-line data of the plurality of cache areas to obtain target same-line data.

The merging may be to merge the data in the same line in the same cache region first, and then merge the data in the same line in different cache regions, and in the process of merging the data in the same line in different cache regions, the deduplication step is not required. And merging the same-line data in all the cache regions to obtain the target same-line data.

Optionally, the target peer data includes the target peer times, and the target peer times may be obtained by counting the number of peer data having the same peer ID and corresponding to different peer times after the relation deduplication is performed in each cache region.

In some possible embodiments, before the number of times of the same row is calculated, the same row data with the same ID and different time in a preset time period is alternatively retained, and the same row data close to the middle time point of the preset time period may be preferentially retained.

In the embodiment, data acquisition is performed through parallel threads, the data acquisition speed is increased, the same-line calculation is performed on different cache regions through a plurality of sliding windows, the calculation speed is increased, duplicate removal is performed on the same window, duplicate removal is performed on different windows, and the calculation accuracy is increased.

Optionally, referring to fig. 3, fig. 3 is a schematic flowchart of another processing method for peer data according to an embodiment of the present invention, as shown in fig. 3, including the following steps:

301. and acquiring the archive data to be processed from the archive database in parallel through the n threads.

The n threads correspond to n preset first cache regions, and the first cache regions may be a memory of a CPU or a video memory of a GPU. The archive data to be processed includes the number of the image capturing device, i.e. the image capturing device ID, and the image capturing device ID is used to distinguish the image capturing device ID from other individual image capturing devices. The image data captured by the image capturing device includes an image capturing device ID.

The file database comprises files of all the captured persons, and the files store corresponding file data of the captured persons. The to-be-processed archive data are image data of the snap-shot personnel, snap-shot pictures of the snap-shot personnel can be obtained through the image acquisition equipment, and are filed through image clustering, so that archive data of each snap-shot personnel are obtained, archives are correspondingly established and stored in the archive database, and each archive corresponds to an archive ID.

Specifically, the archive data to be processed may be acquired from the archive database through a data acquisition condition, where the acquisition condition may be aid% n, where aid is the archive ID, and n corresponds to the thread name, and the acquisition condition may be understood as that the archive ID takes n modulo remainder values of archive data, so that the amount of data acquired by each thread is approximately equal to prevent data skew.

302. And loading the acquired file data to be processed into n first cache regions.

And storing the file data to be processed acquired by each thread into the same first cache region, so as to facilitate subsequent taking and processing.

303. And regrouping the archive data to be processed in the n first cache regions according to the ID of the image acquisition equipment to obtain m data sets to be processed, correspondingly storing the m data sets to be processed in the m second cache regions, and sequencing the data sets according to the snapshot time.

The first cache region and the second cache region may be the same cache region, but the first cache region is used to store the archive data acquired from the database, and the second cache region is used to store the grouped archive data, that is, the second cache region may be a cache region obtained by repartitioning the storage region from the first cache region. The archive data in the archive database comprise the image acquisition equipment ID and snapshot time corresponding to the archive data. The m image acquisition devices correspond to the m image acquisition device IDs, correspondingly, the number of the second cache regions is set to m, and the archive data loaded in each second cache region has the same image acquisition device ID. In the second cache region, according to the snapshot time corresponding to each piece of archive data, all the archive data to be processed in the cache region are sorted, either in a time sequence or in a reverse time sequence, so as to obtain the archive data to be processed corresponding to each cache region.

304. And each cache region is provided with a sliding window to extract the archive data to be processed in the cache region, so as to obtain a same-row data set.

This step is similar to step 203 in the implementation of fig. 2, and is not described herein again.

305. And removing the duplicate of the same-row data set according to the archive ID of the archive data to be processed to obtain the same-row data.

This step is similar to step 204 in the implementation of fig. 2, and is not described herein again.

306. And merging the same-line data of the plurality of cache areas to obtain target same-line data.

This step is similar to step 205 in the implementation of fig. 2, and is not described herein again.

Referring to fig. 4, fig. 4 is a schematic structural diagram of a device for processing peer data according to an embodiment of the present invention, as shown in fig. 4, including:

an obtaining module 401, configured to obtain to-be-processed archive data, where the to-be-processed archive data includes an archive ID, and the to-be-processed archive data is archive data that is sorted according to time by the same acquisition device;

an extracting module 402, configured to extract the to-be-processed archive data through a preset sliding window to obtain a peer data set corresponding to each to-be-processed archive data;

a processing module 403, configured to perform deduplication on all peer data sets based on the archive ID, where the deduplication includes deduplication in the peer data sets and deduplication between the peer data sets;

and a merging module 403, configured to merge the deduplicated peer-to-peer data sets to obtain target peer-to-peer data.

Optionally, as shown in fig. 4, the obtaining module 401 is further configured to group the archive data obtained from the archive database according to the image acquisition device ID to obtain m to-be-processed data sets, and correspondingly store the archive data in the m to-be-processed data sets in m cache regions according to time sequence, so as to obtain to-be-processed archive data that is sequenced according to time under the same image acquisition device, where m represents that the to-be-processed archive data is acquired by the m image acquisition devices.

Optionally, as shown in fig. 4, the extracting module 402 is further configured to extract, in parallel, each to-be-processed archive data in the m cache regions through preset m sliding windows, so as to obtain a same-row data set corresponding to each to-be-processed archive data.

Optionally, as shown in fig. 5, the preset sliding window includes a preset extraction window and a preset sliding step, and the extraction module 402 includes:

a first preset unit 4021, configured to preset an extraction window with an upper edge of t-x1 and a lower edge of t + x1, where t is a corresponding position of the reference archive data, and the extraction window extracts x1 pieces of to-be-processed archive data above and below the reference archive data;

a second preset unit 4022, configured to preset a sliding step length, where the extraction window slides according to the sliding step length, so that each to-be-processed file corresponds to one extraction window;

the extracting unit 4023 is configured to extract the archive ID located in the extraction window once every time the sliding step is slid, so as to obtain a plurality of data sets in the same row corresponding to different extraction windows.

Optionally, as shown in fig. 6, the processing module 403 includes:

a first deduplication unit 4031, configured to delete the to-be-processed archive data having the same archive ID as the reference archive, and retain the reference archive;

a second deduplication unit 4032, configured to reserve one piece of to-be-processed archive data with the largest time under the same archive ID among the remaining to-be-processed archive data, and delete the remaining to-be-processed archive data to obtain a deduplication data set;

a third deduplication unit 4033, configured to perform relational deduplication on the deduplication data sets obtained in the same cache region according to the archive IDs in the deduplication data sets.

Optionally, as shown in fig. 7, the profile ID has a sorting relationship, and the third deduplication unit 4033 includes:

a sorting subunit 40331, configured to sort, according to the sorting relationship of the archive ID, the archive data to be processed in the duplicate removal data set;

a matching subunit 40332, configured to perform peer matching on the profile IDs in the duplicate removal data set according to the sorting to obtain a peer relationship of the reference profile, where the peer relationship includes peer time and a peer ID, the peer time is a time corresponding to a piece of to-be-processed profile data sorted in the peer ID, and the peer ID includes a profile ID of the reference profile and other profile IDs matched with the profile ID of the reference profile;

a duplicate removal subunit 40333, configured to perform relational duplicate removal on the same peer IDs according to the peer time.

Optionally, as shown in fig. 8, the apparatus further includes:

the counting module 405 is configured to count the number of peer relationships corresponding to different peer times and having the same peer ID after the relationship deduplication is performed in all the cache regions, so as to obtain the number of target peer times.

It should be noted that the apparatus for processing peer data provided in the embodiment of the present invention may be applied to a data processing device, for example: a database, a server, a computer, or the like capable of performing peer data processing.

The processing device of the peer data provided in the embodiment of the present invention can implement each implementation manner in the method embodiments of fig. 1 to fig. 7, and has corresponding beneficial effects, and for avoiding repetition, details are not repeated here.

Referring to fig. 9, fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, as shown in fig. 9, including: a memory 1002, a processor 1001 and a computer program stored on said memory and executable on said processor, wherein:

the processor 1001 is used for calling the computer program stored in the memory 1002, and executes the following steps:

Optionally, the step of acquiring the archive data to be processed performed by the processor 1001 includes:

Optionally, the extracting, performed by the processor 1001, the to-be-processed archive data through a preset sliding window to obtain a peer data set corresponding to each to-be-processed archive data includes:

Optionally, the preset sliding window includes a preset extraction window and a preset sliding step length, and the processor 1001 extracts the to-be-processed archive data through the preset sliding window to obtain a peer data set corresponding to each to-be-processed archive data, including:

Optionally, the performing, by the processor 1001, deduplication on the peer data set based on the archive ID includes:

Optionally, the archive IDs have an ordering relationship, and the deduplication processing performed by the processor 1001 on the deduplication data sets obtained in the same cache region includes:

sorting the archive data to be processed in the duplicate removal data set according to the sorting relation of the archive ID;

Optionally, the target peer-to-peer data includes a target peer-to-peer number, and the processor 1001 further performs the following steps:

It should be noted that the electronic device provided in the embodiment of the present invention may be applied to a processing device for peer data, for example: a database, a server, a computer, or the like capable of performing peer data processing.

The electronic device provided by the embodiment of the present invention can implement each implementation manner in the method embodiments of fig. 1 to fig. 3, and corresponding beneficial effects, and are not described herein again to avoid repetition.

The embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements each process of the embodiment of the processing method for peer data provided in the embodiment of the present invention, and can achieve the same technical effect, and in order to avoid repetition, details are not described here again.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

The above disclosure is only for the purpose of illustrating the preferred embodiments of the present invention, and it is therefore to be understood that the invention is not limited by the scope of the appended claims.

Claims

1. A method for processing same-row data is characterized by comprising the following steps:

2. The method of claim 1, wherein the step of obtaining archival data to be processed comprises:

3. The method of claim 2, wherein the extracting the to-be-processed archive data through a preset sliding window to obtain a peer data set corresponding to each to-be-processed archive data comprises:

4. The method of claim 2, wherein the preset sliding window includes a preset extraction window and a preset sliding step, and the extracting the to-be-processed archive data through the preset sliding window to obtain the same-row data set corresponding to each to-be-processed archive data includes:

5. The method of claim 4, wherein said deduplicating the peer dataset based on the archive ID comprises:

6. The method of claim 5, wherein the archive IDs have an ordering relationship, and wherein the deduplication of the resulting deduplication data sets in the same cache comprises:

7. The method of claim 6, wherein the target peer data comprises a target peer count, the method further comprising:

8. An apparatus for processing inline data, comprising:

9. An electronic device, comprising: memory, processor and computer program stored on the memory and executable on the processor, the processor implementing the steps in the method for processing inline data according to any of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, characterized in that a computer program is stored thereon, which computer program, when being executed by a processor, realizes the steps in the method for processing peer data according to any one of claims 1 to 7.