CN112416912A

CN112416912A - Method, device, terminal equipment and medium for removing duplicate of longitudinal federal data statistics

Info

Publication number: CN112416912A
Application number: CN202011099097.8A
Authority: CN
Inventors: 谭明超; 马国强; 范涛; 陈天健; 杨强
Original assignee: WeBank Co Ltd
Current assignee: WeBank Co Ltd
Priority date: 2020-10-14
Filing date: 2020-10-14
Publication date: 2021-02-26
Anticipated expiration: 2040-10-14
Also published as: CN112416912B

Abstract

The invention discloses a duplication elimination method, a duplication elimination device, terminal equipment and a storage medium for longitudinal federal data statistics, wherein a result matrix sent by other participants in the longitudinal federal is received by any participant in the longitudinal federal, wherein the result matrix is obtained by multiplying a preset random matrix by a first feature matrix after the other participants construct the first feature matrix locally based on first to-be-duplicated data in self-owned data; constructing a second feature matrix locally based on second data to be deduplicated in the own data, and longitudinally splicing the second feature matrix and the result matrix to obtain a spliced matrix; and detecting each target row with the same corresponding position element in the splicing matrix, and performing deduplication processing on the first data to be deduplicated and the second data to be deduplicated pointed by each target row. According to the invention, the data deduplication can be carried out under the condition of ensuring the privacy and safety of the data without carrying out encryption operation on the data to be deduplicated, so that the data deduplication efficiency is improved.

Description

Method, device, terminal equipment and medium for removing duplicate of longitudinal federal data statistics

Technical Field

The invention relates to the technical field of federal data duplicate removal, in particular to a duplicate removal method and device for longitudinal federal data statistics, terminal equipment and a storage medium.

Background

The development of scientific technology has already stepped into the era of data informatization, and statistical applications for data have become more and more extensive. In a data statistics scenario, it is a very common operation to perform deduplication processing on duplicate data, for example, a user selects a specific statistical feature for data owned locally by the user, and then, when it is detected that a plurality of pieces of data exist under the statistical feature and the plurality of pieces of data are the same, the user performs deduplication on the plurality of pieces of the same data to keep only one piece of data under the statistical feature.

However, after a plurality of users with own data jointly form a longitudinal federal scene, based on privacy security considerations of the own data of each user, data that needs to be subjected to deduplication judgment is usually encrypted and then transmitted among the users to be subjected to deduplication processing jointly, so that when the amount of data that needs to be subjected to deduplication is large, a large amount of time is consumed to complete preprocessing, encryption and other processes on the data, and it is difficult to ensure the overall efficiency of deduplication on the data.

Disclosure of Invention

The invention mainly aims to provide a deduplication method and device for longitudinal federated data statistics, terminal equipment and a storage medium, and aims to solve the technical problem that in the prior art, the overall efficiency of deduplication processing performed on statistical data in a longitudinal federated scene is difficult to guarantee.

In order to achieve the above object, the present invention provides a method for removing duplicate in longitudinal federal data statistics, wherein the method for removing duplicate in longitudinal federal data statistics is applied to any one of the participants in the longitudinal federal, and the method for removing duplicate in longitudinal federal data statistics includes:

receiving a result matrix sent by other participants in a longitudinal federation, wherein the result matrix is obtained by multiplying a preset random matrix by a first feature matrix after the other participants construct the first feature matrix locally based on first to-be-deduplicated data in self-owned data;

constructing a second feature matrix locally based on second data to be deduplicated in the own data, and longitudinally splicing the second feature matrix and the result matrix to obtain a spliced matrix;

and detecting each target row with the same corresponding position element in the splicing matrix, and performing deduplication processing on the first data to be deduplicated and the second data to be deduplicated pointed by each target row.

Further, the step of locally constructing a second feature matrix based on second data to be deduplicated in the owned data includes:

extracting second data to be deduplicated from local owned data, wherein the number of the second data to be deduplicated is more than one;

and sequentially corresponding each feature data in each piece of second data to be deduplicated to each element of a row in the matrix to construct a second feature matrix.

Further, the step of extracting the second data to be deduplicated from the local owned data includes:

receiving a preset data deduplication request, and analyzing the preset data deduplication request to obtain statistical characteristics of data to be deduplicated, wherein the number of the statistical characteristics of the data to be deduplicated is greater than or equal to one;

locally detecting characteristic data belonging to the statistical characteristics of the data to be deduplicated under each data label in the own data;

and taking the feature data under the same data label as a piece of second data to be deduplicated to extract each piece of second data to be deduplicated.

Further, the step of detecting each target row with the same corresponding position element in the mosaic matrix includes:

sequentially taking each row of the splicing matrix as a reference row, and traversing each element in each to-be-detected row below the reference row in the splicing matrix;

and if all the elements in the current to-be-detected line are the same as all the elements at the corresponding positions in the reference line, marking the to-be-detected line and the reference line as target lines.

Further, the step of performing deduplication processing on the first data to be deduplicated and the second data to be deduplicated pointed by each target line includes:

determining target second deduplication data pointed by each target row in each second data to be deduplicated, and determining a unique reserved number in each target second deduplication data;

removing other target second duplicate removal data except the reserved data in each target second duplicate removal data;

determining a reserved target row corresponding to the reserved data in each target row, and generating a row identifier for other target rows except the reserved target row in each target row;

and sending the line identifier to the other participants so that the other participants can determine the target first deduplication data pointed by the line identifier in each piece of first deduplication data and clear the target first deduplication data.

In addition, in order to achieve the above object, the present invention further provides a deduplication method for longitudinal federal data statistics, where the deduplication method for longitudinal federal data statistics is applied to a coordinator in a longitudinal federal, and the deduplication method for longitudinal federal data statistics includes:

receiving a result matrix sent by each participant in a longitudinal federation, wherein the result matrix is obtained by multiplying a preset random matrix by a feature matrix after each participant locally constructs the feature matrix based on data to be deduplicated in own data;

longitudinally splicing the result matrixes to obtain a spliced matrix, and detecting target rows with the same corresponding position elements in the spliced matrix;

and controlling each participant policy to perform deduplication processing on the data to be deduplicated according to each target line.

Further, the step of controlling each participant to perform deduplication processing on the data to be deduplicated according to each target line includes:

determining a unique reserved target line in each target line, and generating a line identifier for other target lines except the reserved target line in each target line;

and sending the row identifier to each participant so that each participant locally determines the deduplication data pointed by the row identifier in the deduplication data to be processed and clears the deduplication data.

In addition, in order to achieve the above object, the present invention further provides a longitudinal federal data statistics deduplication device, which is applied to any one of the participants in the longitudinal federal, and includes:

the weight removal device for the longitudinal federal data statistics is applied to any one participant in the longitudinal federal, and comprises:

the first receiving module is used for receiving a result matrix sent by other participants in a longitudinal federation, wherein the result matrix is obtained by multiplying a preset random matrix by a first feature matrix after the other participants construct the first feature matrix locally based on first to-be-deduplicated data in self-owned data;

the first matrix splicing module is used for constructing a second feature matrix locally based on second data to be deduplicated in the owned data, and longitudinally splicing the second feature matrix and the result matrix to obtain a spliced matrix;

the first deduplication module is used for detecting each target row with the same corresponding position element in the mosaic matrix and performing deduplication processing on the first data to be deduplicated and the second data to be deduplicated pointed by each target row;

wherein, the weight removal device of vertical federal data statistics still is applied to the coordinator in the vertical federal, the weight removal device of vertical federal data statistics still includes:

the second receiving module is used for receiving a result matrix sent by each participant in the longitudinal federation, wherein the result matrix is obtained by multiplying a preset random matrix by a feature matrix after each participant locally constructs the feature matrix based on data to be deduplicated in own data;

the second matrix splicing module is used for longitudinally splicing the result matrixes to obtain a spliced matrix and detecting each target row with the same corresponding position element in the spliced matrix;

and the second deduplication module is used for controlling each participant policy to perform deduplication processing on the data to be deduplicated according to each target row.

In addition, to achieve the above object, the present invention also provides a terminal device, including: the system comprises a memory, a processor and a longitudinal federal data statistics deduplication program stored on the memory and capable of running on the processor, wherein when the longitudinal federal data statistics deduplication program is executed by the processor, the steps of the longitudinal federal data statistics deduplication method are realized.

In addition, to achieve the above object, the present invention further provides a storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the steps of the deduplication method of longitudinal federated data statistics as described above.

According to the duplication elimination method, the duplication elimination device, the terminal equipment and the storage medium for the longitudinal federal data statistics, a result matrix sent by other participants in the longitudinal federal is received by any participant in the longitudinal federal, wherein the result matrix is obtained by multiplying a preset random matrix by a first feature matrix after the other participants construct the first feature matrix locally based on first to-be-duplicated data in self-owned data; constructing a second feature matrix locally based on second data to be deduplicated in the own data, and longitudinally splicing the second feature matrix and the result matrix to obtain a spliced matrix; and detecting each target row with the same corresponding position element in the splicing matrix, and performing deduplication processing on the first data to be deduplicated and the second data to be deduplicated pointed by each target row.

In a longitudinal federal scene, the invention constructs a characteristic matrix based on the data to be deduplicated of the participants by using the local data to be deduplicated, forms a one-to-one corresponding relation between the data to be deduplicated and each row element of the matrix, meets the requirement of data consistency, facilitates the joint judgment and deduplication operation of the same data by each participant, and in addition, by transferring a result matrix obtained by multiplying the random matrix by the characteristic matrix among the participants, the encryption of data by directly adopting a random number multiplication mode in a matrix mode is realized, and need not in addition to treat that the duplication data carries out other complicated encryption operations, just can accomplish and carry out the data deduplication under the condition of guaranteeing data privacy security, simplified the process of carrying out the deduplication to the data to a great extent, improved the overall efficiency of data deduplication, can be applicable to the deduplication of bigger data size, scalability is strong.

Drawings

Fig. 1 is a schematic structural diagram of the hardware operation of a terminal device according to an embodiment of the present invention;

FIG. 2 is a flow chart of an embodiment of a deduplication method for longitudinal federated data statistics of the present invention;

FIG. 3 is a diagram illustrating an application scenario involved in an embodiment of a deduplication method for longitudinal federated data statistics of the present invention;

FIG. 4 is a schematic flow chart of another embodiment of the deduplication method for longitudinal federated data statistics of the present invention;

fig. 5 is a schematic block structure diagram of a deduplication device for longitudinal federal data statistics according to the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

As shown in fig. 1, fig. 1 is a schematic structural diagram of a hardware operating environment related to a terminal device according to an embodiment of the present invention.

It should be noted that fig. 1 is a schematic structural diagram of a hardware operating environment of the terminal device. The terminal equipment of the embodiment of the invention can be terminal equipment such as a PC, a portable computer and the like.

As shown in fig. 1, the terminal device may include: a processor 1001, such as a CPU, a network interface 1004, a user interface 1003, a memory 1005, a communication bus 1002. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (e.g., a magnetic disk memory). The memory 1005 may alternatively be a storage device separate from the processor 1001.

Those skilled in the art will appreciate that the terminal device configuration shown in fig. 1 is not intended to be limiting of the terminal device and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.

As shown in fig. 1, a memory 1005, which is a kind of computer storage medium, may include therein an operating system, a network communication module, a user interface module, and a distributed task processing program. Among them, the operating system is a program that manages and controls the hardware and software resources of the sample terminal device, a handler that supports distributed tasks, and the execution of other software or programs.

In the terminal apparatus shown in fig. 1, the user interface 1003 is mainly used for data communication with each terminal; the network interface 1004 is mainly used for connecting a background server and performing data communication with the background server; and the processor 1001 may be configured to invoke a deduplication program for longitudinal federated data statistics stored in the memory 1005 and perform the following operations:

Further, the processor 1001 may call a deduplication program for longitudinal federated data statistics stored in the memory 1005, and also perform the following operations:

Based on the structure, the invention provides various embodiments of the deduplication method for longitudinal federal data statistics.

Referring to fig. 2, fig. 2 is a schematic flow chart of a deduplication method for longitudinal federated data statistics according to a first embodiment of the present invention.

While a logical order is shown in the flow chart, in some cases, the steps shown or described may be performed in a different order than that shown.

The duplication elimination method for the longitudinal federation data statistics in the embodiment of the present invention is applied to the terminal device serving as one of the participating parties in the longitudinal federation, and the terminal device in the embodiment of the present invention may be a terminal device such as a PC, a portable computer, or the like, and is not specifically limited herein.

The duplicate removal method for longitudinal federal data statistics in the embodiment comprises the following steps:

step S100, receiving a result matrix sent by other participants in a longitudinal federation, wherein the result matrix is obtained by multiplying a preset random matrix by a first feature matrix after the other participants construct the first feature matrix locally based on first to-be-deduplicated data in own data;

it should be noted that, in this embodiment, each participant uses different feature dimensions of the user as vertical candidate columns, and each user is used as a horizontal data tag to establish a data statistics table, and fills the feature data of each user belonging to different feature dimensions to a corresponding position, so as to form data of each row to obtain own data of each local terminal of each participant through statistics.

In a vertical federation formed by a plurality of participants, each other participant except the current participant selects first data to be deduplicated from own data based on a received preset data deduplication request, a first feature matrix is constructed and obtained based on the first data to be deduplicated, a result matrix is obtained by multiplying the first feature matrix by a preset random matrix, and then the current participant receives the result matrix transmitted by each other participant.

It should be noted that, in this embodiment, the preset data deduplication request may be automatically triggered by a terminal device serving as another party or manually triggered by a user based on the terminal device, and when the preset data deduplication request is automatically or manually triggered, a data tag used for extracting data to be determined whether deduplication is performed in own data of the home terminal is synchronously configured, and a feature dimension to which the data to be determined whether deduplication is performed is configured as a statistical feature of the data to be deduplicated, so that the data to be deduplicated is extracted and a feature matrix is constructed conveniently.

In addition, the preset random matrix is n according to the number of the feature dimensions configured in the preset data deduplication request, and an n × n matrix is randomly generated.

Specifically, for example, in a vertical federation built jointly by party a and party B, party B utilizes the feature dimensions locally: "age", "income", "transaction number", and "overdue" are listed as alternative columns, and then the own data is counted with the device numbers U1, U2, and U3 as horizontal data tags, respectively. Then, if the participating party B receives a preset data deduplication request triggered by configuring U1, U2 and U3 as data tags of data to be deduplicated and configuring "age", "income" and "transaction times" as statistical features of the data to be deduplicated, then the participating party B locally extracts 9 feature data belonging to feature dimensions "age", "income" and "transaction times" in the three rows of data of U1, U2 and U3 as first data to be deduplicated, and constructs a 3 x 3 dimensional first feature matrix based on the 9 feature data, that is, the participating party B locally takes the three feature data of the row U1 as the three elements of the first row (top) in the first feature matrix, takes the three feature data of the row U2 as the three elements of the second row (top) in the first feature matrix, and the three feature data of the line U3 are taken as three elements of the third row (top to bottom) in the first feature matrix.

In addition, the participant B generates a 3 × 3 dimensional random matrix locally according to the statistical characteristics "3" of the age "," income "and" transaction times "of the data to be deduplicated configured in the received preset data deduplication request. Therefore, after building the first feature matrix with 3 × 3 dimensions, participant B multiplies the generated random matrix with the first feature matrix to obtain a result matrix with 3 × 3 dimensions based on a mature matrix multiplication algorithm, and transmits the result matrix to participant a.

S200, constructing a second feature matrix locally based on second data to be deduplicated in the own data, and longitudinally splicing the second feature matrix and the result matrix to obtain a spliced matrix;

after the current participant receives the result matrixes transmitted by other participants, the current participant immediately extracts second data to be deduplicated from own data locally, constructs a second feature matrix based on the second data to be deduplicated, and then vertically splices the second feature matrix and the received result matrixes locally to obtain a splicing matrix.

It should be noted that, in this embodiment, the processes of extracting the data to be deduplicated locally by each participant in the longitudinal federation, and then constructing and obtaining the respective feature matrices based on the data to be deduplicated are consistent. In addition, the way that the current participant a performs vertical splicing on the second feature matrix and the received result matrix locally is based on the existing mature matrix splicing operation. Specifically, for example, if the second feature matrix constructed by the participant a based on the second to-be-deduplicated data in the owned data is a 3 × 3 dimensional matrix, and the result matrix received by the participant a and uploaded by the participant B is also a 3 × 3 dimensional matrix, then a 3 × 6 dimensional mosaic matrix can be obtained when the participant a performs vertical mosaic on the second feature matrix and the result matrix (directly arranges three columns of elements of the result matrix after the third column of feature elements of the second feature matrix, or directly arranges three columns of elements of the second feature matrix after the third column of feature elements of the result matrix).

Further, in a possible embodiment, the step S200 may include:

step S201, extracting second data to be deduplicated from local own data, wherein the number of the second data to be deduplicated is more than one;

it should be noted that, in this embodiment, in order to determine whether each piece of data to be deduplicated identified by the data label is the same in the own data, the number of pieces of data to be deduplicated extracted locally by each participant from each own data is equal to or greater than one, that is, at least two pieces of data to be deduplicated identified by different data labels need to be extracted, and it is only possible to determine whether the two pieces of data to be deduplicated are the same, so that deduplication processing needs to be performed.

Step S202, sequentially corresponding each feature data in each piece of the second data to be deduplicated to each element in a row in the matrix, so as to construct a second feature matrix.

The current participant extracts a plurality of pieces of second data to be deduplicated which need to be judged to perform data deduplication processing from own data counted by the local terminal based on a preset data deduplication request received from the local terminal, and then, the current participant sequentially takes each feature data in each piece of data to be deduplicated as each element of a certain row in the matrix, and further constructs a second feature matrix in which each row of elements corresponds to each feature data in each piece of second data to be deduplicated.

Specifically, for example, current participant a also utilizes feature dimensions locally: "age", "income", "transaction number", and "overdue" are listed as alternative columns, and then the own data is counted with the device numbers U4, U5, and U6 as horizontal data tags, respectively. Then, if the participating party a receives a preset data deduplication request triggered by configuring U4, U5 and U6 as data tags of data to be deduplicated and configuring "age", "income" and "overdue" as statistical features of data to be deduplicated, then the participating party a locally extracts 9 feature data belonging to feature dimensions "age", "income" and "overdue" in the three rows of data of U4, U5 and U6 as first data to be deduplicated, and constructs a 3-by-3 first feature matrix based on the 9 feature data, that is, three feature data of the row U4 are used as three elements of the first row (bottom) in the first feature matrix, three feature data of the row U5 are used as three elements of the second row (top-to-bottom) in the first feature matrix, and the three feature data of the line U6 are taken as three elements of the third row (top to bottom) in the first feature matrix.

Further, in a possible embodiment, in the step S201, the step of "extracting the second data to be deduplicated from the local owned data" may include:

step S2011, receiving a preset data deduplication request, and analyzing the preset data deduplication request to obtain statistical features of data to be deduplicated, wherein the number of the statistical features of the data to be deduplicated is greater than or equal to one;

after receiving the first encryption values transmitted by other participants, the current participant can output a prompt message at the local terminal, then receive a preset data deduplication request triggered based on the prompt message, and further analyze the preset data deduplication request to extract one or more statistical features of data to be deduplicated from the preset data deduplication request.

It should be noted that, in this embodiment, the statistical feature of the data to be deduplicated is a terminal device that is a current participant, and when a preset data deduplication request is automatically triggered or manually triggered based on a user of the terminal device, one or more feature dimensions that are configured synchronously are used. In addition, the prompt message may specifically be a request for reminding a user of the terminal device as the current participant, where other participants are currently initiating a request for determining whether deduplication processing is needed for their own data, so as to enable the user to confirm whether to respond to the request and select data to be deduplicated in the own data of the local terminal for determining and executing data deduplication processing.

Specifically, for example, referring to the application scenario shown in fig. 3, in a vertical federation jointly constructed by a participant a and a participant B, after the current participant a receives a result matrix delivered by the participant B, the current participant a outputs a prompt message at the local terminal, and if the current participant a receives a preset data deduplication request triggered by acknowledgement feedback automatically input by a user or the local terminal, the current participant a parses the preset data deduplication request based on a time-matured instruction parsing technology, and extracts four statistical features of data to be deduplicated, namely "age", "income", and "time out" from the preset data deduplication request.

Step S2012, feature data belonging to the statistical features of the data to be deduplicated under each data label in the local detection own data;

step S2013, using the feature data under the same data label as a piece of second data to be deduplicated to extract each piece of second data to be deduplicated.

After the current participant analyzes the preset data deduplication request to extract statistical features of the data to be deduplicated, the current participant immediately detects feature data belonging to the statistical features of the data to be deduplicated under each data label in own data which is already counted by the local terminal, and then the current participant takes each feature data belonging to the same data label in the feature data as a piece of second data to be deduplicated, thereby obtaining each piece of second data to be deduplicated classified according to each data label.

Specifically, for example, current participant a locally utilizes the feature dimensions in the same statistical manner as participant B: the "age", "income", "transaction number", and "overdue" are listed as candidate columns, and the own data is obtained by counting with the device numbers U4, U5, and U6 as horizontal data tags, respectively. Then, the current participant a analyzes three statistical characteristics of the data to be deduplicated, namely 'age', 'income', and 'overdue' from the received preset data deduplication request, in each row of feature data identified by the detection data tags U4, U5, and U6 in the owned data, 9 feature data each belonging to the three statistical features of "age", "income", and "overdue" of the data to be deduplicated, and finally, the current participant a regards 3 feature data in the row identified by the data tag U4 as one piece of second data to be deduplicated, 3 feature data in the row identified by the data tag U5 as another piece of second data to be deduplicated, and 3 pieces of feature data in the row identified by the data label U6 are used as another piece of second data to be deduplicated, so as to obtain three pieces of second data to be deduplicated in the own data of the local terminal.

Step S300, detecting each target row with the same corresponding position element in the splicing matrix, and performing deduplication processing on the first data to be deduplicated and the second data to be deduplicated pointed by each target row.

After the current participant locally and longitudinally splices the second feature matrix and the received result matrix to obtain a spliced matrix, traversing elements of each row in the spliced matrix to detect a plurality of target rows with the same corresponding position elements in the spliced matrix, then, locally performing deduplication processing on second data to be deduplicated contained in the target row by the current participant, and generating a deduplication instruction according to the target row to control other participants to perform deduplication processing on first data to be deduplicated pointed by the target row.

Further, in a possible embodiment, in step S300, "detecting each target row with the same corresponding position element in the mosaic matrix" may include:

step S301, sequentially taking each row of the splicing matrix as a reference row, and traversing each element in each to-be-detected row below the reference row in the splicing matrix;

step S302, if each element in the current line to be checked is the same as each element at the corresponding position in the reference line, marking the line to be checked and the reference line as a target line.

Specifically, for example, after the current participant a receives the 3 × 3-dimensional result matrix passed by the participant B and vertically splices the result matrix with the second feature matrix, which is also 3 × 3-dimensional and is constructed locally based on the second data to be deduplicated, to obtain a 3 × 6-dimensional splicing matrix, the current participant a locally uses the first row in the 3 × 6-dimensional splicing matrix as a reference row and traverses 6 elements in the second row in the 3 × 6-dimensional splicing matrix to detect whether the 6 elements are sequentially identical to the 6 elements in the reference row (i.e., detect whether the 1 st element of the second row is identical to the 1 st element of the reference row, and thus detect up to the 6 th element of the second row), and then the current participant a continues to traverse the 6 elements in the third row in the 3 × 6-dimensional splicing matrix, finally, the current participant a uses the second row of the 3 x 6-dimensional stitching matrix as the reference row and traverses 6 elements in the third row of the 3 x 6-dimensional stitching matrix to detect whether the 6 elements are sequentially the same as the 6 elements in the reference row.

Thus, if the current participant a detects that only 6 elements in the second row are sequentially the same as 6 elements in the reference row, the current participant a simultaneously marks the first row and the second row, which are currently serving as the reference row, in the mosaic matrix as the target row. Similarly, if the current participant a uses the second row as the reference row and detects that the 6 elements in the third row are sequentially the same as the 6 elements in the reference row, the current participant a simultaneously marks the second row and the third row, which are currently used as the reference row, in the mosaic matrix as the target row.

Further, in another possible embodiment, if the current participant a uses the first row as the reference row, and synchronously detects that the 6 elements in the second row are sequentially the same as the 6 elements in the reference row, and the 6 elements in the third row are also sequentially the same as the 6 elements in the reference row, the current participant a does not need to perform subsequent detection on the second row as the reference row, but directly marks the first row, the second row, and the third row, which are currently serving as the reference row, in the concatenation matrix as the target row.

Further, in a possible embodiment, in step S300, the step of performing deduplication processing on the first data to be deduplicated and the second data to be deduplicated pointed by each target row may include:

step S303, determining target second deduplication data pointed to by each target row in each second deduplication data to be deduplicated, and determining unique reserved data in each target second deduplication data;

and the current participant determines elements of a second feature matrix contained in each target row in the splicing matrix based on the one-to-one correspondence relationship between the feature data in each piece of second data to be deduplicated and elements of each row of the second feature matrix, and randomly determines a piece of only reserved data to be reserved in the target second data to be deduplicated.

Specifically, for example, after determining a first row and a second row in a 3 × 6-dimensional concatenation matrix as a target row, the current participant a extracts the first 3 elements of the first row belonging to the second feature matrix, determines a first piece of second to-be-deduplicated data identified by the data tag U4 as target second to-be-deduplicated data according to a correspondence relationship between each of the 3 elements and 3 pieces of feature data in the row identified by the data tag U4 in each piece of second to-be-deduplicated data, and similarly determines a second piece of second to-be-deduplicated data identified by the data tag U5 as target second to-be-deduplicated data, and then determines the first piece of data to-be-deduplicated identified by the data tag U4 as retained data to be retained in the two pieces of target second to-be-deduplicated data at random.

Step S304, removing other target second duplicate removal data except the reserved data in each target second duplicate removal data;

after the current participant determines a piece of only reserved data needing to be reserved from the target second data to be deduplicated, all other target second data to be deduplicated except the reserved data in the target second data to be deduplicated are cleared, so that two pieces of second data to be deduplicated with the same characteristic data do not exist in the second data to be deduplicated at the same time.

Specifically, for example, after the current participant a determines that the first piece of to-be-deduplicated data identified by the data tag U4 is the reserved data that needs to be reserved, from the two pieces of target second to-be-deduplicated data identified by U4 and U5, the current participant a then clears the piece of target second to-be-deduplicated data identified by U5.

Step S305, determining a corresponding reserved target row of the reserved data in each target row, and generating a row identifier for other target rows except the reserved target row in each target row;

step S306, sending the row identifier to the other parties, so that the other parties determine target first deduplication data pointed by the row identifier in each piece of first deduplication data, and remove the target first deduplication data.

After the current participant determines each target row from the splicing matrix, a unique reserved target row is randomly determined from each target row, then, the current participant generates respective corresponding line identification for other target lines except the reserved target line in the target lines, then the current participant transmits the line identification to other participants, after receiving the row identifier, the other participants directly determine locally the target row pointed by the row identifier in the first feature matrix, and further determining the target row in the first feature matrix based on the one-to-one correspondence between the feature data in the first to-be-deduplicated data and the row elements of the first feature matrix, target first data to be deduplicated pointed to in each piece of first data to be deduplicated, and then directly erasing the target first data to be deduplicated, therefore, two pieces of first data to be deduplicated with the same characteristic data do not exist in the first data to be deduplicated at the same time.

Specifically, for example, after determining the first row and the second row in the 3 × 6-dimensional stitching matrix, the current participant a determines the first row as the reserved target row, and then generates the row identifier "row 2" of the second row, and then the current participant a passes the row identifier "row 2" to the participant B. After receiving the row identifier "row 2", the participant B immediately locally determines the second row as the target row from the 3 × 3-dimensional first feature matrix constructed based on the first data to be deduplicated, determines the first piece of first data to be deduplicated identified by the data tag U2 as the target first data to be deduplicated according to the correspondence between 3 elements in the second row and 3 pieces of feature data in the row identified by the data tag U2 in each piece of first data to be deduplicated, and then clears the piece of target first data to be deduplicated identified by the U2.

In this embodiment, in a vertical federal scenario composed of a plurality of participants each having own data, a result matrix transmitted by any one of the participants is received by the current participant (the result matrix is obtained by the other participants each locally constructing a first feature matrix based on first to-be-deduplicated data in the own data and multiplying the first feature matrix by a preset random matrix), then the current participant locally constructs a second feature matrix based on second to-be-deduplicated data in the own data, the second feature matrix is spliced with the received result matrix to obtain a spliced matrix, and finally, the current participant detects whether elements at corresponding positions of rows in the spliced matrix are the same to detect a plurality of target rows having the same corresponding position elements, and locally performs deduplication processing on the second to-be-deduplicated data according to the plurality of target rows, and controlling other participants to perform deduplication processing on the first data to be deduplicated.

Further, referring to fig. 5, fig. 5 is a flowchart illustrating a second embodiment of the deduplication method for longitudinal federated data statistics according to the present invention.

While the second embodiment of the deduplication method for longitudinal federated data statistics of the present invention applies to the coordinator in the longitudinal federation, it should be noted that, although the second embodiment of the deduplication method for longitudinal federated data statistics of the present invention shows a logical order in the flow chart, in some cases, the steps shown or described may be performed in an order different from that here.

The duplication elimination method for the longitudinal federation data statistics in the embodiment of the invention is applied to the terminal equipment serving as a coordinating party in the longitudinal federation, and the terminal equipment in the embodiment of the invention can be terminal equipment such as a PC (personal computer), a portable computer and the like, and is not particularly limited herein.

step S400, receiving a result matrix sent by each participant in a longitudinal federation, wherein the result matrix is obtained by multiplying a preset random matrix by a feature matrix after each participant locally constructs the feature matrix based on data to be deduplicated in own data;

in a longitudinal federation formed by a plurality of participants, each participant selects data to be deduplicated from own data based on a received preset data deduplication request, and constructs a feature matrix based on the data to be deduplicated, and multiplies a preset random matrix by the feature matrix to obtain a result matrix, and then each participant uploads the result matrix obtained by local calculation to a coordinator in the longitudinal federation, and the coordinator receives the result matrix uploaded by each participant.

Specifically, for example, in a vertical federation built jointly by coordinator X, participant a, and participant B, participant B utilizes locally the feature dimension: the 'age', 'income', 'transaction times' and 'overdue' are taken as alternative columns, and then the self-owned data are obtained by statistics by taking the equipment numbers U1, U2 and U3 as transverse data labels respectively; participant a also utilizes the feature dimension locally: "age", "income", "transaction number", and "overdue" are listed as alternative columns, and then the own data is counted with the device numbers U4, U5, and U6 as horizontal data tags, respectively.

Then, after the participant B receives a preset data deduplication request triggered by configuring U1, U2 and U3 as data tags of data to be deduplicated and configuring "age", "income" and "transaction times" as statistical features of the data to be deduplicated, the participant B then extracts locally 9 feature data belonging to feature dimensions "age", "income" and "transaction times" in the three rows of data of U1, U2 and U3 as data to be deduplicated from the own data, and constructs a 3 x 3-dimensional feature matrix based on the 9 feature data, that is, the participant B locally takes the three feature data of the row U1 as three elements of a first row (top to bottom) in the feature matrix, takes the three feature data of the row U2 as three elements of a second row (bottom) in the feature matrix, and takes the three feature data of the row U3 as three elements of a third row (top to bottom) in the feature matrix And (4) element.

In addition, the participant B generates a 3 × 3 dimensional random matrix locally according to the statistical characteristics "3" of the age "," income "and" transaction times "of the data to be deduplicated configured in the received preset data deduplication request. Therefore, after constructing the 3 × 3 dimensional feature matrix, the participant B multiplies the generated random matrix by the feature matrix based on a mature matrix multiplication algorithm to obtain a 3 × 3 dimensional result matrix, and transmits the result matrix to the coordinator X.

It should be noted that, in this embodiment, since each participant in the longitudinal federation locally constructs a feature matrix based on data to be deduplicated in the owned data, and then the process of obtaining a result matrix by multiplying the feature matrix by the random matrix is the same, the process of locally extracting the owned data and constructing the feature matrix for the participant a and obtaining the result matrix by multiplying the feature matrix by the random matrix is not described in detail here.

Step S500, longitudinally splicing the result matrixes to obtain a spliced matrix, and detecting target rows with the same corresponding position elements in the spliced matrix;

after receiving the result matrixes uploaded by each participant, the coordinator carries out longitudinal splicing on each two result matrixes to obtain a spliced matrix, and then the coordinator traverses elements of each row in the spliced matrix to detect a plurality of target rows with the same corresponding position elements in the spliced matrix.

Specifically, for example, if the coordinator X receives the result matrices uploaded by the participating party a and the participating party B, respectively — the result matrix a and the result matrix B are both 3 × 3 dimensional matrices, then the coordinator X performs vertical concatenation on the result matrix a and the result matrix B (three columns of elements of the result matrix a are arranged behind the third column of feature elements of the result matrix B, or three columns of elements of the result matrix B are directly arranged behind the third column of feature elements of the result matrix a), so as to obtain a 3 × 6 dimensional concatenation matrix.

Then, if the coordinator X first uses the first row in the 3X 6-dimensional stitching matrix as the reference row and traverses the 6 elements in the second row in the 3X 6-dimensional stitching matrix, to detect whether the 6 elements are identical to the 6 elements in the reference line in turn (i.e. to detect whether the 1 st element of the second line is identical to the 1 st element of the reference line, and so on until the 6 th element of the second line), and, if the coordinator X continues to traverse the 6 elements in the third row of the 3X 6-dimensional stitching matrix, with the same check whether the 6 elements are in turn identical to the 6 elements in the reference row, and finally, if the coordinator X further uses the second row in the 3X 6-dimensional mosaic matrix as the reference row, and traversing 6 elements in the third row in the 3 x 6-dimensional stitching matrix to detect whether the 6 elements are sequentially identical to the 6 elements in the reference row.

Thus, if the coordinator X detects that only 6 elements in the second row are sequentially the same as 6 elements in the reference row, the coordinator X simultaneously marks the first row and the second row, which are currently serving as the reference row, in the mosaic matrix as the target row. Similarly, if the coordinator X uses the second row as the reference row and detects that the 6 elements in the third row are sequentially the same as the 6 elements in the reference row, the coordinator X may mark the second row and the third row, which are currently used as the reference row, in the mosaic matrix as the target row at the same time.

Step S600, controlling each participant policy to perform deduplication processing on the data to be deduplicated according to each target row.

After determining each target row from the splicing matrix, the coordinator randomly determines only one reserved target row from each target row, then, the coordinator generates respective corresponding row identifiers for other target rows except the reserved target row in the target rows, then transmits the row identifiers to the participants, after receiving the row identifiers, the participants directly determine the target rows pointed by the row identifiers in the feature matrix locally, and further determines the target data to be deduplicated pointed by the target row in each data to be deduplicated in the feature matrix based on the one-to-one correspondence between each feature data in each data to be deduplicated and each row element of the feature matrix, and then, directly clearing the target data to be deduplicated, so that two pieces of data to be deduplicated with the same characteristic data do not exist in the data to be deduplicated at the same time.

Specifically, for example, after determining the first row and the second row in the 3 × 6-dimensional stitching matrix, the coordinator X determines the first row as the reserved target row, and then generates the row identifier "row 2" of the second row, and then the coordinator X passes the row identifier "row 2" to the participant a and the participant B.

After receiving the row identifier "row 2", the participant a immediately determines the second row locally from the 3 × 3-dimensional feature matrix constructed based on the data to be deduplicated, determines the first piece of data to be deduplicated identified by the data label U5 as the target data to be deduplicated according to the correspondence between 3 elements in the second row and 3 pieces of feature data in the data to be deduplicated, and then clears the piece of target data to be deduplicated identified by the U5.

After receiving the row identifier "row 2", the participant B immediately locally determines the second row as a target row from the 3-dimensional feature matrix constructed based on the to-be-deduplicated data, determines the first piece of to-be-deduplicated data identified by the data label U2 as the target to-be-deduplicated data according to the correspondence between 3 elements in the second row and 3 pieces of feature data in the to-be-deduplicated data one by one, and then clears the piece of target to-be-deduplicated data identified by the U2.

In this embodiment, in a vertical federal scene composed of a plurality of participants each having own data, a coordinator receives a result matrix transmitted by each participant (the result matrix is obtained by constructing a feature matrix based on data to be deduplicated in the own data by each participant locally and multiplying the feature matrix by a preset random matrix), then the coordinator splices the received result matrices two by two to obtain a spliced matrix, and finally, the coordinator detects whether elements at corresponding positions of each row in the spliced matrix are the same to obtain a plurality of target rows having the same corresponding position elements, and then sends a deduplication instruction to each participant according to the target rows to control each participant to perform deduplication processing on the data to be deduplicated locally.

In a longitudinal federal scene, the invention constructs a characteristic matrix based on the data to be deduplicated at the local end by using the data to be deduplicated at the participant, forms a one-to-one corresponding relation between the data to be deduplicated and each row element of the matrix, meets the requirement of data consistency, facilitates the joint judgment and deduplication operation of the same data by each participant, and in addition, transmits a result matrix obtained by multiplying the random matrix by the characteristic matrix to a coordinating party for joint judgment, realizes the data encryption by directly adopting a random number multiplication mode by using a matrix mode without other complicated encryption operation additionally performed on the data to be deduplicated, can complete data deduplication under the condition of ensuring the safety of data privacy, greatly simplifies the process of deduplication aiming at the data, improves the overall efficiency of data deduplication, and can be suitable for deduplication with larger data volume, and the expandability is strong.

In addition, referring to fig. 5, an embodiment of the present invention further provides a longitudinal federal data statistics deduplication device, where the deduplication device is applied to any one of participants in a longitudinal federation, and the deduplication device includes:

and the first deduplication module is used for detecting each target row with the same corresponding position element in the splicing matrix and performing deduplication processing on the first data to be deduplicated and the second data to be deduplicated pointed by each target row.

Preferably, the first matrix splicing module comprises:

the device comprises an extraction unit, a comparison unit and a comparison unit, wherein the extraction unit is used for extracting second data to be deduplicated from local own data, and the number of the second data to be deduplicated is more than one;

and the construction unit is used for sequentially corresponding each piece of feature data in each piece of second data to be deduplicated to each element in a row in the matrix so as to construct and obtain a second feature matrix.

Preferably, the extraction unit includes:

the receiving subunit is configured to receive a preset data deduplication request, and analyze the preset data deduplication request to obtain statistical features of data to be deduplicated, where the number of the statistical features of the data to be deduplicated is greater than or equal to one;

the detection subunit is used for locally detecting the characteristic data belonging to the statistical characteristics of the data to be deduplicated under each data label in the own data;

and the marking subunit is used for taking the feature data under the same data label as a piece of second data to be deduplicated to extract each piece of second data to be deduplicated.

Preferably, the first deduplication module comprises:

the traversing unit is used for sequentially taking each row of the splicing matrix as a reference row and traversing each element in each to-be-detected row below the reference row in the splicing matrix;

and the marking unit is used for marking the to-be-detected line and the reference line as a target line if each element in the to-be-detected line is the same as each element at the corresponding position in the reference line.

Preferably, the first deduplication module further comprises:

a first determining unit, configured to determine target second deduplication data pointed to by each target row in each second data to be deduplicated, and determine unique retained data in each target second deduplication data;

a duplicate removal clearing unit, configured to clear other target second duplicate removal data, excluding the retained data, in each target second duplicate removal data;

a second determining unit, configured to determine a reserved target row corresponding to the reserved data in each target row, and generate a row identifier for another target row except the reserved target row in each target row;

and the first sending unit is used for sending the line identifier to the other participants so that the other participants can determine the target first deduplication data pointed by the line identifier in each piece of first deduplication data and clear the target first deduplication data.

Preferably, the weight removal device for longitudinal federal data statistics is further applied to a coordinator in a longitudinal federal, and the weight removal device for longitudinal federal data statistics further includes:

Preferably, the second deduplication module comprises:

a third determining unit, configured to determine a unique reserved target row in each target row, and generate a row identifier for another target row in each target row except the reserved target row;

and the second sending unit is used for sending the line identifier to each participant so that each participant locally determines the deduplication data pointed by the line identifier in each data to be deduplicated and clears the deduplication data.

The steps implemented when the functional modules of the deduplication device for longitudinal federated data statistics run may refer to the above embodiments of the deduplication method for longitudinal federated data statistics, and are not described herein again.

In addition, an embodiment of the present invention further provides a terminal device, where the terminal device includes: the system comprises a memory, a processor and a longitudinal federal data statistics deduplication program stored on the memory and capable of running on the processor, wherein the longitudinal federal data statistics deduplication program realizes the steps of the longitudinal federal data statistics deduplication method when being executed by the processor.

The steps implemented when the deduplication program for longitudinal federated data statistics running on the processor is executed may refer to various embodiments of the deduplication method for longitudinal federated data statistics of the present invention, and are not described herein again.

In addition, an embodiment of the present invention further provides a storage medium, which is applied to a computer, and the storage medium may be a non-volatile computer-readable storage medium, on which a deduplication program for longitudinal federated data statistics is stored, and when the deduplication program for longitudinal federated data statistics is executed by a processor, the steps of the deduplication method for longitudinal federated data statistics as described above are implemented.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A duplication elimination method for longitudinal federal data statistics is characterized in that the duplication elimination method for longitudinal federal data statistics is applied to any one participant in longitudinal federation, and comprises the following steps:

2. The method for deduplication based on longitudinal federated data statistics as recited in claim 1, wherein the step of building a second feature matrix locally based on second data to be deduplicated in the owned data comprises:

3. The method for deduplication based on longitudinal federated data statistics as recited in claim 2, wherein the step of extracting the second data to be deduplicated from the local owned data comprises:

4. The method for deduplication based on longitudinal federated data statistics as claimed in claim 1, wherein the step of detecting each target row in the mosaic matrix with the same corresponding position element comprises:

5. The method for deduplication based on vertical federal data statistics as claimed in claim 1, wherein the step of performing deduplication processing on the first data to be deduplicated and the second data to be deduplicated pointed by each of the target rows comprises:

determining target second deduplication data pointed by each target row in each second data to be deduplicated, and determining unique reserved data in each target second deduplication data;

6. A duplication elimination method for longitudinal federal data statistics is characterized in that the duplication elimination method for longitudinal federal data statistics is applied to a coordinator in longitudinal federal, and comprises the following steps:

7. The method for deduplication based on longitudinal federated data statistics as claimed in claim 6, wherein the step of controlling each of the participants to perform deduplication processing on the data to be deduplicated according to each of the target rows comprises:

8. A longitudinal federal data statistics deduplication device is applied to any one participant in a longitudinal federal, and comprises:

9. A terminal device, characterized in that the terminal device comprises: a memory, a processor, and a longitudinal federated data statistical deduplication program stored on the memory and executable on the processor, the longitudinal federated data statistical deduplication program, when executed by the processor, implementing the steps of the longitudinal federated data statistical deduplication method of any one of claims 1 to 5 or 6 to 7.

10. A storage medium, characterized in that the storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of the longitudinal federated data statistical deduplication method of any one of claims 1 to 5 or 6 to 7.