CN112416912B

CN112416912B - Method, device, terminal equipment and medium for removing duplicate of longitudinal federal data statistics

Info

Publication number: CN112416912B
Application number: CN202011099097.8A
Authority: CN
Inventors: 谭明超; 马国强; 范涛; 陈天健; 杨强
Original assignee: WeBank Co Ltd
Current assignee: WeBank Co Ltd
Priority date: 2020-10-14
Filing date: 2020-10-14
Publication date: 2024-05-07
Anticipated expiration: 2040-10-14
Also published as: CN112416912A

Abstract

The invention discloses a method, a device, a terminal device and a storage medium for longitudinal federal data statistics, which are characterized in that any one of the longitudinal federal participants receives a result matrix sent by other participants in the longitudinal federal, wherein the result matrix is obtained by multiplying a preset random matrix by the first feature matrix after the other participants construct the first feature matrix locally based on first data to be deduplicated in own data; constructing a second feature matrix based on second data to be deduplicated in the self data in a local mode, and longitudinally splicing the second feature matrix and a result matrix to obtain a spliced matrix; detecting each target row with the same corresponding position element in the splicing matrix, and performing de-duplication processing on the first data to be de-duplicated and the second data to be de-duplicated pointed by each target row. According to the invention, the data can be subjected to data deduplication under the condition of ensuring the privacy security of the data without carrying out encryption operation on the data to be deduplicated, and the data deduplication efficiency is improved.

Description

Method, device, terminal equipment and medium for removing duplicate of longitudinal federal data statistics

Technical Field

The invention relates to the technical field of federal data deduplication, in particular to a method, a device, terminal equipment and a storage medium for deduplication of longitudinal federal data statistics.

Background

At present, the development of science and technology has already been in the age of data informatization, and the statistical application for data has also become more and more widespread. In the data statistics scenario, it is quite common to perform a deduplication process on repeated data, for example, a user selects a specific statistical feature for data owned locally by the user, and then, in the case that a plurality of pieces of data exist under the statistical feature and are all the same, the user performs deduplication on the same pieces of data so as to retain only one piece of data under the statistical feature.

However, after a plurality of users having own data jointly form a vertical federal scenario, based on privacy security consideration of own data of each user, data to be subjected to deduplication judgment is generally selected to be encrypted and then transmitted between the users to be jointly subjected to deduplication processing, so that under the condition that the amount of data to be subjected to deduplication is large, a large amount of time is required to complete preprocessing, encryption and other processes for the data, and the overall efficiency of deduplication of the data is difficult to ensure.

Disclosure of Invention

The invention mainly aims to provide a method, a device, a terminal device and a storage medium for removing duplicates of longitudinal federal data statistics, and aims to solve the technical problem that in the prior art, the overall efficiency of performing duplicate removal processing on statistical data in a longitudinal federal scene is difficult to guarantee.

In order to achieve the above object, the present invention provides a method for deduplicating longitudinal federal data statistics, where the method for deduplicating longitudinal federal data statistics is applied to any one of the participants in the longitudinal federal, and the method for deduplicating longitudinal federal data statistics includes:

Receiving result matrixes sent by other participants in a longitudinal federation, wherein the result matrixes are obtained by multiplying a preset random matrix by the first characteristic matrix after the other participants construct the first characteristic matrix locally based on first data to be deduplicated in own data;

constructing a second feature matrix based on second data to be deduplicated in the self data in a local mode, and transversely splicing the second feature matrix and the result matrix to obtain a spliced matrix;

detecting each target row with the same corresponding position element in the splicing matrix, and performing de-duplication processing on the first data to be de-duplicated and the second data to be de-duplicated pointed by each target row.

Further, the step of constructing a second feature matrix locally based on the second data to be deduplicated in the own data includes:

Extracting second data to be de-duplicated from the local self-owned data, wherein the number of the second data to be de-duplicated is greater than one;

and sequentially taking each piece of characteristic data in the second data to be de-duplicated as each element of one row in the matrix to construct and obtain a second characteristic matrix.

Further, the step of extracting the second data to be deduplicated from the local self-owned data includes:

Receiving a preset data deduplication request, and analyzing the preset data deduplication request to obtain statistical features of data to be deduplicated, wherein the number of the statistical features of the data to be deduplicated is greater than or equal to one;

feature data belonging to the statistical features of the data to be deduplicated under each data tag in the self-owned data are detected locally;

and taking the characteristic data under the same data label as a piece of second data to be de-duplicated, so as to extract each piece of second data to be de-duplicated.

Further, the step of detecting each target row with the same corresponding position element in the splicing matrix includes:

taking each row of the splicing matrix as a reference row in sequence, and traversing each element in each row to be detected, which is positioned below the reference row, in the splicing matrix;

And if the elements in the current line to be detected are the same as the elements at the corresponding positions in the reference line, marking the line to be detected and the reference line as target lines.

Further, the step of performing deduplication processing on the first to-be-deduplicated data and the second to-be-deduplicated data pointed by each target line includes:

determining target second deduplication data pointed by each target row in each second to-be-deduplicated data respectively, and determining a unique reserved number in each target second deduplication data;

Removing the target second deduplication data except the reserved data in the target second deduplication data;

Determining a reserved target row corresponding to the reserved data in each target row, and generating row identifiers for other target rows except the reserved target row in each target row;

and sending the row identification to the other participants so that the other participants can determine target first deduplication data pointed by the row identification in each piece of first deduplication data to be deduplicated, and clearing the target first deduplication data.

In addition, to achieve the above object, the present invention further provides a method for deduplicating longitudinal federal data statistics, where the method for deduplicating longitudinal federal data statistics is applied to a coordinator in a longitudinal federal, and the method for deduplicating longitudinal federal data statistics includes:

Receiving a result matrix sent by each participant in a longitudinal federation, wherein the result matrix is obtained by multiplying a characteristic matrix by a preset random matrix after each participant constructs the characteristic matrix locally based on data to be deduplicated in own data;

transversely splicing the result matrixes to obtain spliced matrixes, and detecting target rows with the same corresponding position elements in the spliced matrixes;

And controlling each participant to perform deduplication processing on the data to be deduplicated according to each target row.

Further, the step of controlling each of the participants to perform deduplication processing on the data to be deduplicated according to each of the target rows includes:

determining a unique reserved target row in each target row, and generating row identifiers for other target rows except the reserved target row in each target row;

And sending the row identification to each participant so that each participant can respectively and locally determine the deduplication data pointed by the row identification in each piece of data to be deduplicated, and clearing the deduplication data.

In addition, in order to achieve the above object, the present invention further provides a longitudinal federal data statistics deduplication device, where the longitudinal federal data statistics deduplication device is applied to any one of the participants in the longitudinal federal, and the longitudinal federal data statistics deduplication device includes:

The device for removing the duplication of the longitudinal federal data statistics is applied to any one of the participants in the longitudinal federal data statistics, and the device for removing the duplication of the longitudinal federal data statistics comprises:

The first receiving module is used for receiving a result matrix sent by other participants in the longitudinal federation, wherein the result matrix is obtained by multiplying a preset random matrix by the first characteristic matrix after the other participants construct the first characteristic matrix locally based on first to-be-deduplicated data in own data;

The first matrix splicing module is used for constructing a second characteristic matrix locally based on second data to be deduplicated in the own data, and transversely splicing the second characteristic matrix with the result matrix to obtain a splicing matrix;

the first deduplication module is used for detecting each target row with the same corresponding position element in the splicing matrix, and performing deduplication processing on the first to-be-deduplicated data and the second to-be-deduplicated data pointed by each target row;

the deduplication device of the longitudinal federal data statistics is also applied to a coordinator in the longitudinal federal, and the deduplication device of the longitudinal federal data statistics further comprises:

The second receiving module is used for receiving a result matrix sent by each participant in the longitudinal federation, wherein the result matrix is obtained by multiplying a preset random matrix by the feature matrix after each participant constructs the feature matrix based on the data to be deduplicated in the own data;

the second matrix splicing module is used for transversely splicing the result matrixes to obtain splicing matrixes and detecting target rows with the same corresponding position elements in the splicing matrixes;

and the second deduplication module is used for controlling each participant to perform deduplication processing on the data to be deduplicated according to each target row.

In addition, to achieve the above object, the present invention also provides a terminal device including: the method comprises the steps of a memory, a processor and a longitudinal federation data statistics deduplication program stored on the memory and capable of running on the processor, wherein the longitudinal federation data statistics deduplication program is executed by the processor to realize the longitudinal federation data statistics deduplication method.

In addition, to achieve the above object, the present invention further provides a storage medium having a computer program stored thereon, which when executed by a processor, implements the steps of the deduplication method for vertical federal data statistics as described above.

The invention provides a method, a device, a terminal device and a storage medium for removing weight of longitudinal federal data statistics, which are characterized in that any one of the longitudinal federal participants receives a result matrix sent by other participants in the longitudinal federal, wherein the result matrix is obtained by multiplying a preset random matrix by the first characteristic matrix after the other participants construct the first characteristic matrix locally based on first data to be removed in own data; constructing a second feature matrix based on second data to be deduplicated in the self data in a local mode, and transversely splicing the second feature matrix and the result matrix to obtain a spliced matrix; detecting each target row with the same corresponding position element in the splicing matrix, and performing de-duplication processing on the first data to be de-duplicated and the second data to be de-duplicated pointed by each target row.

In a longitudinal federal scene, the feature matrix is constructed based on the to-be-deduplicated data of the participant, so that the one-to-one correspondence between the to-be-deduplicated data and each row of elements of the matrix is formed, the data consistency requirement is met, the judgment and deduplication operation of the same data by each participant are facilitated, in addition, the result matrix obtained by multiplying the feature matrix by the random matrix is transmitted between the participants, the data is directly encrypted in a matrix mode by adopting the random number multiplication mode, other complex encryption operations on the to-be-deduplicated data are not needed, the data deduplication can be completed under the condition of ensuring the privacy safety of the data, the overall efficiency of data deduplication is improved, and the method is applicable to deduplication of larger data quantity and has strong expandability.

Drawings

Fig. 1 is a schematic structural diagram of hardware operation of a terminal device according to an embodiment of the present invention;

FIG. 2 is a flow chart of an embodiment of a method for deduplication of vertical federal data statistics according to the present invention;

FIG. 3 is a schematic diagram of an application scenario involved in one embodiment of a deduplication method for vertical federal data statistics according to the present invention;

FIG. 4 is a flow chart of another embodiment of a deduplication method for vertical federal data statistics according to the present invention;

Fig. 5 is a schematic block diagram of a longitudinal federal data statistics deduplication apparatus according to the present invention.

The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.

Detailed Description

It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

As shown in fig. 1, fig. 1 is a schematic structural diagram of a hardware running environment related to a terminal device according to an embodiment of the present invention.

It should be noted that fig. 1 may be a schematic structural diagram of a hardware operating environment of a terminal device. The terminal equipment of the embodiment of the invention can be PC, portable computer and other terminal equipment.

As shown in fig. 1, the terminal device may include: a processor 1001, such as a CPU, a network interface 1004, a user interface 1003, a memory 1005, a communication bus 1002. Wherein the communication bus 1002 is used to enable connected communication between these components. The user interface 1003 may include a Display, an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may further include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a stable memory (non-volatile memory), such as a disk memory. The memory 1005 may also optionally be a storage device separate from the processor 1001 described above.

It will be appreciated by those skilled in the art that the terminal device structure shown in fig. 1 is not limiting of the terminal device and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.

As shown in fig. 1, an operating system, a network communication module, a user interface module, and a processing program of distributed tasks may be included in a memory 1005 as one type of computer storage medium. The operating system is a program for managing and controlling hardware and software resources of the sample terminal device, and supports the running of processing programs of distributed tasks and other software or programs.

In the terminal device shown in fig. 1, the user interface 1003 is mainly used for data communication with each terminal; the network interface 1004 is mainly used for connecting a background server and carrying out data communication with the background server; and the processor 1001 may be configured to invoke a deduplication program for longitudinal federation data statistics stored in the memory 1005, and perform the following operations:

Further, the processor 1001 may call a deduplication program of the longitudinal federal data statistics stored in the memory 1005, and further perform the following operations:

Based on the above structure, various embodiments of the deduplication method of longitudinal federal data statistics of the present invention are presented.

Referring to fig. 2, fig. 2 is a flowchart illustrating a first embodiment of a deduplication method for vertical federation data statistics according to the present invention.

Embodiments of the present invention provide embodiments of a deduplication method for vertical federal data statistics, it being noted that although a logical order is shown in the flowchart, in some cases the steps shown or described may be performed in an order different than that shown or described herein.

The deduplication method of the longitudinal federation data statistics of the embodiment of the invention is applied to the terminal equipment serving as one of the participants in the longitudinal federation, and the terminal equipment of the embodiment of the invention can be terminal equipment such as PC, portable computer and the like, and is not particularly limited.

The deduplication method for longitudinal federal data statistics in this embodiment includes:

Step S100, receiving a result matrix sent by other participants in a longitudinal federation, wherein the result matrix is obtained by multiplying a preset random matrix by a first feature matrix after the other participants construct the first feature matrix locally based on first data to be deduplicated in own data;

In this embodiment, each participant uses different feature dimensions of the user as vertical alternative columns, uses each user as a horizontal data tag to establish a data statistics table, and fills feature data of each user belonging to different feature dimensions to corresponding positions, so as to form each data to obtain own data of each local end of each participant in a statistics mode.

In a longitudinal federation formed by a plurality of participants, each other participant except the current participant locally selects first data to be deduplicated which needs to be subjected to data deduplication from own data based on a received preset data deduplication request, a first feature matrix is formed based on the first data to be deduplicated, the first feature matrix is multiplied by a preset random matrix to obtain a result matrix, and then the current participant receives the result matrix transmitted by each other participant.

It should be noted that, in this embodiment, the preset data deduplication request may be specifically automatically triggered by a terminal device serving as another participant or manually triggered based on a user of the terminal device, and when the preset data deduplication request is automatically or manually triggered, the preset data deduplication request is synchronously configured to extract a data tag of data to be determined whether to deduplicate from self-owned data of a local terminal, and configure a feature dimension to which the data to be determined whether to deduplicate belongs as a statistical feature of the data to be deduplicated, so as to facilitate extracting the data to be deduplicated to construct a feature matrix.

In addition, the preset random matrix is n according to the number of the feature dimensions configured in the preset data deduplication request, and a matrix of n is randomly generated.

Specifically, for example, in a longitudinal federation built by a combination of party a and party B, party B locally utilizes a feature dimension: the "age", "income", "number of transactions" and "whether overdue" are used as the candidate columns, and then the device numbers U1, U2 and U3 are used as the transverse data labels, respectively, to count the obtained own data. Then, if the participant B receives the preset data deduplication request triggered by configuring U1, U2 and U3 as data labels of the data to be deduplicated and configuring "age", "income" and "transaction number" as statistical features of the data to be deduplicated, the participant B extracts 9 feature data belonging to feature dimensions "age", "income" and "transaction number" from among three lines of data of U1, U2 and U3 from the own data locally as first data to be deduplicated, and constructs a 3*3-dimensional first feature matrix based on the 9 feature data, that is, the participant B takes three feature data of the line of U1 locally as three elements of the first line (from top to bottom) in the first feature matrix, takes three feature data of the line of U2 as three elements of the second line (from top to bottom) in the first feature matrix, and takes three feature data of the line of U3 as three elements of the third line (from top to bottom) in the first feature matrix.

In addition, the participant B generates a 3*3-dimensional random matrix locally according to the number "3" of the statistical features of the data to be deduplicated, "age", "income" and "transaction number" configured in the received preset data deduplication request. Thus, after constructing a first feature matrix of 3*3 dimensions, party B multiplies the generated random matrix by the first feature matrix based on a mature matrix multiplication algorithm to obtain a 3*3-dimensional result matrix, and transmits the result matrix to party a.

Step S200, constructing a second feature matrix based on second data to be deduplicated in the own data locally, and transversely splicing the second feature matrix and the result matrix to obtain a spliced matrix;

the current participant, after receiving the result matrix delivered by the other participants, then

And the second data to be deduplicated in the own data are locally extracted, a second feature matrix is constructed based on the second data to be deduplicated, and then the current party transversely splices the second feature matrix with the received result matrix locally to obtain a spliced matrix.

It should be noted that, in this embodiment, the processes of locally extracting the data to be deduplicated by each participant in the longitudinal federation and then constructing and obtaining the respective feature matrix based on the data to be deduplicated are consistent. In addition, the manner in which the current participant a performs the lateral stitching locally with respect to the second feature matrix and the received result matrix is based on existing mature matrix stitching operations. Specifically, for example, if the second feature matrix constructed by the participant a based on the second to-be-deduplicated data in the own data is a 3*3-dimensional matrix, and the result matrix received by the participant a and uploaded by the participant B is also a 3*3-dimensional matrix, a 3*6-dimensional spliced matrix may be obtained after the participant a performs transverse splicing on the second feature matrix and the result matrix (directly after three columns of elements of the result matrix are arranged in the third column of feature elements of the second feature matrix, or directly after three columns of elements of the second feature matrix are arranged in the third column of feature elements of the result matrix).

Further, in a possible embodiment, step S200 may include:

step S201, extracting second data to be de-duplicated from local self-owned data, wherein the number of the second data to be de-duplicated is greater than one;

It should be noted that, in this embodiment, in order to determine whether each piece of to-be-deduplicated data identified by the data tag at the end of the own data is the same, the number of each piece of to-be-deduplicated data locally extracted from the own data by each participant is equal to or greater than one, that is, at least two pieces of to-be-deduplicated data identified by different data tags need to be extracted, so that whether the two pieces of to-be-deduplicated data are the same can be determined, and thus, deduplication processing needs to be performed.

Step S202, each piece of characteristic data in the second data to be deduplicated is sequentially used as each element of one row in the matrix, so that a second characteristic matrix is constructed.

The current party also extracts a plurality of pieces of second data to be de-duplicated which need to be judged to perform data de-duplication processing from the self-owned data counted by the current party based on the preset data de-duplication request received by the current party, and then the current party sequentially takes each piece of characteristic data in each piece of data to be de-duplicated as each element of a certain row in the matrix, and further builds a second characteristic matrix of which each row of elements corresponds to each piece of characteristic data in each piece of second data to be de-duplicated.

Specifically, for example, current party a also locally utilizes the feature dimension: the "age", "income", "number of transactions" and "whether overdue" are used as the candidate columns, and then the device numbers U4, U5 and U6 are used as the transverse data labels, respectively, to count the obtained own data. Then, if the participant a receives the preset data deduplication request triggered by configuring U4, U5 and U6 as data labels of the data to be deduplicated and configuring "age", "income" and "overdue" as statistical features of the data to be deduplicated, the participant a extracts 9 feature data belonging to feature dimensions "age", "income" and "overdue" from three lines of data of U4, U5 and U6 from the own data locally as first data to be deduplicated, and constructs a 3*3-dimensional first feature matrix based on the 9 feature data, namely, uses three feature data of the line of U4 as three elements of the first line (from top to bottom) in the first feature matrix, uses three feature data of the line of U5 as three elements of the second line (from top to bottom) in the first feature matrix, and uses three feature data of the line of U6 as three elements of the third line (from top to bottom) in the first feature matrix.

Further, in a possible embodiment, in the step S201, the step of extracting the second data to be deduplicated from the local self-owned data may include:

Step 2011, receiving a preset data deduplication request, and analyzing the preset data deduplication request to obtain statistical features of data to be deduplicated, wherein the number of the statistical features of the data to be deduplicated is greater than or equal to one;

After the current party receives the first encryption values transmitted by other parties, a prompt message can be output at the local end, then a preset data deduplication request triggered based on the prompt message is received, and then the preset data deduplication request is analyzed to extract one or more statistical features of the data to be deduplicated from the preset data deduplication request.

It should be noted that, in this embodiment, the statistical feature of the data to be deduplicated is one or more feature dimensions configured synchronously when the terminal device serving as the current participant triggers automatically or triggers a preset data deduplication request manually based on a user of the terminal device. In addition, the prompting message may specifically be a user for prompting a terminal device serving as a current participant, where other participants are currently initiating a request for judging whether to perform deduplication processing on own data, so that the user confirms whether to respond to the request and selects data to be deduplicated in own data of the local end to perform judgment and perform data deduplication processing.

Specifically, for example, please refer to an application scenario shown in fig. 3, in a vertical federation constructed by combining a party a and a party B, after receiving a result matrix transmitted by a party B, a current party a outputs a prompt message at a local end, and if the current party a receives a preset data deduplication request triggered by a confirmation feedback input by a user or the local end automatically, the current party a parses the preset data deduplication request based on a command parsing technology mature at present, and extracts four statistical features of "age", "income" and "overdue" to be deduplicated from the preset data deduplication request.

Step 2012, feature data belonging to the statistical feature of the data to be deduplicated under each data tag in the own data is detected locally;

And step S2013, taking the characteristic data under the same data label as a piece of second data to be de-duplicated, so as to extract each piece of second data to be de-duplicated.

After analyzing a preset data deduplication request to extract statistical characteristics of data to be deduplicated, the current participant immediately detects characteristic data belonging to the statistical characteristics of the data to be deduplicated under each data label in own data counted by the local end, and then the current participant takes each characteristic data belonging to the same data label in the characteristic data as one piece of second data to be deduplicated, so that each piece of second data to be deduplicated classified according to each data label is obtained.

Specifically, for example, current party a locally utilizes the feature dimension in the same statistical manner as party B: the "age", "income", "number of transactions" and "overdue" are used as the alternative columns, and the device numbers U4, U5 and U6 are used as the transverse data labels to make statistics to obtain the own data. Then, the current party a detects 9 pieces of characteristic data respectively belonging to the three statistical characteristics of the three pieces of data to be deduplicated, namely the "age", "income" and "overdue" in each line of characteristic data identified by the data labels U4, U5 and U6, in the own data based on the three statistical characteristics of the three pieces of data to be deduplicated, namely the "age", "income" and "overdue" analyzed from the received preset data deduplication request, and finally, the current party a takes 3 pieces of characteristic data in the line identified by the data label U4 as one piece of second data to be deduplicated, 3 pieces of characteristic data in the line identified by the data label U5 as another piece of second data to be deduplicated, and 3 pieces of characteristic data in the line identified by the data label U6 as another piece of second data to be deduplicated, thereby obtaining three pieces of second data to be deduplicated in the own data of the local terminal.

Step S300, detecting each target row with the same corresponding position element in the splicing matrix, and performing deduplication processing on the first to-be-deduplicated data and the second to-be-deduplicated data pointed by each target row.

After the current participant locally performs longitudinal splicing on the second feature matrix and the received result matrix to obtain a spliced matrix, traversing elements of each row in the spliced matrix to detect a plurality of target rows with the same corresponding position elements in the spliced matrix, performing deduplication processing on second to-be-deduplicated data contained in the target rows by the current participant locally, and generating a deduplication instruction according to the target rows to control other participants to perform deduplication processing on first to-be-deduplicated data pointed by the target rows.

Further, in a possible embodiment, in step S300, "detecting each target row with the same corresponding position element in the stitching matrix" may include:

Step S301, taking each row of the splicing matrix as a reference row in sequence, and traversing each element in each row to be detected, which is positioned below the reference row, in the splicing matrix;

Step S302, if each element in the current line to be detected is the same as each element at the corresponding position in the reference line, marking the line to be detected and the reference line as target lines.

Specifically, for example, after receiving the 3*3-dimensional result matrix transmitted by the current participant a and transversely splicing the result matrix with the second feature matrix which is also 3*3-dimensional and is locally constructed based on the second data to be deduplicated, so as to obtain a 3*6-dimensional spliced matrix, the current participant a locally takes the first row in the 3*6-dimensional spliced matrix as a reference row first, and traverses 6 elements in the second row in the 3*6-dimensional spliced matrix to detect whether the 6 elements are sequentially identical to the 6 elements in the reference row (that is, detect whether the 1 st element of the second row is identical to the 1 st element of the reference row, so detect the 6 th element of the second row), then, continuously traverse 6 elements in the third row in the 3*6-dimensional spliced matrix, detect whether the 6 elements are sequentially identical to the 6 elements in the reference row first, traverse 6 elements in the reference row first, and finally, take the current participant a sequentially detects whether the 6 elements in the second row in the 3*6-dimensional spliced matrix are sequentially identical to the 6 elements in the reference row first row, and sequentially traverse 6 elements in the reference row 6 row.

Thus, if the current party a detects that only 6 elements in the second row are sequentially identical to 6 elements in the reference row, the current party a marks the first row and the second row currently serving as the reference row in the splicing matrix as the target row. Similarly, if the current participant a uses the second line as the reference line and detects that 6 elements in the third line are sequentially identical to 6 elements in the reference line, the current participant a marks the second line and the third line currently serving as the reference line in the splicing matrix as the target line.

Further, in another possible embodiment, if the current participant a uses the first row as the reference row and synchronously detects that 6 elements in the second row are sequentially identical to 6 elements in the reference row and 6 elements in the third row are sequentially identical to 6 elements in the reference row, the current participant a does not need to use the second row as the reference row for subsequent detection, but directly marks the first row, the second row and the third row, which are currently being used as the reference row, in the splicing matrix as the target row.

Further, in a possible embodiment, in step S300, the step of performing the deduplication processing on the first to-be-deduplicated data and the second to-be-deduplicated data pointed to by the target rows may include:

Step S303, determining target second deduplication data pointed by each target row in each second to-be-deduplicated data, and determining unique reserved data in each target second deduplication data;

The current party determines the elements of the second feature matrix respectively contained in each target row in the splicing matrix based on the one-to-one correspondence between the feature data in each piece of second data to be de-duplicated and each row element of the second feature matrix, targets the second data to be de-duplicated in each piece of second data to be de-duplicated, and then randomly determines one piece of only reserved data to be reserved in the target second data to be de-duplicated.

Specifically, for example, after determining a 3*6-dimensional splicing matrix, after determining a first row and a second row, extracting first 3 elements belonging to a second feature matrix from the first row, determining, according to a correspondence between the 3 elements and 3 feature data in a row identified by a data tag U4 in each piece of second data to be deduplicated, the first piece of second data to be deduplicated identified by the data tag U4 as target second data to be deduplicated, and similarly determining that the second piece of second data to be deduplicated identified by the data tag U5 is also determined as target second data to be deduplicated, and then, determining, by the current participant a, that the first piece of second data to be deduplicated identified by the data tag U4 is reserved data to be reserved in the two pieces of target second data to be deduplicated.

Step S304, removing the target second deduplication data except the reserved data in the target second deduplication data;

After determining a piece of only reserved data to be reserved from target second data to be de-duplicated, the current participant immediately clears all target second de-duplicated data except the reserved data in the target second data to be de-duplicated, so that two pieces of second data to be de-duplicated with the same characteristic data are not existed in the second data to be de-duplicated at the same time.

Specifically, for example, after determining that the first piece of target second to-be-deduplicated data identified by the data tag U4 is reserved data to be reserved from the two pieces of target second to-be-deduplicated data identified by the U4 and the U5, the current participant a immediately clears the piece of target second to-be-deduplicated data identified by the U5.

Step S305, determining a reserved target row corresponding to the reserved data in each target row, and generating row identifiers for other target rows except the reserved target row in each target row;

Step S306, sending the line identifier to the other participants, so that the other participants determine the target first deduplication data pointed by the line identifier in each piece of first deduplication data, and clear the target first deduplication data.

After each target row is determined from the splicing matrix, the current party randomly determines only one reserved target row from the target rows, then the current party generates respective corresponding row identifiers for other target rows except the reserved target row in the target rows, then the current party transmits the row identifiers to other parties, the other parties directly and firstly determine the target row pointed by the row identifiers in the first feature matrix after receiving the row identifiers, and further determine the target row in the first feature matrix based on the one-to-one correspondence between each feature data in the first feature matrix and each row element of the first feature matrix, and then directly clear the target first to-be-removed data pointed in each piece of first to-be-removed data, so that two pieces of first to-be-removed data with the same feature data do not exist in the first to-be-removed data at the same time.

Specifically, for example, after determining the first row and the second row in the 3*6-dimensional splice matrix, the current party a determines the first row as the retention target row and then generates a row identification "row 2" of the second row, and then the current party a passes the row identification "row 2" to the party B. Immediately after receiving the row identification 'row 2', the participant B directly determines a second row as a target row in a 3*3-dimensional first feature matrix which is constructed based on first data to be deduplicated, determines the first piece of first data to be deduplicated identified by the data tag U2 as target first data to be deduplicated according to the corresponding relation between 3 elements in the second row and 3 feature data in the row identified by the data tag U2, and then clears the target first data to be deduplicated identified by the U2.

In this embodiment, in a longitudinal federal scenario formed by multiple participants each having own data, a current participant in the participant receives a result matrix transmitted by another participant (the result matrix is obtained by constructing a first feature matrix locally by each of the other participants based on first to-be-deduplicated data in the own data and multiplying the first feature matrix by a preset random matrix), then the current participant also constructs a second feature matrix locally based on second to-be-deduplicated data in the own data, and splices the second feature matrix with the received result matrix to obtain a splice matrix, finally, the current participant detects whether elements at corresponding positions of each row in the splice matrix are identical to obtain multiple target rows with identical corresponding position elements by detecting whether the elements at the corresponding positions are identical, and then performs deduplication processing on the second to-be-deduplicated data locally according to the multiple target rows, and controls the other participants to perform deduplication processing on the first to-be-deduplicated data.

Further, referring to fig. 5, fig. 5 is a flowchart illustrating a second embodiment of a deduplication method for vertical federal data statistics according to the present invention.

The second embodiment of the deduplication method of longitudinal federation data statistics of the present invention is applied to a coordinator in the longitudinal federation, and it should be noted that although the second embodiment of the deduplication method of longitudinal federation data statistics of the present invention shows a logical order in the flowchart, in some cases, the steps shown or described may be performed in an order different from that herein.

The deduplication method of the longitudinal federation data statistics of the embodiment of the invention is applied to the terminal equipment serving as a coordinator in the longitudinal federation, and the terminal equipment of the embodiment of the invention can be PC, portable computer and other terminal equipment, and is not particularly limited.

Step S400, receiving a result matrix sent by each participant in a longitudinal federation, wherein the result matrix is obtained by multiplying a preset random matrix by a feature matrix after each participant constructs the feature matrix based on to-be-deduplicated data in own data locally;

In a longitudinal federation formed by a plurality of participants, each participant locally selects data to be deduplicated which needs to be subjected to data deduplication from own data based on a received preset data deduplication request, a feature matrix is constructed based on the data to be deduplicated, the feature matrix is multiplied by a preset random matrix to obtain a result matrix, each participant then uploads the result matrix obtained by local calculation to a coordinator in the longitudinal federation, and the coordinator receives the result matrix uploaded by each participant.

Specifically, for example, in a vertical federation built jointly by coordinator X, participant a, and participant B, participant B locally utilizes the feature dimension: the "age", "income", "transaction number" and "whether overdue" are used as the alternative columns, and then the device numbers U1, U2 and U3 are respectively used as the transverse data labels for statistics to obtain the own data; participant a also locally utilizes the feature dimension: the "age", "income", "number of transactions" and "whether overdue" are used as the candidate columns, and then the device numbers U4, U5 and U6 are used as the transverse data labels, respectively, to count the obtained own data.

Then, the participant B receives the preset data deduplication request triggered by configuring U1, U2 and U3 as data labels of the data to be deduplicated and configuring "age", "income" and "transaction number" as statistical features of the data to be deduplicated, then the participant B extracts 9 feature data belonging to feature dimensions "age", "income" and "transaction number" from three lines of data of U1, U2 and U3 from the own data locally, and constructs a 3*3-dimensional feature matrix based on the 9 feature data, that is, the participant B takes three feature data of the line of U1 locally as three elements of the first line (from top to bottom) in the feature matrix, takes three feature data of the line of U2 as three elements of the second line (from top to bottom) in the feature matrix, and takes three feature data of the line of U3 as three elements of the third line (from top to bottom) in the feature matrix.

In addition, the participant B generates a 3*3-dimensional random matrix locally according to the number "3" of the statistical features of the data to be deduplicated, "age", "income" and "transaction number" configured in the received preset data deduplication request. Thus, after constructing the feature matrix of 3*3 dimensions, the party B multiplies the generated random matrix by the feature matrix based on a mature matrix multiplication algorithm to obtain a result matrix of 3*3 dimensions, and transmits the result matrix to the coordinator X.

It should be noted that, in this embodiment, because the process that each party in the vertical federation locally builds a feature matrix based on the data to be deduplicated in the own data and then multiplies the feature matrix by the random matrix to obtain a result matrix is the same, the process that party a locally extracts the own data and builds the feature matrix and multiplies the feature matrix by the random matrix to obtain the result matrix is not repeated here.

Step S500, transversely splicing the result matrixes to obtain splicing matrixes, and detecting target rows with the same corresponding position elements in the splicing matrixes;

After receiving the result matrixes uploaded by each participant, the coordinator transversely splices the result matrixes two by two to obtain a spliced matrix, and then traverses the elements of each row in the spliced matrix to detect a plurality of target rows with the same corresponding position elements in the spliced matrix.

Specifically, for example, if the result matrix a and the result matrix B that are received by the coordinator X and uploaded by the participant a and the participant B are each 3*3-dimensional matrices, a 3*6-dimensional concatenation matrix may be obtained by performing lateral concatenation on the result matrix a and the result matrix B by the coordinator X (after three columns of elements of the result matrix a are arranged in the third column of the feature element of the result matrix B, or after three columns of elements of the result matrix B are directly arranged in the third column of the feature element of the result matrix a).

Then, if coordinator X first takes the first row in the 3*6-dimensional stitching matrix as the reference row and traverses 6 elements in the second row in the 3*6-dimensional stitching matrix to detect whether the 6 elements are identical in sequence to the 6 elements in the reference row (i.e., detects whether the 1 st element of the second row is identical to the 1 st element of the reference row and so detects the 6 th element up to the second row), and if coordinator X continues to traverse 6 elements in the third row in the 3*6-dimensional stitching matrix to similarly detect whether the 6 elements are identical in sequence to the 6 elements in the reference row, finally, if coordinator X again takes the second row in the 3*6-dimensional stitching matrix as the reference row and traverses 6 elements in the third row in the 3*6-dimensional stitching matrix to detect whether the 6 elements are identical in sequence to the 6 elements in the reference row.

Thus, if the coordinator X detects that only 6 elements in the second row are sequentially identical to 6 elements in the reference row, the coordinator X marks the first row and the second row currently serving as the reference row in the splice matrix as the target row at the same time. Similarly, if the coordinator X takes the second row as the reference row and detects that 6 elements in the third row are sequentially identical to 6 elements in the reference row, the coordinator X marks the second row and the third row currently being taken as the reference row in the splicing matrix as the target row at the same time.

And step S600, controlling each participant to perform de-duplication processing on the data to be de-duplicated according to each target row.

After each target row is determined from the splicing matrix, a coordinator randomly determines only one reserved target row from each target row, then the coordinator generates corresponding row identifiers for other target rows except the reserved target row in each target row, then the coordinator transmits the row identifiers to each participant, each participant directly and firstly determines the target row pointed by the row identifier in the feature matrix after receiving the row identifiers, and further determines target to-be-deduplicated data pointed by each target row in each piece of to-be-deduplicated data in the feature matrix based on one-to-one correspondence between each piece of feature data in each piece of to-be-deduplicated data and each row element of the feature matrix, and then directly clears the target to-be-deduplicated data, so that two pieces of to-be-deduplicated data with the same feature data do not exist in the to-be-deduplicated data at the same time.

Specifically, for example, after determining 3*6 dimensions of the splice matrix, coordinator X determines the first row as the retention target row and then generates a row identification "row 2" of the second row, and then coordinator X passes the row identification "row 2" to participant a and participant B.

Immediately after receiving the row identification 'row 2', the participant A directly determines a second row as a target row in a 3*3-dimensional feature matrix constructed based on the data to be deduplicated, determines a first piece of data to be deduplicated identified by a data tag U5 as target data to be deduplicated according to the corresponding relation between 3 elements in the second row and 3 pieces of feature data in the data to be deduplicated, and then clears the target piece of data to be deduplicated identified by the U5.

Immediately after receiving the row identification 'row 2', the participant B directly determines a second row as a target row in a 3*3-dimensional feature matrix constructed based on the data to be deduplicated, determines a first piece of data to be deduplicated identified by a data tag U2 as target data to be deduplicated according to the corresponding relation between 3 elements in the second row and 3 pieces of feature data in the data to be deduplicated, and then clears the target piece of data to be deduplicated identified by the U2.

In this embodiment, in a longitudinal federal scenario formed by a plurality of participants each having own data, a coordinator receives a result matrix transmitted by each participant (the result matrix is obtained by constructing a feature matrix by each participant based on to-be-deduplicated data in the own data locally and multiplying the feature matrix by a preset random matrix), then the coordinator splices the received result matrices two by two to obtain a spliced matrix, finally, the coordinator detects whether elements at corresponding positions of each row in the spliced matrix are identical to obtain a plurality of target rows with identical corresponding position elements by detecting, and then sends a deduplication instruction to each participant according to the plurality of target rows to control each participant to perform deduplication processing locally on to-be-deduplicated data.

In a longitudinal federal scene, the feature matrix is constructed based on the to-be-deduplicated data of the participant, so that a one-to-one correspondence between the to-be-deduplicated data and each row of elements of the matrix is formed, the data consistency requirement is met, the joint judgment and deduplication operation of all the participants on the same data are facilitated, in addition, the joint judgment is carried out on the coordinator by transmitting the result matrix obtained by multiplying the feature matrix by the random matrix among the participants, the data is encrypted in a matrix mode by directly adopting the random number multiplication mode, other complex encryption operations on the to-be-deduplicated data are not needed, the data deduplication process under the condition of ensuring the privacy safety of the data can be completed, the overall efficiency of data deduplication is greatly improved, the method is applicable to deduplication of larger data quantity, and the method is strong in expandability.

In addition, referring to fig. 5, an embodiment of the present invention further provides a longitudinal federal data statistics deduplication device, where the longitudinal federal data statistics deduplication device is applied to any one of the participants in the longitudinal federal, and the longitudinal federal data statistics deduplication device includes:

the first deduplication module is used for detecting each target row with the same corresponding position element in the splicing matrix, and performing deduplication processing on the first to-be-deduplicated data and the second to-be-deduplicated data pointed by each target row.

Preferably, the first matrix stitching module includes:

The extraction unit is used for extracting second data to be de-duplicated from the local self-owned data, wherein the number of the second data to be de-duplicated is greater than one;

the construction unit is used for sequentially taking each piece of characteristic data in the second data to be de-duplicated as each element of one row in the matrix to construct and obtain a second characteristic matrix.

Preferably, the extraction unit includes:

the receiving subunit is used for receiving a preset data deduplication request and analyzing the preset data deduplication request to obtain statistical features of data to be deduplicated, wherein the number of the statistical features of the data to be deduplicated is greater than or equal to one;

The detection subunit is used for locally detecting characteristic data belonging to the statistical characteristics of the data to be deduplicated under each data tag in the own data;

and the marking subunit is used for taking the characteristic data under the same data label as a piece of second data to be de-duplicated so as to extract each piece of second data to be de-duplicated.

Preferably, the first deduplication module comprises:

the traversing unit is used for taking each row of the splicing matrix as a reference row in sequence and traversing each element in each row to be detected, which is positioned below the reference row, in the splicing matrix;

And the marking unit is used for marking the line to be detected and the reference line as target lines if the elements in the current line to be detected are the same as the elements at the corresponding positions in the reference line.

Preferably, the first deduplication module further comprises:

A first determining unit, configured to determine target second deduplication data pointed by each target row in each second to-be-deduplicated data, and determine unique reserved data in each target second deduplication data;

The duplicate removal unit is used for removing other target second duplicate removal data except the reserved data in the target second duplicate removal data;

a second determining unit, configured to determine a reserved target line corresponding to the reserved data in each target line, and generate a line identifier for other target lines except the reserved target line in each target line;

And the first sending unit is used for sending the row identification to the other participants so that the other participants can determine target first deduplication data pointed by the row identification in each piece of first deduplication data, and clear the target first deduplication data.

Preferably, the deduplication device of the longitudinal federal data statistics is further applied to a coordinator in the longitudinal federal, and the deduplication device of the longitudinal federal data statistics further includes:

Preferably, the second deduplication module comprises:

A third determining unit configured to determine a unique reserved target line among the target lines, and generate a line identifier for other target lines among the target lines except the reserved target line;

And the second sending unit is used for sending the row identification to each participant so that each participant can respectively and locally determine the de-duplication data pointed by the row identification in each data to be de-duplicated and clear the de-duplication data.

The steps implemented by each functional module of the deduplication device for longitudinal federal data statistics according to the present invention during operation may refer to each embodiment of the deduplication method for longitudinal federal data statistics according to the present invention, and will not be described herein.

In addition, the embodiment of the invention also provides a terminal device, which comprises: the method comprises the steps of a memory, a processor and a longitudinal federation data statistics deduplication program stored on the memory and capable of running on the processor, wherein the longitudinal federation data statistics deduplication program is executed by the processor to realize the longitudinal federation data statistics deduplication method.

The steps implemented when the deduplication program of the longitudinal federation data statistics running on the processor is executed may refer to various embodiments of the deduplication method of the longitudinal federation data statistics of the present invention, and are not described herein.

In addition, the embodiment of the invention also provides a storage medium, which is applied to a computer and can be a nonvolatile computer readable storage medium, wherein a longitudinal federal data statistical deduplication program is stored on the storage medium, and the longitudinal federal data statistical deduplication program is executed by a processor to realize the steps of the longitudinal federal data statistical deduplication method.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method according to the embodiments of the present invention.

The foregoing description is only of the preferred embodiments of the present invention, and is not intended to limit the scope of the invention, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein or in the alternative, which may be employed directly or indirectly in other related arts.

Claims

1. A method for deduplication of longitudinal federal data statistics, wherein the method for deduplication of longitudinal federal data statistics is applied to any one of the participants in the longitudinal federal, and the method for deduplication of longitudinal federal data statistics comprises:

2. The method of deduplication of longitudinal federal data statistics according to claim 1, wherein the step of constructing a second feature matrix locally based on second data to be deduplicated in the own data comprises:

3. The method for deduplication of longitudinal federal data statistics according to claim 2, wherein the step of extracting the second data to be deduplicated from the local owned data comprises:

4. The method of deduplication of longitudinal federal data statistics according to claim 1, wherein the step of detecting target rows having identical corresponding position elements in the splice matrix comprises:

5. The method of deduplication of longitudinal federal data statistics according to claim 1, wherein the step of deduplicating the first to-be-deduplicated data and the second to-be-deduplicated data directed for each of the target rows comprises:

determining target second deduplication data pointed by each target row in each second to-be-deduplicated data respectively, and determining unique reserved data in each target second deduplication data;

6. A method for deduplication of longitudinal federal data statistics, wherein the method for deduplication of longitudinal federal data statistics is applied to a coordinator in a longitudinal federal, the method for deduplication of longitudinal federal data statistics comprising:

7. The method for deduplication of longitudinal federal data statistics according to claim 6, wherein the step of controlling each of the participants to perform deduplication processing on the data to be deduplicated according to each of the target rows comprises:

8. A deduplication device for longitudinal federal data statistics, wherein the deduplication device for longitudinal federal data statistics is applied to any one of the participants in the longitudinal federal, and the deduplication device for longitudinal federal data statistics comprises:

9. A terminal device, characterized in that the terminal device comprises: memory, a processor and a longitudinal federation data statistics deduplication program stored on the memory and executable on the processor, which when executed by the processor, implements the steps of the longitudinal federation data statistics deduplication method of any of claims 1 to 5 or 6 to 7.

10. A storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the deduplication method of longitudinal federal data statistics of any of claims 1 to 5 or 6 to 7.