CN116303405B

CN116303405B - Data duplicate checking method and device and computer equipment

Info

Publication number: CN116303405B
Application number: CN202310533575.9A
Authority: CN
Inventors: 王成军; 何涛; 张立杰; 曾明; 史晓婧; 李荣新
Original assignee: Shenzhen Zhuyun Technology Co ltd
Current assignee: Shenzhen Zhuyun Technology Co ltd
Priority date: 2023-05-12
Filing date: 2023-05-12
Publication date: 2023-11-10
Anticipated expiration: 2043-05-12
Also published as: CN116303405A

Abstract

The application relates to a data duplicate checking method. The method comprises the following steps: the following single character queries are performed: acquiring a single character in the data to be checked, and converting the character into data in a preset format according to a mapping rule; the preset format data is used as a subscript of a pointer array to acquire node data from nodes, the node information comprises the preset format data, an address of the preset format data and the pointer array, the preset format data is the same as the node data, and the inquiry of the single character is completed; and after the inquiry of the single character is completed, taking out the next character in the data to be checked to execute the character inquiry, wherein the position of the next character inquiry is determined according to the subscript of the pointer array of the last character node until the inquiry of the last character in the data to be checked is completed, and returning a check result. By adopting the method, the duplicate checking efficiency of the data can be improved.

Description

Data duplicate checking method and device and computer equipment

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a data duplication checking method, apparatus, and computer device.

Background

In the field of data weight checking, a large amount of data is generally required to be compared with weight checking data one by one to obtain weight checking results, and the data can comprise license plate numbers, engine numbers and the like of automobiles.

In the related art, the hash value can be obtained by performing hash transformation on the data, the hash value can be placed in the linked list corresponding to the array subscript after the hash value is remained, and when the index is repeated, the linked list corresponding to the array subscript can be found, and the linked list data is recycled to search the matching data.

Disclosure of Invention

Based on this, it is necessary to provide a data duplication checking method aiming at the technical problem, a linked list structure is constructed, the linked list can include multiple layers of branch nodes, each node position corresponds to the position of the data to be checked, and characters can be used as subscripts of a pointer array, so that the duplication checking speed of the data is improved.

In a first aspect, the present application provides a data duplication checking method. The method comprises the following steps:

the following single character queries are performed:

acquiring a single character in the data to be checked, and converting the character into data in a preset format according to a mapping rule;

acquiring node data from a node by taking the preset format data as a subscript of a pointer array, and finishing the inquiry of the single character, wherein the node information comprises the preset format data, an address of the preset format data and the pointer array, and the preset format data is the same as the node data;

and after the inquiry of the single character is completed, taking out the next character in the data to be checked to execute the character inquiry, wherein the position of the next character inquiry is determined according to the subscript of the pointer array of the last character node until the inquiry of the last character in the data to be checked is completed, and returning a check result.

In one embodiment, a character range is determined according to the to-be-checked duplicate data, the characters in the character range are ordered, the ordered characters are converted into data in a preset format according to the mapping rule, and the data are stored in an array.

In one embodiment, the order of the duplicate data to be checked is associated with a linked list structure.

In one embodiment, the linked list includes a root node, the root node includes at least one layer of branch nodes, each layer of branch nodes includes at least one branch node, a position of the one branch node corresponds to a position of the one character, and the one layer of branch nodes corresponds to a character matched with the number of layers of branch nodes.

In one embodiment, if a character from the first position to the data of the to-be-checked data is the same as a character corresponding to the checked data, the same character of the to-be-checked data and the checked data is the same link.

In a second aspect, the present application further provides a data duplication checking device, where the device includes:

the following single character queries are performed:

the conversion module is used for obtaining single characters in the duplication data to be checked and converting the characters into data in a preset format according to the mapping rule;

the query module is used for taking the preset format data as a subscript of the pointer array to acquire node data from nodes and completing the query of the single character, wherein the node information comprises the preset format data, the address of the preset format data and the pointer array, and the preset format data is the same as the node data;

and the confirmation module is used for taking out the next character in the data to be checked to execute character inquiry after the inquiry of the single character is completed, determining the position of the next character inquiry according to the subscript of the pointer array of the previous character node until the inquiry of the last character in the data to be checked is completed, and returning a check result.

In a third aspect, the present disclosure also provides a computer device. The computer device comprises a memory storing a computer program and a processor implementing the steps of the data duplication method when the processor executes the computer program.

In a fourth aspect, the present disclosure also provides a computer-readable storage medium. The computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of a data duplication method.

In a fifth aspect, the present disclosure also provides a computer program product. The computer program product comprises a computer program which, when executed by a processor, implements the steps of a data duplication method.

The data duplication checking method at least comprises the following beneficial effects:

according to the embodiment scheme provided by the disclosure, the data can be stored in the array, the finally generated data structure is in a tree structure, the data can be used as the index of the pointer array to obtain the node data from the nodes, the node data is compared with the data to be queried, the position of the next character query is determined according to the index of the pointer array of the last character node, and the data weight searching speed is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments or the conventional techniques of the present disclosure, the drawings required for the descriptions of the embodiments or the conventional techniques will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present disclosure, and other drawings may be obtained according to the drawings without inventive effort to those of ordinary skill in the art.

FIG. 1 is an application environment diagram of a data duplication method in one embodiment;

FIG. 2 is a flow diagram of a data duplication method in one embodiment;

FIG. 3 is a schematic diagram of a linked list structure in one embodiment;

FIG. 4 is a block diagram of a data deduplication apparatus in one embodiment;

FIG. 5 is an internal block diagram of a computer device in one embodiment;

fig. 6 is an internal structural diagram of a server in one embodiment.

Detailed Description

In order to enable those skilled in the art to better understand the technical solutions of the present disclosure, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the foregoing figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the disclosure described herein may be capable of operation in sequences other than those illustrated or described herein. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims. The terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, it is not excluded that additional identical or equivalent elements may be present in a process, method, article, or apparatus that comprises a described element. For example, if first, second, etc. words are used to indicate a name, but not any particular order.

The embodiment of the disclosure provides a data duplication checking method, which can be applied to an application environment as shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The data storage system may store data that the server 104 needs to process. The data storage system may be integrated on the server 104 or may be located on a cloud or other network server. The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, internet of things devices, and portable wearable devices, where the internet of things devices may be smart speakers, smart televisions, smart air conditioners, smart vehicle devices, and the like. The portable wearable device may be a smart watch, smart bracelet, headset, or the like. The server 104 may be implemented as a stand-alone server or as a server cluster of multiple servers.

In some embodiments of the present disclosure, as shown in fig. 2, a data duplication checking method is provided, and the method is applied to the server in fig. 1 to process duplication checking data for example. It will be appreciated that the method may be applied to a server, and may also be applied to a system comprising a terminal and a server, and implemented by interaction of the terminal and the server. In a specific embodiment, the method may include the steps of:

s20: the following single character queries are performed:

s202: and acquiring a single character in the data to be checked, and converting the character into data in a preset format according to a mapping rule.

The weight data to be checked can be an identity card number, a license plate number of an automobile, an engine number and the like, and can be composed of a plurality of characters. The characters may include letters, numbers, operators, punctuation marks, and the like. In some embodiments of the present disclosure, the data to be checked may be an identification card number, in general, the identification card number may be composed of letters and numbers, a single character in the identification card number may be obtained, the character is converted into data in a preset format according to a mapping rule, and the mapping rule may be that all characters are converted into numbers, so that sorting and storage are facilitated.

S204: and taking the preset format data as the subscript of the pointer array to acquire node data from the nodes, and finishing the inquiry of the single character, wherein the node information comprises the preset format data, the address of the preset format data and the pointer array, and the preset format data is the same as the node data.

The plurality of nodes are connected to form a linked list structure, the linked list structure can comprise a plurality of layers of node data, and one layer of linked list structure can comprise a plurality of node data. The node information may include data in a preset format, an address of the data in the preset format, and a pointer array, where a next character to be checked may be stored in a subscript of the pointer array. The position of the data in the preset format in the linked list corresponds to the position of the character in the identification card number.

FIG. 3 is a diagram of a linked list structure in one embodiment. In some embodiments of the present disclosure, if the single character currently being checked is the first character of the identification card number, the first character may be stored in the subscript of the pointer array starting from the root node in the linked list structure, and the position of the first character in the linked list may be represented as 0. The first character is taken as a subscript to take out corresponding node data from the subscript of the pointer array of the root node, a next-layer linked list structure of the root node can comprise a plurality of node data, if the next-layer linked list structure does not have the node data matched with the first character, the node data to be checked can be represented as unrepeated, if the next-layer linked list structure can acquire the node data matched with the first character, a second bit character in the data to be checked is taken out, the second bit character is taken out of the node array as the subscript, the corresponding node data are taken out of the node array, the rest characters are sequentially queried according to the query mode, and if all characters in the data to be checked can acquire the matched node data in the subscript of the pointer array according to the query sequence, the existence of the data to be checked can be explained.

S206: and after the inquiry of the single character is completed, taking out the next character in the data to be checked to execute the character inquiry, wherein the position of the next character inquiry is determined according to the subscript of the pointer array of the last character node until the inquiry of the last character in the data to be checked is completed, and returning a check result.

The position of the next character inquiry can be determined according to the subscript of the pointer array of the previous character node, and the sequence of the two character inquiries is adjacent, so that the node positions stored in the linked list structure are also adjacent, and the position of the next character can be quickly acquired through the subscript of the pointer array of the previous character node. In some embodiments of the present disclosure, the identification card number may be used as the data to be checked, and in general, the length of the identification card number is 18-bit characters, and at most, whether the data to be checked exists may be determined through 18 data fetching processes. If all the characters are arranged one by one according to the query sequence, a link is formed, the existence of the data to be checked can be judged, if a certain character in the data to be checked is not stored in the linked list, the absence of the data to be checked can be judged, and if a certain character in the data to be checked exists in the linked list, the position sequence of the character in the linked list is inconsistent with the position sequence of the character in the data to be checked, the absence of the data to be checked can be judged.

In the data duplication checking method, a single character in duplication checking data can be taken out to serve as a subscript of a pointer array to acquire node data from the nodes, whether the duplication checking data exist or not can be judged according to whether the character is matched with the node data, if the character is matched with the node data, a next character in the duplication checking data can be taken out to execute character inquiry, the position of the next character inquiry is determined according to the subscript of the pointer array of the previous character node, and the inquiry speed is improved.

In some embodiments of the present disclosure, a character range is determined according to the to-be-checked duplication data, and the characters in the character range are ordered, so that the ordered characters are converted into data in a preset format according to the mapping rule, and are stored in an array.

The duplication data to be checked may include an identification number, a license plate number of an automobile, an engine number, and the like, and in some embodiments of the present disclosure, the identification number may be used as the duplication data to be checked, where the identification number includes a combination of a number and a letter, and the letter may be converted into a number. The corresponding array of numbers is a [0] =0, a [1] =1, … …, a [9] =9. The corresponding array of letters is a [10] =a, … …, a [35] =z, a [36] =a, … …, a [61] =z. The characters at each position can be stored by using a linked list structure for searching whether the data to be checked exist or not, so that the searching speed is improved.

In some embodiments of the present disclosure, the linked list includes one root node, each root node includes at least one layer of branch nodes, each layer of branch nodes includes at least one branch node, a position of the one branch node corresponds to a position of the one character, and the one layer of branch nodes corresponds to a character matching the number of layers of branch nodes.

The node data stored in the pointer array of the root node is a first character, the linked list comprises a root node, the root node comprises at least one layer of branch nodes, each layer of branch nodes corresponds to characters matched with the number of layers of the branch nodes, for example, the data to be checked is an identity card number, the identity card number is generally 18 bits, the linked list structure can comprise 18 layers of branch nodes, and the node data contained in each layer of branch nodes corresponds to the positions of the characters in the identity card number. The one-layer branch node may include a plurality of branch nodes, and for example, an id card number, the one-layer branch node may include 61 branch nodes.

In some embodiments of the present disclosure, the order of the duplicate data to be checked is associated with a linked list structure.

The linked list structure can be used for storing characters at each position, the linked list can comprise a plurality of layers of branch nodes, the front-back relation of the linked list represents the sequence of the identification card numbers, the depth of the linked list represents the length of the identification card, and the next direction of the nodes in the linked list can have a plurality of directions and is used for representing data possibly appearing at the next position in the identification card numbers. Node data can be obtained from the subscript of the pointer array, the pointed position of the node can be judged according to the node data, and the duplicate checking efficiency is improved.

In some embodiments of the present disclosure, if a character from the first position to the data of the to-be-checked duplicate data is the same as a character corresponding to the checked duplicate data, the same character of the to-be-checked duplicate data and the checked duplicate data are the same link.

In the duplication checking process, the characters corresponding to the duplication checked data and part of characters in the duplication checked data may be the same, if the part of characters includes the first character, the link of the duplication checked data may be acquired, the last character in the part of characters may be obtained, the node position where the last character is located is found in the link of the duplication checked data, and the duplication checked data may be searched from the current node position, so as to improve duplication checking efficiency.

If the partial characters do not include the first character, it can be stated that the link of the data to be checked in the linked list is partially overlapped with the link of the checked data, the first character in the partial characters can be obtained, the position of the first character in the link can be found, the link of the partial characters in the linked list can be used, the searching time is reduced, the last character in the partial characters is obtained, the position of the last character in the linked list is found, the data to be checked can be searched from the position of the last character in the linked list, and the searching efficiency is improved.

It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.

Based on the same inventive concept, the embodiment of the disclosure also provides a data duplicate checking device for implementing the above related duplicate checking method. The implementation scheme of the device for solving the problem is similar to that described in the above method, so the specific limitation in the embodiment of the data duplication checking device provided below can be referred to the limitation of the data duplication checking method hereinabove, and will not be repeated here.

The apparatus may comprise a system (including a distributed system), software (applications), modules, components, servers, clients, etc. that employ the methods described in the embodiments of the present specification in combination with the necessary apparatus to implement the hardware. Based on the same innovative concepts, embodiments of the present disclosure provide for devices in one or more embodiments as described in the following examples. Because the implementation scheme and the method for solving the problem by the device are similar, the implementation of the device in the embodiment of the present disclosure may refer to the implementation of the foregoing method, and the repetition is not repeated. As used below, the term "unit" or "module" may be a combination of software and/or hardware that implements the intended function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.

In one embodiment, as shown in fig. 4, a data duplication checking apparatus is provided, where the apparatus may be the foregoing server, or a module, an assembly, a device, a unit, etc. integrated with the server. The apparatus may include:

a single character query 40 is performed as follows:

the conversion module 402 is configured to obtain a single character in the duplication data to be checked, and convert the character into data in a preset format according to a mapping rule;

the query module 404 is configured to obtain node data from a node by using the preset format data as a subscript of a pointer array, and complete the query of the single character, where the node information includes the preset format data, an address of the preset format data, and the pointer array, and the preset format data is the same as the node data;

and the confirmation module 406 is configured to, after the query of the single character is completed, take out a next character in the data to be queried, execute the character query, where the position of the next character query is determined according to the subscript of the pointer array of the previous character node, until the last character in the data to be queried is completed, and return the query result.

In one embodiment, a character range is determined according to the to-be-checked data, the characters in the character range are ordered, the ordered characters are converted into data in a preset format according to the mapping rule, and the data are stored in an array.

The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

The above modules in the data duplication checking device may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 5. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used for storing the data to be checked. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a data duplication checking method.

In one embodiment, a computer device is provided, which may be a terminal, and the internal structure of which may be as shown in fig. 6. The computer device includes a processor, a memory, a communication interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, a mobile cellular network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a data duplication checking method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

Those skilled in the art will appreciate that the structures shown in fig. 5 and 6 are merely block diagrams of partial structures associated with the disclosed aspects and do not constitute a limitation of the computer device on which the disclosed aspects may be applied, and that a particular computer device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, implements the method of any of the embodiments of the present disclosure.

In one embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the method described in any of the embodiments of the present disclosure.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided by the present disclosure may include at least one of non-volatile and volatile memory, among others. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as Static Random access memory (Static Random access memory AccessMemory, SRAM) or dynamic Random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the various embodiments provided by the present disclosure may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processors involved in the embodiments provided by the present disclosure may be general-purpose processors, central processing units, graphics processors, digital signal processors, programmable logic, quantum computing-based data processing logic, etc., without limitation thereto.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The foregoing examples have expressed only a few embodiments of the present disclosure, which are described in more detail and detail, but are not to be construed as limiting the scope of the present disclosure. It should be noted that variations and modifications can be made by those skilled in the art without departing from the spirit of the disclosure, which are within the scope of the disclosure. Accordingly, the scope of the present disclosure should be determined from the following claims.

Claims

1. A data duplication checking method, the method comprising:

the following single character queries are performed:

2. The method of claim 1, wherein a character range is determined according to the data to be checked, the characters in the character range are ordered, the ordered characters are converted into data in a preset format according to the mapping rule, and the data are stored in an array.

3. The method of claim 1, wherein the order of the data to be queried is associated with a linked list structure.

4. A method according to claim 3, wherein the linked list comprises a root node, the root node comprising at least one tier of branch nodes, each tier of branch nodes comprising at least one branch node, the location of one branch node corresponding to the location of one character, the one tier of branch nodes corresponding to the number of characters matching the number of tiers of branch nodes.

5. The method of claim 1, wherein if a character from a first position to a data of the to-be-checked data is identical to a character corresponding to the checked data, the identical character of the to-be-checked data and the checked data is the same link.

6. A data duplication checking apparatus, the apparatus comprising:

the following single character queries are performed:

the query module is used for taking the preset format data as the subscript of the pointer array to acquire node data from the nodes, so as to complete the query of the single character, wherein the node information comprises the preset format data, the address of the preset format data and the pointer array, and the preset format data is the same as the node data;

7. The apparatus of claim 6, wherein a character range is determined according to the duplication data to be checked, the characters in the character range are ordered, the ordered characters are converted into data in a preset format according to the mapping rule, and the data are stored in an array.

8. The apparatus of claim 6, wherein the order of the duplicate data to be checked is associated with a linked list structure.

9. The apparatus of claim 8, wherein the linked list comprises a root node, the root node comprising at least one tier of branch nodes, each tier of branch nodes comprising at least one branch node, a position of one branch node corresponding to a position of one character, the one tier of branch nodes corresponding to a character matching the number of tiers of branch nodes.

10. The apparatus of claim 6, wherein if a character from a first position to a data of the to-be-checked data is identical to a character corresponding to the checked data, the identical character of the to-be-checked data and the checked data are the same link.

11. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 5 when the computer program is executed.

12. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 5.