CN116501781B

CN116501781B - Data rapid statistical method for enhanced prefix tree

Info

Publication number: CN116501781B
Application number: CN202310768136.6A
Authority: CN
Inventors: 余志淼
Original assignee: Zhongbo Information Technology Research Institute Co ltd
Current assignee: Zhongbo Information Technology Research Institute Co ltd
Priority date: 2023-06-28
Filing date: 2023-06-28
Publication date: 2023-09-12
Anticipated expiration: 2043-06-28
Also published as: CN116501781A

Abstract

The invention relates to the technical field of data processing, in particular to a data rapid statistical method for an enhanced prefix tree, which comprises the following steps: the system comprises a root node, a plurality of branch nodes and a leaf node, wherein the leaf node consists of a path character string, a statistic value, a left pointer and a right pointer, and all characters on a path from the root node to the leaf node are arranged according to the sequence from top to bottom, so that the obtained character string is the path character string of the leaf node; the left pointer of a leaf node points to its left leaf node, the right pointer points to its right leaf node, the left pointer of the leftmost leaf node points to null, and the right pointer of the rightmost leaf node points to null. The invention can meet the requirement of rapid statistics of data in different service scenes, and is beneficial to reducing the construction difficulty and cost of an informationized system.

Description

Data rapid statistical method for enhanced prefix tree

Technical Field

The invention relates to the technical field of data processing, in particular to a data rapid statistical method for an enhanced prefix tree.

Background

With the advent of the big data age, the real-time statistical analysis and processing of data in informationized systems is becoming more and more popular. In different service scenes, statistical analysis is often required to be performed on massive data, or duplicate storage is reduced by performing de-duplication filtering on large file data, or idempotent checking on high-frequency network traffic is required to prevent repeated submission and network attack.

The existing data statistics technology is difficult to meet the requirements at the same time, different big data technology frameworks are needed to realize, the complexity of the design and the realization of the informatization system is increased, and the construction difficulty and the operation and maintenance cost of the informatization system are increased.

Disclosure of Invention

The invention provides a data rapid statistical method for enhancing prefix trees, which can meet the requirements of rapid statistics of data in different service scenes and is beneficial to reducing the construction difficulty and cost of an informatization system.

In order to achieve the purpose of the invention, the technical scheme adopted is as follows: a data rapid statistical method for enhancing prefix tree, the enhancing prefix tree comprises: the system comprises a root node, a plurality of branch nodes and a leaf node, wherein the leaf node consists of a path character string, a statistic value, a left pointer and a right pointer, and all characters on a path from the root node to the leaf node are arranged according to the sequence from top to bottom, so that the obtained character string is the path character string of the leaf node; the left pointer of the leaf node points to the left leaf node, the right pointer points to the right leaf node, the left pointer of the leftmost leaf node points to the null, and the right pointer of the rightmost leaf node points to the null;

the data rapid statistical method comprises the following steps:

s1, converting input data content into a character string with a fixed length;

step S2: searching leaf nodes where the path character string matched with the character string is located in the enhanced prefix tree, if the leaf nodes can be found, executing the step S3, otherwise, executing the step S4;

step S3: adding 1 to the statistical value of the searched leaf nodes;

step S4: creating a path matching the character string, and a branch node and a leaf node passing through the path, and setting the statistical value of the created leaf node to be 1;

and S5, traversing all the leaf nodes by using left and right pointers of the leaf nodes to obtain ordered character strings and statistic values thereof.

As an optimization scheme of the present invention, in step S1, the input data content is converted into a character string with a fixed length, and the specific method is as follows:

if the statistics is digital, a pre-zero-filling method is adopted to obtain a digital character string with fixed length; if the statistics is word or short character string, adopting a post-space filling method to obtain a character string with fixed length; if the statistics is needed, a hash algorithm is adopted to obtain a hash character string with a fixed length.

As an optimization scheme of the present invention, in step S4, newly created paths are arranged from left to right in the order of characters from small to large.

As an optimization scheme of the present invention, in step S4, the left pointer of the newly created leaf node points to its left leaf node, the right pointer points to its right leaf node, the left pointer of the leftmost leaf node points to null, and the right pointer of the rightmost leaf node points to null.

In step S5, starting from the leftmost leaf node, traversing all leaf nodes by using the right pointer of each leaf node to obtain a statistic value from small to large according to characters; starting from the rightmost leaf node, traversing all leaf nodes by using the left pointer of each leaf node can obtain the statistics value from large to small according to the characters.

The invention has the positive effects that: 1) The invention increases left and right pointers and path sequencing in the leaf nodes to enhance the capability of the prefix tree, and all the leaf nodes have the same level, so that the statistical result can be quickly traversed, thereby simplifying the complexity of the program and further reducing the complexity and construction cost of the system using the method;

2) The invention can meet the requirement of rapid statistics of a large amount of data in different service scenes, is beneficial to reducing the construction difficulty and cost of an informatization system, is applied to idempotent examination of high-frequency network traffic, and can prevent repeated submission and network attack.

Drawings

For a clearer description of the technical solutions of embodiments of the present invention, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present invention and should not be considered limiting in scope, and other related drawings can be obtained according to these drawings without inventive effort for a person skilled in the art, wherein:

FIG. 1 is a schematic block diagram of an enhanced prefix tree of the present invention;

FIG. 2 is a schematic diagram of the connection of leaf nodes of the present invention;

FIG. 3 is a schematic flow chart of the method of the present invention;

fig. 4 is a schematic structural diagram of the newly enhanced prefix tree obtained in step 4 of the present invention.

Detailed Description

The present invention will be described in further detail with reference to specific embodiments in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

In order to facilitate understanding of the embodiments of the present invention, first, the enhanced prefix tree in the embodiments of the present invention is described as follows:

the enhanced prefix tree includes: a root node, a plurality of branch nodes and leaf nodes, wherein the hierarchy of each leaf node is the same; the paths from the root node or branch node to its child nodes (which may be branch nodes or leaf nodes) all have a character, and the paths are arranged from left to right in the order of characters from small to large.

By way of example, an exemplary enhanced prefix tree structure is presented, as shown in fig. 1. In fig. 1, there are 1 root node, 7 branch nodes, and 9 leaf nodes.

Specifically, in the enhanced prefix tree provided by the embodiment of the invention, a leaf node is composed of a path character string, a statistic value, a left pointer and a right pointer, all characters on a path from a root node to the leaf node are arranged according to the sequence from top to bottom, and the obtained character string is the path character string of the leaf node; the left pointer of a leaf node points to its left leaf node, the right pointer points to its right leaf node, the left pointer of the leftmost leaf node points to null, and the right pointer of the rightmost leaf node points to null.

Illustratively, the specific structures of leaf node 1, leaf node 2, and leaf node 9 in FIG. 1 are shown in FIG. 2, given the above example. In fig. 1, the path from the root node to leaf node 1 is: branch node 1- & gt branch node 4- & gt leaf node 1, wherein all characters on the path are arranged according to a sequence from top to bottom to obtain character string "add"; the path from the root node to leaf node 2 is: branch node 1- & gt branch node 4- & gt leaf node 2, wherein all characters on the path are arranged according to a sequence from top to bottom to obtain a character string 'adg'; the path from the root node to the leaf node 9 is: branch node 3→branch node 7→leaf node 9, all characters on the path being arranged in order from top to bottom to obtain a character string "ecm". As shown in fig. 2, the path string of the leaf node 1 is "add", the path string of the leaf node 2 is "adg", and the path string of the leaf node 9 is "ecm". As shown in fig. 2, the left pointer of leaf node 1 points to null and the right pointer points to leaf node 2; the left pointer of the leaf node 2 points to the leaf node 1, and the right pointer points to the leaf node 3; the left pointer of node 9 points to leaf node 8 and the right pointer points to null.

Based on the enhanced prefix tree, the embodiment of the invention provides a data rapid statistical method of the enhanced prefix tree, as shown in fig. 3.

Referring to fig. 3, a method for quickly counting data of an enhanced prefix tree according to an embodiment of the present invention includes the following steps:

step S1: according to the actual business requirement, the input data content is converted into a character string with fixed length.

Illustratively, assuming that the current prefix tree state is as shown in fig. 1, the statistics of all leaf nodes are, in order from left to right: 7. 1, 3, 1, 6, 3, 5, 2. Wherein the values in brackets for the corresponding leaf node in fig. 1 represent the statistical value of the leaf node. The string adg, bxx, bx is now entered. And obtaining a character string with a fixed length of 3 bits by adopting a post-space filling method: adg, bxx, bx ∈, "≡" represents space, i.e., post-space filling method.

Step S2: and searching the leaf node of the path character string matched with the character string in the prefix tree, if the leaf node can be found, executing the step S3, otherwise, executing the step S4.

Step S3: the statistical value of the found leaf node is added to 1.

Illustratively, it is assumed that the search is for a string: adg, in FIG. 1 leaf node 2 can be found, adding 1 to the statistics of leaf node 2.

Step S4: creating a path matching the character string, and the branch nodes and the leaf nodes passing through the path, setting the statistical value of the new leaf node to be 1, respectively pointing left and right pointers to the leaf nodes on the left and right sides, and modifying the left and right leaf node pointers to point to the new node.

Illustratively, it is assumed that the search is for a string: bxx, bx ≡, if no matching leaf node can be found in fig. 1, then creating new leaf node, and obtaining prefix tree as shown in fig. 4, and comparing with fig. 1, newly creating branch node 8, leaf node 9 and leaf node 10. The left pointer of leaf node 10 points to leaf node 5 and the right pointer points to leaf node 9. The left pointer of leaf node 9 points to leaf node 10 and the right pointer points to leaf node 6. The right pointer of modified leaf node 5 points to leaf node 10 and the left pointer of modified leaf node 6 points to leaf node 9.

And S5, traversing all the leaf nodes by using left and right pointers of the leaf nodes to obtain ordered character strings and statistic values thereof. The statistics are the number of times a certain string is entered.

Illustratively, all leaf nodes shown in fig. 4 are traversed from left to right, so that an ordered string can be obtained: add, adg, bav, bgb, bgc, bx ≡and bxx, ech, ecj, eck, ecm, the statistics are: 7. 2, 3, 1, 6, 1, 3, 5, 2. Wherein the values in brackets for the corresponding leaf node in fig. 4 represent the statistical value of the leaf node.

The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should make equivalent substitutions or modifications according to the technical scheme of the present invention and the inventive concept thereof, and should be covered by the scope of the present invention.

Claims

1. A data rapid statistical method for enhancing prefix tree is characterized in that: the enhanced prefix tree includes: the system comprises a root node, a plurality of branch nodes and a leaf node, wherein the leaf node consists of a path character string, a statistic value, a left pointer and a right pointer, and all characters on a path from the root node to the leaf node are arranged according to the sequence from top to bottom, so that the obtained character string is the path character string of the leaf node; the left pointer of the leaf node points to the left leaf node, the right pointer points to the right leaf node, the left pointer of the leftmost leaf node points to the null, and the right pointer of the rightmost leaf node points to the null;

the data rapid statistical method comprises the following steps:

s1, converting input data content into a character string with a fixed length;

step S3: adding 1 to the statistical value of the searched leaf nodes;

2. The method for quickly counting data of an enhanced prefix tree according to claim 1, wherein: in step S1, the input data content is converted into a character string with a fixed length, and the specific method is as follows:

3. The method for quickly counting data of an enhanced prefix tree according to claim 1, wherein: in step S4, the newly created paths are arranged from left to right in the order of characters from small to large.

4. The method for quickly counting data of an enhanced prefix tree according to claim 1, wherein: in step S4, the left pointer of the newly created leaf node points to its left leaf node, the right pointer points to its right leaf node, the left pointer of the leftmost leaf node points to null, and the right pointer of the rightmost leaf node points to null.

5. The method for quickly counting data of an enhanced prefix tree according to claim 1, wherein: in step S5, starting from the leftmost leaf node, traversing all leaf nodes by using the right pointer of each leaf node to obtain a statistic value from small to large according to characters; starting from the rightmost leaf node, traversing all leaf nodes by using the left pointer of each leaf node can obtain the statistics value from large to small according to the characters.