CN116501781B - Data rapid statistical method for enhanced prefix tree - Google Patents
Data rapid statistical method for enhanced prefix tree Download PDFInfo
- Publication number
- CN116501781B CN116501781B CN202310768136.6A CN202310768136A CN116501781B CN 116501781 B CN116501781 B CN 116501781B CN 202310768136 A CN202310768136 A CN 202310768136A CN 116501781 B CN116501781 B CN 116501781B
- Authority
- CN
- China
- Prior art keywords
- leaf node
- character string
- pointer
- leaf
- points
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000007619 statistical method Methods 0.000 title claims abstract description 12
- 238000000034 method Methods 0.000 claims description 15
- 230000002708 enhancing effect Effects 0.000 claims description 4
- 238000010276 construction Methods 0.000 abstract description 5
- 230000009286 beneficial effect Effects 0.000 abstract description 3
- 238000012545 processing Methods 0.000 abstract description 3
- 238000010586 diagram Methods 0.000 description 3
- 238000005457 optimization Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000008092 positive effect Effects 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2462—Approximate or statistical queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2228—Indexing structures
- G06F16/2246—Trees, e.g. B+trees
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Probability & Statistics with Applications (AREA)
- Software Systems (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Fuzzy Systems (AREA)
- Computational Linguistics (AREA)
- Mathematical Physics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to the technical field of data processing, in particular to a data rapid statistical method for an enhanced prefix tree, which comprises the following steps: the system comprises a root node, a plurality of branch nodes and a leaf node, wherein the leaf node consists of a path character string, a statistic value, a left pointer and a right pointer, and all characters on a path from the root node to the leaf node are arranged according to the sequence from top to bottom, so that the obtained character string is the path character string of the leaf node; the left pointer of a leaf node points to its left leaf node, the right pointer points to its right leaf node, the left pointer of the leftmost leaf node points to null, and the right pointer of the rightmost leaf node points to null. The invention can meet the requirement of rapid statistics of data in different service scenes, and is beneficial to reducing the construction difficulty and cost of an informationized system.
Description
Technical Field
The invention relates to the technical field of data processing, in particular to a data rapid statistical method for an enhanced prefix tree.
Background
With the advent of the big data age, the real-time statistical analysis and processing of data in informationized systems is becoming more and more popular. In different service scenes, statistical analysis is often required to be performed on massive data, or duplicate storage is reduced by performing de-duplication filtering on large file data, or idempotent checking on high-frequency network traffic is required to prevent repeated submission and network attack.
The existing data statistics technology is difficult to meet the requirements at the same time, different big data technology frameworks are needed to realize, the complexity of the design and the realization of the informatization system is increased, and the construction difficulty and the operation and maintenance cost of the informatization system are increased.
Disclosure of Invention
The invention provides a data rapid statistical method for enhancing prefix trees, which can meet the requirements of rapid statistics of data in different service scenes and is beneficial to reducing the construction difficulty and cost of an informatization system.
In order to achieve the purpose of the invention, the technical scheme adopted is as follows: a data rapid statistical method for enhancing prefix tree, the enhancing prefix tree comprises: the system comprises a root node, a plurality of branch nodes and a leaf node, wherein the leaf node consists of a path character string, a statistic value, a left pointer and a right pointer, and all characters on a path from the root node to the leaf node are arranged according to the sequence from top to bottom, so that the obtained character string is the path character string of the leaf node; the left pointer of the leaf node points to the left leaf node, the right pointer points to the right leaf node, the left pointer of the leftmost leaf node points to the null, and the right pointer of the rightmost leaf node points to the null;
the data rapid statistical method comprises the following steps:
s1, converting input data content into a character string with a fixed length;
step S2: searching leaf nodes where the path character string matched with the character string is located in the enhanced prefix tree, if the leaf nodes can be found, executing the step S3, otherwise, executing the step S4;
step S3: adding 1 to the statistical value of the searched leaf nodes;
step S4: creating a path matching the character string, and a branch node and a leaf node passing through the path, and setting the statistical value of the created leaf node to be 1;
and S5, traversing all the leaf nodes by using left and right pointers of the leaf nodes to obtain ordered character strings and statistic values thereof.
As an optimization scheme of the present invention, in step S1, the input data content is converted into a character string with a fixed length, and the specific method is as follows:
if the statistics is digital, a pre-zero-filling method is adopted to obtain a digital character string with fixed length; if the statistics is word or short character string, adopting a post-space filling method to obtain a character string with fixed length; if the statistics is needed, a hash algorithm is adopted to obtain a hash character string with a fixed length.
As an optimization scheme of the present invention, in step S4, newly created paths are arranged from left to right in the order of characters from small to large.
As an optimization scheme of the present invention, in step S4, the left pointer of the newly created leaf node points to its left leaf node, the right pointer points to its right leaf node, the left pointer of the leftmost leaf node points to null, and the right pointer of the rightmost leaf node points to null.
In step S5, starting from the leftmost leaf node, traversing all leaf nodes by using the right pointer of each leaf node to obtain a statistic value from small to large according to characters; starting from the rightmost leaf node, traversing all leaf nodes by using the left pointer of each leaf node can obtain the statistics value from large to small according to the characters.
The invention has the positive effects that: 1) The invention increases left and right pointers and path sequencing in the leaf nodes to enhance the capability of the prefix tree, and all the leaf nodes have the same level, so that the statistical result can be quickly traversed, thereby simplifying the complexity of the program and further reducing the complexity and construction cost of the system using the method;
2) The invention can meet the requirement of rapid statistics of a large amount of data in different service scenes, is beneficial to reducing the construction difficulty and cost of an informatization system, is applied to idempotent examination of high-frequency network traffic, and can prevent repeated submission and network attack.
Drawings
For a clearer description of the technical solutions of embodiments of the present invention, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present invention and should not be considered limiting in scope, and other related drawings can be obtained according to these drawings without inventive effort for a person skilled in the art, wherein:
FIG. 1 is a schematic block diagram of an enhanced prefix tree of the present invention;
FIG. 2 is a schematic diagram of the connection of leaf nodes of the present invention;
FIG. 3 is a schematic flow chart of the method of the present invention;
fig. 4 is a schematic structural diagram of the newly enhanced prefix tree obtained in step 4 of the present invention.
Detailed Description
The present invention will be described in further detail with reference to specific embodiments in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
In order to facilitate understanding of the embodiments of the present invention, first, the enhanced prefix tree in the embodiments of the present invention is described as follows:
the enhanced prefix tree includes: a root node, a plurality of branch nodes and leaf nodes, wherein the hierarchy of each leaf node is the same; the paths from the root node or branch node to its child nodes (which may be branch nodes or leaf nodes) all have a character, and the paths are arranged from left to right in the order of characters from small to large.
By way of example, an exemplary enhanced prefix tree structure is presented, as shown in fig. 1. In fig. 1, there are 1 root node, 7 branch nodes, and 9 leaf nodes.
Specifically, in the enhanced prefix tree provided by the embodiment of the invention, a leaf node is composed of a path character string, a statistic value, a left pointer and a right pointer, all characters on a path from a root node to the leaf node are arranged according to the sequence from top to bottom, and the obtained character string is the path character string of the leaf node; the left pointer of a leaf node points to its left leaf node, the right pointer points to its right leaf node, the left pointer of the leftmost leaf node points to null, and the right pointer of the rightmost leaf node points to null.
Illustratively, the specific structures of leaf node 1, leaf node 2, and leaf node 9 in FIG. 1 are shown in FIG. 2, given the above example. In fig. 1, the path from the root node to leaf node 1 is: branch node 1- & gt branch node 4- & gt leaf node 1, wherein all characters on the path are arranged according to a sequence from top to bottom to obtain character string "add"; the path from the root node to leaf node 2 is: branch node 1- & gt branch node 4- & gt leaf node 2, wherein all characters on the path are arranged according to a sequence from top to bottom to obtain a character string 'adg'; the path from the root node to the leaf node 9 is: branch node 3→branch node 7→leaf node 9, all characters on the path being arranged in order from top to bottom to obtain a character string "ecm". As shown in fig. 2, the path string of the leaf node 1 is "add", the path string of the leaf node 2 is "adg", and the path string of the leaf node 9 is "ecm". As shown in fig. 2, the left pointer of leaf node 1 points to null and the right pointer points to leaf node 2; the left pointer of the leaf node 2 points to the leaf node 1, and the right pointer points to the leaf node 3; the left pointer of node 9 points to leaf node 8 and the right pointer points to null.
Based on the enhanced prefix tree, the embodiment of the invention provides a data rapid statistical method of the enhanced prefix tree, as shown in fig. 3.
Referring to fig. 3, a method for quickly counting data of an enhanced prefix tree according to an embodiment of the present invention includes the following steps:
step S1: according to the actual business requirement, the input data content is converted into a character string with fixed length.
Illustratively, assuming that the current prefix tree state is as shown in fig. 1, the statistics of all leaf nodes are, in order from left to right: 7. 1, 3, 1, 6, 3, 5, 2. Wherein the values in brackets for the corresponding leaf node in fig. 1 represent the statistical value of the leaf node. The string adg, bxx, bx is now entered. And obtaining a character string with a fixed length of 3 bits by adopting a post-space filling method: adg, bxx, bx ∈, "≡" represents space, i.e., post-space filling method.
Step S2: and searching the leaf node of the path character string matched with the character string in the prefix tree, if the leaf node can be found, executing the step S3, otherwise, executing the step S4.
Step S3: the statistical value of the found leaf node is added to 1.
Illustratively, it is assumed that the search is for a string: adg, in FIG. 1 leaf node 2 can be found, adding 1 to the statistics of leaf node 2.
Step S4: creating a path matching the character string, and the branch nodes and the leaf nodes passing through the path, setting the statistical value of the new leaf node to be 1, respectively pointing left and right pointers to the leaf nodes on the left and right sides, and modifying the left and right leaf node pointers to point to the new node.
Illustratively, it is assumed that the search is for a string: bxx, bx ≡, if no matching leaf node can be found in fig. 1, then creating new leaf node, and obtaining prefix tree as shown in fig. 4, and comparing with fig. 1, newly creating branch node 8, leaf node 9 and leaf node 10. The left pointer of leaf node 10 points to leaf node 5 and the right pointer points to leaf node 9. The left pointer of leaf node 9 points to leaf node 10 and the right pointer points to leaf node 6. The right pointer of modified leaf node 5 points to leaf node 10 and the left pointer of modified leaf node 6 points to leaf node 9.
And S5, traversing all the leaf nodes by using left and right pointers of the leaf nodes to obtain ordered character strings and statistic values thereof. The statistics are the number of times a certain string is entered.
Illustratively, all leaf nodes shown in fig. 4 are traversed from left to right, so that an ordered string can be obtained: add, adg, bav, bgb, bgc, bx ≡and bxx, ech, ecj, eck, ecm, the statistics are: 7. 2, 3, 1, 6, 1, 3, 5, 2. Wherein the values in brackets for the corresponding leaf node in fig. 4 represent the statistical value of the leaf node.
The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should make equivalent substitutions or modifications according to the technical scheme of the present invention and the inventive concept thereof, and should be covered by the scope of the present invention.
Claims (5)
1. A data rapid statistical method for enhancing prefix tree is characterized in that: the enhanced prefix tree includes: the system comprises a root node, a plurality of branch nodes and a leaf node, wherein the leaf node consists of a path character string, a statistic value, a left pointer and a right pointer, and all characters on a path from the root node to the leaf node are arranged according to the sequence from top to bottom, so that the obtained character string is the path character string of the leaf node; the left pointer of the leaf node points to the left leaf node, the right pointer points to the right leaf node, the left pointer of the leftmost leaf node points to the null, and the right pointer of the rightmost leaf node points to the null;
the data rapid statistical method comprises the following steps:
s1, converting input data content into a character string with a fixed length;
step S2: searching leaf nodes where the path character string matched with the character string is located in the enhanced prefix tree, if the leaf nodes can be found, executing the step S3, otherwise, executing the step S4;
step S3: adding 1 to the statistical value of the searched leaf nodes;
step S4: creating a path matching the character string, and a branch node and a leaf node passing through the path, and setting the statistical value of the created leaf node to be 1;
and S5, traversing all the leaf nodes by using left and right pointers of the leaf nodes to obtain ordered character strings and statistic values thereof.
2. The method for quickly counting data of an enhanced prefix tree according to claim 1, wherein: in step S1, the input data content is converted into a character string with a fixed length, and the specific method is as follows:
if the statistics is digital, a pre-zero-filling method is adopted to obtain a digital character string with fixed length; if the statistics is word or short character string, adopting a post-space filling method to obtain a character string with fixed length; if the statistics is needed, a hash algorithm is adopted to obtain a hash character string with a fixed length.
3. The method for quickly counting data of an enhanced prefix tree according to claim 1, wherein: in step S4, the newly created paths are arranged from left to right in the order of characters from small to large.
4. The method for quickly counting data of an enhanced prefix tree according to claim 1, wherein: in step S4, the left pointer of the newly created leaf node points to its left leaf node, the right pointer points to its right leaf node, the left pointer of the leftmost leaf node points to null, and the right pointer of the rightmost leaf node points to null.
5. The method for quickly counting data of an enhanced prefix tree according to claim 1, wherein: in step S5, starting from the leftmost leaf node, traversing all leaf nodes by using the right pointer of each leaf node to obtain a statistic value from small to large according to characters; starting from the rightmost leaf node, traversing all leaf nodes by using the left pointer of each leaf node can obtain the statistics value from large to small according to the characters.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310768136.6A CN116501781B (en) | 2023-06-28 | 2023-06-28 | Data rapid statistical method for enhanced prefix tree |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310768136.6A CN116501781B (en) | 2023-06-28 | 2023-06-28 | Data rapid statistical method for enhanced prefix tree |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116501781A CN116501781A (en) | 2023-07-28 |
CN116501781B true CN116501781B (en) | 2023-09-12 |
Family
ID=87320639
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310768136.6A Active CN116501781B (en) | 2023-06-28 | 2023-06-28 | Data rapid statistical method for enhanced prefix tree |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116501781B (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104765882A (en) * | 2015-04-29 | 2015-07-08 | 中国互联网络信息中心 | Internet website statistics method based on web page characteristic strings |
CN110851722A (en) * | 2019-11-12 | 2020-02-28 | 腾讯云计算(北京)有限责任公司 | Search processing method, device and equipment based on dictionary tree and storage medium |
CN111382323A (en) * | 2018-12-29 | 2020-07-07 | 贵州白山云科技股份有限公司 | Data retrieval optimization method and device and computer equipment |
CN111597185A (en) * | 2020-04-01 | 2020-08-28 | 深圳英飞拓智能技术有限公司 | Real-time state number rapid statistical method based on tree structure resource distribution |
CN111782892A (en) * | 2020-06-30 | 2020-10-16 | 中国平安人寿保险股份有限公司 | Similar character recognition method, device, apparatus and storage medium based on prefix tree |
CN115269935A (en) * | 2022-07-26 | 2022-11-01 | 北京科能腾达信息技术股份有限公司 | Integrated circuit flattening design character string storage and query system and method |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11409732B2 (en) * | 2020-01-17 | 2022-08-09 | Intuit Inc. | Computer estimations based on statistical tree structures |
-
2023
- 2023-06-28 CN CN202310768136.6A patent/CN116501781B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104765882A (en) * | 2015-04-29 | 2015-07-08 | 中国互联网络信息中心 | Internet website statistics method based on web page characteristic strings |
CN111382323A (en) * | 2018-12-29 | 2020-07-07 | 贵州白山云科技股份有限公司 | Data retrieval optimization method and device and computer equipment |
CN110851722A (en) * | 2019-11-12 | 2020-02-28 | 腾讯云计算(北京)有限责任公司 | Search processing method, device and equipment based on dictionary tree and storage medium |
CN111597185A (en) * | 2020-04-01 | 2020-08-28 | 深圳英飞拓智能技术有限公司 | Real-time state number rapid statistical method based on tree structure resource distribution |
CN111782892A (en) * | 2020-06-30 | 2020-10-16 | 中国平安人寿保险股份有限公司 | Similar character recognition method, device, apparatus and storage medium based on prefix tree |
CN115269935A (en) * | 2022-07-26 | 2022-11-01 | 北京科能腾达信息技术股份有限公司 | Integrated circuit flattening design character string storage and query system and method |
Also Published As
Publication number | Publication date |
---|---|
CN116501781A (en) | 2023-07-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111339382B (en) | Character string data retrieval method, device, computer equipment and storage medium | |
US10055439B2 (en) | Fast, scalable dictionary construction and maintenance | |
CN111460311A (en) | Search processing method, device and equipment based on dictionary tree and storage medium | |
WO2021072874A1 (en) | Dual array-based location query method and apparatus, computer device, and storage medium | |
CN101620636B (en) | Method and apparatus for displaying tabular data | |
JP2020027649A (en) | Method, apparatus, device and storage medium for generating entity relationship data | |
US20160210333A1 (en) | Method and device for mining data regular expression | |
Ferragina et al. | On the bit-complexity of Lempel--Ziv compression | |
US20140229484A1 (en) | Extraction method, computer product, extracting apparatus, and extracting system | |
CN112347767A (en) | Text processing method, device and equipment | |
CN113535977A (en) | Knowledge graph fusion method, device and equipment | |
CN108628907A (en) | A method of being used for the Trie tree multiple-fault diagnosis based on Aho-Corasick | |
CN112115313A (en) | Regular expression generation method, regular expression data extraction method, regular expression generation device, regular expression data extraction device, regular expression equipment and regular expression data extraction medium | |
CN115221191A (en) | Virtual column construction method based on data lake and data query method | |
CN112069305B (en) | Data screening method and device and electronic equipment | |
CN116501781B (en) | Data rapid statistical method for enhanced prefix tree | |
CN104462095A (en) | Extraction method and device of common pars of query statements | |
CN111310450A (en) | Character string word segmentation method, device, equipment and storage medium | |
CN113821211B (en) | Command parsing method and device, storage medium and computer equipment | |
CN113378544A (en) | Text analysis method, text data acquisition method, device, medium and equipment | |
CN113051896A (en) | Method and device for correcting text, electronic equipment and storage medium | |
CN107798060B (en) | Real-time streaming data processing application software feature recognition method | |
CN113704465B (en) | Text clustering method and device, electronic equipment and storage medium | |
CN115473933B (en) | Network system associated service discovery method based on frequent subgraph mining | |
CN110825927A (en) | Data query method and device, electronic equipment and computer readable storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |