CN116501781B - Data rapid statistical method for enhanced prefix tree - Google Patents

Data rapid statistical method for enhanced prefix tree Download PDF

Info

Publication number
CN116501781B
CN116501781B CN202310768136.6A CN202310768136A CN116501781B CN 116501781 B CN116501781 B CN 116501781B CN 202310768136 A CN202310768136 A CN 202310768136A CN 116501781 B CN116501781 B CN 116501781B
Authority
CN
China
Prior art keywords
leaf node
character string
pointer
leaf
points
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310768136.6A
Other languages
Chinese (zh)
Other versions
CN116501781A (en
Inventor
余志淼
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongbo Information Technology Research Institute Co ltd
Original Assignee
Zhongbo Information Technology Research Institute Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhongbo Information Technology Research Institute Co ltd filed Critical Zhongbo Information Technology Research Institute Co ltd
Priority to CN202310768136.6A priority Critical patent/CN116501781B/en
Publication of CN116501781A publication Critical patent/CN116501781A/en
Application granted granted Critical
Publication of CN116501781B publication Critical patent/CN116501781B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2246Trees, e.g. B+trees
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Probability & Statistics with Applications (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Fuzzy Systems (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the technical field of data processing, in particular to a data rapid statistical method for an enhanced prefix tree, which comprises the following steps: the system comprises a root node, a plurality of branch nodes and a leaf node, wherein the leaf node consists of a path character string, a statistic value, a left pointer and a right pointer, and all characters on a path from the root node to the leaf node are arranged according to the sequence from top to bottom, so that the obtained character string is the path character string of the leaf node; the left pointer of a leaf node points to its left leaf node, the right pointer points to its right leaf node, the left pointer of the leftmost leaf node points to null, and the right pointer of the rightmost leaf node points to null. The invention can meet the requirement of rapid statistics of data in different service scenes, and is beneficial to reducing the construction difficulty and cost of an informationized system.

Description

Data rapid statistical method for enhanced prefix tree
Technical Field
The invention relates to the technical field of data processing, in particular to a data rapid statistical method for an enhanced prefix tree.
Background
With the advent of the big data age, the real-time statistical analysis and processing of data in informationized systems is becoming more and more popular. In different service scenes, statistical analysis is often required to be performed on massive data, or duplicate storage is reduced by performing de-duplication filtering on large file data, or idempotent checking on high-frequency network traffic is required to prevent repeated submission and network attack.
The existing data statistics technology is difficult to meet the requirements at the same time, different big data technology frameworks are needed to realize, the complexity of the design and the realization of the informatization system is increased, and the construction difficulty and the operation and maintenance cost of the informatization system are increased.
Disclosure of Invention
The invention provides a data rapid statistical method for enhancing prefix trees, which can meet the requirements of rapid statistics of data in different service scenes and is beneficial to reducing the construction difficulty and cost of an informatization system.
In order to achieve the purpose of the invention, the technical scheme adopted is as follows: a data rapid statistical method for enhancing prefix tree, the enhancing prefix tree comprises: the system comprises a root node, a plurality of branch nodes and a leaf node, wherein the leaf node consists of a path character string, a statistic value, a left pointer and a right pointer, and all characters on a path from the root node to the leaf node are arranged according to the sequence from top to bottom, so that the obtained character string is the path character string of the leaf node; the left pointer of the leaf node points to the left leaf node, the right pointer points to the right leaf node, the left pointer of the leftmost leaf node points to the null, and the right pointer of the rightmost leaf node points to the null;
the data rapid statistical method comprises the following steps:
s1, converting input data content into a character string with a fixed length;
step S2: searching leaf nodes where the path character string matched with the character string is located in the enhanced prefix tree, if the leaf nodes can be found, executing the step S3, otherwise, executing the step S4;
step S3: adding 1 to the statistical value of the searched leaf nodes;
step S4: creating a path matching the character string, and a branch node and a leaf node passing through the path, and setting the statistical value of the created leaf node to be 1;
and S5, traversing all the leaf nodes by using left and right pointers of the leaf nodes to obtain ordered character strings and statistic values thereof.
As an optimization scheme of the present invention, in step S1, the input data content is converted into a character string with a fixed length, and the specific method is as follows:
if the statistics is digital, a pre-zero-filling method is adopted to obtain a digital character string with fixed length; if the statistics is word or short character string, adopting a post-space filling method to obtain a character string with fixed length; if the statistics is needed, a hash algorithm is adopted to obtain a hash character string with a fixed length.
As an optimization scheme of the present invention, in step S4, newly created paths are arranged from left to right in the order of characters from small to large.
As an optimization scheme of the present invention, in step S4, the left pointer of the newly created leaf node points to its left leaf node, the right pointer points to its right leaf node, the left pointer of the leftmost leaf node points to null, and the right pointer of the rightmost leaf node points to null.
In step S5, starting from the leftmost leaf node, traversing all leaf nodes by using the right pointer of each leaf node to obtain a statistic value from small to large according to characters; starting from the rightmost leaf node, traversing all leaf nodes by using the left pointer of each leaf node can obtain the statistics value from large to small according to the characters.
The invention has the positive effects that: 1) The invention increases left and right pointers and path sequencing in the leaf nodes to enhance the capability of the prefix tree, and all the leaf nodes have the same level, so that the statistical result can be quickly traversed, thereby simplifying the complexity of the program and further reducing the complexity and construction cost of the system using the method;
2) The invention can meet the requirement of rapid statistics of a large amount of data in different service scenes, is beneficial to reducing the construction difficulty and cost of an informatization system, is applied to idempotent examination of high-frequency network traffic, and can prevent repeated submission and network attack.
Drawings
For a clearer description of the technical solutions of embodiments of the present invention, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present invention and should not be considered limiting in scope, and other related drawings can be obtained according to these drawings without inventive effort for a person skilled in the art, wherein:
FIG. 1 is a schematic block diagram of an enhanced prefix tree of the present invention;
FIG. 2 is a schematic diagram of the connection of leaf nodes of the present invention;
FIG. 3 is a schematic flow chart of the method of the present invention;
fig. 4 is a schematic structural diagram of the newly enhanced prefix tree obtained in step 4 of the present invention.
Detailed Description
The present invention will be described in further detail with reference to specific embodiments in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
In order to facilitate understanding of the embodiments of the present invention, first, the enhanced prefix tree in the embodiments of the present invention is described as follows:
the enhanced prefix tree includes: a root node, a plurality of branch nodes and leaf nodes, wherein the hierarchy of each leaf node is the same; the paths from the root node or branch node to its child nodes (which may be branch nodes or leaf nodes) all have a character, and the paths are arranged from left to right in the order of characters from small to large.
By way of example, an exemplary enhanced prefix tree structure is presented, as shown in fig. 1. In fig. 1, there are 1 root node, 7 branch nodes, and 9 leaf nodes.
Specifically, in the enhanced prefix tree provided by the embodiment of the invention, a leaf node is composed of a path character string, a statistic value, a left pointer and a right pointer, all characters on a path from a root node to the leaf node are arranged according to the sequence from top to bottom, and the obtained character string is the path character string of the leaf node; the left pointer of a leaf node points to its left leaf node, the right pointer points to its right leaf node, the left pointer of the leftmost leaf node points to null, and the right pointer of the rightmost leaf node points to null.
Illustratively, the specific structures of leaf node 1, leaf node 2, and leaf node 9 in FIG. 1 are shown in FIG. 2, given the above example. In fig. 1, the path from the root node to leaf node 1 is: branch node 1- & gt branch node 4- & gt leaf node 1, wherein all characters on the path are arranged according to a sequence from top to bottom to obtain character string "add"; the path from the root node to leaf node 2 is: branch node 1- & gt branch node 4- & gt leaf node 2, wherein all characters on the path are arranged according to a sequence from top to bottom to obtain a character string 'adg'; the path from the root node to the leaf node 9 is: branch node 3→branch node 7→leaf node 9, all characters on the path being arranged in order from top to bottom to obtain a character string "ecm". As shown in fig. 2, the path string of the leaf node 1 is "add", the path string of the leaf node 2 is "adg", and the path string of the leaf node 9 is "ecm". As shown in fig. 2, the left pointer of leaf node 1 points to null and the right pointer points to leaf node 2; the left pointer of the leaf node 2 points to the leaf node 1, and the right pointer points to the leaf node 3; the left pointer of node 9 points to leaf node 8 and the right pointer points to null.
Based on the enhanced prefix tree, the embodiment of the invention provides a data rapid statistical method of the enhanced prefix tree, as shown in fig. 3.
Referring to fig. 3, a method for quickly counting data of an enhanced prefix tree according to an embodiment of the present invention includes the following steps:
step S1: according to the actual business requirement, the input data content is converted into a character string with fixed length.
Illustratively, assuming that the current prefix tree state is as shown in fig. 1, the statistics of all leaf nodes are, in order from left to right: 7. 1, 3, 1, 6, 3, 5, 2. Wherein the values in brackets for the corresponding leaf node in fig. 1 represent the statistical value of the leaf node. The string adg, bxx, bx is now entered. And obtaining a character string with a fixed length of 3 bits by adopting a post-space filling method: adg, bxx, bx ∈, "≡" represents space, i.e., post-space filling method.
Step S2: and searching the leaf node of the path character string matched with the character string in the prefix tree, if the leaf node can be found, executing the step S3, otherwise, executing the step S4.
Step S3: the statistical value of the found leaf node is added to 1.
Illustratively, it is assumed that the search is for a string: adg, in FIG. 1 leaf node 2 can be found, adding 1 to the statistics of leaf node 2.
Step S4: creating a path matching the character string, and the branch nodes and the leaf nodes passing through the path, setting the statistical value of the new leaf node to be 1, respectively pointing left and right pointers to the leaf nodes on the left and right sides, and modifying the left and right leaf node pointers to point to the new node.
Illustratively, it is assumed that the search is for a string: bxx, bx ≡, if no matching leaf node can be found in fig. 1, then creating new leaf node, and obtaining prefix tree as shown in fig. 4, and comparing with fig. 1, newly creating branch node 8, leaf node 9 and leaf node 10. The left pointer of leaf node 10 points to leaf node 5 and the right pointer points to leaf node 9. The left pointer of leaf node 9 points to leaf node 10 and the right pointer points to leaf node 6. The right pointer of modified leaf node 5 points to leaf node 10 and the left pointer of modified leaf node 6 points to leaf node 9.
And S5, traversing all the leaf nodes by using left and right pointers of the leaf nodes to obtain ordered character strings and statistic values thereof. The statistics are the number of times a certain string is entered.
Illustratively, all leaf nodes shown in fig. 4 are traversed from left to right, so that an ordered string can be obtained: add, adg, bav, bgb, bgc, bx ≡and bxx, ech, ecj, eck, ecm, the statistics are: 7. 2, 3, 1, 6, 1, 3, 5, 2. Wherein the values in brackets for the corresponding leaf node in fig. 4 represent the statistical value of the leaf node.
The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should make equivalent substitutions or modifications according to the technical scheme of the present invention and the inventive concept thereof, and should be covered by the scope of the present invention.

Claims (5)

1. A data rapid statistical method for enhancing prefix tree is characterized in that: the enhanced prefix tree includes: the system comprises a root node, a plurality of branch nodes and a leaf node, wherein the leaf node consists of a path character string, a statistic value, a left pointer and a right pointer, and all characters on a path from the root node to the leaf node are arranged according to the sequence from top to bottom, so that the obtained character string is the path character string of the leaf node; the left pointer of the leaf node points to the left leaf node, the right pointer points to the right leaf node, the left pointer of the leftmost leaf node points to the null, and the right pointer of the rightmost leaf node points to the null;
the data rapid statistical method comprises the following steps:
s1, converting input data content into a character string with a fixed length;
step S2: searching leaf nodes where the path character string matched with the character string is located in the enhanced prefix tree, if the leaf nodes can be found, executing the step S3, otherwise, executing the step S4;
step S3: adding 1 to the statistical value of the searched leaf nodes;
step S4: creating a path matching the character string, and a branch node and a leaf node passing through the path, and setting the statistical value of the created leaf node to be 1;
and S5, traversing all the leaf nodes by using left and right pointers of the leaf nodes to obtain ordered character strings and statistic values thereof.
2. The method for quickly counting data of an enhanced prefix tree according to claim 1, wherein: in step S1, the input data content is converted into a character string with a fixed length, and the specific method is as follows:
if the statistics is digital, a pre-zero-filling method is adopted to obtain a digital character string with fixed length; if the statistics is word or short character string, adopting a post-space filling method to obtain a character string with fixed length; if the statistics is needed, a hash algorithm is adopted to obtain a hash character string with a fixed length.
3. The method for quickly counting data of an enhanced prefix tree according to claim 1, wherein: in step S4, the newly created paths are arranged from left to right in the order of characters from small to large.
4. The method for quickly counting data of an enhanced prefix tree according to claim 1, wherein: in step S4, the left pointer of the newly created leaf node points to its left leaf node, the right pointer points to its right leaf node, the left pointer of the leftmost leaf node points to null, and the right pointer of the rightmost leaf node points to null.
5. The method for quickly counting data of an enhanced prefix tree according to claim 1, wherein: in step S5, starting from the leftmost leaf node, traversing all leaf nodes by using the right pointer of each leaf node to obtain a statistic value from small to large according to characters; starting from the rightmost leaf node, traversing all leaf nodes by using the left pointer of each leaf node can obtain the statistics value from large to small according to the characters.
CN202310768136.6A 2023-06-28 2023-06-28 Data rapid statistical method for enhanced prefix tree Active CN116501781B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310768136.6A CN116501781B (en) 2023-06-28 2023-06-28 Data rapid statistical method for enhanced prefix tree

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310768136.6A CN116501781B (en) 2023-06-28 2023-06-28 Data rapid statistical method for enhanced prefix tree

Publications (2)

Publication Number Publication Date
CN116501781A CN116501781A (en) 2023-07-28
CN116501781B true CN116501781B (en) 2023-09-12

Family

ID=87320639

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310768136.6A Active CN116501781B (en) 2023-06-28 2023-06-28 Data rapid statistical method for enhanced prefix tree

Country Status (1)

Country Link
CN (1) CN116501781B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104765882A (en) * 2015-04-29 2015-07-08 中国互联网络信息中心 Internet website statistics method based on web page characteristic strings
CN110851722A (en) * 2019-11-12 2020-02-28 腾讯云计算(北京)有限责任公司 Search processing method, device and equipment based on dictionary tree and storage medium
CN111382323A (en) * 2018-12-29 2020-07-07 贵州白山云科技股份有限公司 Data retrieval optimization method and device and computer equipment
CN111597185A (en) * 2020-04-01 2020-08-28 深圳英飞拓智能技术有限公司 Real-time state number rapid statistical method based on tree structure resource distribution
CN111782892A (en) * 2020-06-30 2020-10-16 中国平安人寿保险股份有限公司 Similar character recognition method, device, apparatus and storage medium based on prefix tree
CN115269935A (en) * 2022-07-26 2022-11-01 北京科能腾达信息技术股份有限公司 Integrated circuit flattening design character string storage and query system and method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11409732B2 (en) * 2020-01-17 2022-08-09 Intuit Inc. Computer estimations based on statistical tree structures

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104765882A (en) * 2015-04-29 2015-07-08 中国互联网络信息中心 Internet website statistics method based on web page characteristic strings
CN111382323A (en) * 2018-12-29 2020-07-07 贵州白山云科技股份有限公司 Data retrieval optimization method and device and computer equipment
CN110851722A (en) * 2019-11-12 2020-02-28 腾讯云计算(北京)有限责任公司 Search processing method, device and equipment based on dictionary tree and storage medium
CN111597185A (en) * 2020-04-01 2020-08-28 深圳英飞拓智能技术有限公司 Real-time state number rapid statistical method based on tree structure resource distribution
CN111782892A (en) * 2020-06-30 2020-10-16 中国平安人寿保险股份有限公司 Similar character recognition method, device, apparatus and storage medium based on prefix tree
CN115269935A (en) * 2022-07-26 2022-11-01 北京科能腾达信息技术股份有限公司 Integrated circuit flattening design character string storage and query system and method

Also Published As

Publication number Publication date
CN116501781A (en) 2023-07-28

Similar Documents

Publication Publication Date Title
CN111339382B (en) Character string data retrieval method, device, computer equipment and storage medium
US10055439B2 (en) Fast, scalable dictionary construction and maintenance
CN111460311A (en) Search processing method, device and equipment based on dictionary tree and storage medium
WO2021072874A1 (en) Dual array-based location query method and apparatus, computer device, and storage medium
CN101620636B (en) Method and apparatus for displaying tabular data
JP2020027649A (en) Method, apparatus, device and storage medium for generating entity relationship data
US20160210333A1 (en) Method and device for mining data regular expression
Ferragina et al. On the bit-complexity of Lempel--Ziv compression
US20140229484A1 (en) Extraction method, computer product, extracting apparatus, and extracting system
CN112347767A (en) Text processing method, device and equipment
CN113535977A (en) Knowledge graph fusion method, device and equipment
CN108628907A (en) A method of being used for the Trie tree multiple-fault diagnosis based on Aho-Corasick
CN112115313A (en) Regular expression generation method, regular expression data extraction method, regular expression generation device, regular expression data extraction device, regular expression equipment and regular expression data extraction medium
CN115221191A (en) Virtual column construction method based on data lake and data query method
CN112069305B (en) Data screening method and device and electronic equipment
CN116501781B (en) Data rapid statistical method for enhanced prefix tree
CN104462095A (en) Extraction method and device of common pars of query statements
CN111310450A (en) Character string word segmentation method, device, equipment and storage medium
CN113821211B (en) Command parsing method and device, storage medium and computer equipment
CN113378544A (en) Text analysis method, text data acquisition method, device, medium and equipment
CN113051896A (en) Method and device for correcting text, electronic equipment and storage medium
CN107798060B (en) Real-time streaming data processing application software feature recognition method
CN113704465B (en) Text clustering method and device, electronic equipment and storage medium
CN115473933B (en) Network system associated service discovery method based on frequent subgraph mining
CN110825927A (en) Data query method and device, electronic equipment and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant