CN113792099B

CN113792099B - Data flow high-utility item set mining system based on historical utility table pruning

Info

Publication number: CN113792099B
Application number: CN202110922923.2A
Authority: CN
Inventors: 闫凤麒; 陈欣如
Original assignee: Shanghai Xiye Information Technology Co ltd
Current assignee: Shanghai Xiye Information Technology Co ltd
Priority date: 2021-08-12
Filing date: 2021-08-12
Publication date: 2023-08-25
Anticipated expiration: 2041-08-12
Also published as: CN113792099A

Abstract

A data flow high-utility item set mining system based on historical utility table pruning. The efficient item set mining of the data stream based on the sliding window is one of the most challenging problems in the field of data mining, and the current algorithm can generate a large number of candidate item sets and redundant items, so that the performance is reduced when large-scale data streams are mined, and reference historical mining results are less referenced in the mining process of the data stream. The invention has the innovation points that the history utility value table is established, the search space of the data stream is effectively built by utilizing the history data, candidates and redundant items are reduced, and the distributed architecture is used for constructing a data mining system, so that the establishment and update work of the history utility value table is completed on the premise of not influencing the data stream mining, and the efficiency of mining the high-efficiency data stream item sets is effectively improved.

Description

Data flow high-utility item set mining system based on historical utility table pruning

Technical Field

The invention relates to a frequent pattern mining algorithm and a data stream mining system.

Efficient use of item set mining is an important branch of frequent pattern mining.

Background

Frequent item set mining is an important branch of the data mining field that is capable of mining out item sets that occur more frequently than a user-set threshold from all transactions in the data set. With the wide application of frequent item sets, it is found that some non-frequent item sets can create higher value than frequent item sets, and aiming at the problem, a learner proposes a concept of efficient item set mining, and the efficient item set overcomes the defects of occurrence frequency, price, profit, regional distribution and the like which are not considered in the frequent mining item sets, and evaluates the importance of the item set through comprehensive utility indexes.

The current mode growth method is effective in a high-utility item set mining algorithm of a data stream, the HUM-UT algorithm provides a global header table for data in a sliding window, the utility value of the data stream is estimated, the high-efficiency item set is mined by using the global header table and a global utility tree, and the global header table and the utility tree still contain a large number of redundant data items and low-utility item sets. In order to solve the problem, the IHUM-UT algorithm improves the time efficiency by compressing the size of the global header table, the SHUGROWth algorithm optimizes the mining process by constructing an SHU-Tree structure, and the HUISW algorithm optimizes the global header table by constructing a HUIL-Tree.

However, too many candidates and redundancy often result in high spatial complexity of the constructed data structure (especially tree structure), which makes the mining process frequently recursive, resulting in memory overflow and reduced algorithm efficiency. Thus, pruning and filtering redundancy sets is one of the main optimization objectives of current algorithms.

More in algorithms based on sliding window technology is to build better global structures. The current algorithm ignores long-term historical data and has a certain guiding significance on the mining of future data streams in actual data analysis, which can help the algorithm to effectively filter redundancy and candidates. Meanwhile, the current efficient item set mining algorithm based on the distributed framework is also quite scarce, and on the premise that the current data flow is more and more huge, the improvement of the instantaneity and the efficiency of the data flow mining algorithm is quite challenging.

Disclosure of Invention

The current pattern growth algorithm inevitably has the problems of candidate item sets, excessive redundancy items, useless processing of low utility data and the like, and often causes higher space complexity of the constructed high utility tree structure, so that subtrees are frequently recursively created in the mining process, and finally the problems of memory overflow, low algorithm efficiency and the like occur. How to effectively screen candidate sets is one of the main optimization goals of efficient use of the item mining algorithm.

With the development of current distributed systems and data stream engines, there have been many solutions to the problem of handling large-scale data streams, where there are no few excellent data stream engines (spark streaming, store, fly). In the actual data mining and analysis process, long-time historical data analysis has a certain reference value for mining of future data streams and by means of a distributed data processing frame, so that the invention considers that the current data stream mining algorithm is assisted and optimized to realize the transformation from a single machine to a distributed type while effectively mining historical data, reduces the time cost and the storage cost of mining, and shows better expandability and stability for a large data set.

The invention designs a distributed high-utility item set mining system, which ensures the real-time performance of the high-utility item set mining of the current data stream while stably analyzing the historical mining data. Meanwhile, the invention effectively utilizes the result of historical mining data, constructs a historical utility value table, effectively reduces redundancy items of the data stream mining algorithm through the table, and improves the efficiency of the data stream mining algorithm.

In order to achieve the above object, the present invention provides the following solutions:

step 1, creating and updating a history utility value table;

step 2, constructing, updating and optimizing a global header table and a global tree;

step 3, performing efficient item set mining on the optimized global data structure;

step 4, a distributed efficient item set mining system;

advantageous effects

The invention reduces the data items with lower utility value in the optimized global header table, supposes that N data items exist currently, tn transactions have average length L, window size is WinSize, batch size is BatchSize, and when all data items are used, the space complexity can reach O (WinSize x BatchSize x L), and under the condition that the current window size and batch size are kept unchanged, the average length of the transactions can be reduced by reducing the data items with low utility, and meanwhile, the generation quantity and recursion times of subtrees in the global tree are effectively reduced, and on the basis, the time and space complexity of the algorithm are effectively improved.

According to the invention, a comparison experiment is carried out on four classical mining data sets, and remarkable improvement of performance is observed. This also demonstrates the improvement in algorithm efficiency of the construction of the historical utility value table and the construction of the distributed efficient use item set mining system. The method has great significance for improving the efficiency of the efficient item set mining algorithm on the current data stream, ensuring the instantaneity of the algorithm and widening the application of the efficient item set mining algorithm.

Drawings

The accompanying drawings are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate the invention and together with the description serve to explain the principles of the invention. In the drawings:

FIG. 1 is a block diagram of a distributed high utility item set mining system;

FIG. 2 is a technical roadmap of a data stream high utility item set mining algorithm based on historical utility table pruning;

FIG. 3 is a flowchart of step one, historical utility value table creation and update;

FIG. 4 is a flowchart of the construction, updating and optimizing of the global head table global tree in the second step;

FIG. 5 is a diagram illustrating efficient use item set mining after step three optimization;

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the following detailed description of the present invention will be made with reference to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating and illustrating the invention, are not intended to limit the invention.

The implementation process of the system of the invention is shown in fig. 2:

step 1, creating and updating a history utility value table;

step 4, a distributed efficient item set mining system;

the individual steps are described in detail below.

Step 1: creation and updating of a historical utility value table, as shown in FIG. 3

1.1 initialization of a historical utility value Table

Initializing when no utility value table exists in the current cache, and constructing different historical utility value tables according to different thresholds, wherein the design of each table item is as follows:

item_index: index representing current data item

item_profile: external utility value representing current data item

item_reliability: utility value mean representing current data item

item_level: representing the level of the current data item (for reference during mining)

And initializing the data items to obtain a historical utility value table, wherein item_index of each item is the name of the data item after initialization, item_profile is updated to an external utility value, item_utility is initialized to 0, item_level is also initialized to 0, and the completed historical utility value table is continuously used in step 1.2.

1.2 creation of historic utility value tables

After the first sliding window is mined, according to the mining result and the calculated utility value mean value, updating item_utility in the data item, which is marked as a low utility item set and is obviously lower than the window transaction weight utility value, marking the item_level of the item_level as-1, mining to obtain item_level marks of high utility multiple item sets as 2, marking the obtained high utility one item set as 1, and both the above data marks and updating are based on the historical utility value table initialized in the step 1.1.

After the first initialization and creation are completed, the data item index is used as a key, the rest items are used as values to be stored in a cache, the names of the historical utility value table are the names of the transaction item set and the minimum utility threshold value, and the historical utility value table generated in the step 1.1 and the step 1.2 only corresponds to a certain minimum utility threshold value under a certain data set.

1.3 updating of a historical utility value Table

After a window is slid forwards and the mining process is completed, two items of item_utility and item_level are updated according to the rule at the time of creation, and a history utility value table to be updated is obtained according to the data set names and the minimum utility value threshold value in the cache.

Step 2: construction, updating and optimization of global header table and global tree as shown in fig. 4

2.1 initialization of Global header tables and Global trees

The global header table needs to contain all data items in the initialization process, and each item in the header table is a transaction utility value average value (TWU) of the data items in the current batch. The global Tree consists of a plurality of TN-Tree subtrees, and three types of nodes are respectively a root node, a general node and a tail node in the TN-Tree. The root node is a null node which is used for merging all child nodes, and the general node and the tail node comprise the current data item name, the pointer of the father node and the pointer of the child node. The tail node is special, and besides the content, all utility values of the current transaction need to be saved, and the tail node is composed of a two-dimensional array. N arrays are created according to the window size, and utility values of data items are stored in each array according to the sequence. In step 2.1, an empty global header table and global root with all data items need to be generated.

2.2 building updates of Global header tables and Global trees

According to the window size, the batch size sequentially reads in the data stream and fills the global head list, and constructs subtrees according to the TN-Tree rule and merges the subtrees onto the global Tree, wherein the global head list is an empty list initialized in the step 2.1, and the merged global Tree is a root node root generated in the step 2.1. It should be noted that transaction items having the same prefix in the tree structure share the same tree node. After the window slides forward, the global header table and the global tree are updated, the oldest batch of data is removed and the latest batch of data is added, and the header table and the tree are updated according to the rules.

2.3 optimization of Global header tables and Global trees

When the cache is provided with a history utility value table corresponding to the current mining window, the global header table needs to be reconstructed according to the data of the history utility value table, and the global header table is optimized mainly according to item_level. The historical utility value table is mainly generated in step 1.1 and step 1.2, and step 1.3 needs to be triggered to update the historical utility value table when the sliding window slides forwards.

The algorithm sorts the absolute high-utility data items (item_level=1) and the potential high-utility data items (item_level=2) to the head of the table, and the data items are mined preferentially in the process of building the tree, the data items are sorted according to the dictionary sequence for the common data items (item_level=0), and the data items (item_level= -1) with low utility are pruned. To guarantee the algorithmic recall as much as possible, the twu value is calculated for the low utility data item (item_level= -1) and remains if it is significantly higher than the minimum utility value of the current window. Meanwhile, the structure of the global tree is adjusted according to the optimization result of the global head table, and the algorithm can adjust to the tail node for the absolute high-utility data item (item_level=1) and the potential high-utility data item (item_level=2) to carry out mining preferentially.

Step 3: efficient use of item mining on optimized global data structures, as shown in FIG. 5

3.1 pretreatment before excavation

The mining firstly needs to read in the data of a complete sliding window, optimizes the global tree and the global head table in the current sliding window according to the step 2.3, adds a utility buffer value to each leaf node after the completion, establishes link pointers of the corresponding tree nodes in the head table, and mines item by item according to the sequence.

3.2 actual excavation Process

According to the global data structure and the mining order obtained in the step 3.1, twu values, utility cache values, positions of nodes in the global tree and transaction path information of a certain data item can be obtained, and because mining is performed through a tail node table, the data item corresponds to leaf nodes, and after mining is completed, the data item nodes can move up the utility cache values to parent nodes.

After mining is started, when the utility value of the data item is greater than or equal to the minimum utility value, the data item is a high utility item set; meanwhile, as long as the twu value of the data item is larger than or equal to the minimum utility value, a sub-header table and a sub-tree are created for the data item; while the twu value of the data item is less than the minimum utility value, depending on the nature of twu convergence downward, the superset of the data item must not be an efficient set of items, thus ending the mining of the data item.

If the current data item needs to create a sub-header table and a subtree, reading utility cache values of nodes corresponding to link pointers of the current data item, obtaining all paths in a global tree where the current data item is located, reading all data items in the paths, calculating twu values, and adding the data items into the sub-header table when twu values are greater than or equal to the minimum utility values. And at the moment, the utility value of the current data item is used as a basic utility value to be added to the tail node of the subtree to complete the construction of the subtree, then the subtree is recursively constructed, the sub-head table is created, and the mining is carried out to finally obtain all the high-efficiency item sets of the current sliding window.

After the excavation work is completed, the sliding window slides forward, and at this time, step 1.3 is required to be executed to update the historical utility value table, and step 2.2, step 2.3 is required to update the global data structure and reconstruct and optimize.

Step 4: distributed efficient use of item set mining system construction, as shown in FIG. 1

The whole high-utility item set mining system is divided into three modules: the historical batch data processing module is mainly responsible for the related processing in the step 1 and is responsible for the worker 1; the data flow processing module is mainly responsible for the related processing in the step 2 and the step 3 and is responsible for the related processing by the worker 2; the historical utility value table caching module is used for storing a historical utility value table which is a processing result of historical batch data by a lightweight caching system (redis) and is used for optimizing the search space of the efficient item set in the pruning data stream, so that the efficiency of the efficient item set mining algorithm is improved.

Innovation point

The invention optimizes the current data flow high-utility item set mining algorithm based on the historical utility value table, and builds a distributed high-utility item set mining system. Because the data flow processing technology of the damping window and the landmark window causes great pressure on a data warehouse and a message queue, the invention adopts a sliding window technology for processing the data flow, and focuses on real-time processing. On the basis, a distributed framework is utilized, tasks are split into two parts of historical data processing and data stream processing, the analysis of historical mining data is performed stably, the real-time performance of the mining of a high-utility item set of the current data stream is guaranteed, the data storage is performed through a lightweight cache, the storage pressure of a data warehouse is reduced, meanwhile, the search space of the data stream mining is effectively pruned according to the referential of the historical data, and the overall efficiency of a data stream mining algorithm is improved.

The system has good performance in retail, connect, TLC _trip and other data sets, and improves the time and space efficiency of the mining algorithm on the premise of ensuring high recall ratio.

Claims

1. A data flow high utility item set mining system based on historical utility table pruning, characterized by comprising the following implementation processes:

step 1, creating and updating a history utility value table;

step 4, a distributed efficient item set mining system;

wherein, step 1: creation and updating of historical utility value tables

1.1 initialization of a historical utility value Table

item_index: index representing current data item

item_profile: external utility value representing current data item

item_reliability: utility value mean representing current data item

item_level: representing the level of the current data item for reference during mining;

initializing according to the data item to obtain a history utility value table;

1.2 creation of a historical utility value Table

After the first sliding window is mined, according to the mining result and the calculated utility value mean value, updating item_utility in the data item, which is marked as a low utility item set and is obviously lower than the window transaction weight utility value, marking the item_level of the item_utility as-1, and mining to obtain item_level marks of high utility multiple item sets as 2, wherein the obtained high utility one item set is marked as 1;

after the first initialization and creation are completed, the index of the data item is used as a key, the rest items are used as values and stored in a cache, and the names of the history utility value table are the names of the transaction item set and the minimum utility threshold;

1.3 updating the historical utility value Table

After a window is slid forwards and the mining process is completed, two items of item_quality and item_level are updated according to the rule at the time of creation, and after the updating process is completed, a historical utility value table is updated into a cache, and the table name is still the name of the transaction item set and the minimum utility threshold value;

step 2: construction, updating and optimization of global head table and global tree

2.1 initialization of Global header tables and Global trees

The global header table needs to contain all data items in the process of initialization, and each table item in the header table is an estimated value of the utility value of the data item in the current batch, wherein the estimated value is twu value of the data item in the batch; the global Tree consists of a plurality of TN-Tree subtrees, and three types of nodes are respectively root nodes, general nodes and tail nodes in the TN-Tree; the root node is an empty node which is used for merging all child nodes, and the general node and the tail node comprise the current data item name, the pointer of the father node and the pointer of the child node; the tail node is special, and besides the content, all utility values of the current transaction need to be stored, and the tail node is composed of a two-dimensional array; creating n arrays according to the window size, and storing utility values of the data items according to the sequence in each array;

2.2 building updates of Global header tables and Global trees

According to the window size, the batch size sequentially reads in the data stream and fills the global header table, and constructs subtrees according to the TN-Tree rule and merges the subtrees on the global Tree, and it is noted that transaction items with the same prefix in the Tree structure share the same Tree node; after the window slides forwards, updating the global head table and the global tree, removing the data of the oldest batch and adding the data of the latest batch, and updating the head table and the tree according to the rule;

2.3 optimization of Global header tables and Global trees

When the cache is provided with a corresponding historical utility value table in the current mining window, the global header table needs to be reconstructed according to the data of the historical utility value table, and the global header table is optimized according to item_level;

step 3, performing efficient item set mining on the optimized global data structure

The first step of mining needs to add a validity_cache to each leaf node in the global tree, and according to the above, any one leaf node can store the utility value on the current path, and each batch of data is stored by using "{ }", and all batches of data need to be stored into the validity_cache; after the history utility value table is optimized, the head table is constructed according to the screened head table, the data item sequence of the head table is according to the sequence of the optimized tail node table, and each link pointer in the head table is stored for pointing to the position of the corresponding data item of the global tree, and then the head table starts to excavate item by item according to the sequence in the head table;

the method comprises the steps that the position of a data item in a global tree is obtained from a link pointer of the currently mined data item, and because the data item is the last item and is subjected to adjustment of a global table and the global tree, a system begins to mine the data item of a tail node, the current data item corresponds to leaf nodes, namely the leaf nodes necessarily contain the attribute_cache, and when the mining of the item is completed, a child node can transmit the attribute_cache to a father node, so that the node in the corresponding tree of the item also has the attribute_cache when the next data item is mined; since the last value in each utility _ cache is the utility value of the current data item, so all the units of the availability cache that the data item owns are the last item and is the utility value of the data item in the current window; when the utility value is greater than or equal to the minimum utility value, the data item is a high utility item set; meanwhile, as long as the twu value of the data item is larger than or equal to the minimum utility value, a sub-header table and a sub-tree are created for the data item; while the twu value of the data item is smaller than the minimum utility value, according to the property of twu that the superset of the data item is not necessarily an efficient item set, thus ending the mining of the data item;

if the current data item needs to create a sub-header table and a subtree, reading the attribute_cache value of the node corresponding to the link pointer of the current data item, at the moment, obtaining all paths in the global tree where the current data item is located, reading all the data items in the paths, calculating twu values, and adding the data item into the sub-header table when the twu value is more than or equal to the minimum utility value; at this time, the utility value of the current data item is used as a basic utility value to be added to the tail node of the subtree, so as to complete the construction of the subtree; then recursively creating subtrees, creating sub-head tables and excavating to finally obtain all high-utility item sets of the current sliding window;

step 4, distributed efficient item set mining system

The whole high-utility item set mining system is divided into three modules: the historical batch data processing module is responsible for the related processing in the step 1 and is responsible for the worker 1; the data flow processing module is responsible for the related processing in the step 2 and the step 3 and is responsible for the related processing by the worker 2; the historical utility value table caching module is used for storing a historical utility value table, which is a processing result of the historical batch data, by using a lightweight caching system redis and is used for optimizing the search space of the efficient use item set in the pruning data stream, so that the efficiency of the efficient use item set mining algorithm is improved.

2. The system of claim 1, wherein in step 2.3:

the system sorts the absolute high-utility data items with item_level=1 and the potential high-utility data items with item_level=2 to the head of the table, the system digs preferentially in the process of constructing the tree in the later step, the common data items with item_level=0 are still sorted according to the dictionary sequence, and the low-utility data items with item_level=1 are pruned;

to ensure system recall, the twu value is calculated for the low utility data item of item_level= -1, which remains if it is significantly higher than the minimum utility value of the current window; meanwhile, the structure of the global tree is adjusted according to the optimization result of the global head table, and the system can adjust the potential high utility data items with item_level=2 to the tail node for mining preferentially aiming at the absolute high utility data items with item_level=1.