WO2018054352A1 - 项集确定方法、装置、处理设备及存储介质 - Google Patents

项集确定方法、装置、处理设备及存储介质 Download PDF

Info

Publication number
WO2018054352A1
WO2018054352A1 PCT/CN2017/102908 CN2017102908W WO2018054352A1 WO 2018054352 A1 WO2018054352 A1 WO 2018054352A1 CN 2017102908 W CN2017102908 W CN 2017102908W WO 2018054352 A1 WO2018054352 A1 WO 2018054352A1
Authority
WO
WIPO (PCT)
Prior art keywords
item set
processed
item
time
determining
Prior art date
Application number
PCT/CN2017/102908
Other languages
English (en)
French (fr)
Inventor
林浚玮
甘文生
肖磊
陈伟
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2018054352A1 publication Critical patent/WO2018054352A1/zh
Priority to US16/023,611 priority Critical patent/US20180322125A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2457Query processing with adaptation to user needs
    • G06F16/24578Query processing with adaptation to user needs using ranking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases

Definitions

  • the present invention relates to the field of data processing technologies, and in particular, to a method, device, processing device, and storage medium for determining an item set.
  • the high expected weighted item set refers to a set of items in the database that are highly time-sensitive and frequently expected, and represent a set of recently expected high expected weight items in the database.
  • the database usually records at least one transaction, news, and the like, each transaction includes at least one data item, and to represent the association rule between the data items in the database, at least one data item is aggregated to form an item set. .
  • the mining algorithm based on weighting factors is generally used to mine the high expected weights of effective time from the database.
  • These algorithms are generally simple mining of item sets based on weight factors, and can only be used for databases storing accurate data.
  • Set mining in the actual mining process, the data types are different, the data in the database often contains uncertainty (that is, the database often stores uncertain data); when from the database with uncertain data stored
  • the current weighting-based mining algorithms are not applicable; for example, a database stores the transaction records of the past three years, and the data items are different.
  • the merchandise wherein the notebook has a weight value of 0.4, the bread has a weight value of 0.001, and the electric fan corresponds to a weight value of 0.05.
  • weight values corresponding to the data items are different.
  • the high expectation weights in the month are based on the current weighting factor based mining algorithm.
  • Excavation by the database will result in the occurrence of a high expected weighted item set without valid time, and the information push based on the items determined by the current mining algorithm will result in the accuracy and timeliness of information push. Sex is not strong.
  • an embodiment of the present invention provides an item set determining method, apparatus, processing apparatus, and storage medium to determine a high expected weight item set of valid time from an indeterminate database.
  • a method for determining a high expected weight item set of effective time comprising:
  • Determining, by the processor, at least one target transaction corresponding to the to-be-processed item set; the target transaction corresponding to the to-be-processed item set is: not defining a transaction in the database that includes all data items of the to-be-processed item set;
  • the processor multiplies the expected support degree of the to-be-processed item set by the item set weight value of the to-be-processed item set, and determines a desired weight support degree of the to-be-processed item set; wherein the to-be-processed
  • the item set weight value of the item set is determined according to a predefined weight value of each data item in the to-be-processed item set;
  • the processor determines that the set of items to be processed is a high expected weight item set of valid time.
  • the embodiment of the present invention further provides an effective time high expected weight item set determining apparatus, including a processor and a memory, wherein the memory stores an instruction module executable by the following processor:
  • a target transaction determining module configured to determine at least one target transaction corresponding to the to-be-processed item set; the target transaction corresponding to the to-be-processed item set is: a transaction in the database that contains all the data items of the to-be-processed item set;
  • a time effective value determining module of the item set in the transaction configured to determine a time valid value of the to-be-processed item set in each target transaction according to a predefined time decay factor
  • a time effective value determining module of the item set configured to add time effective values of the to-be-processed item set in each target transaction, and determine a time effective value of the to-be-processed item set in an uncertain database
  • An item set probability determining module configured to determine an item set probability of the to-be-processed item set in each target transaction
  • a desired support degree determining module configured to add the item set probabilities of the to-be-processed item set in each target transaction, and determine a desired support degree of the to-be-processed item set;
  • a weighted support degree determining module configured to multiply a desired support degree of the to-be-processed item set, and an item set weight value of the to-be-processed item set, to determine a desired weight support degree of the to-be-processed item set;
  • the item set weight value of the to-be-processed item set is determined according to a predefined weight value of each data item in the to-be-processed item set;
  • a high expected weight item set determining module configured to: if the time effective value of the to-be-processed item set in the uncertain database is not less than, a predefined minimum time effective threshold, and a desired weight support degree of the to-be-processed item set, Not less than the product of the predefined minimum expected weight threshold and the total number of transactions in the indeterminate database, then the set of items to be processed is determined to be a high expected weight item set of valid time.
  • An embodiment of the present invention further provides a processing apparatus, including the high expected weight item set determining apparatus of the effective time described above.
  • Embodiments of the present invention also provide a non-volatile storage medium in which processor readable instructions are stored. When the instruction is executed, the processor is caused to perform the high expected weight item set determination method of the effective time described above.
  • the embodiment of the present invention calculates a time decay value, a minimum weight support threshold, and a lowest recent effective threshold, weight values of each data item, and calculates a time effective value of the to-be-processed item set in the uncertain database.
  • a desired weight support degree of the to-be-processed item set thereby determining that the time effective value of the to-be-processed item set in the uncertain database is not less than a predefined minimum time effective threshold, and the expected weight support of the to-be-processed item set
  • the high expected weight item set of the valid item time is determined to be the effective time set, and the high expected weight item set is determined.
  • the high expected weight item set determining method of the effective time provided by the embodiment of the present invention may cause the determined result to be inaccurate and the timeliness is poor by considering the inherent uncertainty of the data, thereby determining the time decay factor and the lowest recent effective threshold according to the time uncertainty factor.
  • Multiple metrics such as minimum expected weight support, which realizes the determination of high expected weights of effective time in uncertain databases, not only makes the determination of high expected weights of effective time applicable to uncertain databases, but also Improve the accuracy, timeliness, and efficiency of the results.
  • the project recommendation is selected from the high expected weights of the effective time to the user terminal, so that the information is pushed more accurately and time-sensitive.
  • FIG. 1 is a schematic structural diagram of an application system for determining an item set according to an embodiment of the present application
  • FIG. 2 is a flowchart of a method for determining an item set provided by the present application
  • FIG. 3 is a structural block diagram of an item set determining apparatus provided by the present application.
  • FIG. 4 is a structural block diagram of a time effective value determining module of an item set in a transaction provided by the present application
  • FIG. 5 is a block diagram showing the hardware structure of a processing device provided by the present application.
  • transaction not sure of a record in the database; for example, the transaction type of the uncertain database records the transaction records of the goods, each transaction can correspond to a transaction record of a commodity;
  • data item (item) information items recorded in a transaction
  • a transaction contains at least one data item
  • a transaction can record at least one data item, and the probability of occurrence of each data item; for example, the type of transaction
  • each transaction can contain the data items of the goods of the transaction, and the transaction probability (a form of probability of occurrence) of each commodity;
  • the transaction type of uncertain database contains 10 transactions, each transaction indicates a transaction record, each transaction contains at least one product name data item, and the transaction probability of each commodity; at the same time, each Transaction records can be distinguished by transaction number (TID), and each transaction corresponds to the transaction time (Transaction Time);
  • the transaction T1 occurs at 9:10 on January 8, 2015.
  • the transaction probability of commodity a is 0.3
  • the transaction probability of commodity b is 0.8
  • the transaction probability of commodity c is 1.
  • itemset A collection of at least one data item used to represent an association rule inherent in an indeterminate database; the difference between a transaction and an item set is that the transaction is usually triggered by an actual event.
  • the records in the database are uncertain; the item sets are usually mined from an indeterminate database.
  • k-itemset contains a collection of k data items; for example, a 1-item set can be a set of items containing a data item, such as item set A containing only data item A; An item set can be a set of items that contain two data items, such as item set AB containing only data items A and B, and so on.
  • Uncertain database refers to the database in which the data items in the transaction have a certain probability of occurrence; the structure of an illustrative uncertain database is shown in Table 1. For example, if the uncertain database records the future weather conditions, the database Each of the weather conditions corresponds to an occurrence probability, that is, each data item in each item in the database is determined to have an occurrence probability.
  • the weight of the data item in the uncertain database is uncertain; the weight value of the data item may be the weight threshold defined by the user for each data item according to the prior knowledge or the application background.
  • the weight value ranges from 0 to 1, which can refer to the importance level of the data item, the risk size, the profit weight, the freshness, etc.
  • the uncertain database contains six data items a, b, c, d, e, and f. If the user customizes the weight values of the six data items, the weight table can be obtained. Table 2 below shows An optional indication of the weight table can be referred to;
  • itemset weight in Database The weight value of the item set represented by the item set weight value in the uncertain database, which can reflect the importance of the item set in the uncertain database; the item set weight of an item set The value may be the total weight of each data item in the item set divided by the number of data items of the item set; the specific calculation formula may be:
  • the weight value of the item set in the corresponding target transaction may be equal to the item set weight of the item set (that is, the weight value of the item set in the uncertain database); the target transaction corresponding to the item set is Item A transaction that aggregates all data items.
  • the time-effective value of the transaction represents the recent value of the transaction (Recency of a transaction), which is used to indicate the time validity of the transaction; in the embodiment of the present invention, the time-effective value of the transaction may be Calculated based on a predefined time decay factor, that is, a valid value related to time of a transaction is calculated by a predefined time decay factor; the specific calculation formula can be:
  • ⁇ ⁇ (0, 1) is a predefined time decay factor
  • R(T q ) is the time rms value of transaction T q
  • t current is the current time
  • t q is the time at which transaction T q occurs.
  • the time set of the item set in the transaction indicates that the item set in the transaction in the recent valid value (Recency of an itemset in a transaction), can be equal to The time rms value of the transaction.
  • the time RMS of the item set in the uncertain database represents the recent effective value of the item set in the uncertain database (Recency of an itemset in a database), Can be equal to the sum of the time valid values of the item set in the corresponding target transactions;
  • the target transactions corresponding to item set a are T1, T4, T7, and T9 (ie, transactions T1, T4, T7, and T9 all contain all data items of item set a), Then the time valid value of item set a in the indeterminate database is: the time valid value of item set a in transaction T1 + the time valid value of item set a in transaction T4 + the time valid value of item set a in transaction T7 + The time rms value of item set a in transaction T9.
  • Expected weighted support of the item set (expWSed, ie Expected weighted support): The expected weight support of an item set is the product of the expected support degree of the item set and the item set weight value of the item set.
  • High Expected Weighted Itemset If the expected weight support of an item set is not less than the product of the predefined minimum expected weight threshold and the total number of transactions in the indeterminate database, then the item Sets are high expectations weight items.
  • the high expected weight item set of the effective time represents the Recent High Expected Weighted Itemset (RHEWI); if an item is in the uncertain database
  • Transaction upper bound weight (tubw): The upper limit of the transaction weight of a transaction can be equal to the maximum value of the weight value of each data item in the transaction; as shown in Table 1 and Table 2, Table 1
  • the upper limit of the transaction weight of the transaction T1 in the transaction is the weight value corresponding to the data item having the largest weight value in the transaction T1, that is, the weight value 1 of the data item c.
  • Transaction upper bound probability The upper limit of the transaction probability of a transaction can be equal to the maximum value of the probability of occurrence of each data item in the transaction; as shown in Table 1, the transaction in Table 1
  • the upper limit of the transaction probability of T2 is the occurrence probability corresponding to the data item with the highest probability of occurrence in transaction T2, that is, the probability of occurrence of data item d1.
  • Transaction upper bound weighted probability (tubwp): The upper limit of the transaction weighted probability of a transaction can be equal to the product of the upper limit of the transaction weight of the transaction and the upper limit of the transaction probability.
  • Transaction accumulation upper bound weighted probability upper limit (Transaction accumulation upper bound weighted probability, taubwp): The upper limit of the cumulative weighted probability of a transaction of an item set may be equal to the upper limit of the transaction weighted probability of each target transaction corresponding to the item set. with.
  • the high expected weight upper limit item set of the effective time indicates the recent high upper bound expected weighted itemset (RHUBEWI);
  • the product of the item is the set of high expected weight upper limit items of the effective time.
  • FIG. 1 is a schematic structural diagram of an application system for determining an item set according to an embodiment of the present application. As shown in FIG. 1 , it is a schematic structural diagram of an implementation environment involved in an embodiment of the present application, where the system includes: And at least one terminal 102.
  • the terminal 102 is connected to the server 101 through a wireless or wired network.
  • the terminal 102 can be an electronic device such as a computer, a smart phone or a tablet computer, and includes a processor and a display device.
  • the server 101 can be an internet application server, which can provide background services for internet applications.
  • an application that provides voice, video, picture, text and other information interaction services for intelligent terminals
  • the Internet application has the advantages of transmitting voice, video, pictures and text across communication operators and cross-operating system platforms.
  • the Internet application server can be configured as a server that provides services through the Internet, and the Internet application server can be a social application server, for example, a server corresponding to a social networking website such as an instant messaging server, a forum, or a Weibo, and can also implement payment through the Internet.
  • a social application server for example, a server corresponding to a social networking website such as an instant messaging server, a forum, or a Weibo, and can also implement payment through the Internet.
  • the embodiment of the present application does not specifically limit the type of the Internet application server.
  • the server 101 may also be another server, such as a multimedia resource sharing server.
  • a multimedia resource sharing server such as a multimedia resource sharing server.
  • the type of the server in this embodiment of the present application is not specifically limited.
  • FIG. 2 is a flowchart of a method for determining an item set according to an embodiment of the present invention.
  • the method is applicable to a data processing device, such as a data processing server applied to a network side.
  • a data processing device such as a data processing server applied to a network side.
  • the mining of the high expected weights of the effective time may also be performed on a device such as a computer on the user side.
  • the method for determining the item set provided by the embodiment of the present invention may be performed. include:
  • Step S200 determining at least one target transaction corresponding to the to-be-processed item set; the target transaction corresponding to the to-be-processed item set is: not defining a transaction in the database that includes all data items of the to-be-processed item set;
  • the embodiment of the present invention may determine a target transaction corresponding to the item set to be processed, and the target transaction corresponding to the item set is a transaction that includes all data items of the item set in the indeterminate database;
  • the set of items to be processed may be any set excavated from an indeterminate database, and one item set includes at least one data item;
  • the target transaction corresponding to item set ab is transaction T1 and Transaction T7, that is, the uncertain database shown in Table 1, only transactions T1 and T7 contain all data items a and b of item set ab;
  • the embodiment of the present invention may first determine a 1-item set containing one data item in the database, and excavate a 1-item set with a high expected weight of the effective time from the 1-item set, and then based on the high expectation of each valid time.
  • the 1-item set of weights is mined for a set of high expected weights that are dependent on the effective time of each 1-item set.
  • Step S210 Determine, according to a predefined time decay factor, a time valid value of the to-be-processed item set in each target transaction; add a time effective value of the to-be-processed item set in each target transaction, and determine the The time-effective value of the pending item set in the indeterminate database;
  • the time valid value of the pending item set in a target transaction may be equal to the time valid value of the target transaction; the time valid value of a transaction may be based on a predefined time decay factor, the current time, and the transaction The time of occurrence is determined;
  • the time effective values of the item set to be processed in each target transaction may be added, and the added result is regarded as the item to be processed in the uncertain The time valid value in the database.
  • Step S220 determining an item set probability of the to-be-processed item set in each target transaction; adding the item set probabilities of the to-be-processed item set in each target transaction, and determining a desired support degree of the to-be-processed item set ;
  • a transaction may record at least one data item, and an occurrence probability of each data item.
  • the embodiment of the present invention may set the to-be-processed item set for each target transaction.
  • the product of the probability of occurrence of each data item in the target transaction, as the probability of the item set in the target transaction of the item to be processed; for each target transaction, the processing of the item set in the target transaction is obtained. Item set probability
  • the item set probabilities of the items to be processed in each target transaction are added, and the added result is taken as the expected support degree of the item set to be processed.
  • Step S230 multiplying a desired support degree of the to-be-processed item set, and an item set weight value of the to-be-processed item set, and determining a desired weight support degree of the to-be-processed item set; wherein the to-be-processed item
  • the set item set weight value is determined according to a predefined weight value of each data item in the to-be-processed item set;
  • the embodiment of the present invention may pre-define a weight table, where the weight table corresponds to a weight value corresponding to each data item in the uncertain database; thereby, when determining the weight value of the item set of the to-be-processed item set, the weight table may be selected from the weight table. Determining a weight value of each data item of the item set to be processed, thereby determining a total weight value of each data item of the item set to be processed, and further dividing the total weight value of each data item of the item set to be processed by the to-be-processed item The number of data items of the set, and the item set weight value of the item set to be processed is obtained.
  • Step S240 if the time effective value of the to-be-processed item set in the uncertain database is not less than a predefined minimum time effective threshold, and the expected weight support degree of the to-be-processed item set is not less than, a predefined minimum Determining the product of the weight threshold and the total number of transactions in the indeterminate database, then determining the set of high expected weights for which the set of items to be processed is valid.
  • the conditions for determining whether the item set to be processed is a high expected weight item set of the valid time are as follows. At the same time, the two conditions are met to determine the high expected weight item set whose effective time is to be processed. If any of the conditions are not satisfied, the high expected weight item set whose effective time is to be determined cannot be determined:
  • Condition 2 the expected weight support of the item set to be processed, not less than, the product of the predefined minimum expected weight threshold and the total number of transactions in the indeterminate database.
  • the weight value of each data item is determined by a predefined time attenuation factor, a minimum weight support degree threshold, and a lowest recent effective threshold, and the time effective value of the to-be-processed item set in the uncertain database is calculated, and the to-be-processed item set The expected weight support degree; thus, the time effective value of the to-be-processed item set in the uncertain database is not less than a predefined minimum time effective threshold, and the expected weight support degree of the to-be-processed item set is not less than When the product of the defined minimum expected weight threshold and the total number of transactions in the uncertain database is determined, the high expected weight item set of the valid item time is determined, and the mining of the high expected weight item set is realized.
  • the method for determining an item set provided by the embodiment of the present invention may cause problems such as inaccurate result and poor timeliness by considering the inherent uncertainty of the data, thereby supporting the time attenuation factor, the lowest recent effective threshold, and the minimum expected weight.
  • the multi-measurement standard achieves the determination of the high expected weighted item set of the effective time in the uncertain database, which not only makes the determination of the high expected weight item set of the effective time applicable to the case of the uncertain database, but also improves the item set determination result. Accuracy, timeliness, and item sets determine efficiency.
  • the high expected weight item set of the effective time mined may be as shown in Table 3 below; Obviously, the specific values of the parameters herein are merely optional values as exemplified;
  • the time valid value of the to-be-processed item set in a target transaction may be equal to the time valid value of the target transaction; the embodiment may be based on a predefined time decay factor, current time, and occurrence time of each target transaction. Determining, respectively, a time valid value of each target transaction; thereby determining the determined time valid value of each target transaction as a time valid value of the to-be-processed item set in each target transaction;
  • the process of determining the time effective value of the to-be-processed item set in each target transaction may be implemented by using the following formula:
  • the time effective value of each target transaction is determined as the time effective value of the to-be-processed item set in each target transaction.
  • the embodiment of the present invention may first determine a set of items in the database that includes a data item, and extract a set of high expected weight items (ie, one data) that includes a valid time of the data item from the item set including one data item.
  • the technology processes the high expected weight upper limit 1 - item set RHEWUBI 1 of each effective time one by one, and mines all the extended items set prefixed by each data item (ie, the high expected weight upper limit 1 - item set of each effective time) And extracting the expanded item set into the to-be-processed item set according to the mining time, calculating the expected weight support degree and the time effective value of each to-be-processed item set, thereby performing mining of the high
  • embodiments of the present invention provide two mining models based on a projection technique.
  • the two mining models are based on projection technology
  • the first model is RHEWI-P
  • the second is sort-based RHEWI-PS.
  • the algorithm pseudo code of the RHEWI-P model is as shown in Algorithm 1 and Algorithm 2 below.
  • the lowest expected weight support threshold in the following algorithm represents a predefined minimum expected weight threshold, represented by the parameter ⁇ ; the lowest recent effective threshold represents The predefined minimum time effective threshold is represented by the parameter ⁇ ; the parameter ⁇ represents the predefined time decay factor; the text following the code below can be regarded as a textual explanation of the code.
  • items 1-4 indicate that the first scan database performs the calculation of the related information of each 1-item set, including the time RMS R(T q ) of the target transaction of each 1-item set.
  • calculate the transaction weight upper limit tubw(T q ) of the target transaction of each 1-item set calculate the transaction probability upper limit tubp(T q ) of the target transaction of each 1-item set, and target the transaction of each 1-item set
  • the transaction weighted probability upper limit taubwp(T q ) is calculated;
  • the embodiment of the present invention may determine an order of each object in the database, and may randomly sort each object in the database, or may sort each object in the database after calculation; specifically, in RHEWI-P In the model, as shown in item 11, the excavated set of high expected weight upper limit items containing the effective time of one data item is in a dictionary order lexicographical order, that is, a dictionary order value according to each item set in the set RHEWUBI 1 . After sorting; afterwards, the RHEWI-P model iteratively calls the function Mining-RHEWI(i j , db
  • the RHEWI-PS model is similar to the RHEWI-P model. The difference between the two is:
  • the RHEWI-PS model uses the descending order of the weights of the respective items as the sort order.
  • the calculated weights for each 1-item set are ⁇ w(a):0.3, w(b):0.4, w(c):1.0, w(d):0.55,w(e ): 0.8, w(f): 0.7 ⁇
  • the sort order in the RHEWI-PS of the present invention is c ⁇ e ⁇ f ⁇ d ⁇ b ⁇ a (c ⁇ e means before the data item c is sorted by e), that is,
  • the excavated high expected weight upper limit item set containing the effective time of one data item is sorted according to the weight value from small to large; the subsequent projection is a database operation, which first performs the above sorting on each item in each transaction, and then performs projection. operating.
  • i j , k) are different, and the upper bound value can be used in advance to filter the undesired item set operations without having to use these unpromising itemsets and extensions.
  • the set performs subsequent projection database and mining.
  • ij, k)' is shown in Algorithm 3.
  • the RHEWI-PS model uses a sorted upper-bound downward closure property (SUBDC property) for pre-filtering operations; thus avoiding a large number of sub-database projections and mining.
  • the operation greatly improves the performance of the excavation while ensuring the integrity and accuracy of the mining results.
  • SUBDC property is mainly based on the following three theories, the details of which are as follows.
  • X k is a k-term set
  • (k-1)-item set X k-1 is a subset of X k , that is, a data item in a subset of an item set is included in the item set.
  • the high expected weight upper limit 1-item set containing the valid time of one data item is sorted by sorting according to the weight value from large to small, that is, according to the weight value of each 1-item set, from large to small, such as w(i1) ⁇ w(i2) ⁇ ... ⁇ w(ik)>0; then w(X k ) ⁇ w(X k-1 ) holds; that is, the weight of the item set of one item set is less than or equal to the item set The item set weight value of the subset;
  • a superset is a collection containing all the data items of the item set, that is, a superset of an item set may contain all the data items of the item set, and other data items; that is, the expected support degree of an item set, not less than the item The expected support of the superset of the set;
  • Theorem 3 Assume that all 1-item sets are sorted according to the weight value from large to small, that is, according to the weight value of each 1-item set from large to small, such as w(i1) ⁇ w(i2) ⁇ ... ⁇ w(ik)>0, the expected weight support of a certain k-item set X is always not less than the expected weight support value of any one of its supersets;
  • X k-1 is assumed to be a (k-1)-term set, and the item set X k is any superset of X k-1 ; according to Theorem 1 and Theorem 2, then w(X k ) ⁇ w(X k- 1 ) is established; expSup(X k-1 ) ⁇ expSup(X k ) holds. Therefore, w(X k-1 ) ⁇ expSup(X k-1 ) ⁇ w(X k ) ⁇ expSup(X k ), ie expWSup(X k-1 ) ⁇ expWSup(X k ); The expected weight support is not less than the expected weight support of any superset of the set.
  • the following core pruning strategy can be obtained: the Sorted upper-bound downward closure property.
  • the item set and Neither of its extended sets is a set of high expected weights of valid time (ie, a set of recently expected high expected weights), and the set of items and its extended set can be safely filtered out.
  • the high expected weight item set of the valid time may be recommended.
  • the items in the high expected weight item set of the valid time for example, the webpage, the news, the commodity, and the like are pushed to the terminal that logs in the social application software user account.
  • the method for determining an item set provided by the embodiment of the present invention may cause problems such as inaccurate result and poor timeliness by considering the inherent uncertainty of the data, thereby supporting the time attenuation factor, the lowest recent effective threshold, and the minimum expected weight.
  • the multi-measurement standard achieves the determination of the high expected weighted item set of the effective time in the uncertain database, which not only makes the determination of the high expected weight item set of the effective time applicable to the case of the uncertain database, but also improves the item set determination result. Accuracy, timeliness, and efficiency.
  • the project recommendation is selected from the high expected weights of the effective time to the user terminal, so that the information is pushed more accurately and time-sensitive.
  • the item set determining apparatus provided by the embodiment of the present invention is described below.
  • the item set determining apparatus described below may refer to the high expected weight item set determining method of the effective time described above.
  • FIG. 3 is a structural block diagram of an item set determining apparatus according to an embodiment of the present invention.
  • the apparatus may include:
  • the target transaction determining module 100 is configured to determine at least one target transaction corresponding to the to-be-processed item set; the target transaction corresponding to the to-be-processed item set is an indeterminate transaction in the database that includes all data items of the to-be-processed item set ;
  • the time effective value determining module 200 of the item set in the transaction is configured to determine a time effective value of the to-be-processed item set in each target transaction according to a predefined time decay factor;
  • the time effective value determining module 300 of the item set is configured to add time valid values of the to-be-processed item set in each target transaction, and determine a time valid value of the to-be-processed item set in an uncertain database;
  • An item set probability determining module 400 configured to determine an item set probability of the to-be-processed item set in each target transaction
  • the expected support degree determining module 500 is configured to add the item set probabilities of the to-be-processed item set in each target transaction, and determine a desired support degree of the to-be-processed item set;
  • the expected weight support determination module 600 is configured to multiply the expected support degree of the to-be-processed item set and the item set weight value of the to-be-processed item set to determine a desired weight support degree of the to-be-processed item set;
  • the item set weight value of the to-be-processed item set is determined according to a predefined weight value of each data item in the to-be-processed item set;
  • the high expected weight item set determining module 700 is configured to: if the time effective value of the to-be-processed item set in the uncertain database is not less than, a predefined minimum time effective threshold, and the expected weight support of the to-be-processed item set , not less than, the predefined minimum expected weight threshold and the total number of transactions in the indeterminate database
  • the product of the determined item set is a set of high expected weight items of valid time.
  • the time valid value of the to-be-processed item set in a target transaction may be equal to the time valid value of the target transaction; correspondingly, FIG. 4 illustrates the time-effective value determining module 200 of the item set in the transaction.
  • An optional structure, referring to FIG. 4, the time effective value determining module 200 of the item set in the transaction may include:
  • the time effective value determining unit 210 of the transaction is configured to respectively determine a time effective value of each target transaction according to a predefined time decay factor, a current time, and an occurrence time of each target transaction;
  • the unit 220 it is used to determine the determined time effective value of each target transaction as the time valid value of the to-be-processed item set in each target transaction.
  • the transaction time effective value determining unit 210 is specifically applicable to, according to the formula Determining the time effective value of the target transaction T q , where ⁇ ⁇ (0, 1) is a predefined time decay factor, R (T q ) is the time rms value of the target transaction T q , t current is the current time, t q is represented The time when the target transaction T q occurred.
  • a transaction record has at least one data item, and an occurrence probability of each data item; the item set probability determining module 400 is specifically configured to, for each target transaction, each data item of the to-be-processed item set is in a target transaction.
  • the product of the occurrence probability in the event as the item set probability of the to-be-processed item set in the target transaction, to determine the item set probability of the to-be-processed item set in each target transaction.
  • the item set determining apparatus may be specifically configured to determine, according to the predefined weight table, a weight value of each data item of the to-be-processed item set, where the weight table record Having a weight value corresponding to each data item in the uncertain database; determining a total weight value of each data item of the to-be-processed item set; dividing the total weight value of each data item of the to-be-processed item set by the waiting The number of data items of the item set is processed, and the item set weight value of the item set to be processed is obtained.
  • the item set determining apparatus may be further configured to : after extracting a high expected weight upper limit item set RHEWUBI 1 including a valid time of a data item from a set containing a data item in the database, based on the pseudo projection technology
  • Each set of high expected weight upper limit items containing valid time of one data item is processed one by one, all extended item sets prefixed by each data item are mined, and the expanded extended item set is determined according to the mining time in turn. Pending item set.
  • the mined high expected weight upper limit item set including a valid time of one data item may be sorted according to a lexicographic order value, or may be sorted according to a weight value from a large to a small order.
  • the item set determining means may determine that the item set weight value of one item set is not greater than the item set weight value of the subset of the item set; the data item in the subset of one item set is included by the item set;
  • determining a desired support degree of an item set not less than a desired support degree of the super set of the item set; a super set of one item set refers to a set of all data items including the item set;
  • the expected weight support of an item set may be determined, not less than the expected weight support of the super set of the item set.
  • the item set determining apparatus may further determine the item set and the extension thereof when the expected weight support degree of one item set is less than a predefined minimum expected weight threshold, or when the time valid value is less than a predefined minimum time effective threshold.
  • the set is not a set of high expected weight items of valid time; the item set and its extended set are filtered.
  • the embodiment of the invention realizes the determination of the high expected weight item set of the effective time in the uncertain database, which not only makes the determination of the high expected weight item set of the effective time applicable to the uncertain database, but also improves the accuracy of the determination result. , timeliness, and mining efficiency.
  • the embodiment of the invention further provides a processing device, which may include the item set determining device described above.
  • FIG. 5 shows a hardware structural block diagram of a processing device.
  • the processing device may include: a processor 1, a communication interface 2, a memory 3, and a communication bus 4;
  • the processor 1, the communication interface 2, and the memory 3 complete communication with each other through the communication bus 4;
  • the communication interface 2 can be an interface of the communication module, such as an interface of the GSM module;
  • a processor 1 for executing a program
  • a memory 3 for storing a program
  • the program can include program code, the program code including computer operating instructions.
  • the processor 1 may be a central processing unit CPU, or an Application Specific Integrated Circuit (ASIC), or one or more integrated circuits configured to implement embodiments of the present invention.
  • CPU central processing unit
  • ASIC Application Specific Integrated Circuit
  • the memory 3 may include a high speed RAM memory and may also include a non-volatile memory such as at least one disk memory.
  • the program can be specifically used to:
  • Determining at least one target transaction corresponding to the to-be-processed item set; the target transaction corresponding to the to-be-processed item set is: not defining a transaction in the database that includes all data items of the to-be-processed item set;
  • Determining an item set probability of the to-be-processed item set in each target transaction Determining an item set probability of the to-be-processed item set in each target transaction; adding the item set probabilities of the to-be-processed item set in each target transaction, and determining a desired support degree of the to-be-processed item set;
  • the time effective value of the to-be-processed item set in the uncertain database is not less than a predefined minimum time effective threshold, and the expected weight support of the to-be-processed item set is not less than, a predefined minimum expected weight threshold
  • the product of the total number of transactions in the database is determined, and then the set of items to be processed is determined to be a high expected weight item set of valid time.
  • the steps of a method or algorithm described in connection with the embodiments disclosed herein can be implemented directly in hardware, a software module executed by a processor, or a combination of both.
  • the software module can be placed in random access memory (RAM), memory, read only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or technical field. Any other form of storage medium known.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Fuzzy Systems (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

一种项集确定方法、装置及处理设备,该方法包括:确定待处理项集所对应的至少一个目标事务;确定所述待处理项集在不确定数据库中的时间有效值;确定所述待处理项集的期望支持度;将所述待处理项集的期望支持度,和所述待处理项集的项集权重值相乘,确定所述待处理项集的期望权重支持度;如果所述待处理项集在不确定数据库中的时间有效值不小于,预定义的最低时间有效阈值,且所述待处理项集的期望权重支持度,不小于,预定义的最低期望权重阈值和不确定数据库中事务总数的乘积,则确定所述待处理项集为有效时间的高期望权重项集(S240)。实现了不确定数据库中有效时间的高期望权重项集的确定。

Description

项集确定方法、装置、处理设备及存储介质
本申请要求于2016年9月23日提交中国专利局、申请号201610847309.3,发明名称为“有效时间的高期望权重项集挖掘方法、装置及处理设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明涉及数据处理技术领域,具体涉及一种项集确定方法、装置、处理设备及存储介质。
背景技术
目前在对用户感兴趣的内容(如网页、新闻、商品等)进行推荐,对频繁搜索的热点高频词进行挖掘时,往往需要从数据库中挖掘出有效时间的高期望权重项集;有效时间的高期望权重项集指的是,数据库中具有高时效性且期望频繁的项集,表示的是数据库中近期有效的高期望权重项集。需要说明的是,数据库通常记录有至少一条交易、新闻等事务,每条事务中包括至少一个数据项,而为表征数据库中数据项间的关联规则,至少一个数据项又会集合形成一个项集。
目前一般是基于权重因素的挖掘算法,从数据库中挖掘出有效时间的高期望权重项集,这些算法一般是简单的基于权重因素进行项集的挖掘,只能对存储有精确数据的数据库进行项集的挖掘;然而,在实际挖掘过程中,数据的型态各异,数据库中的数据往往蕴含着不确定性(即数据库中往往存储有不确定数据);当从存储有不确定数据的数据库(简称不确定数据库)挖掘有效时间的高期望权重项集时,目前的这些基于权重因素的挖掘算法并不适用;比如,某数据库中储存了过去三年的交易记录,里面的数据项为不同的商品,其中,笔记本对应的权重值为0.4,面包对应的权重值为0.001,电风扇对应的权重值则为0.05,可见,数据项间对应的权重值是不同的,如果需要挖掘出六个月里的高期望权重项集,则根据目前的基于权重因素的挖掘算法是无法对不确定数据库进行挖掘的,会导致挖掘不出有效时间的高期望权重项集的情况出现,而基于目前的挖掘算法确定的项目进行信息推送,将造成信息推送的准确性和时效 性不强。
发明内容
有鉴于此,本发明实施例提供一种项集确定方法、装置、处理设备及存储介质,以从不确定数据库中确定出有效时间的高期望权重项集。
一种有效时间的高期望权重项集确定方法,包括:
处理器确定待处理项集所对应的至少一个目标事务;所述待处理项集所对应的目标事务为,不确定数据库中包含所述待处理项集所有数据项的事务;
所述处理器根据预定义的时间衰减因子,确定所述待处理项集在各目标事务中的时间有效值;将所述待处理项集在各目标事务中的时间有效值相加,确定所述待处理项集在不确定数据库中的时间有效值;
所述处理器确定所述待处理项集在各目标事务中的项集概率;将所述待处理项集在各目标事务中的项集概率相加,确定所述待处理项集的期望支持度;
所述处理器将所述待处理项集的期望支持度,和所述待处理项集的项集权重值相乘,确定所述待处理项集的期望权重支持度;其中,所述待处理项集的项集权重值根据预定义的所述待处理项集中各个数据项的权重值确定;
如果所述待处理项集在不确定数据库中的时间有效值不小于,预定义的最低时间有效阈值,且所述待处理项集的期望权重支持度,不小于,预定义的最低期望权重阈值和不确定数据库中事务总数的乘积,则所述处理器确定所述待处理项集为有效时间的高期望权重项集。
本发明实施例还提供一种有效时间的高期望权重项集确定装置,包括处理器和存储器,其中,存储器存储有以下处理器可执行的指令模块:
目标事务确定模块,用于确定待处理项集所对应的至少一个目标事务;所述待处理项集所对应的目标事务为,不确定数据库中包含所述待处理项集所有数据项的事务;
项集在事务中的时间有效值确定模块,用于根据预定义的时间衰减因子,确定所述待处理项集在各目标事务中的时间有效值;
项集的时间有效值确定模块,用于将所述待处理项集在各目标事务中的时间有效值相加,确定所述待处理项集在不确定数据库中的时间有效值;
项集概率确定模块,用于确定所述待处理项集在各目标事务中的项集概率;
期望支持度确定模块,用于将所述待处理项集在各目标事务中的项集概率相加,确定所述待处理项集的期望支持度;
期望权重支持度确定模块,用于将所述待处理项集的期望支持度,和所述待处理项集的项集权重值相乘,确定所述待处理项集的期望权重支持度;其中,所述待处理项集的项集权重值根据预定义的所述待处理项集中各个数据项的权重值确定;
高期望权重项集确定模块,用于如果所述待处理项集在不确定数据库中的时间有效值不小于,预定义的最低时间有效阈值,且所述待处理项集的期望权重支持度,不小于,预定义的最低期望权重阈值和不确定数据库中事务总数的乘积,则确定所述待处理项集为有效时间的高期望权重项集。
本发明实施例还提供一种处理设备,包括上述所述的有效时间的高期望权重项集确定装置。
本发明实施例还提供一种非易失性存储介质,其中存储有处理器可读指令。当所述指令被执行时,使得处理器执行上述的有效时间的高期望权重项集确定方法。
基于上述技术方案,本发明实施例通过预定义时间衰减因子、最低权重支持度阈值和最低近期有效阈值,各个数据项的权重值,并计算待处理项集在不确定数据库中的时间有效值,及待处理项集的期望权重支持度;从而在判断待处理项集在不确定数据库中的时间有效值不小于,预定义的最低时间有效阈值,且所述待处理项集的期望权重支持度,不小于,预定义的最低期望权重阈值和不确定数据库中事务总数的乘积时,确定待处理项集为有效时间的高期望权重项集,实现高期望权重项集的确定。本发明实施例提供的有效时间的高期望权重项集确定方法,通过考虑数据内在的不确定性会导致确定出的结果不准确、时效性差等问题,从而根据时间衰减因子、最低近期有效阀值、最低期望权重支持度等多重衡量标准,实现了不确定数据库中有效时间的高期望权重项集的确定,不仅使得有效时间的高期望权重项集的确定能够适用于不确定数据库的情况,还提高了确定结果的准确性、时效性,和确定效率。从有效时间的高期望权重项集中选取项目推荐给用户终端,使得信息的推送更加具有准确性和时效性。
附图说明
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的实施例,对于本领域普通技术人员来讲,在不付出创 造性劳动的前提下,还可以根据提供的附图获得其他的附图。
图1是本申请实施例提供的一种项集确定方法应用***的结构示意图;
图2为本申请提供的项集确定方法的流程图;
图3为本申请提供的项集确定装置的结构框图;
图4为本申请提供的项集在事务中的时间有效值确定模块的结构框图;
图5为本申请提供的处理设备的硬件结构框图。
具体实施方式
为便于理解本发明实施例提供的技术方案,下面先对一些定义概念进行介绍。
1、事务(transaction):不确定数据库中的一条记录;比如,交易类型的不确定数据库中记录的是商品的交易记录,每一条事务可以对应一条商品的交易记录;
2、数据项(item):事务中记录的信息项目,一条事务包含至少一个数据项;一条事务中可以记录有至少一个数据项,及各数据项的发生概率(probability);比如,交易类型的不确定数据库中,每一条事务可以包含交易的商品的数据项,及各商品的交易概率(发生概率的一种形式)等;
如下表1所示,交易类型的不确定数据库中包含10条事务,每条事务指示一条交易记录,每条事务中包含至少一个商品名称的数据项,及各商品的交易概率;同时,每条事务记录可通过事务编号(TID)进行区分,且每条事务对应记录有事务的发生时间(Transaction Time);
TID Transaction Time Transaction(item,probability)
T1 2015/1/08,09:10 a:0.3,b:0.8,c:1.0
T2 2015/1/09,11:20 d:1.0,f:0.5
T3 2015/1/11,08:20 b:0.6,c:0.7,d:0.9,e:1.0,f:0.7
T4 2015/1/12,09:15 a:0.5,c:0.45,f:1.0
T5 2015/1/12,15:20 c:0.9,d:1.0,e:0.7
T6 2015/1/14,08:30 b:0.7,d:0.3
T7 2015/1/14,15:25 a:0.8,b:0.4,c:0.9,d:1.0,e:0.85
T8 2015/1/15,09:10 c:0.9,d:0.5,f:1.0
T9 2015/1/16,08:30 a:0.5,e:0.4
T10 2015/1/18,09:00 b:1.0,c:0.9,d:0.7,e:1.0,f:1.0
表1
如表1,事务T1的发生时间是2015年1月8日9点10分,在事务T1中,商品a的交易概率是0.3,商品b的交易概率是0.8,商品c的交易概率是1。
3、项集(itemset):至少一个数据项构成的集合,用于表征不确定数据库内在的一种关联规则;事务与项集的不同点在于,事务通常是由实际发生的事件所触发生成的在不确定数据库中的记录;而项集通常是从不确定数据库中挖掘得出。
4、k-项集(k-itemset):包含有k个数据项的集合;比如,1-项集可以是包含一个数据项的项集,如仅包含数据项A的项集A;2-项集可以是包含两个数据项的项集,如仅包含数据项A和B的项集AB,以此类推。
5、不确定数据库:指事务中的数据项存在一定发生概率的数据库;一种示意性的不确定数据库的结构如表一所示,比如,不确定数据库中记录的是未来天气情况,则数据库中每一种天气情况对应一个发生概率,即不确定数据库中的每个事物中的每个数据项对应一个发生概率。
6、数据项在不确定数据库中的权重:不确定数据库中的各个数据项对应的权重值;数据项的权重值可以是用户根据先验知识或应用背景为每个数据项定义的权重阀值;权重值的范围为0至1,可以指代数据项的重要性程度、风险大小、利润比重、新鲜度等;
如表1示出的不确定数据库包含a、b、c、d、e、f这6个数据项,用户自定义设置这6个数据项的权重值,则可得到权重表,下表2示出了权重表的可选示意,可参照;
数据项 a b c d e f
权重值 0.3 0.4 1.0 0.55 0.8 0.7
表2
7、项集权重值(itemset weight in Database):项集权重值表示的项集在不确定数据库中的权重值,可以反映项集在不确定数据库中的重要程度;一个项集的项集权重值可以是,项集中各个数据项的权重总值除以该项集的数据项个数;具体计算公式可以是:
Figure PCTCN2017102908-appb-000001
其中X表示某一项集,|X|是指项集X的数据项个数,i是项集X中的数据项,j是计数词,ij是指项集X中的第j个数据项;
Figure PCTCN2017102908-appb-000002
指代项集X中各数据项的权重值的加和;
可选的,项集在对应的目标事务中的权重值,可以等于该项集的项集权重(即项集在不确定数据库中的权重值);某一项集对应的目标事务为,包含该项 集所有数据项的事务。
8、事务的时间有效值:事务的时间有效值表示的是事务的近期有效值(Recency of a transaction),用于表示事务的时间有效性;在本发明实施例中,事务的时间有效值可以基于预定义的时间衰减因子计算得到,即通过预定义的时间衰减因子计算得出某一事务与时间有关的有效值;具体计算公式可以是:
Figure PCTCN2017102908-appb-000003
其中δ∈(0,1)为预定义的时间衰减因子,R(Tq)为事务Tq的时间有效值,tcurrent表示当前时间,tq表示事务Tq的发生时间。
9、项集在事务中的时间有效值:项集在某一事务中的时间有效值表示的是,项集在该事务中的近期有效值(Recency of an itemset in a transaction),可以等于该事务的时间有效值。
10、项集在不确定数据库中的时间有效值:项集在不确定数据库中的有效时间值表示的是,项集在不确定数据库中的近期有效值(Recency of an itemset in a database),可以等于该项集在所对应的各目标事务中的时间有效值的加和;
如对于项集a,以表1所示,项集a所对应的目标事务为T1,T4,T7和T9(即事务T1,T4,T7和T9均包含有项集a的所有数据项),则项集a在不确定数据库中的时间有效值为:项集a在事务T1中的时间有效值+项集a在事务T4中的时间有效值+项集a在事务T7中的时间有效值+项集a在事务T9中的时间有效值。
11、项集在事务中的项集概率(itemset probability in a transaction):项集在所对应的某一目标事务中的项集概率为,项集的各个数据项在该目标事务中的发生概率的乘积;如以表1所示,项集ab在目标事务T1中的项集概率为,项集ab的数据项a和数据项b在事务T1中的发生概率的乘积,即0.3×0.8=0.24。
12、项集的期望支持度(expSup,即Expected support):项集的期望支持度为,项集在所对应的各个目标事务中的项集概率之和;如对于项集a,以表1所示,项集a所对应的目标事务为T1,T4,T7和T9,则项集a的期望支持度为,项集a在T1,T4,T7和T9中的项集概率之和,即0.3(项集a在T1中的项集概率)+0.5(项集a在T4中的项集概率)+0.8(项集a在T7中的项集概率)+0.5(项集a在T9中的项集概率)=2.1。
13、项集的期望权重支持度(expWSup,即Expected weighted support):某一项集的期望权重支持度为,该项集的期望支持度,与该项集的项集权重值的乘积。
14、高期望权重项集(High Expected Weighted Itemset,HEWI):若某一项集的期望权重支持度,不小于,预定义的最低期望权重阈值和不确定数据库中事务总数的乘积,则该项集为高期望权重项集。
15、有效时间的高期望权重项集:有效时间的高期望权重项集表示的是近期有效的高期望权重项集(Recent High Expected Weighted Itemset,RHEWI);若某一项集在不确定数据库中的时间有效值,不小于,预定义的最低时间有效阈值,且该项集的期望权重支持度,不小于,预定义的最低期望权重阈值和不确定数据库中事务总数的乘积,则该项集为有效时间的高期望权重项集。
16、事务权重上限(Transaction upper bound weight,tubw):某一事务的事务权重上限可以等于,该事务中各个数据项的权重值中的最大值;如结合表1和表2所示,表1中的事务T1的事务权重上限为,事务T1中的权重值最大的数据项所对应的权重值,即为数据项c的权重值1。
17、事务概率上限(Transaction upper bound probability,tubp):某一事务的事务概率上限可以等于,该事务中各个数据项的发生概率中的最大值;如结合表1所示,表1中的事务T2的事务概率上限为,事务T2中发生概率最大的数据项所对应的发生概率,即为数据项d的发生概率1。
18、事务加权概率上限(Transaction upper bound weighted probability,tubwp):某一事务的事务加权概率上限可以等于,该事务的事务权重上限与事务概率上限的乘积。
19、项集的事务累积加权概率上限(Transaction accumulation upper bound weighted probability,taubwp):某一项集的事务累积加权概率上限可以等于,该项集所对应的各目标事务的事务加权概率上限的加和。
20、有效时间的高期望权重上限项集:有效时间的高期望权重上限项集表示的是,近期有效的高期望权重上限项集(Recent high upper bound expected weighted itemset,RHUBEWI);若某一项集在不确定数据库中的时间有效值,不小于,预定义的最低时间有效阈值,且该项集的事务累积加权概率上限,不小于,预定义的最低期望权重阈值和不确定数据库中事务总数的乘积,则该项集为有效时间的高期望权重上限项集。
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是 全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
图1是本申请实施例提供的一种项集确定方法应用***的结构示意图,如图1所述,其示出了本申请实施例所涉及的实施环境的结构示意图,该***包括:服务器101和至少一个终端102。
终端102通过无线或者有线网络和服务器101连接,终端102可以为电脑,智能手机、平板电脑等电子设备,包括处理器和显示装置。
服务器101可以为互联网应用服务器,该互联网应用服务器,可以为互联网应用提供后台服务。互联网应用作为一个为智能终端提供语音、视频、图片、文字等信息交互服务的应用程序,具有可跨通信运营商、跨操作***平台发送语音、视频、图片和文字等优点。
互联网应用服务器可以被配置为一个通过互联网提供服务的服务器,该互联网应用服务器可以为社交应用服务器,例如,即时通信服务器、论坛或微博等社交网站对应的服务器,还可以为通过互联网能够实现支付等业务的服务器,本申请实施例对互联网应用服务器的类型不进行具体限定。
当然,该服务器101也可以为其他服务器,如多媒体资源共享服务器等,本申请实施例对该服务器的类型不作具体限定。
图2为本发明实施例提供的项集确定方法的流程图,该方法可应用于具有数据处理能力的处理设备,如应用于网络侧的数据处理服务器,本发明实施例中采用数据挖掘的方式来进行项集的确定。可选的,根据数据挖掘场景的不同,有效时间的高期望权重项集的挖掘也可能是在用户侧的计算机等设备上进行的;参照图1,本发明实施例提供的项集确定方法可以包括:
步骤S200、确定待处理项集所对应的至少一个目标事务;所述待处理项集所对应的目标事务为,不确定数据库中包含所述待处理项集所有数据项的事务;
可选的,对于各个待处理项集,本发明实施例可确定待处理项集所对应的目标事务,一个项集所对应的目标事务为不确定数据库中包含该项集所有数据项的事务;待处理项集可以为从不确定数据库中挖掘出的任一项集,一个项集包括至少一个数据项;
如表1所示,如果待处理项集为ab,则项集ab所对应的目标事务为事务T1和 事务T7,即表1所示的不确定数据库中,只有事务T1和T7包含了项集ab的所有数据项a和b;
可选的,本发明实施例可先确定数据库中包含一个数据项的1-项集,从1-项集中挖掘出有效时间的高期望权重的1-项集,再基于各个有效时间的高期望权重的1-项集,挖掘出从属于各个1-项集的有效时间的高期望权重项集。
步骤S210、根据预定义的时间衰减因子,确定所述待处理项集在各目标事务中的时间有效值;将所述待处理项集在各目标事务中的时间有效值相加,确定所述待处理项集在不确定数据库中的时间有效值;
可选的,待处理项集在一个目标事务中的时间有效值,可以等于该目标事务的时间有效值;一个事务的时间有效值,可根据预定义的时间衰减因子,当前时间,该事务的发生时间确定;
在得到待处理项集在各个目标事务中的时间有效值后,可将待处理项集在各个目标事务中的时间有效值进行相加处理,将相加的结果作为待处理项集在不确定数据库中的时间有效值。
步骤S220、确定所述待处理项集在各目标事务中的项集概率;将所述待处理项集在各目标事务中的项集概率相加,确定所述待处理项集的期望支持度;
可选的,一个事务可以记录有至少一个数据项,及各数据项的发生概率;本发明实施例在确定待处理项集对应的目标事务后,针对各个目标事务,可将待处理项集的各个数据项在目标事务中的发生概率的乘积,作为待处理项集在该目标事务中的项集概率;针对各个目标事务均作此处理,则可得到待处理项集在各目标事务中的项集概率;
从而将待处理项集在各目标事务中的项集概率相加,将相加结果作为待处理项集的期望支持度。
步骤S230、将所述待处理项集的期望支持度,和所述待处理项集的项集权重值相乘,确定所述待处理项集的期望权重支持度;其中,所述待处理项集的项集权重值根据预定义的所述待处理项集中各个数据项的权重值确定;
可选的,本发明实施例可预定义权重表,权重表中记录有不确定数据库中各数据项对应的权重值;从而在确定待处理项集的项集权重值时,可从权重表中确定待处理项集的各个数据项的权重值,从而确定待处理项集的各个数据项的权重总值,进而将待处理项集的各个数据项的权重总值,除以所述待处理项集的数据项个数,得到所述待处理项集的项集权重值。
步骤S240、如果所述待处理项集在不确定数据库中的时间有效值不小于,预定义的最低时间有效阈值,且所述待处理项集的期望权重支持度,不小于,预定义的最低期望权重阈值和不确定数据库中事务总数的乘积,则确定所述待处理项集为有效时间的高期望权重项集。
在得到待处理项集在不确定数据库中的时间有效值,及待处理项集的期望权重支持度后,判断待处理项集是否为有效时间的高期望权重项集的条件有如下两条,同时满足该两条条件,才能确定待处理项集为有效时间的高期望权重项集,如果任一条件不满足,则不能确定待处理项集为有效时间的高期望权重项集:
条件1,待处理项集在不确定数据库中的时间有效值不小于,预定义的最低时间有效阈值,
条件2,待处理项集的期望权重支持度,不小于,预定义的最低期望权重阈值和不确定数据库中事务总数的乘积。
本发明实施例通过预定义时间衰减因子、最低权重支持度阈值和最低近期有效阈值,各个数据项的权重值,并计算待处理项集在不确定数据库中的时间有效值,及待处理项集的期望权重支持度;从而在判断待处理项集在不确定数据库中的时间有效值不小于,预定义的最低时间有效阈值,且所述待处理项集的期望权重支持度,不小于,预定义的最低期望权重阈值和不确定数据库中事务总数的乘积时,确定待处理项集为有效时间的高期望权重项集,实现高期望权重项集的挖掘。本发明实施例提供的项集确定方法,通过考虑数据内在的不确定性会导致确定出的结果不准确、时效性差等问题,从而根据时间衰减因子、最低近期有效阀值、最低期望权重支持度等多重衡量标准,实现了不确定数据库中有效时间的高期望权重项集的确定,不仅使得有效时间的高期望权重项集的确定能够适用于不确定数据库的情况,还提高了项集确定结果的准确性、时效性,和项集确定效率。
如果设定时间衰减因子为0.15,最低期望权重阈值为15%,最低时间有效阈值为20,则结合表1和表2,挖掘出的有效时间的高期望权重项集可以如下表3所示;显然,此处参数的具体数值仅是举例说明的可选数值;
Figure PCTCN2017102908-appb-000004
Figure PCTCN2017102908-appb-000005
表3
可选的,待处理项集在一个目标事务中的时间有效值,可以等于该目标事务的时间有效值;本发明实施例可根据预定义的时间衰减因子,当前时间,各个目标事务的发生时间,分别确定各个目标事务的时间有效值;从而将所确定的各个目标事务的时间有效值,确定为待处理项集在各目标事务中的时间有效值;
可选的,根据预定义的时间衰减因子,确定所述待处理项集在各目标事务中的时间有效值的过程可以通过如下公式实现:
对于各目标事务,根据公式
Figure PCTCN2017102908-appb-000006
确定目标事务Tq的时间有效值,其中δ∈(0,1)为预定义的时间衰减因子,R(Tq)为目标事务Tq的时间有效值,tcurrent表示当前时间,tq表示目标事务Tq的发生时间;
从而将各目标事务的时间有效值,确定为待处理项集在各目标事务中的时间有效值。
可选的,本发明实施例可先确定数据库中包含一个数据项的项集,从包含一个数据项的项集中,挖掘出包含一个数据项的有效时间的高期望权重项集(即包含一个数据项的近期有效的高期望权重项集),得到有效时间的高期望权重1-项集(简称RHEWI1),和有效时间的高期望权重上限1-项集RHEWUBI1;从而基于伪投影(projection)技术逐一的对各个有效时间的高期望权重上限1-项集RHEWUBI1进行处理,挖掘出以各个数据项(即各个有效时间的高期望权重上限1-项集)为前缀的所有扩展项集,将挖掘出的扩展项集按照挖掘时间依次的确定为待处理项集,计算各待处理项集的期望权重支持度和时间有效值,从而进行各个有效时间的高期望权重项集的挖掘;
基于此,本发明实施例提供了两种基于伪投影(projection)技术的挖掘模 型,该两种挖掘模型均是基于projection技术,第一个模型为RHEWI-P,第二个为基于排序的RHEWI-PS。
RHEWI-P模型的算法伪代码如下述算法1和算法2所示,下述算法中的最低期望权重支持度阈值表示的是预定义的最低期望权重阈值,以参数α表示;最低近期有效阈值表示的是预定义的最低时间有效阈值,以参数β表示;参数δ表示的是预定义的时间衰减因子;下文中跟在代码后面的文字,可以视为是对代码的文字解释说明。
Figure PCTCN2017102908-appb-000007
Figure PCTCN2017102908-appb-000008
在算法1中,第1-4项表示的是,第一次扫描数据库进行各个1-项集的相关信息的计算,包括各个1-项集的目标事务的时间有效值R(Tq)的计算,各个1-项集的目标事务的事务权重上限tubw(Tq)的计算,各个1-项集的目标事务的事务概率上限tubp(Tq)的计算,各个1-项集的目标事务的事务加权概率上限taubwp(Tq)的计算等;
然后计算出近期有效值R(ij)和事务累积加权概率上限taubwp(ij),找出近期有效的高期望权重上限1-项集RHEWUBI1和近期有效的高期望权重1-项集RHEWI1(第5-10项);
在实施中,本发明实施例可以确定数据库中各对象的排列顺序,可以是随机对数据库中的各对象进行排序,也可以计算后对数据库中的各对象进行排序;具体地,在RHEWI-P模型中,如第11项所示,挖掘出的包含一个数据项的有效时间的高期望权重上限项集,采用的是字典顺序lexicographical order,即按照集合RHEWUBI1中的各个项集的字典顺序值进行排序;之后,RHEWI-P模型迭代地调用函数Mining-RHEWI(ij,db|ij,k),不断地基于projection技术挖掘出以各个包含一个数据项的项集(即各个数据项)为前缀的所有扩展项集。
Mining-RHEWI(ij,db|ij,k)的具体操作如算法2所示。
Figure PCTCN2017102908-appb-000009
Figure PCTCN2017102908-appb-000010
RHEWI-PS模型和RHEWI-P模型基本相近,二者的区别在于:
1、在算法1中的第11项中,RHEWI-PS模型采用的是各个项的权重的降序作为排序顺序。在本示例数据库中,计算得到的各个1-项集的权重值为{w(a):0.3,w(b):0.4,w(c):1.0,w(d):0.55,w(e):0.8,w(f):0.7},所以本发明的RHEWI-PS中的排序顺序为c<e<f<d<b<a(c<e表示数据项c排序中e之前),即挖掘出的包含一个数据项的有效时间的高期望权重上限项集按照权重值从小到大排序;此后的投影是数据库操作,均是先对各事务中的各个item进行上述排序,然后再进行投影操作。
2、Mining-RHEWI(ij,db|ij,k)中的具体操作不同,可以提前运用上界值进行过滤没前途的项集操作,而不必对这些没前途的项集及其扩展项集进行后续的投影数据库和挖掘做。Mining-RHEWI(ij,db|ij,k)’的具体操作如算法3所示。
Figure PCTCN2017102908-appb-000011
Figure PCTCN2017102908-appb-000012
在实施中,RHEWI-PS模型运用了一种称为基于排序的上界向下封闭性(Sorted upper-bound downward closure property,SUBDC property)进行提前过滤操作;从而避免了大量的子数据库投影和挖掘操作,大大提高了挖掘的性能,同时又保证了挖掘结果的完整性和准确性。该SUBDC property主要依据下列三个理论,其细节如下所述。
定理1、假定Xk为k-项集,(k-1)-项集Xk-1为Xk的子集,即一个项集的子集中的数据项被该项集所包含。同时假定的包含一个数据项的有效时间的高期望权重上限1-项集采用排序方式为按照权重值从大到小排序,即依据各个1-项集的权重值从大到小进行排序,如w(i1)≥w(i2)≥…≥w(ik)>0;则w(Xk)≤ w(Xk-1)成立;即一个项集的项集权重值小于或等于该项集的子集的项集权重值;
举例来说,在示例数据库中,以所有1-项集的权重值从大到小排序结果是,则项集(cd)的权重值总是不小于它的任何一个子集(cdb),(cda)and(cdba)的权重值;它们的权重值分别为w(cd)=(1.0+0.55)/2=0.775,w(cdb)=(1.0+0.55+0.4)/3=0.650,w(cda)=(1.0+0.5+0.3)/3=0.600,和w(cdba)=(1.0+0.55+0.4+0.3)/4=0.5625;因此,任何一个子集(cdb),(cda)and(cdba)的权重值都小于或等于项集(cd)的权重值。
定理2、项集的期望支持度expSup总是存在反单调性;
即假定Xk-1为(k-1)-项集,项集Xk为Xk-1的任何一个超集,则expSup(Xk-1)≥expSup(Xk)成立;项集的超集是指包含该项集所有数据项的集合,即一个项集的超集可以包含该项集的所有数据项,及其他的数据项;即一个项集的期望支持度,不小于该项集的超集的期望支持度;
定理3、假定所有的1-项集采用排序方式为按照权重值从大到小排序,即依据各个1-项集的权重值从大到小进行排序,如w(i1)≥w(i2)≥…≥w(ik)>0,则某k-项集X的期望权重支持度总是不小于它的任何一个超集的期望权重支持度值;
即假定Xk-1为(k-1)-项集,项集Xk为Xk-1的任何一个超集;根据定理1和定理2,则w(Xk)≤w(Xk-1)成立;expSup(Xk-1)≥expSup(Xk)成立。因此,w(Xk-1)×expSup(Xk-1)≥w(Xk)×expSup(Xk),即expWSup(Xk-1)≥expWSup(Xk);即一个项集的期望权重支持度,不小于,该项集的任何一个超集的期望权重支持度。
根据定理3,可以得到如下核心剪枝策略:即基于排序的上界向下封闭特性(Sorted upper-bound downward closure property)。在进行基于投影projection技术的挖掘操作过程中,当存在某项集的期望权重支持度小于预定义的最低期望权重阈值,或者,时间有效值小于预定义的最低时间有效阈值时,该项集及其扩展集合均不可能为有效时间的高期望权重项集(即近期有效的高期望权重项集),该项集及其扩展集合可以安全地被过滤掉。
可选的,在确定有效时间的高期望权重项集后,在对用户作内容推荐时,可推荐有效时间的高期望权重项集。
可选的,在确定有效时间的高期望权重项集后,将有效时间的高期望权重项集中的项目,例如,网页、新闻、商品等推送给登入社交应用软件用户账号的终端。
本发明实施例提供的项集确定方法,通过考虑数据内在的不确定性会导致确定出的结果不准确、时效性差等问题,从而根据时间衰减因子、最低近期有效阀值、最低期望权重支持度等多重衡量标准,实现了不确定数据库中有效时间的高期望权重项集的确定,不仅使得有效时间的高期望权重项集的确定能够适用于不确定数据库的情况,还提高了项集确定结果的准确性、时效性,和确定效率。从有效时间的高期望权重项集中选取项目推荐给用户终端,使得信息的推送更加具有准确性和时效性。
下面对本发明实施例提供的项集确定装置进行介绍,下文描述的项集确定装置可与上文描述的有效时间的高期望权重项集确定方法相互对应参照。
图3为本发明实施例提供的项集确定装置的结构框图,参照图3,该装置可以包括:
目标事务确定模块100,用于确定待处理项集所对应的至少一个目标事务;所述待处理项集所对应的目标事务为,不确定数据库中包含所述待处理项集所有数据项的事务;
项集在事务中的时间有效值确定模块200,用于根据预定义的时间衰减因子,确定所述待处理项集在各目标事务中的时间有效值;
项集的时间有效值确定模块300,用于将所述待处理项集在各目标事务中的时间有效值相加,确定所述待处理项集在不确定数据库中的时间有效值;
项集概率确定模块400,用于确定所述待处理项集在各目标事务中的项集概率;
期望支持度确定模块500,用于将所述待处理项集在各目标事务中的项集概率相加,确定所述待处理项集的期望支持度;
期望权重支持度确定模块600,用于将所述待处理项集的期望支持度,和所述待处理项集的项集权重值相乘,确定所述待处理项集的期望权重支持度;其中,所述待处理项集的项集权重值根据预定义的所述待处理项集中各个数据项的权重值确定;
高期望权重项集确定模块700,用于如果所述待处理项集在不确定数据库中的时间有效值不小于,预定义的最低时间有效阈值,且所述待处理项集的期望权重支持度,不小于,预定义的最低期望权重阈值和不确定数据库中事务总数 的乘积,则确定所述待处理项集为有效时间的高期望权重项集。
可选的,所述待处理项集在一个目标事务中的时间有效值,可以等于该目标事务的时间有效值;相应的,图4示出了项集在事务中的时间有效值确定模块200的可选结构,参照图4,项集在事务中的时间有效值确定模块200可以包括:
事务的时间有效值确定单元210,用于根据预定义的时间衰减因子,当前时间,各个目标事务的发生时间,分别确定各个目标事务的时间有效值;
作为单元220,用于将所确定的各个目标事务的时间有效值,确定为待处理项集在各目标事务中的时间有效值。
可选的,事务的时间有效值确定单元210具体可用于,根据公式
Figure PCTCN2017102908-appb-000013
确定目标事务Tq的时间有效值,其中δ∈(0,1)为预定义的时间衰减因子,R(Tq)为目标事务Tq的时间有效值,tcurrent表示当前时间,tq表示目标事务Tq的发生时间。
可选的,一个事务记录有至少一个数据项,及各数据项的发生概率;项集概率确定模块400,具体可用于,对于每一个目标事务,将待处理项集的各个数据项在目标事务中的发生概率的乘积,作为所述待处理项集在该目标事务中的项集概率,以确定所述待处理项集在各目标事务中的项集概率。
可选的,项集确定装置在确定待处理项集的项集权重值时,具体可用于,从预定义的权重表中确定待处理项集的各个数据项的权重值,所述权重表记录有不确定数据库中各数据项对应的权重值;确定所述待处理项集的各个数据项的权重总值;将所述待处理项集的各个数据项的权重总值,除以所述待处理项集的数据项个数,得到所述待处理项集的项集权重值。
可选的,项集确定装置还可以用于,在从数据库中包含一个数据项的各项集中,挖掘出包含一个数据项的有效时间的高期望权重上限项集RHEWUBI1后,基于伪投影技术逐一的对各个包含一个数据项的有效时间的高期望权重上限项集进行处理,挖掘出以各个数据项为前缀的所有扩展项集,并将挖掘出的扩展项集按照挖掘时间依次的确定为待处理项集。
可选的,所述挖掘出的包含一个数据项的有效时间的高期望权重上限项集,可以按照字典顺序值进行排序,或,可以按照权重值从大到小的顺序排序。
相应的,项集确定装置可确定一个项集的项集权重值不大于该项集的子集的项集权重值;一个项集的子集中的数据项被该项集所包含;
和/或,可确定一个项集的期望支持度,不小于该项集的超集的期望支持度;一个项集的超集是指包含该项集的所有数据项的集合;
和/或,可确定一个项集的期望权重支持度,不小于,该项集的超集的期望权重支持度。
可选的,项集确定装置还可在一个项集的期望权重支持度小于预定义的最低期望权重阈值,或者,时间有效值小于预定义的最低时间有效阈值时,确定该项集及其扩展集合均不为有效时间的高期望权重项集;并对该项集及其扩展集合进行过滤。
本发明实施例实现了不确定数据库中有效时间的高期望权重项集的确定,不仅使得有效时间的高期望权重项集的确定能够适用于不确定数据库的情况,还提高了确定结果的准确性、时效性,和挖掘效率。
本发明实施例还提供一种处理设备,该处理设备可以包括上述所述的项集确定装置。
可选的,图5示出了处理设备的硬件结构框图,参照图5,该处理设备可以包括:处理器1,通信接口2,存储器3和通信总线4;
其中处理器1、通信接口2、存储器3通过通信总线4完成相互间的通信;
可选的,通信接口2可以为通信模块的接口,如GSM模块的接口;
处理器1,用于执行程序;
存储器3,用于存放程序;
程序可以包括程序代码,所述程序代码包括计算机操作指令。
处理器1可能是一个中央处理器CPU,或者是特定集成电路ASIC(Application Specific Integrated Circuit),或者是被配置成实施本发明实施例的一个或多个集成电路。
存储器3可能包含高速RAM存储器,也可能还包括非易失性存储器(non-volatile memory),例如至少一个磁盘存储器。
其中,程序可具体用于:
确定待处理项集所对应的至少一个目标事务;所述待处理项集所对应的目标事务为,不确定数据库中包含所述待处理项集所有数据项的事务;
根据预定义的时间衰减因子,确定所述待处理项集在各目标事务中的时间有效值;将所述待处理项集在各目标事务中的时间有效值相加,确定所述待处理项集在不确定数据库中的时间有效值;
确定所述待处理项集在各目标事务中的项集概率;将所述待处理项集在各目标事务中的项集概率相加,确定所述待处理项集的期望支持度;
将所述待处理项集的期望支持度,和所述待处理项集的项集权重值相乘,确定所述待处理项集的期望权重支持度;其中,所述待处理项集的项集权重值根据预定义的所述待处理项集中各个数据项的权重值确定;
如果所述待处理项集在不确定数据库中的时间有效值不小于,预定义的最低时间有效阈值,且所述待处理项集的期望权重支持度,不小于,预定义的最低期望权重阈值和不确定数据库中事务总数的乘积,则确定所述待处理项集为有效时间的高期望权重项集。
本说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似部分互相参见即可。对于实施例公开的装置而言,由于其与实施例公开的方法相对应,所以描述的比较简单,相关之处参见方法部分说明即可。
本领域普通技术人员可以理解上述公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。本领域普通技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本发明的范围。
结合本文中所公开的实施例描述的方法或算法的步骤可以直接用硬件、处理器执行的软件模块,或者二者的结合来实施。软件模块可以置于随机存储器(RAM)、内存、只读存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM、或技术领域内所公知的任意其它形式的存储介质中。
对所公开的实施例的上述说明,使本领域普通技术人员能够实现或使用本发明。对这些实施例的多种修改对本领域的普通技术人员来说将是显而易见的,本文中所定义的一般原理可以在不脱离本发明的精神或范围的情况下,在其它实施例中实现。因此,本发明将不会被限制于本文所示的这些实施例,而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。

Claims (17)

  1. 一种项集确定方法,由处理器执行,其特征在于,包括:
    确定待处理项集所对应的至少一个目标事务;所述待处理项集所对应的目标事务为,不确定数据库中包含所述待处理项集所有数据项的事务,所述不确定数据库中的每条事务包含至少一个数据项,及该数据项的发生概率;
    根据预定义的时间衰减因子,确定所述待处理项集在各目标事务中的时间有效值;将所述待处理项集在各目标事务中的时间有效值相加,确定所述待处理项集在不确定数据库中的时间有效值;
    确定所述待处理项集在各目标事务中的项集概率;将所述待处理项集在各目标事务中的项集概率相加,确定所述待处理项集的期望支持度;
    将所述待处理项集的期望支持度,和所述待处理项集的项集权重值相乘,确定所述待处理项集的期望权重支持度;其中,所述待处理项集的项集权重值根据预定义的所述待处理项集中各个数据项的权重值确定;以及
    如果所述待处理项集在不确定数据库中的时间有效值不小于预定义的最低时间有效阈值,且所述待处理项集的期望权重支持度不小于预定义的最低期望权重阈值和不确定数据库中事务总数的乘积,则确定所述待处理项集为有效时间的高期望权重项集。
  2. 根据权利要求1所述的项集确定方法,其特征在于,所述待处理项集在一个目标事务中的时间有效值,等于该目标事务的时间有效值;所述根据预定义的时间衰减因子,确定所述待处理项集在各目标事务中的时间有效值包括:
    根据预定义的时间衰减因子,当前时间,各个目标事务的发生时间,分别确定各个目标事务的时间有效值;以及
    将所确定的各个目标事务的时间有效值,确定为待处理项集在各目标事务中的时间有效值。
  3. 根据权利要求2所述的项集确定方法,其特征在于,所述根据预定义的时间衰减因子,当前时间,各个目标事务的发生时间,分别确定各个目标事务的时间有效值包括:
    根据公式
    Figure PCTCN2017102908-appb-100001
    确定目标事务Tq的时间有效值,其中δ∈(0,1)为预定义的时间衰减因子,R(Tq)为目标事务Tq的时间有效值,tcurrent表示当前时间,tq表示目标事务Tq的发生时间。
  4. 根据权利要求1所述的项集确定方法,其特征在于,;所述确定所述待 处理项集在各目标事务中的项集概率包括:
    对于每一个目标事务,将待处理项集的各个数据项在目标事务中的发生概率的乘积,作为所述待处理项集在该目标事务中的项集概率,以确定所述待处理项集在各目标事务中的项集概率。
  5. 根据权利要求1所述的项集确定方法,其特征在于,所述待处理项集的项集权重值的确定过程包括:
    从预定义的权重表中确定待处理项集的各个数据项的权重值,所述权重表记录有不确定数据库中各数据项对应的权重值;
    确定所述待处理项集的各个数据项的权重总值;以及
    将所述待处理项集的各个数据项的权重总值,除以所述待处理项集的数据项个数,得到所述待处理项集的项集权重值。
  6. 根据权利要求1-5任一项所述的项集确定方法,其特征在于,所述方法还包括:
    在从数据库中包含一个数据项的各项集中,确定出包含一个数据项的有效时间的高期望权重上限项集后,基于伪投影技术逐一的对各个包含一个数据项的有效时间的高期望权重上限项集进行处理,确定出以各个数据项为前缀的所有扩展项集,并将确定出的扩展项集按照确定时间依次的确定为待处理项集,所述不确定数据库中的每条事务包含至少一个数据项,及该数据项的发生概率;
    其中,若某一项集在不确定数据库中的时间有效值不小于预定义的最低时间有效阈值,且该项集的事务累积加权概率上限不小于预定义的最低期望权重阈值和不确定数据库中事务总数的乘积,则该项集为有效时间的高期望权重上限项集。
  7. 根据权利要求6所述的项集确定方法,其特征在于,所述确定出的包含一个数据项的有效时间的高期望权重上限项集,按照字典顺序值进行排序。
  8. 根据权利要求6所述的项集确定方法,其特征在于,所述确定出的包含一个数据项的有效时间的高期望权重上限项集,按照权重值从大到小的顺序排序。
  9. 根据权利要求8所述的项集确定方法,其特征在于,
    一个项集的项集权重值不大于该项集的子集的项集权重值,一个项集的子集中的数据项被该项集所包含。
  10. 根据权利要求8所述的项集确定方法,其特征在于,一个项集的期望支持度不小于该项集的超集的期望支持度,一个项集的超集是指包含该项集的所有数据项的集合。
  11. 根据权利要求8所述的项集确定方法,其特征在于,一个项集的期望权重支持度不小于该项集的超集的期望权重支持度。
  12. 根据权利要求9所述的项集确定方法,其特征在于,所述方法还包括:
    当一个项集的期望权重支持度小于预定义的最低期望权重阈值,或者,时间有效值小于预定义的最低时间有效阈值时,确定该项集及其扩展集合均不为有效时间的高期望权重项集;以及
    对该项集及其扩展集合进行过滤。
  13. 根据权利要求1所述的项集确定方法,其特征在于,所述方法还包括:在确定有效时间的高期望权重项集后,将有效时间的高期望权重项集中的项目推送给登入应用软件用户账号的终端。
  14. 一种项集确定装置,其特征在于,包括处理器和存储器,其中,存储器存储有以下处理器可执行的指令模块:
    目标事务确定模块,用于确定待处理项集所对应的至少一个目标事务;所述待处理项集所对应的目标事务为,不确定数据库中包含所述待处理项集所有数据项的事务;
    项集在事务中的时间有效值确定模块,用于根据预定义的时间衰减因子,确定所述待处理项集在各目标事务中的时间有效值;
    项集的时间有效值确定模块,用于将所述待处理项集在各目标事务中的时间有效值相加,确定所述待处理项集在不确定数据库中的时间有效值;
    项集概率确定模块,用于确定所述待处理项集在各目标事务中的项集概率;
    期望支持度确定模块,用于将所述待处理项集在各目标事务中的项集概率相加,确定所述待处理项集的期望支持度;
    期望权重支持度确定模块,用于将所述待处理项集的期望支持度,和所述待处理项集的项集权重值相乘,确定所述待处理项集的期望权重支持度;其中,所述待处理项集的项集权重值根据预定义的所述待处理项集中各个数据项的权重值确定;以及
    高期望权重项集确定模块,用于如果所述待处理项集在不确定数据库中的时间有效值不小于预定义的最低时间有效阈值,且所述待处理项集的期望权重 支持度不小于预定义的最低期望权重阈值和不确定数据库中事务总数的乘积,则确定所述待处理项集为有效时间的高期望权重项集。
  15. 根据权利要求14所述的项集确定装置,其特征在于,所述项集在事务中的时间有效值确定模块包括:
    事务的时间有效值确定单元,用于根据预定义的时间衰减因子,当前时间,各个目标事务的发生时间,分别确定各个目标事务的时间有效值;以及
    作为单元,用于将所确定的各个目标事务的时间有效值,确定为待处理项集在各目标事务中的时间有效值。
  16. 一种处理设备,其特征在于,包括权利要求14-15任一项所述的项集确定装置。
  17. 一种非易失性存储介质,用于存储一个或多个计算机程序,其中,所述计算机程序包括一个或多个处理器可运行的指令,所述指令被处理器执行时,使得所述处理器执行以下操作:
    确定待处理项集所对应的至少一个目标事务;所述待处理项集所对应的目标事务为,不确定数据库中包含所述待处理项集所有数据项的事务,所述不确定数据库中的每条事务包含至少一个数据项,及该数据项的发生概率;
    根据预定义的时间衰减因子,确定所述待处理项集在各目标事务中的时间有效值;将所述待处理项集在各目标事务中的时间有效值相加,确定所述待处理项集在不确定数据库中的时间有效值;
    确定所述待处理项集在各目标事务中的项集概率;将所述待处理项集在各目标事务中的项集概率相加,确定所述待处理项集的期望支持度;
    将所述待处理项集的期望支持度,和所述待处理项集的项集权重值相乘,确定所述待处理项集的期望权重支持度;其中,所述待处理项集的项集权重值根据预定义的所述待处理项集中各个数据项的权重值确定;以及
    如果所述待处理项集在不确定数据库中的时间有效值不小于预定义的最低时间有效阈值,且所述待处理项集的期望权重支持度不小于预定义的最低期望权重阈值和不确定数据库中事务总数的乘积,则确定所述待处理项集为有效时间的高期望权重项集。
PCT/CN2017/102908 2016-09-23 2017-09-22 项集确定方法、装置、处理设备及存储介质 WO2018054352A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US16/023,611 US20180322125A1 (en) 2016-09-23 2018-06-29 Itemset determining method and apparatus, processing device, and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610847309.3 2016-09-23
CN201610847309.3A CN107870913B (zh) 2016-09-23 2016-09-23 有效时间的高期望权重项集挖掘方法、装置及处理设备

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/023,611 Continuation US20180322125A1 (en) 2016-09-23 2018-06-29 Itemset determining method and apparatus, processing device, and storage medium

Publications (1)

Publication Number Publication Date
WO2018054352A1 true WO2018054352A1 (zh) 2018-03-29

Family

ID=61689350

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/102908 WO2018054352A1 (zh) 2016-09-23 2017-09-22 项集确定方法、装置、处理设备及存储介质

Country Status (3)

Country Link
US (1) US20180322125A1 (zh)
CN (1) CN107870913B (zh)
WO (1) WO2018054352A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115563192A (zh) * 2022-11-22 2023-01-03 山东科技大学 一种应用于购买模式下的高效用周期频繁模式挖掘的方法
CN115617881A (zh) * 2022-12-20 2023-01-17 山东科技大学 一种不确定交易数据库中多序列的周期频繁模式挖掘方法

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112115305B (zh) * 2019-06-21 2024-04-09 杭州海康威视数字技术股份有限公司 群体识别方法、装置及计算机可读存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1641634A (zh) * 2004-01-15 2005-07-20 中国科学院计算技术研究所 一种中文新词语的检测方法及其检测***
CN102708176A (zh) * 2012-05-08 2012-10-03 山东大学 基于活跃用户的微博数据挖掘方法
CN103136219A (zh) * 2011-11-24 2013-06-05 北京百度网讯科技有限公司 一种基于时效性的需求挖掘方法和装置
US8725830B2 (en) * 2006-06-22 2014-05-13 Linkedin Corporation Accepting third party content contributions

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6173280B1 (en) * 1998-04-24 2001-01-09 Hitachi America, Ltd. Method and apparatus for generating weighted association rules
US9171068B2 (en) * 2012-03-07 2015-10-27 Ut-Battelle, Llc Recommending personally interested contents by text mining, filtering, and interfaces
CN104254854A (zh) * 2012-05-15 2014-12-31 惠普发展公司,有限责任合伙企业 基于占有率的模式挖掘
CN105740245A (zh) * 2014-12-08 2016-07-06 北京邮电大学 频繁项集挖掘方法
CN105608182A (zh) * 2015-12-23 2016-05-25 一兰云联科技股份有限公司 面向不确定数据模型中的效用项集挖掘方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1641634A (zh) * 2004-01-15 2005-07-20 中国科学院计算技术研究所 一种中文新词语的检测方法及其检测***
US8725830B2 (en) * 2006-06-22 2014-05-13 Linkedin Corporation Accepting third party content contributions
CN103136219A (zh) * 2011-11-24 2013-06-05 北京百度网讯科技有限公司 一种基于时效性的需求挖掘方法和装置
CN102708176A (zh) * 2012-05-08 2012-10-03 山东大学 基于活跃用户的微博数据挖掘方法

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115563192A (zh) * 2022-11-22 2023-01-03 山东科技大学 一种应用于购买模式下的高效用周期频繁模式挖掘的方法
CN115563192B (zh) * 2022-11-22 2023-03-10 山东科技大学 一种应用于购买模式下的高效用周期频繁模式挖掘的方法
CN115617881A (zh) * 2022-12-20 2023-01-17 山东科技大学 一种不确定交易数据库中多序列的周期频繁模式挖掘方法

Also Published As

Publication number Publication date
CN107870913B (zh) 2021-12-14
CN107870913A (zh) 2018-04-03
US20180322125A1 (en) 2018-11-08

Similar Documents

Publication Publication Date Title
US11281860B2 (en) Method, apparatus and device for recognizing text type
US10726446B2 (en) Method and apparatus for pushing information
CN107172151B (zh) 用于推送信息的方法和装置
US11238058B2 (en) Search and retrieval of structured information cards
US11416684B2 (en) Automated identification of concept labels for a set of documents
CN104899220B (zh) 应用程序推荐方法和***
US20190166216A1 (en) Information pushing method and device
US20160239865A1 (en) Method and device for advertisement classification
CN110069698B (zh) 信息推送方法和装置
US10949418B2 (en) Method and system for retrieval of data
WO2016101811A1 (zh) 一种信息排序方法及装置
CN107908616B (zh) 预测趋势词的方法和装置
WO2015131510A1 (zh) 输入资源推送方法、***、计算机存储介质和设备
WO2018054352A1 (zh) 项集确定方法、装置、处理设备及存储介质
WO2015062359A1 (en) Method and device for advertisement classification, server and storage medium
CN107357794B (zh) 优化键值数据库的数据存储结构的方法和装置
CN107291774B (zh) 错误样本识别方法和装置
CN107291835B (zh) 一种搜索词的推荐方法和装置
CN110347900B (zh) 一种关键词的重要度计算方法、装置、服务器及介质
CN113761565B (zh) 数据脱敏方法和装置
CN110796543B (zh) 基于关系网络的定制信息获取方法、装置及电子设备
CN113722593B (zh) 事件数据处理方法、装置、电子设备和介质
CN110807095A (zh) 一种物品匹配方法和装置
US11321375B2 (en) Text object management system
CN113869904A (zh) 可疑数据识别方法、装置、电子设备、介质和计算机程序

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17852420

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17852420

Country of ref document: EP

Kind code of ref document: A1