WO2014177181A1 - A method of processing a ratings dataset - Google Patents

A method of processing a ratings dataset Download PDF

Info

Publication number
WO2014177181A1
WO2014177181A1 PCT/EP2013/058931 EP2013058931W WO2014177181A1 WO 2014177181 A1 WO2014177181 A1 WO 2014177181A1 EP 2013058931 W EP2013058931 W EP 2013058931W WO 2014177181 A1 WO2014177181 A1 WO 2014177181A1
Authority
WO
WIPO (PCT)
Prior art keywords
ratings
function
dataset
items
itemsets
Prior art date
Application number
PCT/EP2013/058931
Other languages
French (fr)
Other versions
WO2014177181A9 (en
Inventor
Ihab Francis Ilyas Kaldas
Sihem Amer-Yahia
Anup K. CHALAMALLA
Original Assignee
Qatar Foundation
Hoarton, Lloyd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qatar Foundation, Hoarton, Lloyd filed Critical Qatar Foundation
Priority to PCT/EP2013/058931 priority Critical patent/WO2014177181A1/en
Publication of WO2014177181A1 publication Critical patent/WO2014177181A1/en
Publication of WO2014177181A9 publication Critical patent/WO2014177181A9/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0282Rating or review of business operators or products

Definitions

  • the present invention relates to a method of processing a ratings dataset, and more particularly relates to a method of mining a ratings dataset for promotional patterns.
  • Collaborative rating systems for products, movies, businesses, and news have proliferated rapidly on the Web.
  • Websites such as yelp, imdb, amazon and news broadcasting sites such as digg provide a platform for users to evaluate the content-items hosted on these sites by rating them, thereby helping other users in the system make informed decisions pertaining to content of their interest.
  • Such a collaborative rating system has a large number of users, items, and a very large number of ratings between them.
  • An interesting type of pattern in such systems is 'who' rated 'what' and 'how'.
  • PATTERN EXAMPLE 1 80% of ratings for cold weather accessories as rated by female users between age 18-25 are 5/5.
  • PATTERN EXAMPLE 2 The average of ratings for James Cameron's action movies given by male students between age 25-35 is greater than 8/10.
  • Patterns of this kind may be known as promotional patterns.
  • a promotional pattern is a summarized description of ratings between a subset of users and a subset of items in the system satisfying certain prespecified constraints on the ratings between them and the sets themselves.
  • a subset of users is denoted by a userset and a subset of items is denoted by an itemset.
  • association rule mining is formulated as a market basket problem, in which a set of items rated by a user are organized into a transaction and the dataset comprises transactions of all users. The goal of this task is to compute significant associations between two disjoint sets of items, A and B, satisfying a given support and confidence.
  • a constrained association query focuses on a subset of the transaction database from which candidate itemsets A and B are mined efficiently.
  • promotional patterns user and item are two primary entities. It requires evaluating the ratings between all candidate pairs of usersets and itemsets to discover significant rating patterns among them.
  • a ratings dataset assuming a binary rating model, for instance as shown in Fig. 1 (a). Representing this data by the transaction model (Fig. 1 (b) and Fig. 1 (c)), a promotional pattern query triggers an exponential number of constrained association mining queries each corresponding to a subset of the transaction database (equivalent to a userset).
  • a na ' fve approach is prohibitively expensive.
  • Recommender systems seek to predict the 'rating' that a user would give to an item they had not yet considered, using models built from the item content (content-based systems) or from items rated by users similar to the given user (collaborative filtering), or a combination of both (hybrid systems).
  • recommendation techniques compute some significant associations as pairs of usersets and itemsets with coherent ratings between them.
  • Flexrecs a system known as Flexrecs is proposed to provide users the flexibility to filter recommendations provided by the system based on certain criteria.
  • Flexrecs is proposed to provide users the flexibility to filter recommendations provided by the system based on certain criteria.
  • such systems are designed for answering constrained personalized recommendation queries of each user and are not suitable for mining promotional patterns.
  • the present invention seeks to provide an improved method of processing a ratings dataset.
  • One aspect of the present invention provides a method of processing a ratings dataset, the ratings dataset incorporating data identifying a plurality of users U, a plurality of items I and a set of ratings R allocated by the users U to the items I, the method comprising:
  • the method further comprises a rank-join method to identify pairs of usersets and itemsets, the rank-join method comprising:
  • the ratings summary function is ratings sum function
  • the ratin s summary function is ratings density function
  • t (Q U ,Q ,R[Q U , ⁇ ) 1 ]) ⁇ a pattern and d t ⁇ s the ratings density of the pattern t and wherein the ratings summary function is ratings variance function g(R[0",0' ' ])
  • the ratings summary function g is the entropy of ratings distribution R(u,i) /u e Q u ,i e ( .
  • the ratings dataset incorporating data identifying a plurality of users U, a plurality of items I and a set of ratings R allocated by the users U to the items I, the method comprising:
  • processing the matrix using a biclustering algorithm to detect biclusters of subsets of the rows and columns that exhibit a high similarity score.
  • the biclustering algorithm comprises a mean square residue (MSR) function.
  • MSR mean square residue
  • the biclustering algorithm comprises a Delta biclustering algorithm.
  • Figures 1 (a-c) show an example of a transaction data model for encoding binary ratings
  • Figure 2 is a table of projections of ratings summary functions
  • Figure 3 is a schematic diagram representing a rank-join method
  • Figure 4 is a schematic representation of the bottom up computation of the iceberg cube algorithm to compute user and item data cubes for four user attributes
  • Figure 5 is a schematic representation showing the rank join of cuboids
  • Figure 6 is a schematic representation illustrating computing tree bounds
  • Figure 7 is a matrix of binary ratings for users and items to illustrate biclustering
  • Figure 8 is a table showing dataset size
  • Figures 9 (a-d) show a comparison of the overall running time for ratings summary functions for the DCRJN and rank join algorithms
  • Figures 10 (a-d) show a comparison of the enumeration time for ratings summary functions for the DCRJN and rank join algorithms
  • Figures 1 1 (a-d) show a comparison of the aggregation time for ratings summary functions for the DCRJN and rank join algorithms
  • Figures 12 (a-d) show a comparison of the sorting and rank join time for ratings summary functions for the DCRJN and rank join algorithms
  • Figure 13 is a table showing sample query results
  • Figures 14 (a-b) shows the overall running time for ratings summary functions
  • Figures 15 (a-d) show a comparison of the threshold experiment for ratings summary functions for the DCRJN and rank join algorithms
  • Figures 16 (a-d) show the results of the biclustering efficiency experiments
  • Figure 17 is a table showing mean residue values
  • Figures 18 (a-d) show the biclustering efficiency for different sized clusters.
  • An embodiment of the invention seeks to provide a method and system for querying and efficiently mining interesting promotional patterns.
  • the notion of constrained promotional pattern queries is introduced as a means to specify interestingness of patterns using constraints on usersets, itemsets, ratings between them, and patterns themselves.
  • examples of queries on this ratings system include : I. Find patterns (Q u , Q x ) such that the count of ratings given by users in Q u to movies in Q x is greater than 10.
  • other measures such as percentage, sum, avg, and variance of ratings can be defined for a threshold ⁇ .
  • space pruning algorithms i n embod iments of the invention were developed by exploiting the monotonicity properties of the rating constraints vis-a-vis the user and item data cubes.
  • Important aspects of the space pruning algorithms are as follows.
  • a given rating constraint e.g., count of ratings> ⁇ 5
  • a given rating constraint is projected onto the respective spaces of usersets and itemsets and enumerate pairs of usersets and itemsets in the order of their likelihood to satisfy the given constraint.
  • Upper bounding techniques are used to evaluate the likelihood of a userset or an itemset with which patterns satisfying the given constraint can be constructed.
  • biclustering techniques are employed to directly discover patterns in the data.
  • the ratings dataset is represented as a matrix and is input to a biclustering algorithm which outputs biclusters of users and items between which the ratings satisfy a given constraint.
  • a bicluster corresponds to a pair of userset and itemset.
  • Not all rating constraints can be handled using biclustering techniques.
  • schema-driven and ratings-driven pattern mining approaches together cover substantially good number of constraints embodiments of the invention.
  • greedy algorithms are integrated into the system for holistic pattern constraints.
  • a new association paradigm called promotional pattern between a subset of users and a subset of items based on the ratings given by the users to items in a collaborative rating system.
  • a suite of algorithms a) Space pruning algorithms which take advantage of the monotonicity properties of rating constraints vis-a-vis the lattices of usersets and itemsets. The algorithms use a schema-driven definition of usersets and itemsets b) Ratings-driven mining algorithms use biclustering models to mine coherent rating patterns between usersets and itemsets c) Greedy algorithms for holistic pattern constraints (Section 3).
  • Table 2 An example of item database
  • V include ⁇ 0, 1 ⁇ , ⁇ 1, 2, 3, 4,
  • Constrained promotional pattern mining is a means to discover interesting rating patterns between usersets and itemsets.
  • the end-user of the system specifies the data to be mined which includes the user data, item data and ratings. Additionally, the user needs to specify the promotional patterns he is interested in through a set of constraints.
  • the result of a constrained promotional pattern query is a set of triples satisfying C and of the form:
  • R[Q U , Q/ " ] denotes the set of ratings R : Q u * Q' ⁇ V between a userset Q u and an itemset Q/ respectively.
  • Rating Constraints specify constraints that the ratings between a userset Q u and an itemset Q x need to satisfy ⁇
  • Holistic Constraints are defined on a set of patterns that are together considered to be interesting to the end-user of our system
  • a single set constraint is denoted by c s
  • a rating constraint is denoted by c r
  • a holistic constraint is denoted by c ⁇ 1 .
  • a group of set constraints is denoted by C s
  • rating constraints are denoted by C
  • holistic constraints are denoted by C h .
  • the property p ranges over several different types of definitions, e.g., support of sets, aggregate value on an attribute of set objects, multi-dimensional variables, etc.
  • Q u and Q x be a userset and an itemset respectively.
  • Rating constraints specify constraints on the set of ratings between a userset
  • RSF Ratings Summary Function
  • Ratings Summary Function is a function of ratings between a set of users Q u and a set of ' tems 1 , g : ⁇ (Q u , Q x , R[Q U , Q/ " ]) ⁇ ⁇ R, simply denoted as 9(R[Q U , Q'])-
  • Typical rating constraints include:
  • RSFs The monotonicity properties of RSFs are critical to space pruning algorithms. Let Gu represent the powerset lattice of U, and G/ the powerset lattice of /. Let Q" ,Q" be any two elements in Gu, and Q[ and Q 2 ' be any two elements in G/, such that Q" c 3 ⁇ 4" , and Q[ ⁇ Q 2 . The following properties of RSFs are defined based on their monotonicity on Gu and Gf.
  • [L/-monotonic] g is said to be L/-monotonic if Vg", ⁇ e s.t. Q" contains Q ⁇ and VQ e G I ,g(Q 2 U ,Q I ⁇ (Q ⁇ Q) ⁇ g(Q? ,Q' ,R(Q? ,Q'))
  • g is said to be /-monotonic if Vg,£3 ⁇ 4 e G i s - ⁇ Qi contains Q 2 ' and ⁇ " e G u ,g(Q U ,Q 2 I ,R(Q U ,Q 2 )) ⁇ g(Q" ,Q I ,R(Q" ,Q[))
  • Ratings Sum and Ratings Cover are both L/-monotonic and /-monotonic
  • Average Ratings Cover is L/-monotonic but not /-monotonic
  • Ratings Density is neither a L/-monotonic nor an /-monotonic function.
  • Space pruning algorithms in embodiments of the invention are developed to handle both monotonic and non-monotonic functions.
  • Compute constrained promotional pattern queries involves, given a set of constraints, efficiently enumerating candidate usersets, candidate itemsets and patterns satisfying the constraints for all pairs of usersets and itemsets. Though all types of constraints are equally important from the perspective of constrained promotional pattern queries, due to space limitations, embodiments of the present invention seek to provide efficient algorithms for rating constraints and holistic constraints. A suite of algorithms for rating constraints, schema-driven space pruning algorithms and ratings-driven biclustering techniques are discussed in the following sections.
  • schema-driven pattern mining usersets and itemsets are defined based on a schema. We first define the following terms concerning schema-driven pattern mining below.
  • a user u satisfies query Q u , denoted by u
  • Q u , if u satisfies each atomic query in Q u .
  • an item query Q 1 is a conjunction of a set m ⁇ di of atomic queries.
  • An item u e l satisfies the query Q 1 , denoted by i
  • Q', if i satisfies each atomic query in Q 1 .
  • the space of user queries is the output of all group-bys on the user database, known as
  • this space consists of the resulting groups of each of the 7 group-by operations on dimensions ⁇ ⁇ , A 2 , A 3 , ⁇ ⁇ ⁇ 2 , A 2 A 3 , A A 3 , A ⁇ A A 3 .
  • the space of item queries is computed by the item data cube.
  • the output of a user query is a userset and that of an item query is an itemset.
  • the main challenge is to efficiently enumerate candidate usersets and itemsets, and compute patterns that satisfy the constraint.
  • rating constraint e.g., Ratings Density> ⁇ 5
  • d be a rating constraint (e.g., Ratings Density> ⁇ 5) and C s comprise all the set constraints.
  • An approach based on the rank-join method is then discussed to compute d. It takes two sorted lists of usersets and itemsets satisfying the set constraints C s , and outputs a set of patterns satisfying d .
  • the general idea is as follows.
  • RSF The value of an RSF on a pattern is denoted by score.
  • a na ' fve approach computes the score for every pair of candidate userset and itemset in no particular order and outputs pairs that have scores > ⁇ 5.
  • this search is achieved by projecting the RSF on the spaces of usersets and itemsets separately, and selecting sets in the order of their likeliness to contribute to a higher score.
  • an RSF is upper bounded by a composite monotonic function of two score projection functions which operate on the spaces of usersets and itemsets respectively. This is illustrated in Equation 4.1 as follows using Ratings Density.
  • Equation 4.1 . is the composite function of two score projection functions Hu(Q u ,R) and H, (Q',R). ⁇ ⁇ upper bounds the score contribution of Q u in patterns comprising Q u by aggregating its ratings on the entire item database. Fig. 2 lists the projections of other aggregate RSFs.
  • the bottom up computation of an iceberg cube is used to efficiently compute user and item data cubes. This is illustrated in Fig. 4 when there are four user attributes. Each node in this tree, called cuboid, is a group-by on a subset of attributes, and partitions the database into a number of disjoint sets. To compute rating constraints, the rank join computation is modelled between the sorted lists of all usersets and itemsets (from previous section) into multiple rank join computations between pairs of user and item cuboids (Fig. 5). The user and item data cubes are materialized on-the-fly. The pseudocode for this algorithm is given in Algorithm 2. This approach does not offer significant improvement in efficiency by itself compared to the approach considered in previous section. Several optimization strategies are discussed below using this approach. Prior to that, some preliminaries on data cubes are presented that enable those skilled in the art to understand the optimization strategies.
  • Algorithm 1 GroupRJN(Su, Si )
  • a set Q" e P 1 is a parent of
  • Q m u e P m (or equivalently, is a child of Q" ) if all the attributes of ⁇ are in P" 1 as well.
  • a set Q u is said to be a most specific set (MSS) if all the attributes of the database are assigned some values. All the usersets which belong to the cuboid A 1 A 2 A 3 A 4 in Fig. 4 are most specific sets.
  • the most specific descendant set of a set Q u is the set of all most specific sets which have the same values for the attributes on which Q u is grouped.
  • the score projection functions, ⁇ ⁇ and Hi are functions on sets computed by user and item data cubes respectively. Such functions can be classified into three types depending on their monotonicity properties on the data cube. For example:
  • H u (Q 2 U , R) ⁇ H u (Q , R) .
  • H u (Q U , R) ⁇ U ⁇ QU ⁇ IEL R ⁇ u,i) is monotonic.
  • H u is said to be antimonotonic if VQ",Q" s.t. Q" is a parent of Q" , is antimonotonic.
  • Algorithm 2 Algorithm DCRJN
  • the rank join between the sets of an item and user cuboid proceeds similarly to the rank join method discussed in Algorithm 1 . Additionally, vertical pruning is employed to avoid computing the children of a set Q u (or Q') if they are unlikely to produce patterns that satisfy the given rating constraints.
  • the component function score is the value of Hu(Q u ,R) (or H, (Q',R)), and the tree upper bound is the upper bound of Hu(Q u ,R), denoted by , on all sets that are children of Q" in the data cube. Two conditions are checked during the rank-join between the cuboids.
  • the computation of tree upper bounds for various score projection functions is discussed below.
  • the tree upper bound of a set Q u is the score computed by the component function on Q u itself, Hu(Q u ,R).
  • Hu(Q u ,R) the tree upper bound is the minimum of the scores computed by ⁇ ⁇ on all most specific descendants of the set Q u .
  • the most specific descendants of several sets are computed before computing the sets themselves. For example, in Fig. 6 the process begins with B 1 and proceeds in a depth-first manner until the leaf node B 1 B 2 B 3 B 4 is reached. For all sets which are parents of sets in BiB 2 B 3 B 4 and belong to cuboids not processed yet, the tree upper bound corresponds to the lower bound on the scores of their most specific descendants in
  • nontrivial upper bound is the maximum of the scores computed by ⁇ ⁇ on all most specific descendants of the set Q u .
  • Sharing Aggregation involves aggregating on the users (items) of the entire group. For example, involves aggregating for each user the ratings on the
  • ratings aggregate for individual users w( ⁇ R ⁇ ,i)) are pre- computed and stored in a hash table.
  • aggregate score of a group Q" can be computed from its children if the aggregate scores of all its children are known. Since BUC proceeds in a depth-first manner, aggregates for the children of several groups are available before the groups are computed. For example, in Fig. 4 scores for a group Q" of the cuboid AiA 2 A 4 can be computed by simply aggregating the scores of all its children in the cuboid A 1 A 2 A 3 A 4 .
  • pruning algorithms which compute rating constraints guided by the monotonicity properties of ratings summary functions.
  • the techniques are extendible to any aggregate RSFs which can be projected on the spaces of usersets and itemsets and scores of patterns can be upper bounded by composite monotonic functions of the projections.
  • the pruning power of the algorithms depends on three factors: 1 ) the component functions chosen to upper bound the scoring function 2) the histograms of database objects over different attributes 3) the distribution of ratings. For example, for the component functions of Ratings Density, a long tail of ratings along with a large number of itemsets with high cardinality can lead to effective pruning. This can be explained as follows.
  • Biclustering or co-clustering is an effective technique to discover subsets of rows in a data matrix that exhibit similar behavior across a subset of columns.
  • Biclustering algorithms enable certain types of rating constraints to be computed efficiently. Biclusters can be both overlapping as well as non-overlapping.
  • Biclusters with constant values corresponds to a scenario where all the users in a bicluster give the same rating to all the items. 2.
  • Biclusters with constant values on columns or rows corresponds to biclusters in which all users have the same distribution of ratings for the itemset, or all items have the same distribution of ratings for a set of users.
  • Biclusters with coherent additive values corresponds to the scenario where the ratings on each row of a bicluster add up to the same value.
  • Minimum Entropy Biclustering discovers biclusters which have an entropy less than a given threshold. Such an algorithm is useful in minimizing the variance in ratings between a set of users and a set of items. A subset of rows and a subset of columns is considered to be a bi-cluster if they together exhibit high similarity score, which is measured by a function defined as mean squared residue (MSR).
  • MSR mean squared residue
  • the algorithm starts with the original data matrix by computing its mean squared residue and at each step, removes a row or column which results in maximum drop in the MSR until its value reaches below a threshold. It then adds a row or column with maximum rise in MSR until the value reaches above the threshold.
  • Delta biclustering employs a different heuristic under the same quality metric, mean squared residue.
  • the algorithm starts with several randomly generated biclusters. At each step, it determines the best action for each row and each column with an action being deleting or adding a row or column to a bicluster. It performs best actions for every row and every column sequentially until no further improvement can be gained in the mean squared residue of the biclusters.
  • the main advantage of Delta biclustering is that it generates overlapping biclusters with the number of biclusters specified a priori.
  • the time complexity of both the algorithms is 0((N +M) * N * M * k * p) where k is the number of bi-clusters and p is the number of iterations needed to converge to stable biclusters.
  • the mean squared residue can be modified accordingly in the implementation based on the type of biclusters queried for, e.g., biclusters with constant values, bicluster with constant rows, and biclusters with constant columns.
  • Top-/ constraint can be handled in the algorithms discussed in the previous section by assigning the parameter ⁇ to the score of the top-Z ⁇ pattern and updating it when the rf h pattern changes as the algorithm progresses.
  • we post-process the promotional patterns satisfying the rating constraints using greedy heuristics which take as input a set of patterns possibly in the descending order of their RSF scores, and output a set of patterns that satisfy the given threshold constraints.
  • One such heuristic is listed in Algorithm 3 below for Threshold Coverage constraint.
  • the heuristic takes as input, two threshold parameters /3i and ⁇ 2 for the user and item databases.
  • the algorithm proceeds as follows. It maintains two sets, one is a set of users covered by the usersets in patterns seen so far, denoted by ⁇ ⁇ , and the other is a set of items covered by the itemsets in the patterns seen so far, denoted by 7 / . For every new pattern added to the set, if ⁇ ⁇ ⁇ ⁇ ⁇ and
  • the pseudocode for the algorithm is given in Algorithm 3 below. Algorithm 3: Computing Holistic Constraints
  • the analysis includes (i) performance evaluation of the space pruning algorithms in terms of running time and performance of queries involving multiple types of constraints including set, rating and holistic constraints (Section 4.1 ) (ii) performance evaluation of biclustering techniques in terms of running time and quality evaluation of the biclusters generated (Section 4.2).
  • Dataset Size The running time of the algorithms was considered for different sizes of the dataset relevant to the mining task specified by a promotional pattern query. Six subsets of the moußs dataset of increasing size were considered for each experiment. The dataset size is characterized by the number of users, number of items, the number of ratings, and the number of usersets and itemsets for the given schema (shown in Figure 8).
  • the algorithms were evaluated for different types of ratings summary functions characterized by their monotonicity properties on the userset and itemset lattices.
  • the four selected RSFs were Ratings Count and Ratings Sum both of which are L/-monotonic and /-monotonic, Ratings Density which is neither L/-monotonic nor /-monotonic, and Average Ratings Cover which is U- monotonic but not /-monotonic.
  • Constraint Type Lastly, the performance of our algorithms were evaluated for two types of rating constraint on RSFs, namely the (1 ) threshold ⁇ -constraint and (2) top-/ constraint. The ⁇ -constraint computes all the patterns whose RSF score is greater than ⁇ 5.
  • the top-/ constraint computes the patterns with top-/ scores.
  • the value of ⁇ was varied to cover the entire range of scores induced by the RSF on all pairs of usersets and itemsets.
  • the ⁇ 5 value for Ratings Count is varied from 10 1 to 10 5 .
  • Running Time which measures the overall running time of the algorithms
  • Enumeration Time measures the time to materialize usersets and itemsets for result computation
  • Aggregation Time measures the score aggregation time for the given RSF (g) on pairs of usersets and itemsets, their score projection functions ( ⁇ ⁇ and H, ) on individual sets and their tree upper bounds
  • Sorting and Rank Join Time measures the time to perform sorting and rank join in both algorithms. The performance of the algorithms is demonstrated in the following experiments.
  • DCRJN For certain RSFs, over larger datasets DCRJN is 10-20 times faster. Tree upper bound control the number of usersets and itemsets materialized. Hence, the number of usersets and itemsets between which ratings are aggregated is also significantly lowered. Hence, the enumeration time, aggregate score computation and rank join time are several orders better than in basic rank join implementation.
  • Threshold Unlike in the previous experiment where it was assumed that a top-10 constraint by which the value of threshold ⁇ is dynamically computed as the algorithm progresses, the value of ⁇ over the range of scores induced by an RSF is fixed and varied for experiments in this evaluation.
  • the performance comparison (overall running time) of DCRJN is presented with the GroupRJN algorithm in Figures 15 (a-d). It was observed that for higher values of ⁇ both algorithms perform better, and DCRJN performs 3 to 10 orders of magnitude faster for all RSFs except Ratings Density. For Ratings Density a better running time was obtained than the basic rank join implementation by an order of 0.25 (on average).
  • Biclustering algorithms were implemented for three different types of biclusters: 1 ) Biclusters with constant values (Type 1 ) 2) Biclusters with constant values on rows (Type 2) 3) Biclusters with constant values on columns (Type 3).
  • the running times of the implementation is plotted by varying the number of clusters generated and dataset size in Figures 16 and 18 (a-d).
  • the running time for delta biclustering on a dataset in which the range of matrix values is large is usually of the order of 10 4 seconds for a 3000x500 data matrix.
  • the range of matrix values is much smaller (1 to 5) compared to other datasets and the running time is expected to be much higher.
  • Figure 16(a-d) plots the running time against number of clusters for four different sizes of the data matrix : (1 ) 200 x 100 (2) 400 x 200 (3) 600 x 300 (4) 800 x 400.
  • Figure 18(a-d) plots running time against the dataset size by varying the number of clusters generated. The running time increases rapidly for Type 1 clusters with increase in the number of biclusters and the dataset size. They achieve a much smaller running time for Type 2 and Type 3 biclusters.
  • Collaborative rating systems generate large amounts of data in the form of ratings and text reviews given by users to items which can be leveraged to extract business intelligence for promoting sets of items to sets of users. It is important to have an expressive language for constrained promotional pattern queries specifying different types of constraints on usersets, itemsets, ratings and patterns. It is equally important, given the complexity of mining and exploration tasks involved, for the techniques employed to be computationally efficient and scalable at this level.

Landscapes

  • Business, Economics & Management (AREA)
  • Engineering & Computer Science (AREA)
  • Accounting & Taxation (AREA)
  • Development Economics (AREA)
  • Strategic Management (AREA)
  • Finance (AREA)
  • Game Theory and Decision Science (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Physics & Mathematics (AREA)
  • General Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Collaborative rating systems have evolved as important tools for users in dealing with information overload while making decisions pertaining to content hosted on the Web. Such systems allow users to evaluate content in the form of ratings. For example, websites such as yelp.com, imdb.com and amazon.com allow users to express their preferences by rating content-items. An interesting type of pattern in such systems is 'who' rated 'what' and 'how'. A data mining system known as PromPt is disclosed for exploring patterns in ratings given by users to items. A new type of association paradigm called promotional pattern is introduced. Promotional patterns are summarized descriptions of ratings given by a subset of users to a subset of items in the system, and the goal is to mine interesting patterns. Such functionality is demonstrated as being useful in a wide variety of real application scenarios such as business intelligence in promotion and advertising.

Description

A Method of Processing a Ratings Dataset
Description of Invention The present invention relates to a method of processing a ratings dataset, and more particularly relates to a method of mining a ratings dataset for promotional patterns.
Collaborative rating systems for products, movies, businesses, and news have proliferated rapidly on the Web. Websites such as yelp, imdb, amazon and news broadcasting sites such as digg provide a platform for users to evaluate the content-items hosted on these sites by rating them, thereby helping other users in the system make informed decisions pertaining to content of their interest. Such a collaborative rating system has a large number of users, items, and a very large number of ratings between them. An interesting type of pattern in such systems is 'who' rated 'what' and 'how'.
Users rate products on sites such as amazon.com and movies on imdb.com; each user rates many items (movies/products) and each item is rated by many users. Examples of interesting patterns in such rating systems are given below.
PATTERN EXAMPLE 1 . 80% of ratings for cold weather accessories as rated by female users between age 18-25 are 5/5.
PATTERN EXAMPLE 2. The average of ratings for James Cameron's action movies given by male students between age 25-35 is greater than 8/10.
Patterns of this kind may be known as promotional patterns. A promotional pattern is a summarized description of ratings between a subset of users and a subset of items in the system satisfying certain prespecified constraints on the ratings between them and the sets themselves. A subset of users is denoted by a userset and a subset of items is denoted by an itemset. There are limitations with conventional techniques for mining a collaborative rating system for such patterns in ratings between usersets and itemsets. Nevertheless, such patterns are a direct indication of how cross- sections of users have historically rated various categories of items, and hence offer a rich source of business intelligence in promotion and advertising.
From a micro-economic perspective, promoting specific types of products to specific communities of customers is a low cost, profit-driven marketing strategy. Retailers often design such promotions, e.g., 50% discount on cold weather accessories to women. More recently, it is known to promote individual objects through ranking in appropriate communities identified using a multi-dimensional customer database.
One of the most well studied mining tasks on such datasets is association rule mining. Often, it is formulated as a market basket problem, in which a set of items rated by a user are organized into a transaction and the dataset comprises transactions of all users. The goal of this task is to compute significant associations between two disjoint sets of items, A and B, satisfying a given support and confidence. Interestingness of a rule is specified as a query involving constraints on the composition of sets A, B and association between them. Constraints involving either set >4 or B a re called single-variable constraints, e.g., multi-dimensional constraints such as A.att = val, aggregation constraints such as sum(B. item. price) > $1000. Efficient algorithms based on frequent itemset mining have been developed by exploiting the monotonicity properties of the constraints on the itemset lattice. Here, the focus is primarily on finding sets of items co-occuring frequently. Information about preferences of cross-sections of users towards sets of items is not available. A constrained association query focuses on a subset of the transaction database from which candidate itemsets A and B are mined efficiently. However, in promotional patterns user and item are two primary entities. It requires evaluating the ratings between all candidate pairs of usersets and itemsets to discover significant rating patterns among them. Consider a ratings dataset assuming a binary rating model, for instance as shown in Fig. 1 (a). Representing this data by the transaction model (Fig. 1 (b) and Fig. 1 (c)), a promotional pattern query triggers an exponential number of constrained association mining queries each corresponding to a subset of the transaction database (equivalent to a userset). Hence, a na'fve approach is prohibitively expensive.
Recommender systems seek to predict the 'rating' that a user would give to an item they had not yet considered, using models built from the item content (content-based systems) or from items rated by users similar to the given user (collaborative filtering), or a combination of both (hybrid systems). In the process, recommendation techniques compute some significant associations as pairs of usersets and itemsets with coherent ratings between them. However, they do not offer the flexibility needed for exploratory mining of promotional patterns. More recently, a system known as Flexrecs is proposed to provide users the flexibility to filter recommendations provided by the system based on certain criteria. However, such systems are designed for answering constrained personalized recommendation queries of each user and are not suitable for mining promotional patterns.
The present invention seeks to provide an improved method of processing a ratings dataset. One aspect of the present invention provides a method of processing a ratings dataset, the ratings dataset incorporating data identifying a plurality of users U, a plurality of items I and a set of ratings R allocated by the users U to the items I, the method comprising:
defining a subset of users U in the dataset as a userset Qu,
defining a subset of items I in the dataset as an itemset Q1,
receiving at least one rating constraint specifying at least one constraint on the set of ratings between the userset Qu and the itemset Q1,
inputting each rating constraint into a ratings summary function g(R[Qu,Q']) θ δ to define a function of ratings between the userset Qu and the itemset Q', where Θ is selected from one of =,>,<,≥,< and 5 e R,
projecting the ratings summary function separately as Hu(Qu,R) on the space of all usersets and H|(Q',R) on the space of all itemsets to identify pairs of usersets and itemsets (QU| Q') that have a score for the ratings summary function that is greater than δ, and
analysing the identified pairs of usersets and itemsets to find patterns in the ratings dataset according to each rating constraint. Preferably, the method further comprises a rank-join method to identify pairs of usersets and itemsets, the rank-join method comprising:
calculating a score Hu(Qu,R) for all usersets in the space,
sorting a list of usersets by the calculated scores,
calculating a score H|(Q',R) for all itemsets belonging to the space of itemsets,
sorting a list of itemsets by the calculated scores,
merging list of usersets with the list of itemsets in their sorted orders and ranking them. Conveniently, the ratings summary function is ratings count function g(R[QU , QI ]) =∑USQ« X;eg, l(u,i) , where I is an indicator function with value 1 if u has rated i, and 0 otherwise.
Advantageously, the ratings summary function is ratings sum function
Preferably, the ratings dataset is a binary ratings model where L{ut) denotes the set of items rated 1 by user w. and L(Q") = L(ui),ui e Q" , and wherein the ratings summary function is ratings cover function g(R[QU , Q! ]) = .
Figure imgf000006_0001
Conveniently, the ratin s summary function is ratings density function
Figure imgf000006_0002
Advantageously, t = (QU ,Q ,R[QU ,ζ)1]) ^ a pattern and dt \s the ratings density of the pattern t and wherein the ratings summary function is ratings variance function g(R[0",0''])
Figure imgf000006_0003
Preferably, the ratings dataset is a binary ratings model where L(ut) denotes the set of items rated 1 by user w. and L(Q") = L(ui),ui e Q" , and wherein the ratings summary function is average ratings cover function
Figure imgf000006_0004
Conveniently, the ratings summary function g is the entropy of ratings distribution R(u,i) /u e Qu,i e ( .
Advantageously, the ratings dataset incorporating data identifying a plurality of users U, a plurality of items I and a set of ratings R allocated by the users U to the items I, the method comprising:
storing the ratings dataset as a matrix with the users and the items in respective rows and columns; and
processing the matrix using a biclustering algorithm to detect biclusters of subsets of the rows and columns that exhibit a high similarity score.
Preferably, the biclustering algorithm comprises a mean square residue (MSR) function.
Conveniently, the biclustering algorithm comprises a Delta biclustering algorithm.
In order that the invention may be more readily understood, and so that further features thereof may be appreciated, embodiments of the invention will now be described, by way of example, with reference to the accompanying drawings in which:
Figures 1 (a-c) show an example of a transaction data model for encoding binary ratings, Figure 2 is a table of projections of ratings summary functions,
Figure 3 is a schematic diagram representing a rank-join method,
Figure 4 is a schematic representation of the bottom up computation of the iceberg cube algorithm to compute user and item data cubes for four user attributes,
Figure 5 is a schematic representation showing the rank join of cuboids, Figure 6 is a schematic representation illustrating computing tree bounds,
Figure 7 is a matrix of binary ratings for users and items to illustrate biclustering, Figure 8 is a table showing dataset size,
Figures 9 (a-d) show a comparison of the overall running time for ratings summary functions for the DCRJN and rank join algorithms, Figures 10 (a-d) show a comparison of the enumeration time for ratings summary functions for the DCRJN and rank join algorithms,
Figures 1 1 (a-d) show a comparison of the aggregation time for ratings summary functions for the DCRJN and rank join algorithms,
Figures 12 (a-d) show a comparison of the sorting and rank join time for ratings summary functions for the DCRJN and rank join algorithms,
Figure 13 is a table showing sample query results,
Figures 14 (a-b) shows the overall running time for ratings summary functions,
Figures 15 (a-d) show a comparison of the threshold experiment for ratings summary functions for the DCRJN and rank join algorithms, Figures 16 (a-d) show the results of the biclustering efficiency experiments,
Figure 17 is a table showing mean residue values, and Figures 18 (a-d) show the biclustering efficiency for different sized clusters.
An embodiment of the invention, known as PromPt, seeks to provide a method and system for querying and efficiently mining interesting promotional patterns. The notion of constrained promotional pattern queries is introduced as a means to specify interestingness of patterns using constraints on usersets, itemsets, ratings between them, and patterns themselves. Denoting a subset of users in a ratings system (e.g., imdb) by Qu and a subset of items by Qx , examples of queries on this ratings system include : I. Find patterns (Qu , Qx ) such that the count of ratings given by users in Qu to movies in Qx is greater than 10. In general, other measures such as percentage, sum, avg, and variance of ratings can be defined for a threshold δ.
II. Find patterns (Qu , Qx ) such that Qu comprises users in the age group
25-35 and Qx comprises only movies directed by James Cameron.
III. A set of k patterns {(Qu , Qx )} such that usersets from no two patterns overlap by more than β^ and itemsets from no two patterns overlap by more than β2 . In this example, we are interested in a holistic constraint such as diversity in the composition of usersets and itemsets of the patterns mined. Conventional approaches have severe limitations with regards to constrained promotional pattern mining. In constrained association mining, certain constraints are characterized by their monotonicity property on the itemsets, thus enabling efficient algorithms to be developed. Constraints on promotional patterns are not necessarily monotonic on the usersets or itemsets. As an example, the constraint count of ratings given by users in
Qu to movies in 0/ greater than a threshold δ is not monotonic in the transaction data model. Consider the transaction model representation (Fig. 1 .2 (b)). For an itemset X, the set of transactions (userset) it is contained in is denoted by T(X). One is interested in |Γ(Χ)|.|Χ| > δ. Let X1 and X2 be two itemsets such that X1 X2 , then |X1 | < |X2 | and |7"(X1 )| > |T(X2 )|. Clearly, such constraints induce technical challenges that cannot be easily tackled using conventional techniques for constrained association mining.
Furthermore, the expressiveness of the languages used to specify constrained association queries is insufficient to express constraints on promotional patterns. Constraints on usersets and itemsets can be specified and handled similarly to approaches developed for single-variable constraints. However, rating constraints and holistic constraints such as diversity are inexpressible. Constraints are discussed in detail below in Section 2. Finally, the transaction data model in Fig. 1 is a convenient encoding for only binary ratings (0/1 ), where a user has either rated or not rated an item. However, for numerical ratings such as an item rated on a scale 1 to 5 extending the transaction model representation leads to space explosion.
While promotional pattern mining is challenging, two approaches in embodiments of the invention have been developed to answer queries involving constraints on ratings - schema-driven and ratings-driven mining, and integrate greedy algorithms for holistic pattern constraints such as diversity. Two scenarios are possible in applications related to promotion. It is sometimes useful to obtain a description of userset and itemset of a pattern. In Pattern Example 2 above, the itemset is defined by items satisfying the query {director- James Cameron' Λ genre- Action'} on a multi- dimensional item database, and the userset by {gender- male' Λ occupation='student' Λ 25 < age < 35} which is a query on the user database. In schema-driven pattern mining, the solution space of promotional patterns is confined to usersets and itemsets constructed from group-bys of a multi-dimensional data cube. In ratings-driven pattern mining, pairs of ad-hoc usersets and itemsets with ratings are discovered directly from the data.
For schema-driven pattern mining, space pruning algorithms i n embod iments of the invention were developed by exploiting the monotonicity properties of the rating constraints vis-a-vis the user and item data cubes. Important aspects of the space pruning algorithms are as follows. A given rating constraint (e.g., count of ratings> <5) is projected onto the respective spaces of usersets and itemsets and enumerate pairs of usersets and itemsets in the order of their likelihood to satisfy the given constraint. Upper bounding techniques are used to evaluate the likelihood of a userset or an itemset with which patterns satisfying the given constraint can be constructed. A technique similar to rank join i s th en u sed between the space of usersets and the space of itemsets to prune "unpromising" pairs. The tree structure of the user and item data cubes is exploited for sharing computation between sets (e.g., sharing aggregation for measures such as count, sum).
For ratings-driven pattern mining, biclustering techniques are employed to directly discover patterns in the data. The ratings dataset is represented as a matrix and is input to a biclustering algorithm which outputs biclusters of users and items between which the ratings satisfy a given constraint. A bicluster corresponds to a pair of userset and itemset. Not all rating constraints can be handled using biclustering techniques. However, the schema-driven and ratings-driven pattern mining approaches together cover substantially good number of constraints embodiments of the invention. Finally, greedy algorithms are integrated into the system for holistic pattern constraints.
1. Summary
In summary, the development of embodiments of the present invention resulted in the following:
1 . A new association paradigm called promotional pattern between a subset of users and a subset of items based on the ratings given by the users to items in a collaborative rating system.
2. The notion of constrained promotional pattern queries and study of different types of constraints involving usersets, itemsets, ratings and patterns. The monotonicity properties of rating constraints are critical to pruning algorithms for promotional pattern mining (Section 2.1 ).
3. A suite of algorithms: a) Space pruning algorithms which take advantage of the monotonicity properties of rating constraints vis-a-vis the lattices of usersets and itemsets. The algorithms use a schema-driven definition of usersets and itemsets b) Ratings-driven mining algorithms use biclustering models to mine coherent rating patterns between usersets and itemsets c) Greedy algorithms for holistic pattern constraints (Section 3).
2. DATA MODEL
The data model consists of users, items, and the ratings between them. Users are considered in a du -dimensional space U = {/41 , A2 ,■■■ , }, with the domain of each attribute Aj defined as dom(A ). Items are considered in a di -dimensional space / = {81 , B2 , . . . , }, with the domain of each attribute Bk defined as dom(Bk ). Examples of databases U and / in a movie ratings dataset are shown in Tables 1 and 2 below. Table 1: An example of user database
Figure imgf000013_0001
Table 2: An example of item database
Figure imgf000013_0002
The function R : U /→ V , where V c R, assigns a unique rating value for a pair of user and item, e.g., R(u, /') = 4. Examples of V include {0, 1}, {1, 2, 3, 4,
5}, [-1, 1]. A userset is denoted by Qu and an itemset is denoted by 0/ . 2.1 Constrained Promotional Pattern Queries
Constrained promotional pattern mining is a means to discover interesting rating patterns between usersets and itemsets. The end-user of the system specifies the data to be mined which includes the user data, item data and ratings. Additionally, the user needs to specify the promotional patterns he is interested in through a set of constraints.
DEFINITION 1. Constrained Promotional Pattern Query.
Given U , I , R, and a set of constraints C, the result of a constrained promotional pattern query is a set of triples satisfying C and of the form:
tfQ", 0' [0",0/]},
where R[QU , Q/"] denotes the set of ratings R : Qu * Q' → V between a userset Quand an itemset Q/ respectively. Three fundamental classes of constraints for promotional pattern queries are discussed below.
• Set Constraints confine the composition of the userset Qu and itemset Qx between which ratings patterns are evaluated
• Rating Constraints specify constraints that the ratings between a userset Qu and an itemset Qx need to satisfy · Holistic Constraints are defined on a set of patterns that are together considered to be interesting to the end-user of our system
A single set constraint is denoted by cs, a rating constraint is denoted by cr and a holistic constraint is denoted by c^1 . Similarly, a group of set constraints is denoted by Cs , rating constraints are denoted by C and holistic constraints are denoted by Ch .
2.1.1 Set Constraints
Let p be a generic property of the sets. A set constraint is of the form S.p Θ δ where S is a userset or itemset, Θ is one of the operators =,≠, > < < > c, _≡ and δ is a real value or Boolean depending on the operator. The property p ranges over several different types of definitions, e.g., support of sets, aggregate value on an attribute of set objects, multi-dimensional variables, etc. Let Qu and Qx be a userset and an itemset respectively. Some examples of set constraints are:
Qu such that \QU | > 10, Q' such that \Q' \≥ 5
• Qu satisfies a multi-dimensional constraint, u.zipcode = 70011 Vu e • Q/" such that agg(i.price) > $100 for /' e Q/" and agg is one of sum, count, mm, max, avg 2.12 Rating Constraints
Rating constraints specify constraints on the set of ratings between a userset
Qu and an itemset 0/ . Ratings Summary Function (RSF) is defined first, and then examples of RSFs and constraints are listed based on the RSFs.
DEFIN ITION 2. Ratings Summary Function, denoted by g, is a function of ratings between a set of users Qu and a set of 'tems 1 , g : {(Qu , Qx , R[QU , Q/"])} → R, simply denoted as 9(R[QU , Q'])-
Rating constraints are expressed on an RSF, e.g., g(R[Qu , Q/ ]) θ δ, where Θ can be one of =, > < > < and δ ε R. Some examples of g based on aggregation of ratings are listed below.
[Ratings Count] g(R[Qu , Q ]) =∑u^ ∑ieQi I(u,i) where / is an indicator function with value 1 if u has rated /', and 0 otherwise.
. [Ratings Sum] g(R[Qu ]) =∑η^ ∑ieQi R(u, i)
• [Ratings Cover] Assuming binary ratings model of V, let L{ut) denote the set of items rated 1 by user ut , then L(Q") = L(ui),ui e Q" . Define
Figure imgf000015_0001
y „ y , R(u,i)
[Ratings Density] g(R[Qu , Q' ]) • [Ratings Variance] Let t = (QU ,Q' ,R[QU ,Q']) be a pattern. Let < , denote the ratings density of the pattern t . Then ratings variance is defined as
ieQi (R(u,i) - dtf
\QU\Q
• Average Ratings Cover] Let L(Q") be as defined above. Define g as
Figure imgf000016_0001
• [Entropy] g is the entropy of ratings distribution R(u,i) /u e Q",i e Q
Typical rating constraints include:
Ratings Density(Q , Q' ,R\QU , Q'])≥ 3.0
· 0.5 < Ratings Variance(Qu , Q^RIQ" , Q7]) < 0.8
Entropy(R[Qu , Q7]) < 1 .0
• Compute top-/ patterns ranked by score computed by an RSF, e.g., Ratings Count. 2.1.3 Holistic Constraints
While set and rating constraints operate on the usersets, itemsets and set of ratings between them, a holistic constraint operates on a set of discovered patterns to select a subset of patterns that together satisfy the holistic constraint. This is often challenging as the number of candidate subsets of patterns is exponential . Below, we list two holistic constraints.
• [Threshold Coverage Constraint] Let P be a set of patterns and tJ = (Qj", Qj,R[Qj" , Qj
Figure imgf000016_0002
υΡρ , then compute the smallest such P subject to the constraint that W has at least βυ% of the database U. Similarly, one can define /' and discover the smallest set of patterns which cover at least /3,% of the database / • [Threshold Diversity Constraint] Let P be a set of patterns. Compute a subset P' of top-k patterns (based on the score of an RSF) from P such that and
Figure imgf000017_0001
An example to illustrate the above constraints on the movie ratings dataset (Tables 1 and 2) is as follows.
EXAMPLE 3. Let Q[ and Q2' be defined by the usersets
Figure imgf000017_0002
and gender='fema/e respectively. Let Q[ and Q2 be defined by the itemsets {year='1997'} and {year='1996'} respectively. Let ΐγ=(% ,Q[,R[QI ,Q[j, h = {Q2 U,Q2,R[Q2 U,Q2^- Let βυ = 1.0 and β,- = 0.75. The set of patterns { , t2j are said to satisfy the Threshold Coverage Constraint. 2.2 Properties of RSFs
The monotonicity properties of RSFs are critical to space pruning algorithms. Let Gu represent the powerset lattice of U, and G/ the powerset lattice of /. Let Q" ,Q" be any two elements in Gu, and Q[ and Q2' be any two elements in G/, such that Q" c ¾" , and Q[^Q2. The following properties of RSFs are defined based on their monotonicity on Gu and Gf.
1. [L/-monotonic] g is said to be L/-monotonic if Vg",^ e s.t. Q" contains Q\ and VQ e GI,g(Q2 U,QI ^(Q^Q)^ g(Q? ,Q' ,R(Q? ,Q'))
2. [/-monotonic] g is said to be /-monotonic if Vg,£¾ e Gi s-^~ Qi contains Q2' and νβ" e Gu,g(QU,Q2 I,R(QU,Q2))< g(Q" ,QI,R(Q" ,Q[)) Among the examples of RSFs discussed above, Ratings Sum and Ratings Cover are both L/-monotonic and /-monotonic, Average Ratings Cover is L/-monotonic but not /-monotonic, and Ratings Density is neither a L/-monotonic nor an /-monotonic function. Space pruning algorithms in embodiments of the invention are developed to handle both monotonic and non-monotonic functions.
3. ALGORITHMS
Computing constrained promotional pattern queries involves, given a set of constraints, efficiently enumerating candidate usersets, candidate itemsets and patterns satisfying the constraints for all pairs of usersets and itemsets. Though all types of constraints are equally important from the perspective of constrained promotional pattern queries, due to space limitations, embodiments of the present invention seek to provide efficient algorithms for rating constraints and holistic constraints. A suite of algorithms for rating constraints, schema-driven space pruning algorithms and ratings-driven biclustering techniques are discussed in the following sections.
3.1 Schema-Driven Pattern Mining
In schema-driven pattern mining, usersets and itemsets are defined based on a schema. We first define the following terms concerning schema-driven pattern mining below.
DEFINITION 3. User Query. An atomic query qu on a user database is an assignment of an attribute Aj = vj . A query Qu of length I is a conjunction of I≤ du atomic queries on I different attributes of the user database, (Ai = vn Λ . , . Λ Αϋ = ν¾). A user u e U satisfies an atomic query Aj = vj if the attribute value of u forAj is Vj . A user u satisfies query Qu, denoted by u |= Qu, if u satisfies each atomic query in Qu.
DEFINITION 4. Item Query. An atomic query q1 on an item database is an assignment of an attribute Bj = vj . Similarly, an item query Q1 is a conjunction of a set m≤ di of atomic queries. An item u e l satisfies the query Q1, denoted by i |= Q', if i satisfies each atomic query in Q1.
Examples of user query and item query are /gender='male' Λ occupation='student' and director=' James Cameron' Λ genre='Action' respectively. The space of user queries is the output of all group-bys on the user database, known as
data cube. For three attributes A^, A2, and A3, this space consists of the resulting groups of each of the 7 group-by operations on dimensions ΑΛ, A2, A3, ΑΛΑ2, A2A3, A A3, A^A A3. Similarly, the space of item queries is computed by the item data cube. The output of a user query is a userset and that of an item query is an itemset.
For a given rating constraint (e.g., Ratings Density> <5), the main challenge is to efficiently enumerate candidate usersets and itemsets, and compute patterns that satisfy the constraint. A general framework that lays the foundation to address the above challenge progressively will now be discussed.
For a constrained promotional pattern query, let d be a rating constraint (e.g., Ratings Density> <5) and Cs comprise all the set constraints. An approach based on the rank-join method is then discussed to compute d. It takes two sorted lists of usersets and itemsets satisfying the set constraints Cs, and outputs a set of patterns satisfying d . The general idea is as follows.
The value of an RSF on a pattern is denoted by score. A na'fve approach computes the score for every pair of candidate userset and itemset in no particular order and outputs pairs that have scores > <5. However, one can limit this search to pairs which are more likely to have a score > <5. This is achieved by projecting the RSF on the spaces of usersets and itemsets separately, and selecting sets in the order of their likeliness to contribute to a higher score. In other words, an RSF is upper bounded by a composite monotonic function of two score projection functions which operate on the spaces of usersets and itemsets respectively. This is illustrated in Equation 4.1 as follows using Ratings Density.
g{R[Q Qi = Σ (QU , R]HJ [Q , R) HU(Q R) = ^ ^ R(U, L)
Figure imgf000020_0001
In Equation 4.1 , . is the composite function of two score projection functions Hu(Qu,R) and H, (Q',R). Ηυ upper bounds the score contribution of Qu in patterns comprising Qu by aggregating its ratings on the entire item database. Fig. 2 lists the projections of other aggregate RSFs.
Rank-Join. The score Hu(Qu,R) is computed first for all usersets Qu and they are sorted by their scores. Similarly, itemsets Q' ranked by H, {Q',R) are listed. A join between the two lists is then performed in the sorted order (on the lines of sort- merge join). The pseudocode for the method is given in Algorithm 1 . The method is illustrated in Fig. 3 for the Ratings Density function with δ = 0.3. The search on Q" terminates at Q} and the algorithm terminates after processing QK" as any pair involving QK" and usersets ranked below it has an upper bound less than <5. There are three computational bottlenecks in this approach:
• Materializing all usersets and itemsets, which are very large in number
• Computing the aggregate component scores Hu(Qu,R) and H, (Q',R) for each userset and itemset
· Sorting all usersets and itemsets by their Hu(Qu,R) and H, (Q',R) scores respectively and performing a rank join between them In the next section, these bottlenecks are addressed by providing pruning optimizations. Hereinafter, the algorithms are illustrated using the Ratings Density function, and it is demonstrated below that they can be extended to other RSFs easily.
3.2 Algorithm DCRJN
The bottom up computation of an iceberg cube (BUC algorithm) is used to efficiently compute user and item data cubes. This is illustrated in Fig. 4 when there are four user attributes. Each node in this tree, called cuboid, is a group-by on a subset of attributes, and partitions the database into a number of disjoint sets. To compute rating constraints, the rank join computation is modelled between the sorted lists of all usersets and itemsets (from previous section) into multiple rank join computations between pairs of user and item cuboids (Fig. 5). The user and item data cubes are materialized on-the-fly. The pseudocode for this algorithm is given in Algorithm 2. This approach does not offer significant improvement in efficiency by itself compared to the approach considered in previous section. Several optimization strategies are discussed below using this approach. Prior to that, some preliminaries on data cubes are presented that enable those skilled in the art to understand the optimization strategies.
Algorithm 1 : GroupRJN(Su, Si )
Input: Two sorted lists of sets Su, Si and δ
Output: A result set of patterns
1 : N <— newPriorityQueueQ
2: while notEmpty(Su) do
3: Q" <- nextSubspace{Su)
4: while notEmpty(Si ) do
5: Q' <— nextSubspace(Si )
6: if Hu{Qu,R).Hi (Q',R) < δ then
7: move to next Qu
8: else
9: compute g(Qu,Q',R[Qu,Q']), insert in N
10: if Hu{Qu,R).Hi {* ,R) < δ then 1 1 : return N
Let and P be two cuboids of the user data cube. A set Q" e P1 is a parent of
Qm u e Pm (or equivalently, is a child of Q" ) if all the attributes of≠ are in P"1 as well. Further, Qm" has the same values as Q" for the common attributes. For example, the set gender='male7 is a parent of /gender='male', occupation='student7, which is a child of both gender='male7 and occupation='student7. A set Qu is said to be a most specific set (MSS) if all the attributes of the database are assigned some values. All the usersets which belong to the cuboid A1A2A3A4 in Fig. 4 are most specific sets. Further, the most specific descendant set of a set Qu is the set of all most specific sets which have the same values for the attributes on which Qu is grouped. For example, the most specific descendant set of (A-i = a1;A2 = a2) is the set of all sets (A-i = a1;A2 = a2,A3 = * A* - * ) where * denotes that the corresponding attribute takes all values from the domains of corresponding attributes.
The score projection functions, Ηυ and Hi are functions on sets computed by user and item data cubes respectively. Such functions can be classified into three types depending on their monotonicity properties on the data cube. For example:
1 . is said to be monotonic if /Q",Q" s.t. Q" is a parent of Q" ,
Hu(Q2 U , R) < Hu(Q , R) . For example, Hu(QU , R) =∑U≠QUIELR{u,i) is monotonic.
2. Hu is said to be antimonotonic if VQ",Q" s.t. Q" is a parent of Q" , is antimonotonic.
Figure imgf000022_0001
s said to be nonmonotonic if it is neither monotonic nor antimonotonic
y. R(u,i)
-,— is nonmonotonic. Similarly, H, can be categorized into monotonic, antimonotonic or nonmonotonic. Optimization techniques for Algorithm 2 are presented below.
Algorithm 2: Algorithm DCRJN
Input: Databases U, I and δ
Output: A result set of patterns
Procedure DCRankJoin:
1 : for all j = 1 to du do
2: DepthFirstUC(Aj, j)
Procedure DepthFirstUC(Pu, j):
Input: A user cuboid Pu, last attribute of Pu 1 : DepthFirstlC{Pu, * I )
2\ \fj = du then
3: Update tree upper bounds in parents of Pu
4: else
5: for all k = j + 1 to du do
6: Project U on attribute Ak;
7: Ρ υ ^set of unique values (Ak = v)
Figure imgf000023_0001
9: DepthFirstUC{P'u, k) Procedure DepthFirstlC(Pu,Pi):
Input: A user cuboud Pu, item cuboid P,
1 : GroupRJN{Pu,Pi )
2: if P/ is a leaf node then 3: Update tree upper bounds in parents of P,
4: else
5: for all children ΡΊ of P, do
6: DepthFirstlC{Pu,P'!)
Vertical Pruning. The rank join between the sets of an item and user cuboid proceeds similarly to the rank join method discussed in Algorithm 1 . Additionally, vertical pruning is employed to avoid computing the children of a set Qu (or Q') if they are unlikely to produce patterns that satisfy the given rating constraints. For each set, two measures are associated, the component function score and the tree upper bound score. The component function score is the value of Hu(Qu,R) (or H, (Q',R)), and the tree upper bound is the upper bound of Hu(Qu,R), denoted by
Figure imgf000024_0001
, on all sets that are children of Q" in the data cube. Two conditions are checked during the rank-join between the cuboids. If Hu(Qu,R).Hi (Q1 ,R) < δ then the search on Qu is discarded. Additionally, if HU {Q R HI ~{Q' , R) < S the computation of the children of Q' can be avoided when the corresponding cuboid is expanded in a depth-first manner. The tree upper bound is propagated until the root node, denoted by J (*,R) on the item data cube. For any Qu, if
Hjj ig" , R)H J (* , R) < δ the computation of the children of Q" can be avoided, and the algorithm terminates if HU (*, R).HI (*, R) < S . The computation of tree upper bounds for various score projection functions is discussed below.
Computing Tree Upper Bounds. For monotonic functions, by definition the tree upper bound of a set Qu is the score computed by the component function on Qu itself, Hu(Qu,R). For anti-monotonic functions such as Η^"^) = the tree upper bound is the minimum of the scores computed by Ηυ on all most specific descendants of the set Qu. In bottom-up data cube computation, the most specific descendants of several sets are computed before computing the sets themselves. For example, in Fig. 6 the process begins with B1 and proceeds in a depth-first manner until the leaf node B1B2B3B4 is reached. For all sets which are parents of sets in BiB2B3B4 and belong to cuboids not processed yet, the tree upper bound corresponds to the lower bound on the scores of their most specific descendants in
Figure imgf000025_0001
nontrivial upper bound is the maximum of the scores computed by Ηυ on all most specific descendants of the set Qu.
Sharing Aggregation. Computing the score projection functions Ηυ and H, involves aggregating on the users (items) of the entire group. For example, involves aggregating for each user the ratings on the
Figure imgf000025_0002
entire item database. To minimize redundant computation, two strategies are employed. First, ratings aggregate for individual users w(∑ R{ ,i)) are pre- computed and stored in a hash table. Second, aggregate score of a group Q" can be computed from its children if the aggregate scores of all its children are known. Since BUC proceeds in a depth-first manner, aggregates for the children of several groups are available before the groups are computed. For example, in Fig. 4 scores for a group Q" of the cuboid AiA2A4 can be computed by simply aggregating the scores of all its children in the cuboid A1A2A3A4.
In the preceding sections, pruning algorithms are disclosed which compute rating constraints guided by the monotonicity properties of ratings summary functions. The techniques are extendible to any aggregate RSFs which can be projected on the spaces of usersets and itemsets and scores of patterns can be upper bounded by composite monotonic functions of the projections. The pruning power of the algorithms depends on three factors: 1 ) the component functions chosen to upper bound the scoring function 2) the histograms of database objects over different attributes 3) the distribution of ratings. For example, for the component functions of Ratings Density, a long tail of ratings along with a large number of itemsets with high cardinality can lead to effective pruning. This can be explained as follows. There are a large number of users who rate very few items and hence the function Ηυ has a smaller numerator score for many usersets. Also, because of the high cardinality of item groups the product Ηυ.Ηι score is small for most patterns, which effectively means greater pruning power. Later in Section 4, the effect of these factors is discussed using the results of the experiments. Further, Algorithm 2 can be parallelized easily as the rank join computations between cuboids can be performed in parallel.
3.3 Ratings-Driven Pattern Mining
In this section, a ratings-driven approach is proposed to discover promotional patterns using biclustering of the ratings matrix. Biclustering, or co-clustering is an effective technique to discover subsets of rows in a data matrix that exhibit similar behavior across a subset of columns. Given a ratings data matrix with user data along rows and item data along columns, biclustering techniques discover a set of biclusters pk = (Qk",Qi) such that each bicluster pk satisfies specific characteristics of homogeneity in ratings between the userset Qk" and itemset Q[ , where homogeneity is defined by an objective function. Biclustering algorithms enable certain types of rating constraints to be computed efficiently. Biclusters can be both overlapping as well as non-overlapping. Some important classes of biclustering algorithms relevant to promotional pattern mining are described as follows:
1 . Biclusters with constant values corresponds to a scenario where all the users in a bicluster give the same rating to all the items. 2. Biclusters with constant values on columns or rows corresponds to biclusters in which all users have the same distribution of ratings for the itemset, or all items have the same distribution of ratings for a set of users.
3. Biclusters with coherent additive values corresponds to the scenario where the ratings on each row of a bicluster add up to the same value.
4. Minimum Entropy Biclustering discovers biclusters which have an entropy less than a given threshold. Such an algorithm is useful in minimizing the variance in ratings between a set of users and a set of items. A subset of rows and a subset of columns is considered to be a bi-cluster if they together exhibit high similarity score, which is measured by a function defined as mean squared residue (MSR). The algorithm starts with the original data matrix by computing its mean squared residue and at each step, removes a row or column which results in maximum drop in the MSR until its value reaches below a threshold. It then adds a row or column with maximum rise in MSR until the value reaches above the threshold.
Delta biclustering employs a different heuristic under the same quality metric, mean squared residue. The algorithm starts with several randomly generated biclusters. At each step, it determines the best action for each row and each column with an action being deleting or adding a row or column to a bicluster. It performs best actions for every row and every column sequentially until no further improvement can be gained in the mean squared residue of the biclusters. The main advantage of Delta biclustering is that it generates overlapping biclusters with the number of biclusters specified a priori. The time complexity of both the algorithms is 0((N +M) * N * M * k * p) where k is the number of bi-clusters and p is the number of iterations needed to converge to stable biclusters. The mean squared residue can be modified accordingly in the implementation based on the type of biclusters queried for, e.g., biclusters with constant values, bicluster with constant rows, and biclusters with constant columns. 3.4 Top-k and Holistic Constraints
Techniques for handling the holistic constraints listed in Section 2 will now be discussed. Top-/ constraint can be handled in the algorithms discussed in the previous section by assigning the parameter δ to the score of the top-Z ^ pattern and updating it when the rfh pattern changes as the algorithm progresses. For diversity and coverage threshold constraints, we post-process the promotional patterns satisfying the rating constraints using greedy heuristics which take as input a set of patterns possibly in the descending order of their RSF scores, and output a set of patterns that satisfy the given threshold constraints. One such heuristic is listed in Algorithm 3 below for Threshold Coverage constraint.
The heuristic takes as input, two threshold parameters /3i and β2 for the user and item databases. The algorithm proceeds as follows. It maintains two sets, one is a set of users covered by the usersets in patterns seen so far, denoted by Τυ, and the other is a set of items covered by the itemsets in the patterns seen so far, denoted by 7/ . For every new pattern added to the set, if \ Τυ\≥ ι and | T,| > β2 we terminate. The pseudocode for the algorithm is given in Algorithm 3 below. Algorithm 3: Computing Holistic Constraints
Input: A set of patterns, coverage thresholds βι,β2
Output: A set of patterns
2: Γ, <- 0
3: OutputSet <- {}
4: while | TU| < β1 and \ T, \ < β2 do
5: NextPattern <— t≡ (QU,Q',R[QU,Q']) which maximizes the number of elements
6: Tu ^ Tu j Qu
7: Ti <- T, Q' 8: Add t to OutputSet 4. EXPERIMENTS
In this section, the performance and effectiveness of the schema-driven pruning algorithms and biclustering techniques of embodiments of the invention is discussed. The experiments are conducted using a movielens dataset from the Group Lens Project site (http://www.grouplens.org/node/12). The dataset consists of 1 million ratings from 6040 users on 3638 movies. The user tuples have four attributes (age, gender, location, occupation). For the movie database, the attributes include 3 genres, director, writers, actors (with a rank associated indicating the significance of the actor's role), year, country, etc.
Five meaningful attributes were extracted for experimental purposes: (year,main Genre, country, director,main actor). Each user has rated on an average 100 movies on a scale 1 to 5. The experiments were conducted on a Linux machine running a Dual Core AMD Opteron processor 2.2GHz with 8GB memory. All the algorithms were implemented in Java and executed in the main memory with all the data being loaded at once in the memory and no further disk access. The analysis includes (i) performance evaluation of the space pruning algorithms in terms of running time and performance of queries involving multiple types of constraints including set, rating and holistic constraints (Section 4.1 ) (ii) performance evaluation of biclustering techniques in terms of running time and quality evaluation of the biclusters generated (Section 4.2).
4.1 Performance of Pruning Techniques
The performance of the pruning techniques was evaluated based on the following three properties of a promotional pattern query: Dataset Size. The running time of the algorithms was considered for different sizes of the dataset relevant to the mining task specified by a promotional pattern query. Six subsets of the movielens dataset of increasing size were considered for each experiment. The dataset size is characterized by the number of users, number of items, the number of ratings, and the number of usersets and itemsets for the given schema (shown in Figure 8).
Ratings Summary Functions. The algorithms were evaluated for different types of ratings summary functions characterized by their monotonicity properties on the userset and itemset lattices. The four selected RSFs were Ratings Count and Ratings Sum both of which are L/-monotonic and /-monotonic, Ratings Density which is neither L/-monotonic nor /-monotonic, and Average Ratings Cover which is U- monotonic but not /-monotonic. Constraint Type. Lastly, the performance of our algorithms were evaluated for two types of rating constraint on RSFs, namely the (1 ) threshold δ-constraint and (2) top-/ constraint. The δ-constraint computes all the patterns whose RSF score is greater than <5. The top-/ constraint computes the patterns with top-/ scores. For each RSF considered above, the value of δ was varied to cover the entire range of scores induced by the RSF on all pairs of usersets and itemsets. For example, the <5 value for Ratings Count is varied from 101 to 105.
It was observed that a na'fve algorithm which enumerates all pairs of usersets and itemsets in no particular order and evaluates the rating constraints on each of them is several orders of magnitude slower than the basic rank join algorithm even for smaller datasets (e.g., 1000x600 ratings matrix). Hence, the experiments focussed on analyzing and comparing the performance of basic GroupRJN algorithm and the advanced DCRJN algorithm (Section 3). The two space pruning algorithms were compared using four measures corresponding to the optimization principles discussed in Section 3: (1 ) Running Time which measures the overall running time of the algorithms (2) Enumeration Time measures the time to materialize usersets and itemsets for result computation (3) Aggregation Time measures the score aggregation time for the given RSF (g) on pairs of usersets and itemsets, their score projection functions (Ηυ and H, ) on individual sets and their tree upper bounds (4) Sorting and Rank Join Time measures the time to perform sorting and rank join in both algorithms. The performance of the algorithms is demonstrated in the following experiments.
Dataset Experiment. The performance comparison of DCRJN is presented with the basic rank join algorithm in the Figures 9 to 12 (a-d) for the four measures discussed above by varying the dataset size (Fig .8). In this experiment, the top-10 patterns are retrieved and hence the value of threshold δ for a ratings summary function is the smallest score computed by an RSF among the top-10 patterns. The running time is measured in seconds. It was observed that for Ratings Count and Ratings Sum which are L/-monotonic and /-monotonic, and to an extent Average Ratings Cover which is L/-monotonic but not /-monotonic, the principles of tree upper bound and aggregation sharing together are effective in lowering the running time by 4 to 18 orders of magnitude. For certain RSFs, over larger datasets DCRJN is 10-20 times faster. Tree upper bound control the number of usersets and itemsets materialized. Hence, the number of usersets and itemsets between which ratings are aggregated is also significantly lowered. Hence, the enumeration time, aggregate score computation and rank join time are several orders better than in basic rank join implementation.
For Ratings Density which is non-monotonic in both the usersets and itemsets, it is not possible to achieve significant improvement in pruning, except for minimizing the aggregate score computation time through sharing of aggregation . This can be explained as follows. The pruning power of the score projection functions (Ηυ and Hi) chosen for Ratings Density is minimal in both the basic rank join and DCRJN algorithms. We observed a large statistical difference between the range of values computed by the composite monotonic function of Ηυ and H, and the actual scores of patterns for Ratings Density. The actual scores for Ratings Density vary from 0 to 5, while 82% of the values computed by Ηυυ).Ηι (Q for patterns ^ ^R ^1]) are greater than 5. Hence, a large number of patterns evaluated by the GroupRJN algorithm (and hence DCRJN) in this scenario have scores greater than δ at any given time. Therefore, it performs as badly as a na'fve baseline which computes scores for all pairs. Sample query results (top-3 patterns) are provided for the four RSFs in the table shown in Fig. 13 on the largest dataset (6000x3600 ratings matrix).
Threshold Experiment. Unlike in the previous experiment where it was assumed that a top-10 constraint by which the value of threshold δ is dynamically computed as the algorithm progresses, the value of δ over the range of scores induced by an RSF is fixed and varied for experiments in this evaluation. The performance comparison (overall running time) of DCRJN is presented with the GroupRJN algorithm in Figures 15 (a-d). It was observed that for higher values of δ both algorithms perform better, and DCRJN performs 3 to 10 orders of magnitude faster for all RSFs except Ratings Density. For Ratings Density a better running time was obtained than the basic rank join implementation by an order of 0.25 (on average).
Set and Holistic Constraints. An experiment was performed with queries involving multiple constraints. A set constraint of support > 5 on the individual usersets and itemsets was added, and a holistic constraint of diversity on the usersets and itemsets of the patterns. The overall running time is plotted in Figures 14 (a-b). The grey blocks on the top represent additional time taken for evaluating set and holistic constraints. It was observed that the running time for queries involving set and holistic constraints as well scales proportionately to the time for queries involving only rating constraints. 4.2 Performance and Effectiveness of Biclustering
The efficiency of the biclustering techniques discussed in Section 3 is first evaluated. Biclustering algorithms were implemented for three different types of biclusters: 1 ) Biclusters with constant values (Type 1 ) 2) Biclusters with constant values on rows (Type 2) 3) Biclusters with constant values on columns (Type 3). The running times of the implementation is plotted by varying the number of clusters generated and dataset size in Figures 16 and 18 (a-d). The running time for delta biclustering on a dataset in which the range of matrix values is large is usually of the order of 104 seconds for a 3000x500 data matrix. For the movielens ratings dataset, the range of matrix values is much smaller (1 to 5) compared to other datasets and the running time is expected to be much higher.
An implementation of an embodiment of the invention is experimented on smaller datasets extracted from the movielens dataset. Figure 16(a-d) plots the running time against number of clusters for four different sizes of the data matrix : (1 ) 200 x 100 (2) 400 x 200 (3) 600 x 300 (4) 800 x 400. Figure 18(a-d) plots running time against the dataset size by varying the number of clusters generated. The running time increases rapidly for Type 1 clusters with increase in the number of biclusters and the dataset size. They achieve a much smaller running time for Type 2 and Type 3 biclusters.
Cluster Quality. In biclustering techniques the ground truth about correct biclusters is not available in advance. Hence, the biclustering results are evaluated using a quality measure called residue function. For a perfect bicluster, the residue is zero, and a smaller residue represents a more coherent bicluster. The mean values of residue function for four different number of clusters generated is presented in the table shown in Fig. 17. The mean residue values in general can be as large as possible (the upper bound being ∞). The mean residue values in the implementation are all less than 1 and tend to be closer to 0 indicating a good cluster quality. Embodiments of the invention present Prompt, an exploratory system for mining promotional patterns in large collaborative rating systems. Collaborative rating systems generate large amounts of data in the form of ratings and text reviews given by users to items which can be leveraged to extract business intelligence for promoting sets of items to sets of users. It is important to have an expressive language for constrained promotional pattern queries specifying different types of constraints on usersets, itemsets, ratings and patterns. It is equally important, given the complexity of mining and exploration tasks involved, for the techniques employed to be computationally efficient and scalable at this level.
When used in this specification and claims, the terms "comprises" and "comprising" and variations thereof mean that the specified features, steps or integers are included. The terms are not to be interpreted to exclude the presence of other features, steps or components.
Techniques available for implementing aspects of embodiments of the invention:
[1 ] G. Adomavicius, R. Sankaranarayanan, S. Sen, and A. Tuzhilin. Incorporating contextual information in recommender systems using a multidimensional approach. ACM Transactions on Information Systems (TOIS),
23(1 ):103-145, 2005.
[2] G. Adomavicius and A. Tuzhilin. Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions. Knowledge and Data Engineering EEE Transactions on, 17(6):734-749, 2005.
[3] R. Agrawal, T. Imieli'nski, and A. Swami. Mining association rules between sets of items in large databases. In ACM SIGMOD Record, volume 22, pages 207-216. ACM, 1993.
[4] K. Beyer and R. Ramakrishnan. Bottom-up computation of sparse and iceberg cube. ACM SIGMOD Record, 28(2):359-370, 1999.
[5] F. Bonchi, F. Giannotti, C. Lucchese, S. Orlando, R. Perego, and R. Trasarti. A constraint-based querying system for exploratory pattern discovery. Information Systems, 34(1 ):3-27, 2009.
[6] R. Burke. Hybrid recommender systems: Survey and experiments. User modeling and user-adapted interaction, 12(4):331-370, 2002.
[7] Y. Cheng and G. Church. Biclustering of expression data. In Proceedings of the eighth international conference on intelligent systems for molecular biology, volume 1 , pages 93-103, 2000.
[8] C. Das, P. Maji, and S. Chattopadhyay. A novel biclustering algorithm for discovering value-coherent overlapping σ-biclusters. In Advanced Computing and Communications, 2008. ADCOM 2008. 16th International
[9] J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. Communications of the ACM, 51 (1 ):107-1 13, 2008.
[10] T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama. Data mining using two-dimensional optimized association rules: Scheme, algorithms, and visualization. In ACM SIGMOD Record, volume 25, pages 13-23. ACM, 1996. [1 1 ] J. Gray, A. Bosworth, A. Lyaman, and H. Pirahesh. Data cube: a relational aggregation operator generalizing group-by, cross-tab, and sub-totals, pages 152 - 159, feb-1 mar 1996.
[12] J. Han, H. Cheng, D. Xin, and X. Yan. Frequent pattern mining: current status and future directions. Data Mining and Knowledge Discovery, 15(1 ):55-86, 2007.Conference on,pages 148-156. IEEE, 2008.
[13] J. Han and Y. Fu. Discovery of multiple-level association rules from large databases. In Proceedings of the International Conference on very large Data Bases, pages 420-431 , 1995.
[14] J. Han, L. Lakshmanan, and R. Ng. Constraint-based, multidimensional data mining. Computer, 32(8):46-50, 1999.
[15] I. F. Ilyas, W. G. Aref, and A. K. Elmagarmid. Joining ranked inputs in practice. In VLDB, pages 950-961 , 2002.
[16] T. Imieli'nski and A. Virmani. Msql: A query language for database mining. Data Mining and Knowledge Discovery, 3(4):373-408, 1999.
[17] M. Kamber, J. Han, and J. Chiang. Metarule-guided mining of multidimensional association rules using data cubes. In KDD, volume 97, page 207, 1997.
[18] J. Kleinberg, C. Papadimitriou, and P. Raghavan. A microeconomic view of data mining. Data Min. Knowl.Discov., 2(4):31 1-324, Dec. 1998.
[19] P. Kotler and K. Keller. A framework for marketing management. 2003.
[20] G. Koutrika, B. Bercovitz, and H. Garcia-Molina. Flexrecs: expressing and combining flexible recommendations. In Proceedings of the 35th SIGMOD international conferenceon Management of data, pages 745-758. ACM, 2009.
[21 ] S. Madeira and A. Oliveira. Biclustering algorithms forbiological data analysis: a survey. Computational Biologyand Bioinformatics, IEEE/ACM Transactions on,
1 (1 ):24-45, 2004.
[22] R. Ng, L. Lakshmanan, J. Han, and A. Pang. Exploratory mining and pruning optimizations of constrained associations rules. In ACM SIGMOD Record, volume 27, pages 13-24. ACM, 1998. [23] R. Srikant and R. Agrawal. Mining quantitative association rules in large relational tables. In ACM SIGMOD Record, volume 25, pages 1-12. ACM, 1996.
[24] R. Srikant and R. Agrawal. Mining generalized association rules. Future Generation Computer Systems, 13(2):161-180, 1997.
[25] T. Wu, D. Xin, Q. Mei, and J. Han. Promotion analysis in multi-dimensional space. Proc. VLDB Endow., 2(1 ): 109-120, 2009.
[26] X. Zhang, P. L. Chou, and G. Dong. Efficient computation of iceberg cubes by bounding aggregate functions. IEEE Trans, on Knowl. and Data Eng., 19(7):903- 918, July 2007.

Claims

CLAIMS:
1 . A method of processing a ratings dataset, the ratings dataset incorporating data identifying a plurality of users U, a plurality of items I and a set of ratings R allocated by the users U to the items I, the method comprising:
defining a subset of users U in the dataset as a userset Qu,
defining a subset of items I in the dataset as an itemset Q',
receiving at least one rating constraint specifying at least one constraint on the set of ratings between the userset Qu and the itemset Q',
inputting each rating constraint into a ratings summary function g(R[Qu,Q']) θ δ to define a function of ratings between the userset Qu and the itemset Q', where Θ is selected from one of =,>,<,≥,< and δ≡ R,
projecting the ratings summary function separately as Hu(Qu,R) on the space of all usersets and H|(Q',R) on the space of all itemsets to identify pairs of usersets and itemsets (QU| Q') that have a score for the ratings summary function that is greater than δ, and
analysing the identified pairs of usersets and itemsets to find patterns in the ratings dataset according to each rating constraint.
2. A method according to claim 1 , wherein the method further comprises a rank- join method to identify pairs of usersets and itemsets, the rank-join method comprising:
calculating a score Hu(Qu,R) for all usersets in the space,
sorting a list of usersets by the calculated scores,
calculating a score H|(Q',R) for all itemsets belonging to the space of itemsets,
sorting a list of itemsets by the calculated scores,
merging list of usersets with the list of itemsets in their sorted orders and ranking them.
3. A method according to claim 1 or claim 2, wherein the ratings summary function is ratings count function g(R[Q" ,Q ~) =∑ueQ»;.eg, U ) > where I is an indicator function with value 1 if u has rated i, and 0 otherwise.
4. A method according to claim 1 or claim 2, wherein the ratings summary function is ratings sum function g(R[Q" ,Qi ~) = uf=QU ∑.eg, R(u,i) .
5. A method according to claim 1 or claim 2, wherein the ratings dataset is a binary ratings model where L{ut) denotes the set of items rated 1 by user w. and L(Q") = L(ui),ui e Q" , and wherein the ratings summary function is ratings cover function g(R[0",01']) =
Figure imgf000039_0001
Q\ .
6. A method according to claim 1 or claim 2, wherein the ratings summary
y „ y , R(u,i)
function is ratings density function g(J?[gM ,O ])
\QU\Q
7. A method according to claim 1 or claim 2, wherein t = (Qu,Qi,R[Qu,Qi]) \s a pattern and dt \s the ratings density of the pattern t and wherein the ratings summary function is ratings variance function
Figure imgf000039_0002
8. A method according to claim 1 or claim 2, wherein the ratings dataset is a binary ratings model where L{ut) denotes the set of items rated 1 by user w. and
L(Q") = L(ui),ui e Q" , and wherein the ratings summary function is average ratings cover function
Figure imgf000039_0003
9. A method according to claim 1 or claim 2, wherein the ratings summary function g is the entropy of ratings distribution R(u,i) /u e Q",i e Q .
10. A method of processing a ratings dataset, the ratings dataset incorporating data identifying a plurality of users U, a plurality of items I and a set of ratings R allocated by the users U to the items I, the method comprising:
storing the ratings dataset as a matrix with the users and the items in respective rows and columns; and
processing the matrix using a biclustering algorithm to detect biclusters of subsets of the rows and columns that exhibit a high similarity score.
1 1 . A method of processing a ratings dataset according to claim 10, wherein the biclustering algorithm comprises a mean square residue (MSR) function.
12. A method of processing a ratings dataset according to claim 10, wherein the biclustering algorithm comprises a Delta biclustering algorithm.
PCT/EP2013/058931 2013-04-29 2013-04-29 A method of processing a ratings dataset WO2014177181A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/EP2013/058931 WO2014177181A1 (en) 2013-04-29 2013-04-29 A method of processing a ratings dataset

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2013/058931 WO2014177181A1 (en) 2013-04-29 2013-04-29 A method of processing a ratings dataset

Publications (2)

Publication Number Publication Date
WO2014177181A1 true WO2014177181A1 (en) 2014-11-06
WO2014177181A9 WO2014177181A9 (en) 2015-05-14

Family

ID=48190522

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2013/058931 WO2014177181A1 (en) 2013-04-29 2013-04-29 A method of processing a ratings dataset

Country Status (1)

Country Link
WO (1) WO2014177181A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113472582A (en) * 2020-07-15 2021-10-01 北京沃东天骏信息技术有限公司 System and method for alarm correlation and alarm aggregation in information technology monitoring
US11449514B2 (en) 2019-12-27 2022-09-20 Interset Software LLC Approximate aggregation queries

Non-Patent Citations (28)

* Cited by examiner, † Cited by third party
Title
"STATEMENT IN ACCORDANCE WITH THE NOTICE FROM THE EUROPEAN PATENT OFFICE DATED 1 OCTOBER 2007 CONCERNING BUSINESS METHODS - PCT / ERKLAERUNG GEMAESS DER MITTEILUNG DES EUROPAEISCHEN PATENTAMTS VOM 1.OKTOBER 2007 UEBER GESCHAEFTSMETHODEN - PCT / DECLARATION CONFORMEMENT AU COMMUNIQUE DE L'OFFICE EUROP", 20071101, 1 November 2007 (2007-11-01) - 1 November 2007 (2007-11-01), XP007905525 *
C. DAS; P. MAJI; S. CHATTOPADHYAY: "A novel biclustering algorithm for discovering value-coherent overlapping o-bidusters", ADVANCED COMPUTING AND COMMUNICATIONS, 2008
CONFERENCE, 2008, pages 148 - 156
F. BONCHI; F. GIANNOTTI; C. LUCCHESE; S. ORLANDO; R. PEREGO; R. TRASARTI.: "A constraint-based querying system for exploratory pattern discovery", INFORMATION SYSTEMS, vol. 34, no. 1, 2009, pages 3 - 27
G. ADOMAVICIUS; A. TUZHILIN.: "Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions", KNOWLEDGE AND DATA ENGINEERING,IEEE TRANSACTIONS ON, vol. 17, no. 6, 2005, pages 734 - 749, XP011130675, DOI: doi:10.1109/TKDE.2005.99
G. ADOMAVICIUS; R. SANKARANARAYANAN; S. SEN; A. TUZHILIN: "Incorporating contextual information in recommender systems using a multidimensional approach", ACM TRANSACTIONS ON INFORMATION SYSTEMS (TOIS, vol. 23, no. 1, 2005, pages 103 - 145, XP002602128, DOI: doi:10.1145/1055709.1055714
G. KOUTRIKA; B. BERCOVITZ; H. GARCIA-MOLINA: "Proceedings of the 35th SIGMOD international conferenceon Management of data, pages", 2009, ACM, article "Flexrecs: expressing and combining flexible recommendations", pages: 745 - 758
I. F. ILYAS; W. G. AREF; A. K. ELMAGARMID: "Joining ranked inputs in practice", VLDB, 2002, pages 950 - 961
J. DEAN; S. GHEMAWAT: "Mapreduce: Simplified data processing on large clusters", COMMUNICATIONS OF THE ACM, vol. 51, no. 1, 2008, pages 107 - 113, XP002630192, DOI: doi:10.1145/1327452.1327492
J. GRAY; A. BOSWORTH; A. LYAMAN; H. PIRAHESH: "Data cube: a relational aggregation operator generalizing group-by", CROSS-TAB, AND SUB-TOTAL, February 1996 (1996-02-01), pages 152 - 159, XP010158909, DOI: doi:10.1109/ICDE.1996.492099
J. HAN; H. CHENG; D. XIN; X. YAN: "Frequent pattern mining: current status and future directions", DATA MINING AND KNOWLEDGE DISCOVERY, vol. 15, no. 1, 2007, pages 55 - 86, XP019525924, DOI: doi:10.1007/s10618-006-0059-1
J. HAN; L. LAKSHMANAN; R. NG.: "Constraint-based, multidimensional data mining", COMPUTER, vol. 32, no. 8, 1999, pages 46 - 50, XP000923708, DOI: doi:10.1109/2.781634
J. HAN; Y. FU.: "Discovery of multiple-level association rules from large databases", PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON VERY LARGE DATA BASES, 1995, pages 420 - 431
J. KLEINBERG; C. PAPADIMITRIOU; P. RAGHAVAN: "A microeconomic view of data mining", DATA MIN. KNOWL.DISCOV., vol. 2, no. 4, December 1998 (1998-12-01), pages 311 - 324
K. BEYER; R. RAMAKRISHNAN: "Bottom-up computation of sparse and iceberg cube", ACM SIGMOD RECORD, vol. 28, no. 2, 1999, pages 359 - 370, XP058236134, DOI: doi:10.1145/304182.304214
M. KAMBER; J. HAN; J. CHIANG: "Metarule-guided mining of multi- dimensional association rules using data cubes", KDD, vol. 97, 1997, pages 207
P. KOTLER; K. KELLER, A FRAMEWORK FOR MARKETING MANAGEMENT., 2003
R. AGRAWAL; T. IMIELI'NSKI; A. SWAMI: "ACM SIGMOD Record", vol. 22, 1993, ACM, article "Mining association rules between sets of items in large databases", pages: 207 - 216
R. BURKE: "Hybrid recommender systems: Survey and experiments", USER MODELING AND USER-ADAPTED INTERACTION, vol. 12, no. 4, 2002, pages 331 - 370
R. NG; L. LAKSHMANAN; J. HAN; A. PANG: "ACM SIGMOD Record", vol. 27, 1998, ACM, article "Exploratory mining and pruning optimizations of constrained associations rules", pages: 13 - 24
R. SRIKANT; R. AGRAWAL.: "ACM SIGMOD Record", vol. 25, 1996, ACM, article "Mining quantitative association rules in large relational tables", pages: 1 - 12
R. SRIKANT; R. AGRAWAL.: "Mining generalized association rules", FUTURE GENERATION COMPUTER SYSTEMS, vol. 13, no. 2, 1997, pages 161 - 180, XP004099492, DOI: doi:10.1016/S0167-739X(97)00019-8
S. MADEIRA; A. OLIVEIRA.: "Biclustering algorithms forbiological data analysis: a survey", COMPUTATIONAL BIOLOGYAND BIOINFORMATICS, IEEE/ACM TRANSACTIONS ON, vol. 1, no. 1, 2004, pages 24 - 45, XP011117813, DOI: doi:10.1109/TCBB.2004.2
T. FUKUDA; Y. MORIMOTO; S. MORISHITA; T. TOKUYAMA: "ACM SIGMOD Record", vol. 25, 1996, ACM, article "Data mining using two-dimensional optimized association rules: Scheme, algorithms, and visualization", pages: 13 - 23
T. IMIELI'NSKI; A. VIRMANI: "Msql: A query language for database mining", DATA MINING AND KNOWLEDGE DISCOVERY, vol. 3, no. 4, 1999, pages 373 - 408
T. WU; D. XIN; Q. MEI; J. HAN: "Promotion analysis in multi-dimensional space", PROC. VLDB ENDOW, vol. 2, no. 1, 2009, pages 1 09 - 120
X. ZHANG; P. L. CHOU; G. DONG: "Efficient computation of iceberg cubes by bounding aggregate functions", IEEE TRANS. ON KNOWL. AND DATA ENG., vol. 19, no. 7, July 2007 (2007-07-01), pages 903 - 918, XP011185010, DOI: doi:10.1109/TKDE.2007.1053
Y. CHENG; G. CHURCH.: "Biclustering of expression data", PROCEEDINGS OF THE EIGHTH INTERNATIONAL CONFERENCE ON INTELLIGENT SYSTEMS FOR MOLECULAR BIOLOGY, vol. 1, 2000, pages 93 - 103

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11449514B2 (en) 2019-12-27 2022-09-20 Interset Software LLC Approximate aggregation queries
CN113472582A (en) * 2020-07-15 2021-10-01 北京沃东天骏信息技术有限公司 System and method for alarm correlation and alarm aggregation in information technology monitoring

Also Published As

Publication number Publication date
WO2014177181A9 (en) 2015-05-14

Similar Documents

Publication Publication Date Title
Gan et al. A survey of utility-oriented pattern mining
Liu et al. Graph summarization methods and applications: A survey
Yuan et al. Index-based densest clique percolation community search in networks
Xiang et al. Summarizing transactional databases with overlapped hyperrectangles
Pool et al. Description-driven community detection
US20160328406A1 (en) Interactive recommendation of data sets for data analysis
Chen et al. Location-aware top-k term publish/subscribe
US20030120630A1 (en) Method and system for similarity search and clustering
Krishnamoorthy Efficient mining of high utility itemsets with multiple minimum utility thresholds
Jiang et al. Probabilistic skylines on uncertain data: model and bounding-pruning-refining methods
Wu et al. Mining association rules for low-frequency itemsets
Chung et al. Categorization for grouping associative items using data mining in item-based collaborative filtering
Deepak et al. Operators for similarity search: Semantics, techniques and usage scenarios
Leung et al. Finding efficiencies in frequent pattern mining from big uncertain data
Liu et al. Collaborative prediction for multi-entity interaction with hierarchical representation
Silva et al. Constrained pattern mining in the new era
Zhao et al. Monochromatic and bichromatic ranked reverse boolean spatial keyword nearest neighbors search
Gao et al. Efficient algorithms for finding the most desirable skyline objects
Dalal et al. Review on high utility itemset mining algorithms for big data
Yang et al. LAZY R-tree: The R-tree with lazy splitting algorithm
WO2014177181A1 (en) A method of processing a ratings dataset
Dong et al. Aggregate reverse rank queries
Yang et al. A keyword-based scholar recommendation framework for biomedical literature
Georgoulas et al. User-centric similarity search
Opoku-Mensah et al. Understanding user situational relevance in ranking web search results

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 13718859

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 13718859

Country of ref document: EP

Kind code of ref document: A1