CN106250526A

CN106250526A - A kind of text class based on content and user behavior recommends method and apparatus

Info

Publication number: CN106250526A
Application number: CN201610635123.1A
Authority: CN
Inventors: 张达; 亓开元; 苏志远
Original assignee: Inspur Electronic Information Industry Co Ltd
Current assignee: Inspur Electronic Information Industry Co Ltd
Priority date: 2016-08-05
Filing date: 2016-08-05
Publication date: 2016-12-21

Abstract

The present invention provides a kind of text class based on content and user behavior to recommend method, and the method comprising the steps of: obtains collection of document to be analyzed, the document in collection of document is carried out Chinese word segmentation and obtains multiple lexical item；Lexical item in collection of document is carried out information gain calculating, sorts according to the size of information gain amount and screen multiple lexical items as reference vector；According to reference vector, the text in collection of document is converted to the vector space model of multidimensional；Vector space model is carried out TF IDF calculating, obtains text vector matrix；Calculate the similarity between different text vector matrixes, form document relationships matrix；Analyze user behavior data, in conjunction with document relationships matrix, form recommendation list and recommend user.This device includes word-dividing mode, IG computing module, dimensionality reduction module, TF IDF computing module, similarity calculation module and recommending module.The method and device, it is possible to promote the effectiveness of user version commending contents.

Description

A kind of text class based on content and user behavior recommends method and apparatus

Technical field

The present invention relates to data mining technology field, recommend particularly to a kind of text class based on content and user behavior Method and apparatus.

Background technology

The appearance of the Internet and popularize and bring substantial amounts of information to user, meets user in the information age to information Demand, but the increasing substantially of the network information amount brought along with developing rapidly of network so that user is in the face of bulk information Shi Wufa therefrom obtains the part information actually useful to oneself, reduces the service efficiency of information on the contrary, here it is institute The information overload problem of meaning.

At present, one of the solution for information overload problem is with search engine for representative information searching system, it Help user obtain the network information in terms of play extremely important effect.But current search engine is often only capable of basis The character of user's input carries out coupling search, and when using same keyword search information, the result obtained is identical, nothing Method obtains Different Results according to diversified search need.From the point of view of Ling Yifangmian, information and propagation thereof are diversified, and user Demand to information is diversification and personalization, then the knot obtained by the information retrieval system with search engine as representative Fruit can not meet the individual demand of user, still cannot solve information overload problem well.

Summary of the invention

The present invention provides a kind of text class based on content and user behavior to recommend method and apparatus, to solve above-mentioned technology Problem.

A kind of based on content and user behavior the text class that the present invention provides recommends method, including step:

Step A, obtains collection of document to be analyzed, the document in described collection of document is carried out Chinese word segmentation and obtains many Individual lexical item；

Step B, carries out information gain calculating to the lexical item in described collection of document, sorts according to the size of information gain amount Screen multiple lexical item as reference vector；

Step C, according to described reference vector, is converted to the space vector mould of multidimensional by the text in described collection of document Type；

Step D, carries out TF-IDF calculating to described vector space model, obtains text vector matrix；

Step E, calculates the similarity between different text vector matrixes, forms document relationships matrix；

Step F, analyzes user behavior data, in conjunction with described document relationships matrix, forms recommendation list and recommends user.

Wherein, further comprise the steps of: before step A

Generate dictionary and described dictionary is persisted in dictionary.

Wherein, step generates dictionary and includes step:

Obtain the multiple original dictionary recording lexical information, utilize TreeSet to the word described in described original dictionary Remittance carries out auto-sequencing, loads, filters and collect, and generates List according to the attribute of vocabulary, then is generated by even numbers group index tree Dictionary, is persisted in dictionary.

Wherein, step A includes step to the document in described collection of document carries out Chinese word segmentation:

From described dictionary, gather the dictionary that loaded in internal memory, carry out multi-mode matching based on DATrie, generation Election contest path, carries out dividing processing to the vocabulary in document simultaneously, filters out spcial character, is calculated optimal path and carries out Join, generate word segmentation result.

Wherein, after step A, further comprise the steps of: before step B

Extract noun lexical item, filter out stop words.

Wherein, the lexical item in described collection of document is carried out information gain by step B calculate and include step:

Using every text as a classification, using the lexical item in text as feature, calculate information according to equation below and increase Beneficial amounts:

Wherein, the total textual data during N represents described text set；P(C_i), represent classification C_iThe probability occurred；P (t), represents The probability that feature (T) occurs；Represent feature (T) absent variable probability；P(C_i| t) represent that text comprises feature (T) and belongs to Classification C_iProbability；Represent that text comprises feature (T) and belongs to classification C_iProbability.

Wherein, described vector space model is carried out TF-IDF by step D calculate and include step:

According to equation below calculating TF-IDF:

TF=log (1+f_{T, d})；

TF-IDF=TF*IDF；

Wherein, TF is word frequency, and IDF is document frequency inverse.

Wherein, described step calculates the similarity between different text vector matrix and includes step:

Two different text vector squares are obtained by the included angle cosine value calculating two different text vector matrixes The similarity of battle array, cosine similarity computing formula is as follows:

Wherein, w_{I, k}Represent the TF-IDF result of lexical item.

Wherein, further comprise the steps of: before described step B after step A

By the text vector in described collection of document, Hash database structure is used to store.

The present invention also provides for the device that a kind of text class based on content and user behavior is recommended, including:

Word-dividing mode, for obtaining collection of document to be analyzed, carries out Chinese word segmentation to the document in described collection of document Obtain multiple lexical item；

IG computing module, for carrying out information gain calculating to the lexical item in described collection of document, according to information gain amount Size sequence screen multiple lexical items as reference vector；

Dimensionality reduction module, for according to described reference vector, is converted to the space of multidimensional by the text in described collection of document Vector model；

TF-IDF computing module, for described vector space model is carried out TF-IDF calculating, obtains text vector matrix；

Similarity calculation module, for calculating the similarity between different text vector matrixes, forms document relationships square Battle array；

Recommending module, is used for analyzing user behavior data, in conjunction with described document relationships matrix, forms recommendation list and recommends User.

Embodiments provide a kind of text class based on content and user behavior and recommend method and apparatus, pass through word The steps such as allusion quotation generation, text participle, feature selection, TF-IDF calculating, Similarity Measure, user behavior analysis, excavate text And the similarity between text, and according to the interest of user historical data analysis user, be modeled, thus actively push away to user Recommend and can meet their interest and the information of demand, it is achieved individuation data based on user behavior is recommended, and reduces recommending data Blindness and ineffectivity, improve accuracy and the efficiency of data recommendation.

Accompanying drawing explanation

Fig. 1 is the schematic flow sheet that present invention text based on content and user behavior class recommends one embodiment of method；

The schematic flow sheet of the method that Fig. 2 position embodiment of the present invention two provides；

The schematic flow sheet that in Fig. 3 embodiment of the present invention two, dictionary generates；

The schematic flow sheet of participle in Fig. 4 embodiment of the present invention two；

The schematic diagram of Similarity Measure related procedure in Fig. 5 embodiment of the present invention two；

The structural framing schematic diagram of the device that Fig. 6 embodiment of the present invention three provides.

Detailed description of the invention

Embodiments provide a kind of text class based on content and user behavior and recommend method and apparatus.

Embodiment one

Shown in Figure 1, the method that the embodiment of the present invention one provides includes step:

Step S110, obtains collection of document to be analyzed, the document in collection of document is carried out Chinese word segmentation and obtains multiple Lexical item.

Step S111, carries out information gain calculating to the lexical item in collection of document, sorts according to the size of information gain amount Screen multiple lexical item as reference vector.

Step S112, according to described reference vector, is converted to the space vector of multidimensional by the text in described collection of document Model.

Step S113, carries out TF-IDF calculating to described vector space model, obtains text vector matrix.

Step S114, calculates the similarity between different text vector matrixes, forms document relationships matrix.

Step S115, analyzes user behavior data, in conjunction with described document relationships matrix, forms recommendation list and recommends use Family.

Alternatively, need before carrying out Chinese word segmentation to build generation dictionary in advance, and described dictionary is persisted to dictionary In storehouse.

Dictionary generates main by all kinds of dictionaries (in, outer) and user-oriented dictionary, utilizes TreeSet auto-sequencing, adds Carry, filter and collect, be persisted in dictionary, generate List according to the attribute of word, then by Double-Array Trie (even numbers group index tree is called for short DAT) generates final dictionary Dict DATrie.Double-Array Trie is the one of TRIE tree Kind of deformation, it is to ensure on the premise of TRIE tree retrieval rate, raising space availability ratio and a kind of data structure of proposing.It Essence is a finite-state automata determined (DFA), a state of each node on behalf automat, according to variable not With, carry out state transfer, when arriving done state or cannot shift when, complete inquiry.

The purpose generating dictionary is to collect substantial amounts of vocabulary composition dictionary, the abundantest result meeting that represent participle of dictionary The most accurate.

Text participle, is, by certain algorithm, text is carried out participle conversion, and adds up text related information, such as: literary composition Shelves frequency, word frequency, word sum etc..By text vector, storage to memory database, use for subsequent step.

Text participle need to gather the dictionary that loaded in internal memory from dictionary, based on the dictionary previously generated before DATrie carries out multi-mode matching, generates election contest path, word is carried out dividing processing simultaneously, filter out spcial character, add word Frequently weights etc., calculate path cost, thus obtain optimal path and mate, and generate word segmentation result.Wherein spcial character is this Field known technology term, including the symbol that punctuation mark, space etc. are non-legible.

After participle, need to carry out noun extraction and stop words filter filtration further to word segmentation result, coupling noun and Stop words dictionary, filters out vocabulary incoherent with text feature.Alternatively, collect in advance and disable dictionary and part of speech, generally recognize Maximum importance is had for the similarities and differences distinguishing text for noun part-of-speech.Accordingly, it would be desirable to pass through noun part-of-speech and disable Word (" ", " obtaining ", " " etc.) filter it is screened, to ensure diversity and the accuracy of Text similarity computing.

Prepare document sets, record dictionary, and carry out Chinese word segmentation, filter stop words, filter out the feature of noun part-of-speech, Generate after name set of words, it is necessary to add up the frequency of occurrences of each lexical item (Term) and lexical item document frequency and document word number Deng, and by text vector, use Hash Key data structure, it is stored in memory database.

Information gain represents that feature occurs in the text or occurs without as judging the quantity of information that text generic is provided Size.Calculate (IG calculating) by information gain, tend to be converted into high-dimensional space the space of low-dimensional, its basis Training data, calculates the information gain of each characteristic item, deletes the item that information gain is the least, remaining according to information gain from Big to little sequence, thus reach the purpose of dimensionality reduction.Specifically, all vocabulary of whole TEXT system are calculated by IG, according to The size of whole system contribution information amount is ranked up, filters out top n vocabulary as reference vector, employing LIST data knot Structure, is persisted in memory database.

By calculating information gain, can to obtain those frequencies of occurrences in positive example sample high and occur in positive example sample The feature that frequency is low.Information gain relates to more mathematical theory and complicated entropy theory formula, and the embodiment of the present invention is determined Justice is the quantity of information that whole classification can be provided by for certain characteristic item, does not consider the entropy of any feature entropy after consideration this feature Difference.It is according to training data, calculates the information gain of each characteristic item, deletes the item that information gain is the least, and remaining is pressed Sort from big to small according to information gain.The information gain computing formula that the embodiment of the present invention provides is as follows:

Symbol description:

N, represents total textual data, i.e. total classification number；P(C_i), represent classification C_iThe probability occurred, i.e. text D_iOccur is general Rate, is equal toP (t), represents feature (T) probability that occurs, uses and comprises the amount of text of feature (T) divided by total amount of text N, That is:Wherein DF_TRepresent the document frequency of feature (T)； Represent feature (T) absent variable probability, equal to 1-P (t)；P(C_i| t), represent that text comprises feature (T) and belongs to classification C_iProbability；Represent that text comprises feature (T) And belong to classification C_iProbability.

After information gain calculates, the word segmentation result of text is carried out vector space model conversion, tie according to the calculating of IG Really, text is carried out word filtration, all texts are all expressed as the characteristic vector of n dimension.I.e. text d can be expressed as the sky of n dimension Between vector W₁,W₂,…,W_n, wherein W_iIt is ith feature item weighted value in text d, as follows:

d1→W₁₁,W₁₂,…,W_1n

d2→W₂₁,W₂₂,…,W_2n

…

dn→W_n1,W_n2,…,W_nn

It follows that need text vector is carried out TF-IDF calculating, obtain the lexical item significance level to the text, formed new Text matrix, store in memory database.TF-IDF calculates, actually TF*IDF, TF word frequency (Term Frequency), IDF (Inverse Document Frequency).TF represents the frequency that lexical item occurs in document d, IDF It is document frequency inverse, i.e. on the basis of word frequency, " importance " weight will be distributed to each word.The most modal word (" " etc.) giving minimum weight, word (" Chinese " etc.) more typically gives less weight, more rare word (" pattra leaves This ", etc.) give bigger weight.Its size is inversely proportional to the common degree of a word.TF-IDF computing formula is as follows:

TF computing formula: log (1+f_t,d)

IDF computing formula:

N represents total textual data, f_t,dRepresent the frequency that feature t occurs in text d, n_tRepresent that feature T is in text d Number.

Afterwards, for TF-IDF result, calculate the similarity between text by cosine similarity formula, form document and close Being matrix, more convergence 1 shows that two vectors are the most similar, and on the contrary, then two vectors are the most dissimilar.

Cosine similarity calculates, and assesses their similarity by calculating two vectorial included angle cosine values.By upper The space vector stated, carries out Similarity Measure by object vector and candidate vector, range of results between [0,1], more convergence 1 table Bright two vectors are the most similar, and on the contrary, then two vectors are the most dissimilar.Cosine similarity computing formula is as follows:

Wherein: w_i,kRepresent the TF-IDF result of word.

Wherein, according to technical solution of the present invention, those skilled in the art can determine that remaining parameter defines, the embodiment of the present invention Do not enumerate.

The result calculated according to cosine similarity between text and text, obtains document relationships matrix, by analyzing user Behavioral data, excavates the label that user is interested, in conjunction with document relationships matrix, uses certain weight proportion to recommendation list Carry out marking, filter and sorting, form final recommendation list, it is recommended that to user.

The embodiment of the present invention one, based on user behavior analysis, sets up around user's historical data and analyzes model, by effectively Algorithm carries out deep excavation to it, excavates user's request and hobby, it is provided that personalized recommendation, improve data recommendation has Effect property and specific aim, improve Consumer's Experience.

Embodiment two

The text class based on content and user behavior that the embodiment of the present invention two provides recommends the flow process of method to see Fig. 2 institute Show, specifically include:

Step S210, obtains original document collection, for RDBMS or text.

Step S211, uses segmenter that original document collection is carried out Chinese word segmentation.

Step S212, uses noun filter to carry out noun screening and obtains a set of words.

Step S213, carries out document frequency statistics and is stored in redis, enters step S214 and step S217.

Step S214, carries out inverted index and indexed results is stored in redis, and entering step S221.

Step S215, carries out word frequency statistics and is stored in redis, enters step S216 and step S217 afterwards.

Step S216, carries out document forward index, enters step S219 afterwards.

Step S217, carries out IG calculating.

Step S218, is persisted to redis by calculated for IG Feature Words, enters step S219 afterwards.

Step S219, carries out TF-IDF calculating.

Step S220, generates document vector, is converted into document vector space and is stored in redis.

Step S221, carries out cosine similarity calculating.

Step S222, sets up document relationships matrix according to cosine similarity result of calculation and is stored in redis.

Step S223, obtains the recent browing record of user.

Step S224, in conjunction with the recent browing record of document relationships matrix and user, carries out recommendation score.

Step S225, filters according to the recent browing record of appraisal result and user and sorts.

Step S226, obtains recommending lists of documents to recommend user.

In the embodiment of the present invention two, dictionary product process is shown in Figure 3, and segmenter carries out participle flow process and sees Fig. 4 Shown in, Text similarity computing related procedure is shown in Figure 5.

Embodiment three

The embodiment of the present invention also provides for the device that a kind of text class based on content and user behavior is recommended, and sees Fig. 6 institute Show, including:

Embodiments providing one can be with the method and apparatus of personalized recommendation, it is achieved that according to the information of user Demand, interest etc., the targeted information that information interested for user, product etc. are recommended user is recommended.With traditional search Engine is compared, and this commending system, by studying the interest preference of user, carries out personalized calculating, by the interest of system discovery user Similarity between point and text and text, thus guide the information requirement that the user discover that oneself, provide the user more effectively Data recommendation service.

It should be noted that device or system embodiment in the embodiment of the present invention can be realized by software, it is possible to Realize in the way of by hardware or software and hardware combining.For hardware view, as shown in Figure 6, for the embodiment of the present invention A kind of hardware configuration block schematic illustration, in addition to CPU, internal memory, network interface and nonvolatile memory, in embodiment The equipment at device place generally can also include other hardware, such as the forwarding chip etc. of responsible process message.Implemented in software As a example by, as the device on a logical meaning, it is that the CPU by its place equipment is by meter corresponding in nonvolatile memory Calculation machine programmed instruction reads and runs formation in internal memory.

The foregoing is only presently preferred embodiments of the present invention, not in order to limit the present invention, all essences in the present invention Within god and principle, any modification, equivalent substitution and improvement etc. done, within should be included in the scope of protection of the invention.

Claims

1. a text class based on content and user behavior recommends method, it is characterised in that include step:

Step A, obtains collection of document to be analyzed, the document in described collection of document is carried out Chinese word segmentation and obtains multiple word ?；

Step B, carries out information gain calculating to the lexical item in described collection of document, sorts according to the size of information gain amount and screens Multiple lexical items are as reference vector；

Step C, according to described reference vector, is converted to the vector space model of multidimensional by the text in described collection of document；

Step D, calculates described vector space model, obtains text vector matrix；

Text class based on content and user behavior the most according to claim 1 recommends method, it is characterised in that described step Further comprise the steps of: before rapid A

Generate dictionary and described dictionary is persisted in dictionary.

Text class based on content and user behavior the most according to claim 2 recommends method, it is characterised in that described step The rapid dictionary that generates includes step:

Obtain the multiple original dictionary recording lexical information, utilize TreeSet that the vocabulary described in described original dictionary is entered Row auto-sequencing, load, filter and collect, generate List according to the attribute of vocabulary, then by even numbers group index tree generation dictionary, It is persisted in dictionary.

Text class based on content and user behavior the most according to claim 3 recommends method, it is characterised in that described step Rapid A includes step to the document in described collection of document carries out Chinese word segmentation:

Gathering the dictionary that loaded from described dictionary in internal memory, carry out multi-mode matching based on DATrie, generation is campaigned for Path, carries out dividing processing to the vocabulary in document simultaneously, filters out spcial character, is calculated optimal path and mates, Generate word segmentation result.

Text class based on content and user behavior the most according to claim 1 recommends method, it is characterised in that described step After rapid A, further comprise the steps of: before step B

Extract noun lexical item, filter out stop words.

Text class based on content and user behavior the most according to claim 1 recommends method, it is characterised in that described step Lexical item in described collection of document is carried out information gain by rapid B calculate and include step:

Using every text as a classification, using the lexical item in text as feature, according to equation below calculating information gain amount:

I G (T) = - Σ_{i = 1}^{n} P (C_{i}) \times \log_{2} P (C_{i}) + P (t) \times Σ_{i = 1}^{n} P (C_{i} | t) \times \log_{2} P (C_{i} | t) + P (\overset{&OverBar;}{t}) \times Σ_{i = 1}^{n} P (C_{i} | \overset{&OverBar;}{t}) \times \log_{2} P (C_{i} | \overset{&OverBar;}{t})

Wherein, the total textual data during N represents described text set；P(C_i), represent classification C_iThe probability occurred；P (t), represents feature (T) probability occurred；Represent feature (T) absent variable probability；P(C_i| t) represent that text comprises feature (T) and belongs to classification C_iProbability；Represent that text comprises feature (T) and belongs to classification C_iProbability.

Text class based on content and user behavior the most according to claim 1 recommends method, it is characterised in that described step Described vector space model is carried out calculating by rapid D and includes step:

According to equation below calculating TF-IDF:

TF=log (1+f_{T, d})；

I D F = l o g (1 + \frac{N}{n_{t}});

TF-IDF=TF*IDF；

Wherein, TF is word frequency, and IDF is document frequency inverse.

Text class based on content and user behavior the most according to claim 1 recommends method, it is characterised in that described step The rapid similarity calculated between different text vector matrixes includes step:

Two different text vector matrixes are obtained by the included angle cosine value calculating two different text vector matrixes Similarity, cosine similarity computing formula is as follows:

s i m (i, j) = \frac{Σ_{k = 1}^{n} w_{i, k} \times w_{j, k}}{\sqrt{Σ_{k = 1}^{n} {w_{i, k}}^{2}} \times \sqrt{Σ_{k = 1}^{n} {w_{j, k}}^{2}}}

Wherein, w_{I, k}Represent the TF-IDF result of lexical item.

Text class based on content and user behavior the most according to claim 1 recommends method, it is characterised in that described step Further comprise the steps of: before described step B after rapid A

10. the device that a text class based on content and user behavior is recommended, it is characterised in that including:

Word-dividing mode, for obtaining collection of document to be analyzed, carries out Chinese word segmentation to the document in described collection of document and obtains Multiple lexical items；

IG computing module, for the lexical item in described collection of document is carried out information gain calculating, big according to information gain amount Little sequence screens multiple lexical items as reference vector；

Dimensionality reduction module, for according to described reference vector, is converted to the space vector of multidimensional by the text in described collection of document Model；

Similarity calculation module, for calculating the similarity between different text vector matrixes, forms document relationships matrix；

Recommending module, is used for analyzing user behavior data, in conjunction with described document relationships matrix, forms recommendation list and recommends use Family.