CN106250526A - A kind of text class based on content and user behavior recommends method and apparatus - Google Patents

A kind of text class based on content and user behavior recommends method and apparatus Download PDF

Info

Publication number
CN106250526A
CN106250526A CN201610635123.1A CN201610635123A CN106250526A CN 106250526 A CN106250526 A CN 106250526A CN 201610635123 A CN201610635123 A CN 201610635123A CN 106250526 A CN106250526 A CN 106250526A
Authority
CN
China
Prior art keywords
document
text
vector
user behavior
dictionary
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610635123.1A
Other languages
Chinese (zh)
Inventor
张达
亓开元
苏志远
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Electronic Information Industry Co Ltd
Original Assignee
Inspur Electronic Information Industry Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Electronic Information Industry Co Ltd filed Critical Inspur Electronic Information Industry Co Ltd
Priority to CN201610635123.1A priority Critical patent/CN106250526A/en
Publication of CN106250526A publication Critical patent/CN106250526A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3347Query execution using vector based model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/338Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention provides a kind of text class based on content and user behavior to recommend method, and the method comprising the steps of: obtains collection of document to be analyzed, the document in collection of document is carried out Chinese word segmentation and obtains multiple lexical item;Lexical item in collection of document is carried out information gain calculating, sorts according to the size of information gain amount and screen multiple lexical items as reference vector;According to reference vector, the text in collection of document is converted to the vector space model of multidimensional;Vector space model is carried out TF IDF calculating, obtains text vector matrix;Calculate the similarity between different text vector matrixes, form document relationships matrix;Analyze user behavior data, in conjunction with document relationships matrix, form recommendation list and recommend user.This device includes word-dividing mode, IG computing module, dimensionality reduction module, TF IDF computing module, similarity calculation module and recommending module.The method and device, it is possible to promote the effectiveness of user version commending contents.

Description

A kind of text class based on content and user behavior recommends method and apparatus
Technical field
The present invention relates to data mining technology field, recommend particularly to a kind of text class based on content and user behavior Method and apparatus.
Background technology
The appearance of the Internet and popularize and bring substantial amounts of information to user, meets user in the information age to information Demand, but the increasing substantially of the network information amount brought along with developing rapidly of network so that user is in the face of bulk information Shi Wufa therefrom obtains the part information actually useful to oneself, reduces the service efficiency of information on the contrary, here it is institute The information overload problem of meaning.
At present, one of the solution for information overload problem is with search engine for representative information searching system, it Help user obtain the network information in terms of play extremely important effect.But current search engine is often only capable of basis The character of user's input carries out coupling search, and when using same keyword search information, the result obtained is identical, nothing Method obtains Different Results according to diversified search need.From the point of view of Ling Yifangmian, information and propagation thereof are diversified, and user Demand to information is diversification and personalization, then the knot obtained by the information retrieval system with search engine as representative Fruit can not meet the individual demand of user, still cannot solve information overload problem well.
Summary of the invention
The present invention provides a kind of text class based on content and user behavior to recommend method and apparatus, to solve above-mentioned technology Problem.
A kind of based on content and user behavior the text class that the present invention provides recommends method, including step:
Step A, obtains collection of document to be analyzed, the document in described collection of document is carried out Chinese word segmentation and obtains many Individual lexical item;
Step B, carries out information gain calculating to the lexical item in described collection of document, sorts according to the size of information gain amount Screen multiple lexical item as reference vector;
Step C, according to described reference vector, is converted to the space vector mould of multidimensional by the text in described collection of document Type;
Step D, carries out TF-IDF calculating to described vector space model, obtains text vector matrix;
Step E, calculates the similarity between different text vector matrixes, forms document relationships matrix;
Step F, analyzes user behavior data, in conjunction with described document relationships matrix, forms recommendation list and recommends user.
Wherein, further comprise the steps of: before step A
Generate dictionary and described dictionary is persisted in dictionary.
Wherein, step generates dictionary and includes step:
Obtain the multiple original dictionary recording lexical information, utilize TreeSet to the word described in described original dictionary Remittance carries out auto-sequencing, loads, filters and collect, and generates List according to the attribute of vocabulary, then is generated by even numbers group index tree Dictionary, is persisted in dictionary.
Wherein, step A includes step to the document in described collection of document carries out Chinese word segmentation:
From described dictionary, gather the dictionary that loaded in internal memory, carry out multi-mode matching based on DATrie, generation Election contest path, carries out dividing processing to the vocabulary in document simultaneously, filters out spcial character, is calculated optimal path and carries out Join, generate word segmentation result.
Wherein, after step A, further comprise the steps of: before step B
Extract noun lexical item, filter out stop words.
Wherein, the lexical item in described collection of document is carried out information gain by step B calculate and include step:
Using every text as a classification, using the lexical item in text as feature, calculate information according to equation below and increase Beneficial amounts:
Wherein, the total textual data during N represents described text set;P(Ci), represent classification CiThe probability occurred;P (t), represents The probability that feature (T) occurs;Represent feature (T) absent variable probability;P(Ci| t) represent that text comprises feature (T) and belongs to Classification CiProbability;Represent that text comprises feature (T) and belongs to classification CiProbability.
Wherein, described vector space model is carried out TF-IDF by step D calculate and include step:
According to equation below calculating TF-IDF:
TF=log (1+fT, d);
TF-IDF=TF*IDF;
Wherein, TF is word frequency, and IDF is document frequency inverse.
Wherein, described step calculates the similarity between different text vector matrix and includes step:
Two different text vector squares are obtained by the included angle cosine value calculating two different text vector matrixes The similarity of battle array, cosine similarity computing formula is as follows:
Wherein, wI, kRepresent the TF-IDF result of lexical item.
Wherein, further comprise the steps of: before described step B after step A
By the text vector in described collection of document, Hash database structure is used to store.
The present invention also provides for the device that a kind of text class based on content and user behavior is recommended, including:
Word-dividing mode, for obtaining collection of document to be analyzed, carries out Chinese word segmentation to the document in described collection of document Obtain multiple lexical item;
IG computing module, for carrying out information gain calculating to the lexical item in described collection of document, according to information gain amount Size sequence screen multiple lexical items as reference vector;
Dimensionality reduction module, for according to described reference vector, is converted to the space of multidimensional by the text in described collection of document Vector model;
TF-IDF computing module, for described vector space model is carried out TF-IDF calculating, obtains text vector matrix;
Similarity calculation module, for calculating the similarity between different text vector matrixes, forms document relationships square Battle array;
Recommending module, is used for analyzing user behavior data, in conjunction with described document relationships matrix, forms recommendation list and recommends User.
Embodiments provide a kind of text class based on content and user behavior and recommend method and apparatus, pass through word The steps such as allusion quotation generation, text participle, feature selection, TF-IDF calculating, Similarity Measure, user behavior analysis, excavate text And the similarity between text, and according to the interest of user historical data analysis user, be modeled, thus actively push away to user Recommend and can meet their interest and the information of demand, it is achieved individuation data based on user behavior is recommended, and reduces recommending data Blindness and ineffectivity, improve accuracy and the efficiency of data recommendation.
Accompanying drawing explanation
Fig. 1 is the schematic flow sheet that present invention text based on content and user behavior class recommends one embodiment of method;
The schematic flow sheet of the method that Fig. 2 position embodiment of the present invention two provides;
The schematic flow sheet that in Fig. 3 embodiment of the present invention two, dictionary generates;
The schematic flow sheet of participle in Fig. 4 embodiment of the present invention two;
The schematic diagram of Similarity Measure related procedure in Fig. 5 embodiment of the present invention two;
The structural framing schematic diagram of the device that Fig. 6 embodiment of the present invention three provides.
Detailed description of the invention
Embodiments provide a kind of text class based on content and user behavior and recommend method and apparatus.
Embodiment one
Shown in Figure 1, the method that the embodiment of the present invention one provides includes step:
Step S110, obtains collection of document to be analyzed, the document in collection of document is carried out Chinese word segmentation and obtains multiple Lexical item.
Step S111, carries out information gain calculating to the lexical item in collection of document, sorts according to the size of information gain amount Screen multiple lexical item as reference vector.
Step S112, according to described reference vector, is converted to the space vector of multidimensional by the text in described collection of document Model.
Step S113, carries out TF-IDF calculating to described vector space model, obtains text vector matrix.
Step S114, calculates the similarity between different text vector matrixes, forms document relationships matrix.
Step S115, analyzes user behavior data, in conjunction with described document relationships matrix, forms recommendation list and recommends use Family.
Alternatively, need before carrying out Chinese word segmentation to build generation dictionary in advance, and described dictionary is persisted to dictionary In storehouse.
Dictionary generates main by all kinds of dictionaries (in, outer) and user-oriented dictionary, utilizes TreeSet auto-sequencing, adds Carry, filter and collect, be persisted in dictionary, generate List according to the attribute of word, then by Double-Array Trie (even numbers group index tree is called for short DAT) generates final dictionary Dict DATrie.Double-Array Trie is the one of TRIE tree Kind of deformation, it is to ensure on the premise of TRIE tree retrieval rate, raising space availability ratio and a kind of data structure of proposing.It Essence is a finite-state automata determined (DFA), a state of each node on behalf automat, according to variable not With, carry out state transfer, when arriving done state or cannot shift when, complete inquiry.
The purpose generating dictionary is to collect substantial amounts of vocabulary composition dictionary, the abundantest result meeting that represent participle of dictionary The most accurate.
Text participle, is, by certain algorithm, text is carried out participle conversion, and adds up text related information, such as: literary composition Shelves frequency, word frequency, word sum etc..By text vector, storage to memory database, use for subsequent step.
Text participle need to gather the dictionary that loaded in internal memory from dictionary, based on the dictionary previously generated before DATrie carries out multi-mode matching, generates election contest path, word is carried out dividing processing simultaneously, filter out spcial character, add word Frequently weights etc., calculate path cost, thus obtain optimal path and mate, and generate word segmentation result.Wherein spcial character is this Field known technology term, including the symbol that punctuation mark, space etc. are non-legible.
After participle, need to carry out noun extraction and stop words filter filtration further to word segmentation result, coupling noun and Stop words dictionary, filters out vocabulary incoherent with text feature.Alternatively, collect in advance and disable dictionary and part of speech, generally recognize Maximum importance is had for the similarities and differences distinguishing text for noun part-of-speech.Accordingly, it would be desirable to pass through noun part-of-speech and disable Word (" ", " obtaining ", " " etc.) filter it is screened, to ensure diversity and the accuracy of Text similarity computing.
Prepare document sets, record dictionary, and carry out Chinese word segmentation, filter stop words, filter out the feature of noun part-of-speech, Generate after name set of words, it is necessary to add up the frequency of occurrences of each lexical item (Term) and lexical item document frequency and document word number Deng, and by text vector, use Hash Key data structure, it is stored in memory database.
Information gain represents that feature occurs in the text or occurs without as judging the quantity of information that text generic is provided Size.Calculate (IG calculating) by information gain, tend to be converted into high-dimensional space the space of low-dimensional, its basis Training data, calculates the information gain of each characteristic item, deletes the item that information gain is the least, remaining according to information gain from Big to little sequence, thus reach the purpose of dimensionality reduction.Specifically, all vocabulary of whole TEXT system are calculated by IG, according to The size of whole system contribution information amount is ranked up, filters out top n vocabulary as reference vector, employing LIST data knot Structure, is persisted in memory database.
By calculating information gain, can to obtain those frequencies of occurrences in positive example sample high and occur in positive example sample The feature that frequency is low.Information gain relates to more mathematical theory and complicated entropy theory formula, and the embodiment of the present invention is determined Justice is the quantity of information that whole classification can be provided by for certain characteristic item, does not consider the entropy of any feature entropy after consideration this feature Difference.It is according to training data, calculates the information gain of each characteristic item, deletes the item that information gain is the least, and remaining is pressed Sort from big to small according to information gain.The information gain computing formula that the embodiment of the present invention provides is as follows:
Symbol description:
N, represents total textual data, i.e. total classification number;P(Ci), represent classification CiThe probability occurred, i.e. text DiOccur is general Rate, is equal toP (t), represents feature (T) probability that occurs, uses and comprises the amount of text of feature (T) divided by total amount of text N, That is:Wherein DFTRepresent the document frequency of feature (T); Represent feature (T) absent variable probability, equal to 1-P (t);P(Ci| t), represent that text comprises feature (T) and belongs to classification CiProbability;Represent that text comprises feature (T) And belong to classification CiProbability.
After information gain calculates, the word segmentation result of text is carried out vector space model conversion, tie according to the calculating of IG Really, text is carried out word filtration, all texts are all expressed as the characteristic vector of n dimension.I.e. text d can be expressed as the sky of n dimension Between vector W1,W2,…,Wn, wherein WiIt is ith feature item weighted value in text d, as follows:
d1→W11,W12,…,W1n
d2→W21,W22,…,W2n
dn→Wn1,Wn2,…,Wnn
It follows that need text vector is carried out TF-IDF calculating, obtain the lexical item significance level to the text, formed new Text matrix, store in memory database.TF-IDF calculates, actually TF*IDF, TF word frequency (Term Frequency), IDF (Inverse Document Frequency).TF represents the frequency that lexical item occurs in document d, IDF It is document frequency inverse, i.e. on the basis of word frequency, " importance " weight will be distributed to each word.The most modal word (" " etc.) giving minimum weight, word (" Chinese " etc.) more typically gives less weight, more rare word (" pattra leaves This ", etc.) give bigger weight.Its size is inversely proportional to the common degree of a word.TF-IDF computing formula is as follows:
TF computing formula: log (1+ft,d)
IDF computing formula:
N represents total textual data, ft,dRepresent the frequency that feature t occurs in text d, ntRepresent that feature T is in text d Number.
Afterwards, for TF-IDF result, calculate the similarity between text by cosine similarity formula, form document and close Being matrix, more convergence 1 shows that two vectors are the most similar, and on the contrary, then two vectors are the most dissimilar.
Cosine similarity calculates, and assesses their similarity by calculating two vectorial included angle cosine values.By upper The space vector stated, carries out Similarity Measure by object vector and candidate vector, range of results between [0,1], more convergence 1 table Bright two vectors are the most similar, and on the contrary, then two vectors are the most dissimilar.Cosine similarity computing formula is as follows:
Wherein: wi,kRepresent the TF-IDF result of word.
Wherein, according to technical solution of the present invention, those skilled in the art can determine that remaining parameter defines, the embodiment of the present invention Do not enumerate.
The result calculated according to cosine similarity between text and text, obtains document relationships matrix, by analyzing user Behavioral data, excavates the label that user is interested, in conjunction with document relationships matrix, uses certain weight proportion to recommendation list Carry out marking, filter and sorting, form final recommendation list, it is recommended that to user.
The embodiment of the present invention one, based on user behavior analysis, sets up around user's historical data and analyzes model, by effectively Algorithm carries out deep excavation to it, excavates user's request and hobby, it is provided that personalized recommendation, improve data recommendation has Effect property and specific aim, improve Consumer's Experience.
Embodiment two
The text class based on content and user behavior that the embodiment of the present invention two provides recommends the flow process of method to see Fig. 2 institute Show, specifically include:
Step S210, obtains original document collection, for RDBMS or text.
Step S211, uses segmenter that original document collection is carried out Chinese word segmentation.
Step S212, uses noun filter to carry out noun screening and obtains a set of words.
Step S213, carries out document frequency statistics and is stored in redis, enters step S214 and step S217.
Step S214, carries out inverted index and indexed results is stored in redis, and entering step S221.
Step S215, carries out word frequency statistics and is stored in redis, enters step S216 and step S217 afterwards.
Step S216, carries out document forward index, enters step S219 afterwards.
Step S217, carries out IG calculating.
Step S218, is persisted to redis by calculated for IG Feature Words, enters step S219 afterwards.
Step S219, carries out TF-IDF calculating.
Step S220, generates document vector, is converted into document vector space and is stored in redis.
Step S221, carries out cosine similarity calculating.
Step S222, sets up document relationships matrix according to cosine similarity result of calculation and is stored in redis.
Step S223, obtains the recent browing record of user.
Step S224, in conjunction with the recent browing record of document relationships matrix and user, carries out recommendation score.
Step S225, filters according to the recent browing record of appraisal result and user and sorts.
Step S226, obtains recommending lists of documents to recommend user.
In the embodiment of the present invention two, dictionary product process is shown in Figure 3, and segmenter carries out participle flow process and sees Fig. 4 Shown in, Text similarity computing related procedure is shown in Figure 5.
Embodiment three
The embodiment of the present invention also provides for the device that a kind of text class based on content and user behavior is recommended, and sees Fig. 6 institute Show, including:
Word-dividing mode, for obtaining collection of document to be analyzed, carries out Chinese word segmentation to the document in described collection of document Obtain multiple lexical item;
IG computing module, for carrying out information gain calculating to the lexical item in described collection of document, according to information gain amount Size sequence screen multiple lexical items as reference vector;
Dimensionality reduction module, for according to described reference vector, is converted to the space of multidimensional by the text in described collection of document Vector model;
TF-IDF computing module, for described vector space model is carried out TF-IDF calculating, obtains text vector matrix;
Similarity calculation module, for calculating the similarity between different text vector matrixes, forms document relationships square Battle array;
Recommending module, is used for analyzing user behavior data, in conjunction with described document relationships matrix, forms recommendation list and recommends User.
Embodiments providing one can be with the method and apparatus of personalized recommendation, it is achieved that according to the information of user Demand, interest etc., the targeted information that information interested for user, product etc. are recommended user is recommended.With traditional search Engine is compared, and this commending system, by studying the interest preference of user, carries out personalized calculating, by the interest of system discovery user Similarity between point and text and text, thus guide the information requirement that the user discover that oneself, provide the user more effectively Data recommendation service.
It should be noted that device or system embodiment in the embodiment of the present invention can be realized by software, it is possible to Realize in the way of by hardware or software and hardware combining.For hardware view, as shown in Figure 6, for the embodiment of the present invention A kind of hardware configuration block schematic illustration, in addition to CPU, internal memory, network interface and nonvolatile memory, in embodiment The equipment at device place generally can also include other hardware, such as the forwarding chip etc. of responsible process message.Implemented in software As a example by, as the device on a logical meaning, it is that the CPU by its place equipment is by meter corresponding in nonvolatile memory Calculation machine programmed instruction reads and runs formation in internal memory.
The foregoing is only presently preferred embodiments of the present invention, not in order to limit the present invention, all essences in the present invention Within god and principle, any modification, equivalent substitution and improvement etc. done, within should be included in the scope of protection of the invention.

Claims (10)

1. a text class based on content and user behavior recommends method, it is characterised in that include step:
Step A, obtains collection of document to be analyzed, the document in described collection of document is carried out Chinese word segmentation and obtains multiple word ?;
Step B, carries out information gain calculating to the lexical item in described collection of document, sorts according to the size of information gain amount and screens Multiple lexical items are as reference vector;
Step C, according to described reference vector, is converted to the vector space model of multidimensional by the text in described collection of document;
Step D, calculates described vector space model, obtains text vector matrix;
Step E, calculates the similarity between different text vector matrixes, forms document relationships matrix;
Step F, analyzes user behavior data, in conjunction with described document relationships matrix, forms recommendation list and recommends user.
Text class based on content and user behavior the most according to claim 1 recommends method, it is characterised in that described step Further comprise the steps of: before rapid A
Generate dictionary and described dictionary is persisted in dictionary.
Text class based on content and user behavior the most according to claim 2 recommends method, it is characterised in that described step The rapid dictionary that generates includes step:
Obtain the multiple original dictionary recording lexical information, utilize TreeSet that the vocabulary described in described original dictionary is entered Row auto-sequencing, load, filter and collect, generate List according to the attribute of vocabulary, then by even numbers group index tree generation dictionary, It is persisted in dictionary.
Text class based on content and user behavior the most according to claim 3 recommends method, it is characterised in that described step Rapid A includes step to the document in described collection of document carries out Chinese word segmentation:
Gathering the dictionary that loaded from described dictionary in internal memory, carry out multi-mode matching based on DATrie, generation is campaigned for Path, carries out dividing processing to the vocabulary in document simultaneously, filters out spcial character, is calculated optimal path and mates, Generate word segmentation result.
Text class based on content and user behavior the most according to claim 1 recommends method, it is characterised in that described step After rapid A, further comprise the steps of: before step B
Extract noun lexical item, filter out stop words.
Text class based on content and user behavior the most according to claim 1 recommends method, it is characterised in that described step Lexical item in described collection of document is carried out information gain by rapid B calculate and include step:
Using every text as a classification, using the lexical item in text as feature, according to equation below calculating information gain amount:
I G ( T ) = - Σ i = 1 n P ( C i ) × log 2 P ( C i ) + P ( t ) × Σ i = 1 n P ( C i | t ) × log 2 P ( C i | t ) + P ( t ‾ ) × Σ i = 1 n P ( C i | t ‾ ) × log 2 P ( C i | t ‾ )
Wherein, the total textual data during N represents described text set;P(Ci), represent classification CiThe probability occurred;P (t), represents feature (T) probability occurred;Represent feature (T) absent variable probability;P(Ci| t) represent that text comprises feature (T) and belongs to classification CiProbability;Represent that text comprises feature (T) and belongs to classification CiProbability.
Text class based on content and user behavior the most according to claim 1 recommends method, it is characterised in that described step Described vector space model is carried out calculating by rapid D and includes step:
According to equation below calculating TF-IDF:
TF=log (1+fT, d);
I D F = l o g ( 1 + N n t ) ;
TF-IDF=TF*IDF;
Wherein, TF is word frequency, and IDF is document frequency inverse.
Text class based on content and user behavior the most according to claim 1 recommends method, it is characterised in that described step The rapid similarity calculated between different text vector matrixes includes step:
Two different text vector matrixes are obtained by the included angle cosine value calculating two different text vector matrixes Similarity, cosine similarity computing formula is as follows:
s i m ( i , j ) = Σ k = 1 n w i , k × w j , k Σ k = 1 n w i , k 2 × Σ k = 1 n w j , k 2
Wherein, wI, kRepresent the TF-IDF result of lexical item.
Text class based on content and user behavior the most according to claim 1 recommends method, it is characterised in that described step Further comprise the steps of: before described step B after rapid A
By the text vector in described collection of document, Hash database structure is used to store.
10. the device that a text class based on content and user behavior is recommended, it is characterised in that including:
Word-dividing mode, for obtaining collection of document to be analyzed, carries out Chinese word segmentation to the document in described collection of document and obtains Multiple lexical items;
IG computing module, for the lexical item in described collection of document is carried out information gain calculating, big according to information gain amount Little sequence screens multiple lexical items as reference vector;
Dimensionality reduction module, for according to described reference vector, is converted to the space vector of multidimensional by the text in described collection of document Model;
TF-IDF computing module, for described vector space model is carried out TF-IDF calculating, obtains text vector matrix;
Similarity calculation module, for calculating the similarity between different text vector matrixes, forms document relationships matrix;
Recommending module, is used for analyzing user behavior data, in conjunction with described document relationships matrix, forms recommendation list and recommends use Family.
CN201610635123.1A 2016-08-05 2016-08-05 A kind of text class based on content and user behavior recommends method and apparatus Pending CN106250526A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610635123.1A CN106250526A (en) 2016-08-05 2016-08-05 A kind of text class based on content and user behavior recommends method and apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610635123.1A CN106250526A (en) 2016-08-05 2016-08-05 A kind of text class based on content and user behavior recommends method and apparatus

Publications (1)

Publication Number Publication Date
CN106250526A true CN106250526A (en) 2016-12-21

Family

ID=58078061

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610635123.1A Pending CN106250526A (en) 2016-08-05 2016-08-05 A kind of text class based on content and user behavior recommends method and apparatus

Country Status (1)

Country Link
CN (1) CN106250526A (en)

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107578292A (en) * 2017-09-19 2018-01-12 上海财经大学 A kind of user's portrait constructing system
CN107679916A (en) * 2017-10-12 2018-02-09 北京京东尚科信息技术有限公司 For obtaining the method and device of user interest degree
CN107704512A (en) * 2017-08-31 2018-02-16 平安科技(深圳)有限公司 Financial product based on social data recommends method, electronic installation and medium
CN108228546A (en) * 2018-01-19 2018-06-29 北京中关村科金技术有限公司 A kind of text feature, device, equipment and readable storage medium storing program for executing
CN108334494A (en) * 2018-01-23 2018-07-27 阿里巴巴集团控股有限公司 A kind of construction method and device of customer relationship network
CN108572954A (en) * 2017-03-07 2018-09-25 上海颐为网络科技有限公司 A kind of approximation entry structure recommendation method and system
CN108829780A (en) * 2018-05-31 2018-11-16 北京万方数据股份有限公司 Method for text detection, calculates equipment and computer readable storage medium at device
CN109213972A (en) * 2017-07-06 2019-01-15 阿里巴巴集团控股有限公司 Determine the method, apparatus, equipment and computer storage medium of Documents Similarity
CN109978645A (en) * 2017-12-28 2019-07-05 北京京东尚科信息技术有限公司 A kind of data recommendation method and device
CN110489758A (en) * 2019-09-10 2019-11-22 深圳市和讯华谷信息技术有限公司 The values calculation method and device of application program
CN111125297A (en) * 2019-11-29 2020-05-08 中国电子科技集团公司第二十八研究所 Massive offline text real-time recommendation method based on search engine
CN111241403A (en) * 2020-01-15 2020-06-05 华南师范大学 Deep learning-based team recommendation method, system and storage medium
CN111310467A (en) * 2020-03-23 2020-06-19 应豪 Topic extraction method and system combining semantic inference in long text
CN111488138A (en) * 2020-04-10 2020-08-04 杭州顺藤网络科技有限公司 B2B recommendation engine based on Bayesian algorithm and cosine algorithm
CN112163399A (en) * 2020-10-12 2021-01-01 北京字跳网络技术有限公司 Online document pushing method and device, electronic equipment and computer readable medium
CN112348583A (en) * 2020-11-04 2021-02-09 贝壳技术有限公司 User preference generation method and generation system
CN113329344A (en) * 2021-05-19 2021-08-31 中国科学院计算技术研究所 File recommendation method for communication network
CN116561293A (en) * 2023-07-07 2023-08-08 中国石油天然气股份有限公司 Text feature-based oil and gas industry energy saving technology recommendation method and system

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6850937B1 (en) * 1999-08-25 2005-02-01 Hitachi, Ltd. Word importance calculation method, document retrieving interface, word dictionary making method
CN1786962A (en) * 2005-12-21 2006-06-14 中国科学院计算技术研究所 Method for managing and searching dictionary with perfect even numbers group TRIE Tree
CN102081667A (en) * 2011-01-23 2011-06-01 浙江大学 Chinese text classification method based on Base64 coding
CN103116637A (en) * 2013-02-08 2013-05-22 无锡南理工科技发展有限公司 Text sentiment classification method facing Chinese Web comments
CN104615779A (en) * 2015-02-28 2015-05-13 云南大学 Method for personalized recommendation of Web text
CN105159998A (en) * 2015-09-08 2015-12-16 海南大学 Keyword calculation method based on document clustering

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6850937B1 (en) * 1999-08-25 2005-02-01 Hitachi, Ltd. Word importance calculation method, document retrieving interface, word dictionary making method
CN1786962A (en) * 2005-12-21 2006-06-14 中国科学院计算技术研究所 Method for managing and searching dictionary with perfect even numbers group TRIE Tree
CN102081667A (en) * 2011-01-23 2011-06-01 浙江大学 Chinese text classification method based on Base64 coding
CN103116637A (en) * 2013-02-08 2013-05-22 无锡南理工科技发展有限公司 Text sentiment classification method facing Chinese Web comments
CN104615779A (en) * 2015-02-28 2015-05-13 云南大学 Method for personalized recommendation of Web text
CN105159998A (en) * 2015-09-08 2015-12-16 海南大学 Keyword calculation method based on document clustering

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
(美)西尔伯沙茨等著,杨冬青等译: "《数据库***概念(原书第6版)》", 31 March 2012 *
阎红灿著: "《本体建模与予以Web知识发现》", 31 December 2015 *

Cited By (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108572954A (en) * 2017-03-07 2018-09-25 上海颐为网络科技有限公司 A kind of approximation entry structure recommendation method and system
CN109213972A (en) * 2017-07-06 2019-01-15 阿里巴巴集团控股有限公司 Determine the method, apparatus, equipment and computer storage medium of Documents Similarity
CN107704512A (en) * 2017-08-31 2018-02-16 平安科技(深圳)有限公司 Financial product based on social data recommends method, electronic installation and medium
CN107578292A (en) * 2017-09-19 2018-01-12 上海财经大学 A kind of user's portrait constructing system
CN107679916A (en) * 2017-10-12 2018-02-09 北京京东尚科信息技术有限公司 For obtaining the method and device of user interest degree
CN109978645B (en) * 2017-12-28 2022-04-12 北京京东尚科信息技术有限公司 Data recommendation method and device
CN109978645A (en) * 2017-12-28 2019-07-05 北京京东尚科信息技术有限公司 A kind of data recommendation method and device
CN108228546A (en) * 2018-01-19 2018-06-29 北京中关村科金技术有限公司 A kind of text feature, device, equipment and readable storage medium storing program for executing
CN108334494A (en) * 2018-01-23 2018-07-27 阿里巴巴集团控股有限公司 A kind of construction method and device of customer relationship network
CN108829780B (en) * 2018-05-31 2022-05-24 北京万方数据股份有限公司 Text detection method and device, computing equipment and computer readable storage medium
CN108829780A (en) * 2018-05-31 2018-11-16 北京万方数据股份有限公司 Method for text detection, calculates equipment and computer readable storage medium at device
CN110489758A (en) * 2019-09-10 2019-11-22 深圳市和讯华谷信息技术有限公司 The values calculation method and device of application program
CN110489758B (en) * 2019-09-10 2023-04-18 深圳市和讯华谷信息技术有限公司 Value view calculation method and device for application program
CN111125297A (en) * 2019-11-29 2020-05-08 中国电子科技集团公司第二十八研究所 Massive offline text real-time recommendation method based on search engine
CN111125297B (en) * 2019-11-29 2022-11-25 中国电子科技集团公司第二十八研究所 Massive offline text real-time recommendation method based on search engine
CN111241403B (en) * 2020-01-15 2023-04-18 华南师范大学 Deep learning-based team recommendation method, system and storage medium
CN111241403A (en) * 2020-01-15 2020-06-05 华南师范大学 Deep learning-based team recommendation method, system and storage medium
CN111310467A (en) * 2020-03-23 2020-06-19 应豪 Topic extraction method and system combining semantic inference in long text
CN111310467B (en) * 2020-03-23 2023-12-12 应豪 Topic extraction method and system combining semantic inference in long text
CN111488138A (en) * 2020-04-10 2020-08-04 杭州顺藤网络科技有限公司 B2B recommendation engine based on Bayesian algorithm and cosine algorithm
CN112163399A (en) * 2020-10-12 2021-01-01 北京字跳网络技术有限公司 Online document pushing method and device, electronic equipment and computer readable medium
CN112348583A (en) * 2020-11-04 2021-02-09 贝壳技术有限公司 User preference generation method and generation system
CN112348583B (en) * 2020-11-04 2022-12-06 贝壳技术有限公司 User preference generation method and generation system
CN113329344A (en) * 2021-05-19 2021-08-31 中国科学院计算技术研究所 File recommendation method for communication network
CN113329344B (en) * 2021-05-19 2022-08-30 中国科学院计算技术研究所 File recommendation method for communication network
CN116561293A (en) * 2023-07-07 2023-08-08 中国石油天然气股份有限公司 Text feature-based oil and gas industry energy saving technology recommendation method and system

Similar Documents

Publication Publication Date Title
CN106250526A (en) A kind of text class based on content and user behavior recommends method and apparatus
KR101536520B1 (en) Method and server for extracting topic and evaluating compatibility of the extracted topic
CN105224699B (en) News recommendation method and device
CN100465954C (en) Reinforced clustering of multi-type data objects for search term suggestion
US7421418B2 (en) Method and apparatus for fundamental operations on token sequences: computing similarity, extracting term values, and searching efficiently
CN101055585B (en) System and method for clustering documents
CN105183833B (en) Microblog text recommendation method and device based on user model
CN102253982B (en) Query suggestion method based on query semantics and click-through data
CN102637170A (en) Question pushing method and system
CN107193883B (en) Data processing method and system
CN112966091B (en) Knowledge map recommendation system fusing entity information and heat
Sabuna et al. Summarizing Indonesian text automatically by using sentence scoring and decision tree
CN101673306B (en) Website information query method and system thereof
CN111506831A (en) Collaborative filtering recommendation module and method, electronic device and storage medium
Schofield et al. Identifying hate speech in social media
Dutta et al. PNRank: Unsupervised ranking of person name entities from noisy OCR text
Wei et al. Online education recommendation model based on user behavior data analysis
Nikas et al. Open domain question answering over knowledge graphs using keyword search, answer type prediction, SPARQL and pre-trained neural models
Jedrzejewski et al. Opinion mining and social networks: A promising match
Yao et al. Online deception detection refueled by real world data collection
Alam et al. Social media content categorization using supervised based machine learning methods and natural language processing in bangla language
Campbell et al. Content+ context networks for user classification in twitter
CN109871429B (en) Short text retrieval method integrating Wikipedia classification and explicit semantic features
CN106294689A (en) A kind of method and apparatus selecting based on text category feature to carry out dimensionality reduction
CN108932247A (en) A kind of method and device optimizing text search

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20161221