CN105205699A - User label and hotel label matching method and device based on hotel comments - Google Patents

User label and hotel label matching method and device based on hotel comments Download PDF

Info

Publication number
CN105205699A
CN105205699A CN201510593613.5A CN201510593613A CN105205699A CN 105205699 A CN105205699 A CN 105205699A CN 201510593613 A CN201510593613 A CN 201510593613A CN 105205699 A CN105205699 A CN 105205699A
Authority
CN
China
Prior art keywords
hotel
user
tag
label
emotion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510593613.5A
Other languages
Chinese (zh)
Inventor
林小俊
张猛
暴筱
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zhonghui Information Technology Co Ltd
Original Assignee
Beijing Zhonghui Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zhonghui Information Technology Co Ltd filed Critical Beijing Zhonghui Information Technology Co Ltd
Priority to CN201510593613.5A priority Critical patent/CN105205699A/en
Publication of CN105205699A publication Critical patent/CN105205699A/en
Pending legal-status Critical Current

Links

Landscapes

  • Machine Translation (AREA)

Abstract

The invention discloses a user label and hotel label matching method and device based on hotel comments. The method comprises the steps that a hotel industry emotional statement template library is prepared; final hotel labels of at least three hotels are prepared; at least two user comments of specific users for one or different hotels are obtained from the internet; emotional statements are compared with the emotional statement template library, matched emotional statements are screened out and recognized to be different dimensions, and all the recognized dimensions are made to form a user label set of the specific users; the weight of each user label is calculated, wherein the higher the occurrence frequency of the user label in all user comments of the specific users is and the lower the occurrence frequency of the user label in all user comments of all users for all hotels is, the higher the weight of the user label is; the user label of the high weight is selected as the final user label of the specific users; a hotel with the high matching rate of the final hotel label and the finial user label of the specific users is recommended to the specific users.

Description

Based on user tag and hotel's tag match method and the device of hotel's comment
Technical field
The present invention relates to a kind of internet information disposal route, particularly a kind of user draws a portrait generation method and device.
Background technology
The transition in epoch, inevitably bring many social changes.After internet steps into large data age gradually, inevitably for enterprise and consumer behaviour are brought a series of change and reinvented.The only fast not broken rhythm in internet, has upset the logic that original business develops, has made the participant of business have to be faced with unprecedented change, accelerate the change in adaptation epoch.How to utilize the commercial value that large data mining is potential, how the large data technique of out and out application in enterprise.Along with discussion, the innovation of large market demand, personalization technology becomes an important pick-up point.Compare member management under traditional line, survey, market basket analysis, large data first time makes enterprise can obtain user's feedback information more widely expediently by internet, for further precisely, rapidly analyzing the important business informations such as user behavior custom, consumption habit, provide enough data bases.Progressively go deep into along with to the understanding of people, the concept of " user's portrait " is arisen at the historic moment, and it ideally takes out the information overall picture of a user, can regard the foundation that enterprise applies large data as.
User's portrait is the virtual representations of real user, is the Virtual User drawn on the basis of profound understanding True Data.After the data of enterprise by main informations such as collection and analysis consumer's society attribute, habits and customs, consumer behavior, viewpoint differences, they are divided into different types, then characteristic feature is extracted in every type, give a name, a description such as photo, some demography key element, scenes, just define user's portrait, this is the business overall picture of user, can be regarded as the basic mode that enterprise applies large data technique.User's portrait provides enough Information bases for enterprise, and enterprise can be helped to find the feedback information more widely such as accurate user group and user's request fast.
Large data processing, be unable to do without the computing of computing machine, user's portrait can represent with tally set is incompatible, label is that the symbol of a certain user characteristics represents, user profile labeling provides one mode easily, make computing machine can the procedure treatment information relevant to people, even people " can be understood " by algorithm, model.
The signature identification of a normally predefined height refining of label, as age bracket label: 25 ~ 35 years old, region label: Beijing, label presents two key characters: (1) semantization, people can understand each meaning tag easily, this also makes user draw a portrait model to possess practical significance, can meeting business demand preferably, as judged user preference; (2) short text, each label only represents a kind of implication usually, and label itself is without the need to doing the pretreatment work such as too much text analyzing again, and this provides convenience for utilizing machine to extract standardized information.
User draws a portrait label and comprises two aspects specifically: label and weight thereof.Label, characterizes content, and user is interesting to this content, preference, demand etc.Weight, characterizes index, the interest of user, preference function, and also the demand degree of possibility characterizing consumer, simply can be interpreted as degree of confidence.
For the focus work of user's portrait is exactly for user beats " label ", and the signature identification of height refining that label normally artificially specifies, as age, sex, region, user preference etc., in general by all labels of user, substantially just can sketch the contours of the solid " portrait " of this user finally.
Specifically, when drawing a portrait for user, need to collect data, analyzing tags two steps.
First, collect all related datas of user and user data be divided into static information data, the large class of multidate information data two, static data is exactly the metastable information of user, as sex, age, region, occupation etc., dynamic data is exactly the behavioural information that user does not stop to change, as browsed webpage, search commercial articles, delivering comment, contact channel etc.
Secondly, by profile data for user sticks corresponding label and index, label representative of consumer is interested in this content, preference, demand etc., the level of interest, desirability, purchase probability etc. of index representative of consumer.
As Chinese patent application discloses No. 104750731A a kind of method obtaining whole user portrait disclosed, comprising: obtain incomplete user and draw a portrait matrix and stochastic generation customer parameter matrix P and label matrix Q; Calculate the portrait error of Part I user, upgrade customer parameter matrix and tag parameter matrix, wherein, the first change difference of the Part I user selected is greater than the first change difference of the first remaining users, first remaining users is the user except Part I user in multiple user, and the first change difference is the difference between the first predicted value of upgrading for the r-2 time of the first predicted value of user the r-1 time renewal and user; Upgrade after customer parameter matrix P and tag parameter matrix Q at the R time, according to the result of matrix decomposition, obtain complete user and draw a portrait matrix.
And for example Chinese patent application discloses No. 104268292A a kind of label Word library updating method of drawing a portrait system disclosed, it comprises: obtain the representation data of user, and described representation data comprises the urtext that label for describing described user and described user deliver; When the ratio of the quantity of label and the quantity of urtext is less than default first threshold, word segmentation processing is carried out to all urtext that described user delivers, to obtain multiple label candidate word, and label candidate word is sent to commending system; Commending system calculates the vector distance of each word in each label candidate word and default term vector model file, the label candidate word that there is vector distance and be greater than default Second Threshold is joined in label dictionary, the label candidate word that there is not vector distance and be greater than Second Threshold is deleted.
For another example Chinese patent application discloses No. 103577549A a kind of crowd portrayal system and method based on microblog label disclosed, comprise microblog label to recommend and the large module of label Subject Clustering two, wherein adopt a label recommendations algorithm containing three steps in the first module.The first step is homogeney label recommendations, and second step is co-occurrence tag extension; 3rd step is then set up semantic network based on Chinese knowledge mapping, utilizes the semantic similarity that network topology characteristic is come between measurement labels, thus removes semantic same or analogous label, ensures that the label being used for portraying user is Politeness.
But the application of user's Portrait brand technology does not all belong to hotel industry involved in the present invention disclosed in above-mentioned three sections of patent documentations.
In hotel industry, current user draws a portrait the investigation and application of labeling analysis and mainly concentrates in the data such as user property and user behavior, UAD comprises age, sex, region etc., user behavior data comprise user official website or Mobile solution end access history, click the data such as history, consumption history, the investigation and application based on comment data is less.The subject matter of this respect is that the analysis and understanding commenting on text is difficult to, and needing by technology such as natural language processings, is structurized data by non-structured data transformations, and common user tag analytical algorithm just can be applied.
Therefore, a kind of user tag based on hotel's comment and hotel's tag match method is provided to become urgent problem in the industry.
Summary of the invention
The object of this invention is to provide a kind of user tag based on hotel's comment and hotel's tag match method and device, it is hotel and user modeling by label, thus is associated between hotel and user better.
It is all structure based data that common user comments on analytical approach, as UAD, comprises age, sex, region etc., or user behavior data comprise user official website or Mobile solution end access history, click history, consumption history etc.The present invention is directed to hotel's comment data that investigation and application is less, can not only analyze the evaluation of user to hotel is that favorable comment or difference are commented, and can also excavate dimension, build the label of hotel and user based on this.
First the present invention comments on data by focused crawler from each large main flow comment (OnlineTravelAgent, OTA) website acquisition online.Then for extensive comment, hotel industry emotion dictionary and domain knowledge base is arranged by automatic/semi-automatic mode.Finally, for each sentence in comment, carry out the analyses such as natural language processing technique such as participle, part-of-speech tagging, phrase structure syntactic analysis, extraction keyword or crucial clause are as feature on this basis, realize emotional semantic classification by maximum entropy classifiers.For the sentence showed emotion, obtain dimension according to field keyword and Analysis of Knowledge Bases Reasoning further.Each dimension reflects the angle that people observe, are familiar with and describe hotel or user.
The present invention describes the focus of hotel industry hotel and user both sides' concern in detail by dimension, and in this, as tally set.User tag reflects the aspect that user takes notice of, and hotel's label reflects the aspect that hotel is good at.For the scene of recommending hotel such to user, the label taken notice of as user is more similar to the label that hotel is good at, or matching degree is higher, then more applicablely recommend user.Had tag set, next step is exactly all comments in all comments for certain user or certain hotel, calculates label weight.Weight calculation is mainly based on the frequency that label occurs in comment.The difference of hotel's label and user tag is, is good at degree in order to what reflect aspect, hotel, needs to consider that label corresponding point punctuate and annotate feeling polarities.On certain label, good evaluation is more, then think that hotel's this respect is more good at, and does better.
In the present invention, the dimension of indication refers to the statement affective style can expressed and evaluate in a certain respect hotel, health rank, traffic convenience degree, surrounding enviroment index, room space size etc. the aspect in such as hotel, specifically can comprise several dimensions, such as dimension 1 represents that health rank is A level; Dimension 12 represents that traffic convenience degree is B level; Dimension 53 represents that surrounding enviroment index is C level; Dimension 104 represents that room space size is D level etc.
In the present invention, the different attribute of the vocabulary of indication refers to and vocabulary is divided into the attributes such as evaluation object word, evaluation attributes word and emotion word.
According to an aspect of the present invention, a kind of user tag based on hotel's comment and hotel's tag match method are provided, comprise: (1), preparation hotel industry emotion statement template base, hotel industry emotion statement template base comprises at least 100 emotion statement templates; (2) the final hotel label at least three hotels, is prepared; (3), comment on from internet acquisition specific user at least two users in same hotel or different hotel; (4), the emotion statement of all user's comments of specific user is compared with at least 100 emotion statement templates one by one, filter out the emotion statement matched with at least 100 emotion statement templates, and filtered out emotion statement is identified as different dimensions according to expressed affective style, then form the user tag set of specific user with identified all dimensions; (5) weight of each user tag in the user tag set of specific user, is calculated respectively, wherein, the frequency occurred in whole user's comments of specific user is higher and frequency lower then user tag weight that is that occur in all user's comments of all users for all hotels is higher; (6), the great final user label of user tag as specific user setting threshold value in first of right to choose from the user tag set of specific user; And (7), the hotel that is positioned at front three to final user's tag match rate of major general's final hotel label and specific user recommend specific user.
Wherein, according to concrete application conditions, the final hotel label preparing at least three hotels can be preparation at least 10, at least 100 or at least 500 final hotel labels.
Selectively, can in advance by other device or for subsequent use by manually obtaining comment data from comment website.
Selectively, can in advance by other device or to go out hotel industry semantic dictionary by manual sorting for subsequent use.
Selectively, can in advance by other device or to go out hotel industry emotion statement template base by manual sorting for subsequent use.
Selectively, can in advance by other device or to go out seed semantic dictionary by manual sorting for subsequent use.
Selectively, the final hotel label preparing at least three hotels in step (2) comprises: (2.1), to obtain comment on for the user at least three hotels respectively from internet, and the user wherein comprising at least three users for each hotel comments on; (2.2), the emotion statement of all user's comments for Official hotel is compared with at least 100 emotion statement templates one by one, filter out the emotion statement matched with at least 100 emotion statement templates, and filtered out emotion statement is identified as different dimensions according to expressed affective style, then form hotel's tag set of Official hotel with identified all dimensions; (2.3) weight of each hotel label in hotel's tag set of Official hotel, is calculated respectively, wherein, the frequency occurred in all users for same hotel comment on is higher and frequency Yue Dize hotel label weight that is that occur in all users for all hotels comment on is higher; (2.4), the great final hotel label of hotel's label as Official hotel setting threshold value in second of right to choose from hotel's tag set; And (2.5), repeat step (2.2)-(2.4) until obtain the final hotel label in all hotels.
Selectively, the frequency height that in step (1), preparation hotel industry emotion statement template base can comprise according to statement appearance from least 10000 hotel users of internet acquisition comment on filters out at least 100 emotion statements as emotion statement template.
Selectively, comprise the frequency height occurred according to vocabulary further and filter out at least 1000 hotel industry common wordss in order to build hotel industry semantic dictionary from least 10000 hotel user comments.
Selectively, in step (1), before preparation hotel industry emotion statement template base, comprise the step building hotel industry semantic dictionary further, being compared at least 100 emotion statement templates one by one by the emotion statement of all users' comments of specific user in step (4) comprises: (4.1), is become and several hotel industry common wordss corresponding in hotel industry semantic dictionary by particular emotion sentence segmentation; (4.2), according to the different attribute of vocabulary each in particular emotion statement compare with at least 100 emotion statement templates respectively, thus determine whether match with any one the emotion statement template at least 100 emotion statement templates; And (4.3), repetition step (4.1)-(4.2), until filter out all emotion statements matched with at least 100 emotion statement templates.
Selectively, in step (2.2), the emotion statement of all users comment for Official hotel is compared at least 100 emotion statement templates one by one and comprise: (2.2.1), particular emotion sentence segmentation is become and several hotel industry common wordss corresponding in hotel industry semantic dictionary; (2.2.2), according to the different attribute of vocabulary each in particular emotion statement compare with at least 100 emotion statement templates respectively, thus determine whether match with any one the emotion statement template at least 100 emotion statement templates; And (2.2.3), repetition step (2.2.1)-(2.2.2), until filter out all emotion statements matched with at least 100 emotion statement templates.
Selectively, user's comment can be obtained by focused crawler from comment website in step (3).
Selectively, prepare hotel industry emotion statement template base in step (1) and by extracting clause masterplate based on the Bootload of user's comment, thus hotel industry emotion statement template base can be obtained.
Selectively, the step preparing hotel industry emotion statement template base and structure hotel industry semantic dictionary comprises: (1.1), acquisition comment data, become seed dictionary by the morphology arranging each Emotional Factors; (1.2), to the sentence of comment data carry out word segmentation processing, then also replace with semantic category label by its semantic category of word judgment; (1.3) the comment data after, replacing label are made pauses in reading unpunctuated ancient writings, according to the concrete term generation masterplate that title and each semantic category of each semantic category comprise; (1.4), by masterplate be applied in the comment data after the replacement of semantic category label, to extract the semantic word of each semantic category; (1.5) importance, according to masterplate, generalization and accuracy, give a mark to each masterplate; (1.6), choose the highest part masterplate of score, calculate the score of the semantic word that each masterplate extracts according to the masterplate chosen and marking thereof, and then the part of semantic word choosing score the highest expands to semantic dictionary; And (1.7), step (1.2) are carried out to step (1.6) iteration, until iteration ends when select semantic word is incorrect, obtain final hotel industry semantic dictionary, and form hotel industry emotion statement template base by each masterplate.
Selectively, step (1.1) comments on data by focused crawler online from the acquisition of comment website, and by manually checking a small amount of comment, arranges the word of each semantic category, forms seed dictionary.
Selectively, step (1.2) first adopts the maximum match segmentation based on dictionary to carry out participle, then has the part of ambiguity to adopt the segmenting method of sequence labelling to obtain correct word segmentation result for participle; The cutting problems of word is converted to the classification problem of word by the segmenting method of described sequence labelling, and each radical, according to its diverse location in word, gives different position classification marks, based on the slit mode of such flag sequence determination sentence.
Selectively, different position classification mark, comprises in prefix, word, suffix and monosyllabic word, and adopts conditional random field models to realize sequence labelling task.
Selectively, in step (1.2), semantic category comprises evaluation object word, evaluation attributes word, emotion word, degree adverb, common adverbial word, negative word, insertion word.
Selectively, step (1.3) according to ".", "! ", "? " 3 punctuation marks are made pauses in reading unpunctuated ancient writings, and the minimum length limiting masterplate is 3 words, and maximum length is 7 words.
Selectively, when step (1.4) extracts the semantic word of each semantic category, when the difference of certain masterplate corresponding to comment fragment and step (1.3) gained masterplate only has a word, using the example word of this word as corresponding semantic category.
Selectively, step (1.5) to the method that each masterplate is given a mark is:
1) to masterplate importance and generalization marking S (pat i) computing formula as follows:
Wherein, | pat i| be masterplate pat ilength, with word number calculate, f (pat i) represent masterplate pat ithe frequency, C (pat i) represent nested pat imasterplate set;
2) to masterplate accuracy marking P (pat i) computing formula as follows:
P ( pat i ) = Σ t ∈ S e m L e x , t ∈ T ( pat i ) f ( t ) Σ t ∈ T ( pat i ) f ( t ) ,
Wherein, T (pat i) represent masterplate pat ithe semantic set of words extracted, f (t) represents the frequency of semantic word t, and SemLex is seed semantic dictionary;
3) Sigmoid function is adopted by S (pat i) normalize to (0,1), and then the marking of merging two aspects obtains F (pat i), computing formula is as follows:
F ( pat i ) = α * log 2 1 1 + e - S ( pat i ) + ( 1 - α ) * log 2 P ( pat i ) ,
Wherein α is importance and generalization marking S (pat i) weight, span is [0,1].
Selectively, the part masterplate that step (1.6) described score is the highest be score the highest front 5 ~ 10% masterplate, the part of semantic word that described score is the highest be score the highest front 5 ~ 10% semantic word.
Selectively, after step (1.7), by manually carrying out the polarity determining emotion word in semantic dictionary, and emotion word and evaluation object word, evaluation attributes word collocation polarity; In artificial deterministic process, using comment fragment corresponding for masterplate belonging to it as the foundation judged.
Selectively, comprise commenting on the step of carrying out sentiment analysis in the present invention: obtain comment data, standardization processing is carried out to it; Word segmentation processing is carried out to the sentence of the comment data after standardization processing; Factor analysis is carried out to the sentence after participle, identifies and affect all kinds of words that emotion tendentiousness of text detects analysis; Clause stencil matching is carried out to carrying out the comment data after factor analysis according to clause template library; Determine to refer to first lang corresponding to language in the sentence of comment data, and recover abridged subject; To the sentence alternatively emotion sentence of evaluation object word, evaluation attributes word or emotion word be there is, and adopt the sentence polarity of maximum entropy model to candidate's emotion sentence to differentiate, obtain the emotion tendency of sentence.
Selectively, standardization processing adopts the misspelling in rule-based method process comment text, and described rule is the mapping that " word string or the word string that comprise wrongly written or mispronounced characters " arrives " corresponding correct word string or word string "; Described rule is obtained by two kinds of methods: one is according to existing knowledge, i.e. the common misspelling of forefathers' summary; Two is contextual extraction similar character according to each word or word or word, determines correct word string or word string by desk checking.
Selectively, first adopt the maximum match segmentation based on dictionary to carry out participle, then have the part of ambiguity to adopt the segmenting method of sequence labelling to obtain correct word segmentation result for participle; The cutting problems of word is converted to the classification problem of word by the segmenting method of described sequence labelling, and each radical, according to its diverse location in word, gives different position classification marks, based on the slit mode of such flag sequence determination sentence.
Selectively, different position classification mark, comprises in prefix, word, suffix and monosyllabic word, and adopts conditional random field models to realize sequence labelling task.
Selectively, key element comprises evaluation object word, evaluation attributes word, emotion word, degree adverb, common adverbial word, negative word, insertion word in comment data, and about the word at city, sight spot, after the key element in sentence is identified, corresponding class label on mark.
Selectively, extract clause masterplate by the Bootload based on comment, thus set up clause template library.
Selectively, if do not have evaluation object word or evaluation attributes word in current sentence, then an evaluation object finally mentioned or evaluation attributes word is selected to be incorporated into current sentence; If only have evaluation attributes word in current sentence, be then introduced into current sentence when there is evaluation object for upper one.
Selectively, maximum entropy model is predicted different emotions classification by set up the condition probability model and is estimated its probability, and emotion classification comprises-1,0,1 three classes, represents that difference is commented, ameleia, favorable comment respectively.
According to a further aspect in the invention, provide a kind of user tag based on hotel's comment and hotel's tag match device, comprising: hotel industry emotion statement template base generation module, hotel industry emotion statement template base comprises at least 100 emotion statement templates; Final hotel tag generation module, it is for generating the final hotel label at least three hotels; User comments on acquisition module, and it is commented on from internet acquisition specific user at least two users in same hotel or different hotel; User tag set generation module, the emotion statement of all user's comments of specific user is compared with at least 100 emotion statement templates by one by one, filter out the emotion statement matched with at least 100 emotion statement templates, and filtered out emotion statement is identified as different dimensions according to expressed affective style, then form the user tag set of specific user with identified all dimensions; User tag weight computation module, it calculates the weight of each user tag in the user tag set of specific user respectively, wherein, the frequency occurred in whole user's comments of specific user is higher and frequency lower then user tag weight that is that occur in all user's comments of all users for all hotels is higher; Final user's tag generation module, the great final user label of user tag as specific user setting threshold value in first of its right to choose from the user tag set of specific user; And hotel's recommending module, the hotel that its final user's tag match rate to major general's final hotel label and specific user is positioned at front three recommends specific user.
Selectively, final hotel tag generation module is commented on acquisition module by user and is commented on for the user at least three hotels respectively from internet acquisition, wherein comprises user's comment of at least three users for each hotel; Final hotel tag generation module also can comprise: hotel's tag set generates submodule, the emotion statement of all user's comments for Official hotel is compared with at least 100 emotion statement templates by one by one, filter out the emotion statement matched with at least 100 emotion statement templates, and filtered out emotion statement is identified as different dimensions according to expressed affective style, then form hotel's tag set of Official hotel with identified all dimensions; And hotel's label weight calculation submodule, it calculates the weight of each hotel label in hotel's tag set of Official hotel respectively, wherein, the frequency occurred in all users for same hotel comment on is higher and frequency Yue Dize hotel label weight that is that occur in all users for all hotels comment on is higher; Wherein, the great final hotel label of hotel's label as Official hotel setting threshold value in second of final hotel tag generation module right to choose from hotel's tag set.
Selectively, hotel industry emotion statement template base generation module is commented on acquisition module by user and is obtained at least 10000 hotel users comment from internet and therefrom filter out at least 100 emotion statements as emotion statement template according to the frequency height that statement occurs.
Selectively, can comprise hotel industry semantic dictionary generation module further, its frequency height occurred according to vocabulary filters out at least 1000 hotel industry common wordss in order to build hotel industry semantic dictionary from least 10000 hotel user comments.
Selectively, the first setting threshold value or the second setting threshold value can be selected arbitrarily in 0 ~ 1 scope.Such as, the first setting threshold value elects 0.5 as, and the second setting threshold value elects 0.3 as.
Optionally, for the structure in hotel industry semantic dictionary and sentence pattern template storehouse, the present invention can adopt the method based on Bootstrapping.
Bootstrapping (Bootstrapping), namely from expansion or bootstrapping, is a kind of semi-supervised machine learning method, may be used for extracting semantic dictionary and template simultaneously.The thought of this method is based on such observation: extraction template may be used for extracting new example, and these examples may be used for again extracting new template conversely.The advantage of this method is the corpus not needing to mark, and only needs minority seed.First obtain initialized seed word by manual intervention, utilize seed word to obtain template, and then obtain seed word by template, iteration like this is carried out.Take turns in iteration at each, all new labeled data will be produced, optimum word can add in mutually required semantic dictionary, optimum masterplate also can add in template library, the labeled data new with these relearns model, from and new data can be produced, so move in circles, until finally restrain end, thus obtain more seed word and template.Here it is the most basic Bootstrapping algorithm (or process).
The semantic category of semantic dictionary comprises evaluation object word, evaluation attributes word, emotion word, degree adverb, common adverbial word, negative word, insertion word etc., and each semantic category comprises some words, and masterplate forms all sequences by semantic category title or concrete term exactly.
Here is concrete implementation step:
Step 1: data encasement.Data are commented on online from main flow comment website acquisitions such as taking journey by focused crawler.
Step 2: seed dictionary creation.Manually check a small amount of (as 500) comment, arrange the word of each semantic category, semantic dictionary is designated as SemLex.
Step 3: comment participle.Chinese word segmentation is the basic steps of Chinese natural language process, the method that participle of the present invention adopts Dictionary based segment and statistics participle to merge.First adopt the maximum match segmentation based on dictionary, have the part of ambiguity to adopt the segmenting method of sequence labelling again for participle.
Based on the maximum match segmentation of dictionary, given dictionary, for the Chinese character sequence treating participle, finds the longest dictionary word of coupling, successively without matcher then as monosyllabic word process, until this Chinese character series processing is complete.According to the difference to Chinese character sequence scanning direction, the method can be divided into again: Forward Maximum Method (mating from left to right) and reverse maximum coupling (mating from right to left).Such as, for sequence " when the atom binding constituents period of the day from 11 p.m. to 1 a.m ", Forward Maximum Method result be " when | atom | combine | become | molecule | time ", and reverse maximum matching result be " when | atom | combination | composition | the period of the day from 11 p.m. to 1 a.m ".Obviously, Forward Maximum Method and reverse maximum coupling all can not process overcome ambiguity problem well.Forward Maximum Method and reverse maximum coupling also can in conjunction with the two-way maximum couplings of formation, forward and reverse coupling inconsistent place during bi-directional matching, the place of potential ambiguity often.Ambiguity is had often to need to confirm word segmentation result according to concrete context.There is the sequence labelling method of supervision can excavate contextual feature-rich fully, calling sequence mask method of the present invention disambiguation when therefore having an ambiguity.The cutting problems of word is converted to the classification problem of word by the method, and each radical, according to its diverse location in word, gives different position classification marks, in such as prefix, word, suffix and monosyllabic word.Based on such flag sequence, be easy to the slit mode determining sentence.Wherein, B (Begin), M (Middle), E (End), S (Single) represent in prefix, word respectively, suffix, monosyllabic word.Had the flag sequence of word, the word sequence meeting regular expression " S " or " B (M) * E " represents a word, thus sentence completion cutting easily.In order to realize sequence labelling task, a point invention adopts conditional random field models (ConditionalRandomFields, CRF), and this model is used widely in natural language processing, and achieves very ten-strike.Specific features comprises: previous word, current word, a rear word, previous word and current word, current word and a rear word.Conditional random field models utilizes these features extracted, the category label of each word doped.
The dictionary of maximum matching process and have the training study language material of the conditional random field models of supervision all from 100,000 hotels' comments that the present invention manually marks.
Step 4: semantic category label is replaced.Comment after participle is also replaced with semantic category label by its semantic category of word judgment, as " dining room | | price | very | high ", replace with " Obj|'s | Attr|Dgr|Sent ", " Start " and " End " label added respectively for comment starting and ending position, in comment except ".", "! ", "? " outside punctuation mark also adopt " Punc " label to replace.
Step 5: masterplate generates.According to ".", "! ", "? " 3 punctuation mark punctuates, limit masterplate minimum length 3 words, maximum length 7 words, and the comment after the replacement of scanning label, generates masterplate.
Step 6: masterplate is given a mark.The present invention gives a mark from two aspects, is weighed importance and the generalization of masterplate on the one hand, weighed the accuracy of masterplate on the other hand by the hit rate in semantic dictionary by the frequency.
Pat iimportance and generalization marking S (pat i) computing formula as follows:
Wherein, | pat i| be masterplate pat ilength, with word number calculate, f (pat i) represent masterplate pat ithe frequency, C (pat i) represent nested pat imasterplate set.
Pat iaccuracy marking P (pat i) computing formula as follows:
P ( pat i ) = Σ t ∈ S e m L e x , t ∈ T ( pat i ) f ( t ) Σ t ∈ T ( pat i ) f ( t )
Wherein, T (pat i) represent masterplate pat ithe semantic set of words extracted, f (t) represents the frequency of semantic word.
Adopt Sigmoid function by S (pat i) normalize to (0,1), and then the marking of merging two aspects obtains F (pat i), computing formula is as follows:
F ( pat i ) = α * log 2 1 1 + e - S ( pat i ) + ( 1 - α ) * log 2 P ( pat i )
α=0.4, the present invention more focuses on the accuracy of masterplate.
Step 7: masterplate is selected.According to F (pat i) choose score the highest front 5%.
Step 8: semantic word extracts.Select masterplate is applied to semantic category label replace after to comment in.When certain comment fragment with select masterplate only have a word variant time, using the example word of this word as corresponding semantic category.
Step 9: semantic word marking.
P ( t j ) = Σ k , t j ∈ T ( pat k ) P ( pat k )
Step 10: semantic dictionary expands.Choose score the highest front 5%.
Step 4 is carried out to step 10 iteration.Stopping criterion for iteration.Select semantic word obvious incorrect time stop.
Step 11: polarity is determined.For the polarity of emotion word, and emotion word and evaluation object word, evaluation attributes word collocation polarity, by manually completing.In artificial deterministic process, using comment fragment corresponding for masterplate belonging to it as the foundation judged.
Result shows, the present invention achieves good performance in accuracy rate and recall rate.Produce high-quality semantic dictionary and sentence pattern template storehouse.
Alternatively, emotion statement template of the present invention build and statement compare of analysis method as follows.
First the present invention comments on data by focused crawler from each large main flow comment website acquisition online.Then for extensive comment, semantic dictionary and clause storehouse is arranged by semiautomatic fashion.Finally, for each sentence in comment, carry out the process such as participle and analyze, extraction keyword or crucial clause are as feature on this basis, realize emotional semantic classification by maximum entropy classifiers.Comprise the steps:
Step 1: text normalization.
Internet comment text often there will be misspelling, and for these problems, we adopt rule-based method process.These rules are mappings that " word string or the word string that comprise wrongly written or mispronounced characters " arrives " corresponding correct word string or word string ".This rule is obtained by two kinds of methods: one is according to existing knowledge, i.e. the common misspelling of forefathers' summary; Two is contextual extraction similar character according to each word or word or word, and desk checking is determined.This method is simple, effectively.The performance of this module of system depends on the quantity that misspelling corrects rule, constantly can sum up, enrich rule base in the process of system O&M.
Also there is the full half-angle problem of punctuation mark in Chinese, according to the full half-angle mapping relations of symbol, is denoted as DBC case by unified for punctuation mark.
Step 2: comment participle.
Comment participle.Chinese word segmentation is the basic steps of Chinese natural language process, the method that participle of the present invention adopts Dictionary based segment and statistics participle to merge.First adopt the maximum match segmentation based on dictionary, have the part of ambiguity to adopt the segmenting method of sequence labelling again for participle.
Based on the maximum match segmentation of dictionary, given dictionary, for the Chinese character sequence treating participle, finds the longest dictionary word of coupling, successively without matcher then as monosyllabic word process, until this Chinese character series processing is complete.According to the difference to Chinese character sequence scanning direction, the method can be divided into again: Forward Maximum Method (mating from left to right) and reverse maximum coupling (mating from right to left).Such as, for sequence " when the atom binding constituents period of the day from 11 p.m. to 1 a.m ", Forward Maximum Method result be " when | atom | combine | become | molecule | time ", and reverse maximum matching result be " when | atom | combination | composition | the period of the day from 11 p.m. to 1 a.m ".Obviously, Forward Maximum Method and reverse maximum coupling all can not process overcome ambiguity problem well.Forward Maximum Method and reverse maximum coupling also can in conjunction with the two-way maximum couplings of formation, forward and reverse coupling inconsistent place during bi-directional matching, the place of potential ambiguity often.Ambiguity is had often to need to confirm word segmentation result according to concrete context.There is the sequence labelling method of supervision can excavate contextual feature-rich fully, calling sequence mask method of the present invention disambiguation when therefore having an ambiguity.The cutting problems of word is converted to the classification problem of word by the method, and each radical, according to its diverse location in word, gives different position classification marks, in such as prefix, word, suffix and monosyllabic word.Based on such flag sequence, be easy to the slit mode determining sentence.Wherein, B (Begin), M (Middle), E (End), S (Single) represent in prefix, word respectively, suffix, monosyllabic word.Had the flag sequence of word, the word sequence meeting regular expression " S " or " B (M) * E " represents a word, thus sentence completion cutting easily.In order to realize sequence labelling task, a point invention adopts conditional random field models (ConditionalRandomFields, CRF), and this model is used widely in natural language processing, and achieves very ten-strike.Specific features comprises: previous word, current word, a rear word, previous word and current word, current word and a rear word.Conditional random field models utilizes these features extracted, the category label of each word doped.
The dictionary of maximum matching process and have the training study language material of the conditional random field models of supervision all from 100,000 hotels' comments that the present invention manually marks.
Step 3: factor analysis.
Key element, refer to the key factor affecting text emotion and analyze, both comprise above-mentioned emotion information key element, as evaluation object word, evaluation attributes word, emotion word, degree adverb, common adverbial word, negative word, insertion word etc. in comment, comprise again the word of multiple classification such as city, sight spot.Factor analysis is identified the key element in sentence, and mark its corresponding class label upper.
Step 4: clause is mated.
Sentence semantics classification form is obtained after factor analysis to sentence, i.e. clause, clause reflection be the common context of word wherein or key element, so have certain disambiguation ability.In clause matching process, existing clause storehouse plays key effect, it reflects the common clause showed emotion in field.Clause storehouse is core resource of the present invention, reflects the common clause of emotional expression in comment.The present invention takes out clause by extracting based on bootstrapping (Bootstrapping) method of comment.
Step 5: reference resolution.
Referring to and omitting is common language phenomenon.Refer to normal expression to refer to altogether, namely two kinds of statements all censure same object.Referred to polytype, we mainly for personal pronoun, demonstrative pronoun as the situation referring to language.Omission can be considered as the situation that zero refers to language, so we will refer to and omit " referring to " of regarding broad sense as, reference resolution refers to and finds to refer to first lang corresponding to language, or recovers abridged subject.If do not have evaluation object word or evaluation attributes word in current sentence, an evaluation object finally mentioned or evaluation attributes word is selected to be incorporated into current sentence.If only have evaluation attributes word in current sentence, be incorporated into current sentence when there is evaluation object for upper one.
Step 6: sentiment analysis.
The sentence alternatively emotion sentence of evaluation object word, evaluation attributes word or emotion word will be there is.For candidate's emotion sentence, adopt maximum entropy (MaximumEntropy) model, merge abundant contextual feature, sentence polarity is differentiated, obtains the emotion tendency of sentence.In classification task, discriminative model is often better than production model.What production model was estimated is joint probability distribution, for data Direct Modeling in machine learning, or by Bayes rule as the intermediate steps obtaining conditional probability.And discriminative model is directly to conditional probability modeling, the training of model and prediction is consistent, thus distinguishes between classification better.In discriminative model, maximum entropy model is used widely in natural process field.Predict for given contextual information x ∈ X the classification problem that classification y ∈ Y is such, maximum entropy model set up the condition probability model P (y|x) is predicted different classes of y ∈ Y and is estimated its probability.Classification comprises-1 (difference is commented), 0 (ameleia), 1 (favorable comment) three class.Feature comprises evaluation object word, evaluation attributes word, emotion word, and their collocation, also has the feature such as negative word, clause.
The invention has the beneficial effects as follows: the solution of the present invention effectively can utilize hotel to comment on data and form user's portrait, and according to user's portrait, the hotel meeting user's request is most recommended specific user, this can save the time and efforts that user searches for hotel on the internet significantly, and hotel can also be helped to find/overcome the deficiency of self and improve/optimize self characteristic further.
Accompanying drawing explanation
Fig. 1 shows and the present invention is based on the user tag of hotel's comment and the schematic flow sheet of hotel's tag match method.
Embodiment
Below by with reference to drawings and Examples, the present invention is further elaborated, but these elaborations do not limit in any form the present invention.Unless otherwise stated, the implication that all Science and Technology terms used herein have belonging to the present invention and the those skilled in the art of correlative technology field understand usually.
Please refer to Fig. 1, according to a kind of non-limiting embodiment of the present invention, a kind of user tag based on hotel's comment and hotel's tag match method are provided, specifically comprise the following steps.
In step sl, obtain about 50000 hotel user comments from internet, and therefrom filter out about 5000 hotel industry common wordss in order to build hotel industry semantic dictionary according to the frequency height that vocabulary occurs.
In step s 2, prepare hotel industry emotion statement template base, comprise the frequency height occurred according to statement from about 50000 hotel user comments that internet obtains and filter out about 500 emotion statements as emotion statement template.
In step s3, prepare the final hotel label in Yue200Ge hotel, specifically comprise: filter out from about 50000 hotels obtained above and comment on for the user in Yue200Ge hotel respectively, wherein comprise user's comment of about 100 users for each hotel, the emotion statement of all user's comments for Official hotel is compared with about 500 emotion statement templates one by one, filter out the emotion statement matched with about 500 emotion statement templates, and filtered out emotion statement is identified as different dimensions according to expressed affective style, hotel's tag set of Official hotel is formed again with identified all dimensions, such as, hotel's tag set in a hotel comprises: dimension 1 (health rank is A level), dimension 11 (traffic convenience degree is A level), dimension 51 (surrounding enviroment index is A level), dimension 101 (room space size is A level) etc., hotel's tag set in No. two hotels comprises: dimension 2 (health rank is B level), dimension 12 (traffic convenience degree is B level), dimension 52 (surrounding enviroment index is B level), dimension 102 (room space size is B level) etc., hotel's tag set in No. three hotels comprises: dimension 3 (health rank is C level), dimension 13 (traffic convenience degree is C level), dimension 53 (surrounding enviroment index is C level), dimension 103 (room space size is C level) etc., calculate the weight of each hotel label in hotel's tag set of Official hotel respectively, wherein, the frequency occurred in all users for same hotel comment on is higher and frequency Yue Dize hotel label weight that is that occur in all users for all hotels comment on is higher, the great final hotel label of hotel's label as Official hotel setting threshold value in second of right to choose from hotel's tag set, wherein, the second setting threshold value elects 0.4 as.Repeat this step until obtain the final hotel label in all hotels.Wherein, the emotion statement of all user's comments for Official hotel specifically can be comprised to about 500 emotion statement template processes of comparing one by one: particular emotion sentence segmentation is become and several hotel industry common wordss corresponding in hotel industry semantic dictionary; Different attribute according to vocabulary each in particular emotion statement is compared with 500 emotion statement templates respectively, thus determines whether match with any one the emotion statement template in 500 emotion statement templates; And repeat this process until filter out all emotion statements matched with 500 emotion statement templates.
In step s 4 which, three the user comments of specific user for three hotels are obtained from internet.
In step s 5, the emotion statement of all user's comments of specific user is compared with about 500 emotion statement templates one by one, filter out the emotion statement matched with about 500 emotion statement templates, and filtered out emotion statement is identified as different dimensions according to expressed affective style, the user tag set of specific user is formed again with identified all dimensions, such as, the user tag set of particular customer comprises: dimension 1 (health rank is A level), dimension 12 (traffic convenience degree is B level), dimension 51 (surrounding enviroment index is A level), dimension 103 (room space size is C level) etc.Wherein, the emotion statement of all user's comments of specific user is specifically comprised to about 500 emotion statement template processes of comparing one by one: particular emotion sentence segmentation is become and several hotel industry common wordss corresponding in hotel industry semantic dictionary; Different attribute according to vocabulary each in particular emotion statement is compared with 500 emotion statement templates respectively, thus determines whether match with any one the emotion statement template in 500 emotion statement templates; And repeat this process until filter out all emotion statements matched with 500 emotion statement templates.
In step s 6, calculate the weight of each user tag in the user tag set of specific user respectively, wherein, the frequency occurred in whole user's comments of specific user is higher and frequency lower then user tag weight that is that occur in all user's comments of all users for all hotels is higher.
In the step s 7, the great final user label of user tag as specific user setting threshold value in first of right to choose from the user tag set of specific user, wherein, the first setting threshold value elects 0.6 as.
In step s 8, hotel the highest for final user's tag match rate of final hotel label and specific user is recommended specific user, such as, in this non-limiting embodiment, a hotel is recommended this specific user.
According to another kind of non-limiting embodiment of the present invention, a kind of user tag based on hotel's comment and hotel's tag match device are provided, comprise: hotel industry emotion statement template base generation module, hotel industry emotion statement template base comprises 1000 emotion statement templates; Final hotel tag generation module, it is for generating the final hotel label in 500 hotels; User comments on acquisition module, and it obtains five the user comments of specific user for different hotel from internet; User tag set generation module, the emotion statement of all user's comments of specific user is compared with 1000 emotion statement templates by one by one, filter out the emotion statement matched with 1000 emotion statement templates, and filtered out emotion statement is identified as different dimensions according to expressed affective style, then form the user tag set of specific user with identified all dimensions; User tag weight computation module, it calculates the weight of each user tag in the user tag set of specific user respectively, wherein, the frequency occurred in whole user's comments of specific user is higher and frequency lower then user tag weight that is that occur in all user's comments of all users for all hotels is higher; Final user's tag generation module, the great final user label of user tag as specific user setting threshold value in first of its right to choose from the user tag set of specific user; And hotel's recommending module, the hotel that final user's tag match rate of final hotel label and specific user is positioned at top ten list is recommended specific user by it.
Final hotel tag generation module is commented on acquisition module by user and is commented on for the user in 500 hotels respectively from internet acquisition, wherein comprises user's comment of 200 users for each hotel; Final hotel tag generation module also comprises: hotel's tag set generates submodule, the emotion statement of all user's comments for Official hotel is compared with 1000 emotion statement templates by one by one, filter out the emotion statement matched with 1000 emotion statement templates, and filtered out emotion statement is identified as different dimensions according to expressed affective style, then form hotel's tag set of Official hotel with identified all dimensions; And hotel's label weight calculation submodule, it calculates the weight of each hotel label in hotel's tag set of Official hotel respectively, wherein, the frequency occurred in all users for same hotel comment on is higher and frequency Yue Dize hotel label weight that is that occur in all users for all hotels comment on is higher; Wherein, the great final hotel label of hotel's label as Official hotel setting threshold value in second of final hotel tag generation module right to choose from hotel's tag set.
Hotel industry emotion statement template base generation module is commented on acquisition module by user and to be obtained 100000 hotel users comments from internet and therefrom filter out 1000 emotion statements as emotion statement template according to the frequency height that statement occurs.
Device of the present invention comprises hotel industry semantic dictionary generation module further, and its frequency height occurred according to vocabulary filters out 10000 hotel industry common wordss in order to build hotel industry semantic dictionary from 100000 hotel user comments.
Below in conjunction with specific embodiment the present invention made and elaborating further, but embodiment should not be construed as limiting the scope of the invention.
Based on user tag and the hotel's tag match method of hotel's comment, it comprises the steps:
Step 1: comment on data online from main flow comment website acquisitions such as taking journey by focused crawler;
Step 2: filtering spam is commented on, rubbish comment comprises meaningless statement;
Step 3: build hotel industry semantic dictionary and sentence pattern template storehouse;
Step 4: sentiment analysis is carried out to comment.
Step 5: label analysis.
For each sentence showed emotion in comment, excavate the viewpoint that it is expressed, expressed by label.
Step 6: according to label aggregation comment fragment, calculate the weight of the different label of different user according to TF-IDF algorithm.TF-IDF (TermFrequency-InverseDocumentFrequency) is a kind of statistical method, is used for assessment word to the significance level of file, is widely used in information retrieval and the field such as text feature selection and calculating.The main thought of TF-IDF is: if certain word occurs very frequent in one section of document, and seldom occur in other documents, then thinking that this word has good class discrimination ability, being applicable to for characterizing the document.
TF-IDF is actual is the product of TF and IDF.TF represents term frequencies (TermFrequency), is the frequency that some given words occur in a document, is the normalization to the word frequency, to prevent from being partial to the many documents of word.Computing formula is as follows:
tf i , j = n i , j Σ k n k , j
Wherein, tf i, jrepresent the frequency of word i in document j, n i, jrepresent the frequency of word i in document j, Σ kn k, jrepresent the frequency sum of all words in document.
IDF represents reverse document frequency (InverseDocumentFrequency), and be the tolerance of a word general importance, computing formula is as follows:
idf i = log | D | | { j : t i ∈ d j } |
Wherein, idf irepresent the reverse document frequency of word i in corpus, | D| represents the total number of documents in corpus, | { j:t i∈ d j| represent the number of documents comprising word i.If word is not in corpus, denominator will be caused to be zero, therefore generally denominator uses | { j:t i∈ d j|+1.
Had TF and IDF, and then calculated TFIDF, computing formula is as follows:
tfidf i,j=tf i,j×idf i
High-frequency word in a certain particular document, and the low document frequency of this word in whole collection of document, can produce the TF-IDF of high weight.Therefore, TF-IDF tends to filter out common word, retains important word.
Step 7: for different hotel and different user, selects according to its TF-IDF and the threshold value preset, thus obtains final hotel's label and user tag.
Although describe the preferred embodiment of the present invention in detail at this, but should be understood that the present invention is not limited to the concrete structure described in detail and illustrate here, other modification and variant can be realized when not departing from the spirit and scope of the invention by those skilled in the art.

Claims (10)

1., based on user tag and the hotel's tag match method of hotel's comment, comprising:
(1), prepare hotel industry emotion statement template base, described hotel industry emotion statement template base comprises at least 100 emotion statement templates;
(2) the final hotel label at least three hotels, is prepared;
(3), comment on from internet acquisition specific user at least two users in same hotel or different hotel;
(4), the emotion statement of all user's comments of described specific user is compared with described at least 100 emotion statement templates one by one, filter out the emotion statement matched with described at least 100 emotion statement templates, and filtered out emotion statement is identified as different dimensions according to expressed affective style, then form the user tag set of described specific user with identified all dimensions;
(5) weight of each user tag in the user tag set of described specific user, is calculated respectively, wherein, the frequency occurred in whole user's comments of described specific user is higher and frequency lower then user tag weight that is that occur in all user's comments of all users for all hotels is higher;
(6), the great final user label of user tag as described specific user setting threshold value in first of right to choose from the user tag set of described specific user; And
(7) hotel, being positioned at front three to final user's tag match rate of major general's final hotel label and described specific user recommends described specific user.
2., as claimed in claim 1 based on user tag and hotel's tag match method of hotel's comment, it is characterized in that, the final hotel label preparing at least three hotels in described step (2) comprises:
(2.1), from internet obtain and comment on for the user at least three hotels respectively, wherein comprise user's comment of at least three users for each hotel;
(2.2), the emotion statement of all user's comments for Official hotel is compared with described at least 100 emotion statement templates one by one, filter out the emotion statement matched with described at least 100 emotion statement templates, and filtered out emotion statement is identified as different dimensions according to expressed affective style, then form hotel's tag set of described Official hotel with identified all dimensions;
(2.3) weight of each hotel label in hotel's tag set of described Official hotel, is calculated respectively, wherein, the frequency occurred in all users for same hotel comment on is higher and frequency Yue Dize hotel label weight that is that occur in all users for all hotels comment on is higher;
(2.4), the great final hotel label of hotel's label as described Official hotel setting threshold value in second of right to choose from described hotel tag set; And
(2.5) step (2.2)-(2.4), are repeated until obtain the final hotel label in all hotels.
3. as claimed in claim 2 based on user tag and hotel's tag match method of hotel's comment, it is characterized in that, in described step (1), before preparation hotel industry emotion statement template base, comprise the step building hotel industry semantic dictionary further, being compared with described at least 100 emotion statement templates one by one by the emotion statement of all user's comments of described specific user in described step (4) comprises:
(4.1), particular emotion sentence segmentation is become several hotel industry common wordss corresponding to described hotel industry semantic dictionary;
(4.2), according to the different attribute of vocabulary each in particular emotion statement compare with described at least 100 emotion statement templates respectively, thus determine whether match with any one the emotion statement template in described at least 100 emotion statement templates; And
(4.3), step (4.1)-(4.2) are repeated, until filter out all emotion statements matched with described at least 100 emotion statement templates.
4. as claimed in claim 3 based on user tag and hotel's tag match method of hotel's comment, it is characterized in that, being compared with described at least 100 emotion statement templates one by one by the emotion statement of all user's comments for Official hotel in described step (2.2) comprises:
(2.2.1), particular emotion sentence segmentation is become several hotel industry common wordss corresponding to described hotel industry semantic dictionary;
(2.2.2), according to the different attribute of vocabulary each in particular emotion statement compare with described at least 100 emotion statement templates respectively, thus determine whether match with any one the emotion statement template in described at least 100 emotion statement templates; And
(2.2.3), step (2.2.1)-(2.2.2) is repeated, until filter out all emotion statements matched with described at least 100 emotion statement templates.
5. as claimed in claim 4 based on user tag and hotel's tag match method of hotel's comment, it is characterized in that, is obtain user's comment by focused crawler from comment website in described step (3).
6. as claimed in claim 5 based on user tag and hotel's tag match method of hotel's comment, it is characterized in that, in described step (1), preparation hotel industry emotion statement template base is the Bootload extraction clause masterplate by commenting on based on user, thus obtains hotel industry emotion statement template base.
7. as claimed in claim 6 based on user tag and hotel's tag match method of hotel's comment, it is characterized in that, the step preparing described hotel industry emotion statement template base and build described hotel industry semantic dictionary comprises:
(1.1), obtain comment data, become seed dictionary by the morphology arranging each Emotional Factors;
(1.2), to the sentence of comment data carry out word segmentation processing, then also replace with semantic category label by its semantic category of word judgment;
(1.3) the comment data after, replacing label are made pauses in reading unpunctuated ancient writings, according to the concrete term generation masterplate that title and each semantic category of each semantic category comprise;
(1.4), by masterplate be applied in the comment data after the replacement of semantic category label, to extract the semantic word of each semantic category;
(1.5) importance, according to masterplate, generalization and accuracy, give a mark to each masterplate;
(1.6), choose the highest part masterplate of score, calculate the score of the semantic word that each masterplate extracts according to the masterplate chosen and marking thereof, and then the part of semantic word choosing score the highest expands to semantic dictionary; And
(1.7), step (1.2) carries out to step (1.6) iteration, until iteration ends when select semantic word is incorrect, obtains final hotel industry semantic dictionary, and forms hotel industry emotion statement template base by each masterplate.
8. as claimed in claim 7 based on user tag and hotel's tag match method of hotel's comment, it is characterized in that, the part masterplate that described in step (1.6), score is the highest be score the highest front 5 ~ 10% masterplate, the part of semantic word that described score is the highest be score the highest front 5 ~ 10% semantic word.
9., based on user tag and the hotel's tag match device of hotel's comment, comprising:
Hotel industry emotion statement template base generation module, described hotel industry emotion statement template base comprises at least 100 emotion statement templates;
Final hotel tag generation module, it is for generating the final hotel label at least three hotels;
User comments on acquisition module, and it is commented on from internet acquisition specific user at least two users in same hotel or different hotel;
User tag set generation module, the emotion statement of all user's comments of described specific user is compared with described at least 100 emotion statement templates by one by one, filter out the emotion statement matched with described at least 100 emotion statement templates, and filtered out emotion statement is identified as different dimensions according to expressed affective style, then form the user tag set of described specific user with identified all dimensions;
User tag weight computation module, it calculates the weight of each user tag in the user tag set of described specific user respectively, wherein, the frequency occurred in whole user's comments of described specific user is higher and frequency lower then user tag weight that is that occur in all user's comments of all users for all hotels is higher;
Final user's tag generation module, the great final user label of user tag as described specific user setting threshold value in first of its right to choose from the user tag set of described specific user; And
Hotel's recommending module, the hotel that its final user's tag match rate to major general's final hotel label and described specific user is positioned at front three recommends described specific user.
10. as claimed in claim 9 based on user tag and hotel's tag match device of hotel's comment, it is characterized in that, described final hotel tag generation module is commented on acquisition module by described user and is commented on for the user at least three hotels respectively from internet acquisition, wherein comprises user's comment of at least three users for each hotel;
Described final hotel tag generation module also comprises:
Hotel's tag set generates submodule, the emotion statement of all user's comments for Official hotel is compared with described at least 100 emotion statement templates by one by one, filter out the emotion statement matched with described at least 100 emotion statement templates, and filtered out emotion statement is identified as different dimensions according to expressed affective style, then form hotel's tag set of described Official hotel with identified all dimensions; And
Hotel's label weight calculation submodule, it calculates the weight of each hotel label in hotel's tag set of described Official hotel respectively, wherein, the frequency occurred in all users for same hotel comment on is higher and frequency Yue Dize hotel label weight that is that occur in all users for all hotels comment on is higher;
Wherein, the great final hotel label of hotel's label as described Official hotel setting threshold value in second of described final hotel tag generation module right to choose from described hotel tag set.
CN201510593613.5A 2015-09-17 2015-09-17 User label and hotel label matching method and device based on hotel comments Pending CN105205699A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510593613.5A CN105205699A (en) 2015-09-17 2015-09-17 User label and hotel label matching method and device based on hotel comments

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510593613.5A CN105205699A (en) 2015-09-17 2015-09-17 User label and hotel label matching method and device based on hotel comments

Publications (1)

Publication Number Publication Date
CN105205699A true CN105205699A (en) 2015-12-30

Family

ID=54953364

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510593613.5A Pending CN105205699A (en) 2015-09-17 2015-09-17 User label and hotel label matching method and device based on hotel comments

Country Status (1)

Country Link
CN (1) CN105205699A (en)

Cited By (34)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105844424A (en) * 2016-05-30 2016-08-10 中国计量学院 Product quality problem discovery and risk assessment method based on network comments
CN106257455A (en) * 2016-07-08 2016-12-28 闽江学院 A kind of Bootstrapping algorithm based on dependence template extraction viewpoint evaluation object
CN106874435A (en) * 2017-01-25 2017-06-20 北京航空航天大学 User portrait construction method and device
CN106909659A (en) * 2017-02-27 2017-06-30 携程旅游网络技术(上海)有限公司 Hotel's sort method based on traffic convenience degree in OTA websites
WO2017120739A1 (en) * 2016-01-11 2017-07-20 程强 Method and system for analyzing restaurant reviews
CN107515932A (en) * 2017-08-28 2017-12-26 北京智诚律法科技有限公司 Artificial intelligence law consulting system based on typical problem storehouse
WO2018035698A1 (en) * 2016-08-23 2018-03-01 盛玉伟 Method and system for house appraisal
CN108256067A (en) * 2018-01-16 2018-07-06 平安好房(上海)电子商务有限公司 Calculate method, apparatus, equipment and the storage medium of source of houses similarity
CN108289121A (en) * 2018-01-02 2018-07-17 阿里巴巴集团控股有限公司 The method for pushing and device of marketing message
CN108470023A (en) * 2018-01-18 2018-08-31 阿里巴巴集团控股有限公司 The recommendation method and device of business function
CN108664469A (en) * 2018-05-07 2018-10-16 首都师范大学 A kind of emotional category determines method, apparatus and server
CN108959253A (en) * 2018-06-28 2018-12-07 北京嘀嘀无限科技发展有限公司 Extracting method, device and the readable storage medium storing program for executing of core phrase
CN109272337A (en) * 2017-07-17 2019-01-25 阿里巴巴集团控股有限公司 The generation method and relevant device of object tag
CN109325186A (en) * 2018-08-11 2019-02-12 桂林理工大学 A kind of behavior motive estimating method that user preference feature is merged with geographical feature
CN109446310A (en) * 2018-10-30 2019-03-08 腾讯科技(武汉)有限公司 A kind of method for evaluating quality, device and the storage medium of question sentence template
WO2019062081A1 (en) * 2017-09-28 2019-04-04 平安科技(深圳)有限公司 Salesman profile formation method, electronic device and computer readable storage medium
CN110020149A (en) * 2017-11-30 2019-07-16 Tcl集团股份有限公司 Labeling processing method, device, terminal device and the medium of user information
CN110097394A (en) * 2019-03-27 2019-08-06 青岛高校信息产业股份有限公司 The latent objective recommended method of product and device
CN110147483A (en) * 2017-09-12 2019-08-20 阿里巴巴集团控股有限公司 A kind of title method for reconstructing and device
CN110263022A (en) * 2019-05-08 2019-09-20 深圳丝路天地电子商务有限公司 Hotel's data matching method and device
CN110457502A (en) * 2019-08-21 2019-11-15 京东方科技集团股份有限公司 Construct knowledge mapping method, man-machine interaction method, electronic equipment and storage medium
CN110633370A (en) * 2019-09-19 2019-12-31 携程计算机技术(上海)有限公司 Generation method, system, electronic device and medium of OTA hotel label
CN110633469A (en) * 2019-09-10 2019-12-31 陈绪平 Method for accurately understanding Chinese sentence meaning
CN110674260A (en) * 2019-09-27 2020-01-10 北京百度网讯科技有限公司 Training method and device of semantic similarity model, electronic equipment and storage medium
CN110781394A (en) * 2019-10-24 2020-02-11 西北工业大学 Personalized commodity description generation method based on multi-source crowd-sourcing data
WO2020076179A1 (en) * 2018-10-11 2020-04-16 Общество С Ограниченной Ответственностью "Глобус Медиа" Method for determining tags for hotels and device for the implementation thereof
CN111080162A (en) * 2019-12-27 2020-04-28 南昌众荟智盈信息技术有限公司 Automatic service task allocation method for improving automation level of hotel service process
CN111737400A (en) * 2020-06-15 2020-10-02 上海理想信息产业(集团)有限公司 Knowledge reasoning-based big data service tag expansion method and system
CN111768213A (en) * 2020-09-03 2020-10-13 耀方信息技术(上海)有限公司 User label weight evaluation method
CN112035750A (en) * 2020-09-17 2020-12-04 上海二三四五网络科技有限公司 Control method and device for user tag expansion
CN112948677A (en) * 2021-02-26 2021-06-11 上海携旅信息技术有限公司 Recommendation reason determination method, system, device and medium based on comment aesthetic feeling
CN113139838A (en) * 2021-05-10 2021-07-20 上海华客信息科技有限公司 Hotel service evaluation method, system, equipment and storage medium
CN113361920A (en) * 2021-06-04 2021-09-07 上海华客信息科技有限公司 Hotel service optimization index recommendation method, system, equipment and storage medium
CN115034213A (en) * 2022-08-15 2022-09-09 苏州大学 Joint learning-based method for recognizing prefix and suffix negative words

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103984984A (en) * 2014-06-11 2014-08-13 张劲松 Hotel room-reservation system and realizing method thereof
CN104281645A (en) * 2014-08-27 2015-01-14 北京理工大学 Method for identifying emotion key sentence on basis of lexical semantics and syntactic dependency
CN105095508A (en) * 2015-08-31 2015-11-25 北京奇艺世纪科技有限公司 Multimedia content recommendation method and multimedia content recommendation apparatus

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103984984A (en) * 2014-06-11 2014-08-13 张劲松 Hotel room-reservation system and realizing method thereof
CN104281645A (en) * 2014-08-27 2015-01-14 北京理工大学 Method for identifying emotion key sentence on basis of lexical semantics and syntactic dependency
CN105095508A (en) * 2015-08-31 2015-11-25 北京奇艺世纪科技有限公司 Multimedia content recommendation method and multimedia content recommendation apparatus

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
娄小丰: "基于多属性打分的酒店推荐算法研究", 《中国优秀硕士学位论文全文数据库》 *
聂卉,杜嘉忠: "依存句法模板下的商品特征标签抽取研究", 《现代图书情报技术》 *

Cited By (45)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017120739A1 (en) * 2016-01-11 2017-07-20 程强 Method and system for analyzing restaurant reviews
CN105844424A (en) * 2016-05-30 2016-08-10 中国计量学院 Product quality problem discovery and risk assessment method based on network comments
CN106257455A (en) * 2016-07-08 2016-12-28 闽江学院 A kind of Bootstrapping algorithm based on dependence template extraction viewpoint evaluation object
WO2018035698A1 (en) * 2016-08-23 2018-03-01 盛玉伟 Method and system for house appraisal
CN106874435B (en) * 2017-01-25 2020-02-14 北京航空航天大学 User portrait construction method and device
CN106874435A (en) * 2017-01-25 2017-06-20 北京航空航天大学 User portrait construction method and device
CN106909659A (en) * 2017-02-27 2017-06-30 携程旅游网络技术(上海)有限公司 Hotel's sort method based on traffic convenience degree in OTA websites
CN109272337A (en) * 2017-07-17 2019-01-25 阿里巴巴集团控股有限公司 The generation method and relevant device of object tag
CN107515932A (en) * 2017-08-28 2017-12-26 北京智诚律法科技有限公司 Artificial intelligence law consulting system based on typical problem storehouse
CN110147483B (en) * 2017-09-12 2023-09-29 阿里巴巴集团控股有限公司 Title reconstruction method and device
CN110147483A (en) * 2017-09-12 2019-08-20 阿里巴巴集团控股有限公司 A kind of title method for reconstructing and device
WO2019062081A1 (en) * 2017-09-28 2019-04-04 平安科技(深圳)有限公司 Salesman profile formation method, electronic device and computer readable storage medium
CN110020149A (en) * 2017-11-30 2019-07-16 Tcl集团股份有限公司 Labeling processing method, device, terminal device and the medium of user information
CN108289121B (en) * 2018-01-02 2020-09-29 阿里巴巴集团控股有限公司 Marketing information pushing method and device
CN108289121A (en) * 2018-01-02 2018-07-17 阿里巴巴集团控股有限公司 The method for pushing and device of marketing message
CN108256067A (en) * 2018-01-16 2018-07-06 平安好房(上海)电子商务有限公司 Calculate method, apparatus, equipment and the storage medium of source of houses similarity
CN108470023A (en) * 2018-01-18 2018-08-31 阿里巴巴集团控股有限公司 The recommendation method and device of business function
CN108664469A (en) * 2018-05-07 2018-10-16 首都师范大学 A kind of emotional category determines method, apparatus and server
CN108664469B (en) * 2018-05-07 2021-11-19 首都师范大学 Emotion category determination method and device and server
CN108959253A (en) * 2018-06-28 2018-12-07 北京嘀嘀无限科技发展有限公司 Extracting method, device and the readable storage medium storing program for executing of core phrase
CN109325186A (en) * 2018-08-11 2019-02-12 桂林理工大学 A kind of behavior motive estimating method that user preference feature is merged with geographical feature
CN109325186B (en) * 2018-08-11 2021-08-17 桂林理工大学 Behavior motivation inference algorithm integrating user preference and geographic features
WO2020076179A1 (en) * 2018-10-11 2020-04-16 Общество С Ограниченной Ответственностью "Глобус Медиа" Method for determining tags for hotels and device for the implementation thereof
CN109446310A (en) * 2018-10-30 2019-03-08 腾讯科技(武汉)有限公司 A kind of method for evaluating quality, device and the storage medium of question sentence template
CN109446310B (en) * 2018-10-30 2020-11-03 腾讯科技(武汉)有限公司 Question template quality evaluation method and device and storage medium
CN110097394A (en) * 2019-03-27 2019-08-06 青岛高校信息产业股份有限公司 The latent objective recommended method of product and device
CN110263022A (en) * 2019-05-08 2019-09-20 深圳丝路天地电子商务有限公司 Hotel's data matching method and device
CN110263022B (en) * 2019-05-08 2023-03-14 深圳丝路天地电子商务有限公司 Hotel data matching method and device
CN110457502A (en) * 2019-08-21 2019-11-15 京东方科技集团股份有限公司 Construct knowledge mapping method, man-machine interaction method, electronic equipment and storage medium
CN110633469A (en) * 2019-09-10 2019-12-31 陈绪平 Method for accurately understanding Chinese sentence meaning
CN110633370A (en) * 2019-09-19 2019-12-31 携程计算机技术(上海)有限公司 Generation method, system, electronic device and medium of OTA hotel label
CN110633370B (en) * 2019-09-19 2023-07-04 携程计算机技术(上海)有限公司 OTA hotel label generation method, system, electronic device and medium
CN110674260A (en) * 2019-09-27 2020-01-10 北京百度网讯科技有限公司 Training method and device of semantic similarity model, electronic equipment and storage medium
CN110781394A (en) * 2019-10-24 2020-02-11 西北工业大学 Personalized commodity description generation method based on multi-source crowd-sourcing data
WO2021077973A1 (en) * 2019-10-24 2021-04-29 西北工业大学 Personalised product description generating method based on multi-source crowd intelligence data
CN111080162A (en) * 2019-12-27 2020-04-28 南昌众荟智盈信息技术有限公司 Automatic service task allocation method for improving automation level of hotel service process
CN111737400B (en) * 2020-06-15 2023-06-20 上海理想信息产业(集团)有限公司 Knowledge reasoning-based big data service label expansion method and system
CN111737400A (en) * 2020-06-15 2020-10-02 上海理想信息产业(集团)有限公司 Knowledge reasoning-based big data service tag expansion method and system
CN111768213A (en) * 2020-09-03 2020-10-13 耀方信息技术(上海)有限公司 User label weight evaluation method
CN112035750A (en) * 2020-09-17 2020-12-04 上海二三四五网络科技有限公司 Control method and device for user tag expansion
CN112948677A (en) * 2021-02-26 2021-06-11 上海携旅信息技术有限公司 Recommendation reason determination method, system, device and medium based on comment aesthetic feeling
CN112948677B (en) * 2021-02-26 2023-11-03 上海携旅信息技术有限公司 Recommendation reason determining method, system, equipment and medium based on comment aesthetic feeling
CN113139838A (en) * 2021-05-10 2021-07-20 上海华客信息科技有限公司 Hotel service evaluation method, system, equipment and storage medium
CN113361920A (en) * 2021-06-04 2021-09-07 上海华客信息科技有限公司 Hotel service optimization index recommendation method, system, equipment and storage medium
CN115034213A (en) * 2022-08-15 2022-09-09 苏州大学 Joint learning-based method for recognizing prefix and suffix negative words

Similar Documents

Publication Publication Date Title
CN105205699A (en) User label and hotel label matching method and device based on hotel comments
CN106649818B (en) Application search intention identification method and device, application search method and server
CN106407236B (en) A kind of emotion tendency detection method towards comment data
CN109189942B (en) Construction method and device of patent data knowledge graph
CN110502621A (en) Answering method, question and answer system, computer equipment and storage medium
CN106650943B (en) Auxiliary writing method and device based on artificial intelligence
CN110263180B (en) Intention knowledge graph generation method, intention identification method and device
CN100595760C (en) Method for gaining oral vocabulary entry, device and input method system thereof
CN105528437B (en) A kind of question answering system construction method extracted based on structured text knowledge
CN110298033A (en) Keyword corpus labeling trains extracting tool
CN105243129A (en) Commodity property characteristic word clustering method
CN106407235B (en) A kind of semantic dictionary construction method based on comment data
CN103049435A (en) Text fine granularity sentiment analysis method and text fine granularity sentiment analysis device
Savoy Authorship attribution: A comparative study of three text corpora and three languages
CN112395395B (en) Text keyword extraction method, device, equipment and storage medium
CN110162594B (en) Viewpoint generation method and device for text data and electronic equipment
CN104636456A (en) Question routing method based on word vectors
CN102576355A (en) Methods and systems for knowledge discovery
CN111143571B (en) Entity labeling model training method, entity labeling method and device
CN110750635A (en) Joint deep learning model-based law enforcement recommendation method
CN108021715B (en) Heterogeneous label fusion system based on semantic structure feature analysis
CN115017303A (en) Method, computing device and medium for enterprise risk assessment based on news text
CN101556596A (en) Input method system and intelligent word making method
CN113988057A (en) Title generation method, device, equipment and medium based on concept extraction
CN106897274B (en) Cross-language comment replying method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 100088 Madian East Road, Haidian District, No. 17,, golden floor, International Building, 18

Applicant after: Beijing Zhong Hui Information Technology Limited by Share Ltd

Address before: 100088 Madian East Road, Haidian District, No. 17,, golden floor, International Building, 18

Applicant before: BEIJING ZHONGHUI INFORMATION TECHNOLOGY CO., LTD.

COR Change of bibliographic data
RJ01 Rejection of invention patent application after publication

Application publication date: 20151230

RJ01 Rejection of invention patent application after publication