WO2017092337A1 - 评论标签提取方法和装置 - Google Patents

评论标签提取方法和装置 Download PDF

Info

Publication number
WO2017092337A1
WO2017092337A1 PCT/CN2016/089277 CN2016089277W WO2017092337A1 WO 2017092337 A1 WO2017092337 A1 WO 2017092337A1 CN 2016089277 W CN2016089277 W CN 2016089277W WO 2017092337 A1 WO2017092337 A1 WO 2017092337A1
Authority
WO
WIPO (PCT)
Prior art keywords
words
word
comment
group
subject
Prior art date
Application number
PCT/CN2016/089277
Other languages
English (en)
French (fr)
Inventor
康潮明
Original Assignee
乐视控股(北京)有限公司
乐视网信息技术(北京)股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 乐视控股(北京)有限公司, 乐视网信息技术(北京)股份有限公司 filed Critical 乐视控股(北京)有限公司
Priority to US15/249,677 priority Critical patent/US20170154077A1/en
Publication of WO2017092337A1 publication Critical patent/WO2017092337A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis

Definitions

  • the present invention relates to the field of tag extraction technologies, and in particular, to a comment tag extraction method and apparatus.
  • comment tag extraction is mainly achieved through the following two scenarios:
  • the first type relying on manual search for comments from users, sorting and extracting some words in the middle as the object's comment tags.
  • This kind of comment label extraction scheme takes a long time and requires a lot of human resources.
  • the manual screening of words usually has strong subjectivity, the extracted comment tags are often difficult to reflect the characteristics of the objects in the most objective form, resulting in low accuracy of the extracted comment tags.
  • the above second comment label extraction scheme can automatically complete the extraction of the comment label, compared with the first comment label extraction scheme, it can save a lot of human resources and processing time, but since the extraction method ignores between the comments The interrelationship between the extracted tags and the various comments is low, and ultimately the accuracy of the extracted comment tags is still low.
  • the present invention provides a comment label extraction method and apparatus to solve the problem of low accuracy of comment labels extracted by the existing comment label extraction scheme.
  • the present invention discloses a method for extracting a comment label, the method comprising: performing a binary group extraction on each comment corresponding to an object to be processed, and combining the extracted two groups into a first a set; wherein the dual group includes: a subject word and a modifier; determining a word in the word comment that the word frequency-inverted file frequency TF-IDF is greater than a first set threshold, and combining the determined words into a second set; processing the first set and the second set according to a first setting rule to generate a third set; determining words in the respective comments that the subject weight value is greater than a second set threshold, Combining the determined topic weight value with the second set threshold value into a fourth set; performing intersection processing on the third set and the fourth set to obtain a fifth set; for the fifth set
  • the words are deduplicated, and the remaining words after deduplication are determined as the comment tags of the current object to be processed.
  • the present invention also discloses a comment label extracting apparatus, and the apparatus includes: a dual group extracting module, configured to perform binary group extraction on each comment corresponding to the current object to be processed, and extract the extracted Combining the two groups into a first set; wherein the dual group includes: a subject word and a modifier; and a first combining module, configured to determine that the word frequency-inverted file frequency TF-IDF is greater than each of the comments a first set threshold word, combining the determined words into a second set; a second combining module, configured to process the first set and the second set according to a first setting rule to generate a first a third combination module, configured to determine words in the respective comments that have a topic weight value greater than a second set threshold, and combine the determined words with a topic weight value greater than a second set threshold into a fourth set a fourth combination module, configured to perform intersection processing on the third set and the fourth set to obtain a fifth set; and a deduplication module, configured to de-re
  • Embodiments of the present invention provide a computer program comprising computer readable code that, when executed on an electronic device, causes the electronic device to perform the comment tag extraction method described above.
  • Embodiments of the present invention provide a computer readable medium in which the above computer program is stored.
  • the comment label extraction method and apparatus provided by the present invention can construct a dual group of words by performing word and grammatical analysis on each sentence in each comment, and can effectively utilize the relationship between the contexts of the words in the comment, and filter out the independent Insignificant noise words narrow the scope of the word as a candidate comment tag, which in turn improves the accuracy of the tag.
  • the comment label extraction method and apparatus provided by the present invention when screening a word as a candidate comment label, also filters the word weight value of the word, and filters out the words whose subject weight value is less than or equal to the second set threshold, and retains Words that are closely related to the subject of the comment can further improve the accuracy of the extracted tags.
  • FIG. 1 is a flow chart showing the steps of a method for extracting comment tags according to a first embodiment of the present invention
  • FIG. 2 is a flow chart showing the steps of a method for extracting comment tags according to Embodiment 2 of the present invention
  • FIG. 3 is a flow chart showing the steps of performing comment label extraction by using the method shown in Embodiment 2 of the present invention.
  • Figure 4 is a probability map of the LDA model
  • FIG. 5 is a structural block diagram of a comment label extracting apparatus according to Embodiment 3 of the present invention.
  • FIG. 6 is a structural block diagram of a comment label extracting apparatus according to Embodiment 4 of the present invention.
  • Figure 7 shows schematically a block diagram of an electronic device for performing the method according to the invention.
  • Fig. 8 schematically shows a storage unit for holding or carrying program code implementing the method according to the invention.
  • FIG. 2 a flow chart of steps of a comment label extraction method according to Embodiment 1 of the present invention is shown.
  • Step S102 Perform binary group extraction on each comment corresponding to the current object to be processed, and combine the extracted two groups into the first set.
  • the object to be processed may be a song, a movie, an item, or the like, and each comment corresponding to the current object to be processed is each comment about the object. For example, if a comment tag needs to be extracted from a plurality of comments of a movie, the movie and the object to be processed, all comments for the movie, that is, the respective comments corresponding to the current object to be processed.
  • the two-group includes: subject words and modifiers, for example: the group is ⁇ song, classic>.
  • the group is ⁇ song, classic>.
  • Step S104 Determine words in each comment that the TF-IDF is greater than the first set threshold, and combine the determined words into a second set.
  • the first set threshold may be set by a person skilled in the art according to the actual requirements in the specific implementation process, and is not specifically limited in the embodiment of the present invention.
  • Step S106 Processing the first set and the second set according to the first setting rule to generate a third set.
  • the first setting rule can be set by a person skilled in the art according to actual needs, and is not specifically limited in the embodiment of the present invention.
  • the first setting rule is set to extract each subject word from the first set to form a subject word set, and the subject word set and the second set are combined and operated.
  • the first setting rule is set to extract each modifier from the first set to form a modifier set, and the modifier set and the second set are combined and operated.
  • the first setting rule is set to perform a union operation on the first set and the second set.
  • Step S108 Determine words in which the topic weight value is greater than the second set threshold in each comment, and combine the words whose determined topic weight value is greater than the second set threshold into the fourth set.
  • the second set threshold may be set by a person skilled in the art according to actual needs, which is not specifically limited in the embodiment of the present invention.
  • Step S110 Perform intersection processing on the third set and the fourth set to obtain a fifth set.
  • intersection is to extract the same elements from the two sets to form a new collection.
  • the third set contains words A and B
  • the fourth set contains words A and C.
  • the word A is extracted to form a fifth set.
  • Step S112 De-duplicate the words in the fifth set, and determine the remaining words after the de-duplication as the comment tags of the current object to be processed.
  • the comment label extraction method provided by the embodiment of the present invention, by constructing a dual group of words by using words and grammar analysis for each sentence in each comment, the relationship between the contexts of the words in the comment can be effectively utilized, and the independence is filtered out.
  • the meaningless noise word narrows the range of words as candidate comment tags, which in turn improves the accuracy of the extracted comment tags.
  • the comment label extraction method provided by the embodiment of the present invention filters the word weight value of the word when filtering the word as the candidate comment label, and filters out the words whose subject weight value is less than or equal to the second set threshold, and retains The subject of the review is closely related to the words, which can further improve the accuracy of the extracted comment tags.
  • FIG. 2 a flow chart of steps of a comment label extraction method according to Embodiment 2 of the present invention is shown.
  • Step S202 The processing device performs binary group extraction on each comment corresponding to the current object to be processed, and combines the extracted two groups into a first set.
  • the processing device may be any device having a computing function, such as a server, a computer, or the like.
  • the two-group includes: subject words and modifiers.
  • the current comment contains the sentence "Wang Feng's song is very classic, the lyrics are very inspirational", after the above sentence segmentation, part of speech determination, syntactic analysis determined the two-word phrase: ⁇ , classic>, ⁇ lyrics, inspirational>.
  • Step S204 The processing device determines words in each comment that the TF-IDF is greater than the first set threshold, and combines the determined words into a second set.
  • the TF-IDF of a word is: the product of TF (term frequenc) and IDF (inverse document frequency reverse file frequency).
  • the first set threshold is 0.75.
  • the first set threshold may be 0.7, 0.8, or the like.
  • a person skilled in the art can set the first set threshold to any appropriate value according to actual needs.
  • the words whose TF-IDF is greater than the first set threshold may be determined, and the words are grouped into the second set.
  • Step S206 The processing device extracts a modifier or a subject word included in each of the two groups in the first set to form a set of modifiers or a set of subject words.
  • the first set contains a plurality of binary groups, and each of the two groups includes a modifier and a subject word.
  • the modifiers included in each group are extracted, and the extracted modifiers are extracted.
  • Words form a collection of modifiers.
  • the first set contains a binary group ⁇ song, classic>, ⁇ lyrics, inspirational>, and the extracted modifiers are “classic” and “inspirational”, and the “classic” and “inspirational” are composed of modifiers.
  • Words form a collection of subject words.
  • Step S208 The processing device performs a union process on the set of modifiers or the set of subject words and the second set to generate a third set.
  • the modifier set contains the words A, B, and C
  • the second set contains the words A, D, and E.
  • the third set generated by the union of the two contains the words A, B, and C. D and E.
  • Step S210 The processing device determines the subject weight value of each word in each comment according to the potential Dirichlet distribution model.
  • the subject influence of a word in the document that is, the subject weight value
  • the subject weight value can be calculated.
  • the subject weight value of the word in all comments can be determined.
  • Step S212 The processing device compares the theme weight value of each word with the second set threshold, respectively, to determine a word whose subject weight value is greater than the second set threshold, and determines the determined subject weight value to be greater than the second setting.
  • the words of the threshold are combined into a fourth set.
  • the second set threshold may be set by a person skilled in the art according to actual needs.
  • the second set threshold is set to 0.8.
  • it is not limited to this value, and can also be set to values of 0.7, 0.75, and 0.85.
  • This step may filter out words whose subject weight value is less than or equal to the second set threshold, and retain words closely related to the subject of the comment to improve the accuracy of the extracted comment tags.
  • Step S214 The processing device performs intersection processing on the third set and the fourth set to obtain a fifth set.
  • Step S216 The processing device de-duplicates the words in the fifth set, and determines the remaining words after the de-duplication as the comment tags of the current object to be processed.
  • the fifth set contains the words A, B, C, and D
  • a and B, A and C, A and D, B and C, B and D, C and D are combined and combined into multiple words. group.
  • S, T are two words in the word group
  • P(S, T) represents the similarity of two words
  • D(S, T) represents the minimum editing distance of two words
  • Sim(pos) represents two words.
  • can be set to 0.6 and ⁇ to 0.4.
  • each word group is processed to complete the deduplication of the fifth set.
  • the comment label extraction method provided by the embodiment of the present invention, by constructing a dual group of words by using words and grammar analysis for each sentence in each comment, the relationship between the contexts of the words in the comment can be effectively utilized, and the independence is filtered out.
  • the meaningless noise word narrows the range of words as candidate comment tags, which in turn improves the accuracy of the extracted comment tags.
  • the comment label extraction method provided by the embodiment of the present invention filters the word weight value of the word when filtering the word as the candidate comment label, and filters out the words whose subject weight value is less than or equal to the second set threshold, and retains The subject of the review is closely related to the words, which can further improve the accuracy of the extracted comment tags.
  • Step S302 Acquire a comment S corresponding to the song.
  • the song corresponds to a plurality of comments, and in this step, a comment S is obtained in advance for processing.
  • Step S304 extracting the word set corresponding to the comment S by performing word segmentation on the obtained sentence included in the comment S and the part-of-speech tagging.
  • Step S306 Perform a dependency syntax analysis on the comment S to determine a dual group corresponding to the comment S.
  • a syntactic analysis is performed on each sentence to obtain a modification between words and words, and finally, a dual group is constructed.
  • the comment is: "Wang Feng's song is very classic, the lyrics are very inspirational.”
  • the subject words and modifiers in the sentence are obtained, and the binary group constructed by the "subject word, modifier” is extracted as Describe a tag for this song, extract the resulting two-tuple as ⁇ song, classic>, ⁇ lyrics, inspirational>.
  • Each of the extracted binary groups is composed of a tag candidate set A, that is, a first set.
  • Step S308 Perform TF-IDF calculation on the words in all comments corresponding to the song, and generate a candidate label set, that is, the second set, according to the calculation result.
  • the number of occurrences of the word in this particular example is obtained by TF statistics. But for some comments, the more times a word appears, the less important the word is to the song. Therefore, you need to find an appropriate weight coefficient to measure the importance of the word. If a word is uncommon, but it appears multiple times in a comment, then the word reflects the characteristics of the song to a certain extent, ie the word can be used as a candidate tag.
  • IDF is used as the weight coefficient in this specific example.
  • multiplying the two values of the word TF and IDF yields the TF-IDF value of a word.
  • the TF-IDF value of the words in all the comments corresponding to the song is calculated, and a threshold, that is, the first set threshold is set, a part of the words that cannot satisfy the requirement is selected, and the words satisfying the requirements constitute a candidate label word.
  • Set B is the second set.
  • the first step is to calculate TF.
  • Word Frequency (TF) number of times a word appears in a comment / total number of words in the comment.
  • the second step is to calculate the IDF.
  • Reverse file frequency (IDF) log (total number of comments for this song / (number of comments containing the word + 1)).
  • the third step is to calculate the TF-IDF.
  • TF-IDF word frequency (TF) ⁇ reverse file frequency (IDF).
  • the TF-IDF of each word can be calculated.
  • the threshold a is set as the first set threshold. By comparing the TF-IDF of the word with the set threshold, it can be determined whether the word can be added in the value candidate label set B.
  • the threshold a can be set to 0.75, by which the words are filtered. At the time of screening, when the TF-IDF>a of the word, the word is added to the candidate tag set B.
  • Step S310 All comments corresponding to the song are processed using the LDA model to determine the candidate label set D, that is, the fourth set.
  • the LDA model was proposed by Blei (Blai) et al. in 2003 and used for document topic modeling.
  • each document is represented as a mixed distribution containing K implicit topics, each of which is a multinomial distribution over W words, and the probability map representation of the model is shown in Figure 4.
  • represents the probability distribution of the document-theme
  • ⁇ and ⁇ represent ⁇ and Superparameters subject to Dirichlet's prior distribution
  • open circles represent implicit variables
  • solid circles represent observable variables, ie words.
  • d) indicates the theme influence of the word in the document d, that is, the subject weight.
  • w represents the word in d
  • t contains t implicit topics
  • this specific example is adopted The probability that the word w is in the subject z, Represents the probability of occurrence of topic z in document d.
  • the subject influence of word w can be calculated by the following formula:
  • represents the “document-theme” distribution of the document
  • represents the “topic-word” distribution of each topic.
  • N 1 (d, j) represents the number of times the word in the document d is assigned to the subject j
  • N 2 (w, j) represents the number of times the word w is assigned to the subject j in the training corpus
  • N is the word in the text total.
  • Formula (1) can be solved by formula (2) and formula (3) to calculate the influence of a topic in the document.
  • a threshold that is, a second set threshold is set, and by comparing the T(w
  • the second set threshold can be set to 0.8, and each word can be filtered by the second set threshold. At the time of screening, when the T(w
  • the second set threshold may be set to any appropriate value by a person skilled in the art, which is not specifically limited in this specific example.
  • Step S312 Perform intersection and union processing on each set determined by step S306, step S308, and step S310.
  • Step S314 De-duplicating the determined candidate tag set E to obtain a word that is finally used as a comment tag.
  • the candidate tag set E is deprocessed based on the word similarity of the minimum edit distance combined with the part of speech. Specifically, for any two words S and T in the candidate tag set E, the similarity of the selected two words is calculated by the following formula:
  • S and T represent two words in the word group
  • P(S, T) represents the similarity of two words
  • D(S, T) represents the minimum editing distance of two words
  • Sim(pos) represents two words.
  • the weight coefficient ⁇ is set to 0.6, and the weight coefficient ⁇ is set to 0.4.
  • the similarity of any two words in the candidate tag set E is calculated by the above formula. Then, the words in the candidate tag set E are deduplicated according to the similarity.
  • a third set threshold for example, 0.7
  • the two words are considered to be duplicated, one of them is removed, and all the words in the candidate tag set E are filtered according to the method.
  • the last remaining set of words is the comment label for the song.
  • FIG. 5 a block diagram of a structure of a comment label extracting apparatus in Embodiment 3 of the present invention is shown.
  • the comment label extracting apparatus of the embodiment of the present invention includes: a dual group extracting module 502, configured to Each of the comments corresponding to the current object to be processed is subjected to a two-group extraction, and the extracted two groups are combined into a first set; wherein the dual group includes: a subject word and a modifier; the first combination module 504 And determining, in the respective comments, a word frequency-inverted file frequency TF-IDF greater than a first set threshold, combining the determined words into a second set; and a second combining module 506, configured to a setting rule processing the first set and the second set to generate a third set; and a third combining module 508, configured to determine a term in the each comment that the subject weight value is greater than a second set threshold And combining, by the fourth combination module 510, the intersection processing of the third set and the fourth set to obtain a fifth a set; deduplication module 512, configured to de-reply the words in the fifth set, and determine the remaining words after the de-duplication
  • the comment label extracting apparatus constructs a dual group of words by performing word and grammatical analysis on each sentence in each comment, and can effectively utilize the relationship between the contexts of the words in the comment, and filter out the independence.
  • the meaningless noise word narrows the range of words as candidate comment tags, which in turn improves the accuracy of the extracted comment tags.
  • the comment label extracting apparatus provided by the embodiment of the present invention filters the word weight value of the word when filtering the words as the candidate comment label, and filters out the words whose subject weight value is less than or equal to the second set threshold, and retains The subject of the review is closely related to the words, which can further improve the accuracy of the extracted comment tags.
  • FIG. 6 a block diagram of a structure of a comment label extracting apparatus in Embodiment 4 of the present invention is shown.
  • the comment label extracting apparatus of the embodiment of the present invention is further optimized for the comment label extracting apparatus shown in the third embodiment, and the optimized comment label extracting apparatus includes: a dual group extracting module 602, configured to correspond to the current object to be processed Each of the comments is subjected to a two-group extraction, and the extracted two groups are combined into a first set; wherein the dual group includes: a subject word and a modifier; and the first combination module 604 is configured to determine a word in which the word frequency-reversed file frequency TF-IDF is greater than the first set threshold in each comment, and the determined words are combined into a second set; and the second combining module 606 is configured to follow the first setting rule The first set and the second set are processed to generate a third set, and the third combining module 608 is configured to determine a term in the each comment that the subject weight value is greater than a second set threshold, and the determining is performed.
  • the dual group includes: a subject word and a modifier
  • the first combination module 604 is configured to
  • the words whose subject weight value is greater than the second set threshold are combined into a fourth a fourth combination module 610, configured to perform intersection processing on the third set and the fourth set to obtain a fifth set; and a deduplication module 612, configured to perform deduplication on words in the fifth set And determining the remaining words after the deduplication as the comment label of the current object to be processed.
  • the dual group extraction module 602 performs a binary group extraction on each comment corresponding to the current object to be processed: for each comment, each sentence included in the comment is segmented, and the word segmentation is determined. The part of speech of each word; syntactically analyzing the part of speech of the words, obtaining a modification relationship between the words in each of the words, and constructing a binary group corresponding to each sentence according to the modification relationship.
  • the second combining module 606 includes: a modifier extracting sub-module 6062, configured to extract a modifier or a subject word included in each of the first set, and form a modifier set or a set of subject words.
  • the union processing sub-module 6064 is configured to perform a unionization process on the modifier set or the subject set and the second set to generate the third set.
  • the third combining module 608 determines a term in the each comment that the subject weight value is greater than a second set threshold: determining a topic of each word in each of the comments according to a potential Dirichlet distribution model a weight value; the subject weight value of each word is compared with the second set threshold, respectively, to determine a word whose subject weight value is greater than the second set threshold.
  • the de-duplication module 612 includes: a grouping sub-module 6122, configured to combine the words in the fifth set respectively into two groups, and combine the words into a group of words; the similarity calculation sub-module 6124 is configured to target Each word group determines the similarity value of the two words in the current word group according to the minimum edit distance and the part-of-speech similarity of the two words in the current word group; the deletion sub-module 6126 is configured to respectively compare the similarity value to the first Deleting a word in the set of words of the set threshold to complete deduplication of the fifth set; determining a sub-module 6128 for determining the remaining words after deduplication as the comment label of the current object to be processed .
  • a grouping sub-module 6122 configured to combine the words in the fifth set respectively into two groups, and combine the words into a group of words
  • the similarity calculation sub-module 6124 is configured to target Each word group determines the similarity value of the two words in the current word group
  • the comment label extracting apparatus of the embodiment of the present invention is used to implement the corresponding comment label extracting method in the first embodiment and the second embodiment, and has the beneficial effects corresponding to the method embodiment, and details are not described herein again.
  • the device embodiments described above are merely illustrative, wherein the units described as separate components may or may not be physically separate, and the components displayed as units may or may not be physical units, ie may be located A place, or it can be distributed to multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the embodiment. Those of ordinary skill in the art can understand and implement without deliberate labor.
  • Figure 7 illustrates an electronic device in accordance with the present invention.
  • the electronic device conventionally includes a processor 710 and a computer program product or computer readable medium in the form of a memory 720.
  • Memory 720 can be an electronic memory such as a flash memory, EEPROM (Electrically Erasable Programmable Read Only Memory), EPROM, hard disk, or ROM.
  • Memory 720 has a memory space 730 for program code 731 for performing any of the method steps described above.
  • storage space 730 for program code may include various program code 731 for implementing various steps in the above methods, respectively.
  • the program code can be read from or written to one or more computer program products.
  • Such computer program products include program code carriers such as hard disks, compact disks (CDs), memory cards or floppy disks.
  • Such a computer program product is typically a portable or fixed storage unit as described with reference to FIG.
  • the storage unit may have a storage section, a storage space, and the like arranged similarly to the memory 720 in the electronic device of FIG.
  • the program code can be compressed, for example, in an appropriate form.
  • the storage unit includes computer readable code 731', i.e., code readable by a processor, such as 710, that when executed by an electronic device causes the electronic device to perform various steps in the methods described above.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

一种评论标签提取方法和装置,其中所述方法包括:将当前待处理对象对应的各条评论进行二元组提取,将提取出的二元组组合成第一集合(S102);确定各条评论中TF-IDF大于第一设定阈值的词语,将所述确定的词语组合成第二集合(S104);按照第一设定规则对第一集合以及第二集合进行处理,生成第三集合(S106);确定各条评论中主题权重值大于第二设定阈值的词语,将确定的主题权重值大于第二设定阈值的词语组合成第四集合(S108);对第三集合以及第四集合进行求交集处理得到第五集合(S110);对第五集合中的词语进行去重复,并将去重复后剩余的词语确定为当前待处理对象的评论标签(S112)。通过上述提供的评论标签提取方法,能够提高评论标签的精确度。

Description

评论标签提取方法和装置
本申请要求在2015年12月1日提交中国专利局、申请号为201510866792.5、发明名称为“评论标签提取方法和装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明涉及标签提取技术领域,特别是涉及一种评论标签提取方法和装置。
背景技术
对于一个对象(产品、商户、歌曲、电影)往往伴随着成千上万的用户评论。如何从这些冗杂的评论信息中提取出能够描述此对象的精华信息作为评论标签是当前研究的热点问题之一。以一首歌曲为例,若能通过对该歌曲的相关评论进行处理,得到能够体现该歌曲特点的精华信息作为其标签,那么,将有助于用户对该歌曲的特性的直观了解。
目前,评论标签提取主要通过以下两种方案实现:
第一种:依靠人工对用户发出的评论进行搜索、整理并提取中间的某些词语作为对象的评论标签。该种评论标签提取方案,耗时长且需要占用大量的人力资源。不仅如此,由于人工筛选词语通常带有较强的主观性,提取的评论标签往往难以以最客观的形式来体现对象的特性,导致提取的评论标签的精确度低。
第二种:直接使用文本标签的提取的方式对评论标签进行提取。具体为:基于词性和模板的直接对各条评论中的词语进行提取以确定出对象对应的评论标签;或者是,基于词语出现的频率从各条评论中筛选出词语作为对象的评论标签。
上述第二种评论标签提取方案虽能自动完成评论标签的提取,相较于第一种评论标签提取方案能够节省大量的人力资源以及处理时间,但由于该种抽取方法忽略了各条评论之间的相互关系,造成抽取的标签与各条评论之间的相关度低,最终依然会导致提取的评论标签的精确度低。
发明内容
本发明提供了一种评论标签提取方法和装置,以解决现有的评论标签提取方案所提取的评论标签精确度低的问题。
为了解决上述问题,本发明公开了一种评论标签提取方法,所述方法包括:将当前待处理对象对应的各条评论进行二元组提取,将提取出的所述二元组组合成第一集合;其中,所述二元组包括:主语词和修饰词;确定所述各条评论中词频-反转文件频率TF-IDF大于第一设定阈值的词语,将所述确定的词语组合成第二集合;按照第一设定规则对所述第一集合以及所述第二集合进行处理,生成第三集合;确定所述各条评论中主题权重值大于第二设定阈值的词语,将所述确定的主题权重值大于第二设定阈值的词语组合成第四集合;对所述第三集合以及所述第四集合进行求交集处理得到第五集合;对所述第五集合中的词语进行去重复,并将去重复后剩余的词语确定为所述当前待处理对象的评论标签。
为了解决上述问题,本发明还公开了一种评论标签提取装置,所述装置包括:二元组提取模块,用于将当前待处理对象对应的各条评论进行二元组提取,将提取出的所述二元组组合成第一集合;其中,所述二元组包括:主语词和修饰词;第一组合模块,用于确定所述各条评论中词频-反转文件频率TF-IDF大于第一设定阈值的词语,将所述确定的词语组合成第二集合;第二组合模块,用于按照第一设定规则对所述第一集合以及所述第二集合进行处理,生成第三集合;第三组合模块,用于确定所述各条评论中主题权重值大于第二设定阈值的词语,将所述确定的主题权重值大于第二设定阈值的词语组合成第四集合;第四组合模块,用于对所述第三集合以及所述第四集合进行求交集处理得到第五集合;去重复模块,用于对所述第五集合中的词语进行去重复,并将去重复后剩余的词语确定为所述当前待处理对象的评论标签。
本发明实施例提供一种计算机程序,其包括计算机可读代码,当所述计算机可读代码在电子装置上运行时,导致所述电子装置执行上述的评论标签提取方法。
本发明实施例提供一种计算机可读介质,其中存储了上述的计算机程序。
本发明提供的评论标签提取方法和装置,通过对各评论中的各句子进行词语、语法分析构建词语的二元组,能够有效的利用评论中词语的上下文之间的关系,过滤掉了独立的无意义的噪音词,缩小了作为候选评论标签的词语范围,相应地提高了标签的精确度。此外,本发明提供的评论标签提取方法和装置,在筛选作为候选评论标签的词语时,还对词语主题权重值的筛选,将主题权重值小于或等于第二设定阈值的词语过滤掉,保留与评论的主题关联密切的词语,可以进一步提高提取的标签精确度。
附图说明
为了更清楚地说明本发明或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是根据本发明实施例一的一种评论标签提取方法的步骤流程图;
图2是根据本发明实施例二的一种评论标签提取方法的步骤流程图;
图3是采用本发明实施例二中所示的方法进行评论标签提取的步骤流程图;
图4是LDA模型的概率图;
图5是根据本发明实施例三的一种评论标签提取装置的结构框图;
图6是根据本发明实施例四的一种评论标签提取装置的结构框图;
图7示意性地示出了用于执行根据本发明的方法的电子装置的框图;以及
图8示意性地示出了用于保持或者携带实现根据本发明的方法的程序代码的存储单元。
具体实施方式
为使本发明实施例的目的、技术方案和优点更加清楚,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获 得的所有其他实施例,都属于本发明保护的范围。
实施例一
参照图2,示出了本发明实施例一的一种评论标签提取方法的步骤流程图。
本发明实施例的评论标签提取方法包括以下步骤:
步骤S102:将当前待处理对象对应的各条评论进行二元组提取,将提取出的二元组组合成第一集合。
其中,待处理对象可以是歌曲、电影、物品等,当前待处理对象对应的各条评论即关于该对象的各条评论。例如:需要从一部电影的众多评论中提取出评论标签,则该部电影及为待处理对象,针对该部电影的全部评论即当前待处理对象对应的各条评论。
其中,二元组包括:主语词和修饰词,例如:二元组为<歌,经典>。通过对各条评论中构成句子的词语以及语法进行分析,得到各条评论包含的二元组,然后,将各条评论的二元组组合成第一集合。
步骤S104:确定各条评论中TF-IDF大于第一设定阈值的词语,将确定的词语组合成第二集合。
需要说明的是,评论中的词语的TF-IDF(term frequency-inverse document frequency,词频-反转文件频率或词频-反向文件频率)的确定参见相关技术即可,本发明实施例中对此不作具体限制。
第一设定阈值可以由本领域技术人员在具体实现过程中根据实际需求进行设定,本发明实施例中对此也不作具体限制。
步骤S106:按照第一设定规则对第一集合以及第二集合进行处理,生成第三集合。
第一设定规则可以由本领域技术人员根据实际需求进行设定,本发明实施例中对此不作具体限制。例如:将第一设定规则设定成从第一集合中提取各主语词组成主语词集合,将主语词集合与第二集合进行并集运算。例如:将第一设定规则设定成从第一集合中提取各修饰词组成修饰词集合,将修饰词集合与第二集合进行并集运算。再例如:将第一设定规则设定成将第一集与第二集合进行并集运算。
步骤S108:确定各条评论中主题权重值大于第二设定阈值的词语,将确定的主题权重值大于第二设定阈值的词语组合成第四集合。
其中,第二设定阈值可以由本领域技术人员根据实际需求进行设置,本发明实施例中对此不作具体限制。
步骤S110:对第三集合以及第四集合进行求交集处理得到第五集合。
求交集即将两个集合中相同的元素提取出来构成一个新的集合。例如:第三集合中包含词语A和B,第四集合中包含词语A和C,在对这两个集合求交集时,则提取出词语A组成第五集合。
步骤S112:对第五集合中的词语进行去重复,并将去重复后剩余的词语确定为当前待处理对象的评论标签。
通过本发明实施例提供的评论标签提取方法,通过对各评论中的各句子进行词语、语法分析构建词语的二元组,能够有效的利用评论中词语的上下文之间的关系,过滤掉了独立的无意义的噪音词,缩小了作为候选评论标签的词语范围,相应地提高了提取的评论标签的精确度。此外,本发明实施例提供的评论标签提取方法,在筛选作为候选评论标签的词语时还对词语主题权重值的筛选,将主题权重值小于或等于第二设定阈值的词语过滤掉,保留与评论的主题关联密切的词语,可以进一步提高提取的评论标签精确度。
实施例二
参照图2,示出了本发明实施例二的一种评论标签提取方法的步骤流程图。
本发明实施例的评论标签提取方法具体包括以下步骤:
步骤S202:处理装置将当前待处理对象对应的各条评论进行二元组提取,将提取出的二元组组合成第一集合。
其中,处理装置可以是任意具有计算功能的装置,例如:服务器、电脑等。二元组包括:主语词和修饰词。
一种可选的将当前待处理对象对应的各条评论进行二元组提取的方式如下:
针对每条评论,对该评论包含的每个句子进行分词,并确定分词后的各词语的词性;对各词语的词性进行句法分析,获取每个句子中词语之间的修 饰关系,依据所述修饰关系构建每个句子对应的二元组。采用上述提取方式对各条评论进行处理,即可确定全部二元组。
例如:当前评论包含的句子为“汪峰的歌很经典,歌词很励志”,经过上述句子分词、词性确定、句法分析后确定的二元词组为:<歌,经典>,<歌词,励志>。
步骤S204:处理装置确定各条评论中TF-IDF大于第一设定阈值的词语,将确定的各词语组合成第二集合。
词语的TF-IDF为:词语的TF(term frequenc,词频)与IDF(inverse document frequency反转文件频率)的积。
其中,TF的具体计算方式可以由本领域技术人员根据实际需求进行设置。例如:可以采用如下公式TF=词语在一条评论中出现的次数/词语所在的该条评论的总词数,来计算词语的TF。还可以采用如下公式TF=词语在一条评论中出现的次数,来确定词语的TF。
IDF的具体计算方式也可以由本领域技术人员根据实际需求进行设置。例如:可以采用如下公式IDF=log(待处理对象下评论总条数/(包含该词语的评论数+1))计算词语的IDF。还可以采用如下公式IDF=log(待处理对象下评论总条数/包含该词语的评论数)计算词语的IDF。
可选地,第一设定阈值为0.75。当然,并不限于此,第一设定阈值还可以为0.7、0.8等。在具体实现过程中,本领域技术人员可以依据实际需求将第一设定阈值设定成任意适当的值。
在确定各词语的TF-IDF后,分别将各词语的TF-IDF的与第一设定阈值进行比较即可确定TF-IDF大于第一设定阈值的词语,将这些词语组成第二集合。
步骤S206:处理装置提取第一集合中各二元组包含的修饰词或主语词,组成修饰词集合或主语词集合。
第一集合中包含多个二元组,而每个二元组中包含一个修饰词和一个主语词,本步骤中,需要提取各二元组中包含的修饰词,并将提取出的各修饰词组成修饰词集合。例如:第一集合中包含二元组<歌,经典>,<歌词,励志>,提取出的修饰词为“经典”、“励志”,则将“经典”、“励志”组成修饰词集合。当然,也可以提取各二元组中包含的主语词,并将提取出的各主 语词组成主语词集合。
步骤S208:处理装置对修饰词集合或主语词集合与第二集合进行求并集处理,生成第三集合。
例如:修饰词集合中包含词语A、B和C,第二集合中包含词语A、D和E,那么,将二者求并集后所生成的第三集合则包含词语A、B、C、D和E。
步骤S210:处理装置依据潜在狄利克雷分布模型确定各条评论中的各词语的主题权重值。
通过LDA(Latent Dirichlet Allocation,潜在狄利克雷分布)模型可以计算出一个词语在文档中的主题影响力即主题权重值。具体确定方式参见相关技术即可,本发明实施例对此不作具体限制。相应地,通过将各条评论作为文档,即可确定出词语在所有评论中的主题权重值。
步骤S212:处理装置分别将各词语的主题权重值与第二设定阈值进行比对,以确定出主题权重值大于第二设定阈值的词语,并将确定的主题权重值大于第二设定阈值的词语组合成第四集合。
需要说明的是,第二设定阈值可以由本领域技术人员根据实际需求进行设置。可选地,第二设定阈值设定为0.8。当然并不限于该值,还可以设定为0.7、0.75、0.85等值。
本步骤可将主题权重值小于或等于第二设定阈值的词语过滤掉,保留与评论的主题关联密切的词语,以提高提取的评论标签精确度。
步骤S214:处理装置对第三集合以及第四集合进行求交集处理得到第五集合。
步骤S216:处理装置对第五集合中的词语进行去重复,并将去重复后剩余的词语确定为当前待处理对象的评论标签。
一种可选的对第五集合中的词语进行去重复的方式如下:
S1:将第五集合中的各词语分别两两进行组合,组合成词语组;
例如:第五集合中包含词语A、B、C和D,那么,将A和B、A和C、A和D、B和C、B和D、C和D进行组合,组合成多个词语组。
S2:针对每个词语组,分别依据当前词语组中两个词语的最小编辑距离以及词性相似度确定当前词语组中的两个词语的相似度值。
一种可选的依据两个词语的最小编辑距离以及词性相似度确定当前词语组中的两个词语的相似度值的方式为采用如下公式进行计算:
P(S,T)=α(D(S,T)+1)+βSim(pos);
其中,S,T为词语组中的两个词语,P(S,T)表示两个词语的相似度,D(S,T)表示两个词语的最小编辑距离,Sim(pos)表示两个词语的词性相似度,α、β为权重系数。若S,T词性相同,则Sim(pos)为1,若S,T词性不同,则Sim(pos)为0,α+β=1,P(S,T)∈[0,1]。
当D(S,T)=0且Sim(pos)=1,即词语S和T的最小编辑距离为0,词性相同,则P(S,T)=1,表示S和T的相似度最大。Sim(pos)=0,D(S,T)越大,即词语S和T的最小编辑距离越大,P(S,T)越小,表示S和T的相似度越小。
可选地,可将α设置为0.6,将β设置为0.4。
S3:分别将相似度值大于第三设定阈值的词语组中的一个词语删除,以完成对第五集合的去重复。
例如:若S和T组成的词语组的相似度值大于第三设定阈值,则需从第五结合中删除S或T任意一个词语;若S和T组成的词语组的相似度值小于或等于第三设定阈值,则无需进行词语删除。采用相同的原则,将各词语组进行处理,即可完成对第五集合的去重复。
通过本发明实施例提供的评论标签提取方法,通过对各评论中的各句子进行词语、语法分析构建词语的二元组,能够有效的利用评论中词语的上下文之间的关系,过滤掉了独立的无意义的噪音词,缩小了作为候选评论标签的词语范围,相应地提高了提取的评论标签的精确度。此外,本发明实施例提供的评论标签提取方法,在筛选作为候选评论标签的词语时还对词语主题权重值的筛选,将主题权重值小于或等于第二设定阈值的词语过滤掉,保留与评论的主题关联密切的词语,可以进一步提高提取的评论标签精确度。
下面参照图3以一具体实例对本发明实施例的评论标签提取方法进行说明。
本具体实例中以一首歌曲作为待处理对象为例进行的说明,也即提取该首歌曲的评论标签。具体提取流程如下:
步骤S302:获取歌曲对应的一条评论S。
其中,歌曲对应多条评论,在本步骤中预先获取一条评论S进行处理。
步骤S304:通过对获取的该条评论S中包含的句子进行分词、以及词性标注提取出评论S对应的词语集合。
为了抽取评论中词语与词语之间的结构关系,对每条评论中的每个句子,首先进行分词、词性标注。
步骤S306:对评论S进行依存句法分析,确定评论S对应的二元组。
本步骤中,对每个句子进行句法分析,获取词语与词语之间的修饰,最后,构建二元组。例如,评论为:“汪峰的歌很经典,歌词很励志”通过依存句法分析之后,得到该句子中的主语词和修饰词,将<主语词,修饰词>构造的二元组提取出来,作为描述此歌曲的一个标签,提取得到的二元组为<歌,经典>,<歌词,励志>。
循环执行步骤S302至步骤S306至该首歌曲对应的全部评论中的二元组均提取完成。将提取的各二元组组成标签候选集合A即第一集合。
步骤S308:对歌曲对应的所有评论中的词语进行TF-IDF计算,依据计算结果生成候选标签集合即第二集合。
词语出现次数越多,则说明这个词语对该歌曲越重要,本具体实例中词语出现次数通过TF统计得到。但是对于有些评论而言,某个词出现的次越多,该词语对该歌曲反而越不重要。因此,需要找到一个适当的权重系数,来衡量该词语的重要性。如果一个词语不常见,但是它在评论中多次出现,那么该词语在一定程度上体现了该歌曲的特性,即该词语可以作为候选标签。为克服上述问题,本具体实例中使用IDF作为权重系数。
具体地,将词语的TF和IDF这两个值相乘,就得到了一个词语的TF-IDF值。词语的TF-IDF值越大,则该词预对歌曲的重要性越高。本具体实例中,计算歌曲对应的全部评论中的词语的TF-IDF值,通过设置一个阈值即第一设定阈值,筛选出一部分不能满足要求的词语,将满足要求的词语构成一个候选标签词集合B即第二集合。
针对一个词语的TF-IDF的具体计算步骤如下:
第一步,计算TF。
词频(TF)=词语在评论中出现的次数/该评论的总词数。
说明:由于每条评论的长度不一,除以评论总词数进行词频标准化。
第二步,计算IDF。
反转文件频率(IDF)=log(该歌曲对应的评论总数/(包含该词语的评论数量+1))。
如果一个词越常见,那么分母就越大,反转文件频率就越小越接近0。
第三步,计算TF-IDF。
TF-IDF=词频(TF)×反转文件频率(IDF)。
重复上述计算流程,即可计算各词语的TF-IDF。
本发明实施例中设置阈值a即第一设定阈值,通过将词语的TF-IDF与设置的阈值进行比对,即可确定该词语是否可添加值候选标签集合B中。
阈值a可以设置为0.75,通过该阈值a对各词语进行筛选。在筛选时,当词语的TF-IDF>a时,则将词语加入候选标签集合B中。
步骤S310:使用LDA模型对歌曲对应的所有评论进行处理,以确定候选标签集合D即第四集合。
LDA模型为在2003年由Blei(布莱)等提出并用于文档主题建模。在LDA模型中,每篇文档表示为含有K个隐含主题的混合分布,每个主题是在W个词语上的多项分布,该模型的概率图表示如图4所示。
其中,
Figure PCTCN2016089277-appb-000001
表示LDA模型中主题-词语的概率分布,θ表示文档-主题的概率分布,α和β分别表示θ和
Figure PCTCN2016089277-appb-000002
所服从Dirichlet先验分布的超参数,空心圆圈表示隐含变量,实心圆圈表示可观察到的变量,即词语。
在本具体实例中由于是要对歌曲的评论进行处理,因此,歌曲对应的全部评论即相当于待处理文档d,T(w|d)表示该词语在文档d中的主题影响力即主题权重值,其中,w表示d中的词语,并假设文档d包含t个隐含主题,本具体实例中t=10。词语w在一个主题z中出现的概率越大,则该词语对主题z越重要;若w对应的主题z在d中的出现概率越大,则表明主题z相对于文档d越重要,因而,w也越重要。基于以上分析,本具体实例中采用
Figure PCTCN2016089277-appb-000003
表示词语w在主题z中的概率,采用
Figure PCTCN2016089277-appb-000004
表示文档d中的主题z的出现概率,词语w的主题影响力可以通过下述公式计算得到:
Figure PCTCN2016089277-appb-000005
其中θ表示文档的“文档-主题”分布,φ表示每个主题的“主题-词语”分布,这两个参数通常利用Dirichlet即狄利克雷分布和多项式分布之间的共轭性质,通过Gibbs即吉布斯采样计算得到。计算公式如下:
Figure PCTCN2016089277-appb-000006
Figure PCTCN2016089277-appb-000007
其中,N1(d,j)表示文档d中的词被赋给主题j的次数,N2(w,j)表示在训练语料库中词语w被赋给主题j的次数,N为文本中词语总数。通过公式(2)和公式(3)即可求解公式(1),从而计算出一个词语在文档中的主题影响力。
重复采用上述公式即可计算出歌曲对应的全部评论下的全部词语的主题影响力。
本具体施例中设置一个阈值即第二设定阈值,通过将词语的T(w|d)与第二设定阈值进行比对,即可确定该词语是否可添加值候选标签集合D即第四集合中。
第二设定阈值可以设置为0.8,通过第二设定阈值即可对各词语进行筛选。在筛选时,当词语的T(w|d)>0.8时,则将词语加入候选标签集合D中。
需要说明的是,上述仅是以0.8为例进行的说明,在具体实现过程中,第二设定阈值可以由本领域技术人员设置成任意适当的值,本具体实例中对此不作具体限制。
步骤S312:将通过步骤S306、步骤S308以及步骤S310确定的各集合进行交、并集处理。
具体地,将通过步骤S306确定的标签候选集合A中的修饰词提取出来,记作为集合Aa,对集合Aa和通过步骤S308确定的候选标签集合B进行并集 运算,即Aa∪B=C,得到候选标签集合C即第三集合。然后,将候选标签集合C与候选标签集合D进行求交集运算,C∩D=E,得到候选标签集合E即第五集合。
步骤S314:对确定的候选标签集合E进行去重复,得到最终作为评论标签的词语。
本具体实例中基于最小编辑距离结合词性的词相似度对候选标签集合E进行去处理。具体地:对候选标签集合E中任意两个词语S、T,利用如下公式计算选择的这两个词语的相似度:
P(S,T)=α(D(S,T)+1)+βSim(pos)
其中,S和T表示词语组中的两个词语,P(S,T)表示两个词语的相似度,D(S,T)表示两个词语的最小编辑距离,Sim(pos)表示两个词语的词性相似度,α与β均为权重系数。如果S和T的词性相同,则为1;若不同,则为0。α+β=1,P(S,T)∈[0,1]。
当D(S,T)=0且Sim(pos)=1,即词语S和T的最小编辑距离为0,,则P(S,T)=1,表示S和T的相似度最大。当Sim(pos)=0,D(S,T)越大,即词语S和T的最小编辑距离越大,P(S,T)越小,则S和T的相似度越小。
可选地,将权重系数α设置为0.6,将权重系数β设置为0.4。
通过上述公式分别计算候选标签集合E中任意两个词语的相似度。然后,依据相似度对候选标签集合E中的词语进行去重复。
当候选标签集合E中两个词的相似度大于第三设定阈值(例如:0.7)时,则认为这两个词语重复,去掉其中一个,按照该方法筛选候选标签集合E中的所有词语,最后剩下的词语集合即为该首歌曲的评论标签。
实施例三
参照图5,示出了本发明实施例三中的一种评论标签提取装置的结构框图。
本发明实施例的评论标签提取装置包括:二元组提取模块502,用于将 当前待处理对象对应的各条评论进行二元组提取,将提取出的所述二元组组合成第一集合;其中,所述二元组包括:主语词和修饰词;第一组合模块504,用于确定所述各条评论中词频-反转文件频率TF-IDF大于第一设定阈值的词语,将所述确定的词语组合成第二集合;第二组合模块506,用于按照第一设定规则对所述第一集合以及所述第二集合进行处理,生成第三集合;第三组合模块508,用于确定所述各条评论中主题权重值大于第二设定阈值的词语,将所述确定的主题权重值大于第二设定阈值的词语组合成第四集合;第四组合模块510,用于对所述第三集合以及所述第四集合进行求交集处理得到第五集合;去重复模块512,用于对所述第五集合中的词语进行去重复,并将去重复后剩余的词语确定为所述当前待处理对象的评论标签。
通过本发明实施例提供的评论标签提取装置,通过对各评论中的各句子进行词语、语法分析构建词语的二元组,能够有效的利用评论中词语的上下文之间的关系,过滤掉了独立的无意义的噪音词,缩小作为候选评论标签的词语范围,相应地提高了提取的评论标签的精确度。此外,本发明实施例提供的评论标签提取装置,在筛选作为候选评论标签的词语时还对词语主题权重值的筛选,将主题权重值小于或等于第二设定阈值的词语过滤掉,保留与评论的主题关联密切的词语,可以进一步提高提取的评论标签精确度。
实施例四
参照图6,示出了本发明实施例四中的一种评论标签提取装置的结构框图。
本发明实施例的评论标签提取装置是对实施例三中所示的评论标签提取装置的进一步优化,优化后的评论标签提取装置包括:二元组提取模块602,用于将当前待处理对象对应的各条评论进行二元组提取,将提取出的所述二元组组合成第一集合;其中,所述二元组包括:主语词和修饰词;第一组合模块604,用于确定所述各条评论中词频-反转文件频率TF-IDF大于第一设定阈值的词语,将所述确定的词语组合成第二集合;第二组合模块606,用于按照第一设定规则对所述第一集合以及所述第二集合进行处理,生成第三集合;第三组合模块608,用于确定所述各条评论中主题权重值大于第二设定阈值的词语,将所述确定的主题权重值大于第二设定阈值的词语组合成第四 集合;第四组合模块610,用于对所述第三集合以及所述第四集合进行求交集处理得到第五集合;去重复模块612,用于对所述第五集合中的词语进行去重复,并将去重复后剩余的词语确定为所述当前待处理对象的评论标签。
可选地,所述二元组提取模块602将当前待处理对象对应的各条评论进行二元组提取时:针对每条评论,对该评论包含的每个句子进行分词,并确定分词后的各词语的词性;对所述各词语的词性进行句法分析,获取所述每个中词语之间的修饰关系,依据所述修饰关系构建所述每个句子对应的二元组。
可选地,所述第二组合模块606包括:修饰词提取子模块6062,用于提取所述第一集合中、各二元组包含的修饰词或主语词,组成修饰词集合或主语词集合;并集处理子模块6064,用于对所述修饰词集合或主语词集合与所述第二集合进行求并集处理,生成所述第三集合。
可选地,所述第三组合模块608确定所述各条评论中主题权重值大于第二设定阈值的词语时:依据潜在狄利克雷分布模型确定所述各条评论中的各词语的主题权重值;分别将各词语的主题权重值与所述第二设定阈值进行比对,以确定出主题权重值大于所述第二设定阈值的词语。
可选地,所述去重复模块612包括:分组子模块6122,用于将所述第五集合中的各词语分别两两进行组合,组合成词语组;相似度计算子模块6124,用于针对每个词语组,分别依据当前词语组中两个词语的最小编辑距离以及词性相似度确定当前词语组中的两个词语的相似度值;删除子模块6126,用于分别将相似度值大于第三设定阈值的词语组中的一个词语删除,以完成对所述第五集合的去重复;确定子模块6128,用于将去重复后剩余的词语确定为所述当前待处理对象的评论标签。
可选地,所述相似度计算子模块6124利用如下公式计算每个词语组中的两个词语的相似度:P(S,T)=α(D(S,T)+1)+βSim(pos);其中,S,T表示词语组中的两个词语,P(S,T)表示两个词语的相似度,D(S,T)表示两个词语的最小编辑距离,Sim(pos)表示两个词语的词性相似度,α与β均为权重系数。
本发明实施例的评论标签提取装置用于实现前述实施例一、二中相应的评论标签提取方法,并具有与方法实施例相应的有益效果,在此不再赘述。
本说明书中的各个实施例均采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似的部分互相参见即可。对于***实施例而言,由于其与方法实施例基本相似,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。
以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性的劳动的情况下,即可以理解并实施。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到各实施方式可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件。基于这样的理解,上述技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在计算机可读存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行各个实施例或者实施例的某些部分所述的方法。
例如,图7示出了可以实现根据本发明的电子装置。该电子装置传统上包括处理器710和以存储器720形式的计算机程序产品或者计算机可读介质。存储器720可以是诸如闪存、EEPROM(电可擦除可编程只读存储器)、EPROM、硬盘或者ROM之类的电子存储器。存储器720具有用于执行上述方法中的任何方法步骤的程序代码731的存储空间730。例如,用于程序代码的存储空间730可以包括分别用于实现上面的方法中的各种步骤的各个程序代码731。这些程序代码可以从一个或者多个计算机程序产品中读出或者写入到这一个或者多个计算机程序产品中。这些计算机程序产品包括诸如硬盘,紧致盘(CD)、存储卡或者软盘之类的程序代码载体。这样的计算机程序产品通常为如参考图8所述的便携式或者固定存储单元。该存储单元可以具有与图7的电子装置中的存储器720类似布置的存储段、存储空间等。程序代码可以例如以适当形式进行压缩。通常,存储单元包括计算机可读代码 731’,即可以由例如诸如710之类的处理器读取的代码,这些代码当由电子装置运行时,导致该电子装置执行上面所描述的方法中的各个步骤。
最后应说明的是:以上实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。

Claims (14)

  1. 一种评论标签提取方法,其特征在于,包括:
    将当前待处理对象对应的各条评论进行二元组提取,将提取出的所述二元组组合成第一集合;其中,所述二元组包括:主语词和修饰词;
    确定所述各条评论中词频-反转文件频率TF-IDF大于第一设定阈值的词语,将所述确定的词语组合成第二集合;
    按照第一设定规则对所述第一集合以及所述第二集合进行处理,生成第三集合;
    确定所述各条评论中主题权重值大于第二设定阈值的词语,将所述确定的主题权重值大于第二设定阈值的词语组合成第四集合;
    对所述第三集合以及所述第四集合进行求交集处理得到第五集合;
    对所述第五集合中的词语进行去重复,并将去重复后剩余的词语确定为所述当前待处理对象的评论标签。
  2. 根据权利要求1所述的方法,其特征在于,所述将当前待处理对象对应的各条评论进行二元组提取的步骤包括:
    针对每条评论,对该评论包含的每个句子进行分词,并确定分词后的各词语的词性;
    对所述各词语的词性进行句法分析,获取所述每个句子中词语之间的修饰关系,依据所述修饰关系构建所述每个句子对应的二元组。
  3. 根据权利要求1所述的方法,其特征在于,所述按照第一设定规则对所述第一集合以及所述第二集合进行处理,生成第三集合的步骤包括:
    提取所述第一集合中各二元组包含的修饰词或主语词,组成修饰词集合或主语词集合;
    对所述修饰词集合或主语词集合与所述第二集合进行求并集处理,生成所述第三集合。
  4. 根据权利要求1所述的方法,其特征在于,所述确定所述各条评论中主题权重值大于第二设定阈值的词语的步骤包括:
    依据潜在狄利克雷分布模型确定所述各条评论中的各词语的主题权重值;
    分别将各词语的主题权重值与所述第二设定阈值进行比对,以确定出主题权重值大于所述第二设定阈值的词语。
  5. 根据权利要求1所述的方法,其特征在于,所述对所述第五集合中的词语进行去重复的步骤包括:
    将所述第五集合中的各词语分别两两进行组合,组合成词语组;
    针对每个词语组,分别依据当前词语组中两个词语的最小编辑距离以及词性相似度确定当前词语组中的两个词语的相似度值;
    分别将相似度值大于第三设定阈值的词语组中的一个词语删除,以完成对所述第五集合的去重复。
  6. 根据权利要求5所述的方法,其特征在于,利用如下公式计算每个词语组中的两个词语的相似度:
    P(S,T)=α(D(S,T)+1)+βSim(pos);
    其中,S,T表示词语组中的两个词语,P(S,T)表示两个词语的相似度,D(S,T)表示两个词语的最小编辑距离,Sim(pos)表示两个词语的词性相似度,α与β均为权重系数。
  7. 一种评论标签提取装置,其特征在于,包括:
    二元组提取模块,用于将当前待处理对象对应的各条评论进行二元组提取,将提取出的所述二元组组合成第一集合;其中,所述二元组包括:主语词和修饰词;
    第一组合模块,用于确定所述各条评论中词频-反转文件频率TF-IDF大于第一设定阈值的词语,将所述确定的词语组合成第二集合;
    第二组合模块,用于按照第一设定规则对所述第一集合以及所述第二集合进行处理,生成第三集合;
    第三组合模块,用于确定所述各条评论中主题权重值大于第二设定阈值的词语,将所述确定的主题权重值大于第二设定阈值的词语组合成第四集合;
    第四组合模块,用于对所述第三集合以及所述第四集合进行求交集处理得到第五集合;
    去重复模块,用于对所述第五集合中的词语进行去重复,并将去重复后剩余的词语确定为所述当前待处理对象的评论标签。
  8. 根据权利要求7所述的装置,其特征在于,所述二元组提取模块将当前待处理对象对应的各条评论进行二元组提取时:
    针对每条评论,对该评论包含的每个句子进行分词,并确定分词后的各词语的词性;对所述各词语的词性进行句法分析,获取所述每个句子中词语之间的修饰关系,依据所述修饰关系构建所述每个句子对应的二元组。
  9. 根据权利要求7所述的装置,其特征在于,所述第二组合模块包括:
    修饰词提取子模块,用于提取所述第一集合中各二元组包含的修饰词或主语词,组成修饰词集合或主语词集合;
    并集处理子模块,用于对所述修饰词集合或主语词集合与所述第二集合进行求并集处理,生成所述第三集合。
  10. 根据权利要求7所述的装置,其特征在于,所述第三组合模块确定所述各条评论中主题权重值大于第二设定阈值的词语时:
    依据潜在狄利克雷分布模型确定所述各条评论中的各词语的主题权重值;分别将各词语的主题权重值与所述第二设定阈值进行比对,以确定出主题权重值大于所述第二设定阈值的词语。
  11. 根据权利要求7所述的装置,其特征在于,所述去重复模块包括:
    分组子模块,用于将所述第五集合中的各词语分别两两进行组合,组合成词语组;
    相似度计算子模块,用于针对每个词语组,分别依据当前词语组中两个词语的最小编辑距离以及词性相似度确定当前词语组中的两个词语的相似度值;
    删除子模块,用于分别将相似度值大于第三设定阈值的词语组中的一个词语删除,以完成对所述第五集合的去重复;
    确定子模块,用于将去重复后剩余的词语确定为所述当前待处理对象的评论标签。
  12. 根据权利要求11所述的装置,其特征在于,所述相似度计算子模块利用如下公式计算每个词语组中的两个词语的相似度:
    P(S,T)=α(D(S,T)+1)+βSim(pos);
    其中,S,T表示词语组中的两个词语,P(S,T)表示两个词语的相似度,D(S,T)表示两个词语的最小编辑距离,Sim(pos)表示两个词语的词性 相似度,α与β均为权重系数。
  13. 一种计算机程序,包括计算机可读代码,当所述计算机可读代码在电子装置上运行时,导致所述电子装置执行根据权利要求1-6中的任一个所述的评论标签提取方法。
  14. 一种计算机可读介质,其中存储了如权利要求13所述的计算机程序。
PCT/CN2016/089277 2015-12-01 2016-07-07 评论标签提取方法和装置 WO2017092337A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US15/249,677 US20170154077A1 (en) 2015-12-01 2016-08-29 Method for comment tag extraction and electronic device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510866792.5A CN105975453A (zh) 2015-12-01 2015-12-01 评论标签提取方法和装置
CN201510866792.5 2015-12-01

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US15/249,677 Continuation US20170154077A1 (en) 2015-12-01 2016-08-29 Method for comment tag extraction and electronic device

Publications (1)

Publication Number Publication Date
WO2017092337A1 true WO2017092337A1 (zh) 2017-06-08

Family

ID=56988369

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/089277 WO2017092337A1 (zh) 2015-12-01 2016-07-07 评论标签提取方法和装置

Country Status (2)

Country Link
CN (1) CN105975453A (zh)
WO (1) WO2017092337A1 (zh)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109117470A (zh) * 2017-06-22 2019-01-01 北京国双科技有限公司 一种评价文本信息的评价关系提取方法及装置
CN110110190A (zh) * 2018-02-02 2019-08-09 北京京东尚科信息技术有限公司 信息输出方法和装置
CN110826323A (zh) * 2019-10-24 2020-02-21 新华三信息安全技术有限公司 评论信息有效性检测方法及装置
CN113011182A (zh) * 2019-12-19 2021-06-22 北京多点在线科技有限公司 一种对目标对象进行标签标注的方法、装置和存储介质
US11266626B2 (en) 2015-09-09 2022-03-08 The Trustees Of Columbia University In The City Of New York Reduction of ER-MAM-localized APP-C99 and methods of treating alzheimer's disease
CN115858738A (zh) * 2023-02-27 2023-03-28 浙江浙商金控有限公司 一种企业舆情信息相似性识别方法

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107729317B (zh) * 2017-10-13 2021-07-30 北京三快在线科技有限公司 评价标签的确定方法、装置及服务器
CN108920512B (zh) * 2018-05-31 2021-12-28 江苏一乙生态农业科技有限公司 一种基于游戏软件场景的推荐方法
CN109145291A (zh) * 2018-07-25 2019-01-04 广州虎牙信息科技有限公司 一种弹幕关键词筛选的方法、装置、设备及存储介质
CN109522275B (zh) * 2018-11-27 2020-11-20 掌阅科技股份有限公司 基于用户生产内容的标签挖掘方法、电子设备及存储介质
CN110188356B (zh) * 2019-05-30 2023-05-19 腾讯音乐娱乐科技(深圳)有限公司 信息处理方法及装置
CN110688832B (zh) * 2019-10-10 2023-06-09 河北省讯飞人工智能研究院 一种评论生成方法、装置、设备及存储介质
CN111079026B (zh) * 2019-11-28 2023-11-24 北京秒针人工智能科技有限公司 一种确定人物印象数据的方法、存储介质和装置
CN112184323A (zh) * 2020-10-13 2021-01-05 上海风秩科技有限公司 评价标签生成方法和装置、存储介质及电子设备
CN114491013A (zh) * 2021-12-09 2022-05-13 重庆邮电大学 一种融入句法结构信息的主题挖掘方法、存储介质及***
CN115686432B (zh) * 2022-12-30 2023-04-07 药融云数字科技(成都)有限公司 一种用于检索排序的文献评价方法、存储介质及终端

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005309498A (ja) * 2004-04-16 2005-11-04 Kddi Corp 情報抽出システム、情報抽出方法、コンピュータプログラム
US20100257440A1 (en) * 2009-04-01 2010-10-07 Meghana Kshirsagar High precision web extraction using site knowledge
CN103455562A (zh) * 2013-08-13 2013-12-18 西安建筑科技大学 一种文本倾向性分析方法及基于该方法的商品评论倾向判别器
CN104778209A (zh) * 2015-03-13 2015-07-15 国家计算机网络与信息安全管理中心 一种针对千万级规模新闻评论的观点挖掘方法
CN104951430A (zh) * 2014-03-27 2015-09-30 携程计算机技术(上海)有限公司 产品特征标签的提取方法及装置

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103870447A (zh) * 2014-03-11 2014-06-18 北京优捷信达信息科技有限公司 一种基于隐含狄利克雷模型的关键词抽取方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005309498A (ja) * 2004-04-16 2005-11-04 Kddi Corp 情報抽出システム、情報抽出方法、コンピュータプログラム
US20100257440A1 (en) * 2009-04-01 2010-10-07 Meghana Kshirsagar High precision web extraction using site knowledge
CN103455562A (zh) * 2013-08-13 2013-12-18 西安建筑科技大学 一种文本倾向性分析方法及基于该方法的商品评论倾向判别器
CN104951430A (zh) * 2014-03-27 2015-09-30 携程计算机技术(上海)有限公司 产品特征标签的提取方法及装置
CN104778209A (zh) * 2015-03-13 2015-07-15 国家计算机网络与信息安全管理中心 一种针对千万级规模新闻评论的观点挖掘方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
LI, PIJI ET AL.: "Extraction and Ranking of Tags for User Opinions", JOURNAL OF CHINESE INFORMATION PROCESSING, vol. 26, no. 5, 30 September 2012 (2012-09-30) *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11266626B2 (en) 2015-09-09 2022-03-08 The Trustees Of Columbia University In The City Of New York Reduction of ER-MAM-localized APP-C99 and methods of treating alzheimer's disease
CN109117470A (zh) * 2017-06-22 2019-01-01 北京国双科技有限公司 一种评价文本信息的评价关系提取方法及装置
CN110110190A (zh) * 2018-02-02 2019-08-09 北京京东尚科信息技术有限公司 信息输出方法和装置
CN110826323A (zh) * 2019-10-24 2020-02-21 新华三信息安全技术有限公司 评论信息有效性检测方法及装置
CN110826323B (zh) * 2019-10-24 2023-04-25 新华三信息安全技术有限公司 评论信息有效性检测方法及装置
CN113011182A (zh) * 2019-12-19 2021-06-22 北京多点在线科技有限公司 一种对目标对象进行标签标注的方法、装置和存储介质
CN113011182B (zh) * 2019-12-19 2023-10-03 北京多点在线科技有限公司 一种对目标对象进行标签标注的方法、装置和存储介质
CN115858738A (zh) * 2023-02-27 2023-03-28 浙江浙商金控有限公司 一种企业舆情信息相似性识别方法

Also Published As

Publication number Publication date
CN105975453A (zh) 2016-09-28

Similar Documents

Publication Publication Date Title
WO2017092337A1 (zh) 评论标签提取方法和装置
US20170154077A1 (en) Method for comment tag extraction and electronic device
US9424524B2 (en) Extracting facts from unstructured text
US9880998B1 (en) Producing datasets for representing terms and objects based on automated learning from text contents
TW201638803A (zh) 文本挖掘系統和工具
CN107463548B (zh) 短语挖掘方法及装置
Avasthi et al. Techniques, applications, and issues in mining large-scale text databases
CN110019820B (zh) 一种病历中主诉与现病史症状时间一致性检测方法
CN109791632B (zh) 场景片段分类器、场景分类器以及记录介质
AU2018411565B2 (en) System and methods for generating an enhanced output of relevant content to facilitate content analysis
CN108388660A (zh) 一种改进的电商产品痛点分析方法
CN111291177A (zh) 一种信息处理方法、装置和计算机存储介质
Yalcin et al. An external plagiarism detection system based on part-of-speech (POS) tag n-grams and word embedding
CN108228612B (zh) 一种提取网络事件关键词以及情绪倾向的方法及装置
Ashraf et al. Audio-based multimedia event detection with DNNs and sparse sampling
Beheshti et al. Big data and cross-document coreference resolution: Current state and future opportunities
CN115795030A (zh) 文本分类方法、装置、计算机设备和存储介质
Wang et al. A deep learning-based quality assessment model of collaboratively edited documents: A case study of Wikipedia
Kunilovskaya et al. Text preprocessing and its implications in a digital humanities project
TWI234720B (en) Related document linking managing system, method and recording medium
TWI636370B (zh) Establishing chart indexing method and computer program product by text information
CN115129864A (zh) 文本分类方法、装置、计算机设备和存储介质
Camelin et al. Frnewslink: a corpus linking tv broadcast news segments and press articles
Anoop et al. A distributional semantics-based information retrieval framework for online social networks
KR102052823B1 (ko) 잠재 디리클레 할당을 이용한 토픽 모델 자동화 방법 및 장치

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16869650

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16869650

Country of ref document: EP

Kind code of ref document: A1