Summary of the invention
In view of the problem that exists in the prior art; The object of the present invention is to provide a kind of acquisition methods, device and system of network comment elite article; The program that uses a computer and algorithm obtain the elite article automatically, reduce network administration cost, improve the elite degree that article obtains.
In order to achieve the above object, the invention provides a kind of acquisition methods of network comment elite article, it is characterized in that comprising the steps:
Key word in S1, the extraction comment;
S2, obtain the key word that extracted and in the comment storehouse, be worth;
Value calculation key word the value under this theme of the key word that obtains among S3, the number of times that under a certain theme, occurs according to key word and the step S2 in the comment storehouse;
The punctuation mark of S4, calculating comment is worth;
The value of S5, calculating comment similarity;
S6, with the key word that calculates among the step S3 be worth with step S4 in the similarity that calculates among the sign value that obtains and the step S5 be worth to multiply each other and calculate the score of each bar comment;
S7, after obtaining the score of many comments, obtain the comment that branch surpasses certain threshold value and comment on as elite.
Further, the acquisition methods of network comment elite article of the present invention is characterized in that the detailed process of step S1 comprises:
S11, to the comment content carry out participle;
Remove stop words according to the vocabulary of stopping using behind S22, the participle, remaining then for commenting on the key word of content.
Further, the acquisition methods of network comment elite article of the present invention is characterized in that key word value in the comment storehouse realizes through anti-document frequency (IDF) calculating among the step S2.
Further, the acquisition methods of network comment elite article of the present invention, symbol meets rule more in it is characterized in that commenting among the step S4, and this comment is worth high more so.
Further, the acquisition methods of network comment elite article of the present invention is characterized in that the detailed process of step S4 comprises:
The distribution of S41, the extensive language material punctuation mark of statistics, with top score be 1 minute, the distribution normalization of the Chinese character of all sentences and symbol ratio is handled, calculate the distribution score of a symbol;
S42, the symbol distribution score is handled, formed a Chinese character and symbol distribution curve;
S43, calculate symbol factor score in the comment according to distribution curve.
Further, the acquisition methods of network comment elite article of the present invention, high value is low more more with the historical review similarity to it is characterized in that among the step S5 comment.
Further, the acquisition methods of network comment elite article of the present invention is characterized in that for passing through the back-stage management program, set which comment and be the elite comment, and preferential the displaying.
The acquisition methods of network comment elite article of the present invention and system adopt computer program that the elite article under the network comment is calculated, and obtain out the elite comment automatically, and it is objective that real result is obtained in the elite comment, and amount is big, reduces and omits.The comment content can sort according to certain score, conveniently comment and relevant information is screened, and reduces manual intervention and comment maintenance cost.
Embodiment
For making above-mentioned purpose of the present invention, feature and advantage more obviously understandable, the present invention is done further detailed explanation below in conjunction with accompanying drawing and embodiment:
Have various themes on the internet, model, microblogging, picture, video etc. are arranged, for these themes, the online friend can comment on usually in continuous follow-up, thereby has produced a large amount of comments.For different themes, because its comment form is identical, all be word content, therefore, for different themes, the obtain manner of its elite comment can be general.For this reason, in specific embodiment, the embodiment that we comment on a certain video theme with the online friend describes how to obtain the elite comment.
Fig. 1 is the process flow diagram of the acquisition methods of a kind of network comment elite article of the present invention.
Of Fig. 1, the concrete implementation of the inventive method is following:
Key word in S1, the extraction comment;
Because often there are a lot of comments in a certain theme, for example for a certain video, after broadcasting; Often there are thousands of comments,, need analyze the content of each comment in order to obtain the elite comment; For this reason, to each bar comment, at first to carry out participle to the comment content; Remove stop words according to the vocabulary of stopping using behind the participle, remaining then for commenting on the key word of content.Extract these comment key words, comment text feature represented in these key words.Speech in the stop words vocabulary representes that these speech are little to the influence of the text meaning, can ignore.The stop words vocabulary partly derives from the internet, and few part uses statistical method to draw, and such as finding that back " sofa " this key word score is very low in the extensive comment of statistics, can add the stop words vocabulary.In addition, more stop words, for example: as if,, certain or the like.
Extracting this step core concept of comment key word is the trunk that extracts in the comment sentence, finds out the main key word of influence comment content.The purpose that these key words exist is in order to obtain the score value of comment in the elite comment calculating.
Illustrate: " just bored " with article and so on contrast
Behind the participle: with, article, what,, contrast, just, bored,;
After removing stop words: article, contrast, bored.
S2, obtain the key word that extracted and in the comment storehouse, be worth;
The comment storehouse here is meant the comment data to all videos that a certain service provider sets up, all videos in the Yoqoo for example, and comment data refers to the user and watches the comment of delivering behind the video.Can count all key words that occur in the website comment through the comment storehouse, calculate the value of key word in whole comment storehouse.Concrete sample calculation; Such as: " change and paste (sofa) " this key word may occur in a large amount of comments; And " article " this key word possibly only can occur in a spot of comment; " article " will be higher than " changeing card " to the influence power (being worth high) of comment so, therefore, can in the comment storehouse, combine the sign meaning of key word to carry out assignment.Wherein:
Key word is worth (Term Value) and embodies through anti-document frequency (IDF) in the comment storehouse; The principle of calculating the anti-document frequency (IDF) of a key word is in all comment documents; The number of files that this key word occurs is many more, and then key word is worth low more; If the frequency TF that certain speech or phrase occur in one piece of article is high, and in other articles, seldom occur, think that then this speech or phrase have good class discrimination ability, be fit to be used for classify.TFIDF is actually: TF*IDF, TF word frequency (Term Frequency), the anti-document frequency of IDF (Inverse Document Frequency).TF representes the frequency that entry occurs in document.The main thought of IDF is: if it is few more to comprise the document of entry t, just n is more little, and IDF is big more, explains that then entry t has good class discrimination ability.If comprising the number of files of entry t among a certain type document C is m, and other type to comprise the total number of documents of t be k, obviously all comprise the number of files n=m+k of t; When m is big; N is also big, and the value of the IDF that obtains according to the IDF formula can be little, just explains that this entry t class discrimination is indifferent.
Value calculation key word the value under this theme of the key word that obtains among S3, the number of times that under a certain theme, occurs according to key word and the step S2 in the comment storehouse;
With regard to a certain video, the purpose of obtaining the value of key word under video is to calculate the influence power of this key word in this video, can be regarded as the secondary calculating that key word is worth, and calculates the influence power of this key word to some videos.
For example:, suppose that the score of " Huang Haibo " " Lin Xinru " is 2.0 fens in the step S2 value score of all key words in the comment storehouse of letting it pass of falling into a trap.But the value of (like " son's wife's fine epoch " first collection) this both keyword will be different in a video.Statistics " Huang Haibo " occurs 6 times, and " Lin Xinru " occurs 1 time, so " Huang Haibo "=12 minute, " Lin Xinru "=2 minute.
If one comment can be chosen to be the elite comment, so a plurality of key words must be arranged
" Video Term value (key word mark) " in the elite comment computing formula (f step) can use each key word score addition
Example: two comments are arranged
The score of " Huang Haibo " " Lin Xinru " " artistic skills " " well " is respectively 12,2,5,0.1
C1=Huang Haibo artistic skills are pretty good
Key word score " Huang Haibo "+" artistic skills "+" well "=12+5+0.1=17.1
The C2=woods heart such as artistic skills are pretty good
Key word score " Lin Xinru "+" artistic skills "+" well "=2+5+0.1=7.1
The value (Video Term Value) of key word in a video is meant that different user is delivered the key word frequency of occurrences in the video.
The punctuation mark that S4 calculates comment is worth;
The principle of calculating comment content punctuation mark value (Sign Value) institute foundation is that punctuation mark meets rule more in the comment, and this comment is worth high more so.Its computing method are:
Add up the distribution of extensive language material punctuation mark, with top score be 1 minute, the distribution normalization of the Chinese character of all sentences and symbol ratio is handled, calculate the distribution score of a punctuation mark; The follow-up needs handles, and forms a Chinese character and punctuation mark distribution curve; Calculate the Chinese character symbol ratio in the comment at last, the punctuation mark that obtains comment according to the symbol distribution result who calculates is worth score.
For example: the Chinese corpus of statistics 300w bar sentence, the ratio of Chinese character and symbol in the statistics sentence, the symbol of setting the highest sentence of ratio must be divided into 1 fen, and the score of the sentence of other Chinese character symbol ratios is calculated respective value according to ratio.
Statistics:
Chinese character and symbol ratio are that 10: 1 sentence is 30w the highest (expression Chinese character and symbol ratios are have at 10: 1 30w sentence), and the symbol that sentence is commented in setting must be divided into 1 fen.
Chinese character and symbol ratio are that 11: 1 sentence is 20w, and then 11: 1 sentence must be divided into 1* (20/30)=0.6 minute
Chinese character and symbol ratio are that 9: 1 sentence is 25w, and then 11: 1 sentence must be divided into 1* (25/30) and approximates 0.8 fen
" artistic skills of Huang Haibo are all well and good in calculating! " process of symbol factor score of this comment is: calculate earlier that the ratio of Chinese character and punctuation mark is 10: 1 in this comment, the symbol factor score that then can calculate comment equals 1 fen.
The value of S5, calculating comment similarity;
Comment is worth (Similarity Value) with the historical review similarity, and promptly comment under a video and historical review similarity are relatively.Comment that the principle of foundation is delivered after being and historical review similarity high value more are low more.
Adopt Dice coefficient calculations text similarity, weigh the similarity degree between text with the number of same keyword between two texts and the weight of each key word, wherein the Keyword Weight value gets 1;
The Dice coefficient formulas is:
Dice(s1,s2)=2×comm(s1,s2)/(leng(s1)+leng(s2))
Wherein, and comm (s1 s2) is the number of identical characters among s1, the s2, leng (s1), and leng (s2) is the length of character string s1, s2.
S6, with the key word that calculates among the step S3 be worth with step S4 in the similarity that calculates among the sign value that obtains and the step S5 be worth to multiply each other and calculate the score of each bar comment;
Specifically, the score of a comment under a video can be written as:
Comment score=Video Term value (key word mark) * Sign value (symbol factor) * SimilarityValue (similarity factor)
Computing formula can be expanded simultaneously:
Comment score=key word factor * symbol factor * similarity factor * other factors 1* other factors 2
Other factors be for example information such as title, user, video profile to the comment score influence.
S7, after obtaining the score of many comments, obtain the comment that branch surpasses certain threshold value and comment on as elite.
After obtaining the score of many comments, show at the video playback page or leaf according to the mark height, obtain branch and comment on as elite above the comment of certain threshold value.
In addition, can also the manual intervention elite comment, manual intervention be through the back-stage management program, set which comment and be the elite comment, and preferential displaying.
For example the elite of following Example is commented on C1 and C3, and the artificial C4 that sets is the elite comment
The displaying result of so last elite comment is: C4, C1, C3
Through one concrete video is carried out the object lesson that elite comment extracts below and describe implementation of the present invention in detail, so that those skilled in the art know that whole process:
Key word is good before in the comment storehouse, being worth and carrying out the elite comment and calculate among the step S2, and all comment language materials of statistics can obtain in the comment storehouse.
Suppose Huang Haibo=4 minute, artistic skills=2.5 minute, good=0.1 minute
Suppose that some videos have 6 comments (truly carrying out at least 300 of the comment numbers of elite comment, up to ten thousand at most)
The artistic skills of (user 1) C1=Huang Haibo are pretty good.
(user 2) C2=.。。。。。。。。
The Huang Haibo of this TV play the inside of (user 3) C3=is a good person!
(user 4) leading man above the C4=also has the style of oneself.。。。。。。。。。。。。。。
(user 5) C5=Huang Haibo artistic skills are fine.
(user 1) C6=Huang Haibo
Behind participle, the extraction key word:
The Huang Haibo artistic skills are pretty good
NULL (not having key word)
TV play the inside Huang Haibo good person
The own style of top leading man
The Huang Haibo artistic skills are fine
The key word score
Huang Haibo=4*3=12 (key word is worth the number of times that * occurs in video, a user's key word is only calculated once, such as not counting among the C6)
Artistic skills=2.5*2=5
Well=0.1*1=0.1
Preliminary branch of every comment so
C1=" Huang Haibo "+" artistic skills "+" well "=12+5+0.1=17.1
C2=0
C3=19
C4=14
C5=17.2
The compute sign score
C1=1,C2=0,C3=1,C4=0.3,C5=1
Calculate similarity score
C1=1
C2=1
(the most similar with C1, similarity is 0.3 to C3=0.7, last coefficient of similarity score: the 1-0.3=0.7 branch)
C4=1
C5=0.3
Final score
C1=17.1*1*1=17.1
C2=0
C3=19*1*0.7=13.3
C4=14*0.3=4.2
C5=17.2*1*0.3=5.2
Ordering at last
C1>C3>C5>C4>C2
Get 2 (given threshold is set at 10) of front according to threshold setting, obtain branch and surpass 10 comment and comment on as elite.
Technical scheme of the present invention can realize in an isolated system, also can obtain a kind of entity apparatus that can accomplish this technical scheme thus, and Fig. 2 is the block diagram of the deriving means of network comment elite article of the present invention; Specifically comprise like lower module:
The keyword extraction module is used for extracting the key word of comment;
The comment keyword storehouse is worth acquisition module, is used for obtaining the key word that is extracted and is worth in the comment storehouse;
Comment key word value calculation module is used for commenting on the value of value calculation key word under this theme in the storehouse according to key word at the key word that the number of times and the step S2 of appearance under a certain theme obtain;
Comment punctuation mark value calculation module is used to calculate the punctuation mark value of comment;
The comment similarity calculation module is used to calculate the value of commenting on similarity;
Comment score computing module, be used for the key word that comment key word value calculation module calculates be worth with comment punctuation mark value calculation module in the similarity that calculates in the sign value that obtains and the comment similarity calculation module be worth to multiply each other and calculate the score of each bar comment;
Elite comment determination module is used for after obtaining the score of many comments, obtains branch and comments on as elite above the comment of certain threshold value.
In addition, the present invention also can come collaborative the completion through each device that separates, and can obtain a kind of system that can accomplish this technical scheme thus, and Fig. 3 is the block diagram of the system that obtains of network comment elite article of the present invention, specifically comprises like lower device:
The keyword extraction device is used for extracting the key word of comment;
The comment keyword storehouse is worth deriving means, is used for obtaining the key word that is extracted and is worth in the comment storehouse;
Comment key word value calculation device is used for commenting on the value of value calculation key word under this theme in the storehouse according to key word at the key word that the number of times and the step S2 of appearance under a certain theme obtain;
Comment punctuation mark value calculation device is used to calculate the punctuation mark value of comment;
Comment similarity calculation element is used to calculate the value of commenting on similarity;
Comment score calculation element, be used for the key word that comment key word value calculation module calculates be worth with comment punctuation mark value calculation module in the similarity that calculates in the sign value that obtains and the comment similarity calculation module be worth to multiply each other and calculate the score of each bar comment;
Device is confirmed in the elite comment, is used for after obtaining the score of many comments, obtains branch and comments on as elite above the comment of certain threshold value.
In sum, acquisition methods, device and the system of a kind of network comment elite article provided by the invention, it adopts new technical scheme is that all the comment service routine automatic analysers under the video are calculated, and draws the score tabulation of the elite degree of a comment; Elite comment is simultaneously calculated and can be prevented pour water behavior or multi-user of same user and send out problems such as content similar, and the fractional computation result of comment has certain fairness; Be applicable to some more outstanding comments are represented on video playback page or leaf comment zone.
It more than is the detailed description that the preferred embodiments of the present invention are carried out; But those of ordinary skill in the art is to be appreciated that; Instruct down with spirit within the scope of the invention; Various improvement, interpolation and replacement all are possible, for example adjust the interface interchange order, change message format and content, use different programming language (like C, C++, Java etc.) realization etc.These are all in the protection domain that claim of the present invention limited.