CN101630321A - On-line article screening method based on data mining (DM) - Google Patents

On-line article screening method based on data mining (DM) Download PDF

Info

Publication number
CN101630321A
CN101630321A CN200910042170A CN200910042170A CN101630321A CN 101630321 A CN101630321 A CN 101630321A CN 200910042170 A CN200910042170 A CN 200910042170A CN 200910042170 A CN200910042170 A CN 200910042170A CN 101630321 A CN101630321 A CN 101630321A
Authority
CN
China
Prior art keywords
article
screening
articles
score
periodical
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN200910042170A
Other languages
Chinese (zh)
Inventor
罗笑南
***
刘宁
文允
叶均杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
Original Assignee
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Yat Sen University filed Critical Sun Yat Sen University
Priority to CN200910042170A priority Critical patent/CN101630321A/en
Publication of CN101630321A publication Critical patent/CN101630321A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses an on-line article screening method based on data mining (DM), particularly disclosing a method by using various methods to recognize network articles, and belonging to the technical field of network technology. The method comprises the following steps: (1) keyword matching; (2) publishing or not; (3) content covering degree; (4) similarity screening, and deleting redundancy; (5) language suitable for articles classifying; (6) effectively extracting relevant outstanding articles; (7) elaborate articles re-screening; (8) deleting trashy articles; and (9) elaborate article author recommending. The method can improve screening efficiency and can save labor cost.

Description

A kind of online article screening technique based on data mining
Technical field
The invention discloses a kind of online article screening technique based on data mining, it belongs to the networking technology area field.
Background technology
Article screening is meant the article objective evaluation that carries out for the needs that obtain in a certain respect article, thus determine to satisfy system that the degree of filter criteria carries out independently and form result's process.The article screening mainly is that (whether the content that comprises such as article is abundant in the face of the article content system, whether practical function is arranged) accordance, validity and the suitability Survey Operations and the process of carrying out, have the characteristics of systematicness and independence with regard to the mode screening of screening.Systematicness is meant that all screened key elements all should cover; Independence is in order to make the screening activity be independent of screened people or unit, and is just and objective with what guarantee to screen.
It too much is exactly that employed method is too simple that but existing online screening technique is not to use artificial participation factor, such as only using clicking rate.
The people is that the article screening mechanism of carrying out is waste of manpower and material resources, and defectives such as subjectivity factor are arranged, such as artificially liking or being subjected to knowledge to limit error in the decision-making of having done.And when a plurality of screening personnel exist, exist the difference on the personnel ability to shine into the difference in the screening.
And there are a lot of existing problems in the method for depending clicking rate alone, it is very big influenced by time factor such as article, article more early is because time relationship generally all comes out at the top, and new reasonable article always can not top set because the time ratio that occurs is later, has lost the chance of a lot of concerns on the contrary.Will cause the disappearance of good article like this.
Data mining (Data Mining), be called Knowledge Discovery (the KnowledgeDiscovery in Database in the database again, KDD), from mass data, obtain non-trivial process effective, novel, potentially useful, final intelligible pattern exactly, briefly, data mining is exactly to extract or " excavation " knowledge from mass data.Data mining can finely be applied in the screening of article.Effectively use data mining method can reduce a lot of artificial burdens, this patent has solved the problem of existing article screening with regard to using the partial data method for digging.
Summary of the invention
The present invention has overcome the deficiencies in the prior art, has proposed a kind of online article screening technique based on data mining.By the combination of several different methods, thereby can avoid human factor to participate in realizing automatically the effect of article screening as far as possible.The present invention can be applied to government and the relatively article construction of authority's website, can reach effect preferably.
The present invention content cover and the similarity comparison aspect using data mining, cover for content and can use several key word outlooks just not think to cover a content point, perhaps one section program has corresponding input and output just to think this function point of covering.More then use part wording or paragraph coupling for similarity, certain weights set up in all kinds of wordings or paragraph here, when total add up and surpass certain threshold value after just think that these two pieces of articles duplicate.Aspect the ordering of all kinds of articles of difference, use the feed-forward neural network method, constantly revise each relevant attribute,,, carry out dynamic ranking according to their shared weights such as time, article rank, clicking rate according to the number of typing article.
This method key step comprises:
Screening principle for all articles is as follows:
(1) keyword matching;
(2) whether deliver;
(3) viewing content level of coverage;
(4) similarity screening, unnecessary deletion;
(5) the suitable language classification of article;
(6) effectively extract relevant outstanding article;
(7) the elaboration article screens again;
(8) rubbish article deletion;
(9) elaboration article author recommends.
Whole steps is by whole screening principle, screens step by step in order.Earlier screen deletion by principle (1)~(4), using priciple (5) is classified then, re-uses (6) (7) and carries out refinement and learn from else's experience, and take out unaccommodated article in some storehouses by (8), after finishing in steps again, use (9) to carry out recommended work exactly.
The included module of the present invention has:
(1) checks module
(2) core periodical store list
(3) elaboration article memory block
(4) rubbish article memory block
(5) human factor gets involved module
(6) article search module in the storehouse
Check that module is that the core component of this method is to realize the extraction of wording of some articles and the coupling and the content scores of partial content, extract at article, coupling, all data digging methods that use in the scoring are also realized in this module.Core periodical tabulation is to check the necessary tabulation that exists of institute, for whether article of better searching is published in core periodical.Elaboration article and the storage of rubbish article district are exactly the classification district at the article place after screening, and better carry out the selection of corresponding article for the ease of the reader.Human factor gets involved the district certain theme is made an arrangement in advance and handled the article that some can not be handled with this method, thereby improves the accuracy rate of screening.This module may comprise user interface and to the corresponding operation-interface of corresponding system of this method institute etc.The article search module is to search for corresponding article or corresponding elaboration article in order to provide convenience to the user in the storehouse, thereby reaches than higher reference value.
The invention has the beneficial effects as follows:
(1), thereby reaches the purpose of the normal top set of elite article no matter be old or new so long as good article all can appear at top.
(2) can better get rid of the article that has nothing to do in corresponding website, can better be prevented rubbish article popular on the existing network and advertisement phenomenon in vogue especially.
(3) can be aspect a lot of than artificially more favourable, the error that the fatigue that promptly can avoid the defective of personal knowledge to repeat work in addition causes.
(4) can effectively save human resources, save the manpower spending, save cost.
Description of drawings
Below in conjunction with accompanying drawing, the present invention is made further detailed description:
Fig. 1 is implementing procedure figure of the present invention;
Fig. 2 is modular structure figure of the present invention;
Fig. 3 is a rank feed-forward neural net method synoptic diagram.
Embodiment
The present invention is described further below in conjunction with accompanying drawing.
Implementing procedure figure of the present invention as shown in Figure 1, its basic step is as follows:
(1) for one piece of new article, see at first whether this article is about the required article of specific website, this can be from key word, the corresponding wording of abstract extraction is checked can use a determinant attribute here, if this property value is false (irrelevant with the content that include this website), then directly eliminate.If be yes then enter next step screening;
(2) see that secondly whether this article deliver at home or on the external core magazine,, then search a core periodical table, can employ this article basically if article is published on the periodical in this core periodical table if article was delivered.If article not in the core periodical tabulation, had then been delivered because of this article and can have been given a corresponding score.Here this core periodical table is to need Dynamic Maintenance, and general maintenance period just can weekly.This core periodical table can be downloaded from related web site, also can oneself set some association attributeses and (quote number of times, clicks, the article rank) carries out dynamic calculation, according to each periodical must assign to determine which is a core periodical, the screening mechanism of the website that each is similar can also carry out sharing mutually corresponding core periodical and tabulate and reach the purpose of renewal;
(3) then according to the content point that this article covered, obtain corresponding score according to the content point that covers, and how much dynamically the adjusting of this score content point that can cover according to all articles in presents storehouse.Obtain this part mark of this article by the content point that covered of accumulative total at last.The calculating of particular content point can be used the data mining association rules method.Because it is a lot of that the website relates to article, therefrom extract the description that certain partial content formed in corresponding wording, the function point that perhaps obtains certain partial code from corresponding input and output is to be relatively easy to thing.Can existing correlation rule dynamically be adjusted and screen and ought whenever examine an article, remove some old correlation rules, and keep some new useful correlation rules;
(4) can carry out measuring similarity to this article then, can be from key word, summary even can be to adopt matching principle in full.Here consider keyword matching earlier, if adopt coupling in full after the keyword matching,, just judge according to the score situation of original article when certain similarity occurring, then delete time article more of a specified duration if original article score is low, otherwise delete the article of new typing.Measuring similarity used herein is not to carrying out in full coupling word by word and sentence by sentence, but earlier key word is compared, and obtains certain similarity score, then summary is mated.Here summary is mated and be to use the part wording, and this part wording is the method for the use correlation rule classification of extracting in former a large amount of article, the coupling of many more relevant wordings illustrates that the similarity of these two pieces of articles is just high more, can guarantee measuring similarity preferably;
(5) classify according to the language of article then.The language part of article can be divided into summary and text two parts, has or not English description etc. such as summary.The classification of article language is the demand for the article that adapts to each languages;
(6) then check whether have article to need especially in the recent period, just detect whether this article is the type that needs especially if having, if then filing is carried out outstanding article record, otherwise carried out next step screening to certain aspect.Here need a relevant principle, more possible article scores are lower, but the admission of may demoting of bigger demand, this part article is arranged, and the adjustment of this part can also can have some designed systems to adjust by artificial adjustment;
(7) comprehensive review is carried out in final step, and the screening of the intervention fraction article that this comprehensive screening mainly is a human factor comprises the part that skims the cream off milk to some elaboration articles, to the directly artificial deletion of some rubbish articles, some uncertain articles is filed.There is value in this part is to guarantee correctness of the present invention, just must artificially participate in when this method can not be screened.In the actual experiment, screening accuracy rate 5%~10% can be improved in this part;
(8) carry out article author integration typing module at last, according to the corresponding author's of the author of article accumulative total article, this part effect also is to make an arrangement in advance to the author for suitable the time, perhaps saves the screening process of part;
(9) score of above-described various piece article will add up at last, carry out the overall evaluation of this article, and this article is carried out the mark grading, from the poorest, to best, and the grade according to article is carried out classification and storage, so that the in a rush phenomenon that can occur will seek some specific articles the time.And said as the front, the mark of various piece is dynamically to adjust, but this adjustment needs cumbersome process, is by data digging method and the accumulation of time substantially.Should screen the selection of mechanism to article in a word, will be more and more accurate after the continuous accumulation of time.
The pairing several modules of method suggested of the present invention, as shown in Figure 2, check that module is the core component of this method, be to realize the extraction of wording of some articles and the coupling and the content scores of partial content, all data digging methods that extract at article, use in the coupling, scoring are also realized in this module.Core periodical tabulation is exactly to check the necessary tabulation that exists of institute for step 2, for whether article of better searching is published in core periodical.Elaboration article district and the storage of rubbish article district are exactly the classification district at the article place after screening, and better carry out the selection of corresponding article for the ease of the reader.Human factor get involved module just as step 7,8 described certain theme make an arrangement in advance and handle the article that some can not be handled with this method, thereby the accuracy rate that raising is screened.This module may comprise user interface and to the corresponding operation-interface of corresponding system of this method institute etc.The article search module is to search for corresponding article or corresponding elaboration article in order to provide convenience to the user in the storehouse, thereby reaches than higher reference value.
The method that rank is adjusted in the described data mining of step 9 can be the feed-forward neural network method, as shown in Figure 3, for example, according to clicks, the article rank, the warehouse-in time limit, quote number of times, carried out the multiple spot correction with the corresponding weights of each attribute, be that each node all has weights (numeral on each node), and the node weights sum of each row is 1, and every process one deck node data just lacks one, to the last a node, the score of last node is the integrate score of this article just, and just can determine the last rank of article according to this integrate score.And this calculates and rank in each fixed cycle, will upgrade once as 1 hour, with the accuracy of assurance rank.

Claims (2)

1, a kind of online article screening technique based on data mining is characterized in that this method key step comprises:
1) see, at first whether this article is about the required article of specific website, as a determinant attribute, if the content that include this attribute and this website is irrelevant, property value is false from key word, the corresponding wording of abstract extraction, then directly eliminates; If property value is yes then enters next step screening;
2) see that, secondly whether this article deliver at home or on the external core magazine,, then search a core periodical table, employ this article if article is published on the periodical in this core periodical table if article is delivered; If article was then just given a corresponding score because this article has been delivered not in the core periodical tabulation;
3), then according to content point that this article covered, obtain corresponding score according to the content point that covers, and this score is how much dynamically adjusting according to the content point of all article coverings in presents storehouse, obtain this part mark of this article, the calculating use data mining association rules method of particular content point by the content point that covered of accumulative total at last;
4), then this article is carried out similarity audit, from key word, summary or even adopt matching principle in full; Consider keyword matching earlier, if adopt coupling in full after the keyword matching, when certain similarity occurring, just judge, then delete time article more of a specified duration if original article score is low, otherwise delete the article of new typing according to the score situation of original article, similarity audit used herein is not to carrying out coupling word by word and sentence by sentence in full, but earlier key word is compared, obtain certain similarity score, then summary is mated;
5), classify according to the language of article then, the language of article part can be divided into summary and text two parts, and the classification of article language is the demand for the article that adapts to each languages;
6), then check whether have article to need especially in the recent period, just detect whether this article is the type that needs especially if having, if then filing is carried out outstanding article record, otherwise carried out next step screening to certain aspect; Here need a relevant principle, some article scores are lower, but bigger demand is arranged, the just degradation admission of this part article, and the adjustment of this part can either also can have some designed systems to adjust by artificial adjustment;
7), comprehensive review is carried out in final step, the screening of the intervention fraction article that this comprehensive screening mainly is a human factor, comprise the part that skims the cream off milk,, some uncertain articles are filed the directly artificial deletion of some rubbish articles to some elaboration articles;
8) carry out article author integration typing module, at last, add up corresponding author's article according to the author of article;
9), the score of above-described various piece article will add up, carry out the overall evaluation of this article, and this article carried out the mark grading, and carry out classification and storage according to the grade of article.
2, according to the described a kind of online article screening technique of claim 1 based on data mining, it is characterized in that, step 2) the core periodical table is to need Dynamic Maintenance in, general maintenance period weekly, this core periodical table is downloaded from related web site, oneself sets some association attributeses, comprise and quote number of times, clicking rate, factor of influence, carry out dynamic calculation then, according to each periodical must assign to determine which is a core periodical, the screening mechanism of the website that each is similar is carried out sharing mutually corresponding core periodical and is tabulated and reach the purpose of renewal.
CN200910042170A 2009-08-26 2009-08-26 On-line article screening method based on data mining (DM) Pending CN101630321A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN200910042170A CN101630321A (en) 2009-08-26 2009-08-26 On-line article screening method based on data mining (DM)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN200910042170A CN101630321A (en) 2009-08-26 2009-08-26 On-line article screening method based on data mining (DM)

Publications (1)

Publication Number Publication Date
CN101630321A true CN101630321A (en) 2010-01-20

Family

ID=41575429

Family Applications (1)

Application Number Title Priority Date Filing Date
CN200910042170A Pending CN101630321A (en) 2009-08-26 2009-08-26 On-line article screening method based on data mining (DM)

Country Status (1)

Country Link
CN (1) CN101630321A (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012119339A1 (en) * 2011-03-04 2012-09-13 中兴通讯股份有限公司 Retrieval method and apparatus
CN102682120A (en) * 2012-05-15 2012-09-19 合一网络技术(北京)有限公司 Method,device and system for acquiring essential article commented on network
CN104657505A (en) * 2015-03-13 2015-05-27 华北电力大学 Paper automatic database retrieving method based on cloud platform and mobile terminal
CN105653661A (en) * 2015-12-29 2016-06-08 云南电网有限责任公司电力科学研究院 Search result re-ranking method and device
CN106934226A (en) * 2017-03-02 2017-07-07 成都华信高科医疗器械有限责任公司 A kind of data management system and method based on electro photoluminescence terminal
CN107315807A (en) * 2017-06-26 2017-11-03 三螺旋大数据科技(昆山)有限公司 Talent recommendation method and apparatus
CN108804624A (en) * 2013-12-18 2018-11-13 国网江苏省电力有限公司常州供电分公司 The method of text gear typing and comparison
CN110020729A (en) * 2019-03-05 2019-07-16 中国联合网络通信集团有限公司 Article reviewing method and device based on artificial intelligence
CN110134785A (en) * 2019-04-15 2019-08-16 平安普惠企业管理有限公司 Management method, device, storage medium and the equipment of forum's article
CN113535952A (en) * 2021-07-13 2021-10-22 六棱镜(杭州)科技有限公司 Intelligent matching data processing method based on artificial intelligence

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012119339A1 (en) * 2011-03-04 2012-09-13 中兴通讯股份有限公司 Retrieval method and apparatus
CN102682120A (en) * 2012-05-15 2012-09-19 合一网络技术(北京)有限公司 Method,device and system for acquiring essential article commented on network
CN102682120B (en) * 2012-05-15 2015-06-03 合一网络技术(北京)有限公司 Method and device for acquiring essential article commented on network
CN108804624A (en) * 2013-12-18 2018-11-13 国网江苏省电力有限公司常州供电分公司 The method of text gear typing and comparison
CN108984593A (en) * 2013-12-18 2018-12-11 国网江苏省电力有限公司常州供电分公司 The method that multi-format text keeps off typing and compares
CN108959203A (en) * 2013-12-18 2018-12-07 国网江苏省电力有限公司常州供电分公司 A kind of method text gear typing and compared
CN104657505A (en) * 2015-03-13 2015-05-27 华北电力大学 Paper automatic database retrieving method based on cloud platform and mobile terminal
CN104657505B (en) * 2015-03-13 2017-10-10 华北电力大学 A kind of paper based on cloud platform and mobile terminal is checked and accepted automatically draws method
CN105653661A (en) * 2015-12-29 2016-06-08 云南电网有限责任公司电力科学研究院 Search result re-ranking method and device
CN106934226A (en) * 2017-03-02 2017-07-07 成都华信高科医疗器械有限责任公司 A kind of data management system and method based on electro photoluminescence terminal
CN107315807A (en) * 2017-06-26 2017-11-03 三螺旋大数据科技(昆山)有限公司 Talent recommendation method and apparatus
CN107315807B (en) * 2017-06-26 2020-08-04 三螺旋大数据科技(昆山)有限公司 Talent recommendation method and device
CN110020729A (en) * 2019-03-05 2019-07-16 中国联合网络通信集团有限公司 Article reviewing method and device based on artificial intelligence
CN110134785A (en) * 2019-04-15 2019-08-16 平安普惠企业管理有限公司 Management method, device, storage medium and the equipment of forum's article
CN113535952A (en) * 2021-07-13 2021-10-22 六棱镜(杭州)科技有限公司 Intelligent matching data processing method based on artificial intelligence
CN113535952B (en) * 2021-07-13 2024-02-09 六棱镜(杭州)科技有限公司 Intelligent matching data processing method based on artificial intelligence

Similar Documents

Publication Publication Date Title
CN101630321A (en) On-line article screening method based on data mining (DM)
CN105574159B (en) A kind of user's portrait method for building up and user's portrait management system based on big data
CN103744928B (en) A kind of network video classification method based on history access record
CN101819573B (en) Self-adaptive network public opinion identification method
CN103049440B (en) A kind of recommendation process method of related article and disposal system
EP2560111A2 (en) Systems and methods for facilitating the gathering of open source intelligence
CN107220295A (en) A kind of people's contradiction reconciles case retrieval and mediation strategy recommends method
CA2513851A1 (en) Phrase-based generation of document descriptions
CN105653671A (en) Similar information recommendation method and system
CA2513852A1 (en) Phrase-based searching in an information retrieval system
Windsor et al. The language of autocrats: Leaders' language in natural disaster crises
CN103488635A (en) Method and device for acquiring product information
CN105354305A (en) Online-rumor identification method and apparatus
KR101599675B1 (en) Apparatus and method for predicting degree of corporation credit risk using corporation news searching technology based on big data technology
CN108984667A (en) A kind of public sentiment monitoring system
CN103546326A (en) Website traffic statistic method
CN105869100A (en) Method for fusion and prediction of multi-field monitoring data of landslides based on big data thinking
CN104834739B (en) Internet information storage system
CN102236654A (en) Web useless link filtering method based on content relevancy
CN104809252A (en) Internet data extraction system
CN102262663A (en) Method for repairing software defect reports
Zhou et al. Delineating infrastructure failure interdependencies and associated stakeholders through news mining: The case of Hong Kong’s water pipe bursts
CN107194617A (en) A kind of app software engineers soft skill categorizing system and method
CN113901308A (en) Knowledge graph-based enterprise recommendation method and recommendation device and electronic equipment
CN105760633A (en) Green architectural design method applicable to information assistance

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Open date: 20100120