CN108334610A - A kind of newsletter archive sorting technique, device and server - Google Patents

A kind of newsletter archive sorting technique, device and server Download PDF

Info

Publication number
CN108334610A
CN108334610A CN201810116106.6A CN201810116106A CN108334610A CN 108334610 A CN108334610 A CN 108334610A CN 201810116106 A CN201810116106 A CN 201810116106A CN 108334610 A CN108334610 A CN 108334610A
Authority
CN
China
Prior art keywords
newsletter archive
news
feature words
classed thesaurus
newsletter
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810116106.6A
Other languages
Chinese (zh)
Inventor
任宁
晋耀红
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Science and Technology (Beijing) Co., Ltd.
Original Assignee
Beijing Shenzhou Taiyue Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Shenzhou Taiyue Software Co Ltd filed Critical Beijing Shenzhou Taiyue Software Co Ltd
Priority to CN201810116106.6A priority Critical patent/CN108334610A/en
Publication of CN108334610A publication Critical patent/CN108334610A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the present application provides a kind of newsletter archive sorting technique, device and server, first, classed thesaurus is created according to known news language material;Then, classified to newsletter archive according to classed thesaurus, obtain the hit classification of newsletter archive;Then, newsletter archive is segmented, and obtains the degree of correlation of the participle and hit classification of each newsletter archive;Finally, according to the degree of correlation, Feature Words are selected from the participle of newsletter archive, and the Feature Words selected are added in classed thesaurus.The progress classified with newsletter archive, the application is realized is continually updated classed thesaurus in the assorting process of newsletter archive, so that classed thesaurus constantly can be accumulated and improve in use Feature Words, the content change trend of newsletter archive is followed in time, the classification capacity for keeping and classed thesaurus being continuously improved to emerging newsletter archive, to improve the accuracy of newsletter archive classification.

Description

A kind of newsletter archive sorting technique, device and server
Technical field
This application involves a kind of natural language processing technique field more particularly to newsletter archive sorting technique, device and clothes Business device.
Background technology
Include text classification, text organizational and text managemant in the processing of natural language processing technique field, text data Etc. types, wherein text classification refers to that the mistake of text categories is automatically determined according to content of text under given taxonomic hierarchies Journey.
With the development of development of Mobile Internet technology, the information source in internet is more and more extensive, the number of internet information Amount increases sharply.In news media field, as flow media and internet are from the rapid prosperity of media, the source of news becomes More polynary, formation speed is greatly speeded up, and news how is effectively collected from internet, and divide the news being collected into Class has become the significant challenge that news media are faced.Therefore, being classified to newsletter archive just becomes text classification One important application direction.
Classification is carried out to newsletter archive in the prior art and mostly uses the file classification method based on statistical algorithms.Fig. 1 is The schematic diagram of a kind of newsletter archive sorting technique of the prior art, using the file classification method based on statistical algorithms into style of writing When this classification, first, the newsletter archive of a large amount of known class is labeled;Then, made with the newsletter archive after mark For training corpus, training text grader makes text classifier have the classification capacity to unknown text.In the base of the prior art Therefore it is depending on the quality and quantity of training corpus in the accuracy of the file classification method of statistical algorithms, text classification Improve the accuracy rate of text classification, it is necessary to using a large amount of training corpus text classifier is trained, and corpus labeling Process by manually realizing, it is difficult to meet and obtain the requirements that a large amount of training need, cause accurate interest rate undesirable.
Further, since news has the characteristics that hot spot is strong, timeliness is strong, change speed block, expired fireballing, news text This is often just expired during accumulating language material, can not reflect current hot news;Therefore, statistical algorithms are based on File classification method due to training corpus in cumulative process there are the lag in timeliness, and lack to emerging news text This classification capacity, to there is a problem that the accuracy rate classified to newsletter archive is relatively low in practical applications.
Therefore, the accuracy classified to newsletter archive how is improved, those skilled in the art's technology urgently to be resolved hurrily is become Problem.
Invention content
The embodiment of the present application provides a kind of newsletter archive sorting technique, device and server, to solve in the prior art There are the problem of.
In a first aspect, the embodiment of the present application provides a kind of newsletter archive sorting technique, the method includes:
S110 creates classed thesaurus according to known news language material;The classed thesaurus is provided with multiple news categories, each Include at least one Feature Words in news category;
S120 classifies to newsletter archive according to the classed thesaurus, obtains the hit classification of newsletter archive;
S130 segments newsletter archive, and the participle for obtaining each newsletter archive is related to the hit classification Degree;
S140 selects the Feature Words according to the degree of correlation from the participle of newsletter archive, and described in selecting Feature Words are added in the classed thesaurus;
S150 repeats step S120-S140, until the classed thesaurus meets the accuracy rate that newsletter archive is classified Until preset termination condition.
Second aspect, the embodiment of the present application also provides a kind of newsletter archive sorter, described device includes:
Creating unit, for creating classed thesaurus according to known news language material;The classed thesaurus is provided with multiple news Classification includes at least one Feature Words in each news category;
Taxon obtains the hit class of newsletter archive for classifying to newsletter archive according to the classed thesaurus Not;
Computing unit for being segmented to newsletter archive, and obtains the participle of each newsletter archive and the hit class Other degree of correlation;
Word unit is selected, for according to the degree of correlation, the Feature Words being selected from the participle of newsletter archive, and will select The Feature Words gone out are added in the classed thesaurus.
The third aspect, the embodiment of the present application also provides a kind of server, the server includes:
Processor and memory;
The memory is used to store the program that classed thesaurus and the processor can perform;
The processor is configured as executing following steps program:
S110 creates classed thesaurus according to known news language material;The classed thesaurus is provided with multiple news categories, each Include at least one Feature Words in news category;
S120 classifies to newsletter archive according to the classed thesaurus, obtains the hit classification of newsletter archive;
S130 segments newsletter archive, and the participle for obtaining each newsletter archive is related to the hit classification Degree;
S140 selects the Feature Words according to the degree of correlation from the participle of newsletter archive, and described in selecting Feature Words are added in the classed thesaurus;
S150 repeats step S120-S140, until the classed thesaurus meets the accuracy rate that newsletter archive is classified Until preset termination condition.
By above technical scheme it is found that the embodiment of the present application provides a kind of newsletter archive sorting technique, device and service Device creates classed thesaurus according to known news language material first;Then, classified to newsletter archive according to classed thesaurus, obtained The hit classification of newsletter archive;Then, newsletter archive is segmented, and obtains the participle and hit classification of each newsletter archive The degree of correlation;Finally, according to the degree of correlation, Feature Words are selected from the participle of newsletter archive, and the Feature Words selected are added Into classed thesaurus.With the progress that newsletter archive is classified, the application realizes in the assorting process of newsletter archive constantly Classed thesaurus is updated, so that classed thesaurus constantly can be accumulated and improve in use Feature Words, follows newsletter archive in time Content change trend, the classification capacity for keeping and classed thesaurus being continuously improved to emerging newsletter archive, to, improve The accuracy of newsletter archive classification.
Description of the drawings
In order to illustrate more clearly of the technical solution of the application, letter will be made to attached drawing needed in the embodiment below Singly introduce, it should be apparent that, for those of ordinary skills, without having to pay creative labor, Other drawings may also be obtained based on these drawings.
Fig. 1 is a kind of schematic diagram of newsletter archive sorting technique of the prior art;
Fig. 2 is a kind of flow chart of newsletter archive sorting technique provided by the embodiments of the present application;
Fig. 3 is a kind of flow chart of newsletter archive sorting technique step S110 provided by the embodiments of the present application;
Fig. 4 is a kind of taxonomic hierarchies schematic diagram of classed thesaurus provided by the embodiments of the present application;
Fig. 5 is the flow chart of another newsletter archive sorting technique step S110 provided by the embodiments of the present application;
Fig. 6 is a kind of flow chart of newsletter archive sorting technique step S120 provided by the embodiments of the present application;
Fig. 7 is a kind of flow chart of newsletter archive sorting technique step S122 provided by the embodiments of the present application;
Fig. 8 is a kind of flow chart of newsletter archive sorting technique step S130 provided by the embodiments of the present application;
Fig. 9 is a kind of flow chart of newsletter archive sorting technique step S140 provided by the embodiments of the present application;
Figure 10 is a kind of structure diagram of newsletter archive sorter provided by the embodiments of the present application;
Figure 11 is a kind of structure diagram of server provided by the embodiments of the present application.
Specific implementation mode
In order to make those skilled in the art better understand the technical solutions in the application, below in conjunction with the application reality The attached drawing in example is applied, technical solutions in the embodiments of the present application is clearly and completely described, it is clear that described implementation Example is merely a part but not all of the embodiments of the present application.Based on the embodiment in the application, this field is common The every other embodiment that technical staff is obtained without making creative work should all belong to the application protection Range.
Embodiment one
The embodiment of the present application provides a kind of newsletter archive sorting technique, and Fig. 2 is provided by the embodiments of the present application a kind of new Hear file classification method flow chart, as shown in Fig. 2, a kind of newsletter archive sorting technique provided by the embodiments of the present application include with Lower step:
Step S110 creates classed thesaurus according to known news language material;The classed thesaurus is provided with multiple news categories, Include at least one Feature Words in each news category.
Classed thesaurus is for classifying to the newsletter archive of unknown classification.Each news category includes in classed thesaurus Feature Words are derived from the known news language material of corresponding news category.Feature Words are act as in newsletter archive classification:When unknown class When there are Feature Words in other news sheet, indicates that the newsletter archive of the unknown classification has and belong to this feature word place news category Other tendency, Feature Words occur quantity is more, number is more, this tendency is stronger.
Fig. 3 be a kind of flow chart of newsletter archive sorting technique step S110 provided by the embodiments of the present application, one kind can In the embodiment of selection, step S110 may comprise steps of:
The news category of the classed thesaurus is arranged in step S111.
Based on a small amount of known news language material, classed thesaurus is created, according to the news category of known news language material, setting point The news category that class vocabulary includes forms the taxonomic hierarchies of classed thesaurus.
The known news language material of the application can have multiple sources, be stressed with covering different fields and different news Point, to improve the application establishment classed thesaurus in news category it is comprehensive, for example, the application is from multiple news websites Obtain news corpus.When creating classed thesaurus, it can refer to multiple news websites and classed thesaurus be arranged to the mode classification of news Including news category, such as:According to the mode classification that multiple news websites are universal, can be set in the classed thesaurus of the application Set the news categories such as the political situation of the time, the world, military affairs, finance and economics, society, education, culture, amusement.
In the practical application request of newsletter archive classification, it is sometimes desirable to be finely divided to news category, in this regard, the application Classed thesaurus in, can multiple levels be set to news category, to embody the subordinate relation between news category.Such as:With " when Political affairs " are used as parent classification, can be arranged under " political situation of the time " classification " high-rise dynamic " " staffing " " anti-corruption and pro-honesty " " Party building " " when Comment " etc. subclass.
In addition, the mode classification of news is being arranged with reference to multiple news websites the news category of classed thesaurus in the application At the same time it can also increase interim classification according to actual needs.Increasing the meaning of interim classification is:With the variation of time, news Hot spot shows the situation of shifting outburst and replacement, and news media in order to grasp the development state of hot news in time Gesture needs to collect newest hot news in time from the newsletter archive of magnanimity, and in order to meet this demand, the application can basis Interim classification is arranged in the variation of hot news.
Such as:When arriving the Spring Festival, the news topic in relation to the Spring Festival will be broken out, and become hot news, at this point, news Media are in order to collect the news topic and clue in relation to the Spring Festival, it may appear that the demand that Spring Festival related news are individually classified, In order to meet this demand, the interim classification for closing the Spring Festival is may be provided in the classed thesaurus of the application, such as:One parent is set Multiple subclass such as " spring transportation " " New Year customs " " Spring Festival Gala " " returning to one's native place " are then arranged in classification " Spring Festival " under " Spring Festival " classification.Work as the Spring Festival Later, the hot news in relation to the Spring Festival subsides, at this point, the interim classification in relation to the Spring Festival can be deleted from classed thesaurus, from reduction Data calculation amount when newsletter archive is classified improves classification effectiveness.
Illustratively, Fig. 4 is a kind of taxonomic hierarchies schematic diagram of classed thesaurus provided by the embodiments of the present application.
Step S112 obtains the Feature Words from known news language material.
Feature Words can embody the classification tendency of newsletter archive.Such as:It is " important to say when occurring in certain newsletter archive When the Feature Words such as words " " state visit " " surveying and studying " " National People's Congress " " State Council ", illustrate that the classification of the newsletter archive may It is " political situation of the time " class.When occurring Feature Words such as " limit-up " " stock " " rearrangements of assets " in certain newsletter archive, illustrate news text This classification may be " finance and economics " class.
The Feature Words are added to by step S113 according to the news category of the affiliated known news language material of the Feature Words In the classed thesaurus.
Illustratively, from the news corpus that news category is " high-rise dynamic ", the Feature Words of acquisition have the application:Head's meeting It meets, official visit, working forum, Leading Speaches, state visit, survey and study, etc.;Therefore, by the feature of above-mentioned acquisition Word is added to as the Feature Words of " high-rise dynamic " classification in participle class table.
Also include regular expression in a kind of selectable embodiment, in classed thesaurus.In newsletter archive there is Including specific content or the special card sentence with the expression of specific clause, characteristic sentence can embody the classification tendency of newsletter archive, such as: When in certain newsletter archive occur " National People's Congress holds a meeting " characteristic sentence, illustrate the classification of the newsletter archive may be " when Political affairs " class;When the characteristic sentence for occurring " XXX obtains XXX firsts " in certain newsletter archive, illustrate that the classification of the newsletter archive can It can be " sport " class.The regular expression of the application summarises the clause of characteristic sentence, and therefore, regular expression is in newsletter archive point It is act as in class:When there is the clause that regular expression can be matched in the news sheet of unknown classification, indicate this not Know that the newsletter archive of classification has the tendency for belonging to regular expression place classification, the clause that regular expression can be matched to More, this tendency is stronger.
Fig. 5 is the flow chart of another newsletter archive sorting technique step S110 provided by the embodiments of the present application, such as Fig. 5 institutes Show, when in classed thesaurus including regular expression, step S110 can also include the following steps after step S111:
Step S114 obtains characteristic sentence from knowing in news corpus.
Illustratively, include characteristic sentence in certain real estate class newsletter archive:" cities XX this month new house trading volume ring is than declining percentage Three ".
Feature Words and characteristic sentence can embody the classification tendency of newsletter archive, in the process classified to newsletter archive In, comprehensive characteristics word and characteristic sentence carry out newsletter archive classification, can improve the accuracy of text classification.For example, above-mentioned example Characteristic sentence in comprising Feature Words such as " new house " " trading volume " " rings than decline ", wherein " new house " can embody newsletter archive Real estate class is inclined to, and " trading volume " " ring is than declining " can more embody due to also often occurring in finance and economic newsletter archive Go out the finance and economic tendency of newsletter archive, therefore, when occurring " new house " " trading volume " " ring ratio in newsletter archive to be sorted simultaneously When the Feature Words such as decline ", it can classify to newsletter archive and interfere, newsletter archive may be caused mistakenly to be categorized into finance and economic In.At this point, if judging news category by characteristic sentence " cities XX this month new house trading volume ring is than declining 3 percent ", can keep away Exempt to interfere, to improve the accuracy of newsletter archive classification.
Step S115 converts the characteristic sentence to regular expression.
Regular expression can describe the text with specific syntactic rule by character string, can in text classification To use regular expression to be matched to a series of texts for meeting specific syntactic rule from unknown text, the application is by characteristic sentence The syntactic rule having is extract in the form of regular expression, in the assorting process of newsletter archive, so that it may to use just Then expression formula goes to match unknown newsletter archive, and phrase or sentence with specific syntactic rule are found out from unknown newsletter archive Son provides foundation for the classification of newsletter archive.
Illustratively, include that " XX leader presides over symposium and delivers important say characteristic sentence in certain known news corpus Words ", wherein XX represents name.So, meeting a regular expression of this feature sentence syntactic rule can be:Preside over { 0,4 } It can { 0,6 } speeches.
Step S116 adds regular expression according to the news category of the known news language material in the characteristic sentence source Into the classed thesaurus.
Illustratively, from the news corpus that news category is " high-rise dynamic ", the regular expression of acquisition has the application:XX Leader .* attends .* meetings, presides over { 0,4 } meeting { 0,6 } speeches, National People's Congress { 0,3 } meeting { 0,6 } closings, etc..Cause This, is added to the regular expression of above-mentioned acquisition as the regular expression of " high-rise dynamic " classification in participle class table.
In a kind of selectable embodiment, each Feature Words and regular expression may be provided with class label, classification Label is used to indicate the news category belonging to Feature Words and regular expression.
Illustratively, the existence form of Feature Words and regular expression and its corresponding class label in classed thesaurus can be with For:
As can be seen that classed thesaurus can include three row from above-mentioned classed thesaurus, wherein:The rightmost side is characterized word or just Then expression formula;The leftmost side is check box, by selecting check box, can be carried out to single feature word or regular expression Modification can in bulk modify to Feature Words or regular expression alternatively, by carrying out batch selection to check box;In Between a news category for being classified as Feature Words or regular expression.
The application creates the classed thesaurus for including Feature Words and regular expression based on a small amount of known news language material, In classed thesaurus, multiple news categories are provided with according to the actual demand classified to newsletter archive, the application is in step s 110 The classed thesaurus of establishment does not depend on the newsletter archive largely accumulated, and establishment speed is fast, and timeliness is high, has the news on basis Text classification ability.
Step S120 classifies to newsletter archive according to the classed thesaurus, obtains the hit classification of newsletter archive.
In the step s 120, the quantity of Feature Words and the news category of Feature Words in newsletter archive are obtained according to classed thesaurus Not, and according to the quantity of Feature Words and the news category of Feature Words the hit classification of newsletter archive is determined.
Fig. 6 is a kind of flow chart of newsletter archive sorting technique step S120 provided by the embodiments of the present application.
In a kind of selectable embodiment, as shown in fig. 6, step S120 may comprise steps of:
Step S121 obtains all Feature Words for including in newsletter archive according to the classed thesaurus.
Illustratively, a certain piece newsletter archive is traversed according to classed thesaurus, it is as follows obtains Feature Words in newsletter archive:
2018It prepares for the postgraduate qualifying examinationCountdown:PostgraduateTitle also " costly "
Guangming Daily:“It prepares for the postgraduate qualifying examinationHeat " cooling is watchfulPostgraduateThere is an urgent need for improve for the quality of education
Www.chinanews.com's client Beijing December 13 (cold sky sun) whole nation in 2018Master degree candidate's entrance examination preliminary examinationIt will It was carried out 23 to 25 December.In recent years, more and moreUndergraduateIt willIt prepares for the postgraduate qualifying examinationAs "GraduationOutlet ",EnrollmentThe expansion of scale Public opinion is also allowed to start to worry greatlyPostgraduateTraining quality.Why more and more people selectIt prepares for the postgraduate qualifying examination PostgraduateTroop is huge Causing "Well educated devaluation
December 7, ZhengzhouUniversityNew camupus,It prepares for the postgraduate qualifying examinationInto countdown, reporter visitsColleges and universitiesIt prepares for the postgraduate qualifying examinationRace ".Figure is night In library,Prepare for the postgraduate qualifying examination studentAccount for sizeable proportion.(omiting hereinafter)
As it can be seen that occurring two category feature words in the newsletter archive of above-mentioned example.A kind of Feature Words are educational feature Word, including:" preparing for the postgraduate qualifying examination " " postgraduate " " master " " enrollment " " examination " " preliminary examination " " undergraduate " " graduation " " well educated " " student ";Separately A kind of Feature Words are finance and economic Feature Words, including:" devaluation ".
Step S122 obtains news text respectively according to the frequency that the Feature Words of each news category in newsletter archive occur The matching degree of this and each news category.
In general, in a newsletter archive, the frequency that the Feature Words of some news category occur is higher, illustrates news text Originally the possibility for belonging to some news category is higher, and the application can define the calculating side of a matching degree by mathematical measure The frequency that the Feature Words of some news category occur in newsletter archive is converted to newsletter archive and some news category by method Matching degree.
Fig. 7 be a kind of flow chart of newsletter archive sorting technique step S122 provided by the embodiments of the present application, one kind can In the embodiment of selection, step S122 may comprise steps of:
Step S1221 parses the structure of a news story of newsletter archive;The structure of a news story includes title, lead, main body, conclusion With five parts of background.
Newsletter archive is made of five title, lead, main body, conclusion and background parts.Wherein, title, which plays, touches briefly on the essentials Effect, illustrate the theme of news, therefore, in five parts of newsletter archive, title can most apparent from embody news Classification;Lead is the first segment or a word of news beginning, it concisely discloses the core content of news, and lead also can It is enough significantly to embody news category;Main body is the trunk of newsletter archive, and the body matter of usually corresponding newsletter archive, it uses sufficient The fact carry out Behaviour theme, be further expanding and illustrating to lead content;Background refers to the social environment that news occurs And natural environment;Conclusion are the conclusions of newsletter archive, it the content of news is made with summarize or illustrate news acquisition reporter, The information of copywriter;Background and conclusion can also be implicitly included in main body sometimes.
The application in step S1221, according to title, lead, main body, five parts of conclusion and background to newsletter archive into Row Context resolution is obtained from newsletter archive per the corresponding content in part.
Illustratively, as follows to a certain piece newsletter archive progress analysis result:
【Title】2018It prepares for the postgraduate qualifying examinationCountdown:PostgraduateTitle also " costly "
Guangming Daily:“It prepares for the postgraduate qualifying examinationHeat " cooling is watchfulPostgraduateThere is an urgent need for improve for the quality of education
【Lead】Www.chinanews.com's client Beijing December 13 (cold sky sun) whole nation in 2018Master degree candidate's entrance examination Preliminary examinationIt will be carried out 23 to 25 December.In recent years, more and moreUndergraduateIt willIt prepares for the postgraduate qualifying examinationAs "GraduationOutlet ",EnrollmentRule The expansion of mould also allows public opinion to start to worryPostgraduateTraining quality.Why more and more people selectIt prepares for the postgraduate qualifying examination PostgraduateTeam 5 it is huge causing "Well educated devaluation
【Main body】December 7, ZhengzhouUniversityNew camupus,It prepares for the postgraduate qualifying examinationInto countdown, reporter visitsColleges and universitiesIt prepares for the postgraduate qualifying examinationRace ".Figure is In the library at night,Prepare for the postgraduate qualifying examination studentAccount for sizeable proportion.(omiting hereinafter)
Step S1222 obtains the term weight function of newsletter archive various pieces.
In newsletter archive, the Feature Words positioned at newsletter archive different piece are different to the recognition reaction of news category , therefore, each of newsletter archive is partly arranged different term weight functions in the application.
The Feature Words occurred in title, it is the most apparent for the recognition reaction of news category.For example, when certain news When title includes Feature Words " financing ", illustrate that this news is likely to finance and economic news;When including in the backpack body of certain news When Feature Words " Chinese Premier League ", illustrate that this news is likely to sport category news.Therefore, according to the Feature Words in title to new The most apparent feature of the recognition reaction of classification is heard, the characteristic value weight of title division can be set to peak, for example, being set as 10。
The Feature Words occurred in lead will be weaker than the feature occurred in title for the recognition reaction of news category Word, but it is better than the Feature Words occurred in main body.Therefore, the characteristic value weight of lead part should be less than the Feature Words of title division Weight, for example, being set as 2.
The characteristic value weight of other parts may be configured as 1.
Step S1223, the frequency occurred in newsletter archive various pieces according to the Feature Words of each news category and Term weight function calculates the matching degree.
In the application, calculates the matching degree and use following formula:
P=p1 × C1+p2 × C2+ ...+pn × Cn
Wherein, P is the matching degree of newsletter archive and some news category, and p1~pn is newsletter archive various pieces Term weight function, C1~Cn are quantity of the Feature Words in newsletter archive various pieces of some news category.
Illustratively, in the newsletter archive being illustrated above, the term weight function p1=10 of title division, the spy of lead part Word weight p2=2 is levied, the term weight function of main body, background and concluding portion is consolidated into p3=1.
In title division, the number that educational Feature Words occur in title division is C1=4, is occurred in lead part Number is C2=13, is C3=6 in the number that main body, background and concluding portion occur.Therefore, the newsletter archive with it is educational Matching degree P=10 × 4+2 × 13+1 × 6=72.
Similarly, the matching degree P=2 × 1=2 of the newsletter archive and finance and economic.
Step S123, using the corresponding news category of the peak of the matching degree as hit classification.
Illustratively, the newsletter archive being illustrated above and educational matching degree highest, therefore, educational is above-mentioned news The hit classification of text.
The application classifies to newsletter archive according to classed thesaurus, wherein according to the feature of each part of newsletter archive Word is different to the power of the recognition reaction of news category, and different term weight functions is provided with to each part;Then according to every The term weight function of the frequency and various pieces that a kind of Feature Words occur in newsletter archive various pieces calculates the matching degree, To enable matching degree accurately to reflect the correlation of newsletter archive and news category;Finally, by the peak pair of matching degree The news category answered is as hit classification.The method that newsletter archive is classified can be improved according to classed thesaurus in the application The accuracy of newsletter archive classification.
Step S130, segments newsletter archive, and obtains the participle of each newsletter archive and the hit classification The degree of correlation.
In newsletter archive, the classification of newsletter archive can also be risen there is not included in some classed thesaurus To the word of recognition reaction, in order to find out these words from newsletter archive, and these words are added in classed thesaurus, with abundant point The vocabulary of class table improves the classification accuracy of classed thesaurus.The application in step s 130, first divides newsletter archive Word, and obtain the degree of correlation of the participle and the hit classification of each newsletter archive.
Fig. 8 be a kind of flow chart of newsletter archive sorting technique step S130 provided by the embodiments of the present application, one kind can In the embodiment of selection, step S130 may comprise steps of:
Step S131 carries out cutting word processing according to preset cutting word rule to newsletter archive, obtains point of newsletter archive Word.
In the application, the Chinese Word Segmentation mode based on machine learning can be used, cutting word processing is carried out to newsletter archive.
Illustratively, a kind of cutting word result to newsletter archive has been illustrated below:
12/ month/7/ day/,/Zhengzhou/university/new/school district/,/prepare for the postgraduate qualifying examination/entrance/countdown/,/reporter/visit/and colleges and universities/"/ Prepare for the postgraduate qualifying examination race/"/./ figure/be/night// library/inner/,/prepare for the postgraduate qualifying examination/student/account for// quite big/ratio/.
Step S132 removes the stop words for including in the participle of newsletter archive.
In information retrieval, to save memory space and improving search efficiency, in processing natural language data (or text) Before or after can automatic fitration fall certain words or word, these words or word are referred to as stop words (or the outer word of collection).It is any kind Word can be selected as stop words, specifically, using which word as stop words, need to be determined according to given purpose. In this application, stop words can be that English character, number, mathematical character, punctuation mark and frequency of use are extra-high but do not have Physical meaning Chinese word character (such as:) etc..
The application can create according to preset stop words and deactivate vocabulary, then, newsletter archive be retrieved according to deactivated vocabulary Participle, and remove the stop words that retrieves.
Illustratively, removing the result obtained after stop words to the cutting word result being illustrated above is:
Zhengzhou/university/new/school district/prepare for the postgraduate qualifying examination/entrance/countdown/reporter/visit/colleges and universities/prepare for the postgraduate qualifying examination race/night/library/ Prepare for the postgraduate qualifying examination/student/ratio/
By removing stop words, the quantity of participle can be reduced, the calculation amount when degree of correlation is calculated to reduce, improves effect Rate.Furthermore, it is necessary to which supplementary explanation, can also be from point of newsletter archive other than removing stop words in step S132 It is got rid of in word and is present in Feature Words in classed thesaurus, the calculation amount when degree of correlation is calculated to further decrease, improve efficiency.
Step S133 calculates TF-IDF value of each of newsletter archive participle relative to the hit classification, by the TF- IDF values are as the degree of correlation.
TF-IDF (term frequency-inverse document frequency) be it is a kind of for information retrieval with The weighting technique of data mining.TF means that word frequency (Term Frequency), IDF mean reverse document-frequency (Inverse Document Frequency).TF-IDF is a kind of statistical method, to assess a words for the important of corpus Degree.The weight of words, but simultaneously can be as it be in other corpus with the directly proportional increase of number that it occurs in language material The frequency of middle appearance is inversely proportional decline.
In the application, TF refers to some and segments the number appeared in the newsletter archive for hitting classification.IDF refers to some point Word appears in the inverse of number in the newsletter archives of all news categories.
Step S140 selects the Feature Words according to the degree of correlation from the participle of newsletter archive, and will select The Feature Words are added in the classed thesaurus.
Fig. 9 be a kind of flow chart of newsletter archive sorting technique step S140 provided by the embodiments of the present application, one kind can In the embodiment of selection, step S140 includes the following steps:
Step S141 is ranked up the participle of newsletter archive according to the degree of correlation.
Illustratively, according to the degree of correlation, the participle ranking results to the above-mentioned newsletter archive shown are:
Step S142, according to participle sequence as a result, the participle for choosing the degree of correlation higher than the first preset value is made For the Feature Words.
First preset value for selecting Feature Words can be arranged in the application, and the first pre- preset value can be rule of thumb or logical It crosses repeatedly calculation and gets a reasonable value.Illustratively, the value of the first preset value of the application is 0.75, to choose " school district " " library " is characterized word.
The Feature Words are added in the classed thesaurus by step S143.
Illustratively, Feature Words " school district " " library " are added to as educational Feature Words in classed thesaurus.Follow-up In the newsletter archive classification of progress, classed thesaurus just has finds out Feature Words " school district " and " library " from newsletter archive Ability, to which classed thesaurus is improved to the classification capacity of newsletter archive.
The application is ranked up the participle of newsletter archive according to the degree of correlation, and is arranged pre- for the first of selected characteristic word Setting selects Feature Words by determining the reasonable value of the first preset value from the result of participle sequence.It can be pre- by changing first If the quantity and threshold of the value size adjustment selected characteristic word of value, to influence the essence that classed thesaurus classifies to newsletter archive Degree.
Step S150 repeats step S120-S140, until the accuracy rate that the classed thesaurus classifies to newsletter archive Until meeting preset termination condition.
The application constantly can repeat step during newsletter archive is classified using different newsletter archives S120-S140 makes classed thesaurus constantly can accumulate and improve in use Feature Words, to make in classed thesaurus Feature Words can follow the content change trend of newsletter archive in time, can keep and classed thesaurus is continuously improved to emerging The classification capacity of newsletter archive.Therefore, the method provided by the present application that classed thesaurus is updated when newsletter archive is classified can carry The accuracy of high news text classification.
In a kind of selectable embodiment, Feature Words include positive Feature Words and opposite feature word;Positive Feature Words Term weight function be positive value, the term weight function of opposite feature word is negative value.
The term weight function of positive Feature Words is positive value, and therefore, the meaning of expression is:When occurring just in newsletter archive When to Feature Words, illustrate that the newsletter archive has the tendency for being categorized into the corresponding news category of forward direction Feature Words.Opposite feature The term weight function of word is negative value, and therefore, the meaning of expression is:When occurring opposite feature word in newsletter archive, explanation The newsletter archive should not be categorized into the corresponding news category of opposite feature word.
Positive Feature Words and opposite feature word have critically important use in the case of newsletter archive classification is more careful Meaning.For example, for actual demand, need the newsletter archive of sport category being further categorized into:Football, basketball, tennis, table tennis The subclass such as pang ball, diving, at this time can be under the subclass of " football ", will " three points of Feature Words " block " related with basketball Ball " " hack " etc., and, will Feature Words " breaking " " deciding game " related with tennis etc., and with table tennis, carry water Opposite feature word of the related Feature Words as " football " subclass, and the higher number of absolute value is arranged to these opposite feature words Value is the term weight function of negative value, to reduce the matching of newsletter archive and " football " subclass comprising above-mentioned opposite feature word Degree.
By above technical scheme it is found that the embodiment of the present application provides a kind of newsletter archive sorting technique, first, according to Know that news corpus creates classed thesaurus;Then, classified to newsletter archive according to classed thesaurus, obtain the hit of newsletter archive Classification;Then, newsletter archive is segmented, and obtains the degree of correlation of the participle and hit classification of each newsletter archive;Finally, According to the degree of correlation, Feature Words are selected from the participle of newsletter archive, and the Feature Words selected are added in classed thesaurus.With The progress of newsletter archive classification, the application is realized is continually updated classed thesaurus in the assorting process of newsletter archive, makes Classed thesaurus constantly can accumulate and improve in use Feature Words, follow the content change trend of newsletter archive in time, The classification capacity for keeping and classed thesaurus being continuously improved to emerging newsletter archive, to improve newsletter archive classification Accuracy.
Embodiment two
The embodiment of the present application provides a kind of newsletter archive sorter, and Figure 10 is provided by the embodiments of the present application a kind of new The structure diagram of document sorting apparatus is heard, as shown in Figure 10, described device includes:
Creating unit 210, for creating classed thesaurus according to known news language material;The classed thesaurus presets multiple news Classification includes at least one Feature Words in each news category;
Taxon 220 obtains the hit of newsletter archive for classifying to newsletter archive according to the classed thesaurus Classification;
Computing unit 230 for being segmented to newsletter archive, and obtains the participle of each newsletter archive and the hit The degree of correlation of classification;
Word unit 240 is selected, is used to, according to the degree of correlation, the Feature Words are selected from the participle of newsletter archive, and will The Feature Words selected are added in the classed thesaurus.
By above technical scheme it is found that the embodiment of the present application provides a kind of newsletter archive sorter, described device root Classed thesaurus is created according to known news language material;Then, classified to newsletter archive according to classed thesaurus, obtain newsletter archive Hit classification;Then, newsletter archive is segmented, and obtains the degree of correlation of the participle and hit classification of each newsletter archive; Finally, according to the degree of correlation, Feature Words are selected from the participle of newsletter archive, and the Feature Words selected are added to classed thesaurus In.With the progress that newsletter archive is classified, the application is realized is continually updated classificating word in the assorting process of newsletter archive Table makes classed thesaurus constantly can accumulate and improve in use Feature Words, follows the content change of newsletter archive in time Trend, the classification capacity for keeping and classed thesaurus being continuously improved to emerging newsletter archive, to improve newsletter archive point The accuracy of class.
Embodiment three
The embodiment of the present application provides a kind of server, and Figure 11 is a kind of structure of server provided by the embodiments of the present application Block diagram, as shown in figure 11, the server includes:
Processor 310 and memory 320;
The memory 320 is used to store the program that classed thesaurus and the processor 310 can perform;
The processor 310 is configured as executing following procedure step:
S110 creates classed thesaurus according to known news language material;The classed thesaurus presets multiple news categories, Mei Gexin Hear in classification includes at least one Feature Words;
S120 classifies to newsletter archive according to the classed thesaurus, obtains the hit classification of newsletter archive;
S130 segments newsletter archive, and the participle for obtaining each newsletter archive is related to the hit classification Degree;
S140 selects the Feature Words according to the degree of correlation from the participle of newsletter archive, and described in selecting Feature Words are added in the classed thesaurus;
S150 repeats step S120-S140, until the classed thesaurus meets the accuracy rate that newsletter archive is classified Until preset termination condition.
By above technical scheme it is found that the embodiment of the present application provides a kind of server, the server is according to known new It hears language material and creates classed thesaurus;Then, classified to newsletter archive according to classed thesaurus, obtain the hit class of newsletter archive Not;Then, newsletter archive is segmented, and obtains the degree of correlation of the participle and hit classification of each newsletter archive;Finally, root According to the degree of correlation, Feature Words are selected from the participle of newsletter archive, and the Feature Words selected are added in classed thesaurus.With The progress of newsletter archive classification, the application is realized is continually updated classed thesaurus in the assorting process of newsletter archive, makes point Class vocabulary constantly can accumulate and improve in use Feature Words, follow the content change trend of newsletter archive in time, protect The classification capacity held and classed thesaurus is continuously improved to emerging newsletter archive, to improve the standard of newsletter archive classification True property.
The application can be used in numerous general or special purpose computing system environments or configuration.Such as:Personal computer, service Device computer, handheld device or portable device, laptop device, multicomputer system, microprocessor-based system, top set Box, programmable consumer-elcetronics devices, network PC, minicomputer, mainframe computer including any of the above system or equipment Distributed computing environment etc..
The application can describe in the general context of computer-executable instructions executed by a computer, such as program Module.Usually, program module includes routines performing specific tasks or implementing specific abstract data types, program, object, group Part, data structure etc..The application can also be put into practice in a distributed computing environment, in these distributed computing environments, by Task is executed by the connected remote processing devices of communication network.In a distributed computing environment, program module can be with In the local and remote computer storage media including storage device.
It should be noted that herein, the relational terms of such as " first " and " second " or the like are used merely to one A entity or operation with another entity or operate distinguish, without necessarily requiring or implying these entities or operation it Between there are any actual relationship or orders.Moreover, the terms "include", "comprise" or its any other variant are intended to Cover non-exclusive inclusion, so that the process, method, article or equipment including a series of elements includes not only those Element, but also include other elements that are not explicitly listed, or further include for this process, method, article or setting Standby intrinsic element.
Those skilled in the art will readily occur to its of the application after considering specification and putting into practice application disclosed herein Its embodiment.This application is intended to cover any variations, uses, or adaptations of the application, these modifications, purposes or Person's adaptive change follows the general principle of the application and includes the undocumented common knowledge in the art of the application Or conventional techniques.The description and examples are only to be considered as illustrative, and the true scope and spirit of the application are by following Claim is pointed out.
It should be understood that the application is not limited to the precision architecture for being described above and being shown in the accompanying drawings, and And various modifications and changes may be made without departing from the scope thereof.Scope of the present application is only limited by the accompanying claims.

Claims (10)

1. a kind of newsletter archive sorting technique, which is characterized in that including:
S110 creates classed thesaurus according to known news language material;The classed thesaurus is provided with multiple news categories, each news Include at least one Feature Words in classification;
S120 classifies to newsletter archive according to the classed thesaurus, obtains the hit classification of newsletter archive;
S130 segments newsletter archive, and obtains the degree of correlation of the participle and the hit classification of each newsletter archive;
S140 selects the Feature Words, and the feature that will be selected according to the degree of correlation from the participle of newsletter archive Word is added in the classed thesaurus;
S150 repeats step S120-S140, is preset until the classed thesaurus meets the accuracy rate that newsletter archive is classified Until end condition.
2. according to the method described in claim 1, it is characterized in that, described create classed thesaurus according to known news language material;Institute Stating the step of classed thesaurus is provided with multiple news categories, includes at least one Feature Words in each news category includes:
The news category of the classed thesaurus is set;
The Feature Words are obtained from known news language material;
According to the news category of the affiliated known news language material of the Feature Words, the Feature Words are added to the classed thesaurus In.
3. according to the method described in claim 2, it is characterized in that, in the classed thesaurus also include regular expression, it is described After the step of news category of the classed thesaurus is set, further include:
Characteristic sentence is obtained from known news language material;
Convert the characteristic sentence to regular expression;
According to the news category of the known news language material in the characteristic sentence source, regular expression is added to the classed thesaurus In.
4. according to the method described in claim 1, it is characterized in that, described divide newsletter archive according to the classed thesaurus The step of class, the hit classification for obtaining newsletter archive includes:
According to the classed thesaurus, all Feature Words for including in newsletter archive are obtained;
According to the frequency that the Feature Words of each news category in newsletter archive occur, newsletter archive and each news category are obtained respectively Other matching degree;
Using the corresponding news category of the peak of the matching degree as hit classification.
5. according to the method described in claim 4, it is characterized in that, described according to each news category another characteristic in newsletter archive The frequency that word occurs obtains newsletter archive respectively and the step of matching degree of each news category includes:
Parse the structure of a news story of newsletter archive;The structure of a news story includes five title, lead, main body, conclusion and background parts;
Obtain the term weight function of newsletter archive various pieces;
The frequency and term weight function occurred in newsletter archive various pieces according to the Feature Words of each news category calculates The matching degree;
Wherein, it calculates the matching degree and uses following formula:
P=p1 × C1+p2 × C2+ ...+pn × Cn
Wherein, P is the matching degree of newsletter archive and some news category, and p1~pn is the feature of newsletter archive various pieces Word weight, C1~Cn are quantity of the Feature Words in newsletter archive various pieces of some news category.
6. according to the method described in claim 1, it is characterized in that, described segment newsletter archive, and obtaining each new Hear text participle with it is described hit classification the degree of correlation the step of include:
According to preset cutting word rule, cutting word processing is carried out to newsletter archive, obtains the participle of newsletter archive;
Remove the stop words for including in the participle of newsletter archive;
TF-IDF value of each of newsletter archive participle relative to the hit classification is calculated, using the TF-IDF values as described in The degree of correlation.
7. according to the method described in claim 1, it is characterized in that, described according to the degree of correlation, selected from the participle of newsletter archive The Feature Words are pulled out, and include by the step that the Feature Words selected are added in the classed thesaurus:
According to the degree of correlation, the participle of newsletter archive is ranked up;
According to participle sequence as a result, choosing the participle of the degree of correlation higher than the first preset value as the Feature Words;
The Feature Words are added in the classed thesaurus.
8. according to the method described in claim 5, it is characterized in that,
The Feature Words include positive Feature Words and opposite feature word;The term weight function of the forward direction Feature Words is positive value, institute The term weight function for stating opposite feature word is negative value.
9. a kind of newsletter archive sorter, which is characterized in that including:
Creating unit, for creating classed thesaurus according to known news language material;The classed thesaurus is provided with multiple news categories, Include at least one Feature Words in each news category;
Taxon obtains the hit classification of newsletter archive for classifying to newsletter archive according to the classed thesaurus;
Computing unit for being segmented to newsletter archive, and obtains the participle of each newsletter archive and the hit classification The degree of correlation;
Word unit is selected, for according to the degree of correlation, selecting the Feature Words from the participle of newsletter archive, and will be selected The Feature Words are added in the classed thesaurus.
10. a kind of server, which is characterized in that including:
Processor and memory;
The memory is used to store the program that classed thesaurus and the processor can perform;
The processor is configured as executing following steps program:
S110 creates classed thesaurus according to known news language material;The classed thesaurus is provided with multiple news categories, each news Include at least one Feature Words in classification;
S120 classifies to newsletter archive according to the classed thesaurus, obtains the hit classification of newsletter archive;
S130 segments newsletter archive, and obtains the degree of correlation of the participle and the hit classification of each newsletter archive;
S140 selects the Feature Words, and the feature that will be selected according to the degree of correlation from the participle of newsletter archive Word is added in the classed thesaurus;
S150 repeats step S120-S140, is preset until the classed thesaurus meets the accuracy rate that newsletter archive is classified Until end condition.
CN201810116106.6A 2018-02-06 2018-02-06 A kind of newsletter archive sorting technique, device and server Pending CN108334610A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810116106.6A CN108334610A (en) 2018-02-06 2018-02-06 A kind of newsletter archive sorting technique, device and server

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810116106.6A CN108334610A (en) 2018-02-06 2018-02-06 A kind of newsletter archive sorting technique, device and server

Publications (1)

Publication Number Publication Date
CN108334610A true CN108334610A (en) 2018-07-27

Family

ID=62928268

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810116106.6A Pending CN108334610A (en) 2018-02-06 2018-02-06 A kind of newsletter archive sorting technique, device and server

Country Status (1)

Country Link
CN (1) CN108334610A (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109657137A (en) * 2018-11-26 2019-04-19 平安科技(深圳)有限公司 Public sentiment news category model building method, device, computer equipment and storage medium
CN109684472A (en) * 2018-12-20 2019-04-26 深圳价值在线信息科技股份有限公司 A kind of trade classification method and system of security information
CN110209329A (en) * 2019-05-23 2019-09-06 厦门美柚信息科技有限公司 Show method, apparatus, equipment and the storage medium of content of pages
CN110888978A (en) * 2018-09-06 2020-03-17 北京京东金融科技控股有限公司 Article clustering method and device, electronic equipment and storage medium
CN110941718A (en) * 2019-11-27 2020-03-31 广州快决测信息科技有限公司 Method and system for automatically identifying text category through text content
CN111324735A (en) * 2020-02-20 2020-06-23 湖南芒果听见科技有限公司 Method and terminal for automatically classifying hourly essentials
CN111506727A (en) * 2020-04-16 2020-08-07 腾讯科技(深圳)有限公司 Text content category acquisition method and device, computer equipment and storage medium
CN111753197A (en) * 2020-06-18 2020-10-09 达而观信息科技(上海)有限公司 News element extraction method and device, computer equipment and storage medium
CN111782601A (en) * 2020-06-08 2020-10-16 北京海泰方圆科技股份有限公司 Electronic file processing method and device, electronic equipment and machine readable medium
CN112114728A (en) * 2020-09-18 2020-12-22 北京搜狗科技发展有限公司 Input method and device and electronic equipment
CN113239197A (en) * 2021-05-12 2021-08-10 首都师范大学 Method, device and computer storage medium for classifying sentences based on TF-IDF algorithm
CN113505228A (en) * 2021-07-22 2021-10-15 上海弘玑信息技术有限公司 Multi-dimensional text data classification method, training method and device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104008126A (en) * 2014-03-31 2014-08-27 北京奇虎科技有限公司 Method and device for segmentation on basis of webpage content classification
CN104035968A (en) * 2014-05-20 2014-09-10 微梦创科网络科技(中国)有限公司 Method and device for constructing training corpus set based on social network
CN104361010A (en) * 2014-10-11 2015-02-18 北京中搜网络技术股份有限公司 Automatic classification method for correcting news classification
CN104899215A (en) * 2014-03-06 2015-09-09 北京搜狗科技发展有限公司 Data processing method, recommendation source information organization, information recommendation method and information recommendation device
KR20170034206A (en) * 2015-09-18 2017-03-28 아주대학교산학협력단 Apparatus and Method for Topic Category Classification of Social Media Text based on Cross-Media Analysis
CN107045524A (en) * 2016-12-30 2017-08-15 中央民族大学 A kind of method and system of network text public sentiment classification

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104899215A (en) * 2014-03-06 2015-09-09 北京搜狗科技发展有限公司 Data processing method, recommendation source information organization, information recommendation method and information recommendation device
CN104008126A (en) * 2014-03-31 2014-08-27 北京奇虎科技有限公司 Method and device for segmentation on basis of webpage content classification
CN104035968A (en) * 2014-05-20 2014-09-10 微梦创科网络科技(中国)有限公司 Method and device for constructing training corpus set based on social network
CN104361010A (en) * 2014-10-11 2015-02-18 北京中搜网络技术股份有限公司 Automatic classification method for correcting news classification
KR20170034206A (en) * 2015-09-18 2017-03-28 아주대학교산학협력단 Apparatus and Method for Topic Category Classification of Social Media Text based on Cross-Media Analysis
CN107045524A (en) * 2016-12-30 2017-08-15 中央民族大学 A kind of method and system of network text public sentiment classification

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110888978A (en) * 2018-09-06 2020-03-17 北京京东金融科技控股有限公司 Article clustering method and device, electronic equipment and storage medium
CN109657137B (en) * 2018-11-26 2024-05-31 平安科技(深圳)有限公司 Public opinion news classification model construction method, device, computer equipment and storage medium
CN109657137A (en) * 2018-11-26 2019-04-19 平安科技(深圳)有限公司 Public sentiment news category model building method, device, computer equipment and storage medium
CN109684472A (en) * 2018-12-20 2019-04-26 深圳价值在线信息科技股份有限公司 A kind of trade classification method and system of security information
CN110209329A (en) * 2019-05-23 2019-09-06 厦门美柚信息科技有限公司 Show method, apparatus, equipment and the storage medium of content of pages
CN110941718A (en) * 2019-11-27 2020-03-31 广州快决测信息科技有限公司 Method and system for automatically identifying text category through text content
CN111324735A (en) * 2020-02-20 2020-06-23 湖南芒果听见科技有限公司 Method and terminal for automatically classifying hourly essentials
CN111506727B (en) * 2020-04-16 2023-10-03 腾讯科技(深圳)有限公司 Text content category acquisition method, apparatus, computer device and storage medium
CN111506727A (en) * 2020-04-16 2020-08-07 腾讯科技(深圳)有限公司 Text content category acquisition method and device, computer equipment and storage medium
CN111782601A (en) * 2020-06-08 2020-10-16 北京海泰方圆科技股份有限公司 Electronic file processing method and device, electronic equipment and machine readable medium
CN111753197B (en) * 2020-06-18 2024-04-05 达观数据有限公司 News element extraction method, device, computer equipment and storage medium
CN111753197A (en) * 2020-06-18 2020-10-09 达而观信息科技(上海)有限公司 News element extraction method and device, computer equipment and storage medium
CN112114728B (en) * 2020-09-18 2022-02-15 北京搜狗科技发展有限公司 Input method and device and electronic equipment
CN112114728A (en) * 2020-09-18 2020-12-22 北京搜狗科技发展有限公司 Input method and device and electronic equipment
CN113239197A (en) * 2021-05-12 2021-08-10 首都师范大学 Method, device and computer storage medium for classifying sentences based on TF-IDF algorithm
CN113505228A (en) * 2021-07-22 2021-10-15 上海弘玑信息技术有限公司 Multi-dimensional text data classification method, training method and device

Similar Documents

Publication Publication Date Title
CN108334610A (en) A kind of newsletter archive sorting technique, device and server
CN106649818B (en) Application search intention identification method and device, application search method and server
CN108090048B (en) College evaluation system based on multivariate data analysis
CN103399891B (en) Method for automatic recommendation of network content, device and system
US8250067B2 (en) Adding dominant media elements to search results
CN105095187A (en) Search intention identification method and device
EP2192500A2 (en) System and method for providing robust topic identification in social indexes
CN106339502A (en) Modeling recommendation method based on user behavior data fragmentation cluster
CN103744981A (en) System for automatic classification analysis for website based on website content
CN105930411A (en) Classifier training method, classifier and sentiment classification system
Shimada et al. Analyzing tourism information on twitter for a local city
CN104915446A (en) Automatic extracting method and system of event evolving relationship based on news
CN107169086B (en) Text classification method
CN110134792B (en) Text recognition method and device, electronic equipment and storage medium
CN107958014B (en) Search engine
Hayes Using tags and clustering to identify topic-relevant blogs
CN106126605B (en) Short text classification method based on user portrait
CN112214991B (en) Microblog text standing detection method based on multi-feature fusion weighting
CN105653701A (en) Model generating method and device as well as word weighting method and device
Noel et al. Applicability of Latent Dirichlet Allocation to multi-disk search
CN109815401A (en) A kind of name disambiguation method applied to Web people search
CN110866102A (en) Search processing method
CN114330329A (en) Service content searching method and device, electronic equipment and storage medium
Li et al. Improving relevance judgment of web search results with image excerpts
CN112579729A (en) Training method and device for document quality evaluation model, electronic equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20190906

Address after: Room 630, 6th floor, Block A, Wanliu Xingui Building, 28 Wanquanzhuang Road, Haidian District, Beijing

Applicant after: China Science and Technology (Beijing) Co., Ltd.

Address before: 100089 Beijing city Haidian District wanquanzhuang Road No. 28 Wanliu new building block A Room 601

Applicant before: Beijing Shenzhou Taiyue Software Co., Ltd.

TA01 Transfer of patent application right
RJ01 Rejection of invention patent application after publication

Application publication date: 20180727

RJ01 Rejection of invention patent application after publication