CN108334610A - A kind of newsletter archive sorting technique, device and server - Google Patents
A kind of newsletter archive sorting technique, device and server Download PDFInfo
- Publication number
- CN108334610A CN108334610A CN201810116106.6A CN201810116106A CN108334610A CN 108334610 A CN108334610 A CN 108334610A CN 201810116106 A CN201810116106 A CN 201810116106A CN 108334610 A CN108334610 A CN 108334610A
- Authority
- CN
- China
- Prior art keywords
- newsletter archive
- news
- feature words
- classed thesaurus
- newsletter
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The embodiment of the present application provides a kind of newsletter archive sorting technique, device and server, first, classed thesaurus is created according to known news language material;Then, classified to newsletter archive according to classed thesaurus, obtain the hit classification of newsletter archive;Then, newsletter archive is segmented, and obtains the degree of correlation of the participle and hit classification of each newsletter archive;Finally, according to the degree of correlation, Feature Words are selected from the participle of newsletter archive, and the Feature Words selected are added in classed thesaurus.The progress classified with newsletter archive, the application is realized is continually updated classed thesaurus in the assorting process of newsletter archive, so that classed thesaurus constantly can be accumulated and improve in use Feature Words, the content change trend of newsletter archive is followed in time, the classification capacity for keeping and classed thesaurus being continuously improved to emerging newsletter archive, to improve the accuracy of newsletter archive classification.
Description
Technical field
This application involves a kind of natural language processing technique field more particularly to newsletter archive sorting technique, device and clothes
Business device.
Background technology
Include text classification, text organizational and text managemant in the processing of natural language processing technique field, text data
Etc. types, wherein text classification refers to that the mistake of text categories is automatically determined according to content of text under given taxonomic hierarchies
Journey.
With the development of development of Mobile Internet technology, the information source in internet is more and more extensive, the number of internet information
Amount increases sharply.In news media field, as flow media and internet are from the rapid prosperity of media, the source of news becomes
More polynary, formation speed is greatly speeded up, and news how is effectively collected from internet, and divide the news being collected into
Class has become the significant challenge that news media are faced.Therefore, being classified to newsletter archive just becomes text classification
One important application direction.
Classification is carried out to newsletter archive in the prior art and mostly uses the file classification method based on statistical algorithms.Fig. 1 is
The schematic diagram of a kind of newsletter archive sorting technique of the prior art, using the file classification method based on statistical algorithms into style of writing
When this classification, first, the newsletter archive of a large amount of known class is labeled;Then, made with the newsletter archive after mark
For training corpus, training text grader makes text classifier have the classification capacity to unknown text.In the base of the prior art
Therefore it is depending on the quality and quantity of training corpus in the accuracy of the file classification method of statistical algorithms, text classification
Improve the accuracy rate of text classification, it is necessary to using a large amount of training corpus text classifier is trained, and corpus labeling
Process by manually realizing, it is difficult to meet and obtain the requirements that a large amount of training need, cause accurate interest rate undesirable.
Further, since news has the characteristics that hot spot is strong, timeliness is strong, change speed block, expired fireballing, news text
This is often just expired during accumulating language material, can not reflect current hot news;Therefore, statistical algorithms are based on
File classification method due to training corpus in cumulative process there are the lag in timeliness, and lack to emerging news text
This classification capacity, to there is a problem that the accuracy rate classified to newsletter archive is relatively low in practical applications.
Therefore, the accuracy classified to newsletter archive how is improved, those skilled in the art's technology urgently to be resolved hurrily is become
Problem.
Invention content
The embodiment of the present application provides a kind of newsletter archive sorting technique, device and server, to solve in the prior art
There are the problem of.
In a first aspect, the embodiment of the present application provides a kind of newsletter archive sorting technique, the method includes:
S110 creates classed thesaurus according to known news language material;The classed thesaurus is provided with multiple news categories, each
Include at least one Feature Words in news category;
S120 classifies to newsletter archive according to the classed thesaurus, obtains the hit classification of newsletter archive;
S130 segments newsletter archive, and the participle for obtaining each newsletter archive is related to the hit classification
Degree;
S140 selects the Feature Words according to the degree of correlation from the participle of newsletter archive, and described in selecting
Feature Words are added in the classed thesaurus;
S150 repeats step S120-S140, until the classed thesaurus meets the accuracy rate that newsletter archive is classified
Until preset termination condition.
Second aspect, the embodiment of the present application also provides a kind of newsletter archive sorter, described device includes:
Creating unit, for creating classed thesaurus according to known news language material;The classed thesaurus is provided with multiple news
Classification includes at least one Feature Words in each news category;
Taxon obtains the hit class of newsletter archive for classifying to newsletter archive according to the classed thesaurus
Not;
Computing unit for being segmented to newsletter archive, and obtains the participle of each newsletter archive and the hit class
Other degree of correlation;
Word unit is selected, for according to the degree of correlation, the Feature Words being selected from the participle of newsletter archive, and will select
The Feature Words gone out are added in the classed thesaurus.
The third aspect, the embodiment of the present application also provides a kind of server, the server includes:
Processor and memory;
The memory is used to store the program that classed thesaurus and the processor can perform;
The processor is configured as executing following steps program:
S110 creates classed thesaurus according to known news language material;The classed thesaurus is provided with multiple news categories, each
Include at least one Feature Words in news category;
S120 classifies to newsletter archive according to the classed thesaurus, obtains the hit classification of newsletter archive;
S130 segments newsletter archive, and the participle for obtaining each newsletter archive is related to the hit classification
Degree;
S140 selects the Feature Words according to the degree of correlation from the participle of newsletter archive, and described in selecting
Feature Words are added in the classed thesaurus;
S150 repeats step S120-S140, until the classed thesaurus meets the accuracy rate that newsletter archive is classified
Until preset termination condition.
By above technical scheme it is found that the embodiment of the present application provides a kind of newsletter archive sorting technique, device and service
Device creates classed thesaurus according to known news language material first;Then, classified to newsletter archive according to classed thesaurus, obtained
The hit classification of newsletter archive;Then, newsletter archive is segmented, and obtains the participle and hit classification of each newsletter archive
The degree of correlation;Finally, according to the degree of correlation, Feature Words are selected from the participle of newsletter archive, and the Feature Words selected are added
Into classed thesaurus.With the progress that newsletter archive is classified, the application realizes in the assorting process of newsletter archive constantly
Classed thesaurus is updated, so that classed thesaurus constantly can be accumulated and improve in use Feature Words, follows newsletter archive in time
Content change trend, the classification capacity for keeping and classed thesaurus being continuously improved to emerging newsletter archive, to, improve
The accuracy of newsletter archive classification.
Description of the drawings
In order to illustrate more clearly of the technical solution of the application, letter will be made to attached drawing needed in the embodiment below
Singly introduce, it should be apparent that, for those of ordinary skills, without having to pay creative labor,
Other drawings may also be obtained based on these drawings.
Fig. 1 is a kind of schematic diagram of newsletter archive sorting technique of the prior art;
Fig. 2 is a kind of flow chart of newsletter archive sorting technique provided by the embodiments of the present application;
Fig. 3 is a kind of flow chart of newsletter archive sorting technique step S110 provided by the embodiments of the present application;
Fig. 4 is a kind of taxonomic hierarchies schematic diagram of classed thesaurus provided by the embodiments of the present application;
Fig. 5 is the flow chart of another newsletter archive sorting technique step S110 provided by the embodiments of the present application;
Fig. 6 is a kind of flow chart of newsletter archive sorting technique step S120 provided by the embodiments of the present application;
Fig. 7 is a kind of flow chart of newsletter archive sorting technique step S122 provided by the embodiments of the present application;
Fig. 8 is a kind of flow chart of newsletter archive sorting technique step S130 provided by the embodiments of the present application;
Fig. 9 is a kind of flow chart of newsletter archive sorting technique step S140 provided by the embodiments of the present application;
Figure 10 is a kind of structure diagram of newsletter archive sorter provided by the embodiments of the present application;
Figure 11 is a kind of structure diagram of server provided by the embodiments of the present application.
Specific implementation mode
In order to make those skilled in the art better understand the technical solutions in the application, below in conjunction with the application reality
The attached drawing in example is applied, technical solutions in the embodiments of the present application is clearly and completely described, it is clear that described implementation
Example is merely a part but not all of the embodiments of the present application.Based on the embodiment in the application, this field is common
The every other embodiment that technical staff is obtained without making creative work should all belong to the application protection
Range.
Embodiment one
The embodiment of the present application provides a kind of newsletter archive sorting technique, and Fig. 2 is provided by the embodiments of the present application a kind of new
Hear file classification method flow chart, as shown in Fig. 2, a kind of newsletter archive sorting technique provided by the embodiments of the present application include with
Lower step:
Step S110 creates classed thesaurus according to known news language material;The classed thesaurus is provided with multiple news categories,
Include at least one Feature Words in each news category.
Classed thesaurus is for classifying to the newsletter archive of unknown classification.Each news category includes in classed thesaurus
Feature Words are derived from the known news language material of corresponding news category.Feature Words are act as in newsletter archive classification:When unknown class
When there are Feature Words in other news sheet, indicates that the newsletter archive of the unknown classification has and belong to this feature word place news category
Other tendency, Feature Words occur quantity is more, number is more, this tendency is stronger.
Fig. 3 be a kind of flow chart of newsletter archive sorting technique step S110 provided by the embodiments of the present application, one kind can
In the embodiment of selection, step S110 may comprise steps of:
The news category of the classed thesaurus is arranged in step S111.
Based on a small amount of known news language material, classed thesaurus is created, according to the news category of known news language material, setting point
The news category that class vocabulary includes forms the taxonomic hierarchies of classed thesaurus.
The known news language material of the application can have multiple sources, be stressed with covering different fields and different news
Point, to improve the application establishment classed thesaurus in news category it is comprehensive, for example, the application is from multiple news websites
Obtain news corpus.When creating classed thesaurus, it can refer to multiple news websites and classed thesaurus be arranged to the mode classification of news
Including news category, such as:According to the mode classification that multiple news websites are universal, can be set in the classed thesaurus of the application
Set the news categories such as the political situation of the time, the world, military affairs, finance and economics, society, education, culture, amusement.
In the practical application request of newsletter archive classification, it is sometimes desirable to be finely divided to news category, in this regard, the application
Classed thesaurus in, can multiple levels be set to news category, to embody the subordinate relation between news category.Such as:With " when
Political affairs " are used as parent classification, can be arranged under " political situation of the time " classification " high-rise dynamic " " staffing " " anti-corruption and pro-honesty " " Party building " " when
Comment " etc. subclass.
In addition, the mode classification of news is being arranged with reference to multiple news websites the news category of classed thesaurus in the application
At the same time it can also increase interim classification according to actual needs.Increasing the meaning of interim classification is:With the variation of time, news
Hot spot shows the situation of shifting outburst and replacement, and news media in order to grasp the development state of hot news in time
Gesture needs to collect newest hot news in time from the newsletter archive of magnanimity, and in order to meet this demand, the application can basis
Interim classification is arranged in the variation of hot news.
Such as:When arriving the Spring Festival, the news topic in relation to the Spring Festival will be broken out, and become hot news, at this point, news
Media are in order to collect the news topic and clue in relation to the Spring Festival, it may appear that the demand that Spring Festival related news are individually classified,
In order to meet this demand, the interim classification for closing the Spring Festival is may be provided in the classed thesaurus of the application, such as:One parent is set
Multiple subclass such as " spring transportation " " New Year customs " " Spring Festival Gala " " returning to one's native place " are then arranged in classification " Spring Festival " under " Spring Festival " classification.Work as the Spring Festival
Later, the hot news in relation to the Spring Festival subsides, at this point, the interim classification in relation to the Spring Festival can be deleted from classed thesaurus, from reduction
Data calculation amount when newsletter archive is classified improves classification effectiveness.
Illustratively, Fig. 4 is a kind of taxonomic hierarchies schematic diagram of classed thesaurus provided by the embodiments of the present application.
Step S112 obtains the Feature Words from known news language material.
Feature Words can embody the classification tendency of newsletter archive.Such as:It is " important to say when occurring in certain newsletter archive
When the Feature Words such as words " " state visit " " surveying and studying " " National People's Congress " " State Council ", illustrate that the classification of the newsletter archive may
It is " political situation of the time " class.When occurring Feature Words such as " limit-up " " stock " " rearrangements of assets " in certain newsletter archive, illustrate news text
This classification may be " finance and economics " class.
The Feature Words are added to by step S113 according to the news category of the affiliated known news language material of the Feature Words
In the classed thesaurus.
Illustratively, from the news corpus that news category is " high-rise dynamic ", the Feature Words of acquisition have the application:Head's meeting
It meets, official visit, working forum, Leading Speaches, state visit, survey and study, etc.;Therefore, by the feature of above-mentioned acquisition
Word is added to as the Feature Words of " high-rise dynamic " classification in participle class table.
Also include regular expression in a kind of selectable embodiment, in classed thesaurus.In newsletter archive there is
Including specific content or the special card sentence with the expression of specific clause, characteristic sentence can embody the classification tendency of newsletter archive, such as:
When in certain newsletter archive occur " National People's Congress holds a meeting " characteristic sentence, illustrate the classification of the newsletter archive may be " when
Political affairs " class;When the characteristic sentence for occurring " XXX obtains XXX firsts " in certain newsletter archive, illustrate that the classification of the newsletter archive can
It can be " sport " class.The regular expression of the application summarises the clause of characteristic sentence, and therefore, regular expression is in newsletter archive point
It is act as in class:When there is the clause that regular expression can be matched in the news sheet of unknown classification, indicate this not
Know that the newsletter archive of classification has the tendency for belonging to regular expression place classification, the clause that regular expression can be matched to
More, this tendency is stronger.
Fig. 5 is the flow chart of another newsletter archive sorting technique step S110 provided by the embodiments of the present application, such as Fig. 5 institutes
Show, when in classed thesaurus including regular expression, step S110 can also include the following steps after step S111:
Step S114 obtains characteristic sentence from knowing in news corpus.
Illustratively, include characteristic sentence in certain real estate class newsletter archive:" cities XX this month new house trading volume ring is than declining percentage
Three ".
Feature Words and characteristic sentence can embody the classification tendency of newsletter archive, in the process classified to newsletter archive
In, comprehensive characteristics word and characteristic sentence carry out newsletter archive classification, can improve the accuracy of text classification.For example, above-mentioned example
Characteristic sentence in comprising Feature Words such as " new house " " trading volume " " rings than decline ", wherein " new house " can embody newsletter archive
Real estate class is inclined to, and " trading volume " " ring is than declining " can more embody due to also often occurring in finance and economic newsletter archive
Go out the finance and economic tendency of newsletter archive, therefore, when occurring " new house " " trading volume " " ring ratio in newsletter archive to be sorted simultaneously
When the Feature Words such as decline ", it can classify to newsletter archive and interfere, newsletter archive may be caused mistakenly to be categorized into finance and economic
In.At this point, if judging news category by characteristic sentence " cities XX this month new house trading volume ring is than declining 3 percent ", can keep away
Exempt to interfere, to improve the accuracy of newsletter archive classification.
Step S115 converts the characteristic sentence to regular expression.
Regular expression can describe the text with specific syntactic rule by character string, can in text classification
To use regular expression to be matched to a series of texts for meeting specific syntactic rule from unknown text, the application is by characteristic sentence
The syntactic rule having is extract in the form of regular expression, in the assorting process of newsletter archive, so that it may to use just
Then expression formula goes to match unknown newsletter archive, and phrase or sentence with specific syntactic rule are found out from unknown newsletter archive
Son provides foundation for the classification of newsletter archive.
Illustratively, include that " XX leader presides over symposium and delivers important say characteristic sentence in certain known news corpus
Words ", wherein XX represents name.So, meeting a regular expression of this feature sentence syntactic rule can be:Preside over { 0,4 }
It can { 0,6 } speeches.
Step S116 adds regular expression according to the news category of the known news language material in the characteristic sentence source
Into the classed thesaurus.
Illustratively, from the news corpus that news category is " high-rise dynamic ", the regular expression of acquisition has the application:XX
Leader .* attends .* meetings, presides over { 0,4 } meeting { 0,6 } speeches, National People's Congress { 0,3 } meeting { 0,6 } closings, etc..Cause
This, is added to the regular expression of above-mentioned acquisition as the regular expression of " high-rise dynamic " classification in participle class table.
In a kind of selectable embodiment, each Feature Words and regular expression may be provided with class label, classification
Label is used to indicate the news category belonging to Feature Words and regular expression.
Illustratively, the existence form of Feature Words and regular expression and its corresponding class label in classed thesaurus can be with
For:
As can be seen that classed thesaurus can include three row from above-mentioned classed thesaurus, wherein:The rightmost side is characterized word or just
Then expression formula;The leftmost side is check box, by selecting check box, can be carried out to single feature word or regular expression
Modification can in bulk modify to Feature Words or regular expression alternatively, by carrying out batch selection to check box;In
Between a news category for being classified as Feature Words or regular expression.
The application creates the classed thesaurus for including Feature Words and regular expression based on a small amount of known news language material,
In classed thesaurus, multiple news categories are provided with according to the actual demand classified to newsletter archive, the application is in step s 110
The classed thesaurus of establishment does not depend on the newsletter archive largely accumulated, and establishment speed is fast, and timeliness is high, has the news on basis
Text classification ability.
Step S120 classifies to newsletter archive according to the classed thesaurus, obtains the hit classification of newsletter archive.
In the step s 120, the quantity of Feature Words and the news category of Feature Words in newsletter archive are obtained according to classed thesaurus
Not, and according to the quantity of Feature Words and the news category of Feature Words the hit classification of newsletter archive is determined.
Fig. 6 is a kind of flow chart of newsletter archive sorting technique step S120 provided by the embodiments of the present application.
In a kind of selectable embodiment, as shown in fig. 6, step S120 may comprise steps of:
Step S121 obtains all Feature Words for including in newsletter archive according to the classed thesaurus.
Illustratively, a certain piece newsletter archive is traversed according to classed thesaurus, it is as follows obtains Feature Words in newsletter archive:
2018It prepares for the postgraduate qualifying examinationCountdown:PostgraduateTitle also " costly "
Guangming Daily:“It prepares for the postgraduate qualifying examinationHeat " cooling is watchfulPostgraduateThere is an urgent need for improve for the quality of education
Www.chinanews.com's client Beijing December 13 (cold sky sun) whole nation in 2018Master degree candidate's entrance examination preliminary examinationIt will
It was carried out 23 to 25 December.In recent years, more and moreUndergraduateIt willIt prepares for the postgraduate qualifying examinationAs "GraduationOutlet ",EnrollmentThe expansion of scale
Public opinion is also allowed to start to worry greatlyPostgraduateTraining quality.Why more and more people selectIt prepares for the postgraduate qualifying examination PostgraduateTroop is huge
Causing "Well educated devaluation”
December 7, ZhengzhouUniversityNew camupus,It prepares for the postgraduate qualifying examinationInto countdown, reporter visitsColleges and universities“It prepares for the postgraduate qualifying examinationRace ".Figure is night
In library,Prepare for the postgraduate qualifying examination studentAccount for sizeable proportion.(omiting hereinafter)
As it can be seen that occurring two category feature words in the newsletter archive of above-mentioned example.A kind of Feature Words are educational feature
Word, including:" preparing for the postgraduate qualifying examination " " postgraduate " " master " " enrollment " " examination " " preliminary examination " " undergraduate " " graduation " " well educated " " student ";Separately
A kind of Feature Words are finance and economic Feature Words, including:" devaluation ".
Step S122 obtains news text respectively according to the frequency that the Feature Words of each news category in newsletter archive occur
The matching degree of this and each news category.
In general, in a newsletter archive, the frequency that the Feature Words of some news category occur is higher, illustrates news text
Originally the possibility for belonging to some news category is higher, and the application can define the calculating side of a matching degree by mathematical measure
The frequency that the Feature Words of some news category occur in newsletter archive is converted to newsletter archive and some news category by method
Matching degree.
Fig. 7 be a kind of flow chart of newsletter archive sorting technique step S122 provided by the embodiments of the present application, one kind can
In the embodiment of selection, step S122 may comprise steps of:
Step S1221 parses the structure of a news story of newsletter archive;The structure of a news story includes title, lead, main body, conclusion
With five parts of background.
Newsletter archive is made of five title, lead, main body, conclusion and background parts.Wherein, title, which plays, touches briefly on the essentials
Effect, illustrate the theme of news, therefore, in five parts of newsletter archive, title can most apparent from embody news
Classification;Lead is the first segment or a word of news beginning, it concisely discloses the core content of news, and lead also can
It is enough significantly to embody news category;Main body is the trunk of newsletter archive, and the body matter of usually corresponding newsletter archive, it uses sufficient
The fact carry out Behaviour theme, be further expanding and illustrating to lead content;Background refers to the social environment that news occurs
And natural environment;Conclusion are the conclusions of newsletter archive, it the content of news is made with summarize or illustrate news acquisition reporter,
The information of copywriter;Background and conclusion can also be implicitly included in main body sometimes.
The application in step S1221, according to title, lead, main body, five parts of conclusion and background to newsletter archive into
Row Context resolution is obtained from newsletter archive per the corresponding content in part.
Illustratively, as follows to a certain piece newsletter archive progress analysis result:
【Title】2018It prepares for the postgraduate qualifying examinationCountdown:PostgraduateTitle also " costly "
Guangming Daily:“It prepares for the postgraduate qualifying examinationHeat " cooling is watchfulPostgraduateThere is an urgent need for improve for the quality of education
【Lead】Www.chinanews.com's client Beijing December 13 (cold sky sun) whole nation in 2018Master degree candidate's entrance examination Preliminary examinationIt will be carried out 23 to 25 December.In recent years, more and moreUndergraduateIt willIt prepares for the postgraduate qualifying examinationAs "GraduationOutlet ",EnrollmentRule
The expansion of mould also allows public opinion to start to worryPostgraduateTraining quality.Why more and more people selectIt prepares for the postgraduate qualifying examination PostgraduateTeam
5 it is huge causing "Well educated devaluation”
【Main body】December 7, ZhengzhouUniversityNew camupus,It prepares for the postgraduate qualifying examinationInto countdown, reporter visitsColleges and universities“It prepares for the postgraduate qualifying examinationRace ".Figure is
In the library at night,Prepare for the postgraduate qualifying examination studentAccount for sizeable proportion.(omiting hereinafter)
Step S1222 obtains the term weight function of newsletter archive various pieces.
In newsletter archive, the Feature Words positioned at newsletter archive different piece are different to the recognition reaction of news category
, therefore, each of newsletter archive is partly arranged different term weight functions in the application.
The Feature Words occurred in title, it is the most apparent for the recognition reaction of news category.For example, when certain news
When title includes Feature Words " financing ", illustrate that this news is likely to finance and economic news;When including in the backpack body of certain news
When Feature Words " Chinese Premier League ", illustrate that this news is likely to sport category news.Therefore, according to the Feature Words in title to new
The most apparent feature of the recognition reaction of classification is heard, the characteristic value weight of title division can be set to peak, for example, being set as
10。
The Feature Words occurred in lead will be weaker than the feature occurred in title for the recognition reaction of news category
Word, but it is better than the Feature Words occurred in main body.Therefore, the characteristic value weight of lead part should be less than the Feature Words of title division
Weight, for example, being set as 2.
The characteristic value weight of other parts may be configured as 1.
Step S1223, the frequency occurred in newsletter archive various pieces according to the Feature Words of each news category and
Term weight function calculates the matching degree.
In the application, calculates the matching degree and use following formula:
P=p1 × C1+p2 × C2+ ...+pn × Cn
Wherein, P is the matching degree of newsletter archive and some news category, and p1~pn is newsletter archive various pieces
Term weight function, C1~Cn are quantity of the Feature Words in newsletter archive various pieces of some news category.
Illustratively, in the newsletter archive being illustrated above, the term weight function p1=10 of title division, the spy of lead part
Word weight p2=2 is levied, the term weight function of main body, background and concluding portion is consolidated into p3=1.
In title division, the number that educational Feature Words occur in title division is C1=4, is occurred in lead part
Number is C2=13, is C3=6 in the number that main body, background and concluding portion occur.Therefore, the newsletter archive with it is educational
Matching degree P=10 × 4+2 × 13+1 × 6=72.
Similarly, the matching degree P=2 × 1=2 of the newsletter archive and finance and economic.
Step S123, using the corresponding news category of the peak of the matching degree as hit classification.
Illustratively, the newsletter archive being illustrated above and educational matching degree highest, therefore, educational is above-mentioned news
The hit classification of text.
The application classifies to newsletter archive according to classed thesaurus, wherein according to the feature of each part of newsletter archive
Word is different to the power of the recognition reaction of news category, and different term weight functions is provided with to each part;Then according to every
The term weight function of the frequency and various pieces that a kind of Feature Words occur in newsletter archive various pieces calculates the matching degree,
To enable matching degree accurately to reflect the correlation of newsletter archive and news category;Finally, by the peak pair of matching degree
The news category answered is as hit classification.The method that newsletter archive is classified can be improved according to classed thesaurus in the application
The accuracy of newsletter archive classification.
Step S130, segments newsletter archive, and obtains the participle of each newsletter archive and the hit classification
The degree of correlation.
In newsletter archive, the classification of newsletter archive can also be risen there is not included in some classed thesaurus
To the word of recognition reaction, in order to find out these words from newsletter archive, and these words are added in classed thesaurus, with abundant point
The vocabulary of class table improves the classification accuracy of classed thesaurus.The application in step s 130, first divides newsletter archive
Word, and obtain the degree of correlation of the participle and the hit classification of each newsletter archive.
Fig. 8 be a kind of flow chart of newsletter archive sorting technique step S130 provided by the embodiments of the present application, one kind can
In the embodiment of selection, step S130 may comprise steps of:
Step S131 carries out cutting word processing according to preset cutting word rule to newsletter archive, obtains point of newsletter archive
Word.
In the application, the Chinese Word Segmentation mode based on machine learning can be used, cutting word processing is carried out to newsletter archive.
Illustratively, a kind of cutting word result to newsletter archive has been illustrated below:
12/ month/7/ day/,/Zhengzhou/university/new/school district/,/prepare for the postgraduate qualifying examination/entrance/countdown/,/reporter/visit/and colleges and universities/"/
Prepare for the postgraduate qualifying examination race/"/./ figure/be/night// library/inner/,/prepare for the postgraduate qualifying examination/student/account for// quite big/ratio/.
Step S132 removes the stop words for including in the participle of newsletter archive.
In information retrieval, to save memory space and improving search efficiency, in processing natural language data (or text)
Before or after can automatic fitration fall certain words or word, these words or word are referred to as stop words (or the outer word of collection).It is any kind
Word can be selected as stop words, specifically, using which word as stop words, need to be determined according to given purpose.
In this application, stop words can be that English character, number, mathematical character, punctuation mark and frequency of use are extra-high but do not have
Physical meaning Chinese word character (such as:) etc..
The application can create according to preset stop words and deactivate vocabulary, then, newsletter archive be retrieved according to deactivated vocabulary
Participle, and remove the stop words that retrieves.
Illustratively, removing the result obtained after stop words to the cutting word result being illustrated above is:
Zhengzhou/university/new/school district/prepare for the postgraduate qualifying examination/entrance/countdown/reporter/visit/colleges and universities/prepare for the postgraduate qualifying examination race/night/library/
Prepare for the postgraduate qualifying examination/student/ratio/
By removing stop words, the quantity of participle can be reduced, the calculation amount when degree of correlation is calculated to reduce, improves effect
Rate.Furthermore, it is necessary to which supplementary explanation, can also be from point of newsletter archive other than removing stop words in step S132
It is got rid of in word and is present in Feature Words in classed thesaurus, the calculation amount when degree of correlation is calculated to further decrease, improve efficiency.
Step S133 calculates TF-IDF value of each of newsletter archive participle relative to the hit classification, by the TF-
IDF values are as the degree of correlation.
TF-IDF (term frequency-inverse document frequency) be it is a kind of for information retrieval with
The weighting technique of data mining.TF means that word frequency (Term Frequency), IDF mean reverse document-frequency (Inverse
Document Frequency).TF-IDF is a kind of statistical method, to assess a words for the important of corpus
Degree.The weight of words, but simultaneously can be as it be in other corpus with the directly proportional increase of number that it occurs in language material
The frequency of middle appearance is inversely proportional decline.
In the application, TF refers to some and segments the number appeared in the newsletter archive for hitting classification.IDF refers to some point
Word appears in the inverse of number in the newsletter archives of all news categories.
Step S140 selects the Feature Words according to the degree of correlation from the participle of newsletter archive, and will select
The Feature Words are added in the classed thesaurus.
Fig. 9 be a kind of flow chart of newsletter archive sorting technique step S140 provided by the embodiments of the present application, one kind can
In the embodiment of selection, step S140 includes the following steps:
Step S141 is ranked up the participle of newsletter archive according to the degree of correlation.
Illustratively, according to the degree of correlation, the participle ranking results to the above-mentioned newsletter archive shown are:
Step S142, according to participle sequence as a result, the participle for choosing the degree of correlation higher than the first preset value is made
For the Feature Words.
First preset value for selecting Feature Words can be arranged in the application, and the first pre- preset value can be rule of thumb or logical
It crosses repeatedly calculation and gets a reasonable value.Illustratively, the value of the first preset value of the application is 0.75, to choose " school district "
" library " is characterized word.
The Feature Words are added in the classed thesaurus by step S143.
Illustratively, Feature Words " school district " " library " are added to as educational Feature Words in classed thesaurus.Follow-up
In the newsletter archive classification of progress, classed thesaurus just has finds out Feature Words " school district " and " library " from newsletter archive
Ability, to which classed thesaurus is improved to the classification capacity of newsletter archive.
The application is ranked up the participle of newsletter archive according to the degree of correlation, and is arranged pre- for the first of selected characteristic word
Setting selects Feature Words by determining the reasonable value of the first preset value from the result of participle sequence.It can be pre- by changing first
If the quantity and threshold of the value size adjustment selected characteristic word of value, to influence the essence that classed thesaurus classifies to newsletter archive
Degree.
Step S150 repeats step S120-S140, until the accuracy rate that the classed thesaurus classifies to newsletter archive
Until meeting preset termination condition.
The application constantly can repeat step during newsletter archive is classified using different newsletter archives
S120-S140 makes classed thesaurus constantly can accumulate and improve in use Feature Words, to make in classed thesaurus
Feature Words can follow the content change trend of newsletter archive in time, can keep and classed thesaurus is continuously improved to emerging
The classification capacity of newsletter archive.Therefore, the method provided by the present application that classed thesaurus is updated when newsletter archive is classified can carry
The accuracy of high news text classification.
In a kind of selectable embodiment, Feature Words include positive Feature Words and opposite feature word;Positive Feature Words
Term weight function be positive value, the term weight function of opposite feature word is negative value.
The term weight function of positive Feature Words is positive value, and therefore, the meaning of expression is:When occurring just in newsletter archive
When to Feature Words, illustrate that the newsletter archive has the tendency for being categorized into the corresponding news category of forward direction Feature Words.Opposite feature
The term weight function of word is negative value, and therefore, the meaning of expression is:When occurring opposite feature word in newsletter archive, explanation
The newsletter archive should not be categorized into the corresponding news category of opposite feature word.
Positive Feature Words and opposite feature word have critically important use in the case of newsletter archive classification is more careful
Meaning.For example, for actual demand, need the newsletter archive of sport category being further categorized into:Football, basketball, tennis, table tennis
The subclass such as pang ball, diving, at this time can be under the subclass of " football ", will " three points of Feature Words " block " related with basketball
Ball " " hack " etc., and, will Feature Words " breaking " " deciding game " related with tennis etc., and with table tennis, carry water
Opposite feature word of the related Feature Words as " football " subclass, and the higher number of absolute value is arranged to these opposite feature words
Value is the term weight function of negative value, to reduce the matching of newsletter archive and " football " subclass comprising above-mentioned opposite feature word
Degree.
By above technical scheme it is found that the embodiment of the present application provides a kind of newsletter archive sorting technique, first, according to
Know that news corpus creates classed thesaurus;Then, classified to newsletter archive according to classed thesaurus, obtain the hit of newsletter archive
Classification;Then, newsletter archive is segmented, and obtains the degree of correlation of the participle and hit classification of each newsletter archive;Finally,
According to the degree of correlation, Feature Words are selected from the participle of newsletter archive, and the Feature Words selected are added in classed thesaurus.With
The progress of newsletter archive classification, the application is realized is continually updated classed thesaurus in the assorting process of newsletter archive, makes
Classed thesaurus constantly can accumulate and improve in use Feature Words, follow the content change trend of newsletter archive in time,
The classification capacity for keeping and classed thesaurus being continuously improved to emerging newsletter archive, to improve newsletter archive classification
Accuracy.
Embodiment two
The embodiment of the present application provides a kind of newsletter archive sorter, and Figure 10 is provided by the embodiments of the present application a kind of new
The structure diagram of document sorting apparatus is heard, as shown in Figure 10, described device includes:
Creating unit 210, for creating classed thesaurus according to known news language material;The classed thesaurus presets multiple news
Classification includes at least one Feature Words in each news category;
Taxon 220 obtains the hit of newsletter archive for classifying to newsletter archive according to the classed thesaurus
Classification;
Computing unit 230 for being segmented to newsletter archive, and obtains the participle of each newsletter archive and the hit
The degree of correlation of classification;
Word unit 240 is selected, is used to, according to the degree of correlation, the Feature Words are selected from the participle of newsletter archive, and will
The Feature Words selected are added in the classed thesaurus.
By above technical scheme it is found that the embodiment of the present application provides a kind of newsletter archive sorter, described device root
Classed thesaurus is created according to known news language material;Then, classified to newsletter archive according to classed thesaurus, obtain newsletter archive
Hit classification;Then, newsletter archive is segmented, and obtains the degree of correlation of the participle and hit classification of each newsletter archive;
Finally, according to the degree of correlation, Feature Words are selected from the participle of newsletter archive, and the Feature Words selected are added to classed thesaurus
In.With the progress that newsletter archive is classified, the application is realized is continually updated classificating word in the assorting process of newsletter archive
Table makes classed thesaurus constantly can accumulate and improve in use Feature Words, follows the content change of newsletter archive in time
Trend, the classification capacity for keeping and classed thesaurus being continuously improved to emerging newsletter archive, to improve newsletter archive point
The accuracy of class.
Embodiment three
The embodiment of the present application provides a kind of server, and Figure 11 is a kind of structure of server provided by the embodiments of the present application
Block diagram, as shown in figure 11, the server includes:
Processor 310 and memory 320;
The memory 320 is used to store the program that classed thesaurus and the processor 310 can perform;
The processor 310 is configured as executing following procedure step:
S110 creates classed thesaurus according to known news language material;The classed thesaurus presets multiple news categories, Mei Gexin
Hear in classification includes at least one Feature Words;
S120 classifies to newsletter archive according to the classed thesaurus, obtains the hit classification of newsletter archive;
S130 segments newsletter archive, and the participle for obtaining each newsletter archive is related to the hit classification
Degree;
S140 selects the Feature Words according to the degree of correlation from the participle of newsletter archive, and described in selecting
Feature Words are added in the classed thesaurus;
S150 repeats step S120-S140, until the classed thesaurus meets the accuracy rate that newsletter archive is classified
Until preset termination condition.
By above technical scheme it is found that the embodiment of the present application provides a kind of server, the server is according to known new
It hears language material and creates classed thesaurus;Then, classified to newsletter archive according to classed thesaurus, obtain the hit class of newsletter archive
Not;Then, newsletter archive is segmented, and obtains the degree of correlation of the participle and hit classification of each newsletter archive;Finally, root
According to the degree of correlation, Feature Words are selected from the participle of newsletter archive, and the Feature Words selected are added in classed thesaurus.With
The progress of newsletter archive classification, the application is realized is continually updated classed thesaurus in the assorting process of newsletter archive, makes point
Class vocabulary constantly can accumulate and improve in use Feature Words, follow the content change trend of newsletter archive in time, protect
The classification capacity held and classed thesaurus is continuously improved to emerging newsletter archive, to improve the standard of newsletter archive classification
True property.
The application can be used in numerous general or special purpose computing system environments or configuration.Such as:Personal computer, service
Device computer, handheld device or portable device, laptop device, multicomputer system, microprocessor-based system, top set
Box, programmable consumer-elcetronics devices, network PC, minicomputer, mainframe computer including any of the above system or equipment
Distributed computing environment etc..
The application can describe in the general context of computer-executable instructions executed by a computer, such as program
Module.Usually, program module includes routines performing specific tasks or implementing specific abstract data types, program, object, group
Part, data structure etc..The application can also be put into practice in a distributed computing environment, in these distributed computing environments, by
Task is executed by the connected remote processing devices of communication network.In a distributed computing environment, program module can be with
In the local and remote computer storage media including storage device.
It should be noted that herein, the relational terms of such as " first " and " second " or the like are used merely to one
A entity or operation with another entity or operate distinguish, without necessarily requiring or implying these entities or operation it
Between there are any actual relationship or orders.Moreover, the terms "include", "comprise" or its any other variant are intended to
Cover non-exclusive inclusion, so that the process, method, article or equipment including a series of elements includes not only those
Element, but also include other elements that are not explicitly listed, or further include for this process, method, article or setting
Standby intrinsic element.
Those skilled in the art will readily occur to its of the application after considering specification and putting into practice application disclosed herein
Its embodiment.This application is intended to cover any variations, uses, or adaptations of the application, these modifications, purposes or
Person's adaptive change follows the general principle of the application and includes the undocumented common knowledge in the art of the application
Or conventional techniques.The description and examples are only to be considered as illustrative, and the true scope and spirit of the application are by following
Claim is pointed out.
It should be understood that the application is not limited to the precision architecture for being described above and being shown in the accompanying drawings, and
And various modifications and changes may be made without departing from the scope thereof.Scope of the present application is only limited by the accompanying claims.
Claims (10)
1. a kind of newsletter archive sorting technique, which is characterized in that including:
S110 creates classed thesaurus according to known news language material;The classed thesaurus is provided with multiple news categories, each news
Include at least one Feature Words in classification;
S120 classifies to newsletter archive according to the classed thesaurus, obtains the hit classification of newsletter archive;
S130 segments newsletter archive, and obtains the degree of correlation of the participle and the hit classification of each newsletter archive;
S140 selects the Feature Words, and the feature that will be selected according to the degree of correlation from the participle of newsletter archive
Word is added in the classed thesaurus;
S150 repeats step S120-S140, is preset until the classed thesaurus meets the accuracy rate that newsletter archive is classified
Until end condition.
2. according to the method described in claim 1, it is characterized in that, described create classed thesaurus according to known news language material;Institute
Stating the step of classed thesaurus is provided with multiple news categories, includes at least one Feature Words in each news category includes:
The news category of the classed thesaurus is set;
The Feature Words are obtained from known news language material;
According to the news category of the affiliated known news language material of the Feature Words, the Feature Words are added to the classed thesaurus
In.
3. according to the method described in claim 2, it is characterized in that, in the classed thesaurus also include regular expression, it is described
After the step of news category of the classed thesaurus is set, further include:
Characteristic sentence is obtained from known news language material;
Convert the characteristic sentence to regular expression;
According to the news category of the known news language material in the characteristic sentence source, regular expression is added to the classed thesaurus
In.
4. according to the method described in claim 1, it is characterized in that, described divide newsletter archive according to the classed thesaurus
The step of class, the hit classification for obtaining newsletter archive includes:
According to the classed thesaurus, all Feature Words for including in newsletter archive are obtained;
According to the frequency that the Feature Words of each news category in newsletter archive occur, newsletter archive and each news category are obtained respectively
Other matching degree;
Using the corresponding news category of the peak of the matching degree as hit classification.
5. according to the method described in claim 4, it is characterized in that, described according to each news category another characteristic in newsletter archive
The frequency that word occurs obtains newsletter archive respectively and the step of matching degree of each news category includes:
Parse the structure of a news story of newsletter archive;The structure of a news story includes five title, lead, main body, conclusion and background parts;
Obtain the term weight function of newsletter archive various pieces;
The frequency and term weight function occurred in newsletter archive various pieces according to the Feature Words of each news category calculates
The matching degree;
Wherein, it calculates the matching degree and uses following formula:
P=p1 × C1+p2 × C2+ ...+pn × Cn
Wherein, P is the matching degree of newsletter archive and some news category, and p1~pn is the feature of newsletter archive various pieces
Word weight, C1~Cn are quantity of the Feature Words in newsletter archive various pieces of some news category.
6. according to the method described in claim 1, it is characterized in that, described segment newsletter archive, and obtaining each new
Hear text participle with it is described hit classification the degree of correlation the step of include:
According to preset cutting word rule, cutting word processing is carried out to newsletter archive, obtains the participle of newsletter archive;
Remove the stop words for including in the participle of newsletter archive;
TF-IDF value of each of newsletter archive participle relative to the hit classification is calculated, using the TF-IDF values as described in
The degree of correlation.
7. according to the method described in claim 1, it is characterized in that, described according to the degree of correlation, selected from the participle of newsletter archive
The Feature Words are pulled out, and include by the step that the Feature Words selected are added in the classed thesaurus:
According to the degree of correlation, the participle of newsletter archive is ranked up;
According to participle sequence as a result, choosing the participle of the degree of correlation higher than the first preset value as the Feature Words;
The Feature Words are added in the classed thesaurus.
8. according to the method described in claim 5, it is characterized in that,
The Feature Words include positive Feature Words and opposite feature word;The term weight function of the forward direction Feature Words is positive value, institute
The term weight function for stating opposite feature word is negative value.
9. a kind of newsletter archive sorter, which is characterized in that including:
Creating unit, for creating classed thesaurus according to known news language material;The classed thesaurus is provided with multiple news categories,
Include at least one Feature Words in each news category;
Taxon obtains the hit classification of newsletter archive for classifying to newsletter archive according to the classed thesaurus;
Computing unit for being segmented to newsletter archive, and obtains the participle of each newsletter archive and the hit classification
The degree of correlation;
Word unit is selected, for according to the degree of correlation, selecting the Feature Words from the participle of newsletter archive, and will be selected
The Feature Words are added in the classed thesaurus.
10. a kind of server, which is characterized in that including:
Processor and memory;
The memory is used to store the program that classed thesaurus and the processor can perform;
The processor is configured as executing following steps program:
S110 creates classed thesaurus according to known news language material;The classed thesaurus is provided with multiple news categories, each news
Include at least one Feature Words in classification;
S120 classifies to newsletter archive according to the classed thesaurus, obtains the hit classification of newsletter archive;
S130 segments newsletter archive, and obtains the degree of correlation of the participle and the hit classification of each newsletter archive;
S140 selects the Feature Words, and the feature that will be selected according to the degree of correlation from the participle of newsletter archive
Word is added in the classed thesaurus;
S150 repeats step S120-S140, is preset until the classed thesaurus meets the accuracy rate that newsletter archive is classified
Until end condition.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810116106.6A CN108334610A (en) | 2018-02-06 | 2018-02-06 | A kind of newsletter archive sorting technique, device and server |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810116106.6A CN108334610A (en) | 2018-02-06 | 2018-02-06 | A kind of newsletter archive sorting technique, device and server |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108334610A true CN108334610A (en) | 2018-07-27 |
Family
ID=62928268
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810116106.6A Pending CN108334610A (en) | 2018-02-06 | 2018-02-06 | A kind of newsletter archive sorting technique, device and server |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108334610A (en) |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109657137A (en) * | 2018-11-26 | 2019-04-19 | 平安科技(深圳)有限公司 | Public sentiment news category model building method, device, computer equipment and storage medium |
CN109684472A (en) * | 2018-12-20 | 2019-04-26 | 深圳价值在线信息科技股份有限公司 | A kind of trade classification method and system of security information |
CN110209329A (en) * | 2019-05-23 | 2019-09-06 | 厦门美柚信息科技有限公司 | Show method, apparatus, equipment and the storage medium of content of pages |
CN110888978A (en) * | 2018-09-06 | 2020-03-17 | 北京京东金融科技控股有限公司 | Article clustering method and device, electronic equipment and storage medium |
CN110941718A (en) * | 2019-11-27 | 2020-03-31 | 广州快决测信息科技有限公司 | Method and system for automatically identifying text category through text content |
CN111324735A (en) * | 2020-02-20 | 2020-06-23 | 湖南芒果听见科技有限公司 | Method and terminal for automatically classifying hourly essentials |
CN111506727A (en) * | 2020-04-16 | 2020-08-07 | 腾讯科技(深圳)有限公司 | Text content category acquisition method and device, computer equipment and storage medium |
CN111753197A (en) * | 2020-06-18 | 2020-10-09 | 达而观信息科技(上海)有限公司 | News element extraction method and device, computer equipment and storage medium |
CN111782601A (en) * | 2020-06-08 | 2020-10-16 | 北京海泰方圆科技股份有限公司 | Electronic file processing method and device, electronic equipment and machine readable medium |
CN112114728A (en) * | 2020-09-18 | 2020-12-22 | 北京搜狗科技发展有限公司 | Input method and device and electronic equipment |
CN113239197A (en) * | 2021-05-12 | 2021-08-10 | 首都师范大学 | Method, device and computer storage medium for classifying sentences based on TF-IDF algorithm |
CN113505228A (en) * | 2021-07-22 | 2021-10-15 | 上海弘玑信息技术有限公司 | Multi-dimensional text data classification method, training method and device |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104008126A (en) * | 2014-03-31 | 2014-08-27 | 北京奇虎科技有限公司 | Method and device for segmentation on basis of webpage content classification |
CN104035968A (en) * | 2014-05-20 | 2014-09-10 | 微梦创科网络科技(中国)有限公司 | Method and device for constructing training corpus set based on social network |
CN104361010A (en) * | 2014-10-11 | 2015-02-18 | 北京中搜网络技术股份有限公司 | Automatic classification method for correcting news classification |
CN104899215A (en) * | 2014-03-06 | 2015-09-09 | 北京搜狗科技发展有限公司 | Data processing method, recommendation source information organization, information recommendation method and information recommendation device |
KR20170034206A (en) * | 2015-09-18 | 2017-03-28 | 아주대학교산학협력단 | Apparatus and Method for Topic Category Classification of Social Media Text based on Cross-Media Analysis |
CN107045524A (en) * | 2016-12-30 | 2017-08-15 | 中央民族大学 | A kind of method and system of network text public sentiment classification |
-
2018
- 2018-02-06 CN CN201810116106.6A patent/CN108334610A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104899215A (en) * | 2014-03-06 | 2015-09-09 | 北京搜狗科技发展有限公司 | Data processing method, recommendation source information organization, information recommendation method and information recommendation device |
CN104008126A (en) * | 2014-03-31 | 2014-08-27 | 北京奇虎科技有限公司 | Method and device for segmentation on basis of webpage content classification |
CN104035968A (en) * | 2014-05-20 | 2014-09-10 | 微梦创科网络科技(中国)有限公司 | Method and device for constructing training corpus set based on social network |
CN104361010A (en) * | 2014-10-11 | 2015-02-18 | 北京中搜网络技术股份有限公司 | Automatic classification method for correcting news classification |
KR20170034206A (en) * | 2015-09-18 | 2017-03-28 | 아주대학교산학협력단 | Apparatus and Method for Topic Category Classification of Social Media Text based on Cross-Media Analysis |
CN107045524A (en) * | 2016-12-30 | 2017-08-15 | 中央民族大学 | A kind of method and system of network text public sentiment classification |
Cited By (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110888978A (en) * | 2018-09-06 | 2020-03-17 | 北京京东金融科技控股有限公司 | Article clustering method and device, electronic equipment and storage medium |
CN109657137B (en) * | 2018-11-26 | 2024-05-31 | 平安科技(深圳)有限公司 | Public opinion news classification model construction method, device, computer equipment and storage medium |
CN109657137A (en) * | 2018-11-26 | 2019-04-19 | 平安科技(深圳)有限公司 | Public sentiment news category model building method, device, computer equipment and storage medium |
CN109684472A (en) * | 2018-12-20 | 2019-04-26 | 深圳价值在线信息科技股份有限公司 | A kind of trade classification method and system of security information |
CN110209329A (en) * | 2019-05-23 | 2019-09-06 | 厦门美柚信息科技有限公司 | Show method, apparatus, equipment and the storage medium of content of pages |
CN110941718A (en) * | 2019-11-27 | 2020-03-31 | 广州快决测信息科技有限公司 | Method and system for automatically identifying text category through text content |
CN111324735A (en) * | 2020-02-20 | 2020-06-23 | 湖南芒果听见科技有限公司 | Method and terminal for automatically classifying hourly essentials |
CN111506727B (en) * | 2020-04-16 | 2023-10-03 | 腾讯科技(深圳)有限公司 | Text content category acquisition method, apparatus, computer device and storage medium |
CN111506727A (en) * | 2020-04-16 | 2020-08-07 | 腾讯科技(深圳)有限公司 | Text content category acquisition method and device, computer equipment and storage medium |
CN111782601A (en) * | 2020-06-08 | 2020-10-16 | 北京海泰方圆科技股份有限公司 | Electronic file processing method and device, electronic equipment and machine readable medium |
CN111753197B (en) * | 2020-06-18 | 2024-04-05 | 达观数据有限公司 | News element extraction method, device, computer equipment and storage medium |
CN111753197A (en) * | 2020-06-18 | 2020-10-09 | 达而观信息科技(上海)有限公司 | News element extraction method and device, computer equipment and storage medium |
CN112114728B (en) * | 2020-09-18 | 2022-02-15 | 北京搜狗科技发展有限公司 | Input method and device and electronic equipment |
CN112114728A (en) * | 2020-09-18 | 2020-12-22 | 北京搜狗科技发展有限公司 | Input method and device and electronic equipment |
CN113239197A (en) * | 2021-05-12 | 2021-08-10 | 首都师范大学 | Method, device and computer storage medium for classifying sentences based on TF-IDF algorithm |
CN113505228A (en) * | 2021-07-22 | 2021-10-15 | 上海弘玑信息技术有限公司 | Multi-dimensional text data classification method, training method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108334610A (en) | A kind of newsletter archive sorting technique, device and server | |
CN106649818B (en) | Application search intention identification method and device, application search method and server | |
CN108090048B (en) | College evaluation system based on multivariate data analysis | |
CN103399891B (en) | Method for automatic recommendation of network content, device and system | |
US8250067B2 (en) | Adding dominant media elements to search results | |
CN105095187A (en) | Search intention identification method and device | |
EP2192500A2 (en) | System and method for providing robust topic identification in social indexes | |
CN106339502A (en) | Modeling recommendation method based on user behavior data fragmentation cluster | |
CN103744981A (en) | System for automatic classification analysis for website based on website content | |
CN105930411A (en) | Classifier training method, classifier and sentiment classification system | |
Shimada et al. | Analyzing tourism information on twitter for a local city | |
CN104915446A (en) | Automatic extracting method and system of event evolving relationship based on news | |
CN107169086B (en) | Text classification method | |
CN110134792B (en) | Text recognition method and device, electronic equipment and storage medium | |
CN107958014B (en) | Search engine | |
Hayes | Using tags and clustering to identify topic-relevant blogs | |
CN106126605B (en) | Short text classification method based on user portrait | |
CN112214991B (en) | Microblog text standing detection method based on multi-feature fusion weighting | |
CN105653701A (en) | Model generating method and device as well as word weighting method and device | |
Noel et al. | Applicability of Latent Dirichlet Allocation to multi-disk search | |
CN109815401A (en) | A kind of name disambiguation method applied to Web people search | |
CN110866102A (en) | Search processing method | |
CN114330329A (en) | Service content searching method and device, electronic equipment and storage medium | |
Li et al. | Improving relevance judgment of web search results with image excerpts | |
CN112579729A (en) | Training method and device for document quality evaluation model, electronic equipment and medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
TA01 | Transfer of patent application right |
Effective date of registration: 20190906 Address after: Room 630, 6th floor, Block A, Wanliu Xingui Building, 28 Wanquanzhuang Road, Haidian District, Beijing Applicant after: China Science and Technology (Beijing) Co., Ltd. Address before: 100089 Beijing city Haidian District wanquanzhuang Road No. 28 Wanliu new building block A Room 601 Applicant before: Beijing Shenzhou Taiyue Software Co., Ltd. |
|
TA01 | Transfer of patent application right | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180727 |
|
RJ01 | Rejection of invention patent application after publication |