CN109740152A - Determination method, apparatus, storage medium and the computer equipment of text classification - Google Patents
Determination method, apparatus, storage medium and the computer equipment of text classification Download PDFInfo
- Publication number
- CN109740152A CN109740152A CN201811592736.7A CN201811592736A CN109740152A CN 109740152 A CN109740152 A CN 109740152A CN 201811592736 A CN201811592736 A CN 201811592736A CN 109740152 A CN109740152 A CN 109740152A
- Authority
- CN
- China
- Prior art keywords
- classification
- candidate
- keyword
- text
- correlation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
This application involves determination method, apparatus, computer readable storage medium and the computer equipments of a kind of text classification, which comprises extracts the keyword of text to be processed, and determines the weight of each keyword;Obtain semantic description information corresponding with each keyword;According to each semantic description information, first degree of correlation of each keyword respectively with each candidate classification is determined;According to the weight of each keyword and each first degree of correlation, second degree of correlation of the text to be processed respectively with each candidate classification is determined;According to each second degree of correlation, classification belonging to the text to be processed is determined from each candidate classification.Scheme provided by the present application can save human cost, and eliminate dependence of the quality for determining classification belonging to text to be processed to the quality manually marked.
Description
Technical field
This application involves field of computer technology, determination method, apparatus, computer more particularly to a kind of text classification
Readable storage medium storing program for executing and computer equipment.
Background technique
Text classification mark refers to that by text marking be one or more classifications in a bibliography system.Text classification
It is labeled in a large number of services scene such as advertisement, recommendation, search and has a wide range of applications.Determine that classification belonging to text is text class
Important link in target note.
In the method for determination of traditional text classification, first as manually marking classification belonging to several texts, obtain training sample
The machine learning models such as neural network are trained to obtain mapping model further according to training sample by this, and then by text to be processed
Originally it is input in mapping model, the classification of text to be processed is determined by mapping model.However, artificial mark obtains training sample
Process, consume a large amount of manpower.Also, mapping model be according to manually mark training sample training obtain, therefore really
The quality of classification belonging to fixed text to be processed has serious dependence to the quality manually marked.
Summary of the invention
Based on this, it is necessary to for consuming a large amount of manpower in traditional approach, and determine class belonging to text to be processed
Purpose quality has the technical issues of serious dependence to the quality manually marked, provides a kind of determination side of text classification
Method, device, computer readable storage medium and computer equipment.
A kind of determination method of text classification, comprising:
The keyword of text to be processed is extracted, and determines the weight of each keyword;
Obtain semantic description information corresponding with each keyword;
According to each semantic description information, first degree of correlation of each keyword respectively with each candidate classification is determined;
According to the weight of each keyword and each first degree of correlation, determine the text to be processed respectively with each institute
State second degree of correlation of candidate classification;
According to each second degree of correlation, classification belonging to the text to be processed is determined from each candidate classification.
A kind of determining device of text classification, comprising:
Keyword processing module for extracting the keyword of text to be processed, and determines the weight of each keyword;
Semantic description data obtaining module, for obtaining semantic description information corresponding with each keyword;
First degree of correlation determining module, for according to each semantic description information, determine each keyword respectively with each time
Select first degree of correlation of classification;
Second degree of correlation determining module is determined for the weight and each first degree of correlation according to each keyword
The text to be processed second degree of correlation with each candidate classification respectively;
Text classification determining module, described in being determined from each candidate classification according to each second degree of correlation
Classification belonging to text to be processed.
A kind of computer readable storage medium is stored with computer program, when the computer program is executed by processor,
So that the processor executes the step of determination method of text classification as described above.
A kind of computer equipment, including memory and processor, the memory are stored with computer program, the calculating
When machine program is executed by the processor, so that the processor executes the step of the determination method of text classification as described above
Suddenly.
Determination method, apparatus, computer readable storage medium and the computer equipment of above-mentioned text classification are extracted to be processed
The keyword of text, and the weight of each keyword is obtained, semantic description information corresponding with each keyword is then obtained, then
According to each semantic description information, determine each keyword respectively with first degree of correlation of candidate classification, then according to each keyword
Weight and each first degree of correlation determine second degree of correlation of the text to be processed respectively with each candidate classification, and then according to each second
The degree of correlation determines classification belonging to text to be processed from each candidate classification.In this way, can not have any affiliated classification
In the case where the text known, classification belonging to any text is determined as computer equipment whole-process automaticly, to eliminate artificial
The link for marking classification, saves human cost, and removes the quality for determining classification belonging to text to be processed to artificial
The dependence of the quality of mark.
Detailed description of the invention
Fig. 1 is the applied environment figure of the determination method of text classification in one embodiment;
Fig. 2 is the flow diagram of the determination method of text classification in one embodiment;
Fig. 3 is the process schematic that first degree of correlation of keyword and candidate classification is determined in one embodiment;
Fig. 4 is the process schematic that second degree of correlation of text and candidate classification is determined in one embodiment;
Fig. 5 is the interface schematic diagram of the classification annotation results of displaying and query text in one embodiment;
Fig. 6 is the flow diagram of the method for determination of the first proportion threshold value in one embodiment;
Fig. 7 is the process schematic that remaining word number is determined during determining the first proportion threshold value in one embodiment;
Fig. 8 is the process schematic that first degree of correlation of keyword and candidate classification is determined in one embodiment;
Fig. 9 is the process schematic that first degree of correlation of keyword and candidate classification is determined in one embodiment;
Figure 10 is the interface schematic diagram of manual entry association knowledge in one embodiment;
Figure 11 is the interface schematic diagram of manual entry classification precedence information in one embodiment;
Figure 12 is the flow diagram of the determination method of text classification in one embodiment;
Figure 13 is the structural block diagram of the determining device of text classification in one embodiment;
Figure 14 is the structural block diagram of computer equipment in one embodiment;
Figure 15 is the structural block diagram of computer equipment in one embodiment.
Specific embodiment
It is with reference to the accompanying drawings and embodiments, right in order to which the objects, technical solutions and advantages of the application are more clearly understood
The application is further elaborated.It should be appreciated that specific embodiment described herein is only used to explain the application, and
It is not used in restriction the application.
It should be noted that term " first " used in this application, " second " etc. are for making to similar object
Differentiation in name, but these objects itself should not be limited by these terms.It should be appreciated that in the feelings for not departing from scope of the present application
Under condition, these terms can be interchanged in appropriate circumstances.For example, " first participle " can be described as to " the second participle ", and class
As, " the second participle " is described as " first participle ".
The determination method for the text classification that each embodiment of the application provides, can be applied in application environment as shown in Figure 1.
The application environment can be related to terminal 110 and server 120, and terminal 110 and server 120 pass through network connection.
Specifically, terminal 110 obtains text to be processed, and text to be processed is sent to server 120.Server 120
The keyword of text to be processed is extracted, and obtains the weight of each keyword, then obtains semanteme corresponding with each keyword and retouches
Information is stated, first degree of correlation of each keyword and candidate classification, and then basis are then determined according to each semantic description information respectively
The weight of each keyword and each first degree of correlation determine second degree of correlation of the text to be processed respectively with each candidate classification, according to
Each second degree of correlation determines classification belonging to text to be processed from each candidate classification.
In other application environment, server 120 can also be only related to, without regard to terminal 110, accordingly, by server 120
It independently executes from text to be processed is obtained to the series of steps for determining classification belonging to text to be processed from each candidate classification.
Alternatively, terminal 110 can also be only related to, without regard to server 120, accordingly, independently executed by terminal 110 from acquisition text to be processed
Originally to the series of steps for determining classification belonging to text to be processed from each candidate classification.
Wherein, terminal 110 may include mobile phone, tablet computer, laptop, desktop computer, personal digital assistant, wear
At least one of formula equipment etc. is worn, but not limited to this.Server 120 can use independent server either multiple servers
The server cluster of composition is realized.
In one embodiment, as shown in Fig. 2, providing a kind of determination method of text classification.It is applied in this way
It is illustrated for computer equipment (such as terminal 110 or server 120 in above-mentioned Fig. 1).This method may include as follows
Step S202 to S210.
S202, extracts the keyword of text to be processed, and obtains the weight of each keyword.
Text to be processed is the text it needs to be determined that affiliated classification.Text to be processed can be short text, and short text is
The shorter text of text size, for example it is no more than the text of 160 characters, common short text includes micro-blog information, article mark
Topic, viewpoint comment, SMS and literature summary, but not limited to this.Text to be processed may be long text, and long text is phase
Compared with the longer text of text size for short text.
Keyword can be word representative in text to be processed, can be used for characterizing the master of text to be processed
Inscribe thought.Specifically, keyword extraction processing can be carried out to text to be processed, to obtain the keyword of text to be processed.
Keyword extraction processing can be realized using any applicable keyword extraction mode, such as TextRank algorithm, Rake algorithm
And Topic-Model algorithm etc., it is not specifically limited herein.
The weight of keyword can be used for characterizing the keyword to the significance level of text to be processed.The weight of keyword
It can be determined according to the TF-IDF value of the keyword.Wherein, the TF-IDF value of keyword is the keyword in text to be processed
Word frequency (Term Frequency, TF) multiplied by the keyword inverse document frequency (Inverse Document Frequency,
IDF)。
Word frequency of the keyword in text to be processed is frequency of occurrence of the keyword in text to be processed.Keyword
Inverse document frequency can be with are as follows:
In one embodiment, target corpus can be the corresponding corpus of network search service.Accordingly, target corpus
The number of object in all objects in library comprising the keyword, can be and scanned for by network search service to keyword
The number of the total number of obtained all search results, all objects in target corpus can be set to predetermined value, such as
It is set as: 1+100000000.
It should be noted that can obtain together after calling network search service to scan for keyword to the key
The number for all search results that word scans for and semantic description information corresponding with the keyword, and by the two into
Row association.Accordingly, when getting semantic description information corresponding with keyword, it can get together and the keyword is carried out
The obtained number of all search results this parameter is searched for, can directly take this when calculating the inverse document frequency of keyword
Parameter, without temporarily calling network search service to obtain this parameter.
In one embodiment, the weight of each keyword is determined according to the TF-IDF value of each keyword of text to be processed,
Can specifically realize in the following way: computer equipment is determined according to the TF-IDF value of each keyword of text to be processed respectively
The original weight of each keyword, then the original weight of each keyword is normalized respectively, obtain the weight of each keyword.Its
In, the original weight of keyword is normalized, can be the original weight by the keyword divided by each of text to be processed
The summation of the original weight of keyword.In addition, the original weight of keyword can be TF-IDF value of the keyword itself, it can also
To be the TF-IDF value of the keyword and the long product of word of the keyword.
It illustrates, it is assumed that text to be processed is " trap that banker widens land ", the pass of the text to be processed of extraction
Keyword and the corresponding weight of each keyword is obtained, can be " banker: 0.3;Big land: 0.5;Trap: 0.2 ".Wherein, keyword
Between with ";" separate, it is the corresponding weight of keyword after ": ", the summation of the weight of each keyword of text to be processed is 1.
S204 obtains semantic description information corresponding with each keyword.
Semantic description information is the information for being used to help understand meaning expressed by keyword.The number of semantic description information
It can be text file according to form.
In one embodiment, the corresponding semantic description information of keyword can according to related personnel arrange, for retouching
The information (hereinafter referred to as expert's description information) for stating the keyword determines that related personnel can be the expert of related fields.Tool
Body can arrange corresponding expert's description information for each candidate keywords by expert, further according to each candidate keywords,
Matching relationship between each expert's description information and each candidate keywords and each expert's description information constructs expert knowledge library,
Accordingly, when needing to obtain the semantic description information of keyword, candidate pass corresponding with the keyword is searched in expert knowledge library
Keyword, the semantic description information of the keyword may include the matched expert's description information of candidate keywords institute found.
S206 determines first degree of correlation of each keyword respectively with each candidate classification according to each semantic description information.
First degree of correlation of keyword and candidate classification, can be used for measuring between the keyword and candidate's classification
The metric of matching degree.The value range of first degree of correlation can be [0 ,+1], and first degree of correlation is bigger, show the keyword
Matching degree between candidate's classification is higher, otherwise first degree of correlation is smaller, show the keyword and candidate's classification it
Between matching degree it is lower.
In the present embodiment, candidate classification more than one, computer equipment is respectively according to the corresponding semantic description of each keyword
Information determines first degree of correlation of each keyword respectively with each candidate classification.
For example, as shown in figure 3, carrying out keyword extraction to text LT1 to be processed obtains 3 keywords: keyword
Kw1, keyword Kw2 and keyword Kw3, keyword Kw1 corresponds to semantic description information Sd1, keyword Kw2 corresponds to semantic description letter
Breath Sd2, keyword Kw3 correspond to semantic description information Sd3, and there are 3 candidate's classifications: candidate classification C1, candidate classification C2 and
Candidate classification C3.
Accordingly, computer equipment determines the first phase of keyword Kw1 with candidate classification C1 according to semantic description information Sd1
First degree of correlation and keyword Kw1 of Guan Du, keyword Kw1 and candidate classification C2 and first degree of correlation of candidate classification C3.
Also, computer equipment determines first degree of correlation, the key of keyword Kw2 and candidate classification C1 according to semantic description information Sd2
First degree of correlation and keyword Kw2 of word Kw2 and candidate classification C2 and first degree of correlation of candidate classification C3.And it calculates
Machine determines first degree of correlation, keyword Kw3 and the candidate class of keyword Kw3 and candidate classification C1 according to semantic description information Sd3
First degree of correlation of first degree of correlation and keyword Kw3 of mesh C2 and candidate classification C3.
In one embodiment, computer equipment can be according to the corresponding semantic description information of each keyword and each candidate class
Purpose classification description information determines first degree of correlation of each keyword respectively with each candidate classification.Such as computer equipment according to
The classification description information of semantic description information Sd1 and candidate classification C1 determine that keyword Kw1 is related to the first of candidate classification C1
Degree.Wherein, the classification description information of candidate classification can be used for the information for reflecting the characteristic of candidate's classification.
S208, according to the weight of each keyword and each first degree of correlation, determine text to be processed respectively with each candidate classification
Second degree of correlation.
Second degree of correlation of text to be processed and candidate classification, can be used for measuring the text to be processed and candidate's class
The metric of matching degree between mesh.The value range of second degree of correlation can be [0 ,+1], and second degree of correlation is bigger, show
Matching degree between the text to be processed and candidate's classification is higher, otherwise second degree of correlation is smaller, shows the text to be processed
Originally lower the matching degree between candidate's classification.
In the present embodiment, for each candidate classification, computer equipment is according to each keyword of text to be processed
Weight and each keyword are weighted summation with first degree of correlation of candidate's classification respectively, obtain the text to be processed and the time
Select second degree of correlation of classification.
Aforementioned exemplary is accepted, as shown in figure 4, computer equipment is according to the weight of keyword Kw1, keyword Kw1 and candidate
First degree of correlation of classification C1, the weight of keyword Kw2, keyword Kw2 and candidate classification C1 first degree of correlation, keyword
The weight and keyword Kw3 of Kw3 and first degree of correlation of candidate classification C1 are weighted summation, obtain text LT1 to be processed
With second degree of correlation of candidate classification C1.Also, computer equipment is according to the weight of keyword Kw1, keyword Kw1 and candidate class
First degree of correlation of mesh C2, the weight of keyword Kw2, keyword Kw2 and candidate classification C2 first degree of correlation, keyword Kw3
Weight and first degree of correlation of keyword Kw3 and candidate classification C2 be weighted summation, obtain text LT1 to be processed and
Second degree of correlation of candidate classification C2.And computer equipment is according to the weight of keyword Kw1, keyword Kw1 and candidate classification
First degree of correlation of C3, the weight of keyword Kw2, first degree of correlation of keyword Kw2 and candidate classification C3, keyword Kw3
First degree of correlation of weight and keyword Kw3 and candidate classification C3 are weighted summation, obtain text LT1 to be processed and wait
Select second degree of correlation of classification C3.
S210 determines classification belonging to text to be processed from each candidate classification according to each second degree of correlation.
Classification belonging to text to be processed, may include in each candidate classification and second degree of correlation of text to be processed is full
The candidate classification of sufficient degree of correlation screening conditions.Wherein, degree of correlation screening conditions can be set according to actual needs.
Specifically, degree of correlation screening conditions may include: to second degree of correlation of text to be processed equal to or more than related
Threshold value is spent, relevance threshold predefines according to actual needs.Degree of correlation screening conditions also may include: and text to be processed
Second degree of correlation belongs to second degree of correlation of the maximum predetermined number of numerical value in each second degree of correlation, i.e., according to each second degree of correlation
Numerical values recited be ranked up, be sequentially reduced from front to back, classification belonging to text to be processed may include be arranged in front it is pre-
Determine candidate classification corresponding to second degree of correlation of number, predetermined number can be set as any positive integer according to actual needs.
(target domain is subjected to regulation it should be noted that bibliography system can be preset and division is formed by body
System), it include the candidate classification of more than one in bibliography system.Accordingly, can include from bibliography system according to each second degree of correlation
Each candidate classification in determine classification belonging to text to be processed.
In one embodiment, after step S210, can also include the following steps: according to belonging to text to be processed
Classification carries out classification mark to the text to be processed.Classification mark specifically can be the corresponding class target of the output text to be processed
Note is as a result, such mesh annotation results may include the text to be processed, classification belonging to the text to be processed and this is to be processed
The degree of correlation of text and the classification belonging to it.Wherein, the degree of correlation of the text to be processed and the classification belonging to it can be according to this
Second degree of correlation of text to be processed and the classification belonging to it determines, for example can be the text to be processed and the classification belonging to it
Second degree of correlation itself.
Accordingly, in practical application scene, automatic marking system can be built based on the determination method of text classification, it should be certainly
Dynamic labeling system can be used for carrying out classification mark to text.In addition, automatic marking system can also provide the class target of text
Infuse the displaying and query service of result.Displaying and query interface can with as shown in figure 5, user can click it is " previous in interface
The text of classification mark is completed by page browsing for the control of page " and " lower one page ";Text can also be inputted in input frame 500
Or text ID, then click " inquiry " control, with inquire the classification ID of classification belonging to the text, category name and its with
The degree of correlation of text;Category name or classification ID can also be inputted in input frame 500, to inquire the text of respective class now
This.
It should be noted that training sample is obtained, further according to instruction as manually marking classification belonging to several texts for elder generation
Practice sample the machine learning models such as neural network are trained to obtain mapping model, and then text input to be processed is extremely mapped
In model, the traditional approach of the classification of text to be processed is determined by mapping model, due to having the following defects, is unable to satisfy wide
For the demand of classification belonging to determining mass text in a large amount of actual services scenes such as announcement, recommendation, search.
(1) process for manually marking training sample, consumes a large amount of manpower.In particular it is required that the training manually marked
The quantity of sample can increase and linear increase with the quantity for the candidate classification for including in bibliography system.Such as, it is assumed that support one
A candidate's classification needs manually 10,000 training datas of mark, supports that the bibliography system comprising 1000 candidate classifications just needs people
Work marks 10,000,000 training datas, this will expend huge man power and material.
(2) quality that the quality heavy dependence of classification belonging to text to be processed manually marks is determined.Pass through mapping model
Determine classification belonging to text, it is desirable to determine classification belonging to text in high quality, then require the mark manually to training sample
With high accuracy rate, at the same the ratio distribution for each candidate classification for requiring training sample to include in bibliography system with it is really whole
Body sample is consistent.However the training sample that generates of artificial mark (especially large-scale artificial mark) be difficult to meet it is above-mentioned
It is required that and in practical applications, the quality of the training sample manually marked is generally poor, so as to cause can not be high in traditional approach
Determine classification belonging to text to be processed to quality.
(3) it can not know and can not automatically update mapping model by studying new knowledge automatically.Traditional approach is from the instruction manually marked
Practice in sample, learns to the knowledge for mapping the text to classification, mapping model is obtained, in this way, just cannot without new training sample
Automatically new knowledge is arrived in study, and mapping model can not be automatically updated.However, the neologisms and hot spot word that are continued to bring out on network,
For traditional approach, if not introducing artificial participation mark again, mapping model will be unable to understand comprising these neologisms and hot spot
The text to be processed of word, so that classification belonging to the text to be processed can not be accurately determined.
(4) professional knowledge that will manually accumulate is not supported, is introduced during determining classification belonging to text to be processed.?
In actual services in common scene, related personnel has the professional knowledge of the largely classification belonging to determining text, than
If which keyword is related to which classification, which classification should be paid the utmost attention in each candidate classification that bibliography system includes
Etc., these professional knowledges of related personnel's accumulation are introduced during determining classification belonging to text to be processed, Neng Gouti
Height determines the quality of classification belonging to text to be processed, but by text input to be processed to mapping model, mapping in traditional approach
Model, that is, exportable characterize classification belonging to the text to be processed as a result, intermediate processing logic is abstract hardly possible for people
In understanding, manual intervention can not be carried out, determines text institute to be processed to not support to introduce the professional knowledge manually accumulated
During the classification of category.
The determination method of text classification provided by the embodiments of the present application, extracts the keyword of text to be processed, and obtains each
Then the weight of keyword obtains semantic description information corresponding with each keyword, further according to each semantic description information, really
Fixed each keyword respectively with first degree of correlation of candidate classification, then according to the weight of each keyword and each first degree of correlation, really
Second degree of correlation of the text to be processed respectively with each candidate classification is determined, and then according to each second degree of correlation, from each candidate classification
Determine classification belonging to text to be processed.In this way, can be in the case where not having text known to any affiliated classification, by counting
It calculates machine equipment and determines classification belonging to any text whole-process automaticly, to eliminate the link of artificial mark classification, save
Human cost, and remove dependence of the quality for determining classification belonging to text to be processed to the quality manually marked.This
Outside, during obtain classification belonging to text to determination text to be processed to be processed, intermediate logic is for people can
With understanding, so that becoming during the professional knowledge manually accumulated is introduced classification belonging to determining text to be processed
It may.
In one embodiment, the step of extracting the keyword of text to be processed, may include steps of: to be processed
Text carries out word segmentation processing, obtains multiple first participles of text to be processed;It is rejected from each first participle and belongs to goal filtering
The first participle of dictionary obtains one or more second and segments;Each second participle includes the remaining first participle after rejecting;According to
Each second participle, obtains the keyword of text to be processed.
Word segmentation processing, for being partitioned into several words from text to be processed.Word segmentation processing can use any possibility
Participle mode realize, such as condition random field (Conditional Random Field, CRF) participle, JIEBA participle (i.e.
Stammerer participle), NLPIR participle, LTP (Language Technology Platform) participle or THULAC (THU
Lexical Analyzer for Chinese) participle etc..
Wherein, condition random field participle is to comprehensively consider word according to condition random field theory and go out in text to be processed
The context relation of existing frequency and word segments text to be processed, and which has ambiguity word and neologisms good
Good participle effect.
It should be noted that when carrying out word segmentation processing to text to be processed following optimisation strategy can be used: can be by book
Content of text in name is integrally used as a participle, and without being split, for example text to be processed is that " " teenager sends magical
Drift about " in tiger metaphor meaning ", will the magical drift of group " teenager " it is whole as a participle, it is " juvenile without being partitioned into
The words such as group ", " magical drift ".
It the case where being short text for text to be processed, can also be using at least one in following two optimisation strategies:
Content of text in the bracket of predetermined form can be integrally used as to a participle, without being split, the bracket of predetermined form
Including at least one of round bracket, bracket and braces;Can by text to be processed, close to first of text head end
Content of text before colon is integrally used as a participle, and without being split, for example text to be processed is " LPL war communique: difficult
It allows one to chase after two WE2:1 to turn over TOP ", by " LPL war communique " integrally as a participle.
By integrally regarding the content of text in said circumstances as a participle, without being split, effectively remain
The physical meaning of certain text content is conducive to the accuracy for improving determining keyword.
Dictionary is filtered, is the database for record filtering word.Filtering dictionary can be corresponded with data source.Text institute
The data source of category is the source of the text, for example the subject text for the public platform that user pays close attention to can be belonged to a data
Source, the description text that the title text for the article that user read is belonged to a data source, the commodity for buying user
It belongs to a data source, the description text for the video that user watched is belonged into data source etc..
The filter word recorded in filtering dictionary, is the word for needing to reject from each first participle of text to be processed,
It may include at least one of word common in the word and the text covered of respective data sources for do not have physical meaning.It can be with
Understand, the common word in the text that data source covers, it is difficult to the characteristic of single text in the data source is characterized, thus according to
Each text that the word is difficult to cover the data source makes differentiation, therefore can be using the word as the filtering of the data source
Word is taken in the corresponding filtering dictionary of the data source.
Goal filtering dictionary is filtering dictionary corresponding with target data source.Target data source is text institute to be processed
The data source of category.When it is implemented, computer equipment can first determine data source (i.e. target data belonging to text to be processed
Source), then determine the corresponding filtering dictionary of the target data source (i.e. goal filtering dictionary).
In the present embodiment, computer equipment can carry out word segmentation processing to text to be processed, obtain text to be processed
Each first participle, then from each first participle, the first participle identical with each filter word recorded in goal filtering dictionary is rejected,
And then according to the first participle remaining after rejecting (the i.e. second participle), the keyword of text to be processed is obtained.
In one embodiment, the mode for constructing goal filtering dictionary, may include steps of: to belonging to target data
The text in source carries out word segmentation processing, obtains multiple third participles;Determine that each third segments corresponding first ratio respectively;According to
One ratio is more than the third participle of the first proportion threshold value, constructs goal filtering dictionary.
Third segments corresponding first ratio, can be the textual data comprising third participle in target data source and accounts for target
The ratio of the text sum of data source.Assuming that " sports news " this target data source covers 10 texts, wherein 6 text packets
It is segmented containing " match " this third, then this third of " competing ", which segments corresponding first ratio, is
First proportion threshold value, as measuring whether third participle is word common in each text that target data source covers
Standard.It is more than the first proportion threshold value that third, which segments corresponding first ratio, shows that third participle is that target data source covers
Each text in common word, should be used as the filter word of the target data source;Third segments corresponding first ratio and is less than
First proportion threshold value shows that third participle is not word common in each text that target data source covers, should not be used as the mesh
Mark the filter word of data source.Wherein, the first proportion threshold value can using it is any it is applicable by the way of determine, such as by manually according to reality
Border demand is set.Aforementioned exemplary is accepted, the first proportion threshold value is set as 20%, " match " this third participle corresponding first
RatioIt has been more than 20%, then this third of " competing " participle is each text that " sports news " this target data source covers
In common word, should be used as the filter word of the target data source.
In the present embodiment, target data source covers several texts, and computer equipment can cover target data source
Each text carry out word segmentation processing respectively, thus obtain several thirds participle.It is segmented for each third, computer
Equipment determines that the textual data comprising third participle in the target data source accounts for the ratio of the text sum of the target data source, with
It obtains each third and segments corresponding first ratio.In turn, from each
Each third participle that the first ratio is more than the first proportion threshold value is filtered out in third participle, it is each further according to what is filtered out
Third participle, constructs goal filtering dictionary.Accordingly, it is more than the first ratio that the first ratio is recorded in the goal filtering dictionary constructed
Each third of threshold value segments.
It in one embodiment, can also be by manually by virtue of experience determining do not have physical meaning for target data source
Word constructs goal filtering dictionary further according to the word for not having physical meaning for target data source.Accordingly, the mesh constructed
Word manually determining, not having physical meaning for target data source is recorded in mark filtering dictionary.
It in one embodiment, can also be according to the word for not having physical meaning for target data source by manually determining
Language and the first ratio are more than each third participle of the first proportion threshold value, common to construct goal filtering dictionary.That is, construct
The word and the first ratio for not having physical meaning for target data source by manually determining are recorded in goal filtering dictionary simultaneously
Example is segmented more than each third of the first proportion threshold value.
It should be noted that the corresponding filtering dictionary of building data source can be the preparation being previously-completed.Specifically,
The corresponding filtering dictionary of each data source can be constructed in advance determines text to be processed after getting this paper to be marked
Affiliated data source (i.e. target data source) is needing to use corresponding filtering dictionary (the i.e. goal filtering word of target data source
Library) when, goal filtering dictionary directly is found from the corresponding filtering dictionary of each data source constructed in advance, without facing
When construct goal filtering dictionary again.In addition, the corresponding filtering dictionary of each data source can regularly update.
In one embodiment, in addition to the mode for the artificial settings being described above determines the first proportion threshold value, such as Fig. 6 institute
Show, can also determine the first proportion threshold value: S602 using following steps, according to current proportion threshold value, from each third participle really
Fixed 4th participle;S604 determines remaining word number corresponding with each text of target data source is belonged to;S606 is determined remaining
The textual data that word number is equal to or more than word number threshold value accounts for the second ratio of the text sum of target data source;S608, in the second ratio
When example is less than the second proportion threshold value, current proportion threshold value is determined as the first proportion threshold value;S610 is more than the in the second ratio
When two proportion threshold values, current proportion threshold value is updated according to numerical value is lowered, and return according to current proportion threshold value, segmented from each third
The step of middle determination the 4th segments.
4th participle may include the third participle that the first ratio is equal to or more than current proportion threshold value.Specifically, from right
Each text for belonging to target data source carries out in each third participle that word segmentation processing obtains, and the first ratio filtered out is equal to or greatly
It is segmented in the third of current proportion threshold value, the as the 4th participle.
The corresponding remaining word number of text can be remaining the after rejecting the 4th participle in each third of text participle
The number of three participles.
For example, as shown in Figure 7, it is assumed that belong to target data source each text be respectively text CT1, CT2 and
CT3 carries out word segmentation processing to CT1, obtains third participle Pc3-1, Pc3-2, Pc3-3, Pc3-4 and Pc3-5, carries out to CT2
Word segmentation processing obtains third participle Pc3-1, Pc3-2, Pc3-6, Pc3-7 and Pc3-8, carries out word segmentation processing to CT3, obtain
Third segments Pc3-6, Pc3-7, Pc3-8, Pc3-9 and Pc3-10, then carries out word segmentation processing to CT1, CT2 and CT3 and obtain
Each third participle be respectively that third segments Pc3-1 and segments Pc3-10 to third, amount to 10 thirds participle.
If third segments Pc3-1 into third participle Pc3-10, the first ratio is equal to or more than current proportion threshold value N1's
Third participle (the i.e. the 4th participle) is respectively Pc3-1, Pc3-2, Pc3-3, Pc3-4 and Pc3-9.Then, from each the of text CT1
In three participles (Pc3-1, Pc3-2, Pc3-3, Pc3-4 and Pc3-5), remaining third participle is Pc3- after rejecting the 4th participle
5, that is, correspond to current proportion threshold value N1, the corresponding remaining word number of text CT1 is 1.It is segmented from each third of text CT2
In (Pc3-1, Pc3-2, Pc3-6, Pc3-7 and Pc3-8), remaining third participle is Pc3-6, Pc3- after rejecting the 4th participle
7 and Pc3-8 corresponds to current proportion threshold value N1, the corresponding remaining word number of text CT2 is 3.From each the of text CT3
In three participles (Pc3-6, Pc3-7, Pc3-8, Pc3-9 and Pc3-10), remaining third participle is after rejecting the 4th participle
Pc3-6, Pc3-7, Pc3-8 and Pc3-10 correspond to current proportion threshold value N1, the corresponding remaining word number of text CT3 is 4.
Second ratio is the text sum that remaining word number accounts for target data source equal to or more than the textual data of word number threshold value
Ratio, word number threshold value can be determined according to actual needs.Aforementioned exemplary is accepted, each text difference of target data source is belonged to
For text CT1, CT2 and CT3, amount to 3 texts, it is assumed that word number threshold value is 3, then remaining word number is equal to or more than 3 text
Respectively CT2 and CT3, the second ratio are
Can the second proportion threshold value be used as the first proportion threshold value for measuring current proportion threshold value.After determining the second ratio,
Judge that the second ratio whether more than the second proportion threshold value, if not exceeded, current proportion threshold value is determined as the first proportion threshold value, is tied
Beam determines the process of the first proportion threshold value;If being more than, show that current proportion threshold value can not be as the first proportion threshold value, then under
It adjusts numerical value to update current proportion threshold value, i.e., subtracts downward numerical value on the basis of current proportion threshold value, work as further according to updated
Preceding proportion threshold value re-executes the step of determining the 4th participle from each third participle and its subsequent step.Second proportion threshold value
It can determine according to actual needs, for example can be set to 90%.
Furthermore, it is possible to be corresponding first proportion threshold value of data source, the is being determined corresponding to a data source
During one proportion threshold value, when determining current proportion threshold value for the first time, initial proportion threshold value is determined as current proportion threshold value.
Initial proportion threshold value can predefine according to actual needs, for example can be set to 100%.
It should be noted that being determined according to current proportion threshold value surplus compared to the mode that the first proportion threshold value is manually set
The text that remaining word number is equal to or more than word number threshold value accounts for the second ratio for belonging to each text of the target data source, is determining the
When two ratios are more than the second proportion threshold value, reduce current proportion threshold value, then redefine the second ratio, until the second ratio does not surpass
When crossing the second proportion threshold value, current proportion threshold value is determined as the first proportion threshold value.In this way, realizing automatic by computer equipment
It determines the first proportion threshold value, and improves the accuracy of identified first proportion threshold value.
In one embodiment, according to each second participle, the step of obtaining the keyword of text to be processed, may include as
Lower step: permutation and combination is carried out according to each second participle, obtains the 5th participle;From each 5th participle, the 6th participle is determined;From
In each 6th participle, the 7th participle is determined;According to the 7th participle, the keyword of text to be processed is obtained.
5th participle, at least two second participles comprising continuous adjacent.Specifically, it rejects and belongs to from each first participle
After the first participle of goal filtering dictionary, according to preset permutation and combination rule to remaining each first participle (i.e. each
Two participles) permutation and combination is carried out, the portmanteau word of all at least two second participles comprising continuous adjacent is obtained, each portmanteau word is
For each 5th participle.
For example, remaining each first point after rejecting the first participle for belonging to goal filtering dictionary in each first participle
Word is " China ", " people " and " liberation army ", and after carrying out permutation and combination according to preset permutation and combination rule, one be there are
To following 3 comprising continuous adjacent at least two second participle portmanteau words: " Chinese people ", " PLA " and
" Chinese People's Liberation Army ", this 3 portmanteau words are 3 the 5th participles.It should be noted that due to " China " and " liberation army "
Non- continuous adjacent, " Chinese liberation army " are not the 5th participles.
6th participle can be the 5th participle for belonging to existing entry.Existing entry may include encyclopaedia entry, encyclopaedia word
Item is the entry that can be searched by encyclopaedia search service.For example, encyclopaedia entry may include the word included in Baidupedia
The entry included in item and wikipedia.Specifically, it is existing can to judge whether each 5th participle belongs to respectively for computer equipment
Entry, then each 5th participle of existing entry will be belonged to as each 6th participle.
7th participle is not included in the 6th participle in each 6th participle in addition to itself.Wherein, if being wrapped in participle B
Full content in the A containing participle then segments A and is contained in participle B, if only comprising the partial content in participle A in participle B, and
Not comprising participle A in full content, then segment A be not included in participle B in (participle A and participle B be mutually different any two
A participle, the description of " A " and " B " are only the differentiation made in name).
For example, each 6th participle is " Chinese people ", " PLA " and " Chinese people's liberation respectively
Army ".This 3 the 6th are segmented, since " Chinese people " are only comprising the partial content in " Chinese People's Liberation Army ", " China
The people " are contained in " Chinese People's Liberation Army ", and " Chinese people " are not the 7th participles.Since " PLA " also only includes
Partial content in " Chinese People's Liberation Army ", " PLA " are also contained in " Chinese People's Liberation Army ", " people liberation
Army " is not the 7th participle.Only " Chinese People's Liberation Army " is not included in any 6th participle in addition to itself and (neither wraps
Contained in " Chinese people ", also it is not included in " PLA "), therefore the 7th determined from this 3 the 6th participles
Participle is " Chinese People's Liberation Army ".
In one embodiment, the keyword of text to be processed is obtained according to the 7th participle, specifically can be the 7th point
Keyword of the word as text to be processed.
In one embodiment, the step of obtaining each semantic description information corresponding with each keyword, i.e. step
S204 may include steps of: obtain web search message corresponding with each keyword;Respectively according to each keyword pair
The web search message answered obtains each semantic description information corresponding with each keyword.
The corresponding web search message of keyword can scan for obtaining by network search service to the keyword.
Wherein, network search service can be based on internet carry out information search service, may include Web search service and
At least one of encyclopaedia search service.Web search service such as Baidu's Web search service, Google's Web search service etc..
Encyclopaedia search service such as Baidupedia search, wikipedia search service etc., but not limited to this.
In one embodiment, it may include being scanned for by network search service to the keyword that network, which searches prime information,
Obtained target search result.If target search result may include in the obtained all search results of search the degree of correlation it is highest
Dry search result (specific item number can be set according to actual needs).For example, in general, by network search service to pass
Each search result that keyword scans for is ranked up according to the degree of correlation from high to low, and the degree of correlation is successively from front to back
It reduces, it accordingly can be using be arranged in front 50 search results as target search result.
In one embodiment, according to the corresponding web search message of keyword, semanteme corresponding with the keyword is obtained
Description information, specifically can be and the corresponding semantic description information of the keyword includes that web search corresponding with the keyword is believed
Breath.
In another embodiment, it in conjunction with being described above, according to the corresponding expert's description information of keyword and can also be somebody's turn to do
The corresponding web search message of keyword, it is common to determine semantic description information corresponding with the keyword.Specifically, with the key
The corresponding semantic description information of word can include that web search message corresponding with the keyword and the keyword are corresponding simultaneously
Expert's description information.
In one embodiment, the step of obtaining search result corresponding with each keyword, may include walking as follows
It is rapid: to call network search service to scan for respectively to each keyword, obtain web search knot corresponding with each keyword
Fruit.
In the present embodiment, when needing to obtain web search results corresponding with keyword every time, network is temporarily called
Search service scans for the keyword, to obtain web search results corresponding with keyword.
In one embodiment, obtain web search message corresponding with each keyword the step of, may include as
Lower step: the corresponding candidate keywords of each keyword are searched in local information library respectively;Local information library records candidate key
Matching relationship between word and candidate search information, candidate search information are by network search service to corresponding candidate keyword
It scans for obtaining;When finding candidate keywords corresponding with keyword, matched according to the candidate keywords found
Candidate search information, obtain the corresponding web search message of the keyword;Candidate pass corresponding with keyword is not being found
When keyword, calls network search service to scan for the keyword, obtain web search message corresponding with the keyword.
In the present embodiment, network search service can be called to scan for respectively to each candidate keywords in advance, searched for
The target search result corresponding with each candidate keywords arrived, as each candidate search information, and then generation record have each
The number of matching relationship between candidate keywords, each candidate search information and each candidate keywords and each candidate search information
It stores according to library, then by the content in the database to computer equipment, to obtain local information library.
When subsequent needs obtain web search message corresponding with keyword, computer equipment can be directly in local information
Candidate keywords corresponding with the keyword are searched in library.Candidate pass corresponding with the keyword is found in local information library
Keyword shows to call network search service to search for the keyword in advance, can be not repeated to call network search service at this time
Search for the keyword, and directly will the matched candidate search information of candidate keywords institute corresponding with the keyword, as the pass
The corresponding web search message of keyword.
On the contrary, not finding candidate keywords corresponding with the keyword in local information library, show do not have in advance
Have and network search service is called to search for the keyword, it also just can be corresponding as the keyword without storage in local information library
Web search message candidate search information.At this point, computer equipment can temporarily call network search service to the key
Word scans for, and the target search result corresponding with the keyword searched is the corresponding web search letter of the keyword
Breath.Further, it is also possible to using the keyword and the target search result corresponding with the keyword searched as newly-increased candidate
Keyword and newly-increased candidate search information, being updated to local information library and record has each candidate keywords, each candidate search letter
The database of breath and the matching relationship between each candidate keywords and each candidate search information.
It should be noted that not finding candidate keywords corresponding with the keyword in local information library, just adjust
The keyword is scanned for network search service, can greatly promote the effect for determining classification belonging to text to be processed
Rate.And in practical application, the quantity of the keyword often occurred in text be it is more limited, run up to thousand in local information library
After the candidate keywords of ten thousand ranks, just seldom needs the network search service for recalling outside to obtain the corresponding network of keyword and search
Rope information, so as to extremely efficiently complete to determine the task of classification belonging to mass text.
Further, it is also possible to timing to record have each candidate keywords, each candidate search information and each candidate keywords with
The database of matching relationship between each candidate search information is updated.For example, re-calling network every scheduled number of days
Search service scans for each candidate keywords in database, to update the corresponding each candidate of each candidate keywords
Search for information.
In one embodiment, it respectively according to the corresponding web search message of each keyword, obtains distinguishing with each keyword
It the step of corresponding each semantic description information, may include steps of: network corresponding with each keyword being searched respectively
Rope information carries out data cleansing, obtains each semantic description information corresponding with each keyword.
Data cleansing can remove incoherent with the keyword itself from the corresponding web search message of keyword
Information.Correspondingly, the corresponding semantic description information of the keyword may include removal and the incoherent information of the keyword itself
Remaining information afterwards.It wherein, may include date, web site name, video playing letter with the incoherent information of the keyword itself
Breath, music information and common network address etc., but not limited to this.
For example, the web search message of keyword is " for " 2018, bean cotyledon 8.5 divided, comparable to " the surprise of teenager group
Unreal drift "!_ Sohu amusement _ sohu.com [online to play] ", after carrying out data cleansing, will remove following information: " 2018 ", " _
Sohu's amusement ", " _ sohu.com " and " [online to play] ".
In one embodiment, according to each semantic description information, first of each keyword respectively with each candidate classification is determined
It the step of degree of correlation, i.e. step S206, may include steps of: according to the classification of each semantic description information and each candidate classification
Title determines the third degree of correlation of each semantic description information respectively with each candidate classification;According to each third degree of correlation, each pass is determined
Keyword first degree of correlation with each candidate classification respectively.
The category name of candidate classification is the title of candidate's classification.It only includes single level that category name, which can be,
Title, such as " mobile phone app ".Category name is also possible to include that the title of more than one level accordingly can be using predetermined
Connector is separated each level, for example " mobile phone app- game-moba " is the category name for including 3 levels, using "-"
This connector is separated each level.
The third degree of correlation of semantic description information and candidate classification, according to the class of the semantic description information and candidate's classification
Mesh title is determined, and can be used for the measurement for measuring the matching degree between the semantic description information and candidate's classification
Value.The value range of the third degree of correlation can be [0 ,+1], and the third degree of correlation is bigger, shows according to the semantic description information and is somebody's turn to do
For the category name of candidate classification, the matching degree between the semantic description information and candidate's classification is higher, otherwise third
The degree of correlation is smaller, for showing the category name according to the semantic description information and the candidate classification, the keyword and the candidate
Matching degree between classification is lower.
According to being described above, for each keyword, computer equipment can be retouched according to the corresponding semanteme of the keyword
The classification description information for stating information and each candidate classification determines first degree of correlation of the keyword respectively with each candidate classification.?
In the present embodiment, the classification description information of candidate classification may include the category name of candidate's classification, accordingly, for each
Keyword, for computer equipment according to the category name of the corresponding semantic description information of the keyword and each candidate classification, determining should
The corresponding semantic description information of the keyword third degree of correlation with each candidate classification respectively.In turn, corresponding according to the keyword
The semantic description information third degree of correlation with each candidate classification respectively, determines first phase of the keyword respectively with each candidate classification
Guan Du.
For example, as shown in figure 8, the keyword of text LT1 to be processed is respectively keyword Kw1, keyword Kw2 and pass
Keyword Kw3, keyword Kw1 correspond to semantic description information Sd1, keyword Kw2 corresponds to semantic description information Sd2, Kw3 couples of keyword
Semantic description information Sd3 is answered, candidate classification is respectively candidate classification C1, candidate classification C2 and candidate classification C3.
Accordingly, computer equipment determines semantic description according to the category name of semantic description information Sd1 and candidate classification C1
The third degree of correlation of information Sd1 and candidate classification C1, thus related to the third of candidate classification C1 according to semantic description information Sd1
Degree determines first degree of correlation of keyword Kw1 and candidate classification C1.
Computer equipment determines semantic description information according to the category name of semantic description information Sd1 and candidate classification C2
The third degree of correlation of Sd1 and candidate classification C2, thus according to the third degree of correlation of semantic description information Sd1 and candidate classification C2,
Determine first degree of correlation of keyword Kw1 and candidate classification C2.
Computer equipment determines semantic description information according to the category name of semantic description information Sd1 and candidate classification C3
The third degree of correlation of Sd1 and candidate classification C3, thus according to the third degree of correlation of semantic description information Sd1 and candidate classification C3,
Determine first degree of correlation of keyword Kw1 and candidate classification C3.
And so on, determine first of keyword Kw2 respectively with candidate classification C1, candidate classification C2 and candidate classification C3
The degree of correlation determines first degree of correlation of the keyword Kw3 respectively with candidate classification C1, candidate classification C2 and candidate classification C3.
In one embodiment, the third degree of correlation of the corresponding semantic description information of keyword and candidate classification, as should
First degree of correlation of keyword and candidate's classification.For example, the third degree of correlation of semantic description information Sd1 and candidate classification C1,
First degree of correlation of as keyword Kw1 and candidate classification C1.
In one embodiment, according to the category name of each semantic description information and each candidate classification, determine that each semanteme is retouched
State information respectively the third degree of correlation with each candidate classification the step of, may include steps of: according to each semantic description information
With the category name of each candidate classification, shared word of each semantic description information respectively with the category name of each candidate classification is determined;
From each semantic description information respectively and in the shared word of the category name of each candidate classification, determine each semantic description information respectively with
The target of the category name of each candidate's classification shares word;Determine category name of each semantic description information respectively with each candidate classification
Target share the long third ratio long with total word of the category name of each candidate classification of total word of word;According to each third ratio,
Each semantic description information shares the of word in corresponding semantic description information to the target of the category name of each candidate classification respectively
One word frequency and each semantic description information share the first of word against document frequency with the target of the category name of each candidate classification respectively
Rate determines the third degree of correlation of each semantic description information respectively with the category name of each candidate classification.
The shared word of semantic description information and the category name of candidate classification, is the semantic description information and candidate's classification
Category name in the participle that jointly comprises.For example, semantic description information is that " " king's honor " is by Tencent's development of games and to transport
Moba class mobile phone games of the capable a operation on android, ios platform ", the category name of candidate classification is " mobile phone
App- game-moba ", the shared word of the two is " mobile phone ", " hand ", " machine ", " game ", " trip ", " play ", " moba ", " m ",
" o ", " b ", " a ", " mo ", " ob ", " ba " etc..
In the present embodiment, for each semantic description information, computer equipment determines the semantic description information respectively
With the shared word of the category name of each candidate classification.For example, sharing 3 semantic description information: semantic description information Sd1,
Semantic description information Sd2 and semantic description information Sd3 shares 3 candidate classifications: candidate classification C1, candidate classification C2 and candidate
Classification C3, it is determined that the shared word of semantic description information Sd1 and candidate classification C1, semantic description information Sd1 and candidate classification C2
Shared word and the shared word of semantic description information Sd1 and candidate classification C3 similarly determine semantic description information Sd2 points
Not with the shared word and semantic description information Sd3 of candidate classification C1, candidate classification C2 and candidate classification C3 respectively with candidate
The shared word of classification C1, candidate classification C2 and candidate classification C3.
The target of semantic description information and the category name of candidate classification shares word, be not included in the semantic description information with
In shared word in each shared word of the category name of candidate's classification in addition to itself.With the restriction class above to the 7th participle
Seemingly, if comprising the full content in shared word C in shared word D, shared word C is contained in shared word D, if in shared word D only
Comprising sharing the partial content in word C, and do not include the full content in shared word C, then shares word C and be not included in shared word D
In (shared word C and shared word D is that mutually different any two share word, and the description of " C " and " D " is only to make naming
It distinguishes).
For example, each shared word of semantic description information and the category name of candidate classification be respectively as follows: " mobile phone ",
" hand ", " machine ".For this 3 shared words, since " hand " is only comprising the partial content in " mobile phone ", " hand " is contained in " mobile phone "
In, " hand " is not that target shares word.Since " machine " is also only comprising the partial content in " mobile phone ", " machine " is also contained in " mobile phone ",
" machine " is not that target shares word.Only " mobile phone " is not included in any shared word in addition to itself and (was both not included in " hand "
In, also it is not included in " machine "), therefore it is " mobile phone " that the target determined from this 3 shared words, which shares word,.
In the present embodiment, for each semantic description information, computer equipment respectively from the semantic description information with
In the shared word of the category name of each candidate classification, the classification of the semantic description information and each candidate classification is determined
The target of title shares word.
Aforementioned exemplary is accepted, computer equipment is determined from the shared word of semantic description information Sd1 and candidate classification C1
The target of semantic description information Sd1 and candidate classification C1 share word, from the shared word of semantic description information Sd1 and candidate classification C2
In, determine that the target of semantic description information Sd1 and candidate classification C2 shares word, from semantic description information Sd1 and candidate classification
In the shared word of C3, determine that semantic description information Sd1 and the target of candidate classification C3 share word.
Computer equipment is respectively from semantic description information Sd2 and candidate classification C1, candidate classification C2 and candidate classification C3
Shared word in, determine mesh of the semantic description information Sd2 respectively with candidate classification C1, candidate classification C2 and candidate classification C3
Mark shared word.
Computer equipment is respectively from semantic description information Sd3 and candidate classification C1, candidate classification C2 and candidate classification C3
Shared word in, determine mesh of the semantic description information Sd3 respectively with candidate classification C1, candidate classification C2 and candidate classification C3
Mark shared word.
Third ratio is that total word length of the shared word of target of the category name of semantic description information and candidate classification accounts for the time
The ratio for selecting total word of the category name of classification long.For example, the target designation of candidate classification is " mobile phone app- game-moba ",
Assuming that it is " mobile phone ", " game " and " moba " that the target of semantic description information and the category name of candidate classification, which shares word, then should
Total word length of the shared word of the target of the category name of semantic description information and the candidate classification is that 8 (" mobile phone " is 2, and " game " is
2, " moba " is 4, is added up to 8), and total word length of the category name of candidate's classification is 11 (due to being long, 3 "-" that calculate total word
In connector is not counted in, the total length of " mobile phone app game moba " is that 11), therefore third ratio is
In the present embodiment, for each semantic description information, computer equipment determines the semantic description information respectively
The total of the category name of each candidate classification is accounted for total word length that the target of the category name of each candidate classification shares word
The long third ratio of word.
Aforementioned exemplary is accepted, computer equipment determines that semantic description information Sd1 and the target of candidate classification C1 share word
The long third ratio of total word of total long category name for accounting for candidate classification C1 of word determines semantic description information Sd1 and candidate classification
The target of C2 shares the third ratio of total word length of the long category name for accounting for candidate classification C1 of total word of word, determines that semantic description is believed
The long third ratio of the total word for the category name that total word length that the target of breath Sd1 and candidate classification C3 share word accounts for candidate classification C1
Example.
Computer equipment determine semantic description information Sd2 respectively with candidate classification C1, candidate classification C2 and candidate classification
The target of C3 shares the total of the long category name for accounting for candidate classification C1, candidate classification C2 and candidate classification C3 respectively of total word of word
The long third ratio of word.
Computer equipment determine semantic description information Sd3 respectively with candidate classification C1, candidate classification C2 and candidate classification
The target of C3 shares the total of the long category name for accounting for candidate classification C1, candidate classification C2 and candidate classification C3 respectively of total word of word
The long third ratio of word.
The target of semantic description information and the category name of candidate classification shares first of word in the semantic description information
Word frequency is the number that the target shares that word occurs in the semantic description information.For example, semantic description information is that " " king is flourish
Credit " be by Tencent's development of games and moba class mobile phone games of a operation on android, ios platform that run ", the language
The target of adopted description information and the category name of candidate classification shares word: " mobile phone ", " game " and " moba ", then " hand
It is 1 that machine ", " game " and " moba " this 3 targets, which share first word frequency of the word in the semantic description information,.
It is similar with the restriction above to the inverse document frequency of keyword, the category name of semantic description information and candidate classification
Target share word the first inverse document frequency may is that
In one embodiment, target corpus can be the corresponding corpus of network search service.Accordingly, target corpus
The number for sharing the object of word in all objects in library comprising the target, can be and network search service is called to share word to target
The total number of all search results scanned for.The number of all objects in target corpus can be set to predetermined number
Value, for example be set as: 1+100000000.
In the present embodiment, for each semantic description information, computer equipment is respectively according to the semantic description information
The total of the category name of each candidate classification is accounted for total word length that the target of the category name of each candidate classification shares word
The long third ratio of word, the semantic description information share word in the language with the target of the category name of each candidate classification respectively
The first word frequency and the semantic description information in adopted the description information target with the category name of each candidate classification respectively
First inverse document frequency of shared word determines third of the semantic description information respectively with the category name of each candidate classification
The degree of correlation.
In one embodiment, for any semantic description information and any candidate classification, the semantic description information with should
The third degree of correlation of candidate classification can be with are as follows:Wherein, N indicates the semantic description information and is somebody's turn to do
The target of the category name of candidate classification shares the total number of word, and N is equal to or greater than 1 integer;Rb indicates the semantic description
N number of target of information and the category name of candidate's classification shares total word of the long category name for accounting for candidate's classification of total word of word
Long third ratio;TF1iIndicate that N number of target shares i-th of target in word and shares first word of the word in the semantic description information
Frequently;IDF1iIndicate that N number of target shares the first inverse document frequency that i-th of target in word shares word.
In another embodiment, for any semantic description information and any candidate classification, the semantic description information with
The third degree of correlation of candidate's classification may be:Wherein, LiIndicate that N number of target is total
There is the word that i-th of target shares word in word long.
It should be noted that the numerical value of the semantic description information being calculated and the third degree of correlation of candidate classification is greater than 1
When, 1 can be set to.
It in one embodiment, can be using set after the shared word for determining semantic description information and category name
Mode stores each shared word determined, that is, forms shared set of words, determines that the target of semantic description information and category name is total
After having word, each target determined can be stored by the way of set and shares word, that is, forms target and shares set of words.
It in one embodiment, can for the semantic description information and category name of the English character comprising uppercase format
To convert small letter for the English character of uppercase format before determining the shared word of the semantic description information and the category name
Format, with Uniform data format.
In one embodiment, the determination method of text classification can also include the following steps: to be believed according to each semantic description
The predetermined correlation coefficient of predetermined the classification conjunctive word and corresponding candidate classification of breath and each candidate classification, determines each semantic description
Information the 4th degree of correlation with each candidate classification respectively.Accordingly, according to each third degree of correlation, determine each keyword respectively with each time
The step of selecting first degree of correlation of classification, may include steps of: according to tetra- degree of correlation of each third degree of correlation and Ge, determine
Each keyword first degree of correlation with each candidate classification respectively.
The predetermined classification conjunctive word of candidate classification is by word manually determining, with candidate's classification with correlativity
Language.The related coefficient of predetermined the classification conjunctive word and candidate's classification of candidate classification, for characterize the predetermined classification conjunctive word with
Correlation circumstance between candidate's classification.Wherein, the predetermined classification of the predetermined classification conjunctive word of candidate classification and candidate classification
The related coefficient of conjunctive word and candidate's classification, specifically can be by manually predefining according to the experience accumulated in practical business.
The value range of related coefficient is [- 1 ,+1], the phase of predetermined the classification conjunctive word and candidate's classification of candidate classification
When relationship number is positive number, indicate that the predetermined classification conjunctive word and candidate's classification are positively correlated, and the bigger expression positive of related coefficient
The degree of pass is higher, and related coefficient is smaller to indicate that positively related degree is lower.The predetermined classification conjunctive word and the time of candidate classification
When the related coefficient of classification being selected to be negative, indicate that the predetermined classification conjunctive word and candidate's classification are negatively correlated, and related coefficient is got over
Big to indicate that negatively correlated degree is lower, related coefficient is smaller to indicate that negatively correlated degree is higher.
4th degree of correlation of semantic description information and candidate classification, according to the semantic description information and candidate's classification
Predetermined classification conjunctive word and the candidate classification related coefficient determination, can be used for measuring the semantic description information and the time
Select the metric of the matching degree between classification.The value range of 4th degree of correlation can be [0 ,+1], and the 4th degree of correlation is bigger,
Show the related coefficient according to the predetermined classification conjunctive word of the semantic description information and candidate's classification and candidate's classification and
Speech, the matching degree between the semantic description information and candidate's classification is higher, otherwise the 4th degree of correlation is smaller, shows that basis should
For the related coefficient of the predetermined classification conjunctive word of semantic description information and candidate's classification and candidate's classification, the key
Matching degree between word and candidate's classification is lower.
According to being described above, for each keyword, computer equipment can be retouched according to the corresponding semanteme of the keyword
The classification description information for stating information and each candidate classification determines first degree of correlation of the keyword respectively with each candidate classification.?
In the present embodiment, the classification description information of candidate classification may include the predetermined classification conjunctive word and the predetermined class of candidate's classification
The predetermined correlation coefficient of mesh conjunctive word and corresponding candidate classification, for each keyword, computer equipment is according to the keyword
Corresponding semantic description information, the predetermined classification conjunctive word of each candidate classification, the predetermined classification conjunctive word of each candidate classification and each
The related coefficient of self-corresponding candidate's classification determines the of the corresponding semantic description information of the keyword respectively with each candidate classification
Four degrees of correlation.In turn, according to the corresponding semantic description information of the keyword respectively with the third degree of correlation of each candidate classification and
The corresponding semantic description information of the keyword the 4th degree of correlation with each candidate classification respectively, determine the keyword respectively with each time
Select first degree of correlation of classification.
For example, the keyword of text LT1 to be processed is respectively keyword Kw1, keyword Kw2 and keyword Kw3, close
Keyword Kw1 corresponds to semantic description information Sd1, keyword Kw2 corresponds to semantic description information Sd2, keyword Kw3 corresponds to semantic description
Information Sd3, candidate classification are respectively candidate classification C1, candidate classification C2 and candidate classification C3.
Accordingly, as shown in figure 9, computer equipment is according to the category name of semantic description information Sd1 and candidate classification C1, really
The third degree of correlation of attribute justice description information Sd1 and candidate classification C1, according to the pre- of semantic description information Sd1 and candidate classification C1
The related coefficient for determining classification conjunctive word, the predetermined classification conjunctive word and candidate's classification C1 determines semantic description information Sd1 and waits
The 4th degree of correlation of classification C1 is selected, and then is retouched according to the third degree of correlation and semanteme of semantic description information Sd1 and candidate classification C1
State the 4th degree of correlation of information Sd1 and candidate classification C1, common first degree of correlation for determining keyword Kw1 and candidate classification C1.
Computer equipment determines semantic description information according to the category name of semantic description information Sd1 and candidate classification C2
The third degree of correlation of Sd1 and candidate classification C2, according to the predetermined classification conjunctive word of semantic description information Sd1 and candidate classification C2,
The related coefficient of the predetermined classification conjunctive word and candidate's classification C2 determines the of semantic description information Sd1 and candidate classification C2
Four degrees of correlation, and then according to the third degree of correlation and semantic description information Sd1 of semantic description information Sd1 and candidate classification C2 and wait
The 4th degree of correlation of classification C2 is selected, common first degree of correlation for determining keyword Kw1 and candidate classification C2.
Computer equipment determines semantic description information according to the category name of semantic description information Sd1 and candidate classification C3
The third degree of correlation of Sd1 and candidate classification C3, according to the predetermined classification conjunctive word of semantic description information Sd1 and candidate classification C3,
The related coefficient of the predetermined classification conjunctive word and candidate's classification C3 determines the of semantic description information Sd1 and candidate classification C3
Four degrees of correlation, and then according to the third degree of correlation and semantic description information Sd1 of semantic description information Sd1 and candidate classification C3 and wait
The 4th degree of correlation of classification C3 is selected, common first degree of correlation for determining keyword Kw1 and candidate classification C3.
And so on, computer equipment determine keyword Kw2 respectively with candidate classification C1, candidate classification C2 and candidate class
First degree of correlation of mesh C3.Also, determine keyword Kw3 respectively with candidate classification C1, candidate classification C2 and candidate classification C3
First degree of correlation.
Specifically, for any keyword and any candidate classification, can be believed according to the corresponding semantic description of the keyword
Cease the 4th phase of corresponding with the third degree of correlation of candidate's classification and the keyword semantic description information and candidate's classification
Guan Du is commonly summed, and first degree of correlation of the keyword Yu candidate's classification is obtained.For example, to semantic description information Sd1
It is commonly asked with the third degree of correlation and semantic description information Sd1 and the 4th degree of correlation of candidate classification C1 of candidate classification C1
With first degree of correlation of keyword Kw1 and candidate classification C1 can be obtained.
Alternatively, can be respectively the third degree of correlation and the 4th degree of correlation setting weight, according to the corresponding semanteme of the keyword
Description information weight corresponding with the third degree of correlation of candidate's classification, the third degree of correlation, the corresponding semantic description of the keyword
Information weight corresponding with the 4th degree of correlation of candidate's classification and the 4th degree of correlation is weighted summation, obtains the key
First degree of correlation of word and candidate's classification.For example, by the third degree of correlation of semantic description information Sd1 and candidate classification C1, the
The corresponding weight of three degrees of correlation, semantic description information Sd1 power corresponding with the 4th degree of correlation, the 4th degree of correlation of candidate classification C1
It is weighted summation again, first degree of correlation of keyword Kw1 and candidate classification C1 can be obtained.
It include several passes manually determined in Association repository when it is implemented, Association repository can be constructed in advance
Join knowledge.In the case, computer equipment can obtain the predetermined classification conjunctive word of each candidate classification according to Association repository
With the predetermined correlation coefficient of corresponding candidate classification, closed further according to the predetermined classification of each semantic description information and each candidate classification
The predetermined correlation coefficient for joining word and corresponding candidate classification determines that each semantic description information is related to the 4th of each candidate classification the respectively
Degree.
In one embodiment, the data format of association knowledge can be " the classification of candidate classification mark (such as classification
ID), the category name of candidate classification, the predetermined classification conjunctive word of candidate's classification, the predetermined classification conjunctive word and candidate's class
Purpose related coefficient ".
Three exemplary association knowledges are illustrated below:
1, mobile phone app- game-moba, king's honor, 0.8
1, mobile phone app- game-moba, ranking, 0.2
1, mobile phone app- game-moba, heroic alliance, -0.9
Wherein, " king's honor " belongs to one of " mobile phone app- game-moba ", therefore " king's honor " and " mobile phone
This positively related degree of candidate classification of app- game-moba " is very high, manually the related coefficient of the two can be set as 0.8.
" ranking " is only weak related to " mobile phone app- game-moba ", therefore manually the related coefficient of the two can be set as 0.2." hero
Alliance " is not " mobile phone app " although related to " game-moba ", and therefore " heroic alliance " is " mobile phone app- game-
The potential confusable word of this candidate classification of moba ", it is very high with the degree of candidate's classification negative correlation, it manually can be by the two
Related coefficient is set as -0.9.
It in other embodiments, can also be related to the 4th of each candidate classification the respectively according only to each semantic description information
Degree determines first degree of correlation of each keyword respectively with each candidate classification, without consider each semantic description information respectively with each time
Select the third degree of correlation of classification.
It should be noted that by by the predetermined classification conjunctive word and predetermined classification conjunctive word by manually determining and corresponding waiting
The predetermined correlation coefficient for selecting classification, factor the considerations of as first degree of correlation for determining keyword and candidate classification, allows artificial
Automatic marking process is intervened, related service personnel are improved according to the experience accumulated in actual services scene
The quality of classification annotation results realizes technical grade human controllable and artificial easily optimization.
In addition, the automatic marking system being described above can also provide association knowledge typing clothes in practical application scene
Business.Association knowledge input interface is as shown in Figure 10, and user can click control 1001, then in association knowledge input frame 1002 it is defeated
The association knowledge that access customer determines, and then click the manual entry that control 1003 completes respective associated knowledge.In addition, user may be used also
It modifies or deletes to click control 1004 to the respective associated knowledge of typing.
In one embodiment, according to the predetermined classification conjunctive word of each semantic description information and each candidate classification, each time
The predetermined correlation coefficient for selecting predetermined the classification conjunctive word and corresponding candidate classification of classification determines each semantic description information respectively and respectively
It the step of four degree of correlation of candidate classification, may include steps of: according to the predetermined classification conjunctive word point of each candidate classification
The second word frequency not in each semantic description information, each candidate classification predetermined classification conjunctive word the second inverse document frequency, with
And the predetermined classification conjunctive word of each candidate classification predetermined correlation coefficient with corresponding candidate classification respectively, determine each semantic description letter
Cease the 4th degree of correlation respectively with each candidate classification.
Second word frequency of the predetermined classification conjunctive word of candidate classification in semantic description information is the predetermined classification conjunctive word
The number occurred in the semantic description information.For example classification conjunctive word is " king's honor ", semantic description information is " " king
Honor " be by Tencent's development of games and moba class mobile phone games of a operation on android, ios platform that run ", it should
Second word frequency of the predetermined classification conjunctive word in the semantic description information is 1.
Similar with the description above to the inverse document frequency of keyword, the second of the predetermined classification conjunctive word of candidate classification is inverse
Document frequency may is that
In one embodiment, target corpus can be the corresponding corpus of network search service.Accordingly, target corpus
The number of object in all objects in library comprising the predetermined classification conjunctive word, can be through network search service to predetermined class
The number of the total number for all search results that mesh conjunctive word scans for, all objects in target corpus can be set
It for predetermined value, for example is set as: 1+100000000.
In the present embodiment, for each semantic description information, computer equipment is respectively according to each candidate classification
Predetermined classification conjunctive word the second word frequency in the semantic description information respectively, the predetermined classification association of each candidate classification
The predetermined classification conjunctive word of second inverse document frequency of word and each candidate classification respectively with corresponding candidate classification
Related coefficient, determine fourth degree of correlation of the semantic description information respectively with the category name of each candidate classification.
In one embodiment, for any semantic description information and any candidate classification, the semantic description information with should
4th degree of correlation of candidate classification may is thatWherein, M indicates the predetermined of candidate's classification
The total number of classification conjunctive word, M are equal to or greater than 1 integer;TF2jIndicate j-th of classification association in M classification conjunctive word
Second word frequency of the word in the semantic description information;IDF2jIndicate second of j-th of classification conjunctive word in M classification conjunctive word
Inverse document frequency;CoejIndicate the related coefficient of j-th classification conjunctive word and candidate's classification in M classification conjunctive word.
In another embodiment, for any semantic description information and any candidate classification, the semantic description information with
4th degree of correlation of candidate's classification is also possible to:Indicate M class
The word of j-th of classification conjunctive word is long in mesh conjunctive word.
It should be noted that the numerical value of the semantic description information being calculated and the 4th degree of correlation of candidate classification is greater than 1
When, 1 can be set to.
The corresponding text of each keyword is determined for the corresponding network description information of each keyword according to text to be processed
Description information, and according to each semantic description information respectively the third degree of correlation with each candidate classification (according to each semantic description information
Determined with the category name of each candidate classification) and each semantic description information respectively with the 4th degree of correlation of each candidate's classification (according to
The predetermined correlation coefficient of the predetermined classification conjunctive word and corresponding candidate classification of each semantic description information and each candidate classification is true
At least one of it is fixed), determine second degree of correlation of the text to be processed respectively with each candidate classification, and then according to text to be processed
Respectively with second degree of correlation of each candidate classification.There are a basic assumptions: by network search service to from Text Feature Extraction
Keyword scans for, if frequently occurring the title or classification conjunctive word of some candidate classification in obtained each search result,
The text is closely related with candidate's classification.Concrete analysis about the basic assumption is as follows: the mesh of network search service
Be to provide and " dote on the search maximally related content of input information and description, such as " Ha Shiqi " this keyword and candidate classification
Object-dog " is closely related, and is scanned for by network search service to " Ha Shiqi ", obtained each search result intermediate frequency
Numerous the two keywords of appearance " pet " and " dog ", therefore the basic assumption is to set up in practical applications.
In one embodiment, the determination method of text classification can also include the following steps: to obtain each candidate classification
Priority factor.Accordingly, according to each second degree of correlation, the step of classification belonging to text to be processed is determined from each candidate classification
Suddenly, i.e. step S210, may include steps of: according to the priority factor of each second degree of correlation and each candidate classification, determine
Text to be processed the 5th degree of correlation with each candidate classification respectively;According to each 5th degree of correlation, determined from each candidate classification to
Handle classification belonging to text.
The priority factor of candidate classification it is true in practical business can be used to characterize related personnel by manually determining
The degree of priority of fixed candidate's classification.
In the present embodiment, for each candidate classification, computer equipment is according to text to be processed and candidate's classification
Second degree of correlation and candidate's classification priority factor, determine the 5th degree of correlation of candidate's classification.Specifically, it can incite somebody to action
Second degree of correlation of text to be processed and candidate's classification obtains candidate's classification multiplied by the priority factor of candidate's classification
5th degree of correlation.
In addition, the text to be processed when exporting the corresponding classification annotation results of text to be processed, in such mesh annotation results
This with its belonging to classification the degree of correlation, can be the text to be processed and its belonging to classification the 5th degree of correlation itself.
It should be noted that by the priority factor of each candidate classification correct text to be processed respectively with each candidate classification
Second degree of correlation, obtain fiveth degree of correlation of the text to be processed respectively with each candidate classification, and then according to each 5th degree of correlation
Classification belonging to text to be processed is determined from each candidate classification, can be setup flexibly and be needed top-priority candidate classification.
In addition, the automatic marking system being described above can also provide classification precedence information in practical application scene
Typing service.Classification precedence information input interface is as shown in figure 11, and user can click control 1101, then in classification priority
The classification precedence information (such as classification ID, category name and priority factor) that user determines is inputted in information input frame 1102,
And then click the manual entry that 1103 controls complete corresponding classification precedence information.In addition, user can also click control 1104
The corresponding classification precedence information of typing is modified or deleted.
In one embodiment, as shown in figure 12, a kind of determination method of text classification is provided.This method can be by counting
It calculates machine equipment to execute, can specifically include following steps S1202 to S1224.
S1202 extracts the keyword of text to be processed, and determines the weight of each keyword.
S1204 searches the corresponding candidate keywords of each keyword in local information library respectively;Local information library record is waited
The matching relationship between keyword and candidate search information is selected, candidate search information is by network search service to corresponding candidate
Keyword scans for obtaining.
S1206 is matched when finding candidate keywords corresponding with keyword according to the candidate keywords found
Candidate search information, obtain the corresponding web search message of the keyword.
S1208 calls network search service to the keyword when not finding candidate keywords corresponding with keyword
It scans for, obtains web search message corresponding with the keyword.
S1210 carries out data cleansing to web search message corresponding with each keyword respectively, obtains and each key
The corresponding each semantic description information of word.
S1212 determines each semantic description information difference according to the category name of each semantic description information and each candidate classification
With the third degree of correlation of each candidate classification.
S1214, according to each semantic description information and the predetermined classification conjunctive word and corresponding candidate classification of each candidate classification
Predetermined correlation coefficient, determine fourth degree of correlation of each semantic description information respectively with each candidate classification;Wherein, candidate classification
The related coefficient of the predetermined classification conjunctive word of predetermined classification conjunctive word and candidate classification and candidate's classification is by manually determining.
S1216 determines the of each keyword respectively with each candidate classification according to tetra- degree of correlation of each third degree of correlation and Ge
One degree of correlation.
S1218, according to the weight of each keyword and each first degree of correlation, determine text to be processed respectively with each candidate classification
Second degree of correlation.
S1220 obtains the priority factor of each candidate classification, according to the priority of each second degree of correlation and each candidate classification
Coefficient determines fiveth degree of correlation of the text to be processed respectively with each candidate classification.
S1222 determines classification belonging to text to be processed from each candidate classification according to each 5th degree of correlation.
S1224, exports the corresponding classification annotation results of the text to be processed, the corresponding classification annotation results of text to be processed
Including belonging to classification belonging to the text to be processed, the text to be processed and the text to be processed and the text to be processed
5th degree of correlation of classification.
It should be noted that the specific restriction of each technical characteristic in the present embodiment, can with hereinbefore to relevant art
The restriction of feature is identical, is not added and repeats herein.
It should be appreciated that although each step in the flow chart that each embodiment is related to above is according to arrow under reasonable terms
Instruction successively show that but these steps are not that the inevitable sequence according to arrow instruction successively executes.Unless having herein
Explicitly stated, there is no stringent sequences to limit for the execution of these steps, these steps can execute in other order.And
And at least part step in each flow chart may include multiple sub-steps perhaps these sub-steps of multiple stages or rank
Section is not necessarily to execute completion in synchronization, but can execute at different times, these sub-steps or stage
Execution sequence is also not necessarily and successively carries out, but can be with the sub-step or stage of other steps or other steps extremely
Few a part executes in turn or alternately.
In addition, below to the table of the automatic marking system of the determination method using text classification provided by the embodiments of the present application
It is now illustrated: automatic marking system is applied to headline mark a to class comprising 800 candidate classifications
Mesh system, and each candidate classification includes 4 levels, it is 91.5% that level-one classification, which marks accuracy rate, and second level classification mark is accurate
Rate is 83.1%, and it is 78.4% that three-level classification, which marks accuracy rate, and it is 74.1% that level Four classification, which marks accuracy rate,.Wherein, mark is quasi-
True rate (Accuracy)=be correctly labeled as corresponding to text/total amount of text of candidate classification.
In addition, the automatic marking system does not need to expend any manpower mark training sample in annotation process, and if adopting
With traditional approach, then need manually to mark 8,000,000 (800 candidate classifications, each candidate's classification mark 10,000) training samples
This, so that the quality for needing a large amount of man power and material, and manually marking a large amount of training samples is also difficult to ensure, that is, tradition side
Formula does not have actual availability to the mark of the complicated bibliography system comprising multiple candidate classifications.
In one embodiment, as shown in figure 13, a kind of determining device 1300 of text classification is provided.The device 1300
It may include following module 1302 to 1310.
Keyword processing module 1302 for extracting the keyword of text to be processed, and determines the weight of each keyword;
Semantic description data obtaining module 1304, for obtaining semantic description information corresponding with each keyword;
First degree of correlation determining module 1306, for according to each semantic description information, determine each keyword respectively with each time
Select first degree of correlation of classification;
Second degree of correlation determining module 1308 is determined for the weight and each first degree of correlation according to each keyword wait locate
Manage second degree of correlation of the text respectively with each candidate classification;
Text classification determining module 1310, for determining text to be processed from each candidate classification according to each second degree of correlation
Classification belonging to this.
The determining device of above-mentioned text classification, extracts the keyword of text to be processed, and obtains the weight of each keyword, so
Semantic description information corresponding with each keyword is obtained afterwards, further according to each semantic description information, determines each keyword difference
Text to be processed point is determined then according to the weight of each keyword and each first degree of correlation with first degree of correlation of candidate classification
Text to be processed is not determined from each candidate classification and then according to each second degree of correlation with second degree of correlation of each candidate classification
Affiliated classification.In this way, can be in the case where not having text known to any affiliated classification, certainly by computer equipment whole process
Classification belonging to any text is determined dynamicly, to eliminate the link of artificial mark classification, saves human cost, and clear
In addition to dependence of the quality to the quality manually marked of classification belonging to determination text to be processed.In addition, to be processed from obtaining
During classification belonging to text to determining text to be processed, intermediate logic will be understood by for people, therefore make
The professional knowledge that obtaining manually to accumulate is possibly realized during introducing classification belonging to determining text to be processed.
In one embodiment, keyword processing module 1302 may include such as lower unit: first participle acquiring unit, use
In carrying out word segmentation processing to text to be processed, multiple first participles of text to be processed are obtained;Second participle acquiring unit, is used for
The first participle for belonging to goal filtering dictionary is rejected from each first participle, is obtained one or more second and is segmented;Second participle
Including the remaining first participle after rejecting;Keyword acquiring unit, for obtaining the pass of text to be processed according to each second participle
Keyword;Wherein, goal filtering dictionary includes filtering dictionary corresponding with target data source, and target data source includes text to be processed
Data source belonging to this.
In one embodiment, the determining device 1300 of text classification can also include following module: third participle obtains
Module obtains multiple third participles for carrying out word segmentation processing to each text for belonging to target data source;First ratio-dependent mould
Block, for determining that each third segments corresponding first ratio respectively;It includes: target data source that third, which segments corresponding first ratio,
In the textual data comprising third participle account for the ratio of the target data source text sum;Goal filtering dictionary constructs module, uses
In the third participle for according to the first ratio being more than the first proportion threshold value, goal filtering dictionary is constructed.
In one embodiment, the determining device 1300 of text classification can also include the first proportion threshold value determining module,
For determining the 4th participle from each third participle according to current proportion threshold value;4th participle includes that the first ratio is equal to or greatly
It is segmented in the third of current proportion threshold value;Determine remaining word number corresponding with each text of target data source is belonged to;Text
Corresponding residue word number is the number that remaining third segments after rejecting the 4th participle in each third of text participle;It determines
The textual data that remaining word number is equal to or more than word number threshold value accounts for the second ratio for belonging to the text sum of target data source;Second
When ratio is less than the second proportion threshold value, current proportion threshold value is determined as the first proportion threshold value;It is more than second in the second ratio
When proportion threshold value, current proportion threshold value is updated according to numerical value is lowered, and is returned according to current proportion threshold value, from each third participle
The step of determining the 4th participle.
In one embodiment, keyword acquiring unit may include following subelement: the 5th participle obtains subelement, uses
In carrying out permutation and combination according to each second participle, the 5th participle is obtained;Each 5th participle comprising continuous adjacent at least two the
Two participles;6th participle obtains subelement, for determining the 6th participle from each 5th participle;6th participle includes belonging to
There is the 5th participle of entry;7th participle obtains subelement, for determining the 7th participle from each 6th participle;7th participle
It is not included in the 6th participle in each 6th participle in addition to itself;Keyword obtains subelement, for being segmented according to the 7th,
Obtain the keyword of text to be processed.
In one embodiment, semantic description data obtaining module 1304 may include such as lower unit: web search message
Acquiring unit, for obtaining web search message corresponding with each keyword;The corresponding web search message of keyword is
The keyword is scanned for obtaining by network search service;Semantic description information acquisition unit, for respectively according to each pass
The corresponding web search message of keyword obtains each semantic description information corresponding with each keyword.
In one embodiment, web search message acquiring unit may include following subelement: candidate keywords are searched
Subelement, for searching the corresponding candidate keywords of each keyword in local information library respectively;Local information library record is candidate
Matching relationship between keyword and candidate search information, candidate search information are to be closed by network search service to corresponding candidate
Keyword scans for obtaining;Web search message reading subunit, for finding candidate keywords corresponding with keyword
When, according to the matched candidate search information of candidate keywords institute found, obtain the corresponding web search message of the keyword;
Web search message searches for subelement, for calling web search when not finding candidate keywords corresponding with keyword
Service scans for the keyword, obtains web search message corresponding with the keyword.
In one embodiment, the determining device 1300 of text classification can also include following module: priority factor obtains
Modulus block, for obtaining the priority factor of each candidate classification.Accordingly, text classification determining module 1310 may include as placed an order
Member: the 5th degree of correlation determination unit determines to be processed for the priority factor according to each second degree of correlation and each candidate classification
Text the 5th degree of correlation with each candidate classification respectively;Text classification determination unit is used for according to each 5th degree of correlation, from each time
It selects and determines classification belonging to text to be processed in classification.
In one embodiment, the first degree of correlation determining module 1306 may include such as lower unit: the third degree of correlation determines
Unit determines each semantic description information respectively and respectively for the category name according to each semantic description information and each candidate classification
The third degree of correlation of candidate classification;First degree of correlation determination unit, for determining each keyword difference according to each third degree of correlation
With first degree of correlation of each candidate classification.
In one embodiment, third degree of correlation determination unit may include following subelement: shared word determines subelement,
For the category name according to each semantic description information and each candidate classification, determine each semantic description information respectively with each candidate class
The shared word of purpose category name;Target shares word and determines subelement, for from each semantic description information respectively with each candidate class
In the shared word of purpose category name, determine that each semantic description information is shared with the target of the category name of each candidate classification respectively
Word;Third ratio-dependent subelement, for determining target of each semantic description information respectively with the category name of each candidate classification
The long third ratio of total word of the long category name for accounting for each candidate classification of total word of shared word;The third degree of correlation determines subelement,
For sharing word corresponding to the target of the category name of each candidate classification respectively according to each third ratio, each semantic description information
The first word frequency and each semantic description information in semantic description information is total with the target of the category name of each candidate classification respectively
There is the first inverse document frequency of word, determines that each semantic description information is related to each candidate third of category name of classification respectively
Degree;Wherein, the target of semantic description information and the category name of candidate classification shares word, be not included in the semantic description information with
In shared word in each shared word of the category name of candidate's classification in addition to itself.
In one embodiment, the determining device 1300 of text classification can also include following module: the 4th degree of correlation is true
Cover half block, for making a reservation for according to each semantic description information, the predetermined classification conjunctive word of each candidate classification and each candidate classification
The predetermined correlation coefficient of classification conjunctive word and corresponding candidate classification determines the of each semantic description information respectively with each candidate classification
Four degrees of correlation.Accordingly, the first degree of correlation determination unit is used to determine each key according to tetra- degree of correlation of each third degree of correlation and Ge
Word first degree of correlation with each candidate classification respectively;Wherein, the predetermined classification conjunctive word of candidate classification and candidate classification is pre-
Classification conjunctive word is determined with the related coefficient of candidate's classification by manually determining.
In one embodiment, the 4th degree of correlation determining module, for the predetermined classification conjunctive word according to each candidate classification
Respectively the second word frequency in each semantic description information, each candidate classification predetermined classification conjunctive word the second inverse document frequency,
And the predetermined classification conjunctive word of each candidate classification predetermined correlation coefficient with corresponding candidate classification respectively, determine each semantic description
Information the 4th degree of correlation with each candidate classification respectively.
It should be noted that the specific restriction of the determining device 1300 about text classification, may refer to above for
The restriction of the determination method of text classification, details are not described herein.Modules in the determining device 1300 of above-mentioned text classification
It can be realized fully or partially through software, hardware and combinations thereof.Above-mentioned each module can be embedded in the form of hardware or independently of
In processor in computer equipment, it can also be stored in a software form in the memory in computer equipment, in order to locate
It manages device and calls the corresponding operation of the above modules of execution.
In one embodiment, a kind of computer equipment, including memory and processor are provided, memory is stored with meter
Calculation machine program, when computer program is executed by processor, so that processor executes the text class of the application any embodiment offer
Purpose determines the step of method.
Specifically, which can be the server 120 in Fig. 1.As shown in figure 14, which includes
Processor, the memory, network interface connected by system bus.Wherein, the processor is for providing calculating and control ability.
The memory includes non-volatile memory medium and built-in storage, which is stored with operating system and calculating
Machine program, the built-in storage provide environment for the operation of operating system and computer program in non-volatile memory medium.It should
Network interface is used to communicate with external terminal by network connection.To realize this Shen when the computer program is executed by processor
Please any embodiment provide text classification determination method.
Alternatively, the computer equipment can be the terminal 110 in Fig. 1.As shown in figure 15, which includes the meter
Calculating machine equipment includes processor, memory, network interface, input unit and the display screen connected by system bus.Wherein, it deposits
Reservoir includes non-volatile memory medium and built-in storage.The non-volatile memory medium of the computer equipment is stored with operation system
System, can also be stored with computer program, when which is executed by processor, processor may make to realize that the application is any
The determination method for the text classification that embodiment provides.Computer program can also be stored in the built-in storage, the computer program
When being executed by processor, processor may make to execute the determination method of the text classification of the application any embodiment offer.It calculates
The display screen of machine equipment can be liquid crystal display or electric ink display screen, and the input unit of computer equipment can be aobvious
The touch layer covered in display screen is also possible to the key being arranged on computer equipment shell, trace ball or Trackpad, can also be
External keyboard, Trackpad or mouse etc..
It will be understood by those skilled in the art that structure shown in Figure 14 and Figure 15, only related to application scheme
Part-structure block diagram, do not constitute the restriction for the computer equipment being applied thereon to application scheme, it is specific to count
Calculating machine equipment may include perhaps combining certain components or with different portions than more or fewer components as shown in the figure
Part arrangement.
In one embodiment, the determining device 1300 for the text classification that each embodiment of the application provides can be implemented as one
The form of kind computer program, computer program can be run on such as Figure 14 or computer equipment shown in figure 15.Computer is set
Each program module that the determining device 1300 of composition text classification can be stored in standby memory, for example, shown in Figure 13
Keyword processing module 1302, semantic description data obtaining module 1304, first degree of correlation determining module 1306 etc..Each journey
The computer program of sequence module composition makes processor execute the video of the application described in this specification each embodiment
Step in the determination method of text classification.For example, Figure 14 or computer equipment shown in figure 15 can be by as shown in figure 13
Text classification determining device 1300 in keyword processing module 1302 execute step S202, obtained by semantic description information
Modulus block 1304 executes step S204 etc..
Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with
Relevant hardware is instructed to complete by computer program, the program can be stored in a non-volatile computer and can be read
In storage medium, the program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, provided herein
Each embodiment used in any reference to memory, storage, database or other media, may each comprise non-volatile
And/or volatile memory.Nonvolatile memory may include that read-only memory (ROM), programming ROM (PROM), electricity can be compiled
Journey ROM (EPROM), electrically erasable ROM (EEPROM) or flash memory.Volatile memory may include random access memory
(RAM) or external cache.By way of illustration and not limitation, RAM is available in many forms, such as static state RAM
(SRAM), dynamic ram (DRAM), synchronous dram (SDRAM), double data rate sdram (DDRSDRAM), enhanced SDRAM
(ESDRAM), synchronization link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) directly RAM (RDRAM), straight
Connect memory bus dynamic ram (DRDRAM) and memory bus dynamic ram (RDRAM) etc..
Accordingly, in one embodiment, a kind of computer readable storage medium is provided, computer program is stored with, is counted
When calculation machine program is executed by processor, so that processor executes the determination method of the text classification of the application any embodiment offer
The step of.
Each technical characteristic of above embodiments can be combined arbitrarily, for simplicity of description, not to above-described embodiment
In each technical characteristic it is all possible combination be all described, as long as however, the combination of these technical characteristics be not present lance
Shield all should be considered as described in this specification.
The several embodiments of the application above described embodiment only expresses, the description thereof is more specific and detailed, but simultaneously
The limitation to the application the scope of the patents therefore cannot be interpreted as.It should be pointed out that for those of ordinary skill in the art
For, without departing from the concept of this application, various modifications and improvements can be made, these belong to the guarantor of the application
Protect range.Therefore, the scope of protection shall be subject to the appended claims for the application patent.
Claims (15)
1. a kind of determination method of text classification, comprising:
The keyword of text to be processed is extracted, and determines the weight of each keyword;
Obtain semantic description information corresponding with each keyword;
According to each semantic description information, first degree of correlation of each keyword respectively with each candidate classification is determined;
According to the weight of each keyword and each first degree of correlation, determine the text to be processed respectively with each time
Select second degree of correlation of classification;
According to each second degree of correlation, classification belonging to the text to be processed is determined from each candidate classification.
2. the method according to claim 1, wherein the keyword for extracting text to be processed, comprising:
Word segmentation processing is carried out to the text to be processed, obtains multiple first participles of the text to be processed;
The first participle for belonging to goal filtering dictionary is rejected from each first participle, is obtained one or more second and is segmented;
Second participle includes the remaining first participle after rejecting;
According to each second participle, the keyword of the text to be processed is obtained;
Wherein, the goal filtering dictionary includes filtering dictionary corresponding with target data source, and the target data source includes
Data source belonging to the text to be processed.
3. according to the method described in claim 2, it is characterized in that, constructing the mode of the goal filtering dictionary, comprising:
Word segmentation processing is carried out to each text for belonging to the target data source, obtains multiple third participles;
Determine that each third segments corresponding first ratio respectively;It includes: target that the third, which segments corresponding first ratio,
Textual data comprising third participle in data source accounts for the ratio of the target data source text sum;
It is more than the third participle of the first proportion threshold value according to first ratio, constructs the goal filtering dictionary.
4. according to the method described in claim 3, it is characterized in that, determining the mode of first proportion threshold value, comprising:
According to current proportion threshold value, the 4th participle is determined from each third participle;4th participle includes described first
The third that ratio is equal to or more than the current proportion threshold value segments;
Determine remaining word number corresponding with each text of the target data source is belonged to;The corresponding remaining word number of the text
It is the number that remaining third segments after rejecting the 4th participle in each third participle of the text;
Determine that the textual data that the remaining word number is equal to or more than word number threshold value accounts for the text sum for belonging to the target data source
The second ratio;
When second ratio is less than the second proportion threshold value, the current proportion threshold value is determined as the first ratio threshold
Value;
When second ratio is more than second proportion threshold value, the current proportion threshold value is updated according to numerical value is lowered, and
Return to the step of current proportion threshold value of the basis determines the 4th participle from each third participle.
5. according to the method described in claim 2, obtaining described wait locate it is characterized in that, described according to each second participle
Manage the keyword of text, comprising:
Permutation and combination is carried out according to each second participle, obtains the 5th participle;Each 5th participle includes continuous adjacent
At least two second participles;
From each 5th participle, the 6th participle is determined;6th participle includes the 5th participle for belonging to existing entry;
From each 6th participle, the 7th participle is determined;7th participle is not included in each 6th participle except certainly
In the 6th participle other than body;
According to the 7th participle, the keyword of the text to be processed is obtained.
6. the method according to claim 1, wherein the acquisition each language corresponding with each keyword
Adopted description information, comprising:
Obtain web search message corresponding with each keyword;The corresponding web search message of the keyword is logical
It crosses network search service the keyword is scanned for obtaining;
Respectively according to the corresponding web search message of each keyword, each semanteme corresponding with each keyword is obtained
Description information.
7. according to the method described in claim 6, it is characterized in that, acquisition network corresponding with each keyword
Search for information, comprising:
The corresponding candidate keywords of each keyword are searched in local information library respectively;The local information library record is candidate
Matching relationship between keyword and candidate search information, the candidate search information are by the network search service to phase
Candidate keywords are answered to scan for obtaining;
When finding candidate keywords corresponding with the keyword, according to the matched candidate of candidate keywords institute found
Information is searched for, the corresponding web search message of the keyword is obtained;
When not finding candidate keywords corresponding with the keyword, call the network search service to the keyword into
Row search, obtains web search message corresponding with the keyword.
8. the method according to claim 1, wherein further include:
Obtain the priority factor of each candidate classification;
It is described that classification belonging to the text to be processed is determined from each candidate classification according to each second degree of correlation,
Include:
According to the priority factor of each second degree of correlation and each candidate classification, determine the text to be processed respectively with
5th degree of correlation of each candidate classification;
According to each 5th degree of correlation, classification belonging to the text to be processed is determined from each candidate classification.
9. method according to any one of claims 1 to 8, which is characterized in that described respectively according to each semantic description
Information determines first degree of correlation of each keyword and candidate classification, comprising:
According to the category name of each semantic description information and each candidate classification, each semantic description information point is determined
Not with the third degree of correlation of each candidate classification;
According to each third degree of correlation, first degree of correlation of each keyword respectively with each candidate classification is determined.
10. according to the method described in claim 9, it is characterized in that, described according to each semantic description information and each described
The category name of candidate classification determines the third degree of correlation of each semantic description information respectively with each candidate classification, packet
It includes:
According to the category name of each semantic description information and each candidate classification, each semantic description information point is determined
Not with the shared word of the category name of each candidate classification;
Respectively and in the shared word of the category name of each candidate classification, each institute's predicate is determined from each semantic description information
Target of the adopted description information respectively with the category name of each candidate classification shares word;
The total word for determining that each semantic description information shares word with the target of the category name of each candidate classification respectively is long
Account for the long third ratio of total word of the category name of each candidate classification;
According to each third ratio, each semantic description information target with the category name of each candidate classification respectively
Shared first word frequency and each semantic description information of the word in corresponding semantic description information respectively with each candidate class
The target of purpose category name shares the first inverse document frequency of word, determine each semantic description information respectively with each time
Select the third degree of correlation of the category name of classification;
Wherein, the target of the semantic description information and the category name of the candidate classification shares word, is not included in the semanteme
In shared word in each shared word of description information and the category name of candidate's classification in addition to itself.
11. according to the method described in claim 9, it is characterized by further comprising:
According to each semantic description information, the predetermined classification conjunctive word and each candidate classification of each candidate classification
Predetermined classification conjunctive word and corresponding candidate classification predetermined correlation coefficient, determine each semantic description information respectively with each institute
State the 4th degree of correlation of candidate classification;
It is described according to each third degree of correlation, determine that each keyword is related to the first of each candidate classification respectively
Degree, comprising:
According to each third degree of correlation and each 4th degree of correlation, determine each keyword respectively with each candidate class
First degree of correlation of purpose.
12. according to the method described in claim 8, it is characterized in that, described according to each semantic description information, each time
Select the predetermined of the predetermined classification conjunctive word of classification and the predetermined classification conjunctive word of each candidate classification and corresponding candidate classification
Related coefficient determines fourth degree of correlation of each semantic description information respectively with each candidate classification, comprising:
According to the predetermined classification conjunctive word of each candidate classification the second word frequency in each semantic description information, each respectively
The predetermined classification association of second inverse document frequency of the predetermined classification conjunctive word of candidate's classification and each candidate classification
The word predetermined correlation coefficient with corresponding candidate classification respectively, determine each semantic description information respectively with each candidate classification
The 4th degree of correlation.
13. a kind of determining device of text classification, comprising:
Keyword processing module for extracting the keyword of text to be processed, and determines the weight of each keyword;
Semantic description data obtaining module, for obtaining semantic description information corresponding with each keyword;
First degree of correlation determining module, for according to each semantic description information, determine each keyword respectively with each candidate class
First degree of correlation of purpose;
Second degree of correlation determining module, for the weight and each first degree of correlation according to each keyword, determine described in
Text to be processed second degree of correlation with each candidate classification respectively;
Text classification determining module, for being determined from each candidate classification described to from according to each second degree of correlation
Manage classification belonging to text.
14. a kind of computer readable storage medium is stored with computer program, when the computer program is executed by processor,
So that the processor is executed such as the step of any one of claims 1 to 12 the method.
15. a kind of computer equipment, including memory and processor, the memory is stored with computer program, the calculating
When machine program is executed by the processor, so that the processor is executed such as any one of claims 1 to 12 the method
Step.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811592736.7A CN109740152B (en) | 2018-12-25 | 2018-12-25 | Text category determination method and device, storage medium and computer equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811592736.7A CN109740152B (en) | 2018-12-25 | 2018-12-25 | Text category determination method and device, storage medium and computer equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109740152A true CN109740152A (en) | 2019-05-10 |
CN109740152B CN109740152B (en) | 2023-02-17 |
Family
ID=66361194
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811592736.7A Active CN109740152B (en) | 2018-12-25 | 2018-12-25 | Text category determination method and device, storage medium and computer equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109740152B (en) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110147499A (en) * | 2019-05-21 | 2019-08-20 | 智者四海(北京)技术有限公司 | Label method, recommended method and recording medium |
CN110781345A (en) * | 2019-10-31 | 2020-02-11 | 北京达佳互联信息技术有限公司 | Video description generation model acquisition method, video description generation method and device |
CN111325037A (en) * | 2020-03-05 | 2020-06-23 | 苏宁云计算有限公司 | Text intention recognition method and device, computer equipment and storage medium |
CN112000768A (en) * | 2020-07-31 | 2020-11-27 | 恒大智慧科技有限公司 | Device and method for determining correlation degree and computer equipment |
CN112148684A (en) * | 2020-09-24 | 2020-12-29 | 成都知道创宇信息技术有限公司 | Document preview implementation method and device and electronic equipment |
CN112597772A (en) * | 2020-12-31 | 2021-04-02 | 讯飞智元信息科技有限公司 | Hotspot information determination method, computer equipment and device |
CN113610559A (en) * | 2021-07-13 | 2021-11-05 | 广东丸美生物技术股份有限公司 | Method and device for evaluating cosmetics |
CN113988157A (en) * | 2021-09-30 | 2022-01-28 | 北京百度网讯科技有限公司 | Semantic retrieval network training method and device, electronic equipment and storage medium |
CN114817700A (en) * | 2021-01-29 | 2022-07-29 | 腾讯科技(深圳)有限公司 | Text keyword determination method and device, storage medium and electronic equipment |
CN115708085A (en) * | 2021-08-09 | 2023-02-21 | 腾讯科技(深圳)有限公司 | Business processing method, neural network model training method, device, equipment and medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104462347A (en) * | 2014-12-04 | 2015-03-25 | 北京国双科技有限公司 | Keyword classifying method and device |
WO2015149533A1 (en) * | 2014-03-31 | 2015-10-08 | 北京奇虎科技有限公司 | Method and device for word segmentation processing on basis of webpage content classification |
CN105005589A (en) * | 2015-06-26 | 2015-10-28 | 腾讯科技(深圳)有限公司 | Text classification method and text classification device |
CN106095845A (en) * | 2016-06-02 | 2016-11-09 | 腾讯科技(深圳)有限公司 | File classification method and device |
WO2018153265A1 (en) * | 2017-02-23 | 2018-08-30 | 腾讯科技(深圳)有限公司 | Keyword extraction method, computer device, and storage medium |
-
2018
- 2018-12-25 CN CN201811592736.7A patent/CN109740152B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2015149533A1 (en) * | 2014-03-31 | 2015-10-08 | 北京奇虎科技有限公司 | Method and device for word segmentation processing on basis of webpage content classification |
CN104462347A (en) * | 2014-12-04 | 2015-03-25 | 北京国双科技有限公司 | Keyword classifying method and device |
CN105005589A (en) * | 2015-06-26 | 2015-10-28 | 腾讯科技(深圳)有限公司 | Text classification method and text classification device |
CN106095845A (en) * | 2016-06-02 | 2016-11-09 | 腾讯科技(深圳)有限公司 | File classification method and device |
WO2018153265A1 (en) * | 2017-02-23 | 2018-08-30 | 腾讯科技(深圳)有限公司 | Keyword extraction method, computer device, and storage medium |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110147499A (en) * | 2019-05-21 | 2019-08-20 | 智者四海(北京)技术有限公司 | Label method, recommended method and recording medium |
CN110781345A (en) * | 2019-10-31 | 2020-02-11 | 北京达佳互联信息技术有限公司 | Video description generation model acquisition method, video description generation method and device |
CN111325037A (en) * | 2020-03-05 | 2020-06-23 | 苏宁云计算有限公司 | Text intention recognition method and device, computer equipment and storage medium |
CN112000768A (en) * | 2020-07-31 | 2020-11-27 | 恒大智慧科技有限公司 | Device and method for determining correlation degree and computer equipment |
CN112148684A (en) * | 2020-09-24 | 2020-12-29 | 成都知道创宇信息技术有限公司 | Document preview implementation method and device and electronic equipment |
CN112148684B (en) * | 2020-09-24 | 2023-10-13 | 成都知道创宇信息技术有限公司 | Document preview implementation method and device and electronic equipment |
CN112597772A (en) * | 2020-12-31 | 2021-04-02 | 讯飞智元信息科技有限公司 | Hotspot information determination method, computer equipment and device |
CN114817700A (en) * | 2021-01-29 | 2022-07-29 | 腾讯科技(深圳)有限公司 | Text keyword determination method and device, storage medium and electronic equipment |
CN113610559A (en) * | 2021-07-13 | 2021-11-05 | 广东丸美生物技术股份有限公司 | Method and device for evaluating cosmetics |
CN115708085A (en) * | 2021-08-09 | 2023-02-21 | 腾讯科技(深圳)有限公司 | Business processing method, neural network model training method, device, equipment and medium |
CN113988157A (en) * | 2021-09-30 | 2022-01-28 | 北京百度网讯科技有限公司 | Semantic retrieval network training method and device, electronic equipment and storage medium |
CN113988157B (en) * | 2021-09-30 | 2023-10-13 | 北京百度网讯科技有限公司 | Semantic retrieval network training method and device, electronic equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN109740152B (en) | 2023-02-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109740152A (en) | Determination method, apparatus, storage medium and the computer equipment of text classification | |
Yan et al. | Learning to respond with deep neural networks for retrieval-based human-computer conversation system | |
Garg et al. | Personalized, interactive tag recommendation for flickr | |
CN107025310A (en) | A kind of automatic news in real time recommends method | |
CN109189904A (en) | Individuation search method and system | |
CN108376131A (en) | Keyword abstraction method based on seq2seq deep neural network models | |
CN108874992A (en) | The analysis of public opinion method, system, computer equipment and storage medium | |
US9015158B2 (en) | Contents creating device and contents creating method | |
CN108509482A (en) | Question classification method, device, computer equipment and storage medium | |
CN106940726B (en) | Creative automatic generation method and terminal based on knowledge network | |
CN110134792B (en) | Text recognition method and device, electronic equipment and storage medium | |
CN101140588A (en) | Method and apparatus for ordering incidence relation search result | |
CN110457404A (en) | Social media account-classification method based on complex heterogeneous network | |
CN110532480B (en) | Knowledge graph construction method for recommending human-read threat information and threat information recommendation method | |
WO2023108980A1 (en) | Information push method and device based on text adversarial sample | |
KR101088710B1 (en) | Method and Apparatus for Online Community Post Searching Based on Interactions between Online Community User and Computer Readable Recording Medium Storing Program thereof | |
CN111159341A (en) | Information recommendation method and device based on user investment and financing preference | |
CN110196910A (en) | A kind of method and device of corpus classification | |
CN114443847A (en) | Text classification method, text processing method, text classification device, text processing device, computer equipment and storage medium | |
CN116975615A (en) | Task prediction method and device based on video multi-mode information | |
CN116431895A (en) | Personalized recommendation method and system for safety production knowledge | |
Esuli | ICS: Total freedom in manual text classification supported by unobtrusive machine learning | |
Gaba et al. | Sentiment Analysis of Twitter Data Using Machine Learning Approaches | |
CN112434126A (en) | Information processing method, device, equipment and storage medium | |
Omar et al. | Machine Learning Model for Personalizing Online Arabic Journalism |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |