CN101645083B - Acquisition system and method of text field based on concept symbols - Google Patents

Acquisition system and method of text field based on concept symbols Download PDF

Info

Publication number
CN101645083B
CN101645083B CN2009100770180A CN200910077018A CN101645083B CN 101645083 B CN101645083 B CN 101645083B CN 2009100770180 A CN2009100770180 A CN 2009100770180A CN 200910077018 A CN200910077018 A CN 200910077018A CN 101645083 B CN101645083 B CN 101645083B
Authority
CN
China
Prior art keywords
field
statement
concept
word
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN2009100770180A
Other languages
Chinese (zh)
Other versions
CN101645083A (en
Inventor
韦向峰
黄曾阳
张全
缪建明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Acoustics CAS
Original Assignee
Institute of Acoustics CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Acoustics CAS filed Critical Institute of Acoustics CAS
Priority to CN2009100770180A priority Critical patent/CN101645083B/en
Publication of CN101645083A publication Critical patent/CN101645083A/en
Application granted granted Critical
Publication of CN101645083B publication Critical patent/CN101645083B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses an acquisition system and a method of the text field based on concept symbols. The system comprises a concept symbol set for expressing word concepts and field categories, a word knowledge base for storing word and concept symbols, a word segmentation processor, a statement semantic analyzer and a field arbiter. The method comprises the following steps: (1) segmenting an input text into paragraphs, statements and words; (2) carrying out semantic analysis on the statements for obtaining concept categories and semantic blocks of the statements; (3) obtaining activating words in the statements according to semantic concept symbols in the field concept symbol set and the word knowledge base; (4) carrying out comprehensive scoring on field concept symbols of the activating words and obtaining the field concept symbol with the highest score as the field of the statements; (5) merging the statements in the paragraphs according to the field concept symbols for obtaining a statement group and the field thereof; and (6) obtaining the field of the text according to a title of the text and the frequency of occurrence and the position of the statement group in the statement group.

Description

A kind of text field based on concept symbols obtain system and method
Technical field
The present invention relates to the field that utilizes Computer Science and Technology that text is carried out the spoken and written languages information processing, particularly a kind of text field based on concept symbols obtain system and method.
Background technology
The text classification technology is to utilize computing machine, according to certain rule, knowledge and step, is classified as one or more domain class method for distinguishing and process to one piece of text.The conventional method of text classification is that text table is shown as proper vector, and when " angle " of the proper vector of two pieces of texts during less than certain angle, they are classified as same classification.Generally choose word constitutes text as text feature proper vector; The TF*IWF method that the building method of proper vector adopts the TF*IDF method more or derives thus, TF*IDF promptly use word in document the frequency of occurrences and in collection of document the product of the inverse of the frequency of occurrences as the corresponding value of this characteristic word in the proper vector.The k nearest neighbor method of text classification, bayes method, SVMs, neural network, decision tree etc. all are the statistical methods that the vector space model with text is the basis; Good a large amount of text sets carry out the parameter optimization training to require that before classification prior classification is arranged, and can new text be included in a certain classification that defines after the training.Chinese patent file (publication number CN100353361) discloses a kind of method and apparatus of new proper vector weight towards text classification; On the basis of TF*IWF method, introduced the n speech root of DBV and TF; The experiment of the different characteristic speech number (50,100,200,500,1000,1500,2000,2500,3000,3500,4000) through the field of respectively classifying by the word frequency selection purposes, its experimental system better performances when finding to get 3500 speech.
Because file classification method requires to know in advance the field classification set and the criteria for classification of text, obtain under the situation of difficult with the training text collection in that class categories is uncertain, file classification method will be difficult to implement.Therefore, the text cluster technology has appearred again.Typical case's representative of text cluster method commonly used is the K-Means algorithm, and promptly at first optional K text is as cluster centre from text set, and other text basis incorporates in that nearest cluster with the proper vector " distance " of cluster centre; And then with the average of the proper vector of all texts in K type as new cluster centre, all texts are basis and distances of clustering centers cluster more again, so iterative computation is till the evaluation function convergence.But the field classification that the text automatic cluster obtains is very coarse, is difficult to adapt to actual demand owing to lack its result of guidance to different types of areas.And same text cluster method, better to certain text set effect, but maybe be very poor to another text set effect, promptly all there are shortcoming in the practicality of text cluster and stability.
To sum up, the statistical method of text classification needs a large amount of good corpus of prior classification, this divide time-like often be difficult to provide.Though and text cluster can overcome this shortcoming, cluster result is difficult to combine with the actual demand of classification.
Summary of the invention
In order to overcome above-mentioned the problems of the prior art; The invention provides a kind of system and method that obtains of the text field based on concept symbols; This system and method has the configurable characteristics with the sorting technique regularization of criteria for classification; Can classify at the basic area that does not have to obtain text under the situation of corpus, and can customize the class categories of text according to actual needs, can be used for the automatic cluster of text.
In order to achieve the above object, the system that obtains of a kind of text field based on concept symbols provided by the invention, as shown in Figure 1, comprising:
One field concept glossary of symbols is used to express word notion and field classification, and to the field arbiter required field concept symbol is provided.
One word knowledge base is used to store word and concept symbols thereof, and to word segmentation processing device and statement semantics analyzer required word and semantic concept symbol is provided.
One word segmentation processing device, being used for the input text cutting is paragraph, statement, word, and sends into the statement semantics analyzer.
One statement semantics analyzer is used for statement is carried out semantic analysis, obtains the concept classification and the semantic chunk that constitutes statement of statement, comprising: the role of semantic chunk, border and inner formation.
One field arbiter is used for obtaining the activation word in the statement according to the semantic concept symbol of field concept glossary of symbols and word knowledge base; According to semantic chunk type, field concept syntactics, the frequency of occurrence of the activation word in the statement and the position occurs the field concept symbol that activates word is carried out comprehensive grading, obtain the highest field concept of branch and meet field then as statement; Then the statement in the paragraph is merged according to its field concept symbol, obtain sentence crowd and field thereof; Obtain at last the field of input text according to input text title, sentence crowd frequency of occurrence and position in input text.
Wherein, the character types of said semantic chunk is divided into: characteristic semantic chunk E, actor semantic chunk A, object semantic chunk B and contents semantic piece C; Said characteristic semantic chunk type E is divided into two types: a) global characteristics semantic chunk Eg is the characteristic semantic chunk E in the statement first order level; B) local feature semantic chunk El is the characteristic semantic chunk E of nested statement S ' time nested statement S ' in the semantic chunk.
Wherein, said field concept glossary of symbols comprises following upper level node symbol:
" 71,72 " the expression psychological activity and the state of mind; " 8 " expression human thinking activities; " a, b " expression specialty and pursuit movable (second type of work); The activity of " d " expression theory; The work of " q6 " expression first kind; " q7 " representes extra-professional activity; The activity of " q8 " expression faith; " 6m " representes instinctive activity, wherein m=0~5; " 3228 α " representes calamity, wherein α=8~b; " 503,50 α " expression state, wherein α=8~b;
The field concept upper level node The field of expression
71,72 The psychological activity and the state of mind
8 Human thinking activities
a,b Specialty and pursuit movable (second type of work)
d The theory activity
q6 First kind work
q7 Extra-professional activity
q8 The faith activity
6m(m=0~5) Instinctive activity
3228α(α=8~b) Calamity
503,50α(α=8~b) State
And said upper level node is to the node symbol of field concept more specifically that extends below.
Wherein, said field arbiter is confirmed the field of statement S as follows: at first, from the result of sentence category analysis (sca), obtain to activate the type of word semantic chunk of living in; The semantic chunk type sequence of then, pressing global characteristics semantic chunk Eg>local feature semantic chunk El>contents semantic piece C>(object semantic chunk B or actor semantic chunk A) is confirmed the field of statement S successively; A plurality of activation word (W are arranged in same type semantic chunk 1, W 2..., W n) time, the field concept symbol that the hypothesis activation word is corresponding is respectively (D 1, D 2..., D n), calculate the score of each field concept symbol in statement according to following computing formula so:
S(D i)=Rel(i)+Fre(i)+Pos(i),1≤i≤n;
Wherein, i field concept symbol D of Rel (i) expression iIn statement with other field concept symbol D j(j ≠ i, 1≤j≤n) concerns score; Fre (i) representes i field concept symbol D iHigh more its value of frequency of occurrence in statement S, the frequency is big more; Pos (i) representes i field concept symbol D iAppearance position in statement S, position lean on its value of back big more more.With score S (D i) i the highest field concept symbol D iField as statement S.
The acquisition methods of a kind of text field based on concept symbols provided by the invention, as shown in Figure 2, may further comprise the steps:
(1) segmentation subordinate sentence participle: the word segmentation processing device is the input text cutting paragraph, statement, word.
An input text is used as a character string T in computing machine.With " carriage return, line feed " among character string T symbol is cut-off, is text T cutting several paragraphs P.Characters such as " fullstop, question mark, exclamation and branches " with among the paragraph P is a cut-off, is cut into several statements S to paragraph P.
Statement S is made up of Chinese character and other characters.If A, B, C are the Chinese characters that occurs among the statement S, if " AB " is the word in the word knowledge base, then " ABC " cutting is " AB/C "; In like manner, if " BC " is the word in the speech, then " ABC " cutting is " A/BC ".If " AB " and " BC " all is the word in the dictionary, divide the principle cutting to be " A/BC " according to left cut so; If " ABC " is the word in the dictionary, be "/ABC/ " according to the long principle cutting of major term so.So statement S is several words W by cutting, participle finishes.
(2) statement semantics analysis: the statement semantics analyzer carries out semantic analysis to statement, obtains the concept classification and the semantic chunk that constitutes statement of statement, comprising: the role of semantic chunk, border and inner formation.
For each statement S, anolytic sentence obtains its semantic classes (sentence class) code SCode, format code SFomat, sentence type expression formula SExpression, the kind of the semantic chunk of formation statement, scope, the concrete title in sentence class expression formula or the like.The type of particularly confirming semantic chunk is E (characteristic semantic chunk), A (actor semantic chunk), B (object semantic chunk), or C (contents semantic piece).In characteristic semantic chunk type E, be divided into two types again: a kind of Eg of being (global characteristics semantic chunk) is the characteristic semantic chunk E in the statement first order level; A kind of is El (local feature semantic chunk), and it is the characteristic semantic chunk E of nested statement S ' time nested statement S ' in the semantic chunk.
(3) obtain the activation word: the field arbiter obtains the activation word in the statement according to the semantic concept symbol in field concept glossary of symbols and the word knowledge base.
Activating word is the word that contains the field concept symbol among the statement S.The word knowledge base comprises: morphology, tone, senses of a dictionary entry number, adopted item No., concept classification, word frequency and linguistic context, semantic knowledge, sentence category code, format conversion, S, K, CA, CT.Wherein semantic knowledge is with the symbolic formulation of notion primitive, and the field symbol also is the sub-set in the notion primitive symbolism, so possibly contain the field concept symbolic information in the concept symbols of word.In notion primitive symbolism, not all notion primitive node all is used for the description field, and the upper level node of the notion relevant with the field has: 71,72 (psychological activity and the state of mind); 8 (human thinking activities); A, b (specialty and pursuit movable (second type of work)); D (theory activity); Q6 (first kind work); Q7 (extra-professional activity); Q8 (faith activity); 6m (m=0~5) (instinctive activity); 3228 α (α=8~b) (calamity); 503,50 α (α=8~b) (state).The upper level node of these field concept symbols can obtain more concrete field concept node symbol to extending below; For example a (professional activity) extends to downwards: a1 (politics), a2 (economy), a3 (culture), a4 (military affairs), a5 (law), a6 (science and technology), a7 (education), a8 (defending the guarantor); And a1 (politics) can extend to downwards successively: a11 (regime is movable); A113 (top leader (country or local government) change), a113b (election).
What the concept symbols of semantic knowledge used in the concept symbols in field and the word knowledge base is same notion primitive symbolism; When the upper level node that has occurred the field concept symbol in the concept symbols of the semantic knowledge of a word W or its were derived node, word W activated word.The field concept symbolic formulation field of a certain level or type, all spectra concept symbols that the activation word among the statement S is contained is used as the candidate field of statement S.
(4) the statement field is differentiated: the field arbiter carries out comprehensive grading according to activating semantic chunk type, field concept syntactics, the frequency of occurrence of word in the statement and the position occurring to the field concept symbol that activates word, obtains the field of the highest field concept symbol of branch as statement.
Wherein, the statement field derives from the field concept symbol that activates word in the said step (4).When a plurality of activation word is arranged among the while statement S, confirm the statement field as follows: at first, from the result of sentence category analysis (sca), obtain to activate the type of word semantic chunk of living in; The semantic chunk type sequence of pressing global characteristics semantic chunk Eg>local feature semantic chunk El>contents semantic piece C>object semantic chunk B or actor semantic chunk A is then confirmed the field of statement S successively; Even have among the Eg and activate field concept symbol that word W then gets W as the statement field; If do not activate word then then among the Eg from El; Then from C, do not get if activate word among the El, if then from B or A, do not get among the C.
In the semantic chunk of same type, have a plurality of activation words (W1, W2 ...; Wn) time, the field concept symbol that the hypothesis activation word is corresponding is respectively (D1, D2; ..., Dn), calculate the score of each field concept symbol in statement according to following computing formula so: S (D i)=Rel (i)+Fre (i)+Pos (i), 1≤i≤n.At formula S (D iAmong)=Rel (i)+Fre (i)+Pos (i), Rel (i) representes i field concept symbol D iIn statement with other field concept symbol D j(j ≠ i, 1≤j≤n) concerns score; Fre (i) representes i field concept symbol D iHigh more its value of frequency of occurrence in statement S, the frequency is big more; Pos (i) representes i field concept symbol D iAppearance position in statement S, position lean on its value of back big more more.With score S (D i) i the highest field concept symbol D iField as statement S.
The score value of Rel (i) is from field concept symbol D iWith D jRelation.Work as D iBe D jConcept extension when representing, D iScore value add 1; Work as D iWith D jDuring strong correlation, D iScore value add 1.If calculated S (D i) back D iBe the field of statement, D iBefore have negative notion to modify, should get D so i' (being its opposite field concept symbol) is as the field of statement.If if calculated S (D i) back Di is the field of statement, and D jRel (i)+Fre (i) score and D iIdentical, and D iWith D jBe the child node of identical concept node, get D so iWith D jUpper level father node field concept symbol as the field of statement.
If one is activated word W i(among 1≤i≤n) a plurality of field concept symbol (D are arranged I1, D I2..., D Im), this m field concept symbol all need calculate S (D so i) the field score value, just when calculating Rel (i), need not consider D Ij(1≤j≤m) and D Ik(the field concept syntactics between the j ≠ i, 1≤k≤m).If D IjWith D IkFinal calculating score value S (D Ij) and S (D Ik) still identical, get the field concept symbol that comes the front in the word knowledge base field so as statement S.
(5) sentence crowd and field thereof are differentiated: the field arbiter merges according to its field concept symbol the statement in the paragraph, obtains sentence crowd and field thereof.
Sentence the crowd be made up of the statement of the same center of continuous description topic.Sentence crowd's center topic is meant topic or the field that identical or approximate field concept symbol is expressed.Minimum sentence crowd is a statement, and maximum sentence crowd is a paragraph.In the said step (5), for the statement (S among certain paragraph Pi of text T 1, S 2..., S n), the sentence crowd ownership of each statement is confirmed according to following steps, and is as shown in Figure 3:
(5a) get first statement S 1As sentence crowd G 1, get S 1Field D 1As sentence crowd G 1Field D G1
(5b) S 1Be current statement S i, G 1Be current sentence crowd G j, change (5g);
If (5c) S iField D iBe S I-1Field D I-1Symbol extend statement S so iBe included into G j, G jThe field change D into i, change (5g);
If (5d) S I-1Field D I-1Be S iField D iSymbol extend statement S so iBe included into G j, change (5g);
If (5e) current statement S iField D iWith a last statement S I-1Field D I-1Identical, statement S so iBe included into G j, change (5g);
(5f) get S iNext statement S I+1Be new sentence crowd G J+1, field D Gj+1Be statement S I+1Field D I+1
If (5g) current statement S iBe last statement S n, change so (5n);
If (5k) S iThe field be sky and S iBe S 1, statement S so 2Be included into G 1, G 1The field change D into 2, S 2As current statement S i, change (5c);
If (5l) S iThe field be sky and S iNot S 1, statement S so iBe included into G j, change (5g);
If (5m) S iThe field be not empty, so S I+1As current statement S i, change (5c);
(5n) all crowd G to obtaining j, the sentence crowd that adjacent field is identical merges into a sentence crowd, 1≤j≤m wherein, 1≤m≤n.
Through above-mentioned steps and closing operation, a paragraph just is divided into several crowds, simultaneously their field is also decided according to the field of statement, has realized in the paragraph sentence crowd's the division and the differentiation in sentence crowd field.
(6) text field is differentiated: the field arbiter obtains the field of input text according to text header, sentence crowd frequency of occurrence and position in input text.
Wherein, said step (6) also comprises: if title is arranged in the input text, the field of title is used as the field of input text so, if title paragraph P 1In have only a sentence crowd, this crowd's field is exactly the field of text so; If paragraph P 1In a plurality of crowds are arranged, choose paragraph P so 1In first crowd's field and last crowd's field jointly as the field of text.
If there is not title in the text, all crowds' field is used as the candidate field of text field in the text so.N sentence crowd's field is designated as D=(D in proper order by sentence crowd appearance among the text T G1, D G2..., D Gn), from D G1To D GnOperation according to the following steps, as shown in Figure 4:
(6a) D G1As D Gi, the statistics D in D GiThe field number C that the field concept symbol is identical Gi, with D GiWith C GiDeposit among the table HTab;
If (6b) D GiBe D Gn, change so (6f)
(6c) D Gi+1As D Gi
If (6d) D GiThe field concept symbol deposited among the table HTab, change so (6c);
(6e) statistics D in D GiThe field number C that the field concept symbol is identical Gi, with D GiWith C GiDeposit among the table HTab, change (6b);
(6f) obtain showing HTab=((D G1, C G1) ..., (D Gm, C Gm)), 1≤m≤n wherein;
(6g) element (D among the his-and-hers watches HTab Gj, C Gj), 1≤j≤m is according to C GjSize sort from big to small, newly shown HTab '=((D G1', C G1') ..., (D Gm', C Gm')).
The field of the field concept symbol of first element in the new table as text T, the field of text T can not obtain with above-mentioned steps when having title among the text T.
The invention has the advantages that:
When 1, text field provided by the invention obtains system and method and is used for text classification, do not need the good a large amount of corpus of classification in advance, only need to confirm the field concept symbol relevant with class categories.
2, the text field provided by the invention field concept symbol that obtains system and method has the level characteristics, both can adapt to miscellaneous same level class categories, can also adapt to the hierarchical classification of striding of concrete tiny classification.
3, text field provided by the invention obtains the method that system and method mainly adopts semantic analysis and gos deep into the field classification that concept hierarchy is confirmed text; Introduce simultaneously the frequency characteristic of statistical property again, make the processing of the accurate and suitable more extensive text of acquisition methods of text field.
4, text field provided by the invention obtains the classification processing that sentence crowd field that system and method proposes can be used for text, also can be used for the cluster analysis of text and the topic analysis of text.
Description of drawings
Fig. 1 is the structural drawing of the system that obtains of text field of the present invention;
Fig. 2 is the process flow diagram of the acquisition methods of text field of the present invention;
Fig. 3 is the process flow diagram of definite method in sentence crowd of the present invention and field thereof;
Fig. 4 is the process flow diagram of text of the present invention text field acquisition methods when not having title.
Embodiment
Below in conjunction with specific embodiment and accompanying drawing the present invention is elaborated.
At first, from internet download some about 11 pieces in the news report text of Athens Olympic Games 2004 match, totally 60 paragraghs, 6501 Chinese characters.
Secondly, according to principle of design in " fundamental theorem in language concept space and mathematical physics expression " (Maritime Press, in July, 2004) and design symbol concrete perfect the concept symbols in q73 (match) field, obtain concept symbols collection about the field of competing.Word and semantic knowledge thereof about competing in the word knowledge base have been enriched simultaneously.
The 3rd, use the word segmentation processing device that one piece of text is carried out segmentation, subordinate sentence and word segmentation processing.For example following text: Title: the difference of Malaysia " little standard-bearer " one semifinals of not advancing to dive
In the match of the men's Olympic 10m platform event diving that www.xinhuanet.com Athens August 27 held in afternoon 27 day local time,, fail to be promoted to semifinals from Malay Brian-Nickerson's results in the qualifying rank the 19.According to rule, among 33 players of preliminary contest, achievement comes preceding 18 player and is promoted to semifinals.
After the processing through the word segmentation processing device, the result who obtains is following:
[Title :] [Malaysia] [" little standard-bearer "] [one poor] [not advancing] [diving] [semifinals]
[www.xinhuanet.com] [Athens] [August 27] electricity is in [Olympic Games] [man] [ten meters] [diving tower] [diving] of [locality] [time] [27 days] [afternoon] [holding] [match]
[from] Brian-Nickerson's [preliminary contest] [achievement] [rank] [19] of [Malaysia]
[failing] [promotion] [semifinals]
[according to] [rule]
In [33] [player] of [preliminary contest]
[player] [promotion] [semifinals] of [18] before [achievement] comes
The 4th, use the statement semantics analyzer that statement is analyzed, use the field arbiter to obtain then and activate word and analyze sentence crowd and field thereof, after merging sentence crowd field, obtain following result:
//DOM:(q734)
Title: the difference that [Malaysia] [" little standard-bearer "] is is not advanced [diving (a339)] [semifinals (q734)]
Www.xinhuanet.com's [Athens (a219)] August 27 is in [match (q73)] of [Olympic Games (a339i)] [man] ten meters [diving tower (a339ing)] [divings (a339)] of 27 days [locality] [time] [afternoon] [holding (a02)]; [from] Brian-Nickerson [preliminary contest (q734)] [achievement (a0099b)] [rank (q730e25d0 [n])] the 19 of [Malaysia], [failing] [being promoted to (a01ad0ne25)] [semifinals (q734)].[according to] [rule (a009a9)], in 33 [player (q730)] of [preliminary contest (q734)], [achievement (a0099b)] comes preceding 18 [player (q730)] [being promoted to (a01ad0ne25)] [semifinals (q734)].
In text, first statement " Title: Malaysia ' semifinals of not advancing to dive of little standard-bearer ' one's difference ", its semantic analysis result is " Title: Malaysia ' little standard-bearer ' (SB) || one difference is not advanced (S0) || diving semifinals (SC) ".Because global characteristics semantic chunk Eg (being S0) does not have the field concept symbolic information, so from the contents semantic piece C (being SC) that contains realm information, choose the field of statement." diving " and " semifinals " in the SC semantic chunk all contain the field concept symbolic information; The field of calculating them through score value concerns that score is all the same with frequency score; The position score of " but semifinals " is greater than " diving ", so the field of statement is " q734 ".Therefore first paragraph is altogether with regard to a statement, and whole paragraph is a sentence crowd, and sentence crowd's field is exactly " q734 ".Because first paragraph is text header, so the field of text just " q734 ".
Like this,, can obtain the field of statement, sentence crowd's field, finally obtain the field of text through analyzing the type that activates word residing semantic chunk in statement and word position, the frequency etc. according to the field concept symbol that activates word.

Claims (8)

1. the system that obtains based on the text field of concept symbols is characterized in that, the said system that obtains comprises:
One field concept glossary of symbols is used to express word notion and field classification, and to the field arbiter required field concept symbol is provided;
One word knowledge base is used to store word and concept symbols thereof, and to word segmentation processing device and statement semantics analyzer required word and semantic concept symbol is provided;
One word segmentation processing device, being used for the input text cutting is paragraph, statement, word, and sends into the statement semantics analyzer;
One statement semantics analyzer is used for statement is carried out semantic analysis, obtains the concept classification and the semantic chunk that constitutes statement of statement, comprising: the role of semantic chunk, border and inner formation;
One field arbiter is used for obtaining the activation word in the statement according to the semantic concept symbol of field concept glossary of symbols and word knowledge base; According to semantic chunk type, field concept syntactics, the frequency of occurrence of the activation word in the statement and the position occurs the field concept symbol that activates word is carried out comprehensive grading, obtain the highest field concept of branch and meet field then as statement; Then the statement in the paragraph is merged according to its field concept symbol, obtain sentence crowd and field thereof; Obtain at last the field of input text according to input text title, sentence crowd frequency of occurrence and position in input text; If title is arranged in the text, the field of title is as the field of text so; If there is not title in the text, the sentence crowd field that the frequency of appearance is maximum at first in the text so is as the field of text
Wherein, field concept symbol in the described field concept glossary of symbols and the semantic concept symbol in the word knowledge base are based on identical concept primitive symbolism; Described field concept glossary of symbols comprises 10 types of upper level node symbols and upper level node again to the concrete field concept symbol that extends below;
Described activation word is for when the upper level node that has occurred the field concept symbol in the concept symbols of the word of a word W or its are derived node, and word W activates word.
2. the system that obtains of text field according to claim 1 is characterized in that, the character types of said semantic chunk is divided into: characteristic semantic chunk E, actor semantic chunk A, object semantic chunk B and contents semantic piece C; Said characteristic semantic chunk type E is divided into two types: a) global characteristics semantic chunk Eg is the characteristic semantic chunk E in the statement first order level; B) local feature semantic chunk El is the characteristic semantic chunk E of nested statement S ' time nested statement S ' in the semantic chunk.
3. the system that obtains of text field according to claim 1 is characterized in that, said field concept glossary of symbols comprises following upper level node symbol:
The field concept upper level node The field of expression
71,72 The psychological activity and the state of mind 8 Human thinking activities a,b Specialty and pursuit are movable d The theory activity q6 First kind work q7 Extra-professional activity q8 The faith activity 6m(m=0~5) Instinctive activity 3228α(α=8~b) Calamity 503,50α(α=8~b) State
And said upper level node is to the node symbol of field concept more specifically that extends below.
4. the system that obtains of text field according to claim 1 is characterized in that, said field arbiter is confirmed the field of statement S as follows: at first, from the result that statement semantics is analyzed, obtain to activate the type of word semantic chunk of living in; Then, confirm the field of statement S successively by the semantic chunk type sequence of " global characteristics semantic chunk Eg>local feature semantic chunk El>contents semantic piece C>object semantic chunk B or actor semantic chunk A "; A plurality of activation word W are arranged in same type semantic chunk 1, W 2..., W nThe time, the field concept symbol that the hypothesis activation word is corresponding is respectively D 1, D 2..., D n, calculate the score of each field concept symbol in statement according to following computing formula so:
S(D i)=Rel(i)+Fre(i)+Pos(i),1≤i≤n;
Wherein, i field concept symbol D of Rel (i) expression iIn statement with other field concept symbol D j(j ≠ i, 1≤j≤n) concerns score; Fre (i) representes i field concept symbol D iHigh more its value of frequency of occurrence in statement S, the frequency is big more; Pos (i) representes i field concept symbol D iAppearance position in statement S, position lean on its value of back big more more, with score S (D i) i the highest field concept symbol D iField as statement S.
5. acquisition methods based on the text field of concept symbols may further comprise the steps:
(1) segmentation subordinate sentence participle: the word segmentation processing device is the input text cutting paragraph, statement, word;
(2) statement semantics analysis: the statement semantics analyzer carries out semantic analysis to statement, obtains the concept classification and the semantic chunk that constitutes statement of statement, comprising: the role of semantic chunk, border and inner formation;
(3) obtain the activation word: the field arbiter obtains the activation word in the statement according to the semantic concept symbol in field concept glossary of symbols and the word knowledge base; Described activation word is for when the upper level node that has occurred the field concept symbol in the concept symbols of the word of a word W or its are derived node, and word W activates word;
(4) the statement field is differentiated: the field arbiter carries out comprehensive grading according to activating semantic chunk type, field concept syntactics, the frequency of occurrence of word in the statement and the position occurring to the field concept symbol that activates word, obtains the field of the highest field concept symbol of branch as statement;
Wherein, described field concept symbol is used for to described statement field discriminating step required field concept symbol being provided; The concept symbols of described word and word is used for to segmentation subordinate sentence word segmentation processing step and statement semantics analytical procedure required word and concept symbols thereof being provided;
(5) sentence crowd and field thereof are differentiated: the field arbiter merges according to its field concept symbol the statement in the paragraph, obtains sentence crowd and field thereof;
(6) text field is differentiated: the field arbiter obtains the field of input text according to text header, sentence crowd frequency of occurrence and position in input text; If title is arranged in the input text, the field of title is used as the field of input text so; If there is not title in the input text, the sentence crowd field that the frequency that occurs at first in the input text so is maximum is used as the step in the candidate field in input text field;
Wherein, field concept symbol in the described field concept glossary of symbols and the semantic concept symbol in the word knowledge base are based on identical concept primitive symbolism; Described field concept glossary of symbols comprises 10 types of upper level node symbols and upper level node again to the concrete field concept symbol that extends below.
6. according to the acquisition methods of the text field of claim 5, it is characterized in that said step (4) is confirmed the field of statement S as follows: at first, from the result that statement semantics is analyzed, obtain to activate the type of word semantic chunk of living in; Then, confirm the field of statement S successively by the semantic chunk type sequence of " global characteristics semantic chunk Eg>local feature semantic chunk El>contents semantic piece C>object semantic chunk B or actor semantic chunk A "; A plurality of activation word W are arranged in same type semantic chunk 1, W 2..., W nThe time, the field concept symbol that the hypothesis activation word is corresponding is respectively D 1, D 2..., D n, calculate the score of each field concept symbol in statement according to following computing formula so:
S(D i)=Rel(i)+Fre(i)+Pos(i),1≤i≤n;
Wherein, i field concept symbol D of Rel (i) expression iIn statement with other field concept symbol D j(j ≠ i, 1≤j≤n) concerns score; Fre (i) representes i field concept symbol D iHigh more its value of frequency of occurrence in statement S, the frequency is big more; Pos (i) representes i field concept symbol D iAppearance position in statement S, position lean on its value of back big more more, with score S (D i) i the highest field concept symbol D iField as statement S.
7. according to the acquisition methods of the text field of claim 5, it is characterized in that, in the said step (5), for certain paragraph P of text T iIn statement S 1, S 2..., S n, the sentence crowd ownership of each statement is confirmed according to following steps:
(5a) get first statement S 1As sentence crowd G 1, get S 1Field D 1As sentence crowd G 1Field D G1
(5b) S 1Be current statement S i, G 1Be current sentence crowd G j, change (5g);
If (5c) S iField D iBe S I-1Field D I-1Symbol extend statement S so iBe included into G j, G jThe field change D into i, change (5g);
If (5d) S I-1Field D I-1Be S iField D iSymbol extend statement S so iBe included into G j, change (5g);
If (5e) current statement S iField D iWith a last statement S I-1Field D I-1Identical, statement S so iBe included into G j, change (5g);
(5f) get S iNext statement S I+1Be new sentence crowd G J+1, field D Gj+1Be statement S I+1Field D I+1
If (5g) current statement S iBe last statement S n, change so (5n);
If (5k) S iThe field be sky and S iBe S 1, statement S so 2Be included into G 1, G 1The field change D into 2, S 2As current statement S i, change (5c);
If (5l) S iThe field be sky and S iNot S 1, statement S so iBe included into G j, change (5g);
If (5m) S iThe field be not empty, so S I+1As current statement S i, change (5c);
(5n) all crowd G to obtaining j, the sentence crowd that adjacent field is identical merges into a sentence crowd, 1≤j≤m wherein, 1≤m≤n.
8. the acquisition methods of text field according to claim 5 is characterized in that, if there is not title in the text, n sentence crowd's field is designated as D=(D in proper order by sentence crowd appearance among the text T G1, D G2..., D Gn), from D G1To D GnText field is obtained in operation according to the following steps:
(6a) D G1As D Gi, the statistics D in D GiThe field number C that the field concept symbol is identical Gi, with D GiWith C GiDeposit among the table HTab;
If (6b) D GiBe D Gn, change so (6f);
(6c) D Gi+1As D Gi
If (6d) D GiThe field concept symbol deposited among the table HTab, change so (6c);
(6e) statistics D in D GiThe field number C that the field concept symbol is identical Gi, with D GiWith C GiDeposit among the table HTab, change (6b);
(6f) obtain showing HTab=((D G1, C G1) ..., (D Gm, C Gm)), 1≤m≤n wherein;
(6g) element (D among the his-and-hers watches HTab Gj, C Gj), 1≤j≤m is according to C GjSize sort from big to small, newly shown HTab '=((D G1', C G1') ..., (D Gm', C Gm')), the field of the field concept symbol of first element in this new table as text T.
CN2009100770180A 2009-01-16 2009-01-16 Acquisition system and method of text field based on concept symbols Expired - Fee Related CN101645083B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2009100770180A CN101645083B (en) 2009-01-16 2009-01-16 Acquisition system and method of text field based on concept symbols

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2009100770180A CN101645083B (en) 2009-01-16 2009-01-16 Acquisition system and method of text field based on concept symbols

Publications (2)

Publication Number Publication Date
CN101645083A CN101645083A (en) 2010-02-10
CN101645083B true CN101645083B (en) 2012-07-04

Family

ID=41656971

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2009100770180A Expired - Fee Related CN101645083B (en) 2009-01-16 2009-01-16 Acquisition system and method of text field based on concept symbols

Country Status (1)

Country Link
CN (1) CN101645083B (en)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101937462B (en) * 2010-09-03 2016-08-24 中国科学院声学研究所 Literature review automatic searching method and system
CN104281566A (en) * 2014-10-13 2015-01-14 安徽华贞信息科技有限公司 Semantic text description method and semantic text description system
US11775850B2 (en) 2016-01-27 2023-10-03 Microsoft Technology Licensing, Llc Artificial intelligence engine having various algorithms to build different concepts contained within a same AI model
US10803401B2 (en) * 2016-01-27 2020-10-13 Microsoft Technology Licensing, Llc Artificial intelligence engine having multiple independent processes on a cloud based platform configured to scale
US11868896B2 (en) 2016-01-27 2024-01-09 Microsoft Technology Licensing, Llc Interface for working with simulations on premises
US11120299B2 (en) 2016-01-27 2021-09-14 Microsoft Technology Licensing, Llc Installation and operation of different processes of an AI engine adapted to different configurations of hardware located on-premises and in hybrid environments
US11841789B2 (en) 2016-01-27 2023-12-12 Microsoft Technology Licensing, Llc Visual aids for debugging
CN106250398B (en) * 2016-07-19 2020-03-27 北京京东尚科信息技术有限公司 Method and device for classifying and judging complaint content of complaint event
CN106294186A (en) * 2016-08-30 2017-01-04 深圳市悲画软件自动化技术有限公司 Intelligence software automated testing method
CN108153734A (en) * 2017-12-26 2018-06-12 北京嘉和美康信息技术有限公司 A kind of text handling method and device
CN110413989B (en) * 2019-06-19 2020-11-20 北京邮电大学 Text field determination method and system based on field semantic relation graph
CN112699237B (en) * 2020-12-24 2021-10-15 百度在线网络技术(北京)有限公司 Label determination method, device and storage medium
CN117875908A (en) * 2024-03-08 2024-04-12 蒲惠智造科技股份有限公司 Work order processing method and system based on enterprise management software SAAS

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001344256A (en) * 2000-06-01 2001-12-14 Matsushita Electric Ind Co Ltd Word class automatic determination device, example sentence retrieval device, medium, and information aggregate
JP2002259371A (en) * 2001-03-02 2002-09-13 Nippon Telegr & Teleph Corp <Ntt> Method and device for summarizing document, document summarizing program and recording medium recording program
CN101067808A (en) * 2007-05-24 2007-11-07 上海大学 Text key word extracting method
CN101196904A (en) * 2007-11-09 2008-06-11 清华大学 News keyword abstraction method based on word frequency and multi-component grammar
CN101281530A (en) * 2008-05-20 2008-10-08 上海大学 Key word hierarchy clustering method based on conception deriving tree

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001344256A (en) * 2000-06-01 2001-12-14 Matsushita Electric Ind Co Ltd Word class automatic determination device, example sentence retrieval device, medium, and information aggregate
JP2002259371A (en) * 2001-03-02 2002-09-13 Nippon Telegr & Teleph Corp <Ntt> Method and device for summarizing document, document summarizing program and recording medium recording program
CN101067808A (en) * 2007-05-24 2007-11-07 上海大学 Text key word extracting method
CN101196904A (en) * 2007-11-09 2008-06-11 清华大学 News keyword abstraction method based on word frequency and multi-component grammar
CN101281530A (en) * 2008-05-20 2008-10-08 上海大学 Key word hierarchy clustering method based on conception deriving tree

Also Published As

Publication number Publication date
CN101645083A (en) 2010-02-10

Similar Documents

Publication Publication Date Title
CN101645083B (en) Acquisition system and method of text field based on concept symbols
CN106570179B (en) A kind of kernel entity recognition methods and device towards evaluation property text
CN104765769B (en) The short text query expansion and search method of a kind of word-based vector
US20210056571A1 (en) Determining of summary of user-generated content and recommendation of user-generated content
CN101470732B (en) Auxiliary word stock generation method and apparatus
CN109829166B (en) People and host customer opinion mining method based on character-level convolutional neural network
CN105550269A (en) Product comment analyzing method and system with learning supervising function
Alemi et al. Text segmentation based on semantic word embeddings
CN106021410A (en) Source code annotation quality evaluation method based on machine learning
CN101751455B (en) Method for automatically generating title by adopting artificial intelligence technology
CN106202372A (en) A kind of method of network text information emotional semantic classification
CN104778209A (en) Opinion mining method for ten-million-scale news comments
CN101127042A (en) Sensibility classification method based on language model
CN110879831A (en) Chinese medicine sentence word segmentation method based on entity recognition technology
CN106599032A (en) Text event extraction method in combination of sparse coding and structural perceptron
CN105183833A (en) User model based microblogging text recommendation method and recommendation apparatus thereof
CN112051986B (en) Code search recommendation device and method based on open source knowledge
CN109145260A (en) A kind of text information extraction method
CN110134799B (en) BM25 algorithm-based text corpus construction and optimization method
CN110362678A (en) A kind of method and apparatus automatically extracting Chinese text keyword
CN108763348A (en) A kind of classification improved method of extension short text word feature vector
CN110674296B (en) Information abstract extraction method and system based on key words
CN113360647B (en) 5G mobile service complaint source-tracing analysis method based on clustering
CN102360436B (en) Identification method for on-line handwritten Tibetan characters based on components
CN108038099A (en) Low frequency keyword recognition method based on term clustering

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20120704

Termination date: 20160116

EXPY Termination of patent right or utility model