CN101645083B

CN101645083B - Acquisition system and method of text field based on concept symbols

Info

Publication number: CN101645083B
Application number: CN2009100770180A
Authority: CN
Inventors: 韦向峰; 黄曾阳; 张全; 缪建明
Original assignee: Institute of Acoustics CAS
Current assignee: Institute of Acoustics CAS
Priority date: 2009-01-16
Filing date: 2009-01-16
Publication date: 2012-07-04
Anticipated expiration: 2029-01-16
Also published as: CN101645083A

Abstract

The invention discloses an acquisition system and a method of the text field based on concept symbols. The system comprises a concept symbol set for expressing word concepts and field categories, a word knowledge base for storing word and concept symbols, a word segmentation processor, a statement semantic analyzer and a field arbiter. The method comprises the following steps: (1) segmenting an input text into paragraphs, statements and words; (2) carrying out semantic analysis on the statements for obtaining concept categories and semantic blocks of the statements; (3) obtaining activating words in the statements according to semantic concept symbols in the field concept symbol set and the word knowledge base; (4) carrying out comprehensive scoring on field concept symbols of the activating words and obtaining the field concept symbol with the highest score as the field of the statements; (5) merging the statements in the paragraphs according to the field concept symbols for obtaining a statement group and the field thereof; and (6) obtaining the field of the text according to a title of the text and the frequency of occurrence and the position of the statement group in the statement group.

Description

A kind of text field based on concept symbols obtain system and method

Technical field

The present invention relates to the field that utilizes Computer Science and Technology that text is carried out the spoken and written languages information processing, particularly a kind of text field based on concept symbols obtain system and method.

Background technology

The text classification technology is to utilize computing machine, according to certain rule, knowledge and step, is classified as one or more domain class method for distinguishing and process to one piece of text.The conventional method of text classification is that text table is shown as proper vector, and when " angle " of the proper vector of two pieces of texts during less than certain angle, they are classified as same classification.Generally choose word constitutes text as text feature proper vector; The TF*IWF method that the building method of proper vector adopts the TF*IDF method more or derives thus, TF*IDF promptly use word in document the frequency of occurrences and in collection of document the product of the inverse of the frequency of occurrences as the corresponding value of this characteristic word in the proper vector.The k nearest neighbor method of text classification, bayes method, SVMs, neural network, decision tree etc. all are the statistical methods that the vector space model with text is the basis; Good a large amount of text sets carry out the parameter optimization training to require that before classification prior classification is arranged, and can new text be included in a certain classification that defines after the training.Chinese patent file (publication number CN100353361) discloses a kind of method and apparatus of new proper vector weight towards text classification; On the basis of TF*IWF method, introduced the n speech root of DBV and TF; The experiment of the different characteristic speech number (50,100,200,500,1000,1500,2000,2500,3000,3500,4000) through the field of respectively classifying by the word frequency selection purposes, its experimental system better performances when finding to get 3500 speech.

Because file classification method requires to know in advance the field classification set and the criteria for classification of text, obtain under the situation of difficult with the training text collection in that class categories is uncertain, file classification method will be difficult to implement.Therefore, the text cluster technology has appearred again.Typical case's representative of text cluster method commonly used is the K-Means algorithm, and promptly at first optional K text is as cluster centre from text set, and other text basis incorporates in that nearest cluster with the proper vector " distance " of cluster centre; And then with the average of the proper vector of all texts in K type as new cluster centre, all texts are basis and distances of clustering centers cluster more again, so iterative computation is till the evaluation function convergence.But the field classification that the text automatic cluster obtains is very coarse, is difficult to adapt to actual demand owing to lack its result of guidance to different types of areas.And same text cluster method, better to certain text set effect, but maybe be very poor to another text set effect, promptly all there are shortcoming in the practicality of text cluster and stability.

To sum up, the statistical method of text classification needs a large amount of good corpus of prior classification, this divide time-like often be difficult to provide.Though and text cluster can overcome this shortcoming, cluster result is difficult to combine with the actual demand of classification.

Summary of the invention

In order to overcome above-mentioned the problems of the prior art; The invention provides a kind of system and method that obtains of the text field based on concept symbols; This system and method has the configurable characteristics with the sorting technique regularization of criteria for classification; Can classify at the basic area that does not have to obtain text under the situation of corpus, and can customize the class categories of text according to actual needs, can be used for the automatic cluster of text.

In order to achieve the above object, the system that obtains of a kind of text field based on concept symbols provided by the invention, as shown in Figure 1, comprising:

One field concept glossary of symbols is used to express word notion and field classification, and to the field arbiter required field concept symbol is provided.

One word knowledge base is used to store word and concept symbols thereof, and to word segmentation processing device and statement semantics analyzer required word and semantic concept symbol is provided.

One word segmentation processing device, being used for the input text cutting is paragraph, statement, word, and sends into the statement semantics analyzer.

One statement semantics analyzer is used for statement is carried out semantic analysis, obtains the concept classification and the semantic chunk that constitutes statement of statement, comprising: the role of semantic chunk, border and inner formation.

One field arbiter is used for obtaining the activation word in the statement according to the semantic concept symbol of field concept glossary of symbols and word knowledge base; According to semantic chunk type, field concept syntactics, the frequency of occurrence of the activation word in the statement and the position occurs the field concept symbol that activates word is carried out comprehensive grading, obtain the highest field concept of branch and meet field then as statement; Then the statement in the paragraph is merged according to its field concept symbol, obtain sentence crowd and field thereof; Obtain at last the field of input text according to input text title, sentence crowd frequency of occurrence and position in input text.

Wherein, the character types of said semantic chunk is divided into: characteristic semantic chunk E, actor semantic chunk A, object semantic chunk B and contents semantic piece C; Said characteristic semantic chunk type E is divided into two types: a) global characteristics semantic chunk Eg is the characteristic semantic chunk E in the statement first order level; B) local feature semantic chunk El is the characteristic semantic chunk E of nested statement S ' time nested statement S ' in the semantic chunk.

Wherein, said field concept glossary of symbols comprises following upper level node symbol:

" 71,72 " the expression psychological activity and the state of mind; " 8 " expression human thinking activities; " a, b " expression specialty and pursuit movable (second type of work); The activity of " d " expression theory; The work of " q6 " expression first kind; " q7 " representes extra-professional activity; The activity of " q8 " expression faith; " 6m " representes instinctive activity, wherein m=0～5; " 3228 α " representes calamity, wherein α=8～b; " 503,50 α " expression state, wherein α=8～b;

The field concept upper level node	The field of expression
		71，72	The psychological activity and the state of mind
8	Human thinking activities
		a，b	Specialty and pursuit movable (second type of work)
d	The theory activity
		q6	First kind work
q7	Extra-professional activity
		q8	The faith activity
6m(m＝0～5)	Instinctive activity
		3228α(α＝8～b)	Calamity
503，50α(α＝8～b)	State

And said upper level node is to the node symbol of field concept more specifically that extends below.

Wherein, said field arbiter is confirmed the field of statement S as follows: at first, from the result of sentence category analysis (sca), obtain to activate the type of word semantic chunk of living in; The semantic chunk type sequence of then, pressing global characteristics semantic chunk Eg＞local feature semantic chunk El＞contents semantic piece C＞(object semantic chunk B or actor semantic chunk A) is confirmed the field of statement S successively; A plurality of activation word (W are arranged in same type semantic chunk ₁, W ₂..., W _n) time, the field concept symbol that the hypothesis activation word is corresponding is respectively (D ₁, D ₂..., D _n), calculate the score of each field concept symbol in statement according to following computing formula so:

S(D _i)＝Rel(i)+Fre(i)+Pos(i)，1≤i≤n；

Wherein, i field concept symbol D of Rel (i) expression _iIn statement with other field concept symbol D _j(j ≠ i, 1≤j≤n) concerns score; Fre (i) representes i field concept symbol D _iHigh more its value of frequency of occurrence in statement S, the frequency is big more; Pos (i) representes i field concept symbol D _iAppearance position in statement S, position lean on its value of back big more more.With score S (D _i) i the highest field concept symbol D _iField as statement S.

The acquisition methods of a kind of text field based on concept symbols provided by the invention, as shown in Figure 2, may further comprise the steps:

(1) segmentation subordinate sentence participle: the word segmentation processing device is the input text cutting paragraph, statement, word.

An input text is used as a character string T in computing machine.With " carriage return, line feed " among character string T symbol is cut-off, is text T cutting several paragraphs P.Characters such as " fullstop, question mark, exclamation and branches " with among the paragraph P is a cut-off, is cut into several statements S to paragraph P.

Statement S is made up of Chinese character and other characters.If A, B, C are the Chinese characters that occurs among the statement S, if " AB " is the word in the word knowledge base, then " ABC " cutting is " AB/C "; In like manner, if " BC " is the word in the speech, then " ABC " cutting is " A/BC ".If " AB " and " BC " all is the word in the dictionary, divide the principle cutting to be " A/BC " according to left cut so; If " ABC " is the word in the dictionary, be "/ABC/ " according to the long principle cutting of major term so.So statement S is several words W by cutting, participle finishes.

(2) statement semantics analysis: the statement semantics analyzer carries out semantic analysis to statement, obtains the concept classification and the semantic chunk that constitutes statement of statement, comprising: the role of semantic chunk, border and inner formation.

For each statement S, anolytic sentence obtains its semantic classes (sentence class) code SCode, format code SFomat, sentence type expression formula SExpression, the kind of the semantic chunk of formation statement, scope, the concrete title in sentence class expression formula or the like.The type of particularly confirming semantic chunk is E (characteristic semantic chunk), A (actor semantic chunk), B (object semantic chunk), or C (contents semantic piece).In characteristic semantic chunk type E, be divided into two types again: a kind of Eg of being (global characteristics semantic chunk) is the characteristic semantic chunk E in the statement first order level; A kind of is El (local feature semantic chunk), and it is the characteristic semantic chunk E of nested statement S ' time nested statement S ' in the semantic chunk.

(3) obtain the activation word: the field arbiter obtains the activation word in the statement according to the semantic concept symbol in field concept glossary of symbols and the word knowledge base.

Activating word is the word that contains the field concept symbol among the statement S.The word knowledge base comprises: morphology, tone, senses of a dictionary entry number, adopted item No., concept classification, word frequency and linguistic context, semantic knowledge, sentence category code, format conversion, S, K, CA, CT.Wherein semantic knowledge is with the symbolic formulation of notion primitive, and the field symbol also is the sub-set in the notion primitive symbolism, so possibly contain the field concept symbolic information in the concept symbols of word.In notion primitive symbolism, not all notion primitive node all is used for the description field, and the upper level node of the notion relevant with the field has: 71,72 (psychological activity and the state of mind); 8 (human thinking activities); A, b (specialty and pursuit movable (second type of work)); D (theory activity); Q6 (first kind work); Q7 (extra-professional activity); Q8 (faith activity); 6m (m=0～5) (instinctive activity); 3228 α (α=8～b) (calamity); 503,50 α (α=8～b) (state).The upper level node of these field concept symbols can obtain more concrete field concept node symbol to extending below; For example a (professional activity) extends to downwards: a1 (politics), a2 (economy), a3 (culture), a4 (military affairs), a5 (law), a6 (science and technology), a7 (education), a8 (defending the guarantor); And a1 (politics) can extend to downwards successively: a11 (regime is movable); A113 (top leader (country or local government) change), a113b (election).

What the concept symbols of semantic knowledge used in the concept symbols in field and the word knowledge base is same notion primitive symbolism; When the upper level node that has occurred the field concept symbol in the concept symbols of the semantic knowledge of a word W or its were derived node, word W activated word.The field concept symbolic formulation field of a certain level or type, all spectra concept symbols that the activation word among the statement S is contained is used as the candidate field of statement S.

(4) the statement field is differentiated: the field arbiter carries out comprehensive grading according to activating semantic chunk type, field concept syntactics, the frequency of occurrence of word in the statement and the position occurring to the field concept symbol that activates word, obtains the field of the highest field concept symbol of branch as statement.

Wherein, the statement field derives from the field concept symbol that activates word in the said step (4).When a plurality of activation word is arranged among the while statement S, confirm the statement field as follows: at first, from the result of sentence category analysis (sca), obtain to activate the type of word semantic chunk of living in; The semantic chunk type sequence of pressing global characteristics semantic chunk Eg＞local feature semantic chunk El＞contents semantic piece C＞object semantic chunk B or actor semantic chunk A is then confirmed the field of statement S successively; Even have among the Eg and activate field concept symbol that word W then gets W as the statement field; If do not activate word then then among the Eg from El; Then from C, do not get if activate word among the El, if then from B or A, do not get among the C.

In the semantic chunk of same type, have a plurality of activation words (W1, W2 ...; Wn) time, the field concept symbol that the hypothesis activation word is corresponding is respectively (D1, D2; ..., Dn), calculate the score of each field concept symbol in statement according to following computing formula so: S (D _i)=Rel (i)+Fre (i)+Pos (i), 1≤i≤n.At formula S (D _iAmong)=Rel (i)+Fre (i)+Pos (i), Rel (i) representes i field concept symbol D _iIn statement with other field concept symbol D _j(j ≠ i, 1≤j≤n) concerns score; Fre (i) representes i field concept symbol D _iHigh more its value of frequency of occurrence in statement S, the frequency is big more; Pos (i) representes i field concept symbol D _iAppearance position in statement S, position lean on its value of back big more more.With score S (D _i) i the highest field concept symbol D _iField as statement S.

The score value of Rel (i) is from field concept symbol D _iWith D _jRelation.Work as D _iBe D _jConcept extension when representing, D _iScore value add 1; Work as D _iWith D _jDuring strong correlation, D _iScore value add 1.If calculated S (D _i) back D _iBe the field of statement, D _iBefore have negative notion to modify, should get D so _i' (being its opposite field concept symbol) is as the field of statement.If if calculated S (D _i) back Di is the field of statement, and D _jRel (i)+Fre (i) score and D _iIdentical, and D _iWith D _jBe the child node of identical concept node, get D so _iWith D _jUpper level father node field concept symbol as the field of statement.

If one is activated word W _i(among 1≤i≤n) a plurality of field concept symbol (D are arranged _I1, D _I2..., D _Im), this m field concept symbol all need calculate S (D so _i) the field score value, just when calculating Rel (i), need not consider D _Ij(1≤j≤m) and D _Ik(the field concept syntactics between the j ≠ i, 1≤k≤m).If D _IjWith D _IkFinal calculating score value S (D _Ij) and S (D _Ik) still identical, get the field concept symbol that comes the front in the word knowledge base field so as statement S.

(5) sentence crowd and field thereof are differentiated: the field arbiter merges according to its field concept symbol the statement in the paragraph, obtains sentence crowd and field thereof.

Sentence the crowd be made up of the statement of the same center of continuous description topic.Sentence crowd's center topic is meant topic or the field that identical or approximate field concept symbol is expressed.Minimum sentence crowd is a statement, and maximum sentence crowd is a paragraph.In the said step (5), for the statement (S among certain paragraph Pi of text T ₁, S ₂..., S _n), the sentence crowd ownership of each statement is confirmed according to following steps, and is as shown in Figure 3:

(5a) get first statement S ₁As sentence crowd G ₁, get S ₁Field D ₁As sentence crowd G ₁Field D _G1

(5b) S ₁Be current statement S _i, G ₁Be current sentence crowd G _j, change (5g);

If (5c) S _iField D _iBe S _I-1Field D _I-1Symbol extend statement S so _iBe included into G _j, G _jThe field change D into _i, change (5g);

If (5d) S _I-1Field D _I-1Be S _iField D _iSymbol extend statement S so _iBe included into G _j, change (5g);

If (5e) current statement S _iField D _iWith a last statement S _I-1Field D _I-1Identical, statement S so _iBe included into G _j, change (5g);

(5f) get S _iNext statement S _I+1Be new sentence crowd G _J+1, field D _Gj+1Be statement S _I+1Field D _I+1

If (5g) current statement S _iBe last statement S _n, change so (5n);

If (5k) S _iThe field be sky and S _iBe S ₁, statement S so ₂Be included into G ₁, G ₁The field change D into ₂, S ₂As current statement S _i, change (5c);

If (5l) S _iThe field be sky and S _iNot S ₁, statement S so _iBe included into G _j, change (5g);

If (5m) S _iThe field be not empty, so S _I+1As current statement S _i, change (5c);

(5n) all crowd G to obtaining _j, the sentence crowd that adjacent field is identical merges into a sentence crowd, 1≤j≤m wherein, 1≤m≤n.

Through above-mentioned steps and closing operation, a paragraph just is divided into several crowds, simultaneously their field is also decided according to the field of statement, has realized in the paragraph sentence crowd's the division and the differentiation in sentence crowd field.

(6) text field is differentiated: the field arbiter obtains the field of input text according to text header, sentence crowd frequency of occurrence and position in input text.

Wherein, said step (6) also comprises: if title is arranged in the input text, the field of title is used as the field of input text so, if title paragraph P ₁In have only a sentence crowd, this crowd's field is exactly the field of text so; If paragraph P ₁In a plurality of crowds are arranged, choose paragraph P so ₁In first crowd's field and last crowd's field jointly as the field of text.

If there is not title in the text, all crowds' field is used as the candidate field of text field in the text so.N sentence crowd's field is designated as D=(D in proper order by sentence crowd appearance among the text T _G1, D _G2..., D _Gn), from D _G1To D _GnOperation according to the following steps, as shown in Figure 4:

(6a) D _G1As D _Gi, the statistics D in D _GiThe field number C that the field concept symbol is identical _Gi, with D _GiWith C _GiDeposit among the table HTab;

If (6b) D _GiBe D _Gn, change so (6f)

(6c) D _Gi+1As D _Gi

If (6d) D _GiThe field concept symbol deposited among the table HTab, change so (6c);

(6e) statistics D in D _GiThe field number C that the field concept symbol is identical _Gi, with D _GiWith C _GiDeposit among the table HTab, change (6b);

(6f) obtain showing HTab=((D _G1, C _G1) ..., (D _Gm, C _Gm)), 1≤m≤n wherein;

(6g) element (D among the his-and-hers watches HTab _Gj, C _Gj), 1≤j≤m is according to C _GjSize sort from big to small, newly shown HTab '=((D _G1', C _G1') ..., (D _Gm', C _Gm')).

The field of the field concept symbol of first element in the new table as text T, the field of text T can not obtain with above-mentioned steps when having title among the text T.

The invention has the advantages that:

When 1, text field provided by the invention obtains system and method and is used for text classification, do not need the good a large amount of corpus of classification in advance, only need to confirm the field concept symbol relevant with class categories.

2, the text field provided by the invention field concept symbol that obtains system and method has the level characteristics, both can adapt to miscellaneous same level class categories, can also adapt to the hierarchical classification of striding of concrete tiny classification.

3, text field provided by the invention obtains the method that system and method mainly adopts semantic analysis and gos deep into the field classification that concept hierarchy is confirmed text; Introduce simultaneously the frequency characteristic of statistical property again, make the processing of the accurate and suitable more extensive text of acquisition methods of text field.

4, text field provided by the invention obtains the classification processing that sentence crowd field that system and method proposes can be used for text, also can be used for the cluster analysis of text and the topic analysis of text.

Description of drawings

Fig. 1 is the structural drawing of the system that obtains of text field of the present invention;

Fig. 2 is the process flow diagram of the acquisition methods of text field of the present invention;

Fig. 3 is the process flow diagram of definite method in sentence crowd of the present invention and field thereof;

Fig. 4 is the process flow diagram of text of the present invention text field acquisition methods when not having title.

Embodiment

Below in conjunction with specific embodiment and accompanying drawing the present invention is elaborated.

At first, from internet download some about 11 pieces in the news report text of Athens Olympic Games 2004 match, totally 60 paragraghs, 6501 Chinese characters.

Secondly, according to principle of design in " fundamental theorem in language concept space and mathematical physics expression " (Maritime Press, in July, 2004) and design symbol concrete perfect the concept symbols in q73 (match) field, obtain concept symbols collection about the field of competing.Word and semantic knowledge thereof about competing in the word knowledge base have been enriched simultaneously.

The 3rd, use the word segmentation processing device that one piece of text is carried out segmentation, subordinate sentence and word segmentation processing.For example following text: Title: the difference of Malaysia " little standard-bearer " one semifinals of not advancing to dive

In the match of the men's Olympic 10m platform event diving that www.xinhuanet.com Athens August 27 held in afternoon 27 day local time,, fail to be promoted to semifinals from Malay Brian-Nickerson's results in the qualifying rank the 19.According to rule, among 33 players of preliminary contest, achievement comes preceding 18 player and is promoted to semifinals.

After the processing through the word segmentation processing device, the result who obtains is following:

[Title :] [Malaysia] [" little standard-bearer "] [one poor] [not advancing] [diving] [semifinals]

[www.xinhuanet.com] [Athens] [August 27] electricity is in [Olympic Games] [man] [ten meters] [diving tower] [diving] of [locality] [time] [27 days] [afternoon] [holding] [match]

[from] Brian-Nickerson's [preliminary contest] [achievement] [rank] [19] of [Malaysia]

[failing] [promotion] [semifinals]

[according to] [rule]

In [33] [player] of [preliminary contest]

[player] [promotion] [semifinals] of [18] before [achievement] comes

The 4th, use the statement semantics analyzer that statement is analyzed, use the field arbiter to obtain then and activate word and analyze sentence crowd and field thereof, after merging sentence crowd field, obtain following result:

//DOM：(q734)

Title: the difference that [Malaysia] [" little standard-bearer "] is is not advanced [diving (a339)] [semifinals (q734)]

Www.xinhuanet.com's [Athens (a219)] August 27 is in [match (q73)] of [Olympic Games (a339i)] [man] ten meters [diving tower (a339ing)] [divings (a339)] of 27 days [locality] [time] [afternoon] [holding (a02)]; [from] Brian-Nickerson [preliminary contest (q734)] [achievement (a0099b)] [rank (q730e25d0 [n])] the 19 of [Malaysia], [failing] [being promoted to (a01ad0ne25)] [semifinals (q734)].[according to] [rule (a009a9)], in 33 [player (q730)] of [preliminary contest (q734)], [achievement (a0099b)] comes preceding 18 [player (q730)] [being promoted to (a01ad0ne25)] [semifinals (q734)].

In text, first statement " Title: Malaysia ' semifinals of not advancing to dive of little standard-bearer ' one's difference ", its semantic analysis result is " Title: Malaysia ' little standard-bearer ' (SB) || one difference is not advanced (S0) || diving semifinals (SC) ".Because global characteristics semantic chunk Eg (being S0) does not have the field concept symbolic information, so from the contents semantic piece C (being SC) that contains realm information, choose the field of statement." diving " and " semifinals " in the SC semantic chunk all contain the field concept symbolic information; The field of calculating them through score value concerns that score is all the same with frequency score; The position score of " but semifinals " is greater than " diving ", so the field of statement is " q734 ".Therefore first paragraph is altogether with regard to a statement, and whole paragraph is a sentence crowd, and sentence crowd's field is exactly " q734 ".Because first paragraph is text header, so the field of text just " q734 ".

Like this,, can obtain the field of statement, sentence crowd's field, finally obtain the field of text through analyzing the type that activates word residing semantic chunk in statement and word position, the frequency etc. according to the field concept symbol that activates word.

Claims

1. the system that obtains based on the text field of concept symbols is characterized in that, the said system that obtains comprises:

One field concept glossary of symbols is used to express word notion and field classification, and to the field arbiter required field concept symbol is provided;

One word knowledge base is used to store word and concept symbols thereof, and to word segmentation processing device and statement semantics analyzer required word and semantic concept symbol is provided;

One word segmentation processing device, being used for the input text cutting is paragraph, statement, word, and sends into the statement semantics analyzer;

One statement semantics analyzer is used for statement is carried out semantic analysis, obtains the concept classification and the semantic chunk that constitutes statement of statement, comprising: the role of semantic chunk, border and inner formation;

One field arbiter is used for obtaining the activation word in the statement according to the semantic concept symbol of field concept glossary of symbols and word knowledge base; According to semantic chunk type, field concept syntactics, the frequency of occurrence of the activation word in the statement and the position occurs the field concept symbol that activates word is carried out comprehensive grading, obtain the highest field concept of branch and meet field then as statement; Then the statement in the paragraph is merged according to its field concept symbol, obtain sentence crowd and field thereof; Obtain at last the field of input text according to input text title, sentence crowd frequency of occurrence and position in input text; If title is arranged in the text, the field of title is as the field of text so; If there is not title in the text, the sentence crowd field that the frequency of appearance is maximum at first in the text so is as the field of text

Wherein, field concept symbol in the described field concept glossary of symbols and the semantic concept symbol in the word knowledge base are based on identical concept primitive symbolism; Described field concept glossary of symbols comprises 10 types of upper level node symbols and upper level node again to the concrete field concept symbol that extends below;

Described activation word is for when the upper level node that has occurred the field concept symbol in the concept symbols of the word of a word W or its are derived node, and word W activates word.

2. the system that obtains of text field according to claim 1 is characterized in that, the character types of said semantic chunk is divided into: characteristic semantic chunk E, actor semantic chunk A, object semantic chunk B and contents semantic piece C; Said characteristic semantic chunk type E is divided into two types: a) global characteristics semantic chunk Eg is the characteristic semantic chunk E in the statement first order level; B) local feature semantic chunk El is the characteristic semantic chunk E of nested statement S ' time nested statement S ' in the semantic chunk.

3. the system that obtains of text field according to claim 1 is characterized in that, said field concept glossary of symbols comprises following upper level node symbol:

The field concept upper level node The field of expression

71，72 The psychological activity and the state of mind 8 Human thinking activities a，b Specialty and pursuit are movable d The theory activity q6 First kind work q7 Extra-professional activity q8 The faith activity 6m(m＝0～5) Instinctive activity 3228α(α＝8～b) Calamity 503，50α(α＝8～b) State

4. the system that obtains of text field according to claim 1 is characterized in that, said field arbiter is confirmed the field of statement S as follows: at first, from the result that statement semantics is analyzed, obtain to activate the type of word semantic chunk of living in; Then, confirm the field of statement S successively by the semantic chunk type sequence of " global characteristics semantic chunk Eg＞local feature semantic chunk El＞contents semantic piece C＞object semantic chunk B or actor semantic chunk A "; A plurality of activation word W are arranged in same type semantic chunk ₁, W ₂..., W _nThe time, the field concept symbol that the hypothesis activation word is corresponding is respectively D ₁, D ₂..., D _n, calculate the score of each field concept symbol in statement according to following computing formula so:

S(D _i)＝Rel(i)+Fre(i)+Pos(i)，1≤i≤n；

Wherein, i field concept symbol D of Rel (i) expression _iIn statement with other field concept symbol D _j(j ≠ i, 1≤j≤n) concerns score; Fre (i) representes i field concept symbol D _iHigh more its value of frequency of occurrence in statement S, the frequency is big more; Pos (i) representes i field concept symbol D _iAppearance position in statement S, position lean on its value of back big more more, with score S (D _i) i the highest field concept symbol D _iField as statement S.

5. acquisition methods based on the text field of concept symbols may further comprise the steps:

(1) segmentation subordinate sentence participle: the word segmentation processing device is the input text cutting paragraph, statement, word;

(2) statement semantics analysis: the statement semantics analyzer carries out semantic analysis to statement, obtains the concept classification and the semantic chunk that constitutes statement of statement, comprising: the role of semantic chunk, border and inner formation;

(3) obtain the activation word: the field arbiter obtains the activation word in the statement according to the semantic concept symbol in field concept glossary of symbols and the word knowledge base; Described activation word is for when the upper level node that has occurred the field concept symbol in the concept symbols of the word of a word W or its are derived node, and word W activates word;

(4) the statement field is differentiated: the field arbiter carries out comprehensive grading according to activating semantic chunk type, field concept syntactics, the frequency of occurrence of word in the statement and the position occurring to the field concept symbol that activates word, obtains the field of the highest field concept symbol of branch as statement;

Wherein, described field concept symbol is used for to described statement field discriminating step required field concept symbol being provided; The concept symbols of described word and word is used for to segmentation subordinate sentence word segmentation processing step and statement semantics analytical procedure required word and concept symbols thereof being provided;

(5) sentence crowd and field thereof are differentiated: the field arbiter merges according to its field concept symbol the statement in the paragraph, obtains sentence crowd and field thereof;

(6) text field is differentiated: the field arbiter obtains the field of input text according to text header, sentence crowd frequency of occurrence and position in input text; If title is arranged in the input text, the field of title is used as the field of input text so; If there is not title in the input text, the sentence crowd field that the frequency that occurs at first in the input text so is maximum is used as the step in the candidate field in input text field;

Wherein, field concept symbol in the described field concept glossary of symbols and the semantic concept symbol in the word knowledge base are based on identical concept primitive symbolism; Described field concept glossary of symbols comprises 10 types of upper level node symbols and upper level node again to the concrete field concept symbol that extends below.

6. according to the acquisition methods of the text field of claim 5, it is characterized in that said step (4) is confirmed the field of statement S as follows: at first, from the result that statement semantics is analyzed, obtain to activate the type of word semantic chunk of living in; Then, confirm the field of statement S successively by the semantic chunk type sequence of " global characteristics semantic chunk Eg＞local feature semantic chunk El＞contents semantic piece C＞object semantic chunk B or actor semantic chunk A "; A plurality of activation word W are arranged in same type semantic chunk ₁, W ₂..., W _nThe time, the field concept symbol that the hypothesis activation word is corresponding is respectively D ₁, D ₂..., D _n, calculate the score of each field concept symbol in statement according to following computing formula so:

S(D _i)＝Rel(i)+Fre(i)+Pos(i)，1≤i≤n；

7. according to the acquisition methods of the text field of claim 5, it is characterized in that, in the said step (5), for certain paragraph P of text T _iIn statement S ₁, S ₂..., S _n, the sentence crowd ownership of each statement is confirmed according to following steps:

If (5g) current statement S _iBe last statement S _n, change so (5n);

8. the acquisition methods of text field according to claim 5 is characterized in that, if there is not title in the text, n sentence crowd's field is designated as D=(D in proper order by sentence crowd appearance among the text T _G1, D _G2..., D _Gn), from D _G1To D _GnText field is obtained in operation according to the following steps:

If (6b) D _GiBe D _Gn, change so (6f);

(6c) D _Gi+1As D _Gi

(6g) element (D among the his-and-hers watches HTab _Gj, C _Gj), 1≤j≤m is according to C _GjSize sort from big to small, newly shown HTab '=((D _G1', C _G1') ..., (D _Gm', C _Gm')), the field of the field concept symbol of first element in this new table as text T.