CN104731768B

CN104731768B - A kind of location of incident abstracting method towards Chinese newsletter archive

Info

Publication number: CN104731768B
Application number: CN201510097822.0A
Authority: CN
Inventors: 何绯娟; 孙霞; 缪相林
Original assignee: Xian Jiaotong University City College
Current assignee: Xian Jiaotong University City College
Priority date: 2015-03-05
Filing date: 2015-03-05
Publication date: 2017-10-20
Anticipated expiration: 2035-03-05
Also published as: CN104731768A

Abstract

A kind of location of incident abstracting method towards Chinese newsletter archive, participle is carried out to Chinese newsletter archive T first with ICTCLAS Chinese word segmentations instrument, and selection part of speech is mechanism name, the vocabulary of place noun, place name with constituting candidate events point set；Each vocabulary for candidate events in point set again, build comprising contextual feature, position feature, topological characteristic three-dimensional feature vector；Finally using constructed three-dimensional feature vector, each vocabulary using Random Forest graders by candidate events in point set carries out two-value classification according to location of incident and non-event place, so as to realize the extraction of location of incident.Multiclass feature in energy integrated use newsletter archive of the invention, contextual feature, position feature, three feature constitutive characteristic vectors of topological characteristic are extracted, location of incident is identified from participle acquisition mechanism name, place noun, place name using Random Forest graders；The place that media event occurs can be further identified on the basis of place name identification.

Description

A kind of location of incident abstracting method towards Chinese newsletter archive

Technical field

The present invention relates to the text mining in Computer Science and Technology, natural language processing, the analysis of public opinion field, especially It is related to a kind of location of incident abstracting method towards Chinese newsletter archive.

Background technology

In newsletter archive, there is vocabulary or the phrases such as mechanism name, place noun, place name, but their not necessarily things The place that part occurs.For example, in newsletter archive " on June 19th, 2012, the G20 summits held in Mexico's Loews card Butterworth Period, Argentine President's Christina submits the letters about Islas Malvinas sovereignty to British Prime Minister Cameron " in, There are place name " Mexico ", " Loews card Butterworth ", " Islas Malvinas " three place names, still " Islas Malvinas " simultaneously It is not locale.It is one in event extraction that location of incident how is identified from mechanism name, place noun, place name Individual problem.

Patent on the location of incident extraction towards Chinese newsletter archive.Patent name《A kind of place name identification method and Device》Publication number CN103186524A；A kind of place name identification method and apparatus of the disclosure of the invention, to carry out place name identification. The inventive method includes：Participle is carried out to character string to be identified and obtains candidate word；Each candidate word is obtained in address name storehouse Generic；Each candidate word is traveled through, if the generic of current candidate word is first category, by current candidate word It is added to alternative place name set as place name；If the generic of current candidate word be second category, to current candidate word with And the candidate word closed in the address name storehouse with current candidate word is combined and obtains compound word, and by the compound word It is added to alternative place name set as place name.But the patent can only recognize the place name in text, event can't be identified Point.

The content of the invention

In order to overcome the defect of above-mentioned prior art, it is an object of the invention to provide a kind of towards Chinese newsletter archive Location of incident abstracting method, this method extracts contextual feature, position feature, three feature structures of topological characteristic from newsletter archive Into characteristic vector, outgoing event is recognized from participle acquisition mechanism name, place noun, place name using Random Forest graders Place；The place that media event occurs can be further identified on the basis of place name identification.

To achieve the above objectives, the technical scheme is that：

A kind of location of incident abstracting method towards Chinese newsletter archive, comprises the following steps：

Step one：Extract in candidate events place

(1) first, participle is carried out to Chinese newsletter archive T using ICTCLAS Chinese word segmentations instrument, generation one is by binary The sequence S that group is constituted_T=(w₁, p₁), (w₂, p₂) ..., (w_i, p_i) ..., (w_n, p_n), wherein, n represents the vocabulary being syncopated as Number, n ＞ 0, w_iRepresent the vocabulary that ICTCLAS is syncopated as, p_iRepresent w_iPart of speech；

(2) from S_TIn select all to meet p successively_i=" ni " p_i=" nl ", p_iThe binary of one of three kinds of situations of=" ns " Group, p_i=" ni " p_i=" nl ", p_i=" ns " represents corresponding w respectively_iFor mechanism name, place noun, place name；It is all selected Two tuples in w_iConstitute a set W_T={ w '₁, w '₂..., w '_j..., w '_m, W_TCandidate events place will be used as Set, m represents W_TThe number of middle vocabulary, m ＞ 0；

Step 2：Characteristic vector is built

For set W_TIn each w '_j, three features are selected, including：w′_jContext in newsletter archive T Context features；w′_jPosition feature in newsletter archive T；w′_jTopological characteristic in newsletter archive T；The meter of three features Calculate as follows：

Feature one：w′_jContextual feature c in newsletter archive T_j

w′_jContextual feature w ' in newsletter archive T_jThe weight of the regular expression matched is represented, is designated as c_j；

(1) if w '_jSome regular expression in following table can be matched in newsletter archive T, it is assumed that be k-th, then c_j= a_k；a_kRepresent " if k-th of regular expression of terminology match, the vocabulary is the ratio of location of incident ", a_kCalculation formula be： a_k=| L_k|/|S_k|, wherein, S_kRepresent the set of all vocabulary of energy k-th of regular expression of successful match in labeled data, L_k Expression belongs to S_kAnd be the set that constitutes of vocabulary of location of incident in labeled data, labeled data refers to manually be labelled with event The newsletter archive collection in place；

(2) if w '_jMultiple regular expressions in following table can be matched in newsletter archive T, kth is set to₁, k₂..., k_l (l ＞ 1) individual regular expression, then

(3) if w '_jThe regular expression in following table can not be matched in newsletter archive T, then c_j=0.

Feature two：w′_jPosition feature p in newsletter archive T_j

p_j=loc (w '_j, T), wherein, loc (w '_j, T) and represent vocabulary w '_jThe position occurred first in newsletter archive T, I.e. from text T section starts to vocabulary w '_jOccurs the number of words between position for the first time.

Feature three：w′_jTopological characteristic t in newsletter archive T_j

Initialize null set E；For any two tuples (w '_j, w '_j)∈W_T×W_TAnd, perform following two steps：

STEP1：By character string " http://www.baike.com/wiki/ " and w '_jURL is constituted, URL correspondences are downloaded The page；If downloading less than not handling two tuple (w '_j, w '_i)；

STEP2：Utilize regular expression

/<a[^>]*Href=[" ']([^’″>]*)[’″][^>]*>(.*)</a>/ ig matches pagefile, obtains Take all Anchor Texts；If including w ' in Anchor Text_i, then E ∪ { (w '_j, w '_i) result be assigned to E；

For any two tuples (w '_j, w '_i)∈W_T×W_TAnd, perform after above-mentioned two step, generate with W_TFor nodal set Close, E is the digraph G of oriented line set；

To any w '_j∈W_T, calculate its convergence factor C (w '_j), convergence factor is used to weigh what is connected between different node Tightness degree；

In above formula, G_Δ(w′_j) represent to include w ' in G_jThe quantity for closing three point groups, G_Δ(w′_j) it is to represent to include w ' in G_j's Open the quantity of three point groups；Close three point groups and refer to any three nodes being connected two-by-two in figure, open three point groups and refer in figure by two sides companies Pick up three nodes come；

Make w '_jTopological characteristic t in newsletter archive T_jFor C (w '_j), i.e. t_j=C (w '_j)；

Utilize w '_jContextual feature, position feature, topological characteristic in newsletter archive T, build w '_jThree-dimensional feature to Measure (c_j, p_j, t_j)。

Step 3：Location of incident is recognized

100-200 characteristic vector is selected, manually a class label (location of incident and non-thing is marked for each characteristic vector Part place), form a training dataset；Using one grader of Random Forest classifier trainings, the grader is utilized Will set W_TIn each w '_jTwo-value classification is carried out according to location of incident and non-event place, so as to realize taking out for location of incident Take.

Contextual feature, position feature, topological characteristic that can be according to vocabulary in newsletter archive of the invention, sets up grader, Realize the automatic extraction of location of incident.

Brief description of the drawings

Accompanying drawing 1 is the location of incident extraction process towards Chinese newsletter archive.

Accompanying drawing 2 is to be used to calculate topological characteristic t_jA digraph example.

Embodiment

The invention will now be described in detail with reference to the accompanying drawings.

Referring to the drawings, the specific embodiment of this method can be divided into the extraction of candidate events place, characteristic vector structure, event Place recognizes three steps.It is described in detail below：

Step one：Extract in candidate events place

A) first, participle is carried out to Chinese newsletter archive T using ICTCLAS Chinese word segmentations instrument, generation one is by binary The sequence S that group is constituted_T=(w₁, p₁), (w₂, p₂) ..., (w_i, p_i) ..., (w_n, p_n), wherein, n represents the vocabulary being syncopated as Number, n ＞ 0, w_iRepresent the vocabulary that ICTCLAS is syncopated as, p_iRepresent w_iPart of speech；

B) from S_TIn select all to meet p successively_i=" ni ", p_i=" nl ", p_iThe binary of one of three kinds of situations of=" ns " Group, p_i=" ni ", p_i=" nl ", p_i=" ns " represents corresponding w respectively_iFor mechanism name, place noun, place name；It is all selected Two tuples in w_iConstitute a set W_T={ w '₁, w '₂..., w '_j..., w '_m, W_TCandidate events place will be used as Set.

With newsletter archive, " on June 19th, 2012, during the G20 summits that Mexico's Loews card Butterworth is held, Argentina is total System Christina submits the letters about Islas Malvinas sovereignty to British Prime Minister Cameron " exemplified by, by above-mentioned step Suddenly, set { Mexico, Loews card Butterworth, Argentina, Britain, Islas Malvinas } can be generated.

Step 2：Characteristic vector is built

For set W_TIn each w '_j, three features are selected, including：w′_jContext in newsletter archive T (Context) feature；w′_jPosition feature in newsletter archive T；w′_jTopological characteristic in newsletter archive T；Three features It is calculated as follows：

Feature one：w′_jContext (Context) feature c in newsletter archive T_j

w′_jContextual feature w ' in newsletter archive T_jThe weight of the regular expression matched is represented, is designated as c_j：

If a) w '_jSome regular expression in following table can be matched in newsletter archive T, it is assumed that be k-th, then c_j= a_k；a_kRepresent " if k-th of regular expression of terminology match, the vocabulary is the ratio of location of incident ", a_kCalculation formula be： a_k=| L_k|/|S_k|, wherein, S_kRepresent the set of all vocabulary of energy k-th of regular expression of successful match in labeled data, L_k Expression belongs to S_kAnd be the set that constitutes of vocabulary of location of incident in labeled data, labeled data refers to manually be labelled with event The newsletter archive collection in place；

If b) w '_jMultiple regular expressions in following table can be matched in newsletter archive T, kth is set to₁, k₂..., k_l (l ＞ 1) individual regular expression, then

If c) w '_jThe regular expression in following table can not be matched in newsletter archive T, then c_j=0.

With newsletter archive, " on June 19th, 2012, during the G20 summits that Mexico's Loews card Butterworth is held, Argentina is total System Christina submits the letters about Islas Malvinas sovereignty to British Prime Minister Cameron " exemplified by, gather { Mo Xi Brother, Loews card Butterworth, Argentina, Britain, Islas Malvinas in " Mexico " can match first regular expression " ^ W+ $ ", then corresponding c_jFor 0.64.

Feature two：w′_jPosition feature p in newsletter archive T_j

Newsletter archive statistical analysis to artificial mark location of incident shows, when mechanism name, place noun, place name occur Position is more forward, is more likely to be time place.Therefore, introducing position feature, and it is defined as：p_j=loc (w '_j, T), wherein, loc(w′_j, T) and represent vocabulary w '_jThe position occurred first in newsletter archive T, i.e., from text T section starts to vocabulary w '_jFirst The secondary number of words occurred between position.

It is still " on June 19th, 2012, Argentinian during the G20 summits that Mexico's Loews card Butterworth is held with newsletter archive Presidential Christina submits the letters about Islas Malvinas sovereignty to British Prime Minister Cameron " exemplified by, " Mexico " It is 12 in the position of the text first, therefore corresponding p_jFor 12.

Feature three：w′_jTopological characteristic t in newsletter archive T_j

Newsletter archive statistical analysis to artificial mark location of incident shows, if locale has multiple, There is stronger correlation in these location of incidents.Therefore, with place to whether occurring mutually in " interactive encyclopaedia " corresponding page To describe to whether there is correlation.

Initialize null set E；For any two tuples (w '_j, w '_i)∈W_T×W_TAnd, perform following two steps：

STEP2：Utilize regular expression

/<a[^>]*Href=[" ']([^’″>]*)[’″][^>]*>(.*)</a>/ ig matches pagefile, obtains Take all Anchor Texts；If including w ' in Anchor Text_i, then E ∪ { (w '_j, w '_i) result be assigned to E.

With newsletter archive, " on June 19th, 2012, during the G20 summits that Mexico's Loews card Butterworth is held, Argentina is total System Christina submits the letters about Islas Malvinas sovereignty to British Prime Minister Cameron " exemplified by, Fig. 2 institutes can be generated The digraph shown.

To any w '_j∈W_T, calculate its convergence factor C (w '_j), convergence factor is used to weigh what is connected between different node Tightness degree；Here, for weighing multiple degrees of correlation being selected between time place；

In above formula, G_Δ(w′_j) represent to include w ' in G_jThe quantity for closing three point groups, G_Δ(w′_j) it is to represent to include w ' in G_j's Open the quantity of three point groups.Close three point groups and refer to any three nodes being connected two-by-two in figure, open three point groups and refer in figure by two sides companies Pick up three nodes come；

Make w '_jTopological characteristic t in newsletter archive T_jFor C (w '_j), i.e. t_j=C (w '_j)。

According to Fig. 2, the topological characteristic of " Mexico "

Utilize w '_jContextual feature, position feature, topological characteristic in newsletter archive T, build w '_jThree-dimensional feature to Measure (c_j, p_i, t_j)。

With newsletter archive, " on June 19th, 2012, during the G20 summits that Mexico's Loews card Butterworth is held, Argentina is total System Christina submits the letters about Islas Malvinas sovereignty to British Prime Minister Cameron " therefore, " Mexico " exists Characteristic vector in this text is (0.64,12,0.6)

Step 3：Location of incident is recognized

100-200 characteristic vectors are selected, manually class label (location of incident and a non-event are marked for each characteristic vector Place), form a training dataset.

According to w '_jContextual feature, position feature, topological characteristic in newsletter archive T, using Random Forest Classifier training goes out a disaggregated model, can be to W_TIn vocabulary carry out two-value point according to location of incident and non-event place Class, so as to realize the extraction of location of incident.

Claims

1. a kind of location of incident abstracting method towards Chinese newsletter archive, it is characterised in that comprise the following steps：

Step one：Extract in candidate events place

(1) first, participle is carried out to Chinese newsletter archive T using ICTCLAS Chinese word segmentations instrument, generation one is by two tuple structures Into sequence S_T=(w₁, p₁), (w₂, p₂) ..., (w_i, p_i) ..., (w_n, p_n), wherein, n represents for the vocabulary being syncopated as Number, n ＞ 0, ω_iRepresent the vocabulary that ICTCLAS is syncopated as, p_iRepresent w_iPart of speech；

(2) from S_TIn select all to meet p successively_i=" ni ", p_i=" nl ", p_iTwo tuples of one of three kinds of situations of=" ns ", p_i =" ni ", p_i=" nl ", p_i=" ns " represents corresponding w respectively_iFor mechanism name, place noun, place name；All selected two W in tuple_iConstitute a set W_T={ w '₁, w '₂..., w '_j..., w '_m, W_TUsing as the set in candidate events place, M represents W_TThe number of middle vocabulary, m ＞ 0；

Step 2：Characteristic vector is built

For set W_TIn each w '_j, three features are selected, including：w′_jContext Context in newsletter archive T is special Levy；w′_jPosition feature in newsletter archive T；w′_jTopological characteristic in newsletter archive T；

Step 3：Location of incident is recognized

100-200 characteristic vector is selected, manually class label (location of incident and non-event is marked for each characteristic vector Point), form a training dataset；Using one grader of Random Forest classifier trainings, it will be collected using the grader Close W_TIn each w '_jTwo-value classification is carried out according to location of incident and non-event place, so as to realize the extraction of location of incident；

W ' described in described step two_jContext Context features in newsletter archive T；w′_jIn newsletter archive T Position feature；w′_jTopological characteristic in newsletter archive T, three features are calculated as follows：

Feature one：w′_jContextual feature in newsletter archive T_cj

(1) if w '_jSome regular expression in following table can be matched in newsletter archive T, it is assumed that be k-th, then c_j=a_k；a_k Represent " if k-th of regular expression of terminology match, the vocabulary is the ratio of location of incident ", a_kCalculation formula be：a_k=| L_k|/|S_k|, wherein, S_kRepresent the set of all vocabulary of energy k-th of regular expression of successful match in labeled data, L_kRepresent Belong to S_kAnd be the set that constitutes of vocabulary of location of incident in labeled data, labeled data refers to manually be labelled with location of incident Newsletter archive collection；

(2) if w '_jMultiple regular expressions in following table can be matched in newsletter archive T, kth is set to₁, k₂..., k_l(l ＞ 1) individual regular expression, then

(3) if w '_jThe regular expression in following table can not be matched in newsletter archive T, then c_j=0；

Feature two：w′_jPosition feature p in newsletter archive T_j

p_j=loc (w '_j, T), wherein, loc (w '_j, T) and represent vocabulary w '_jThe position occurred first in newsletter archive T, i.e., from text This T section starts are to vocabulary w '_jOccurs the number of words between position for the first time；

Feature three：w′_jTopological characteristic t in newsletter archive T_j

STEP1：By character string " http://www.baike.com/wiki/ " and w '_jURL is constituted, the corresponding pages of the URL are downloaded Face；If downloading less than not handling two tuple (w '_j, w '_i)；

STEP2：Utilize regular expression

/<a[^>]*Href=[" ']([^'">]*)['"][^>]*>(.*)</a>/ ig matches pagefile, obtains institute There is Anchor Text；If including w ' in Anchor Text_i, then E ∪ { (w '_j, w '_i) result be assigned to E；

For any two tuples (w '_j, w '_i)∈W_T×W_TAnd, perform after above-mentioned two step, generate with W_TFor node set, E is The digraph G of oriented line set；

To any w '_j∈W_T, calculate its convergence factor C (w '_j), convergence factor is used to weighing connect between different node close Degree；

In above formula, G_Δ(w′_j) represent to include w ' in G_jThe quantity for closing three point groups, G_∧(w′_j) it is to represent to include w ' in G_jOpen three The quantity of point group；Close three point groups and refer in figure any connected three nodes two-by-two, open three point groups and refer in figure and connected by two sides Three nodes come；

Utilize w '_jContextual feature, position feature, topological characteristic in newsletter archive T, build w '_jThree-dimensional feature vector (c_j, p_j, t_j)。