CN108776656A

CN108776656A - Food safety affair entity abstracting method based on condition random field

Info

Publication number: CN108776656A
Application number: CN201810569813.0A
Authority: CN
Inventors: 王东波; 朱子赫; 叶文豪; 吴毅; 王玥雯
Original assignee: Nanjing Agricultural University
Current assignee: Nanjing Agricultural University
Priority date: 2018-06-05
Filing date: 2018-06-05
Publication date: 2018-11-09

Abstract

The invention discloses the food safety affair entity abstracting methods under a kind of multiple features knowledge, include the following steps：The entity of food safety affair defines；The inside and outside characteristic statistics of food safety affair entity；Machine learning model is established；The selection of language material and the processing of language material；The selection of feature and the formulation of feature templates；The advantage of the invention is that：Play the role of serving as basic resource for structuring food prods security incident knowledge base and excavation food security countermeasure.Automatically food name can be released with the material elements for causing food safety affair to occur.During building extraction model, a large amount of feature knowledge has not only been incorporated in conditional random field models but also constructed model is carried out on the food safety affair language material by mark of magnanimity.

Description

Food safety affair entity abstracting method based on condition random field

Technical field

The present invention relates to technical field of food safety, more particularly to the food safety affair under a kind of multiple features knowledge is real Body abstracting method.

Background technology

In order to cope with the food safety affair problem being concerned, the central rural area warp in 23 to 24 December in 2013 Ji working conference clearly proposes that the specific of " establishing the whole nation unified agricultural product and food security information trace platform as early as possible " is arranged It applies, and the basis of structuring food prods tracing of safety information platform is that the principal entities in food-safe event is wanted to be confirmed, especially It is when being related to the processing of food security public sentiment, and the extraction of related entities becomes ever more important.For this case, this hair The bright food safety affair corpus based on structure, conjugation condition random field machine learning model, by using food security thing The entity of the multiple features knowledge of part entity, food-safe event carries out extraction experiment.On the one hand it is structuring food prods security incident Knowledge base provides basic knowledge anchor point, is on the other hand also deeply to excavate, analyze and summarize reply food safety affair Strategy lay a good foundation.

Research in relation to food safety affair is concentrated mainly in case, policy and emergency processing, representational research Mainly have：34 network volunteers are combined by the postgraduate Wu Heng of Fudan University and create " throwing out outside window " website [1], are collected About food safety affair dependent event and construct database.The database is the food safety affair that the present invention is built Corpus provides a certain number of texts, is the basis of building of corpus of the present invention.

Research about food safety affair is carried out from the angle of management, and more representational research has： Zhang Mujie etc. [2] is based on two typical cases, harm caused by information is underground when analyzing contingency management event, and inquires into Common underground reason.The research chooses the method for typical case as present invention determine that language material text provides method On reference.Horse grain husk etc. [3] constructs the Epidemic Model of food service industry event risk perception, and with derived from Earthquakes in Japan For " panic buying salt event ", numerical analysis and inspection are carried out to model.The research is that the present invention carries out food safety affair Title mark provides corresponding reference place.

On the one hand the studies above is that the present invention provides the method for macroscopic view, strategy instructions, be on the other hand also that the present invention is true The entity for determining food safety affair provides specific foundation.

Newest research is mainly extracted by the method for machine learning in non-structured text in terms of the extraction of entity Entity, more representational research are as follows：Strategy based on neural network, Chen Yu etc. [4] attempt to utilize Deep Belief Relationship of the Nets models between entity and entity extracts.The research is present invention determine that the quantity of characteristic quantity provides Corresponding guide for method.It is also current more popular strategy, Shao Fa to entity extract using corresponding semantic knowledge Deng [5] from solving the problems, such as that polysemy sets about, using ambiguity dispelling tactics, pass through the resource of HowNet and Bayes's classification With method, entity is extracted.From the completion of the angle of disambiguation although the identification of entity is scientific, But overall performance of this method on large-scale language material is up for verification.For the electron medical treatment text sharply increased, Medical language materials of the Xu Hua etc. [6] based on participle, part-of-speech tagging completes the pumping to entity in medical text using the method for rule It takes, overall performance reaches 80% or more.Although the method for rule has certain adaptability on the language material of a certain feature, Due to adequately probing into the rule shortage lain between specific language material vocabulary, made set pattern can be caused to a certain extent Coverage then is relatively poor.This is also that the present invention chooses conditional random field models progress food safety affair entity extraction One of the main reasons.With the relevant information extraction research of food safety affair, concentrate on complaining text word for food at present The Knowledge Extraction of remittance level, more representational research are the pumpings that Wei Xiuzhuo [7] complains text sensitivity vocabulary around food Take the extraction for complaining text to endanger information with food of the Gao Rui [8] based on ontology.It is extracted relative to entity, the extraction of vocabulary grade It is relatively easy, it is mainly reflected on relatively easy this 2 points of the shorter and internal composition of length of vocabulary.Condition random field is as pumping Take the machine learning model of the serializings such as term and entity that there is wide application, it is more representational as follows：Li Li Double equal [9] complete the extraction to automobile terms by simple feature template；On the basis of the feature templates of word combination, Wang Wen Dragons etc. [10] complete the extraction to entity in project application book；In conjunction with the feature knowledge of Chinese medicine vocabulary, Liu Kai etc. [11] structures The entity extraction model of Traditional Chinese medical electronic case history.Above-mentioned term and entity based on condition random field are extracted with only reality The simple feature knowledge of body itself is not directed to the information of institute's extracting object context of co-text, and the present invention is in identification food security Complicated feature templates are constructed during event entity, compensate for the deficiency of existing recognition methods to a certain extent.

There are two the deficiencies of aspect for above-mentioned existing research tool, on the one hand in model mistake of the structure based on condition random field Cheng Zhong does not use corresponding feature or used feature relatively simple, and the overall performance for causing constructed model has It waits improving, on the other hand during training pattern, existing research is substantially the spy being unfolded on small-scale language material Study carefully, and the present invention is to build on the extensive language material by manually marking, and has the very strong transportable property of model And adaptability.

Bibliography

[1] [EB/OL] [2014-02- 18] .http outside window is thrown out://www.zccw.info/；

[2] Zhang Mujie, Shen Jianhua are about on thinking [J] disclosed in information in disposition food and drug safety accident Extra large Food and drug administration information research, 2012 (2)：45-49；

[3] Epidemic Model research [J] scientific research pipes of the wide food service industrys event risk perception of Ma Ying, Zhang Yuanyuan, Song Wen Reason, 2013,34 (9)：123-130；

[4] Chinese name entity relation extraction [J] of Chen Yu, Zheng Dequan, Zhao Tiejun based on Deep Belief Nets are soft Part journal, 2012,23 (10):2572-2585；

[5] Shao sends out, and the such as Huang Yinge, Zhou Lanjiang are learned based on Chinese entity relation extraction [J] the Shandong University that entity disambiguates Report：Engineering version, 2014,44 (6):32-37；

Illness bacterium entities of the such as [6] Xu Hua, Liu Maofu, Jiang Li based on language rule extracts [J] Wuhan University Journals (Edition), 2015,61 (2):51-55；

[7] Wei Xiu Zhuos food complains text sensitivity vocabulary to extract the Changchun research [D]：Northeast Normal University, 2015；

[8] food of the high stamen based on ontology complains text to endanger the Changchun information extraction research [D]：Northeast Normal University, 2011。

Invention content

The present invention in view of the drawbacks of the prior art, take out by the food safety affair entity provided under a kind of multiple features knowledge Method is taken, can effectively solve the problem that the above-mentioned problems of the prior art.

In order to realize the above goal of the invention, the technical solution adopted by the present invention is as follows：

A kind of food safety affair entity abstracting method under multiple features knowledge, includes the following steps：

S1：Food safety affair entity defines and characteristic statistics；

S11：Entity defines；

On the basis of food-safe event is acquired, marks and organizes, structuring food prods security incident corpus；

S12：The inside and outside characteristic statistics of food safety affair entity；

Choose all food safety affairs, to food name therein and cause food safety affair occur it is specific because Element is labeled；On the basis of the language material of mark, statistics " food name " is inside and outside with " material elements " these entities Feature.

Internal feature includes physical length and quantity：

Physical length is obtained to be used to grasp the complexity of extracted entity object and determine condition random field label sets Number；

The distribution situation for counting specific entity is special for the particular content of entity and the right boundary of the specific entity of statistics Sign.

The surface of entity

The right boundary of " food name " and " material elements " in food-safe event language material is counted, the system Count result has important value for subsequent builds " food name " and " material elements " extraction model.

The bounds of " food name " and " material elements " be limited to ".！？" ending clause within the scope of, " food The left margin of title " and " material elements " is start mark, in the range of terminating to first label since sentence, referred to as β.Terminate to sentence since the last one label, this range is denoted as α.It is specific to choose " food name " and " material elements " Shown in the calculation formula of left margin word such as formula (1).

Wherein, f (W_left_outside) indicates that the frequency that W occurs within the scope of β, f (W_left) indicate W in β, " food The name of an article claims ", frequency for occurring inside " material elements ".Give P's in conjunction with the language material of food safety affair by formula (1) Empirical value is 0. 8, i.e., as P >=0.8, W is likely to become the left margin word of " food name " and " material elements ", then ties The introspection that artificial language is gained knowledge is closed, finally determines 7 left margin words：", with and be, food, it is exceeded, in ".

The selection of " food name " and " material elements " right margin word is used for using formula (2).

Wherein, f (W_right_outside) indicates the frequencys that occur within the scope of α of W, f (W_right) indicate W α, The frequency occurred inside " food name ", " material elements ", is also set to 0.8, according to linguistics by the threshold value of right margin word P The introspection of knowledge finally determines 10 right margin words in conjunction with the P values more than or equal to 0.8：", use, product, have, plant and Be, surpass, in, production ".

S2：Model brief introduction and feature determine

S21：Machine learning model

If x={ x1, x2 ..., xn-1, xn } indicates observed input data sequence, such as the word after being segmented in language material；y ={ y1, y2 ..., yn-1, yn } indicates finite state set, wherein each state corresponds to a label；In given input sequence Under conditions of arranging x, for the conditional probability of the status switch y of the linear chain CRFs of parameter lambda={ λ 1, λ 2 ..., λ n-1, λ n } As shown in formula (3) and formula (4).

Wherein, Z_xFor normalization factor, the score of all possible status switch is indicated, it is ensured that all possible state sequences The sum of conditional probability of row is 1.It is the characteristic function of a Unified Form, usually two-value characterizes function；λ_jIt is to pass through model The weight of the corresponding feature function obtained later is trained to training data.

S22：The selection of language material and the processing of language material

Specifically the entity of " food name " and " material elements " be marked as in language material "【】" form；

Based on the characteristic statistics to " food name " and " material elements ", determine for " food name " and " it is specific because During the CRF reference numerals of element ", formula (5).

Wherein, L indicates the length as i≤k when " food name " and " material elements " after average weighted, N_iIndicate institute The number that " food name " and " material elements " that length is i in the language material of selection occurs, k and j are indicated in corpus most respectively The long length with most short " food name " and " material elements ", N indicate " food name " and " material elements " in corpus Total number.

Based on formula (5), the basic condition in conjunction with language material and corresponding experimental result, " food name " and " it is specific because Determine that the mark collection using 5 lexemes, mark collection are indicated with R in element " identification model structure, specially R={ B, C, E, S, A }, B indicates that the initial word of " food name " and " material elements ", C are the medium term of " food name " and " material elements ", and E is The closing of " food name " and " material elements ", S are the vocabulary except " food name " and " material elements ", and A is a word Or word is individually for the case where " food name " and " material elements ", if the length of " food name " and " material elements " is more than 3, just expansion word is indicated with C.

S23：The selection of feature and the formulation of feature templates

Feature is made of atomic features and compound characteristics two parts.

Selection atomic features be word itself, part of speech, word length, whether entity word, whether left margin, whether right margin 6 A feature；

Compound characteristics are to characterize " food name " and " material elements " entity complex by the combination to atomic features Linguistic feature.

Further, the gatherer process of food safety affair is as follows in the S11：The acquisition target of food safety affair Food safety affair on main food safety affair and paper media including on internet；

The acquisition of food safety affair mainly by adopting automatically towards event topic vertical search engine technology on network Collection, acquisition range includes news portal, forum and blog, passes through corresponding data cleansing, conversion for the isomeric data of acquisition Statistics is saved in database, and the event case of papery then is completed to adopt event by way of manual entry, check and correction Collection.

The mark of food safety affair：Mainly complete participle, the part-of-speech tagging of food-safe event；

The tissue of food security：Mainly food-safe event carries out classification mark, and specific category mark is then based on 《People's Republic of China's the law of food safety》It carries out.

Further, in S23 the characteristic window size of 6 feature selectings be respectively 7,3,5,5,5,5,7 windows model It is { -3, -2, -1,0,1,2,3 } to enclose, and the range of 5 windows is { -2, -1,0,1,2 }, and the range of 3 windows is { -1,0,1 }； In features described above, to " food name " and " material elements " extract performance boost from the point of view of, part of speech and word sheet Body is most important feature, followed by right boundary word and entity word, is finally the length of " food name " and " material elements " Degree.

Compared with prior art the advantage of the invention is that：For structuring food prods security incident knowledge base and excavate food Safe countermeasure plays the role of serving as basic resource.Automatically food name can be sent out with food safety affair is caused Raw material elements are released.During building extraction model, not only incorporated in conditional random field models a large amount of Feature knowledge and also constructed model be to be carried out on food safety affair language material in magnanimity by mark, this two Point is the important innovations point of the present invention.

Description of the drawings

Fig. 1 is the topological structure schematic diagram of linear chain CRFs models；

Specific implementation mode

To make the objectives, technical solutions, and advantages of the present invention more comprehensible, below in conjunction with attached drawing and implementation is enumerated Example, is described in further details the present invention.

S1：Food safety affair entity defines and characteristic statistics

S11：The brief introduction of food safety affair language material and entity define

On the basis of food-safe event is acquired, marks and organizes, the present invention constructs 2005-2015 The food safety affair corpus in year.The gatherer process of food safety affair is as follows：The acquisition target of food safety affair is main Including the food safety affair on the food safety affair and paper media on internet.Food safety affair adopts on network Collection mainly by towards event topic vertical search engine technology automatic collection, acquisition range include news portal, forum and Blog is saved in database the isomeric data of acquisition by corresponding data cleansing, conversion statistics, and the thing of papery Part case then completes the acquisition to event by way of manual entry, check and correction.

The mark of food safety affair mainly completes participle, the part-of-speech tagging of food-safe event, long for vocabulary The part of speech that bigger food security title is then labelled with big level-one is spent, for relatively general language material in food safety affair Vocabulary length it is longer, this kind of vocabulary is considered as a vocabulary to handle and carry out part-of-speech tagging when being segmented；Food The tissue of product safety is mainly that food-safe event carries out classification mark, and specific category mark is then based on《The Chinese people are total With state's the law of food safety》It carries out.By above-mentioned processing, constructed food safety affair corpus reach 1 500 ten thousand word grades and 6870000 word grades share 2800 food safety affair compositions.

The so-called entity of the present invention is primarily referred to as the food name involved in food safety affair and leads to food security thing The material elements that part occurs, such as specific food name have " milk powder, soy sauce, rice, milk " etc., and material elements are then " additive, formaldehyde, benzoyl peroxide, trans-fatty acid " etc..The main task of the present invention is structure machine learning model, Automatically food name is released with the material elements for causing food safety affair to occur.Conditional random field models training and survey Language material sample used in examination is as follows：

Enterprise/n or/c people/n /u "/w illegal/vn behaviors/n "/w in/f ,/w includes/v "/w productions/v personations/vn Registration/vn trade marks/n /u is bottled/"/w is in/p productions/vn processing/vn dumpling wrapper/n ,/wn Wantons skin/n mistakes by b water/n "/w In journey/n/f additions/v is toxic/vi is harmful/a substances/n【Borax/n】"/w is in/p productions/vn processing/vn by "/w【Omasum/ nr】,/wn【Squid/n】,/wn【Ox】Tripe/ng etc./v food/n /u processes/n in/f additions/v is toxic/vi is harmful/a objects Matter/n【Hydrogen peroxide/n】With/p hydrogen【Sodium oxide molybdena/n】"/w submits/v falsenesses/a materials/n acquirements/v food and drink/n services/vn to "/w License/vn "/w "/w distorts/v food/n productions/vn dates/n simultaneously/d sale/v "/w etc./u./wj

S12：The inside and outside characteristic statistics of entity

2800 food safety affairs are chosen, by manually being sent out with food safety affair is caused food name therein Raw material elements carry out manual mark.On the basis of the language material of mark, statistics " food name " and " material elements " these realities The inside and outside feature of body.

(1) internal feature

Word length

On the one hand the length for obtaining entity is conducive to grasp the complexity of extracted entity object, on the other hand also have Conducive to the number of determining condition random field label sets.The distribution of food safety affair physical length is as shown in table 1.

1 food safety affair physical length distribution table of table

Physical length	Quantity (a)	Physical length	Quantity (a)
				2	48 036	13	13
3	23 499	9	9
				4	6 878	10	7
1	6 594	12	5
				5	1 383	14	2
6	394	15	1
				7	182	11	1
8	37	20	1

As can be seen from Table 1, the length of entity is accounted for mainly between 1-5 by the way that the entity that length is 1-5 is calculated The 99.25% of sum, the entity that length is 2 and 3 account for the 82.18% of sum, and the entity that length is 2 accounts for the 55.19% of sum, The entity that length is 3 accounts for the 27.00% of sum.By result of calculation it is seen that：The physical quantities that length is 2 are more than half Number, therefore in terms of entity extraction, the entity that length is 2 and 3 is the object that emphasis extracts, such as " milk powder ", " milk ", " pig Meat ", " additive ", " gutter oil " etc..And those length more than 8 be mostly containing adjectival noun or some complexity Proper noun, such as：" sodium cyclohexylsulfamate ".

The distribution situation of specific entity

Distribution situation by counting specific entity not only facilitates the particular content for the related entity for obtaining perception, and And it is also beneficial to count the left and right feature knowledge of specific entity.The distribution of Partial Food security incident entity is as shown in table 2.

The distribution situation of 2 specific food safety affair entity of table

Entity	Quantity (a)	Entity	Quantity (a)
				Additive	2 243	Rice	899
Milk powder	1 661	Milk	810
				Gutter oil	1 178	Medicine bag	733
Soy sauce	1 078	Total plate count	377
				Wine	1 006	Nitrite	352
Pork	943	Trans-fatty acid	95
				Formaldehyde	904	Benzoyl peroxide	90

Table 2 only has chosen part entity data, be respectively before ranking 10 and quantity that physical length is 4-6 it is forward Solid data (data share 3 193,87 042).Because in 10 before ranking being largely the entity that length is 2, therefore Do not add such data again in the table.The entity total amount of this statistics is 87 042, wherein first 10 account for sum

13.16%, first 5 account for total 8.23%, and Section 2 milk powder accounts for 1.91%, and first item additive accounts for 2.58%.

(2) surface of entity

In the language material of different food products security incident, there are larger for the right boundary of " food name " and " material elements " Difference, respectively the right boundary of " food name " and " material elements " in food-safe event language material counted, this Statistical result has important value for subsequent builds " food name " and " material elements " extraction model.

The bounds of " food name " and " material elements " be limited to ".！？" ending clause within the scope of, " food It is " food name " and " material elements " that the left margin of title " and " material elements ", which can never cross over its first label, Start mark, therefore investigation range is limited in the range of terminating to first label since sentence, referred to as β.Equally, The right margin Feature Words of " food name " and " material elements " can never be crossed over

The last one of " food name " and " material elements " mark, therefore investigate range and be limited to from the last one mark Note starts to sentence to terminate in such a range, this range is denoted as α.It is specific to choose " food name " and " material elements " Shown in the calculation formula of left margin word such as formula (1).

Wherein, f (W_left_outside) indicates that the frequency that W occurs within the scope of β, f (W_left) indicate W in β, " food The name of an article claims ", frequency for occurring inside " material elements ".Give P's in conjunction with the language material of food safety affair by formula (1) Empirical value is 0.8, i.e., as P >=0.8, W is likely to become the left margin word of " food name " and " material elements ", then ties The introspection that artificial language is gained knowledge is closed, finally determines 7 left margin words：", with and be, food, it is exceeded, in ".

Similarly, the selection of " food name " and " material elements " right margin word is used for using formula (2).

S2：Model brief introduction and feature determine

S21：Machine learning model

Condition random field is for solving the problems, such as preferably a kind of model of sequence labelling, is to need to mark at given one group Observation sequence under conditions of, calculate the undirected graph model of the combination condition probability distribution of entire observation sequence status indication.It is right In specified node input value, the conditional probability of specified node output valve can be calculated, training objective be so that conditional probability most Bigization.Most common CRFs models are single order chain structure, i.e. linear chain structure, and topological structure is as shown in Figure 1.

If x={ x1, x2 ..., xn-1, xn } indicates observed input data sequence, after being segmented in language material of the present invention Word；Y={ y1, y2 ..., yn-1, yn } indicates finite state set, wherein each state corresponds to a label.Given Under conditions of list entries x, for the item of the status switch y of the linear chain CRFs of parameter lambda={ λ 1, λ 2 ..., λ n-1, λ n } Shown in part probability such as formula (3) and formula (4).

Wherein, Z_xFor normalization factor, the score of all possible status switch is indicated, it is ensured that all possible state sequences The sum of conditional probability of row is 1.It is the characteristic function of a Unified Form, usually two-value characterizes function；λ_jIt is to pass through model The weight of the corresponding feature function obtained later is trained to training data.Maximum entropy model (ME) is with McCallum etc. (McCallum A,Freitag D,Pereira F.Maximumentropy Markov m odels for information extraction andsegmentation.In Proc. ICML 2000,2000:591-598) propose principle of maximum entropy Based on, if that is, the principle of maximum entropy is mainly that probability distribution information is uncertain, the way of biasing is least will produce, It is exactly that equalization treats probability distribution, does not do any subjectivity it is assumed that under the given restrictive condition about training data, make mould The entropy of type reaches maximum distribution, is exactly required distribution.Maximum entropy model is in the fields such as artificial intelligence and natural language processing It is used widely, but since maximum entropy model itself has mark biasing, wrong identification and unidentified situation It is more, cause in some cases its effect not as good as the models such as CRF.

Specifically the entity of " food name " and " material elements " be marked as in language material "【】" form, such as：

"【/ wky milk/n】/ wky ,/wd 30/m is remaining/and m/q law enforcements/vn personnel/n comes the/west the v/streets b long/n agricultural trades Market/n ", verification/v【/ wky is trans-/b aliphatic acid/n】/ wky /u/ problems/n./wd :Based on to " food name " and " tool The characteristic statistics of body factor ", on the basis of conditional random field models define, the present invention is being determined for " food name " and " tool During the CRF reference numerals of body factor ", Primary Reference formula (5).

Wherein, L indicates the length as i≤k when " food name " and " material elements " after average weighted, N_iIndicate institute The number that " food name " and " material elements " that length is i in the language material of selection occurs, k and j are indicated in corpus most respectively The long length with most short " food name " and " material elements ", N indicate " food name " and " material elements " in corpus Total number.Based on formula (5), the basic condition in conjunction with language material and corresponding experimental result, " food name " and " it is specific because Determine that the mark collection using 5 lexemes, mark collection are indicated with R in element " identification model structure, specially R={ B, C, E, S, A }, B indicates that the initial word of " food name " and " material elements ", C are the medium term of " food name " and " material elements ", and E is The closing of " food name " and " material elements ", S are the vocabulary except " food name " and " material elements ", and A is a word Or word is individually for the case where " food name " and " material elements ", if the length of " food name " and " material elements " is more than 3, just expansion word is indicated with C.

The present invention by writing java applet, in conjunction with " food name " in language material and " material elements " "【】" label with And according to the feature of selection and the feature templates of formulation, all language materials are labeled automatically, specific mark is as shown in table 3.

Table 3 " food name " and " material elements " training corpus and testing material mark sample

Word

Part of speech

Word length

Whether entity word

Whether left margin

Whether right margin

Label

It is related

p

2

N

S

It is trans-

b

2

Y

N

B

Aliphatic acid

n

3

Y

N

E

Problem

n

1

N

S

,

wd

1

N

S

Zhejiang Province

ns

3

N

S

Jinhua

ns

3

N

S

Public security bureau

n

3

N

S

Jiangnan

ns

2

N

S

Branch office

n

2

N

S

It is connected to

v

2

N

S

The masses

n

2

N

S

Report

vn

2

N

S

Claim

v

1

N

S

S23：The selection of feature and the formulation of feature templates

For in the machine learning model based on condition random field, the selection of feature is all of crucial importance.Feature selecting it is good The bad performance that will directly influence CRFs models.Feature is made of atomic features and compound characteristics two parts.The present invention chooses Atomic features be word itself, part of speech, word length, whether entity word, whether left margin, whether 6 features such as right margin； Compound characteristics are special come the linguistics for characterizing " food name " and " material elements " entity complex by the combination to atomic features Sign.The range that the characteristic window size of 6 feature selectings is respectively 7,3,5,5,5,5,7 windows in the present invention be -3, - 2, -1,0,1,2,3 }, the range of 5 windows is { -2, -1,0,1,2 }, and the range of 3 windows is { -1,0,1 }.In above-mentioned spy In sign, to " food name " and " material elements " extract performance boost from the point of view of, part of speech and word itself are most important Feature, followed by right boundary word and entity word are finally the length of " food name " and " material elements ".

Entity extracts experiment

The evaluation of extraction model performance is mainly weighed using three indexs：Accuracy rate (Precision), recall rate (Recall), F values (F-measure).It is based respectively on the language material use condition random field models and maximum entropy model marked above Carry out the extraction of " food name " and " material elements ".Constructed by method test in specific experiment using cross validation The performance of model, by 2800 language material documents respectively according to 9:1 ratio is divided into training corpus and testing material, test result As shown in table 4 and table 5, table 6 illustrates the comparison that training and test take under the conditions of same software and hardware of two kinds of models.

Table 4 is based on conditional random field models " food name " and " material elements " extract performance and compare

Table 5 be based on maximum entropy model " food name " and " material elements " extraction performance compare

Test No.	Accuracy rate	Recall rate	F values
				1	72.55%	62.50%	67.15%
2	73.72%	61.89%	67.29%
				3	81.90%	65.19%	72.60%
4	84.10%	59.97%	70.01%
				5	81.67%	62.49%	70.80%
6	86.52%	63.70%	73.38%
				7	81.66%	65.74%	72.84%
8	72.71%	67.10%	69.79%
				9	74.72%	63.37%	68.58%
10	80.88%	65.40%	72.32%
				Mean value	79.04%	63.74%	70.48%

6 condition random field of table and maximum entropy model training and test take and compare

From table 4 and table 5 as can be seen that " food name " and " material elements " identification model based on condition random field The performance based on maximum entropy model can be better than.The F values minimum 90.06% of conditional random field models, up to 91.94%, average out to 90.88%；The F values of maximum entropy model are only up to 73.38%, and averagely only 70.48%.It can from table 6 To find out, training and test it is time-consuming from the point of view of, maximum entropy model is better than conditional random field models.The former primary training with Test takes at 100 seconds or so, and the latter needs about 50 000 seconds or so.Due to follow-up study more focus on " food name " and The performance of " material elements " identification rather than the length of time consumption for training, therefore alternative condition random field models of the present invention carry out " food The name of an article claims " and " material elements " identification." food name " and " material elements " that conditional random field models are identified Simple analysis is carried out, it is found that identification mistake more " food name " and " material elements " is mainly length process, such as " food The name of an article claims " and " material elements ", such as " vibrio parahaemolytious bacterium ", " village Qiao Jiashangao steamed bun ", " for animals plus selenium humic acid Sodium ", " being polluted by bacillus cereus ", " Wang Shi bee glue soft capsules " otherwise in these entities containing be difficult to multiplely Name and adjective, such as " Qiao Jia ", " grid are high " or name are combined with noun, such as " Wang Shi ", " propolis ".These entities In complicated constituent affect the accurate rate and recall rate of Entity recognition in food-safe event language material.

Those of ordinary skill in the art will understand that the embodiments described herein, which is to help reader, understands this The implementation of invention, it should be understood that protection scope of the present invention is not limited to such specific embodiments and embodiments.This The those of ordinary skill in field can make according to the technical disclosures disclosed by the invention various does not depart from of the invention essence Various other specific variations and combinations, these variations and combinations are still within the scope of the present invention.

Claims

1. the food safety affair entity abstracting method under a kind of multiple features knowledge, which is characterized in that include the following steps：

S1：Food safety affair entity defines and characteristic statistics；

S11：Entity defines；

All food safety affairs are chosen, food name therein and the material elements for causing food safety affair to occur are carried out Mark；On the basis of the language material of mark, the inside and outside feature of " food name " and " material elements " these entities is counted；

Internal feature includes physical length and quantity：

Physical length is obtained to be used to grasp the complexity of extracted entity object and determine the number of condition random field label sets；

The distribution situation for counting specific entity is used for the particular content of entity and counts the right boundary feature of specific entity；

The surface of entity：

The right boundary of " food name " and " material elements " in food-safe event language material is counted, the statistical result There is important value for subsequent builds " food name " and " material elements " extraction model；

The bounds of " food name " and " material elements " be limited to ".！？" ending clause within the scope of, " food name " The left margin of " material elements " is start mark, in the range of terminating to first label since sentence, referred to as β；From most The latter label starts to sentence to terminate, this range is denoted as α；It is specific to choose " food name " and " material elements " left margin word Calculation formula such as formula (1) shown in；

Wherein, f (W_left_outside) indicates that the frequency that W occurs within the scope of β, f (W_left) indicate W in β, " food name The frequency occurred inside title ", " material elements "；By formula (1) the experience threshold of P is given in conjunction with the language material of food safety affair Value is 0.8, i.e., as P >=0.8, W is likely to become the left margin word of " food name " and " material elements ", then in conjunction with artificial language It says the introspection gained knowledge, finally determines 7 left margin words：", with and be, food, it is exceeded, in "；

The selection of " food name " and " material elements " right margin word is used for using formula (2)；

Wherein, f (W_right_outside) indicates that the frequency that W occurs within the scope of α, f (W_right) indicate W in α, " food The frequency occurred inside title ", " material elements ", is also set to 0.8, according to linguistic knowledge by the threshold value of right margin word P Introspection finally determines 10 right margin words in conjunction with the P values more than or equal to 0.8：", use, product, have, plant and be, surpass, In, production "；

S2：Model foundation and feature determine

S21：Machine learning model is established

If x={ x1, x2 ..., xn-1, xn } indicates observed input data sequence, such as the word after being segmented in language material；Y= { y1, y2 ..., yn-1, yn } indicates finite state set, wherein each state corresponds to a label；In given list entries x Under conditions of, for the conditional probability such as formula of the status switch y of the linear chain CRFs of parameter lambda={ λ 1, λ 2 ..., λ n-1, λ n } (3) and shown in formula (4)；

Wherein, Z_xFor normalization factor, the score of all possible status switch is indicated, it is ensured that the item of all possible state sequences The sum of part probability is 1；It is the characteristic function of a Unified Form, usually two-value characterizes function；λ_jIt is by model to training Data are trained the weight of the corresponding feature function obtained later；

Based on the characteristic statistics to " food name " and " material elements ", determining for " food name " and " material elements " During CRF reference numerals, formula (5)；

Wherein, L indicates the length as i≤k when " food name " and " material elements " after average weighted, N_iSelected by indicating The number that " food name " and " material elements " that length is i in language material occurs, k and j indicate in corpus longest and most respectively The length of short " food name " and " material elements ", N indicate the total number of " food name " and " material elements " in corpus；

Based on formula (5), the basic condition in conjunction with language material and corresponding experimental result, " food name " and " material elements " is known Determine that the mark collection using 5 lexemes, mark collection indicate that specially R={ B, C, E, S, A }, B are indicated with R in other model construction The initial word of " food name " and " material elements ", C are the medium term of " food name " and " material elements ", and E is " food name The closing of title " and " material elements ", S are the vocabulary except " food name " and " material elements ", and A is that a word or word are independent For " food name " and " material elements " the case where, if the length of " food name " and " material elements " is more than 3, just use C tables Show expansion word；

S23：The selection of feature and the formulation of feature templates；

Feature is made of atomic features and compound characteristics two parts；

Selection atomic features be word itself, part of speech, word length, whether entity word, whether left margin, whether 6 spies of right margin Sign；

Compound characteristics are the language that " food name " and " material elements " entity complex is characterized by the combination to atomic features Learn feature.

2. according to the method described in claim 1, it is characterized in that：The gatherer process of food safety affair is as follows in the S11： The acquisition target of food safety affair includes mainly the food security thing on food safety affair and paper media on internet Part；

The acquisition of food safety affair is mainly by towards event topic vertical search engine technology automatic collection, acquisition on network Range includes news portal, forum and blog, and the isomeric data of acquisition is preserved by corresponding data cleansing, conversion statistics Into database, and the event case of papery then completes the acquisition to event by way of manual entry, check and correction；

The tissue of food security：Mainly food-safe event carries out classification mark, and specific category mark is then based on《Middle Chinese People republic the law of food safety》It carries out.

3. according to the method described in claim 1, it is characterized in that：The characteristic window size of 6 feature selectings is respectively in S23 The range of 7,3,5,5,5,5,7 windows is { -3, -2, -1,0,1,2,3 }, and the ranges of 5 windows is { -2, -1,0,1,2 }, 3 The range of a window is { -1,0,1 }；In features described above, to " food name " and " material elements " extract performance boost Angle considers that part of speech and word itself are most important features, followed by right boundary word and entity word, are finally " food names The length of title " and " material elements ".