CN108776656A - Food safety affair entity abstracting method based on condition random field - Google Patents

Food safety affair entity abstracting method based on condition random field Download PDF

Info

Publication number
CN108776656A
CN108776656A CN201810569813.0A CN201810569813A CN108776656A CN 108776656 A CN108776656 A CN 108776656A CN 201810569813 A CN201810569813 A CN 201810569813A CN 108776656 A CN108776656 A CN 108776656A
Authority
CN
China
Prior art keywords
food
material elements
entity
word
name
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810569813.0A
Other languages
Chinese (zh)
Inventor
王东波
朱子赫
叶文豪
吴毅
王玥雯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Agricultural University
Original Assignee
Nanjing Agricultural University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Agricultural University filed Critical Nanjing Agricultural University
Priority to CN201810569813.0A priority Critical patent/CN108776656A/en
Publication of CN108776656A publication Critical patent/CN108776656A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses the food safety affair entity abstracting methods under a kind of multiple features knowledge, include the following steps:The entity of food safety affair defines;The inside and outside characteristic statistics of food safety affair entity;Machine learning model is established;The selection of language material and the processing of language material;The selection of feature and the formulation of feature templates;The advantage of the invention is that:Play the role of serving as basic resource for structuring food prods security incident knowledge base and excavation food security countermeasure.Automatically food name can be released with the material elements for causing food safety affair to occur.During building extraction model, a large amount of feature knowledge has not only been incorporated in conditional random field models but also constructed model is carried out on the food safety affair language material by mark of magnanimity.

Description

Food safety affair entity abstracting method based on condition random field
Technical field
The present invention relates to technical field of food safety, more particularly to the food safety affair under a kind of multiple features knowledge is real Body abstracting method.
Background technology
In order to cope with the food safety affair problem being concerned, the central rural area warp in 23 to 24 December in 2013 Ji working conference clearly proposes that the specific of " establishing the whole nation unified agricultural product and food security information trace platform as early as possible " is arranged It applies, and the basis of structuring food prods tracing of safety information platform is that the principal entities in food-safe event is wanted to be confirmed, especially It is when being related to the processing of food security public sentiment, and the extraction of related entities becomes ever more important.For this case, this hair The bright food safety affair corpus based on structure, conjugation condition random field machine learning model, by using food security thing The entity of the multiple features knowledge of part entity, food-safe event carries out extraction experiment.On the one hand it is structuring food prods security incident Knowledge base provides basic knowledge anchor point, is on the other hand also deeply to excavate, analyze and summarize reply food safety affair Strategy lay a good foundation.
Research in relation to food safety affair is concentrated mainly in case, policy and emergency processing, representational research Mainly have:34 network volunteers are combined by the postgraduate Wu Heng of Fudan University and create " throwing out outside window " website [1], are collected About food safety affair dependent event and construct database.The database is the food safety affair that the present invention is built Corpus provides a certain number of texts, is the basis of building of corpus of the present invention.
Research about food safety affair is carried out from the angle of management, and more representational research has: Zhang Mujie etc. [2] is based on two typical cases, harm caused by information is underground when analyzing contingency management event, and inquires into Common underground reason.The research chooses the method for typical case as present invention determine that language material text provides method On reference.Horse grain husk etc. [3] constructs the Epidemic Model of food service industry event risk perception, and with derived from Earthquakes in Japan For " panic buying salt event ", numerical analysis and inspection are carried out to model.The research is that the present invention carries out food safety affair Title mark provides corresponding reference place.
On the one hand the studies above is that the present invention provides the method for macroscopic view, strategy instructions, be on the other hand also that the present invention is true The entity for determining food safety affair provides specific foundation.
Newest research is mainly extracted by the method for machine learning in non-structured text in terms of the extraction of entity Entity, more representational research are as follows:Strategy based on neural network, Chen Yu etc. [4] attempt to utilize Deep Belief Relationship of the Nets models between entity and entity extracts.The research is present invention determine that the quantity of characteristic quantity provides Corresponding guide for method.It is also current more popular strategy, Shao Fa to entity extract using corresponding semantic knowledge Deng [5] from solving the problems, such as that polysemy sets about, using ambiguity dispelling tactics, pass through the resource of HowNet and Bayes's classification With method, entity is extracted.From the completion of the angle of disambiguation although the identification of entity is scientific, But overall performance of this method on large-scale language material is up for verification.For the electron medical treatment text sharply increased, Medical language materials of the Xu Hua etc. [6] based on participle, part-of-speech tagging completes the pumping to entity in medical text using the method for rule It takes, overall performance reaches 80% or more.Although the method for rule has certain adaptability on the language material of a certain feature, Due to adequately probing into the rule shortage lain between specific language material vocabulary, made set pattern can be caused to a certain extent Coverage then is relatively poor.This is also that the present invention chooses conditional random field models progress food safety affair entity extraction One of the main reasons.With the relevant information extraction research of food safety affair, concentrate on complaining text word for food at present The Knowledge Extraction of remittance level, more representational research are the pumpings that Wei Xiuzhuo [7] complains text sensitivity vocabulary around food Take the extraction for complaining text to endanger information with food of the Gao Rui [8] based on ontology.It is extracted relative to entity, the extraction of vocabulary grade It is relatively easy, it is mainly reflected on relatively easy this 2 points of the shorter and internal composition of length of vocabulary.Condition random field is as pumping Take the machine learning model of the serializings such as term and entity that there is wide application, it is more representational as follows:Li Li Double equal [9] complete the extraction to automobile terms by simple feature template;On the basis of the feature templates of word combination, Wang Wen Dragons etc. [10] complete the extraction to entity in project application book;In conjunction with the feature knowledge of Chinese medicine vocabulary, Liu Kai etc. [11] structures The entity extraction model of Traditional Chinese medical electronic case history.Above-mentioned term and entity based on condition random field are extracted with only reality The simple feature knowledge of body itself is not directed to the information of institute's extracting object context of co-text, and the present invention is in identification food security Complicated feature templates are constructed during event entity, compensate for the deficiency of existing recognition methods to a certain extent.
Newest research is mainly extracted by the method for machine learning in non-structured text in terms of the extraction of entity Entity, more representational research are as follows:Strategy based on neural network, Chen Yu etc. [4] attempt to utilize Deep Belief Relationship of the Nets models between entity and entity extracts.The research is present invention determine that the quantity of characteristic quantity provides Corresponding guide for method.It is also current more popular strategy, Shao Fa to entity extract using corresponding semantic knowledge Deng [5] from solving the problems, such as that polysemy sets about, using ambiguity dispelling tactics, pass through the resource of HowNet and Bayes's classification With method, entity is extracted.From the completion of the angle of disambiguation although the identification of entity is scientific, But overall performance of this method on large-scale language material is up for verification.For the electron medical treatment text sharply increased, Medical language materials of the Xu Hua etc. [6] based on participle, part-of-speech tagging completes the pumping to entity in medical text using the method for rule It takes, overall performance reaches 80% or more.Although the method for rule has certain adaptability on the language material of a certain feature, Due to adequately probing into the rule shortage lain between specific language material vocabulary, made set pattern can be caused to a certain extent Coverage then is relatively poor.This is also that the present invention chooses conditional random field models progress food safety affair entity extraction One of the main reasons.With the relevant information extraction research of food safety affair, concentrate on complaining text word for food at present The Knowledge Extraction of remittance level, more representational research are the pumpings that Wei Xiuzhuo [7] complains text sensitivity vocabulary around food Take the extraction for complaining text to endanger information with food of the Gao Rui [8] based on ontology.It is extracted relative to entity, the extraction of vocabulary grade It is relatively easy, it is mainly reflected on relatively easy this 2 points of the shorter and internal composition of length of vocabulary.Condition random field is as pumping Take the machine learning model of the serializings such as term and entity that there is wide application, it is more representational as follows:Li Li Double equal [9] complete the extraction to automobile terms by simple feature template;On the basis of the feature templates of word combination, Wang Wen Dragons etc. [10] complete the extraction to entity in project application book;In conjunction with the feature knowledge of Chinese medicine vocabulary, Liu Kai etc. [11] structures The entity extraction model of Traditional Chinese medical electronic case history.Above-mentioned term and entity based on condition random field are extracted with only reality The simple feature knowledge of body itself is not directed to the information of institute's extracting object context of co-text, and the present invention is in identification food security Complicated feature templates are constructed during event entity, compensate for the deficiency of existing recognition methods to a certain extent.
There are two the deficiencies of aspect for above-mentioned existing research tool, on the one hand in model mistake of the structure based on condition random field Cheng Zhong does not use corresponding feature or used feature relatively simple, and the overall performance for causing constructed model has It waits improving, on the other hand during training pattern, existing research is substantially the spy being unfolded on small-scale language material Study carefully, and the present invention is to build on the extensive language material by manually marking, and has the very strong transportable property of model And adaptability.
Bibliography
[1] [EB/OL] [2014-02- 18] .http outside window is thrown out://www.zccw.info/;
[2] Zhang Mujie, Shen Jianhua are about on thinking [J] disclosed in information in disposition food and drug safety accident Extra large Food and drug administration information research, 2012 (2):45-49;
[3] Epidemic Model research [J] scientific research pipes of the wide food service industrys event risk perception of Ma Ying, Zhang Yuanyuan, Song Wen Reason, 2013,34 (9):123-130;
[4] Chinese name entity relation extraction [J] of Chen Yu, Zheng Dequan, Zhao Tiejun based on Deep Belief Nets are soft Part journal, 2012,23 (10):2572-2585;
[5] Shao sends out, and the such as Huang Yinge, Zhou Lanjiang are learned based on Chinese entity relation extraction [J] the Shandong University that entity disambiguates Report:Engineering version, 2014,44 (6):32-37;
Illness bacterium entities of the such as [6] Xu Hua, Liu Maofu, Jiang Li based on language rule extracts [J] Wuhan University Journals (Edition), 2015,61 (2):51-55;
[7] Wei Xiu Zhuos food complains text sensitivity vocabulary to extract the Changchun research [D]:Northeast Normal University, 2015;
[8] food of the high stamen based on ontology complains text to endanger the Changchun information extraction research [D]:Northeast Normal University, 2011。
Invention content
The present invention in view of the drawbacks of the prior art, take out by the food safety affair entity provided under a kind of multiple features knowledge Method is taken, can effectively solve the problem that the above-mentioned problems of the prior art.
In order to realize the above goal of the invention, the technical solution adopted by the present invention is as follows:
A kind of food safety affair entity abstracting method under multiple features knowledge, includes the following steps:
S1:Food safety affair entity defines and characteristic statistics;
S11:Entity defines;
On the basis of food-safe event is acquired, marks and organizes, structuring food prods security incident corpus;
S12:The inside and outside characteristic statistics of food safety affair entity;
Choose all food safety affairs, to food name therein and cause food safety affair occur it is specific because Element is labeled;On the basis of the language material of mark, statistics " food name " is inside and outside with " material elements " these entities Feature.
Internal feature includes physical length and quantity:
Physical length is obtained to be used to grasp the complexity of extracted entity object and determine condition random field label sets Number;
The distribution situation for counting specific entity is special for the particular content of entity and the right boundary of the specific entity of statistics Sign.
The surface of entity
The right boundary of " food name " and " material elements " in food-safe event language material is counted, the system Count result has important value for subsequent builds " food name " and " material elements " extraction model.
The bounds of " food name " and " material elements " be limited to ".!?" ending clause within the scope of, " food The left margin of title " and " material elements " is start mark, in the range of terminating to first label since sentence, referred to as β.Terminate to sentence since the last one label, this range is denoted as α.It is specific to choose " food name " and " material elements " Shown in the calculation formula of left margin word such as formula (1).
Wherein, f (W_left_outside) indicates that the frequency that W occurs within the scope of β, f (W_left) indicate W in β, " food The name of an article claims ", frequency for occurring inside " material elements ".Give P's in conjunction with the language material of food safety affair by formula (1) Empirical value is 0. 8, i.e., as P >=0.8, W is likely to become the left margin word of " food name " and " material elements ", then ties The introspection that artificial language is gained knowledge is closed, finally determines 7 left margin words:", with and be, food, it is exceeded, in ".
The selection of " food name " and " material elements " right margin word is used for using formula (2).
Wherein, f (W_right_outside) indicates the frequencys that occur within the scope of α of W, f (W_right) indicate W α, The frequency occurred inside " food name ", " material elements ", is also set to 0.8, according to linguistics by the threshold value of right margin word P The introspection of knowledge finally determines 10 right margin words in conjunction with the P values more than or equal to 0.8:", use, product, have, plant and Be, surpass, in, production ".
S2:Model brief introduction and feature determine
S21:Machine learning model
If x={ x1, x2 ..., xn-1, xn } indicates observed input data sequence, such as the word after being segmented in language material;y ={ y1, y2 ..., yn-1, yn } indicates finite state set, wherein each state corresponds to a label;In given input sequence Under conditions of arranging x, for the conditional probability of the status switch y of the linear chain CRFs of parameter lambda={ λ 1, λ 2 ..., λ n-1, λ n } As shown in formula (3) and formula (4).
Wherein, ZxFor normalization factor, the score of all possible status switch is indicated, it is ensured that all possible state sequences The sum of conditional probability of row is 1.It is the characteristic function of a Unified Form, usually two-value characterizes function;λjIt is to pass through model The weight of the corresponding feature function obtained later is trained to training data.
S22:The selection of language material and the processing of language material
Specifically the entity of " food name " and " material elements " be marked as in language material "【】" form;
Based on the characteristic statistics to " food name " and " material elements ", determine for " food name " and " it is specific because During the CRF reference numerals of element ", formula (5).
Wherein, L indicates the length as i≤k when " food name " and " material elements " after average weighted, NiIndicate institute The number that " food name " and " material elements " that length is i in the language material of selection occurs, k and j are indicated in corpus most respectively The long length with most short " food name " and " material elements ", N indicate " food name " and " material elements " in corpus Total number.
Based on formula (5), the basic condition in conjunction with language material and corresponding experimental result, " food name " and " it is specific because Determine that the mark collection using 5 lexemes, mark collection are indicated with R in element " identification model structure, specially R={ B, C, E, S, A }, B indicates that the initial word of " food name " and " material elements ", C are the medium term of " food name " and " material elements ", and E is The closing of " food name " and " material elements ", S are the vocabulary except " food name " and " material elements ", and A is a word Or word is individually for the case where " food name " and " material elements ", if the length of " food name " and " material elements " is more than 3, just expansion word is indicated with C.
S23:The selection of feature and the formulation of feature templates
Feature is made of atomic features and compound characteristics two parts.
Selection atomic features be word itself, part of speech, word length, whether entity word, whether left margin, whether right margin 6 A feature;
Compound characteristics are to characterize " food name " and " material elements " entity complex by the combination to atomic features Linguistic feature.
Further, the gatherer process of food safety affair is as follows in the S11:The acquisition target of food safety affair Food safety affair on main food safety affair and paper media including on internet;
The acquisition of food safety affair mainly by adopting automatically towards event topic vertical search engine technology on network Collection, acquisition range includes news portal, forum and blog, passes through corresponding data cleansing, conversion for the isomeric data of acquisition Statistics is saved in database, and the event case of papery then is completed to adopt event by way of manual entry, check and correction Collection.
The mark of food safety affair:Mainly complete participle, the part-of-speech tagging of food-safe event;
The tissue of food security:Mainly food-safe event carries out classification mark, and specific category mark is then based on 《People's Republic of China's the law of food safety》It carries out.
Further, in S23 the characteristic window size of 6 feature selectings be respectively 7,3,5,5,5,5,7 windows model It is { -3, -2, -1,0,1,2,3 } to enclose, and the range of 5 windows is { -2, -1,0,1,2 }, and the range of 3 windows is { -1,0,1 }; In features described above, to " food name " and " material elements " extract performance boost from the point of view of, part of speech and word sheet Body is most important feature, followed by right boundary word and entity word, is finally the length of " food name " and " material elements " Degree.
Compared with prior art the advantage of the invention is that:For structuring food prods security incident knowledge base and excavate food Safe countermeasure plays the role of serving as basic resource.Automatically food name can be sent out with food safety affair is caused Raw material elements are released.During building extraction model, not only incorporated in conditional random field models a large amount of Feature knowledge and also constructed model be to be carried out on food safety affair language material in magnanimity by mark, this two Point is the important innovations point of the present invention.
Description of the drawings
Fig. 1 is the topological structure schematic diagram of linear chain CRFs models;
Specific implementation mode
To make the objectives, technical solutions, and advantages of the present invention more comprehensible, below in conjunction with attached drawing and implementation is enumerated Example, is described in further details the present invention.
S1:Food safety affair entity defines and characteristic statistics
S11:The brief introduction of food safety affair language material and entity define
On the basis of food-safe event is acquired, marks and organizes, the present invention constructs 2005-2015 The food safety affair corpus in year.The gatherer process of food safety affair is as follows:The acquisition target of food safety affair is main Including the food safety affair on the food safety affair and paper media on internet.Food safety affair adopts on network Collection mainly by towards event topic vertical search engine technology automatic collection, acquisition range include news portal, forum and Blog is saved in database the isomeric data of acquisition by corresponding data cleansing, conversion statistics, and the thing of papery Part case then completes the acquisition to event by way of manual entry, check and correction.
The mark of food safety affair mainly completes participle, the part-of-speech tagging of food-safe event, long for vocabulary The part of speech that bigger food security title is then labelled with big level-one is spent, for relatively general language material in food safety affair Vocabulary length it is longer, this kind of vocabulary is considered as a vocabulary to handle and carry out part-of-speech tagging when being segmented;Food The tissue of product safety is mainly that food-safe event carries out classification mark, and specific category mark is then based on《The Chinese people are total With state's the law of food safety》It carries out.By above-mentioned processing, constructed food safety affair corpus reach 1 500 ten thousand word grades and 6870000 word grades share 2800 food safety affair compositions.
The so-called entity of the present invention is primarily referred to as the food name involved in food safety affair and leads to food security thing The material elements that part occurs, such as specific food name have " milk powder, soy sauce, rice, milk " etc., and material elements are then " additive, formaldehyde, benzoyl peroxide, trans-fatty acid " etc..The main task of the present invention is structure machine learning model, Automatically food name is released with the material elements for causing food safety affair to occur.Conditional random field models training and survey Language material sample used in examination is as follows:
Enterprise/n or/c people/n /u "/w illegal/vn behaviors/n "/w in/f ,/w includes/v "/w productions/v personations/vn Registration/vn trade marks/n /u is bottled/"/w is in/p productions/vn processing/vn dumpling wrapper/n ,/wn Wantons skin/n mistakes by b water/n "/w In journey/n/f additions/v is toxic/vi is harmful/a substances/n【Borax/n】"/w is in/p productions/vn processing/vn by "/w【Omasum/ nr】,/wn【Squid/n】,/wn【Ox】Tripe/ng etc./v food/n /u processes/n in/f additions/v is toxic/vi is harmful/a objects Matter/n【Hydrogen peroxide/n】With/p hydrogen【Sodium oxide molybdena/n】"/w submits/v falsenesses/a materials/n acquirements/v food and drink/n services/vn to "/w License/vn "/w "/w distorts/v food/n productions/vn dates/n simultaneously/d sale/v "/w etc./u./wj
S12:The inside and outside characteristic statistics of entity
2800 food safety affairs are chosen, by manually being sent out with food safety affair is caused food name therein Raw material elements carry out manual mark.On the basis of the language material of mark, statistics " food name " and " material elements " these realities The inside and outside feature of body.
(1) internal feature
Word length
On the one hand the length for obtaining entity is conducive to grasp the complexity of extracted entity object, on the other hand also have Conducive to the number of determining condition random field label sets.The distribution of food safety affair physical length is as shown in table 1.
1 food safety affair physical length distribution table of table
Physical length Quantity (a) Physical length Quantity (a)
2 48 036 13 13
3 23 499 9 9
4 6 878 10 7
1 6 594 12 5
5 1 383 14 2
6 394 15 1
7 182 11 1
8 37 20 1
As can be seen from Table 1, the length of entity is accounted for mainly between 1-5 by the way that the entity that length is 1-5 is calculated The 99.25% of sum, the entity that length is 2 and 3 account for the 82.18% of sum, and the entity that length is 2 accounts for the 55.19% of sum, The entity that length is 3 accounts for the 27.00% of sum.By result of calculation it is seen that:The physical quantities that length is 2 are more than half Number, therefore in terms of entity extraction, the entity that length is 2 and 3 is the object that emphasis extracts, such as " milk powder ", " milk ", " pig Meat ", " additive ", " gutter oil " etc..And those length more than 8 be mostly containing adjectival noun or some complexity Proper noun, such as:" sodium cyclohexylsulfamate ".
The distribution situation of specific entity
Distribution situation by counting specific entity not only facilitates the particular content for the related entity for obtaining perception, and And it is also beneficial to count the left and right feature knowledge of specific entity.The distribution of Partial Food security incident entity is as shown in table 2.
The distribution situation of 2 specific food safety affair entity of table
Entity Quantity (a) Entity Quantity (a)
Additive 2 243 Rice 899
Milk powder 1 661 Milk 810
Gutter oil 1 178 Medicine bag 733
Soy sauce 1 078 Total plate count 377
Wine 1 006 Nitrite 352
Pork 943 Trans-fatty acid 95
Formaldehyde 904 Benzoyl peroxide 90
Table 2 only has chosen part entity data, be respectively before ranking 10 and quantity that physical length is 4-6 it is forward Solid data (data share 3 193,87 042).Because in 10 before ranking being largely the entity that length is 2, therefore Do not add such data again in the table.The entity total amount of this statistics is 87 042, wherein first 10 account for sum
13.16%, first 5 account for total 8.23%, and Section 2 milk powder accounts for 1.91%, and first item additive accounts for 2.58%.
(2) surface of entity
In the language material of different food products security incident, there are larger for the right boundary of " food name " and " material elements " Difference, respectively the right boundary of " food name " and " material elements " in food-safe event language material counted, this Statistical result has important value for subsequent builds " food name " and " material elements " extraction model.
The bounds of " food name " and " material elements " be limited to ".!?" ending clause within the scope of, " food It is " food name " and " material elements " that the left margin of title " and " material elements ", which can never cross over its first label, Start mark, therefore investigation range is limited in the range of terminating to first label since sentence, referred to as β.Equally, The right margin Feature Words of " food name " and " material elements " can never be crossed over
The last one of " food name " and " material elements " mark, therefore investigate range and be limited to from the last one mark Note starts to sentence to terminate in such a range, this range is denoted as α.It is specific to choose " food name " and " material elements " Shown in the calculation formula of left margin word such as formula (1).
Wherein, f (W_left_outside) indicates that the frequency that W occurs within the scope of β, f (W_left) indicate W in β, " food The name of an article claims ", frequency for occurring inside " material elements ".Give P's in conjunction with the language material of food safety affair by formula (1) Empirical value is 0.8, i.e., as P >=0.8, W is likely to become the left margin word of " food name " and " material elements ", then ties The introspection that artificial language is gained knowledge is closed, finally determines 7 left margin words:", with and be, food, it is exceeded, in ".
Similarly, the selection of " food name " and " material elements " right margin word is used for using formula (2).
Wherein, f (W_right_outside) indicates the frequencys that occur within the scope of α of W, f (W_right) indicate W α, The frequency occurred inside " food name ", " material elements ", is also set to 0.8, according to linguistics by the threshold value of right margin word P The introspection of knowledge finally determines 10 right margin words in conjunction with the P values more than or equal to 0.8:", use, product, have, plant and Be, surpass, in, production ".
S2:Model brief introduction and feature determine
S21:Machine learning model
Condition random field is for solving the problems, such as preferably a kind of model of sequence labelling, is to need to mark at given one group Observation sequence under conditions of, calculate the undirected graph model of the combination condition probability distribution of entire observation sequence status indication.It is right In specified node input value, the conditional probability of specified node output valve can be calculated, training objective be so that conditional probability most Bigization.Most common CRFs models are single order chain structure, i.e. linear chain structure, and topological structure is as shown in Figure 1.
If x={ x1, x2 ..., xn-1, xn } indicates observed input data sequence, after being segmented in language material of the present invention Word;Y={ y1, y2 ..., yn-1, yn } indicates finite state set, wherein each state corresponds to a label.Given Under conditions of list entries x, for the item of the status switch y of the linear chain CRFs of parameter lambda={ λ 1, λ 2 ..., λ n-1, λ n } Shown in part probability such as formula (3) and formula (4).
Wherein, ZxFor normalization factor, the score of all possible status switch is indicated, it is ensured that all possible state sequences The sum of conditional probability of row is 1.It is the characteristic function of a Unified Form, usually two-value characterizes function;λjIt is to pass through model The weight of the corresponding feature function obtained later is trained to training data.Maximum entropy model (ME) is with McCallum etc. (McCallum A,Freitag D,Pereira F.Maximumentropy Markov m odels for information extraction andsegmentation.In Proc. ICML 2000,2000:591-598) propose principle of maximum entropy Based on, if that is, the principle of maximum entropy is mainly that probability distribution information is uncertain, the way of biasing is least will produce, It is exactly that equalization treats probability distribution, does not do any subjectivity it is assumed that under the given restrictive condition about training data, make mould The entropy of type reaches maximum distribution, is exactly required distribution.Maximum entropy model is in the fields such as artificial intelligence and natural language processing It is used widely, but since maximum entropy model itself has mark biasing, wrong identification and unidentified situation It is more, cause in some cases its effect not as good as the models such as CRF.
S22:The selection of language material and the processing of language material
Specifically the entity of " food name " and " material elements " be marked as in language material "【】" form, such as:
"【/ wky milk/n】/ wky ,/wd 30/m is remaining/and m/q law enforcements/vn personnel/n comes the/west the v/streets b long/n agricultural trades Market/n ", verification/v【/ wky is trans-/b aliphatic acid/n】/ wky /u/ problems/n./wd :Based on to " food name " and " tool The characteristic statistics of body factor ", on the basis of conditional random field models define, the present invention is being determined for " food name " and " tool During the CRF reference numerals of body factor ", Primary Reference formula (5).
Wherein, L indicates the length as i≤k when " food name " and " material elements " after average weighted, NiIndicate institute The number that " food name " and " material elements " that length is i in the language material of selection occurs, k and j are indicated in corpus most respectively The long length with most short " food name " and " material elements ", N indicate " food name " and " material elements " in corpus Total number.Based on formula (5), the basic condition in conjunction with language material and corresponding experimental result, " food name " and " it is specific because Determine that the mark collection using 5 lexemes, mark collection are indicated with R in element " identification model structure, specially R={ B, C, E, S, A }, B indicates that the initial word of " food name " and " material elements ", C are the medium term of " food name " and " material elements ", and E is The closing of " food name " and " material elements ", S are the vocabulary except " food name " and " material elements ", and A is a word Or word is individually for the case where " food name " and " material elements ", if the length of " food name " and " material elements " is more than 3, just expansion word is indicated with C.
The present invention by writing java applet, in conjunction with " food name " in language material and " material elements " "【】" label with And according to the feature of selection and the feature templates of formulation, all language materials are labeled automatically, specific mark is as shown in table 3.
Table 3 " food name " and " material elements " training corpus and testing material mark sample
Word Part of speech Word length Whether entity word Whether left margin Whether right margin Label
It is related p 2 N N N S
It is trans- b 2 Y N N B
Aliphatic acid n 3 Y N N E
Problem n 1 N N N S
, wd 1 N N N S
Zhejiang Province ns 3 N N N S
Jinhua ns 3 N N N S
Public security bureau n 3 N N N S
Jiangnan ns 2 N N N S
Branch office n 2 N N N S
It is connected to v 2 N N N S
The masses n 2 N N N S
Report vn 2 N N N S
Claim v 1 N N N S
S23:The selection of feature and the formulation of feature templates
For in the machine learning model based on condition random field, the selection of feature is all of crucial importance.Feature selecting it is good The bad performance that will directly influence CRFs models.Feature is made of atomic features and compound characteristics two parts.The present invention chooses Atomic features be word itself, part of speech, word length, whether entity word, whether left margin, whether 6 features such as right margin; Compound characteristics are special come the linguistics for characterizing " food name " and " material elements " entity complex by the combination to atomic features Sign.The range that the characteristic window size of 6 feature selectings is respectively 7,3,5,5,5,5,7 windows in the present invention be -3, - 2, -1,0,1,2,3 }, the range of 5 windows is { -2, -1,0,1,2 }, and the range of 3 windows is { -1,0,1 }.In above-mentioned spy In sign, to " food name " and " material elements " extract performance boost from the point of view of, part of speech and word itself are most important Feature, followed by right boundary word and entity word are finally the length of " food name " and " material elements ".
Entity extracts experiment
The evaluation of extraction model performance is mainly weighed using three indexs:Accuracy rate (Precision), recall rate (Recall), F values (F-measure).It is based respectively on the language material use condition random field models and maximum entropy model marked above Carry out the extraction of " food name " and " material elements ".Constructed by method test in specific experiment using cross validation The performance of model, by 2800 language material documents respectively according to 9:1 ratio is divided into training corpus and testing material, test result As shown in table 4 and table 5, table 6 illustrates the comparison that training and test take under the conditions of same software and hardware of two kinds of models.
Table 4 is based on conditional random field models " food name " and " material elements " extract performance and compare
Table 5 be based on maximum entropy model " food name " and " material elements " extraction performance compare
Test No. Accuracy rate Recall rate F values
1 72.55% 62.50% 67.15%
2 73.72% 61.89% 67.29%
3 81.90% 65.19% 72.60%
4 84.10% 59.97% 70.01%
5 81.67% 62.49% 70.80%
6 86.52% 63.70% 73.38%
7 81.66% 65.74% 72.84%
8 72.71% 67.10% 69.79%
9 74.72% 63.37% 68.58%
10 80.88% 65.40% 72.32%
Mean value 79.04% 63.74% 70.48%
6 condition random field of table and maximum entropy model training and test take and compare
From table 4 and table 5 as can be seen that " food name " and " material elements " identification model based on condition random field The performance based on maximum entropy model can be better than.The F values minimum 90.06% of conditional random field models, up to 91.94%, average out to 90.88%;The F values of maximum entropy model are only up to 73.38%, and averagely only 70.48%.It can from table 6 To find out, training and test it is time-consuming from the point of view of, maximum entropy model is better than conditional random field models.The former primary training with Test takes at 100 seconds or so, and the latter needs about 50 000 seconds or so.Due to follow-up study more focus on " food name " and The performance of " material elements " identification rather than the length of time consumption for training, therefore alternative condition random field models of the present invention carry out " food The name of an article claims " and " material elements " identification." food name " and " material elements " that conditional random field models are identified Simple analysis is carried out, it is found that identification mistake more " food name " and " material elements " is mainly length process, such as " food The name of an article claims " and " material elements ", such as " vibrio parahaemolytious bacterium ", " village Qiao Jiashangao steamed bun ", " for animals plus selenium humic acid Sodium ", " being polluted by bacillus cereus ", " Wang Shi bee glue soft capsules " otherwise in these entities containing be difficult to multiplely Name and adjective, such as " Qiao Jia ", " grid are high " or name are combined with noun, such as " Wang Shi ", " propolis ".These entities In complicated constituent affect the accurate rate and recall rate of Entity recognition in food-safe event language material.
Those of ordinary skill in the art will understand that the embodiments described herein, which is to help reader, understands this The implementation of invention, it should be understood that protection scope of the present invention is not limited to such specific embodiments and embodiments.This The those of ordinary skill in field can make according to the technical disclosures disclosed by the invention various does not depart from of the invention essence Various other specific variations and combinations, these variations and combinations are still within the scope of the present invention.

Claims (3)

1. the food safety affair entity abstracting method under a kind of multiple features knowledge, which is characterized in that include the following steps:
S1:Food safety affair entity defines and characteristic statistics;
S11:Entity defines;
On the basis of food-safe event is acquired, marks and organizes, structuring food prods security incident corpus;
S12:The inside and outside characteristic statistics of food safety affair entity;
All food safety affairs are chosen, food name therein and the material elements for causing food safety affair to occur are carried out Mark;On the basis of the language material of mark, the inside and outside feature of " food name " and " material elements " these entities is counted;
Internal feature includes physical length and quantity:
Physical length is obtained to be used to grasp the complexity of extracted entity object and determine the number of condition random field label sets;
The distribution situation for counting specific entity is used for the particular content of entity and counts the right boundary feature of specific entity;
The surface of entity:
The right boundary of " food name " and " material elements " in food-safe event language material is counted, the statistical result There is important value for subsequent builds " food name " and " material elements " extraction model;
The bounds of " food name " and " material elements " be limited to ".!?" ending clause within the scope of, " food name " The left margin of " material elements " is start mark, in the range of terminating to first label since sentence, referred to as β;From most The latter label starts to sentence to terminate, this range is denoted as α;It is specific to choose " food name " and " material elements " left margin word Calculation formula such as formula (1) shown in;
Wherein, f (W_left_outside) indicates that the frequency that W occurs within the scope of β, f (W_left) indicate W in β, " food name The frequency occurred inside title ", " material elements ";By formula (1) the experience threshold of P is given in conjunction with the language material of food safety affair Value is 0.8, i.e., as P >=0.8, W is likely to become the left margin word of " food name " and " material elements ", then in conjunction with artificial language It says the introspection gained knowledge, finally determines 7 left margin words:", with and be, food, it is exceeded, in ";
The selection of " food name " and " material elements " right margin word is used for using formula (2);
Wherein, f (W_right_outside) indicates that the frequency that W occurs within the scope of α, f (W_right) indicate W in α, " food The frequency occurred inside title ", " material elements ", is also set to 0.8, according to linguistic knowledge by the threshold value of right margin word P Introspection finally determines 10 right margin words in conjunction with the P values more than or equal to 0.8:", use, product, have, plant and be, surpass, In, production ";
S2:Model foundation and feature determine
S21:Machine learning model is established
If x={ x1, x2 ..., xn-1, xn } indicates observed input data sequence, such as the word after being segmented in language material;Y= { y1, y2 ..., yn-1, yn } indicates finite state set, wherein each state corresponds to a label;In given list entries x Under conditions of, for the conditional probability such as formula of the status switch y of the linear chain CRFs of parameter lambda={ λ 1, λ 2 ..., λ n-1, λ n } (3) and shown in formula (4);
Wherein, ZxFor normalization factor, the score of all possible status switch is indicated, it is ensured that the item of all possible state sequences The sum of part probability is 1;It is the characteristic function of a Unified Form, usually two-value characterizes function;λjIt is by model to training Data are trained the weight of the corresponding feature function obtained later;
S22:The selection of language material and the processing of language material
Specifically the entity of " food name " and " material elements " be marked as in language material "【】" form;
Based on the characteristic statistics to " food name " and " material elements ", determining for " food name " and " material elements " During CRF reference numerals, formula (5);
Wherein, L indicates the length as i≤k when " food name " and " material elements " after average weighted, NiSelected by indicating The number that " food name " and " material elements " that length is i in language material occurs, k and j indicate in corpus longest and most respectively The length of short " food name " and " material elements ", N indicate the total number of " food name " and " material elements " in corpus;
Based on formula (5), the basic condition in conjunction with language material and corresponding experimental result, " food name " and " material elements " is known Determine that the mark collection using 5 lexemes, mark collection indicate that specially R={ B, C, E, S, A }, B are indicated with R in other model construction The initial word of " food name " and " material elements ", C are the medium term of " food name " and " material elements ", and E is " food name The closing of title " and " material elements ", S are the vocabulary except " food name " and " material elements ", and A is that a word or word are independent For " food name " and " material elements " the case where, if the length of " food name " and " material elements " is more than 3, just use C tables Show expansion word;
S23:The selection of feature and the formulation of feature templates;
Feature is made of atomic features and compound characteristics two parts;
Selection atomic features be word itself, part of speech, word length, whether entity word, whether left margin, whether 6 spies of right margin Sign;
Compound characteristics are the language that " food name " and " material elements " entity complex is characterized by the combination to atomic features Learn feature.
2. according to the method described in claim 1, it is characterized in that:The gatherer process of food safety affair is as follows in the S11: The acquisition target of food safety affair includes mainly the food security thing on food safety affair and paper media on internet Part;
The acquisition of food safety affair is mainly by towards event topic vertical search engine technology automatic collection, acquisition on network Range includes news portal, forum and blog, and the isomeric data of acquisition is preserved by corresponding data cleansing, conversion statistics Into database, and the event case of papery then completes the acquisition to event by way of manual entry, check and correction;
The mark of food safety affair:Mainly complete participle, the part-of-speech tagging of food-safe event;
The tissue of food security:Mainly food-safe event carries out classification mark, and specific category mark is then based on《Middle Chinese People republic the law of food safety》It carries out.
3. according to the method described in claim 1, it is characterized in that:The characteristic window size of 6 feature selectings is respectively in S23 The range of 7,3,5,5,5,5,7 windows is { -3, -2, -1,0,1,2,3 }, and the ranges of 5 windows is { -2, -1,0,1,2 }, 3 The range of a window is { -1,0,1 };In features described above, to " food name " and " material elements " extract performance boost Angle considers that part of speech and word itself are most important features, followed by right boundary word and entity word, are finally " food names The length of title " and " material elements ".
CN201810569813.0A 2018-06-05 2018-06-05 Food safety affair entity abstracting method based on condition random field Pending CN108776656A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810569813.0A CN108776656A (en) 2018-06-05 2018-06-05 Food safety affair entity abstracting method based on condition random field

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810569813.0A CN108776656A (en) 2018-06-05 2018-06-05 Food safety affair entity abstracting method based on condition random field

Publications (1)

Publication Number Publication Date
CN108776656A true CN108776656A (en) 2018-11-09

Family

ID=64024557

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810569813.0A Pending CN108776656A (en) 2018-06-05 2018-06-05 Food safety affair entity abstracting method based on condition random field

Country Status (1)

Country Link
CN (1) CN108776656A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109766541A (en) * 2018-12-12 2019-05-17 咪咕文化科技有限公司 Marketing strategy identification method, server and computer storage medium
CN111259106A (en) * 2019-12-31 2020-06-09 贵州大学 Relation extraction method combining neural network and feature calculation

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110035210A1 (en) * 2009-08-10 2011-02-10 Benjamin Rosenfeld Conditional random fields (crf)-based relation extraction system
CN103268339A (en) * 2013-05-17 2013-08-28 中国科学院计算技术研究所 Recognition method and system of named entities in microblog messages
CN107122416A (en) * 2017-03-31 2017-09-01 北京大学 A kind of Chinese event abstracting method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110035210A1 (en) * 2009-08-10 2011-02-10 Benjamin Rosenfeld Conditional random fields (crf)-based relation extraction system
CN103268339A (en) * 2013-05-17 2013-08-28 中国科学院计算技术研究所 Recognition method and system of named entities in microblog messages
CN107122416A (en) * 2017-03-31 2017-09-01 北京大学 A kind of Chinese event abstracting method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
吴毅: "基于复杂特征知识的食品安全事件多类型命名实体抽取研究", 《中国优秀硕士学位论文全文数据库工程科技I辑》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109766541A (en) * 2018-12-12 2019-05-17 咪咕文化科技有限公司 Marketing strategy identification method, server and computer storage medium
CN109766541B (en) * 2018-12-12 2023-08-18 咪咕文化科技有限公司 Marketing strategy identification method, server and computer storage medium
CN111259106A (en) * 2019-12-31 2020-06-09 贵州大学 Relation extraction method combining neural network and feature calculation

Similar Documents

Publication Publication Date Title
Van Der Lee et al. Best practices for the human evaluation of automatically generated text
Rae et al. Scaling language models: Methods, analysis & insights from training gopher
CN104408093B (en) A kind of media event key element abstracting method and device
CN106598944B (en) A kind of civil aviaton's security public sentiment sentiment analysis method
CN103235772B (en) A kind of text set character relation extraction method
Al Tamimi et al. AARI: automatic Arabic readability index.
CN103823794B (en) A kind of automatization's proposition method about English Reading Comprehension test query formula letter answer
CN103631858B (en) A kind of science and technology item similarity calculating method
CN104484336B (en) A kind of Chinese comment and analysis method and its system
CN106126619A (en) A kind of video retrieval method based on video content and system
CN104484380A (en) Personalized search method and personalized search device
CN107797994A (en) Vietnamese noun phrase block identifying method based on constraints random field
CN106202035B (en) Vietnamese conversion of parts of speech disambiguation method based on combined method
CN104572877A (en) Detection method and detection system of game public opinion
Manke et al. A review on: opinion mining and sentiment analysis based on natural language processing
CN105912720B (en) A kind of text data analysis method of emotion involved in computer
CN106202039A (en) Vietnamese portmanteau word disambiguation method based on condition random field
CN106503256A (en) A kind of hot information method for digging based on social networkies document
Swamy et al. " i have a feeling trump will win..................": Forecasting Winners and Losers from User Predictions on Twitter
Zhang et al. Term recognition using conditional random fields
CN108776656A (en) Food safety affair entity abstracting method based on condition random field
Mohammadshahi et al. What do compressed multilingual machine translation models forget?
CN109145286A (en) Based on BiLSTM-CRF neural network model and merge the Noun Phrase Recognition Methods of Vietnamese language feature
Dou et al. Improving large-scale paraphrase acquisition and generation
CN106126606A (en) A kind of short text new word discovery method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20181109