CN110609896A - Military scenario text event information extraction method and device based on secondary decoding - Google Patents

Military scenario text event information extraction method and device based on secondary decoding Download PDF

Info

Publication number
CN110609896A
CN110609896A CN201910653282.8A CN201910653282A CN110609896A CN 110609896 A CN110609896 A CN 110609896A CN 201910653282 A CN201910653282 A CN 201910653282A CN 110609896 A CN110609896 A CN 110609896A
Authority
CN
China
Prior art keywords
event
word
extraction
sequence
trigger word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910653282.8A
Other languages
Chinese (zh)
Other versions
CN110609896B (en
Inventor
刘乾
杨若鹏
蒋序平
卢稳新
鲁云军
鲁义威
战立莹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN201910653282.8A priority Critical patent/CN110609896B/en
Publication of CN110609896A publication Critical patent/CN110609896A/en
Application granted granted Critical
Publication of CN110609896B publication Critical patent/CN110609896B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a military scenario text event information extraction method and device based on secondary decoding, wherein the method comprises the following steps: 1. preprocessing, namely, constructing a professional dictionary, segmenting sentences and segmenting words to obtain a data set represented by a word sequence form; 2. corpus labeling, namely, defining a corpus labeling method and rules by defining the structural semantics of 9 types of events in a military scenario text, and manually labeling the corpus to obtain a training set and a test set; 3. model training, namely encoding a machine learning model by using a training set to obtain event extraction parameters; 4. and extracting information, inputting a test set, performing primary decoding by using a specific algorithm to obtain an event trigger word extraction sequence, and performing secondary decoding by adaptively using different event element extraction parameters based on a trigger word extraction result to obtain an event element extraction sequence. The invention solves the problem of mismatching of the event trigger word and the event element in the one-time decoding extraction method, and improves the accuracy of the event information extraction.

Description

Military scenario text event information extraction method and device based on secondary decoding
Technical Field
The invention belongs to the technical field of natural language processing, and particularly relates to a military scenario text event information extraction method and device based on secondary decoding.
Background
The military scenario text is a description text which is assumed and assumed according to the attempts, situations and battle progress scenes of both fighters. Event information extraction is an important branch of the technical field of natural language processing, and key information which is interesting to people can be extracted from a large number of natural language texts by using an event extraction technology, wherein the key information comprises an event type, an event subject, an event occurrence time, an event occurrence place, an event object and the like. Therefore, the event extraction technology can enable military personnel to accurately acquire interested key information from a large amount of irregular and random military scenario texts.
The traditional method for extracting the event mainly realizes the extraction of the event information by a mode matching method or a machine learning method. The pattern matching is to construct an extraction pattern, which is better in application in a specific field, but has poor portability and flexibility, needs to reconstruct the pattern when crossing fields, and consumes a great deal of time and labor; machine learning can be applicable to different fields, has higher portability and flexibility, but needs to label a large amount of linguistic data, and is higher to the linguistic data requirement, if the linguistic data is not enough or labeling quality is not high, extraction degree of accuracy and precision all can reduce.
Due to the professional particularity of military scenario text event information extraction, a large amount of time and labor are consumed by applying a pattern matching method, actual requirements cannot be met, a machine learning method relies on large-scale corpora for training, but the military scenario text corpora is limited in scale, and the problem of data sparseness is serious.
Disclosure of Invention
The invention aims to overcome the defects of the prior art, provides a military scenario text event information extraction method and device based on secondary decoding, solves the problem that event trigger words and event elements are wrongly matched when a machine learning model is used for primary decoding under the conditions that the corpus scale is limited and the corpus scale of each type of event is unbalanced, and can improve the accuracy of event extraction.
In order to achieve the purpose, the invention adopts the following technical scheme:
a military scenario text event information extraction method based on secondary decoding, the method comprising the steps of:
A. preprocessing, namely performing text preprocessing on input military scenario text corpora, constructing a professional dictionary based on the input military scenario text corpora by depending on a dictionary with an open source participler, performing sentence segmentation and word segmentation on the military scenario text corpora in sequence, and generating a data set expressed in a word sequence form, wherein the preprocessing specifically comprises the following steps:
a1, Chinese clauses, for dividing the scheduled text corpus into sentences according to Chinese punctuation and punctuation marks to form a sentence set;
a2, constructing a professional dictionary, and constructing the professional dictionary based on the dictionary with the open source word segmentation device;
a3, Chinese word segmentation, which is used for segmenting each sentence in the sentence set by using an open source word segmentation device based on a professional dictionary to obtain a word set, and displaying the word set line by line to generate a word sequence;
the open source word segmenter includes but is not limited to jieba, Hanlp, CoreNLP, and thulac.
B. The corpus tagging is used for defining structured semantics of 9 types of events in a military scenario text, formulating a corpus tagging method and rules, tagging each word in a word sequence data set with a corresponding event trigger word tag or event element tag line by line, generating an event trigger word tagging sequence and an event element tagging sequence, and constructing an event trigger word extraction training set, an event element extraction training set and an event information extraction test set, and specifically comprises the following steps:
b1, event structured semantic definition, which is used for defining 9 classes of military scenario text event types according to the concept of military operation and determining the structured semantic of each class of event, including event trigger words and event element information;
b2, making a labeling rule method, wherein the labeling rule method is used for making a corpus labeling method and a corpus labeling rule, and respectively defining a trigger word labeling label and an element labeling label of a 9-class event;
b3, performing corpus manual labeling, labeling corresponding labels to the word sequences line by adopting a manual labeling mode, and generating 1 event trigger word labeling sequence and 9 event element labeling sequences;
b4, data set construction, which is used for constructing 1 event trigger word extraction training set based on the word sequence and the event trigger word tagging sequence, 9 event element extraction training sets based on the word sequence and the 9 event element tagging sequence, and an event information extraction test set based on the word sequence, the event trigger word tagging sequence and the event element tagging sequence;
the 9 types of military scenario text events comprise attack events, defense events, command events, deployment events, maneuvering events, blocking events, coordination events, reconnaissance events and guarantee events.
C. Model training, which is used for converting the event trigger word extraction training set and the event element extraction training set into digital signals, coding the digital signals by using a machine learning model based on the event trigger word extraction training set and the 9 event element extraction training sets, and generating event trigger word extraction parameters and 9 event element extraction parameters, and specifically comprises the following steps:
c1, converting the event trigger word extraction training set and the event element extraction training set into digital signals;
c2, generating parameters, inputting digital signals into a machine learning model for encoding to obtain event trigger word identification parameters and 9 event element identification parameters, wherein the generation processes of the event trigger word and the event element identification parameters are basically the same, and the event trigger word identification parameters are taken as an example and specifically include:
c2.1, generating a state number and an observation number, and setting a trigger word tag sequence in the event trigger word extraction training set as Q ═ Q1,q2,......qTWherein q istRepresenting the trigger word label at the time of T (T is more than or equal to 1 and less than or equal to T) as a state sequence, counting the types of the trigger word labels, taking the number of the types N as the state number, and setting the word sequence as Y ═ Y ≦ T1,y2,...yTIn which y istRespectively representing words at the time of T (T is more than or equal to 1 and less than or equal to T), taking the words as observation sequences, counting word types, and taking the type number M as an observation number;
c2.2, generating the probability distribution pi of the initial state, wherein the generation formula is as follows:
in the formula piiIndicates an initial state of qiProbability of (a), S (q)i) Representing events in the training set of event-triggered word extraction by qiSequence number, sigma, with trigger word label in initial statej∈[1,N]S(qj) Represents the total number of sequences;
c2.3, generating a state transition probability matrix A, wherein the generation formula is as follows:
in the formula S (q)i,qj) The event trigger word label at the moment t is qjThe event trigger word label at the time of t-1 is qiOf (a) times, sigmaI∈[1,N]S(qi,qI) When the event trigger word label at the time t is in an arbitrary state, the event trigger word label at the time t-1 is qiThe sum of the times of (c);
c2.4, generating an observation probability matrix B, wherein the generation formula is as follows:
in the formula S (q)i,yk) Event trigger word mark for indicating t timeThe label is qiAnd the word is ykOf (a) times, sigmaI∈[1,M]S(qi,yI) The label of event trigger word at t moment is qiAnd the word yIAny type of number of times;
c2.5, generating parameters, wherein the event trigger word identification parameter λ is (N, M, a, B, pi), and a is { a ═ aij},B={bik},π={πi};
The machine learning models include, but are not limited to, HMM, CRF, MEMM, NB, and the like.
D. Information extraction, namely decoding an event information extraction test set by using a decoding algorithm based on event trigger word extraction parameters and 9 event element extraction parameters to obtain an event trigger word extraction sequence and an event element extraction sequence, combining the trigger word extraction sequence and the event element extraction sequence to complete event information extraction, and specifically comprising the following steps:
d1, decoding for the first time, and extracting the event trigger word identification parameter lambda (N, M, A, B, pi) generated in the step C2 and the event information extraction test set observation sequence Y (Y)1,y2,...yTInputting the observation sequence Y as a decoding algorithm model, performing first decoding by using a decoding algorithm, and calculating an observation sequence Y as { Y }1,y2,...yTState sequence Q ═ formed with random search (Q)1,q2,,…,qT) Mapping probability P (y)1→q1,y2→q2,,…,yT→qTLambda) is output, the mapping probability is PmaxTime sequence of states QmaxI.e. event-triggered word extraction sequence;
d2, determining the event type, scanning line by line and judging the event trigger word label type of each sentence based on the event trigger word extraction sequence to obtain the event type of each sentence;
d3, second decoding, for calling 1 event element identification parameter corresponding to the event type of the sentence according to the event type of each sentence, and using decoding algorithm to decode the sentence for the second time according to the step D1, and generating the event element extraction sequence with the maximum probability corresponding to the sentence;
d4, scanning the event information extraction test set sentence by sentence, and repeating the step D3 until all sentences in the event information extraction test set are completely scanned, and generating an event element extraction sequence corresponding to the event information extraction test set;
d5, merging the extraction results, combining the event trigger word extraction sequence and the event element extraction sequence corresponding to the event information extraction test set to form an event information extraction sequence, and finishing the event information extraction process;
the decoding algorithm includes, but is not limited to, Viterbi, Dijkstra, Forward-Backward, etc. algorithms.
The invention adopts a military scenario text event information extraction method based on secondary decoding, and has the advantages that:
1. the portability and the flexibility are good, and a mode does not need to be reconstructed when the field is crossed;
2. a large amount of linguistic data does not need to be marked, the requirement on the linguistic data is low, and a large amount of time and labor cost can be saved;
3. the matching degree of the event trigger words and the event elements is high, and the mismatching rate can be reduced.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a schematic flow chart diagram of an embodiment of a military scenario text event information extraction method based on secondary decoding according to the invention;
fig. 2 is a block diagram of the composition structure of the present invention.
Detailed Description
The following further describes embodiments of the present invention with reference to the drawings. It should be noted that the description of the embodiments is provided to help understanding of the present invention, but the present invention is not limited thereto. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, a schematic flow chart of an embodiment of the military scenario text event information extraction method based on secondary decoding of the present invention is shown, and specifically includes the following steps:
a: preprocessing, inputting military imagination text corpora, constructing a professional dictionary by relying on a dictionary of an open source word segmentation device such as jieba, Hanlp and the like, and sequentially carrying out sentence segmentation and word segmentation on the military imagination text corpora to generate a data set expressed in a word sequence form;
a1: dividing a military scenario text corpus into sentences according to Chinese punctuations and punctuations to form a sentence set;
a2: the method comprises the steps that a professional dictionary is built based on word segmenters such as jieba and Hanlp and the like, the format of the professional dictionary is the same as that of the dictionary, the professional dictionary is displayed line by line, information of each line comprises words and word frequency, the words are separated by spaces, and the accuracy of military thought text word segmentation can be improved by utilizing the professional dictionary to perform word segmentation;
a3: dividing each sentence in the sentence set into words by using open source word dividers such as jieba and Hanlp based on a professional dictionary to obtain a word set, and displaying the word set line by line to generate a word sequence;
b: corpus tagging, namely defining structured semantics of 9 types of events in a military scenario text, formulating a corpus tagging method and rules, tagging each word in a word sequence data set with a corresponding event trigger word tag or event element tag line by line, generating an event trigger word tagging sequence and an event element tagging sequence, and constructing an event trigger word extraction training set, an event element extraction training set and an event information extraction test set;
b1: defining 9 types of military scenario text event types including an attack event, a defense event, a command event, a deployment event, a maneuvering event, a blocking event, a cooperative event, a reconnaissance event and a guarantee event based on a military operation concept, and determining the structural semantics of each type of event, including event trigger words and event element information;
b2: formulating a corpus tagging method and a rule, and respectively defining a trigger word tagging label and an element tagging label of a 9-class event;
b3: labeling corresponding labels to the word sequences line by adopting an artificial labeling mode to generate 1 event trigger word labeling sequence and 9 event element labeling sequences;
b4: constructing 1 event trigger word extraction training set based on a word sequence and an event trigger word tagging sequence, 9 event elements extraction training sets based on the word sequence and 9 event element tagging sequences, and an event information extraction test set based on the word sequence, the event trigger word tagging sequence and the event element tagging sequence;
c: model training, namely, coding the training set based on event trigger word extraction and 9 event element extraction training sets by using machine learning models such as HMM (hidden Markov model), CRF (learning reference model) and the like to generate an event trigger word extraction parameter lambda1And 9 event element extraction parameters λ2
C1: converting the event trigger word extraction training set and the event element extraction training set into digital signals;
c2: parameter generation, namely inputting a digital signal into a machine learning model for coding to obtain an event trigger word recognition parameter and 9 event element recognition parameters, wherein the generation processes of the event trigger word and the event element recognition parameters are basically the same, and the event trigger word recognition parameters are taken as an example and specifically comprise the following steps:
c2.1, generating a state number and an observation number, and setting a trigger word tag sequence in the event trigger word extraction training set as Q ═ Q1,q2,......qTWherein q istRepresenting the trigger word label at the time of T (T is more than or equal to 1 and less than or equal to T) as a state sequence, counting the types of the trigger word labels, taking the number of the types N as the state number, and setting the word sequence as Y ═ Y ≦ T1,y2,...yTIn which y istRespectively representing words at the time of T (T is more than or equal to 1 and less than or equal to T), taking the words as observation sequences, counting word types, and taking the type number M as an observation number;
c2.2, generating the probability distribution pi of the initial state, wherein the generation formula is as follows:
in the formula piiIndicates an initial state of qiProbability of (a), S (q)i) Representing events in the training set of event-triggered word extraction by qiSequence number, sigma, with trigger word label in initial statej∈[1,N]S(qj) Represents the total number of sequences;
c2.3, generating a state transition probability matrix A, wherein the generation formula is as follows:
in the formula S (q)i,qj) The event trigger word label at the moment t is qjThe event trigger word label at the time of t-1 is qiOf (a) times, sigmaI∈[1,N]S(qi,qI) When the event trigger word label at the time t is in an arbitrary state, the event trigger word label at the time t-1 is qiThe sum of the times of (c);
c2.4, generating an observation probability matrix B, wherein the generation formula is as follows:
in the formula S (q)i,yk) The label of event trigger word at t moment is qiAnd the word is ykOf (a) times, sigmaI∈[1,M]S(qi,yI) The label of event trigger word at t moment is qiAnd the word yIAny type of number of times;
c2.5, generating parameters, wherein the event trigger word identification parameter λ is (N, M, a, B, pi), and a is { a ═ aij},B={bik},π={πi};
D: information extraction based on event-triggered word extraction parameters lambda1And 9 event element extraction parameters λ2Decoding the event information extraction test set by using Viterbi algorithm and the like to obtain event trigger wordsExtracting the sequence and the event element extraction sequence, and finally combining the trigger word extraction sequence and the event element extraction sequence to finish the event information extraction process;
d1: identifying parameter lambda based on event-triggered words1Performing first decoding on the event information extraction test set by using algorithms such as Viterbi and Dijkstra to obtain an event trigger extraction sequence corresponding to the event information extraction test set, and completing an event trigger extraction process;
d2: based on the event trigger word extraction sequence, scanning line by line and judging the event trigger word label type of each sentence to obtain the event type of the sentence;
d3: calling 1 event element identification parameter lambda corresponding to the event type of a sentence after judging the event type of the sentence2Secondly, decoding the sentence by using algorithms such as Viterbi and Dijkstra to generate an event element extraction sequence corresponding to the sentence;
d4: scanning the event information extraction test set sentence by sentence, and repeating the step D3 until all sentences of the event information extraction test set are completely scanned, generating an event element extraction sequence corresponding to the event information extraction test set, and completing the event element extraction process;
d5: and combining the event trigger word extraction sequence and the event element extraction sequence corresponding to the event information extraction test set to form an event information extraction sequence, and finishing the event information extraction process.
Referring to fig. 2, a structural diagram of the military scenario text event information extraction device based on secondary decoding according to the present invention is shown, and specifically includes the following structural components:
the preprocessing module 100 is configured to perform text preprocessing on an input military scenario text corpus, construct a professional dictionary based on the input military scenario text corpus by relying on a dictionary provided with an open-source word segmenter, perform sentence segmentation and word segmentation on the military scenario text corpus in sequence, and generate a data set expressed in a word sequence form, which specifically includes:
a Chinese sentence dividing unit 101, configured to divide a military scenario text corpus into sentences according to Chinese punctuation sentence-breaking symbols, so as to form a sentence set;
the professional dictionary building unit 102 builds a professional dictionary based on the own dictionary of the open source word segmentation devices such as jieba and Hanlp;
the Chinese word segmentation unit 103 is configured to segment each sentence in the sentence set by using an open source word segmentation device such as jieba based on a professional dictionary to obtain a word set, and display the word set line by line to generate a word sequence.
The corpus tagging module 200 is configured to define structured semantics of 9 types of events in a military scenario text, formulate a corpus tagging method and rules, tag each word in a word sequence data set with a corresponding event trigger word tag or event element tag line by line, generate an event trigger word tagging sequence and an event element tagging sequence, and construct an event trigger word extraction training set, an event element extraction training set, and an event information extraction test set, which specifically include:
the event structured semantic definition unit 201 is used for defining 9 types of military scenario text event types including an attack event, a defense event, a command event, a deployment event, a maneuvering event, a blocking event, a cooperative event, a reconnaissance event and a guarantee event according to a military operation concept, and determining the structured semantics of each type of event, including event trigger words and event element information;
a labeling rule method establishing unit 202, configured to establish a corpus labeling method and a rule, and respectively define a trigger word labeling tag and an element labeling tag of a 9-class event;
the corpus manual labeling unit 203 labels the word sequences line by adopting a manual labeling mode to generate 1 event trigger word labeling sequence and 9 event element labeling sequences;
the data set constructing unit 204 is configured to construct 1 event trigger word extraction training set based on the word sequence and the event trigger word tagging sequence, 9 event element extraction training sets based on the word sequence and the 9 event element tagging sequences, and an event information extraction test set based on the word sequence, the event trigger word tagging sequence, and the event element tagging sequence.
The model training module 300 is configured to convert the event-triggered word extraction training set and the event element extraction training set into digital signals, encode the digital signals by using a machine learning model based on the event-triggered word extraction training set and the 9 event element extraction training sets, and generate event-triggered word extraction parameters and 9 event element extraction parameters, which specifically include:
a signal conversion unit 301, configured to convert the event trigger word extraction training set and the event element extraction training set into digital signals;
the parameter generating unit 302 inputs the digital signal into a machine learning model such as an HMM and encodes the digital signal to obtain event trigger recognition parameters and 9 event element recognition parameters.
The information extraction module 400 decodes the event information extraction test set by using a specific algorithm based on the event trigger extraction parameters and the 9 event element extraction parameters to obtain an event trigger extraction sequence and an event element extraction sequence, and combines the trigger extraction sequence and the event element extraction sequence to complete the event information extraction, which specifically includes:
a first decoding unit 401, configured to perform first decoding on the event information extraction test set by using a decoding algorithm based on the event trigger recognition parameter, to obtain an event trigger extraction sequence corresponding to the event information extraction test set;
an event type determining unit 402, which scans line by line and judges the event trigger word tag type of each sentence based on the event trigger word extraction sequence to obtain the event type of each sentence;
a second decoding unit 403, configured to call 1 event element identification parameter corresponding to the event type of each sentence, perform second decoding on the sentence by using a Viterbi decoding algorithm, generate an event element extraction sequence corresponding to the sentence, scan the event information extraction test set sentence by sentence until all sentences in the event information extraction test set are completely scanned, and generate an event element extraction sequence corresponding to the event information extraction test set;
and an extraction result merging unit 404, configured to merge the event trigger word extraction sequence and the event element extraction sequence corresponding to the event information extraction test set to form an event information extraction sequence, and complete the event information extraction process.

Claims (6)

1. A military scenario text event information extraction method based on secondary decoding is characterized by comprising the following steps:
A. pretreatment: carry out text preprocessing to the military affairs of input and decide the text corpus, on the basis of the military affairs of input and decide the text corpus, rely on the dictionary of taking oneself of the word segmentation ware of opening the source and construct professional dictionary, carry out clause, word segmentation to the military affairs of deciding the text corpus in proper order, generate the data set that shows with the word sequence form, specifically include:
a1, Chinese clause: dividing a military scenario text corpus into sentences according to Chinese punctuations and punctuations to form a sentence set;
a2, constructing a professional dictionary: constructing a professional dictionary based on a self-contained dictionary of the open source word segmentation device;
a3, Chinese word segmentation: dividing each sentence in the sentence set into words by using an open source word divider based on a professional dictionary to obtain a word set, and displaying the word set line by line to generate a word sequence;
B. and (3) corpus labeling: defining structured semantics of 9 types of events in a military scenario text, formulating a corpus tagging method and rules, tagging each word in a word sequence data set with a corresponding event trigger word tag or event element tag line by line, generating an event trigger word tagging sequence and an event element tagging sequence, and constructing an event trigger word extraction training set, an event element extraction training set and an event information extraction test set, which specifically comprise:
b1, event structured semantic definition: defining 9 classes of military scenario text event types according to the concept of military operation, and determining the structural semantics of each class of events, including event trigger words and event element information;
b2, making a labeling rule method: formulating a corpus tagging method and a rule, and respectively defining a trigger word tagging label and an element tagging label of a 9-class event;
b3, corpus manual labeling: labeling corresponding labels to the word sequences line by adopting an artificial labeling mode to generate 1 event trigger word labeling sequence and 9 event element labeling sequences;
b4, data set construction: constructing 1 event trigger word extraction training set based on a word sequence and an event trigger word tagging sequence, 9 event elements extraction training sets based on the word sequence and 9 event element tagging sequences, and an event information extraction test set based on the word sequence, the event trigger word tagging sequence and the event element tagging sequence;
C. model training: converting the event trigger word extraction training set and the event element extraction training set into digital signals, coding the digital signals by using a machine learning model based on the event trigger word extraction training set and the 9 event element extraction training sets, and generating event trigger word extraction parameters and 9 event element extraction parameters, wherein the method specifically comprises the following steps:
c1, signal conversion: converting the event trigger word extraction training set and the event element extraction training set into digital signals;
c2, parameter generation: inputting the digital signal into a machine learning model for coding to obtain an event trigger word identification parameter and 9 event element identification parameters, wherein the generation processes of the event trigger word parameter and the event element identification parameter are the same, and the generation process of the event trigger word identification parameter is as follows:
c2.1, number of states and number of observations: setting the label sequence of the trigger words in the event trigger word extraction training set as Q ═ Q1,q2,......qTWherein q istRepresenting the trigger word label at the time of T (T is more than or equal to 1 and less than or equal to T) as a state sequence, counting the types of the trigger word labels, taking the number of the types N as the state number, and setting the word sequence as Y ═ Y ≦ T1,y2,...yTIn which y istRespectively representing words at the time of T (T is more than or equal to 1 and less than or equal to T), taking the words as observation sequences, counting word types, and taking the type number M as an observation number;
c2.2, generating initial state probability distribution pi: the generation formula is as follows:
in the formula piiIndicates an initial state of qiProbability of (a), S (q)i) Representing events in the training set of event-triggered word extraction by qiSequence number, sigma, with trigger word label in initial statej∈[1,N]S(qj) Represents the total number of sequences;
c2.3, generating a state transition probability matrix A: the generation formula is as follows:
in the formula S (q)i,qj) The event trigger word label at the moment t is qjThe event trigger word label at the time of t-1 is qiOf (a) times, sigmaI∈[1,N]S(qi,qI) When the event trigger word label at the time t is in an arbitrary state, the event trigger word label at the time t-1 is qiThe sum of the times of (c);
c2.4, generating an observation probability matrix B: the generation formula is as follows:
in the formula S (q)i,yk) The label of event trigger word at t moment is qiAnd the word is ykOf (a) times, sigmaI∈[1,M]S(qi,yI) The label of event trigger word at t moment is qiAnd the word yIAny type of number of times;
c2.5, parameter generation: an event-triggered word recognition parameter λ ═ (N, M, a, B, pi), where a ═ aij},B={bik},π={πi};
D. Information extraction: based on the event trigger extraction parameters and the 9 event element extraction parameters, decoding the event information extraction test set by using a decoding algorithm to obtain an event trigger extraction sequence and an event element extraction sequence, and combining the trigger extraction sequence and the event element extraction sequence to complete the event information extraction, specifically comprising:
d1, first decoding: the event-triggered word recognition parameter λ generated in step C2 is set to (N, M, a,b, pi) and event information extraction test set observation sequence Y ═ Y1,y2,...yTInputting the observation sequence Y as a decoding algorithm model, performing first decoding by using a decoding algorithm, and calculating an observation sequence Y as { Y }1,y2,...yTState sequence Q ═ formed with random search (Q)1,q2,,…,qT) Mapping probability P (y)1→q1,y2→q2,,…,yT→qTLambda) is output, the mapping probability is PmaxTime sequence of states QmaxI.e. event-triggered word extraction sequence;
d2, event type determination: based on the event trigger word extraction sequence, scanning line by line and judging the event trigger word label type of each sentence to obtain the event type of each sentence;
d3, second decoding: calling 1 event element identification parameters corresponding to the event type of each sentence, and performing secondary decoding on the sentence according to the step D1 by using a decoding algorithm to generate an event element extraction sequence with the maximum probability corresponding to the sentence;
d4, scanning event information extraction test set sentence by sentence: repeating the step D3 until all sentences of the event information extraction test set are completely scanned, and generating an event element extraction sequence corresponding to the event information extraction test set;
d5, merging extraction results: and combining the event trigger word extraction sequence and the event element extraction sequence corresponding to the event information extraction test set to form an event information extraction sequence, and finishing the event information extraction process.
2. The secondary decoding-based military scenario text event information extraction method of claim 1, wherein the open source tokenizer includes jieba, Hanlp, CoreNLP, thulac.
3. The secondary decoding-based military scenario text event information extraction method of claim 1, wherein the 9 types of military scenario text events include attack events, defense events, command events, deployment events, maneuver events, lockout events, collaboration events, reconnaissance events, and security events.
4. The secondary decoding-based military scenario textual event information extraction method of claim 1, wherein the machine learning model comprises HMM, CRF, MEMM, NB.
5. The secondary decoding-based military scenario text event information extraction method of claim 1, wherein the decoding algorithm comprises Viterbi, Dijkstra, Forward-Backward.
6. A military scenario text event information extraction device based on secondary decoding, the device comprising:
the preprocessing module 100: carry out text preprocessing to the military affairs of input and decide the text corpus, on the basis of the military affairs of input and decide the text corpus, rely on the dictionary of taking oneself of the word segmentation ware of opening the source and construct the military field dictionary, carry out clause, participle in proper order to the military affairs and decide the text corpus, generate the data set that shows with the word sequence form, specifically include:
chinese clause unit 101: dividing a military scenario text corpus into sentences according to Chinese punctuations and punctuations to form a sentence set;
the professional dictionary constructing unit 102: constructing a professional dictionary based on a self-contained dictionary of the open source word segmentation device;
chinese word segmentation unit 103: dividing each sentence in the sentence set into words by using an open source word divider to obtain a word set, and displaying the word set line by line to generate a word sequence;
corpus tagging module 200: defining structured semantics of 9 types of events in a military scenario text, formulating a corpus tagging method and rules, tagging each word in a word sequence data set with a corresponding event trigger word tag or event element tag line by line, generating an event trigger word tagging sequence and an event element tagging sequence, and constructing an event trigger word extraction training set, an event element extraction training set and an event information extraction test set, which specifically comprise:
the event structured semantic definition unit 201: defining 9 classes of military scenario text event types according to the concept of military operation, and determining the structural semantics of each class of events;
the annotation rule method formulation unit 202: formulating a corpus tagging method and a rule, and respectively defining a trigger word tagging label and an element tagging label of a 9-class event;
the corpus manual labeling unit 203: labeling corresponding labels to the word sequences line by adopting an artificial labeling mode to generate 1 event trigger word labeling sequence and 9 event element labeling sequences;
the data set construction unit 204: constructing 1 event trigger word extraction training set based on a word sequence and an event trigger word tagging sequence, 9 event elements extraction training sets based on the word sequence and 9 event element tagging sequences, and an event information extraction test set based on the word sequence, the event trigger word tagging sequence and the event element tagging sequence;
the model training module 300: converting the event trigger word extraction training set and the event element extraction training set into digital signals, coding the digital signals by using a machine learning model based on the event trigger word extraction training set and the 9 event element extraction training sets, and generating event trigger word extraction parameters and 9 event element extraction parameters, wherein the method specifically comprises the following steps:
the signal conversion unit 301: converting the event trigger word extraction training set and the event element extraction training set into digital signals;
parameter generation unit 302: inputting the digital signals into a machine learning model for coding to obtain event trigger word identification parameters and 9 event element identification parameters;
the information extraction module 400: based on the event trigger extraction parameters and the 9 event element extraction parameters, decoding the event information extraction test set by using a decoding algorithm to obtain an event trigger extraction sequence and an event element extraction sequence, and combining the trigger extraction sequence and the event element extraction sequence to complete the event information extraction, specifically comprising:
first-time decoding section 401: based on the event trigger word recognition parameters, performing first decoding on the event information extraction test set by using a decoding algorithm to obtain an event trigger word extraction sequence corresponding to the event information extraction test set;
event type determination unit 402: based on the event trigger word extraction sequence, scanning line by line and judging the event trigger word label type of each sentence to obtain the event type of each sentence;
second-time decoding section 403: calling 1 event element identification parameter corresponding to the event type of each sentence, performing secondary decoding on the sentence by using a decoding algorithm to generate an event element extraction sequence corresponding to the sentence, scanning the event information extraction test set sentence by sentence until all the sentences of the event information extraction test set are completely scanned, and generating an event element extraction sequence corresponding to the event information extraction test set;
decimation-result combining unit 404: and combining the event trigger word extraction sequence and the event element extraction sequence corresponding to the event information extraction test set to form an event information extraction sequence, and finishing the event information extraction process.
CN201910653282.8A 2019-07-19 2019-07-19 Military scenario text event information extraction method and device based on secondary decoding Active CN110609896B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910653282.8A CN110609896B (en) 2019-07-19 2019-07-19 Military scenario text event information extraction method and device based on secondary decoding

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910653282.8A CN110609896B (en) 2019-07-19 2019-07-19 Military scenario text event information extraction method and device based on secondary decoding

Publications (2)

Publication Number Publication Date
CN110609896A true CN110609896A (en) 2019-12-24
CN110609896B CN110609896B (en) 2022-03-22

Family

ID=68889683

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910653282.8A Active CN110609896B (en) 2019-07-19 2019-07-19 Military scenario text event information extraction method and device based on secondary decoding

Country Status (1)

Country Link
CN (1) CN110609896B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111339311A (en) * 2019-12-30 2020-06-26 智慧神州(北京)科技有限公司 Method, device and processor for extracting structured events based on generative network
CN111368551A (en) * 2020-02-14 2020-07-03 京东数字科技控股有限公司 Method and device for determining event subject
CN111475617A (en) * 2020-03-30 2020-07-31 招商局金融科技有限公司 Event body extraction method and device and storage medium
CN112612871A (en) * 2020-12-17 2021-04-06 浙江大学 Multi-event detection method based on sequence generation model
CN113051887A (en) * 2019-12-26 2021-06-29 深圳市北科瑞声科技股份有限公司 Method, system and device for extracting announcement information elements
CN113111649A (en) * 2021-04-13 2021-07-13 科大讯飞股份有限公司 Event extraction method, system and equipment
CN113806481A (en) * 2021-09-17 2021-12-17 中国人民解放军国防科技大学 Operation event extraction method oriented to encyclopedic data
CN114707517A (en) * 2022-04-01 2022-07-05 中国人民解放军国防科技大学 Target tracking method based on open source data event extraction

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060085477A1 (en) * 2004-10-01 2006-04-20 Ricoh Company, Ltd. Techniques for retrieving documents using an image capture device
CN102693219A (en) * 2012-06-05 2012-09-26 苏州大学 Method and system for extracting Chinese event
US20130144600A1 (en) * 2009-03-18 2013-06-06 Microsoft Corporation Adaptive pattern learning for bilingual data mining
CN105260361A (en) * 2015-10-28 2016-01-20 南京邮电大学 Trigger word tagging system and method for biomedical events
CN106599032A (en) * 2016-10-27 2017-04-26 浙江大学 Text event extraction method in combination of sparse coding and structural perceptron
CN107122416A (en) * 2017-03-31 2017-09-01 北京大学 A kind of Chinese event abstracting method
CN107229610A (en) * 2017-03-17 2017-10-03 咪咕数字传媒有限公司 The analysis method and device of a kind of affection data
CN109325228A (en) * 2018-09-19 2019-02-12 苏州大学 English event trigger word abstracting method and system

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060085477A1 (en) * 2004-10-01 2006-04-20 Ricoh Company, Ltd. Techniques for retrieving documents using an image capture device
US20130144600A1 (en) * 2009-03-18 2013-06-06 Microsoft Corporation Adaptive pattern learning for bilingual data mining
CN102693219A (en) * 2012-06-05 2012-09-26 苏州大学 Method and system for extracting Chinese event
CN105260361A (en) * 2015-10-28 2016-01-20 南京邮电大学 Trigger word tagging system and method for biomedical events
CN106599032A (en) * 2016-10-27 2017-04-26 浙江大学 Text event extraction method in combination of sparse coding and structural perceptron
CN107229610A (en) * 2017-03-17 2017-10-03 咪咕数字传媒有限公司 The analysis method and device of a kind of affection data
CN107122416A (en) * 2017-03-31 2017-09-01 北京大学 A kind of Chinese event abstracting method
CN109325228A (en) * 2018-09-19 2019-02-12 苏州大学 English event trigger word abstracting method and system

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
PEI-FENG LI ET AL.: "Using compositional semantics and discourse consistency to improve Chinese trigger identification", 《INFORMATION PROCESSING AND MANAGEMENT》 *
WEI WANG ET AL.: "Chinese News Event 5W1H Elements Extraction using Semantic Role Labeling", 《2010 THIRD INTERNATIONAL SYMPOSIUM ON INFORMATION PROCESSING》 *
王学锋 等: "基于深度学习的军事命名实体识别方法", 《装甲兵工程学院学报》 *
贺瑞芳 等: "基于多任务学习的中文事件抽取联合模型", 《软件学报》 *

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113051887A (en) * 2019-12-26 2021-06-29 深圳市北科瑞声科技股份有限公司 Method, system and device for extracting announcement information elements
CN111339311A (en) * 2019-12-30 2020-06-26 智慧神州(北京)科技有限公司 Method, device and processor for extracting structured events based on generative network
CN111368551A (en) * 2020-02-14 2020-07-03 京东数字科技控股有限公司 Method and device for determining event subject
CN111368551B (en) * 2020-02-14 2023-12-05 京东科技控股股份有限公司 Method and device for determining event main body
CN111475617B (en) * 2020-03-30 2023-04-18 招商局金融科技有限公司 Event body extraction method and device and storage medium
CN111475617A (en) * 2020-03-30 2020-07-31 招商局金融科技有限公司 Event body extraction method and device and storage medium
CN112612871B (en) * 2020-12-17 2023-09-15 浙江大学 Multi-event detection method based on sequence generation model
CN112612871A (en) * 2020-12-17 2021-04-06 浙江大学 Multi-event detection method based on sequence generation model
CN113111649A (en) * 2021-04-13 2021-07-13 科大讯飞股份有限公司 Event extraction method, system and equipment
CN113111649B (en) * 2021-04-13 2024-02-20 科大讯飞股份有限公司 Event extraction method, system and equipment
CN113806481A (en) * 2021-09-17 2021-12-17 中国人民解放军国防科技大学 Operation event extraction method oriented to encyclopedic data
CN114707517A (en) * 2022-04-01 2022-07-05 中国人民解放军国防科技大学 Target tracking method based on open source data event extraction
CN114707517B (en) * 2022-04-01 2024-05-03 中国人民解放军国防科技大学 Target tracking method based on open source data event extraction

Also Published As

Publication number Publication date
CN110609896B (en) 2022-03-22

Similar Documents

Publication Publication Date Title
CN110609896B (en) Military scenario text event information extraction method and device based on secondary decoding
CN108628823B (en) Named entity recognition method combining attention mechanism and multi-task collaborative training
CN110083710B (en) Word definition generation method based on cyclic neural network and latent variable structure
US8069027B2 (en) Word alignment apparatus, method, and program product, and example sentence bilingual dictionary
CN110750959A (en) Text information processing method, model training method and related device
CN111950296B (en) Comment target emotion analysis based on BERT fine tuning model
CN110909736B (en) Image description method based on long-term and short-term memory model and target detection algorithm
CN110598203A (en) Military imagination document entity information extraction method and device combined with dictionary
CN111858932A (en) Multiple-feature Chinese and English emotion classification method and system based on Transformer
CN110851599A (en) Automatic scoring method and teaching and assisting system for Chinese composition
CN110134946A (en) A kind of machine reading understanding method for complex data
CN110929520B (en) Unnamed entity object extraction method and device, electronic equipment and storage medium
CN110175246A (en) A method of extracting notional word from video caption
CN111695338A (en) Interview content refining method, device, equipment and medium based on artificial intelligence
CN113255331A (en) Text error correction method, device and storage medium
Khuman et al. Grey relational analysis and natural language processing to: grey language processing
CN117290515A (en) Training method of text annotation model, method and device for generating text graph
CN116932736A (en) Patent recommendation method based on combination of user requirements and inverted list
Wang et al. Chinese to Braille translation based on Braille word segmentation using statistical model
CN114880994B (en) Text style conversion method and device from direct white text to irony text
CN109960782A (en) A kind of Tibetan language segmenting method and device based on deep neural network
CN111814433B (en) Uygur language entity identification method and device and electronic equipment
CN115357710A (en) Training method and device for table description text generation model and electronic equipment
CN113191135A (en) Multi-category emotion extraction method fusing facial characters
Teng et al. End-to-End Model Based on Bidirectional LSTM and CTC for Online Handwritten Mongolian Word Recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant