CN108509423A - A kind of acceptance of the bid webpage name entity abstracting method based on second order HMM - Google Patents

A kind of acceptance of the bid webpage name entity abstracting method based on second order HMM Download PDF

Info

Publication number
CN108509423A
CN108509423A CN201810300522.1A CN201810300522A CN108509423A CN 108509423 A CN108509423 A CN 108509423A CN 201810300522 A CN201810300522 A CN 201810300522A CN 108509423 A CN108509423 A CN 108509423A
Authority
CN
China
Prior art keywords
bid
acceptance
entity
candidate
name
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810300522.1A
Other languages
Chinese (zh)
Inventor
陈羽中
林剑
郭昆
张伟智
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fuzhou University
Original Assignee
Fuzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fuzhou University filed Critical Fuzhou University
Priority to CN201810300522.1A priority Critical patent/CN108509423A/en
Publication of CN108509423A publication Critical patent/CN108509423A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The present invention relates to a kind of, and the acceptance of the bid webpage based on second order HMM names entity abstracting method, includes the following steps:The HTML code for webpage of getting the bid is converted to the text data of standardization, and records the corresponding title of each webpage;Participle and part-of-speech tagging are carried out to the text data after standardization;Based on second order HMM model, the context identification of entity is named to acceptance of the bid data and recognition result is added in candidate name entity sets;Based on second order HMM model and rule, the name entity in entity sets is named to be identified to candidate.The present invention can accurately extract the project winning a bid information in bidding website.

Description

A kind of acceptance of the bid webpage name entity abstracting method based on second order HMM
Technical field
The present invention relates to name entity recognition techniques field, especially a kind of acceptance of the bid webpage name based on second order HMM is real Body abstracting method.
Background technology
Name Entity recognition is a background task of natural language processing.The purpose is to identify name in language material, place name, Institution term etc. names entity.Since these name physical quantities are continuously increased, it is often impossible to limit is listed in dictionary, And its constructive method has respective certain law, thus, usually the identification to these words from vocabulary morphological process (such as Chinese Language cutting) independent process in task, referred to as name Entity recognition.
As a background task of natural language processing, the correlative study of Entity recognition is named to attract more how special The close attention of family and scholar, and propose some optimization algorithms and model.There is scholar to propose a kind of based on stacking HMM model Entity identification algorithms are named, name and place name are identified first, high-rise organization names recognition is then carried out as feature;Have Scholar proposes a kind of Chinese name entity identification algorithms based on condition random field, and obtains being based on word, boundary, part of speech and entity Dictionary can get good effect as feature;There is scholar to propose a kind of method based on bootstrapping, utilizes Bootstrapping technologies expand seed vocabulary and solve the problems, such as that artificial labeled data is insufficient;There is scholar to propose that one kind is based on The name entity identification algorithms of the neural network structure of BLSTM, this method no longer depend directly on manual features and domain knowledge, But the term vector based on context and the term vector based on word are utilized, the former expresses the contextual information of name entity, the latter Expression constitutes prefix, suffix and the realm information of name entity;There is scholar to propose that a kind of name based on BLSTM-CRF models is real Body recognizer, when carrying out sequence labelling to sentence, the label between word is not independent, considers the label of front word Information and then the information of bluebeard compound mark the tag of current word, CRF substitutions to be exported from the layer using softmax again, generate each The final prediction of word;There is scholar to propose a kind of deep-neural-network model being based on stack own coding grader, solves Determined from Chinese text sequence to mode input the transition problem of vector, it is proposed that convenient for before the vectorization of Project Realization to-after To propagation formula.
Most name entity identification algorithms are all to name, place name at present, and mechanism name is identified, and is not carried out to it It is further to divide.
Invention content
The purpose of the present invention is to provide a kind of, and the acceptance of the bid webpage based on second order HMM names entity abstracting method, for knowing Bid mechanism in the project winning a bid details page of other bidding website, acceptance of the bid mechanism, bid mechanism their location, the acceptance of the bid amount of money, Call for bid six authority contact people, project for bidding title name entities.
To achieve the above object, the technical scheme is that a kind of acceptance of the bid webpage name entity based on second order HMM is taken out Method is taken, following steps are specifically included:Step A:The HTML code for webpage of getting the bid is converted to the text data of standardization, and is remembered Record the corresponding title of each webpage;Step B:Participle and part of speech mark are carried out to the text data after the obtained standardization of step A Note;Step C:Based on second order HMM model, the context identification of entity is named to acceptance of the bid data and recognition result is added It is named in entity sets to candidate;Step D:It is real to the candidate name named in entity sets based on second order HMM model and rule Body is identified.
In an embodiment of the present invention, in the step A, the HTML code in webpage of getting the bid is converted to the text of standardization Notebook data, and the corresponding title of each webpage is recorded, specific method includes the following steps:Step A1:Search acceptance of the bid webpage Whole in HTML codes<table>Label, and to each<table>Label carries out step A2;Step A2:It will<table> The position procession transposition of text in label, the HTML code obtained after conversion is added in former HTML code should<table >Label is corresponding</table>Before label;Step A3:To step A2, treated that HTML code carries out label, by label< div>、<p>、 <table>、<tr>Newline is converted to, other labels are converted to space character.
It further, will in the step A2<table>The position procession transposition of text in label, after conversion Obtained HTML code is added in former HTML code should<table>Label is corresponding</table>After label, specifically include with Lower step:Step A21:It calculates<table>What label was nested with<tr>Number of tags n and<tr>What label was nested with<td>Number of tags M creates the two-dimensional array T of (n+2) × m, and the first row element for initializing two-dimensional matrix T is<tr>, last column element is </tr>, then press first<tr>Afterwards<td>Mode scan<table>HTML code in label, by each<td>Label packet The HTML code contained is saved in two-dimensional matrix T;Step A22:The step A21 two-dimensional array T generated are scanned by column, it will Element in T carries out splicing and generates character string S;Step A23:The A22 character string S generated are added to this<table>Label pair It answers</table>Before label.
In an embodiment of the present invention, in the step C, it is based on second order HMM model, entity is named to acceptance of the bid data Context identification and recognition result is added in candidate name entity sets;It is named the context identification of entity, mesh Be using name entity context possessed by semantic information to name entity type judge, specifically include following Step:
Step C1:Initialize the parameter of second order HMM model;Second order HMM model λ=(A, B, π) is by state transition probability square Battle array A, observation probability matrix B, initial state probability vector π are constituted, Q={ q1,q2,...,qN, it is all possible state Set, N is possible state number, V={ v1,v2,...,vMBe all possible observation set, i.e., the institute in training set It is word number to have the set of words that sample is constituted after step B participles, M;Wherein A=[aijk]N×N×N
aijk=P (it+1=qk|it=qj,it-1=qi), i=1,2 ... N;J=1,2 ..., N;K=1,2 ..., N, It is to be in state q in moment tjAnd moment t-1 is in state qiUnder conditions of in moment t+1 be transferred to state qkProbability;Wherein B=[bijk]N×N×M
bijk=P (ot+1=vk|it=qj,it-1=qi), i=1,2 ... N;J=1,2 ..., N;K=1,2 ..., M, It is to be in state q in moment tjAnd moment t-1 is in state qiUnder conditions of moment t+1 generate observation vkProbability;Wherein π =(π12,...,πN);
πi=P (i1=qi), i=1,2 ..., N is that moment t is in state qiProbability;Step C2:Utilize training set, base The state transition probability matrix and observation probability matrix of model are obtained in Maximum-likelihood estimation;Step C3:Using Viterbi algorithm The hidden state for decoding training set, obtains best status switch;Step C4:It, will be right according to the obtained status switches of step C3 It is that the word of candidate name entity is added in corresponding candidate name entity sets to answer status indication in observation sequence.
Further, wherein Q={ q1,q2,...,qNThe all possible state that is included:LA、LB、LC、LD、LE、 LF、 RA、RB、RC、RD、RE、RF、MA、MB、MC、MD、ME、MF、Z;Above-mentioned state indicates respectively:The mechanism that calls for bid is above, gets the bid Mechanism is above, organization address of getting the bid is above, the acceptance of the bid amount of money is above, bid contact person is above, project name is above, under the mechanism that calls for bid Text, acceptance of the bid mechanism hereafter, acceptance of the bid organization address hereafter, acceptance of the bid the amount of money hereafter, bid contact person hereafter, project name hereafter, wait Select mechanism, candidate acceptance of the bid mechanism, candidate bid organization address, the candidate acceptance of the bid amount of money, candidate acceptance of the bid contact person, candidate items name Title, candidate items title, other;Since the possibility of the context of name entity includes multiple words, so to indicating that name is real The state of body context is marked by the way of BMES:If there are consecutive identical status switch, to the phase continuously occurred Suffix _ B is added to be marked with first state in status switch, the last one state adds suffix _ E to be marked, middle part The state divided adds suffix _ M to be marked, and otherwise adds suffix _ S to be marked the state.
Further, the word order that status indication is MA+ is added to candidate bid mechanism and names entity sets A, shape State is added to candidate acceptance of the bid mechanism name entity sets B labeled as the word order of MB+, and status indication is the word order of MC+ It is added to candidate bid organization address name entity sets C, status indication is that the word order of ME+ is added to candidate bid contact Human life name entity sets E, status indication are that the word order of MF+ is added in candidate items name nominating entity sets F;For Candidate's acceptance of the bid amount of money names entity sets D, is extracted from status switch and meets regular expression:
The sub- status switch of [LD_S | LD_B+LD_M*+LD_E] MD+ [RD_S | RD_B+RD_M*+RD_E], by its with it is right The sub- observation sequence answered sequentially is added to together in candidate acceptance of the bid amount of money name entity sets D;Wherein, the * in regular expression Expression corresponding states occurs 0 time or multiple ,+expression corresponding states appearance 1 time or multiple.
In an embodiment of the present invention, in the step D, include the following steps:Step D1:Pass through second order HMM model pair Bid mechanism and acceptance of the bid mechanism are identified;Step D2:Bid contact person is identified by second order HMM model;Step D3: By regular centering standard gold volume, project name and bid mechanism their location are identified.
In an embodiment of the present invention, in the step D1, by second order HMM model in candidate bid mechanism and candidate Mark mechanism is identified, and specifically includes following steps:Step D11:For the composition pattern of organization names, second order HMM moulds are built Type;
It is wherein directed to bid mechanism name entity and acceptance of the bid mechanism names entity, the observation value set in second order HMM model The set of words obtained after participle with acceptance of the bid mechanism for the bid mechanism in training data;Step D12:Using in training data Observation sequence and status switch calculate the parameter of second order HMM model;Step D13:The candidate bid mechanism life that step C is obtained Name entity sets A and candidate acceptance of the bid mechanism name entity sets B, and optimum state sequence is found out using Viterbi algorithm;Step D14:To the optimum state sequence that step D13 is obtained, pattern matching is carried out using Aho-Corasic algorithms, successful match is then It is considered mechanism name name entity, it fails to match then not thinks it is mechanism name name entity.
Further, in the step D2, bid contact person is identified by second order HMM model, specific steps are such as Under:
Step D21:For the composition pattern of name, second order HMM model is built;Wherein, the observation collection of second order HMM model The set of words being combined into after bid contact person's participle in training data;Step D22:Using in training data observation sequence and Status switch calculates the parameter of second order HMM model;Step D23:Entity set is named to the candidate bid contact person that step C is obtained It closes E and finds out optimum state sequence using Viterbi algorithm;Step D24:To the optimum state sequence that step D23 is obtained, utilize Aho-Corasic algorithms carry out pattern matching, and successful match is then considered the contact person's name entity that calls for bid, and it fails to match then recognizes For be not bid contact person name entity.
Further, in the step D3, by regular centering standard gold volume, project name and bid mechanism their location into Row identification, the specific method is as follows:The acceptance of the bid amount of money judges the candidate acceptance of the bid amount of money, i.e. judgment step by the way of regular expression In the set D generated in C, to status switch be MD* the corresponding observation sequence in part determine whether Arabic numerals or in Literary word figure not thinks it is that acceptance of the bid and is given up the amount of money if being unsatisfactory for, and it is MD* that the status switch is searched if meeting not Part corresponding to observation in whether include unit " hundred ", " one hundred ", " thousand ", " thousand ", " ten thousand " , " Wan ", " hundred million " , " Hundred-million ", it is " beautiful Member ", " yen ", if carrying out unit conversion comprising if;Project for bidding title recognition methods is as follows:As the candidate that step C is obtained When project for bidding name nominating entity sets F is not empty, using the name entity in set as project for bidding title, work as set When F is empty, then select the title of the bid webpage as project for bidding title.
Further, bid organization address recognition methods is as follows:When the candidate bid organization address name that step C is obtained is real When body set C is not empty, it is the part of place name to select part-of-speech tagging in set;When set C is empty, using in step D1 Part-of-speech tagging is the part of place name in the bid mechanism of generation.
Compared to the prior art, the beneficial effects of the invention are as follows:The project winning a bid letter in bidding website can accurately be extracted Breath.
Description of the drawings
Fig. 1 is the implementation flow chart of the method for the present invention.
Specific implementation mode
Explanation is further explained to the present invention in the following with reference to the drawings and specific embodiments.
Fig. 1 is a kind of implementation process signal of acceptance of the bid webpage name entity abstracting method based on second order HMM of the present invention Figure.As shown in Figure 1, the described method comprises the following steps:
Step A:The HTML code for webpage of getting the bid is converted to the text data of standardization, and it is corresponding to record each webpage Title;
The step A, specifically includes following steps:
Step A1:Search the whole in the HTML code of acceptance of the bid webpage<table>Label, and to each<table>Mark Label carry out step A2;
Step A2:It will<table>The position procession transposition of text in label, the HTML code obtained after conversion add Entering into former HTML code should<table>Label is corresponding</table>After label;
The step A2, specifically includes following steps:
Step A21:It calculates<table>What label was nested with<tr>Number of tags n and<tr>What label was nested with<td>Label Number m, creates the two-dimensional array T of (n+2) × m, and the first row element for initializing two-dimensional matrix T is<tr>, last column element For</tr>, then press first<tr>Afterwards<td>Mode scan<table>HTML code in label, by each<td>Label Including HTML code be saved in two-dimensional matrix T;
Step A22:The step A21 two-dimensional array T generated are scanned by column, the element in T is subjected to splicing generation Character string S;
Step A23:The A22 character string S generated are added to this<table>Label is corresponding</table>Before label;
Specifically, the reason of step A2 is by table procession transposition is as follows, by taking table 1 as an example:
Table 1
HTML code is
It is if converting directly extraction without ranks and label is gone to extract text message:
" the acceptance of the bid vendor name contact address acceptance of the bid amount of money (member)
Harbin is stood and Economic Development Zone Nangang, aircushion vehicle Co., Ltd Harbin concentration zones No. 209 Wall Street Zhong Hao A of Changjiang Road 8 floor No. 8 2,066,600.00 "
Visual position relationship in table can be lost, and is then become after ranks are converted and remove label:
Acceptance of the bid vendor name Harbin is stood and aircushion vehicle Co., Ltd
8 floor of Economic Development Zone Nangang, contact address Harbin concentration zones A, the Wall Street Zhong Hao of Changjiang Road 209 8
The acceptance of the bid amount of money (member) 2,066,600.00 "
Remain the original visual position relationship of the table, i.e. contextual location relevant information.
Step A3:To step A2, treated that HTML code carries out label, by label<div>、<p>、<table>、<tr >Newline is converted to, other labels are converted to space character;
Step B:Participle and part-of-speech tagging are carried out to the text data after the obtained standardization of step A;
Step C:Based on second order HMM model, the context identification of entity is named to acceptance of the bid data and by recognition result It is added in candidate name entity sets;
The step C, it is therefore an objective to type of the semantic information to name entity possessed by the context using name entity Judged, specifically includes following steps:
Step C1:Initialize the parameter of second order HMM model;
Specifically, second order HMM model λ=(A, B, π) is by state transition probability matrix A, observation probability matrix B, initial shape State probability vector π is constituted, Q={ q1,q2,...,qN, it is the set of all possible state, N is possible state number, V= {v1,v2,...,vMBe all possible observation set, i.e., all samples in training set constitute after step B participles Set of words, M be word number.
Wherein A=[aijk]N×N×N
aijk=P (it+1=qk|it=qj,it-1=qi), i=1,2 ... N;J=1,2 ..., N;K=1,2 ..., N, It is to be in state q in moment tjAnd moment t-1 is in state qiUnder conditions of in moment t+1 be transferred to state qkProbability;
Wherein B=[bijk]N×N×M
bijk=P (ot+1=vk|it=qj,it-1=qi), i=1,2 ... N;J=1,2 ..., N;K=1,2 ..., M, It is to be in state q in moment tjAnd moment t-1 is in state qiUnder conditions of moment t+1 generate observation vkProbability;
Wherein π=(π12,...,πN)
πi=P (i1=qi), i=1,2 ..., N is that moment t is in state qiProbability.
Wherein Q={ q1,q2,...,qNIncluded all possible state it is as shown in table 2:
Table 2
Specifically, since the possibility of the context of name entity includes multiple words, so to indicating under name physically The state of text is marked by the way of BMES:If there are consecutive identical status switch, to the same state continuously occurred First state in sequence adds suffix _ B to be marked, the last one state adds suffix _ E to be marked, the shape of middle section State adds suffix _ M to be marked, and otherwise adds suffix _ S to be marked the state.
By taking the mechanism that calls for bid is above as an example, when for indicate name entity feature above sequence of terms only there are one word when It waits and is labeled as LA_S;There are two head-words when word to be labeled as LA_B, and tail word is labeled as LA_E;When more than two words It waits, head-word is labeled as LA_B, and tail word is labeled as LA_E, and medium term is labeled as LA_M.
Step C2:Using training set, the state transition probability matrix of model is obtained based on Maximum-likelihood estimation and is observed general Rate matrix;
Step C3:The hidden state that training set is decoded using Viterbi algorithm, obtains best status switch;
Step C4:It is candidate name by status indication in corresponding observation sequence according to the obtained status switches of step C3 The word of entity is added in corresponding candidate name entity sets.
Specifically, the word order that status indication is MA+, which is added to candidate bid mechanism, names entity sets A, state It is added to candidate acceptance of the bid mechanism name entity sets B labeled as the word order of MB+, status indication is that the word order of MC+ adds It is added to candidate bid organization address name entity sets C, status indication is that the word order of ME+ is added to candidate bid contact person It is that the word order of MF+ is added in candidate items name nominating entity sets F to name entity sets E, status indication.
Specifically, naming entity sets D for candidate's acceptance of the bid amount of money, is extracted from status switch and meet regular expression The sub- status switch of [LD_S | LD_B+LD_M*+LD_E] MD+ [RD_S | RD_B+RD_M*+RD_E], by itself and corresponding sub- sight Sequencing row are sequentially added to together in candidate acceptance of the bid amount of money name entity sets D.Wherein, the * in regular expression indicates to correspond to State occurs 0 time or multiple ,+expression corresponding states appearance 1 time or multiple.
Step D:Based on second order HMM model and rule, the name entity in entity sets is named to be identified to candidate.
The step D, specifically includes following steps:
Step D1:Bid mechanism and acceptance of the bid mechanism are identified by second order HMM model;
The step D1, specifically includes following steps:
Step D11:For the composition pattern of organization names, second order HMM model is built;
Specifically, naming entity, the observation collection in second order HMM model for bid mechanism name entity and acceptance of the bid mechanism The set of words that the bid mechanism and acceptance of the bid mechanism being combined into training data obtain after participle, state set is table 3:
Table 3
State Meaning
OA Simple place name prefix
OB Simple mechanism name prefix
OC General prefix
OD Particularity prefix
OE Mechanism name Feature Words
OF Other non-mechanism names
Step D12:Using in training data observation sequence and status switch calculate the parameter of second order HMM model;
Step D13:Entity is named to the step C candidate bid mechanism name entity sets A obtained and candidate acceptance of the bid mechanism Set B finds out optimum state sequence using Viterbi algorithm;
Step D14:To the optimum state sequence that step D13 is obtained, pattern string is carried out using Aho-Corasic algorithms Match, successful match is then considered mechanism name name entity, and it fails to match then not thinks it is mechanism name name entity.
Step D2:Bid contact person is identified by second order HMM model;
The step D, is as follows:
Step D21:For the composition pattern of name, second order HMM model is built;
Wherein, the observation value set of second order HMM model is the set of words after bid contact person's participle in training data, State set is table 4:
Table 4
State Meaning
PA The surname of Chinese personal name
PB The lead-in of two-character given name
PC The last word of two-character given name
PD Single-character given name
PE The feature of name
PF Conjunction
PZ Other non-names
Step D22:Using in training data observation sequence and status switch calculate the parameter of second order HMM model;
Step D23:Entity sets E is named to be found out most using Viterbi algorithm the candidate bid contact person that step C is obtained Good status switch;
Step D24:To the optimum state sequence that step D23 is obtained, pattern string is carried out using Aho-Corasic algorithms Match, successful match is then considered the contact person's name entity that calls for bid, and it fails to match then not thinks it is that bid contact person names entity.
Step D3:By regular centering standard gold volume, project name and bid mechanism their location are identified;
Specifically, in the step D3, by regular centering standard gold volume, project name and bid mechanism their location carry out Identification, the specific method is as follows:
The acceptance of the bid amount of money judges the candidate acceptance of the bid amount of money, the i.e. collection generated in judgment step C by the way of regular expression It closes in D, Arabic numerals or Chinese word figure is determined whether to the corresponding observation sequence in part that status switch is MD*, Not think it is that acceptance of the bid and is given up the amount of money if being unsatisfactory for, it is not right for the part of MD* to search the status switch if meeting Whether include unit " hundred ", " one hundred ", " thousand ", " thousand ", " ten thousand " , " Wan in the observation answered ", " hundred million " , " Hundred-million ", " dollar ", " day Member ", if carrying out unit conversion comprising if.
Project for bidding title recognition methods is as follows:When the candidate project for bidding name nominating entity sets F that step C is obtained not For sky when, using the name entity in set as project for bidding title, when set F is empty, then select the bid webpage Title is as project for bidding title.
The organization address recognition methods that calls for bid is as follows:When the candidate bid organization address that step C is obtained names entity sets C not For sky when, select set in part-of-speech tagging be place name part;When set C is empty, using the bid generated in step D1 Part-of-speech tagging is the part of place name in mechanism.
A kind of acceptance of the bid webpage based on second order HMM of the present invention names entity abstracting method, can be used to identify bid net Bid mechanism, acceptance of the bid mechanism, bid mechanism their location, the acceptance of the bid amount of money, bid mechanism in the project winning a bid details page stood Six contact person, project for bidding title name entities.
The above are preferred embodiments of the present invention, all any changes made according to the technical solution of the present invention, and generated function is made When with range without departing from technical solution of the present invention, all belong to the scope of protection of the present invention.

Claims (11)

1. a kind of acceptance of the bid webpage based on second order HMM names entity abstracting method, it is characterised in that:Include the following steps:
Step A:The HTML code for webpage of getting the bid is converted to the text data of standardization, and records the corresponding mark of each webpage Topic;
Step B:Participle and part-of-speech tagging are carried out to the text data after the obtained standardization of step A;
Step C:Based on second order HMM model, the context identification of entity is named to acceptance of the bid data and recognition result is added It is named in entity sets to candidate;
Step D:Based on second order HMM model and rule, the name entity in entity sets is named to be identified to candidate.
2. a kind of acceptance of the bid webpage based on second order HMM according to claim 1 names entity abstracting method, feature to exist In:In the step A, the HTML code in webpage of getting the bid is converted into the text data of standardization, and record each webpage pair The title answered, specific method include the following steps:
Step A1:Search the whole in the HTML code of acceptance of the bid webpage<table>Label, and to each<table>Label into Row step A2;
Step A2:It will<table>The position procession transposition of text in label, the HTML code obtained after conversion are added to It should in former HTML code<table>Label is corresponding</table>Before label;
Step A3:To step A2, treated that HTML code carries out label, by label<div>、<p>、<table>、<tr>Turn It is changed to newline, other labels are converted to space character.
3. a kind of acceptance of the bid webpage based on second order HMM according to claim 2 names entity abstracting method, feature to exist In:It, will in the step A2<table>The position procession transposition of text in label, the HTML code obtained after conversion Being added in former HTML code should<table>Label is corresponding</table>After label, following steps are specifically included:
Step A21:It calculates<table>What label was nested with<tr>Number of tags n and<tr>What label was nested with<td>Number of tags m, The two-dimensional array T of (n+2) × m is created, and the first row element for initializing two-dimensional matrix T is<tr>, last column element is</ tr>, then press first<tr>Afterwards<td>Mode scan<table>HTML code in label, by each<td>Label includes HTML code be saved in two-dimensional matrix T;
Step A22:The step A21 two-dimensional array T generated are scanned by column, the element in T, which is carried out splicing, generates character String S;
Step A23:The A22 character string S generated are added to this<table>Label is corresponding</table>Before label.
4. a kind of acceptance of the bid webpage based on second order HMM according to claim 1 names entity abstracting method, feature to exist In:In the step C, it is based on second order HMM model, the context identification of entity is named to acceptance of the bid data and by recognition result It is added in candidate name entity sets;It is named the context identification of entity, it is therefore an objective to utilize the context of name entity Possessed semantic information specifically includes following steps to naming the type of entity to judge:
Step C1:Initialize the parameter of second order HMM model;
Second order HMM model λ=(A, B, π) is by state transition probability matrix A, observation probability matrix B, initial state probability vector π It constitutes, Q={ q1,q2,...,qN, it is the set of all possible state, N is possible state number, V={ v1,v2,..., vMBe all possible observation set, i.e., the set of words that all samples in training set are constituted after step B participles, M For word number;Wherein A=[aijk]N×N×N;aijk=P (it+1=qk|it=qj,it-1=qi), i=1,2 ... N;J=1, 2,...,N;K=1,2 ..., N are to be in state q in moment tjAnd moment t-1 is in state qiUnder conditions of moment t+1 turn Move on to state qkProbability;
Wherein B=[bijk]N×N×M
bijk=P (ot+1=vk|it=qj,it-1=qi), i=1,2 ... N;J=1,2 ..., N;K=1,2 ..., M, be when It carves t and is in state qjAnd moment t-1 is in state qiUnder conditions of moment t+1 generate observation vkProbability;
Wherein π=(π12,...,πN)
πi=P (i1=qi), i=1,2 ..., N is that moment t is in state qiProbability;
Step C2:Using training set, the state transition probability matrix and observation probability square of model are obtained based on Maximum-likelihood estimation Battle array;
Step C3:The hidden state that training set is decoded using Viterbi algorithm, obtains best status switch;
Step C4:It is candidate name entity by status indication in corresponding observation sequence according to the obtained status switches of step C3 Word be added in corresponding candidate name entity sets.
5. a kind of acceptance of the bid webpage based on second order HMM according to claim 4 names entity abstracting method, feature to exist In:Wherein Q={ q1,q2,...,qNThe all possible state that is included is:LA、LB、LC、LD、LE、LF、RA、RB、RC、 RD、RE、RF、MA、MB、MC、MD、ME、MF、Z;Above-mentioned state indicates respectively:The mechanism that calls for bid is above, acceptance of the bid mechanism is above, gets the bid Organization address above, the acceptance of the bid amount of money above, contact person is above for bid, project name above, bid mechanism hereafter, under acceptance of the bid mechanism Text, acceptance of the bid organization address hereafter, the acceptance of the bid amount of money hereafter, bid contact person hereafter, project name hereafter, candidate mechanism, it is candidate in Mark mechanism, candidate bid organization address, the candidate acceptance of the bid amount of money, candidate acceptance of the bid contact person, candidate items title, candidate items name Claim, other;Since the possibility of the context of name entity includes multiple words, so the state to indicating name entity context It is marked by the way of BMES:If there are consecutive identical status switch, in the same state sequence that continuously occurs First state adds suffix _ B to be marked, the last one state adds suffix _ E to be marked, the state of middle section add suffix _ M is marked, and otherwise adds suffix _ S to be marked the state.
6. a kind of acceptance of the bid webpage based on second order HMM according to claim 4 names entity abstracting method, feature to exist In:It is the word of MB+ that the word order that status indication is MA+, which is added to candidate bid mechanism to name entity sets A, status indication, Language sequence is added to candidate acceptance of the bid mechanism name entity sets B, and status indication is that the word order of MC+ is added to candidate bid machine It is that the word order of ME+ is added to candidate bid contact person name entity sets E that entity sets C, status indication are named in structure address, Status indication is that the word order of MF+ is added in candidate items name nominating entity sets F;
For candidate acceptance of the bid the amount of money name entity sets D, from status switch extract meet regular expression [LD_S | LD_B+ LD_M*+LD_E] MD+ [RD_S | RD_B+RD_M*+RD_E] sub- status switch, by it together with corresponding sub- observation sequence Sequence is added in candidate acceptance of the bid amount of money name entity sets D;Wherein, the * in regular expression indicates that corresponding states occurs 0 time Or repeatedly ,+indicate corresponding states appearance 1 time or multiple.
7. a kind of acceptance of the bid webpage based on second order HMM according to claim 1 names entity abstracting method, feature to exist In:In the step D, include the following steps:
Step D1:Bid mechanism and acceptance of the bid mechanism are identified by second order HMM model;
Step D2:Bid contact person is identified by second order HMM model;
Step D3:By regular centering standard gold volume, project name and bid mechanism their location are identified.
8. a kind of acceptance of the bid webpage based on second order HMM according to claim 7 names entity abstracting method, feature to exist In:In the step D1, candidate bid mechanism and candidate acceptance of the bid mechanism are identified by second order HMM model, specifically included Following steps:
Step D11:For the composition pattern of organization names, second order HMM model is built;
It is wherein directed to bid mechanism name entity and acceptance of the bid mechanism names entity, the observation value set in second order HMM model is instruction Practice the set of words that the bid mechanism in data and acceptance of the bid mechanism obtain after participle;
Step D12:Using in training data observation sequence and status switch calculate the parameter of second order HMM model;
Step D13:Entity sets are named to the step C candidate bid mechanism name entity sets A obtained and candidate acceptance of the bid mechanism B finds out optimum state sequence using Viterbi algorithm;
Step D14:To the optimum state sequence that step D13 is obtained, pattern matching is carried out using Aho-Corasic algorithms, Then it is considered that mechanism name names entity with success, it fails to match then not thinks it is mechanism name name entity.
9. a kind of acceptance of the bid webpage based on second order HMM according to claim 7 names entity abstracting method, feature to exist In:In the step D2, bid contact person is identified by second order HMM model, is as follows:
Step D21:For the composition pattern of name, second order HMM model is built;
Wherein, the observation value set of second order HMM model is the set of words after bid contact person's participle in training data;
Step D22:Using in training data observation sequence and status switch calculate the parameter of second order HMM model;
Step D23:Entity sets E is named to find out best shape using Viterbi algorithm the candidate bid contact person that step C is obtained State sequence;
Step D24:To the optimum state sequence that step D23 is obtained, pattern matching is carried out using Aho-Corasic algorithms, The contact person's name entity that calls for bid then is considered with success, it fails to match then not thinks it is that bid contact person names entity.
10. a kind of acceptance of the bid webpage based on second order HMM according to claim 7 names entity abstracting method, feature to exist In:In the step D3, by regular centering standard gold volume, project name and bid mechanism their location are identified, specific side Method is as follows:The acceptance of the bid amount of money judges the candidate acceptance of the bid amount of money, the i.e. set generated in judgment step C by the way of regular expression In D, Arabic numerals or Chinese word figure are determined whether to the corresponding observation sequence in part that status switch is MD*, such as Fruit is unsatisfactory for, and not thinks it is that acceptance of the bid and is given up the amount of money, and it is corresponding to the part of MD* that the status switch is searched if meeting not Observation in whether include unit " hundred ", " one hundred ", " thousand ", " thousand ", " ten thousand " , " Wan ", " hundred million " , " Hundred-million ", " dollar ", " yen ", If carrying out unit conversion comprising if;
Project for bidding title recognition methods is as follows:When the candidate project for bidding name nominating entity sets F that step C is obtained is not sky When, using the name entity in set as project for bidding title, when set F is empty, then select the title of the bid webpage As project for bidding title.
11. a kind of acceptance of the bid webpage based on second order HMM according to claim 10 names entity abstracting method, feature to exist In:The organization address recognition methods that calls for bid is as follows:When the candidate bid organization address name entity sets C that step C is obtained is not sky When, it is the part of place name to select part-of-speech tagging in set;When set C is empty, using the bid mechanism generated in step D1 Middle part-of-speech tagging is the part of place name.
CN201810300522.1A 2018-04-04 2018-04-04 A kind of acceptance of the bid webpage name entity abstracting method based on second order HMM Pending CN108509423A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810300522.1A CN108509423A (en) 2018-04-04 2018-04-04 A kind of acceptance of the bid webpage name entity abstracting method based on second order HMM

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810300522.1A CN108509423A (en) 2018-04-04 2018-04-04 A kind of acceptance of the bid webpage name entity abstracting method based on second order HMM

Publications (1)

Publication Number Publication Date
CN108509423A true CN108509423A (en) 2018-09-07

Family

ID=63380702

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810300522.1A Pending CN108509423A (en) 2018-04-04 2018-04-04 A kind of acceptance of the bid webpage name entity abstracting method based on second order HMM

Country Status (1)

Country Link
CN (1) CN108509423A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109408825A (en) * 2018-11-06 2019-03-01 杭州费尔斯通科技有限公司 A kind of acceptance of the bid data extraction method based on name Entity recognition
CN109492230A (en) * 2019-01-11 2019-03-19 浙江大学城市学院 A method of insurance contract key message is extracted based on textview field convolutional neural networks interested
CN109753660A (en) * 2019-01-07 2019-05-14 福州大学 A kind of acceptance of the bid webpage name entity abstracting method based on LSTM
CN109933692A (en) * 2019-04-01 2019-06-25 北京百度网讯科技有限公司 Establish the method and apparatus of mapping relations, the method and apparatus of information recommendation
CN110688841A (en) * 2019-09-30 2020-01-14 广州准星信息科技有限公司 Mechanism name identification method, mechanism name identification device, mechanism name identification equipment and storage medium
CN114492426A (en) * 2021-12-30 2022-05-13 北京百度网讯科技有限公司 Sub-word segmentation method, model training method, device and electronic equipment
CN116304060A (en) * 2023-05-16 2023-06-23 北京拓普丰联信息科技股份有限公司 Method and device for constructing universal word stock based on clustering and electronic equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6052682A (en) * 1997-05-02 2000-04-18 Bbn Corporation Method of and apparatus for recognizing and labeling instances of name classes in textual environments
CN102314417A (en) * 2011-09-22 2012-01-11 西安电子科技大学 Method for identifying Web named entity based on statistical model
CN104615589A (en) * 2015-02-15 2015-05-13 百度在线网络技术(北京)有限公司 Named-entity recognition model training method and named-entity recognition method and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6052682A (en) * 1997-05-02 2000-04-18 Bbn Corporation Method of and apparatus for recognizing and labeling instances of name classes in textual environments
CN102314417A (en) * 2011-09-22 2012-01-11 西安电子科技大学 Method for identifying Web named entity based on statistical model
CN104615589A (en) * 2015-02-15 2015-05-13 百度在线网络技术(北京)有限公司 Named-entity recognition model training method and named-entity recognition method and device

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
冯博琴 等: "C++程序设计 4.2.3二维数组元组的引用", 《中国铁道出版社》 *
周顺先: "基于二阶隐马尔可夫模型的文本信息抽取", 《电子学报》 *
李鹏飞: "用HMM做命名实体识别", 《HTTPS://WWW.LOOKFOR404.COM/用隐马尔可夫模型HMM做命名实体识别-NER系列二/》 *
李鹏飞: "用规则做命名实体识别", 《HTTPS://WWW.LOOKFOR404.COM/用规则做命名实体识别-NER系列(一)/》 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109408825A (en) * 2018-11-06 2019-03-01 杭州费尔斯通科技有限公司 A kind of acceptance of the bid data extraction method based on name Entity recognition
CN109753660A (en) * 2019-01-07 2019-05-14 福州大学 A kind of acceptance of the bid webpage name entity abstracting method based on LSTM
CN109753660B (en) * 2019-01-07 2023-06-13 福州大学 LSTM-based winning bid web page named entity extraction method
CN109492230A (en) * 2019-01-11 2019-03-19 浙江大学城市学院 A method of insurance contract key message is extracted based on textview field convolutional neural networks interested
CN109492230B (en) * 2019-01-11 2022-12-20 浙江大学城市学院 Method for extracting insurance contract key information based on interested text field convolutional neural network
CN109933692A (en) * 2019-04-01 2019-06-25 北京百度网讯科技有限公司 Establish the method and apparatus of mapping relations, the method and apparatus of information recommendation
CN109933692B (en) * 2019-04-01 2022-04-08 北京百度网讯科技有限公司 Method and device for establishing mapping relation and method and device for recommending information
CN110688841A (en) * 2019-09-30 2020-01-14 广州准星信息科技有限公司 Mechanism name identification method, mechanism name identification device, mechanism name identification equipment and storage medium
CN114492426A (en) * 2021-12-30 2022-05-13 北京百度网讯科技有限公司 Sub-word segmentation method, model training method, device and electronic equipment
CN116304060A (en) * 2023-05-16 2023-06-23 北京拓普丰联信息科技股份有限公司 Method and device for constructing universal word stock based on clustering and electronic equipment
CN116304060B (en) * 2023-05-16 2023-08-25 北京拓普丰联信息科技股份有限公司 Method and device for constructing universal word stock based on clustering and electronic equipment

Similar Documents

Publication Publication Date Title
CN108509423A (en) A kind of acceptance of the bid webpage name entity abstracting method based on second order HMM
CN110298037B (en) Convolutional neural network matching text recognition method based on enhanced attention mechanism
CN109918673B (en) Semantic arbitration method and device, electronic equipment and computer-readable storage medium
CN109062893B (en) Commodity name identification method based on full-text attention mechanism
CN109753660B (en) LSTM-based winning bid web page named entity extraction method
CN106202010B (en) Method and apparatus based on deep neural network building Law Text syntax tree
CN109960728B (en) Method and system for identifying named entities of open domain conference information
CN111160031A (en) Social media named entity identification method based on affix perception
CN109635280A (en) A kind of event extraction method based on mark
CN111291188B (en) Intelligent information extraction method and system
US20070055662A1 (en) Method and apparatus for learning, recognizing and generalizing sequences
CN110263325A (en) Chinese automatic word-cut
CN111159485B (en) Tail entity linking method, device, server and storage medium
Tran et al. Understanding what the users say in chatbots: A case study for the Vietnamese language
CN112183094B (en) Chinese grammar debugging method and system based on multiple text features
CN113591483A (en) Document-level event argument extraction method based on sequence labeling
CN112711948A (en) Named entity recognition method and device for Chinese sentences
CN111159407A (en) Method, apparatus, device and medium for training entity recognition and relation classification model
CN112905736A (en) Unsupervised text emotion analysis method based on quantum theory
CN113408287B (en) Entity identification method and device, electronic equipment and storage medium
CN114428850B (en) Text retrieval matching method and system
CN114818717A (en) Chinese named entity recognition method and system fusing vocabulary and syntax information
CN113704416A (en) Word sense disambiguation method and device, electronic equipment and computer-readable storage medium
CN111260428A (en) Commodity recommendation method and device
CN112069312A (en) Text classification method based on entity recognition and electronic device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20180907