CN107291700A - Entity word recognition method and device - Google Patents

Entity word recognition method and device Download PDF

Info

Publication number
CN107291700A
CN107291700A CN201710580718.6A CN201710580718A CN107291700A CN 107291700 A CN107291700 A CN 107291700A CN 201710580718 A CN201710580718 A CN 201710580718A CN 107291700 A CN107291700 A CN 107291700A
Authority
CN
China
Prior art keywords
entity word
dictionary
instance
field
entity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710580718.6A
Other languages
Chinese (zh)
Inventor
晋彤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Special Road Mdt Infotech Ltd
Original Assignee
Guangzhou Special Road Mdt Infotech Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Special Road Mdt Infotech Ltd filed Critical Guangzhou Special Road Mdt Infotech Ltd
Priority to CN201710580718.6A priority Critical patent/CN107291700A/en
Publication of CN107291700A publication Critical patent/CN107291700A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/254Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Machine Translation (AREA)
  • Character Discrimination (AREA)

Abstract

The invention discloses a kind of entity word recognition method, including step:Collection structure data, generate the language material in several fields after the structural data is tentatively filtered and simplified;The first instance dictionary in correspondence field is generated after the language material in each field is trained;Checking generation second instance dictionary is carried out to the first instance dictionary in each field by a large amount of articles, entity word is identified according to the second instance dictionary, the problem of efficiently solving low prior art entity word recognition efficiency and high cost, mode without manually collecting excavates neologisms, human cost is reduced, energy automatic identification entity word simultaneously updates dictionary.

Description

Entity word recognition method and device
Technical field
The present invention relates to computer realm, more particularly to a kind of entity word recognition method and device.
Background technology
With the fast development of science and technology and internet, computer and network technologies oneself through being deep into people's work, it is raw Every aspect living.The information that needs are obtained using computer is also gradually used by people, such as Information retrieval queries, calculating Machine supplementary translation, automatic question answering etc..Be stored with some entity words, such as ProductName in the database of computer server Title, model, Business Name, brand name etc..If included in the sentence that user is inputted by client in the database Entity word, then can directly search corresponding result, such as corresponding translation result, question and answer knot from the database of server Really, retrieval result, then feeds back to client.Such a mode, result corresponding for existing entity word, server can be quick Client is fed back to, so as to improve the response speed of system.In addition, such a mode can ensure the accurate of feedback data Property, it is ensured that the validity of data transfer, it is to avoid user constantly sends the request such as retrieval, translation by client, so as to reduce Data volume of the server transport to client.
Entity word in common server database is obtained by way of manually collecting more, with the continuous hair of technology Exhibition, particularly in some special dimensions, can constantly produce new entity word, often can not be right in time by the way of manually collecting Entity word in database is updated, when user sends the requests such as retrieval, translation by user end to server, server Just it can not realize and fast and accurately respond, so as to reduce response speed.When user can not obtain accurate or its desired result When, it often constantly sends new request, this adds increased server burden, while adding the data transfer of server Amount.In addition, new entity word is excavated by way of manually collecting to be needed to expend substantial amounts of workload, increase human cost.
The content of the invention
The purpose of the embodiment of the present invention is to provide a kind of entity word recognition method and device, can effectively solve prior art real The problem of pronouns, general term for nouns, numerals and measure words recognition efficiency is low and cost is high.
To achieve the above object, the embodiments of the invention provide a kind of entity word recognition method, including step:
Collection structure data, generate the language in several fields after the structural data is tentatively filtered and simplified Material;
The first instance dictionary in correspondence field is generated after the language material in each field is trained;
Checking generation second instance dictionary is carried out to the first instance dictionary in each field by a large amount of articles, according to described Entity word is identified second instance dictionary.
Compared with prior art, entity word recognition method disclosed by the invention is by collection structure data, to the knot Structure data generate the language material in several fields after tentatively being filtered and simplified;It is raw after the language material in each field is trained Into the first instance dictionary in correspondence field;Checking generation second is carried out to the first instance dictionary in each field by a large amount of articles Entity dictionary, entity word is identified according to the second instance dictionary, efficiently solves prior art entity word identification effect The problem of rate is low and cost is high, can automatic identification entity word simultaneously update dictionary.
As the improvement of such scheme, the classification of the entity word includes name, place name, company and brand.
As the improvement of such scheme, entity word identification is included to carry out classification, weight and affiliated neck to the entity word The identification in domain.
As the improvement of such scheme, it is specially to entity word identification according to the second instance dictionary:
According to the second instance dictionary, the entity word is identified by Linear Mapping technology.
As the improvement of such scheme, several fields are generated after the structural data is tentatively filtered and simplified Language material be specially:
Several fields are generated after the structural data is tentatively filtered and simplified by big data ETL technologies Corpus.
As the improvement of such scheme, checking generation the is carried out to the first instance dictionary in each field by a large amount of articles Two entity dictionaries are specially:
According to the first instance dictionary in each field, by condition random field to being total between a large amount of articles progress entity word Now rate is trained, so as to generate second instance dictionary.
As the improvement of such scheme, step is also included after being recognized according to the second instance dictionary to entity word:
Entity word after being identified is subjected to secondary verification by part of speech semantic engine.
The embodiment of the present invention additionally provides a kind of entity word identifying device, including:
Collection module, for collection structure data, is generated after the structural data is tentatively filtered and simplified The language material in several fields;
First instance dictionary generation module, generates the first of correspondence field after being trained for the language material to each field Entity dictionary;
Identification module, for carrying out checking generation second instance to the first instance dictionary in each field by a large amount of articles Dictionary, entity word is identified according to the second instance dictionary.
Compared with prior art, entity word identifying device disclosed by the invention is by collection module collection structure data, The language material in several fields is generated after the structural data is tentatively filtered and simplified, then is given birth to by first instance dictionary The first instance dictionary in correspondence field is generated after the language material in each field is trained into module, then passes through second instance word Storehouse generation module carries out checking generation second instance dictionary to the first instance dictionary in each field according to a large amount of articles, according to institute State second instance dictionary entity word is identified, efficiently solve that prior art entity word recognition efficiency is low and cost is high asks Topic, energy automatic identification entity word simultaneously updates dictionary.
As the improvement of such scheme, the classification of the entity word includes name, place name, company and brand.
As the improvement of such scheme, entity word identification is included to carry out classification, weight and affiliated neck to the entity word The identification in domain.
Brief description of the drawings
Fig. 1 is a kind of schematic flow sheet for entity word recognition method that the embodiment of the present invention 1 is provided.
Fig. 2 is a kind of schematic flow sheet for entity word recognition method that the embodiment of the present invention 2 is provided.
Fig. 3 is a kind of structural representation for entity word identifying device that the embodiment of the present invention 3 is provided.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Site preparation is described, it is clear that described embodiment is only a part of embodiment of the invention, rather than whole embodiments.It is based on Embodiment in the present invention, it is every other that those of ordinary skill in the art are obtained under the premise of creative work is not made Embodiment, belongs to the scope of protection of the invention.
It is a kind of schematic flow sheet for entity word recognition method that the embodiment of the present invention 1 is provided referring to Fig. 1, including step:
S1, collection structure data, generate several fields after the structural data is tentatively filtered and simplified Language material;
S2, the language material in each field is trained after generate correspondence field first instance dictionary;
S3, checking generation second instance dictionary is carried out to the first instance dictionary in each field by a large amount of articles, according to Entity word is identified the second instance dictionary.
Wherein, entity word is recognized in step s3 includes carrying out classification, weight and art to the entity word Identification.
When it is implemented, collection structure data, if being generated after the structural data is tentatively filtered and simplified The language material in dry field;The first instance dictionary in correspondence field is generated after the language material in each field is trained;By a large amount of Article carries out checking generation second instance dictionary to the first instance dictionary in each field, according to the second instance dictionary to reality Pronouns, general term for nouns, numerals and measure words is identified, the problem of efficiently solving low prior art entity word recognition efficiency and high cost, without what is manually collected Mode excavates neologisms, reduces human cost, and energy automatic identification entity word simultaneously updates dictionary.
It should be understood that the classification of the entity word includes name, place name, company and brand.
Preferably, it is specially to entity word identification according to the second instance dictionary in step S3:
According to the second instance dictionary, the entity word is identified by Linear Mapping technology.
Because every generic attribute of entry has a corresponding self refresh dictionary, entry to be identified by with word in dictionary Correlation (correlation rule and similitude differentiate) is that can determine whether that Attribute class is other by the matching analysis.Therefore, Linear Mapping skill is passed through Art is identified, and reduces the dependence to dictionary, has more preferable recognition effect to emerging word.
Preferably, the language in several fields is generated after the structural data tentatively being filtered and simplified in step S1 Material is specially:
Several fields are generated after the structural data is tentatively filtered and simplified by big data ETL technologies Corpus.
ETL, is English Extract-Transform-Load abbreviation, for describing data from source terminal by extracting (extract), conversion (transform), the process of loading (load) to destination.ETL is build data warehouse important one Ring, user extracts required data from data source, by data cleansing, finally according to the data warehouse mould pre-defined Type, is loaded data into data warehouse.
Preferably, checking generation second is carried out in fact to the first instance dictionary in each field by a large amount of articles in step S3 Pronouns, general term for nouns, numerals and measure words storehouse is specially:
According to the first instance dictionary in each field, by condition random field to being total between a large amount of articles progress entity word Now rate is trained, so as to generate second instance dictionary.
It is a kind of schematic flow sheet for entity word recognition method that the embodiment of the present invention 2 is provided, in embodiment 1 referring to Fig. 2 On the basis of, in addition to step:
S4, the entity word after being identified is passed through into part of speech semantic engine carry out secondary verification.
Secondary verification in the step is verified by recognizing part of speech and analysis semanteme.
It is a kind of structural representation for entity word identifying device that the embodiment of the present invention 3 is provided referring to Fig. 3, including:
Collection module 101, it is raw after the structural data is tentatively filtered and simplified for collection structure data Into the language material in several fields;
First instance dictionary generation module 102, generates correspondence field after being trained for the language material to each field First instance dictionary;
Identification module 103, for carrying out checking generation second to the first instance dictionary in each field by a large amount of articles Entity dictionary, entity word is identified according to the second instance dictionary.
When it is implemented, first passing through collection module collection structure data, the structural data is tentatively filtered With simplify after generate the language material in several fields, then the language material in each field is instructed by first instance dictionary generation module After white silk generate correspondence field first instance dictionary, then by second instance dictionary generation module according to a large amount of articles to each The first instance dictionary in field carries out checking generation second instance dictionary, and entity word is known according to the second instance dictionary Not, the problem of efficiently solving low prior art entity word recognition efficiency and high cost, can automatic identification entity word and more neologisms Storehouse.
In a preferred embodiment, the classification of the entity word includes name, place name, company and brand.
In a preferred embodiment, the identification module includes carrying out the entity word classification, power to entity word identification The identification of weight and art.
To sum up, it is right by collection structure data the embodiments of the invention provide a kind of entity word recognition method and device The structural data generates the language material in several fields after tentatively being filtered and simplified;The language material in each field is instructed The first instance dictionary in correspondence field is generated after white silk;Checking life is carried out to the first instance dictionary in each field by a large amount of articles Into second instance dictionary, entity word is identified according to the second instance dictionary, prior art entity word is efficiently solved The problem of recognition efficiency is low and cost is high, can automatic identification entity word simultaneously update dictionary.
Described above is the preferred embodiment of the present invention, it is noted that for those skilled in the art For, under the premise without departing from the principles of the invention, some improvements and modifications can also be made, these improvements and modifications are also considered as Protection scope of the present invention.

Claims (10)

1. a kind of entity word recognition method, it is characterised in that including step:
Collection structure data, generate the language material in several fields after the structural data is tentatively filtered and simplified;
The first instance dictionary in correspondence field is generated after the language material in each field is trained;
Checking generation second instance dictionary is carried out to the first instance dictionary in each field by a large amount of articles, according to described second Entity word is identified entity dictionary.
2. entity word recognition method as claimed in claim 1, it is characterised in that the classification of the entity word include name, Name, company and brand.
3. entity word recognition method as claimed in claim 1, it is characterised in that include entity word identification to the entity word Carry out the identification of classification, weight and art.
4. entity word recognition method as claimed in claim 1, it is characterised in that according to the second instance dictionary to entity word Identification is specially:
According to the second instance dictionary, the entity word is identified by Linear Mapping technology.
5. entity word recognition method as claimed in claim 1, it is characterised in that tentatively filtered to the structural data With simplify after generate the language material in several fields and be specially:
The language material in several fields is generated after the structural data is tentatively filtered and simplified by big data ETL data Storehouse.
6. entity word recognition method as claimed in claim 1, it is characterised in that by a large amount of articles to the first of each field Entity dictionary carries out checking generation second instance dictionary:
According to the first instance dictionary in each field, the co-occurrence rate between entity word is carried out to a large amount of articles by condition random field Training, so as to generate second instance dictionary.
7. entity word recognition method as claimed in claim 1, it is characterised in that according to the second instance dictionary to entity word Also include step after identification:
Entity word after being identified is subjected to secondary verification by part of speech semantic engine.
8. a kind of entity word identifying device, it is characterised in that including:
Collection module, for collection structure data, is generated some after the structural data is tentatively filtered and simplified The language material in individual field;
First instance dictionary generation module, generates the first instance in correspondence field after being trained for the language material to each field Dictionary;
Identification module, for carrying out checking generation second instance word to the first instance dictionary in each field by a large amount of articles Storehouse, entity word is identified according to the second instance dictionary.
9. entity word identifying device as claimed in claim 8, it is characterised in that the classification of the entity word include name, Name, company and brand.
10. entity word identifying device as claimed in claim 8, it is characterised in that include entity word identification to the entity Word carries out the identification of classification, weight and art.
CN201710580718.6A 2017-07-17 2017-07-17 Entity word recognition method and device Pending CN107291700A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710580718.6A CN107291700A (en) 2017-07-17 2017-07-17 Entity word recognition method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710580718.6A CN107291700A (en) 2017-07-17 2017-07-17 Entity word recognition method and device

Publications (1)

Publication Number Publication Date
CN107291700A true CN107291700A (en) 2017-10-24

Family

ID=60101558

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710580718.6A Pending CN107291700A (en) 2017-07-17 2017-07-17 Entity word recognition method and device

Country Status (1)

Country Link
CN (1) CN107291700A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108108350A (en) * 2017-11-29 2018-06-01 北京小米移动软件有限公司 Name word recognition method and device
CN108595430A (en) * 2018-04-26 2018-09-28 携程旅游网络技术(上海)有限公司 Boat becomes information extracting method and system
CN109189900A (en) * 2018-08-03 2019-01-11 北京捷易迅信息技术有限公司 A kind of entity abstracting method for BOT system

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101477518A (en) * 2009-01-09 2009-07-08 昆明理工大学 Tour field named entity recognition method based on condition random field
US20090254334A1 (en) * 2002-01-29 2009-10-08 International Business Machines Corporation Translation method, translation output method and storage medium, program, and computer used therewith
CN103268339A (en) * 2013-05-17 2013-08-28 中国科学院计算技术研究所 Recognition method and system of named entities in microblog messages
CN106528863A (en) * 2016-11-29 2017-03-22 中国国防科技信息中心 Training and technology of CRF recognizer and method for extracting attribute name relation pairs of CRF recognizer
CN106649272A (en) * 2016-12-23 2017-05-10 东北大学 Named entity recognizing method based on mixed model
CN106815293A (en) * 2016-12-08 2017-06-09 中国电子科技集团公司第三十二研究所 System and method for constructing knowledge graph for information analysis

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090254334A1 (en) * 2002-01-29 2009-10-08 International Business Machines Corporation Translation method, translation output method and storage medium, program, and computer used therewith
CN101477518A (en) * 2009-01-09 2009-07-08 昆明理工大学 Tour field named entity recognition method based on condition random field
CN103268339A (en) * 2013-05-17 2013-08-28 中国科学院计算技术研究所 Recognition method and system of named entities in microblog messages
CN106528863A (en) * 2016-11-29 2017-03-22 中国国防科技信息中心 Training and technology of CRF recognizer and method for extracting attribute name relation pairs of CRF recognizer
CN106815293A (en) * 2016-12-08 2017-06-09 中国电子科技集团公司第三十二研究所 System and method for constructing knowledge graph for information analysis
CN106649272A (en) * 2016-12-23 2017-05-10 东北大学 Named entity recognizing method based on mixed model

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
陈蕾: "基于语义与语境的专利信息查询扩展的研究", 《中国优秀硕士学位论文全文数据库_信息科技辑》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108108350A (en) * 2017-11-29 2018-06-01 北京小米移动软件有限公司 Name word recognition method and device
CN108108350B (en) * 2017-11-29 2021-09-14 北京小米移动软件有限公司 Noun recognition method and device
CN108595430A (en) * 2018-04-26 2018-09-28 携程旅游网络技术(上海)有限公司 Boat becomes information extracting method and system
CN109189900A (en) * 2018-08-03 2019-01-11 北京捷易迅信息技术有限公司 A kind of entity abstracting method for BOT system

Similar Documents

Publication Publication Date Title
CN105468605B (en) Entity information map generation method and device
CN106776711B (en) Chinese medical knowledge map construction method based on deep learning
CN105653590B (en) A kind of method that Chinese literature author duplication of name disambiguates
Su et al. Automatic detection and interpretation of nominal metaphor based on the theory of meaning
CN109960786A (en) Chinese Measurement of word similarity based on convergence strategy
Yang Research and realization of internet public opinion analysis based on improved TF-IDF algorithm
CN109271626A (en) Text semantic analysis method
CN107609052A (en) A kind of generation method and device of the domain knowledge collection of illustrative plates based on semantic triangle
CN102314519A (en) Information searching method based on public security domain knowledge ontology model
CN113962293B (en) LightGBM classification and representation learning-based name disambiguation method and system
CN107967290A (en) A kind of knowledge mapping network establishing method and system, medium based on magnanimity scientific research data
CN107291700A (en) Entity word recognition method and device
CN111597349B (en) Rail transit standard entity relation automatic completion method based on artificial intelligence
CN106777048A (en) Enterprise-quality credit data acquisition methods and system
CN107480197A (en) Entity word recognition method and device
CN108959366B (en) Open question-answering method
Kang et al. A short texts matching method using shallow features and deep features
Mohnot et al. Hybrid approach for Part of Speech Tagger for Hindi language
Al-Qawasmeh et al. Arabic named entity disambiguation using linked open data
Nguyen et al. A vietnamese question answering system
Yao et al. An automatic semantic extraction method for web data interchange
CN109685590A (en) A kind of system and method for intelligent medicine purchase
Zeng et al. Construction of scenic spot knowledge graph based on ontology
Chen Natural language processing in web data mining
Saleh et al. Semantic kernels for semantic parsing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20171024

RJ01 Rejection of invention patent application after publication