WO2019228466A1 - 命名实体识别的方法、装置、设备及存储介质 - Google Patents

命名实体识别的方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2019228466A1
WO2019228466A1 PCT/CN2019/089325 CN2019089325W WO2019228466A1 WO 2019228466 A1 WO2019228466 A1 WO 2019228466A1 CN 2019089325 W CN2019089325 W CN 2019089325W WO 2019228466 A1 WO2019228466 A1 WO 2019228466A1
Authority
WO
WIPO (PCT)
Prior art keywords
new
new field
word
field
entity
Prior art date
Application number
PCT/CN2019/089325
Other languages
English (en)
French (fr)
Inventor
温海娇
陈虹
牛国扬
董修岗
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2019228466A1 publication Critical patent/WO2019228466A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Definitions

  • This disclosure relates to (but not selects) the fields of natural language processing, semantic analysis and understanding, and artificial intelligence technologies.
  • NER Named Entity Recognition
  • NLP Natural Language Processing
  • a method for identifying named entities includes: performing entity recognition on new field text data to obtain a new field seed entity word; and performing new field text data on the new field seed entity word. Labeling to obtain labeled new field text data; using the labeled new field text data to train a named entity recognition model to obtain a named entity recognition model suitable for the new field; and using the new field labeled entity recognition model; and A named entity recognition model for a domain, which identifies entity words in other text data in the new domain.
  • An apparatus for naming entity recognition includes: an entity recognition module configured to perform entity recognition on new field text data to obtain a new field seed entity word; and a text marking module configured to The new entity seed entity words are described, and the new domain text data is annotated to obtain the annotated new domain text data.
  • a model training module is configured to use the annotated new domain text data to perform a named entity recognition model. Training to obtain a named entity recognition model suitable for the new field; and a model application module configured to recognize named words in other text data in the new field using the named entity recognition model suitable for the new field.
  • a device for naming entity recognition provided according to an embodiment of the present disclosure includes a processor and a memory coupled to the processor, wherein the memory stores a computer program, and the computer program is executed by the processor.
  • the processor executes the method of named entity recognition according to the present disclosure.
  • a computer storage medium provided according to an embodiment of the present disclosure stores a computer program thereon.
  • the processor executes a method for identifying a named entity according to the present disclosure.
  • FIG. 1 is a flowchart of a method for identifying a named entity according to an embodiment of the present disclosure
  • FIG. 2 is a block diagram of a device for named entity recognition according to an embodiment of the present disclosure
  • FIG. 3 is a block diagram of a device for named entity recognition according to an embodiment of the present disclosure.
  • FIG. 4 is an architecture diagram of an entity recognition system according to an embodiment of the present disclosure.
  • FIG 5 is an architecture diagram of an automated entity recognition system according to an embodiment of the present disclosure.
  • FIG. 6 is a flowchart of mining a seed entity word by new word discovery according to an embodiment of the present disclosure
  • FIG. 7 is a flowchart of a new word discovery algorithm according to an embodiment of the present disclosure.
  • FIG. 8 is a flowchart of mining sentence entity words in a sentence structure according to an embodiment of the present disclosure
  • FIG. 9 is a flowchart of a syntax mining algorithm according to an embodiment of the present disclosure.
  • FIG. 10 is an exemplary structure diagram of a domain concept map
  • FIG. 11 is a flowchart of automatic corpus marking according to an embodiment of the present disclosure.
  • FIG. 12 is a flowchart of semi-automated entity identification with only new fields according to an embodiment of the present disclosure.
  • FIG. 13 is a flowchart of semi-automated entity recognition combining existing and new fields according to an embodiment of the present disclosure.
  • FIG. 1 is a flowchart of a method for named entity recognition according to an embodiment of the present disclosure.
  • the method includes steps S101 to S104.
  • step S101 entity recognition is performed on the text data in the new domain to obtain a seed entity word in the new domain.
  • new field refers to a field in which entity words have not yet been mined or a field in which entity words are not sufficiently mined. In such a field of entity words to be mined, there is no or lack of marked corpus.
  • the new field text data can be split into new field single sentences, and then the new words in each new field single sentence are determined according to the allowed length of the new field seed entity words, and according to The correlation between the new word and the new field, and filtering the new word to obtain the seed entity word of the new field. That is, an algorithm (eg, Nagao algorithm) for mining seed entity words is used to determine the new field seed entity words based on the found new words, and entity recognition is performed on the new field text data to obtain new field seed entity words.
  • an algorithm eg, Nagao algorithm
  • the step of determining a new word in each new field single sentence according to the allowed length of the new field seed entity word includes: for each new field single sentence, counting the new field seed sentence satisfying the new field seed entity word in the new field single sentence Phrases of the allowed length, and then filter each phrase based on the characteristics of each phrase to get new words.
  • the allowed length of the new domain seed entity word can be any length that is less than or equal to the longest length of the preset new domain entity.
  • the characteristics of each phrase may include word frequency, part of speech, and the like.
  • the step of filtering each phrase may include filtering using features and experience thresholds. For example, a phrase having a word frequency greater than a known experience word frequency is used as a new word, and for example, a new word is determined by using an average feature value and a known experience word frequency calculated by using a feature value and a weight of each feature.
  • the step of filtering the new word includes: using a domain concept map and a word frequency-reverse document frequency algorithm that characterize the correlation between the fields to determine Characterizing the correlation score of the correlation between the new word and the new field, and filtering the new word according to the correlation score and the experience threshold to obtain a new word with the correlation score higher than the experience threshold as The new field seed entity word.
  • the step of determining the correlation score may include: obtaining correlation weights of the new field and other fields according to the domain concept map, and using the word frequency-reverse document frequency algorithm to determine a characteristic of the new word for the new field. A probability score of importance, and then using the correlation weight and the probability score to determine a correlation score of the new word and the new field.
  • the domain concept map can organize and characterize the relationships between domains, such as subordinate relationships, and is a graphical representation of the relationships between domains.
  • the word frequency-reverse document frequency algorithm is an existing algorithm.
  • the embodiments of the present disclosure use this algorithm to filter out common words in various fields, while retaining important words in the new field as seed entity words in the new field. That is, new words in each new field single sentence may also appear in other field text data.
  • the algorithm can determine the importance of the new word to the new field text data in the text data set, and then implement new word filtering.
  • the other fields may be any fields different from the higher-level fields of the new field.
  • the higher-level fields are finance, and other fields may be operators, technology, and so on.
  • step S101 is applicable to a scenario where there is only new field text data.
  • the new field text data and existing field text data may be split into a new field single sentence and an existing field single sentence, and then use the existing field single sentence to generate a sentence pattern.
  • Template and matching the new field single sentence with the sentence pattern template to determine the new field seed entity word in the new field single sentence. That is, using an algorithm for mining seed entity words and using a sentence template to mine the seed entity words in the new domain to perform entity recognition on the text data in the new domain to obtain the seed entity words in the new domain.
  • the sentence pattern template includes a first sentence pattern template and a second sentence pattern template
  • the step of generating a sentence pattern template using the existing field single sentence includes: Replace existing entity words in the field with preset entity word mining symbols to obtain a first sentence pattern template, and replace words or phrases in the first sentence pattern template with synonyms or synonym phrases to obtain a second sentence pattern template.
  • the first sentence template and the second sentence template are both seed sentence templates.
  • the sentence template may further include a third sentence template.
  • the third sentence template is a sentence template derived from the seed sentence template. For example, a self-expansion technology (for example, , Bootstrapping algorithm) to achieve the derivation process.
  • a sentence template that is more suitable for the new field can be generated. Therefore, when preparing a corpus of an existing field, you can use the domain concept map to determine the similar field in the new field, and then prepare a corpus of the similar field. For example, when generating a sentence template for mining the seed entity words of "CCB" (ie, the abbreviation of China Construction Bank), the text data of other banks may be selected to generate a sentence template.
  • step S101 The above manner of implementing step S101 is applicable to a scenario in which sufficient new field text data or other field annotation text is lacking.
  • the above two implementation manners may be combined, that is, it may include: dividing the new field text data and the existing field text data into a new field single sentence and an existing field, respectively.
  • a single sentence ; determining a new word in a single sentence of each new field according to the allowed length of the seed entity word of the new field; filtering the new word according to the correlation between the new word and the new field, and filtering Using the existing domain single sentence to generate a sentence template; matching the new domain single sentence with the sentence template to obtain a matched new domain seed entity word; and filtering the Combining the new realm seed entity word with the matched new realm seed entity word to obtain the new realm seed entity word.
  • the algorithm for mining seed entity words is used to determine the seed seed words in the new field based on the found new words, and the sentence entity template is used to mine the seed entity words in the new field to perform entity recognition on the text data in the new field.
  • New field seed entity words is used to mine the seed entity words in the new field to perform entity recognition on the text data in the new field.
  • the seed entity word in the new domain of this embodiment is a typical entity word in the new domain, and is an initial condition for discovering other entity words in the new domain. That is, the seed entity word can be used to expand the entity word in the new domain.
  • step S102 the new field text data is marked according to the new field seed entity word, and the marked new field text data is obtained.
  • word segmentation processing may be performed on each of the new field single sentences to obtain the words constituting the new field single sentence, and then the new field seeds included in the new field single sentence according to each word The position in the entity word is used to label each word of the new field single sentence, and after all the new field single sentences are labeled, the marked new field text data is obtained.
  • the character When identifying a named entity in Chinese, the character may be a Chinese character, and when identifying a named entity in another language, the character may be the smallest unit that forms a single sentence for that language, for example, one of English word.
  • step S103 the named entity recognition model is trained using the marked new field text data to obtain a named entity recognition model suitable for the new field.
  • step S104 entity words in other text data of the new domain are identified by using a named entity recognition model suitable for the new domain.
  • the named entity recognition model may be a universal NER model based on a deep learning framework based on Long-Short-Term Memory (LSTM) + Conditional Random Field (CRF).
  • LSTM Long-Short-Term Memory
  • CRF Conditional Random Field
  • the seed entity word may also be sent to a user interface for a user to manually verify the seed entity word.
  • the present disclosure may also provide a computer storage medium on which a computer program is stored, and when the computer program is executed by a processor, the processor may execute a method for identifying a named entity according to various embodiments of the present disclosure.
  • the storage medium may include, but is not limited to, ROM / RAM, magnetic disk, optical disk, and U disk.
  • FIG. 2 is a block diagram of a device for named entity recognition according to an embodiment of the present disclosure.
  • the device includes an entity recognition module 201 and a text marking module 202.
  • the entity recognition module 201 is configured to perform entity recognition on the text data in the new field to obtain the seed word in the new field, that is, step S101 in FIG. 1 can be implemented.
  • the text marking module 202 is configured to mark the new field text data according to the new field seed entity words, to obtain the labeled new field text data, that is, step S102 of FIG. 1 can be implemented.
  • the model training module 203 is configured to use the labeled new field text data to train a named entity recognition model to obtain a named entity recognition model suitable for the new field, that is, step S103 of FIG. 1 can be implemented.
  • the model application module 204 is configured to use a named entity recognition model applicable to the new domain to identify entity words in other text data of the new domain, that is, step S104 of FIG. 1 may be implemented.
  • each of the above modules may be further configured as a specific implementation manner of implementing each step described in conjunction with FIG. 1. For the sake of clarity, I will not repeat them here.
  • FIG. 3 is a block diagram of a device for named entity recognition according to an embodiment of the present disclosure.
  • the device includes a processor and a memory coupled to the processor.
  • a computer program is stored on the memory.
  • the processor may execute a method of named entity recognition according to various embodiments of the present disclosure.
  • FIG. 4 is a structural diagram of an entity recognition system according to an embodiment of the present disclosure.
  • the system is a system for implementing a general Chinese entity recognition algorithm, which is used to solve the problems of heavy data labeling workload and difficult domain migration.
  • the system mainly includes four modules: mining seed entity words (equivalent to entity recognition module 201 in Fig. 2), automatic corpus marking (equivalent to text marking module 202 in Fig. 2), and offline training NER Model (equivalent to the model training module 203 of FIG. 2) and online use of the NER model (equivalent to the model application module 204 of FIG. 2).
  • mining seed entity words equivalent to entity recognition module 201 in Fig. 2
  • automatic corpus marking equivalent to text marking module 202 in Fig. 2
  • offline training NER Model equivalent to the model training module 203 of FIG. 2
  • online use of the NER model equivalent to the model application module 204 of FIG. 2).
  • Each major module is described in detail below.
  • This module can effectively solve the problem of domain migration and lack of standard corpus for entity recognition. It is a core module. This module can include two sub-modules: new word discovery mining seed entity words, and sentence pattern mining seed entity words.
  • the new word discovery method is suitable for entity recognition scenarios with only new field corpora. For example, entity recognition is required in the telecommunications field, but there is no corpus in telecommunications or other fields. Sentence mining method is suitable for entity recognition scenarios combining existing and new fields. For example, there is a certain amount of entity thesaurus in the "CCB" field. You can use the sentence pattern to quickly mine entities in the "BOC” or "telecommunications" field. word.
  • the system After mining the seed entity words, the system automatically annotates the new field corpus, avoiding tedious manual annotation work, and providing data support for NER model training, which is another core module.
  • This module uses a deep learning framework to train the NER model with bidirectional LSTM + CRF. It is a necessary module for improving the generalization ability of the entity recognition system.
  • This module is a necessary module of the system, not a core module, and can use the traditional NER model using process.
  • FIG. 5 is an architecture diagram of an automated entity recognition system according to an embodiment of the present disclosure.
  • the system can be applied to a variety of devices, such as smart call centers, smart set-top boxes, and intelligent knowledge bases, to improve equipment accuracy and reduce manual workload.
  • FIG. 6 is a flowchart of new word discovery mining a seed entity word according to an embodiment of the present disclosure.
  • this process is applicable to an entity recognition scenario with only a new domain corpus.
  • entity recognition is required in the telecommunications field, but there is no corpus in telecommunications or other fields.
  • Using the new word discovery algorithm to mine entity words in the seed domain it can be quickly applied to the product or the subsequent entity recognition model training process.
  • the new field only needs to provide the corresponding corpus without re-marking.
  • the text data in the field can be Frequently Asked Questions (FAQs), and Q & A pairs can also be chapters and text corpora.
  • FAQs Frequently Asked Questions
  • Q & A pairs can also be chapters and text corpora.
  • an input parameter message containing the input parameters is generated and input through the interface.
  • the input message can be a JavaScript Object Notation (json) message, and the specific format is as follows:
  • step 301 a new domain corpus is obtained.
  • Extract the new field corpus that is, the text data of the new field
  • the input message such as a json message
  • step 303 new words are mined.
  • FIG. 7 shows the flow of step 303, which includes: obtaining a sentence segmentation text (that is, obtaining a clause or a single sentence) (step 401); and counting all phrases in each sentence segmentation text that satisfy the length (that is, the longest length of the domain entity) (Step 402); Count the characteristics of each phrase (Step 403); and perform threshold filtering based on the characteristics of each phrase to obtain the final new word as a candidate new word (Step 404). That is, first count the combinations of phrases that match the length. Then count the characteristics of each phrase, such as mutual information, left and right information entropy, word frequency, part of speech, etc., to get candidate new words. Finally, the final new words are filtered based on the empirical threshold.
  • step 304 new words are filtered.
  • FIG. 10 is an exemplary structure diagram of a domain concept map.
  • TF-IDF Term-Frequency-Inverse Document-Frequency
  • step 305 a seed entity word is output.
  • Zxner_domain is the domain of the entity
  • zxner_result is the result of entity recognition. Its data format is an array, which includes entity words and scores corresponding to the words.
  • FIG. 8 is a flowchart of mining a seed entity word in a sentence structure according to an embodiment of the present disclosure.
  • this process is suitable for entity recognition scenarios where existing and new fields are combined. For example, there is a certain amount of entity thesaurus in the "CCB" field. You can use the sentence structure to quickly mine “BOC” or “Telecom” Entity words in other fields. With the help of sentence structure mining domain entity words, it can be quickly applied to the product or the subsequent entity recognition model training process. The new scene expansion only needs to provide the corresponding corpus, without re-marking.
  • the text data in the new field can be FAQ question-answer pairs or chapters and text corpora.
  • the entity thesaurus and text data in the existing domain need to be prepared for sentence mining.
  • the input parameter message may be a json message, and the specific format is as follows:
  • step 501 an existing domain corpus is obtained, including entity words.
  • step 503 sentence patterns are mined.
  • FIG. 9 shows the flow of step 503, including: obtaining a sentence segmentation text (that is, a clause or a single sentence) (step 601); replacing the entity word in the sentence segmentation text with [E] (that is, a preset entity word mining symbol ) (Step 602); replace other words or phrases in the segmented text with synonyms or synonymous phrases to obtain a seed sentence template (step 603); and use the Bootstrapping algorithm to obtain a sentence template (step 604) .
  • a sentence segmentation text that is, a clause or a single sentence
  • [E] that is, a preset entity word mining symbol
  • step 504 the sentence template is stored.
  • the structure of a stored sentence template is as follows:
  • step 505 a new domain corpus is obtained.
  • step 507 the sentence pattern is matched.
  • Sort according to the correlation between the new field and the existing field match sentence pattern templates, and extract possible entity words.
  • Domain correlation depends on the two parts of the domain concept map (see Figure 10): the inferior relationship between industries, and the areas with inferior relationship, the most relevant, as shown in Figure 10, "CCB” and “Bank” Previously, there was a subordinate relationship; and the similarity structure of the sentence structure between industries. For example, as shown in Figure 10, the "finance” field and the “operator” field have similar sentence structure (similarity 0.75).
  • step 508 a seed entity word is output.
  • Zxner_domain is the domain of the entity
  • zxner_result is the result of entity recognition. Its data form is an array, which includes entity words and templates corresponding to the words.
  • FIG. 11 is a flowchart of automatic corpus marking according to an embodiment of the present disclosure.
  • the new field corpus can be automatically marked (that is, the new field text data is labeled) according to the seed entity words, thereby reducing manual labeling work.
  • the corpus automatic marking process can be applied to various embodiments of the present disclosure, and can also be applied to other sequence labeling task systems, such as a word segmentation system.
  • the field where the corpus is located, the entity words, and the corresponding corpus are used as input parameters, and an input message containing the input parameters is generated and input through the interface.
  • the input parameter message may be a json message, and the specific format is as follows:
  • step 801 a new domain corpus is obtained.
  • Extract the new domain corpus from the input message including domain text data and seed entity words.
  • step 803 the words are segmented by words.
  • step 804 the position of the word in the seed entity word is determined, the start position is marked as B, the middle position is marked as I, the end position is marked as E, and the non-essential word is marked as O.
  • Zxner_domain is the domain of the entity
  • zxner_result is the corpus tagging result. Its data format is an array, including a single sentence corpus and the tagging result of each word.
  • FIG. 12 is a flowchart of semi-automated entity identification with only a new field according to an embodiment of the present disclosure. This embodiment is used to describe an application field where only a new domain entity is identified.
  • step 901 a new domain corpus is obtained.
  • step 902 the seed entity words are mined by new word discovery.
  • step 903 a manual check is performed to determine the entity words as: Qian Yuan, Qian Yuan overflowing, and wealth management products.
  • step 904 the corpus is automatically marked. You can use the general BIEO annotation corpus.
  • the marking results are as follows:
  • the model is trained.
  • FIG. 13 is a flowchart of semi-automated entity recognition combining existing and new fields according to an embodiment of the present disclosure. This embodiment is used to describe an application scenario of entity recognition combining an existing domain and a new domain.
  • step 1001 a new domain corpus and an existing domain corpus are obtained.
  • seed entity words are mined through new word discovery and sentence patterns.
  • Sentence pattern 1 What is the handling method of [E]
  • the new domain belongs to the "financial" domain and has a high correlation with the "telecommunications” domain. It can match the sentence template and get the result:
  • step 1003 a manual verification is performed to determine the entity words as: credit card, debit card.
  • the corpus is automatically marked. You can use the general BIEO annotation corpus.
  • the marking results are as follows:
  • word mark word mark word mark letter B borrow B letter B use I Remember I use I card E card E card E dense O of O How O code O do O What O repair O Reason O do O
  • the model is trained.
  • a functional module for mining seed entity words (entity recognition module 201 in FIG. 2) is added, and new techniques such as new word discovery, keyword extraction, sentence pattern mining, and domain concept map are used to automatically mine in new domains.
  • new techniques such as new word discovery, keyword extraction, sentence pattern mining, and domain concept map are used to automatically mine in new domains.
  • the collection of seed entity words automatic tagging of the corpus, and then deep learning.
  • the bidirectional LSTM + CRF is used to train the NER model, which can reduce the data labeling workload, reduce the requirements for model migration training, and improve the field's generality , Suitable for a variety of scenarios, including (but not limited to) voice assistants, intelligent customer service, intelligent knowledge base, and other applications involving artificial intelligence (Artificial Intelligence).

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本公开提供了一种命名实体识别的方法、装置、设备及存储介质。所述方法包括:对新领域文本数据进行实体识别,得到新领域种子实体词;根据所述新领域种子实体词,对所述新领域文本数据进行标注,得到已标注的新领域文本数据;利用所述已标注的新领域文本数据,对命名实体识别模型进行训练,得到适用于所述新领域的命名实体识别模型;以及利用适用于所述新领域的命名实体识别模型,识别所述新领域的其它文本数据中的实体词。

Description

命名实体识别的方法、装置、设备及存储介质 技术领域
本公开涉及(但不选用)自然语言处理、语义分析与理解、人工智能技术领域。
背景技术
命名实体识别(Named Entity Recognition,NER)是自然语言处理(Natural Language Processing,NLP)的一个基础分支,也是信息抽取中的关键技术之一,用于识别某一领域内的专有名词,比如银行领域的“***”、“借记卡”等。
发明内容
根据本公开实施例,提供一种命名实体识别的方法,包括:对新领域文本数据进行实体识别,得到新领域种子实体词;根据所述新领域种子实体词,对所述新领域文本数据进行标注,得到已标注的新领域文本数据;利用所述已标注的新领域文本数据,对命名实体识别模型进行训练,得到适用于所述新领域的命名实体识别模型;以及利用适用于所述新领域的命名实体识别模型,识别所述新领域的其它文本数据中的实体词。
根据本公开实施例提供的一种命名实体识别的装置,包括:实体识别模块,其构造为对新领域文本数据进行实体识别,得到新领域种子实体词;文本打标模块,其构造为根据所述新领域种子实体词,对所述新领域文本数据进行标注,得到已标注的新领域文本数据;模型训练模块,其构造为利用所述已标注的新领域文本数据,对命名实体识别模型进行训练,得到适用于所述新领域的命名实体识别模型;以及模型应用模块,其构造为利用适用于所述新领域的命名实体识别模型,识别所述新领域的其它文本数据中的实体词。
根据本公开实施例提供的一种命名实体识别的设备,包括处理器以及与所述处理器耦接的存储器,其中,所述存储器上存储有计算 机程序,所述计算机程序被所述处理器执行时,所述处理器执行根据本公开的命名实体识别的方法。
根据本公开实施例提供的一种计算机存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时,所述处理器执行根据本公开的命名实体识别的方法。
附图说明
图1是根据本公开实施例的命名实体识别的方法的流程图;
图2是根据本公开实施例的命名实体识别的装置的框图;
图3是根据本公开实施例的命名实体识别的设备的框图;
图4是根据本公开实施例的实体识别***的架构图;
图5是根据本公开实施例的自动化实体识别***的架构图;
图6是根据本公开实施例的新词发现挖掘种子实体词的流程图;
图7是根据本公开实施例的新词发现算法的流程图;
图8是根据本公开实施例的句式挖掘种子实体词的流程图;
图9是根据本公开实施例的句式挖掘算法的流程图;
图10是领域概念图谱的示例性结构示意图;
图11是根据本公开实施例的语料自动打标的流程图;
图12是根据本公开实施例的仅有新领域的半自动化实体识别的流程图;以及
图13是根据本公开实施例的已有领域和新领域相结合的半自动化实体识别的流程图。
具体实施方式
以下结合附图对本公开的优选实施例进行详细说明,应当理解,以下所说明的优选实施例仅用于说明和解释本公开,并不用于限定本公开。
相关技术可以分为三类,一是基于词典和规则的方法,该方法依赖于词典和规则的构造,在处理新词和新领域有很大的局限性;二是基于统计学的方法:该方法依赖于人工特征选取,需要花费大量人 力和时间;三是基于深度学习的方法,减少人工特征选取的工作量。统计学方法以及深度学习方法在NER任务中有良好的表现。但是在实际应用中有两个不足之处:需要大量标注数据,人工工作量大;以及模型领域迁移性差,领域切换时需要重新标注大规模数据集。
图1是根据本公开实施例的命名实体识别的方法的流程图。
如图1所示,所述方法包括步骤S101至S 104。
在步骤S101,对新领域文本数据进行实体识别,得到新领域种子实体词。
在本公开的上下文中,“新领域”指尚未挖掘实体词的领域或实体词挖掘不充分的领域,在这类待挖掘实体词的领域中,没有或缺乏已标注的语料。
作为实现步骤S101的一种方式,可以将所述新领域文本数据拆分为新领域单句,然后根据所述新领域种子实体词的允许长度,确定每个新领域单句中的新词,并根据所述新词与所述新领域的相关性,对所述新词进行过滤,得到所述新领域种子实体词。即,利用挖掘种子实体词的算法(例如Nagao算法),基于发现的新词来确定所述新领域种子实体词,对新领域文本数据进行实体识别,得到新领域种子实体词。
作为一种实施例,根据所述新领域种子实体词的允许长度,确定每个新领域单句中的新词的步骤包括:对于每个新领域单句,统计新领域单句中满足新领域种子实体词的允许长度的短语,然后根据每个短语的特征,对每个短语进行过滤,得到新词。
新领域种子实体词的允许长度可以是小于或等于预先设定的新领域实体的最长长度的任意长度。每个短语的特征可以包括词频、词性等。
对每个短语进行过滤的步骤可以包括:利用特征与经验阈值进行过滤。例如,将词频大于已知经验词频的短语作为新词,又例如,通过利用每个特征的特征值和权重计算得到的平均特征值和已知经验词频,确定新词。
作为一种实施例,根据所述新词与所述新领域的相关性,对所 述新词进行过滤的步骤包括:利用表征领域间相关性的领域概念图谱和词频-逆向文档频率算法,确定表征所述新词与所述新领域的相关性的相关分数,并根据所述相关分数和经验阈值,对所述新词进行过滤,得到所述相关分数高于所述经验阈值的新词作为所述新领域种子实体词。
确定相关分数的步骤可以包括:按照所述领域概念图谱,获取所述新领域与其它领域的相关性权重,利用所述词频-逆向文档频率算法,确定表征所述新词对所述新领域的重要程度的概率分数,然后利用所述相关性权重和所述概率分数,确定所述新词与所述新领域的相关分数。
领域概念图谱可以组织和表征领域之间的关系,例如上下位关系,是领域间关系的图形化表征。
词频-逆向文档频率算法是现有算法,本公开实施例使用该算法过滤掉各领域常见的词语,而保留新领域内重要的词语作为新领域种子实体词。即,每个新领域单句中的新词也可能出现在其他领域文本数据中。在新领域文本数据和其他领域文本数据组成的文本数据集中,通过该算法能够确定所述新词对所述文本数据集中的新领域文本数据的重要程度,进而实现新词过滤。
其它领域可以是与新领域的上位领域不同的任何领域,例如上位领域是金融,其它领域可以是运营商、科技等。
上述实现步骤S101的方式适用于仅有新领域文本数据的场景。
作为实现步骤S101的另一种方式,可以将所述新领域文本数据和已有领域的文本数据分别拆分为新领域单句和已有领域单句,然后利用所述已有领域单句,生成句式模板,并通过匹配所述新领域单句与所述句式模板,确定所述新领域单句中的所述新领域种子实体词。即,利用挖掘种子实体词的算法,利用句式模板挖掘所述新领域的种子实体词来对新领域文本数据进行实体识别,得到新领域种子实体词。
作为一种实施例,句式模板包括第一句式模板和第二句式模板,并且利用所述已有领域单句,生成句式模板的步骤包括:将每个所述已有领域单句中存在的已有领域实体词替换为预设的实体词挖掘符 号,得到第一句式模板,并将所述第一句式模板中的词或短语替换为同义词或同义短语,得到第二句式模板。所述第一句式模板和所述第二句式模板均为种子句式模板。
作为另一实施例,所述句式模板还可以包括第三句式模板,所述第三句式模板是根据所述种子句式模板衍生出来的句式模板,例如可以采用自扩展技术(例如,Bootstrapping算法)实现衍生过程。
当已有领域与新领域为相近领域时(例如,两者从属于同一上位领域),能够生成更适于新领域的句式模板。因此在准备已有领域的语料时,可以通过领域概念图谱来确定新领域的相近领域,进而准备相近领域的语料。例如,在生成用来挖掘“建行“”(即,建设银行的简称)的种子实体词的句式模板时,可以选择其他银行的文本数据生成句式模板。
上述实现步骤S101的方式适用于缺乏足够的新领域文本数据或其他领域标注文本的场景。
作为实现步骤S101的又一种实施方式,可以将以上两种实施方式结合,即,可以包括:将所述新领域文本数据和已有领域的文本数据分别拆分为新领域单句和已有领域单句;根据所述新领域种子实体词的允许长度,确定每个新领域单句中的新词;根据所述新词与所述新领域的相关性,对所述新词进行过滤,得到过滤后的新领域种子实体词;利用所述已有领域单句,生成句式模板;通过匹配所述新领域单句与所述句式模板,得到匹配后的新领域种子实体词;以及将所述过滤后的新领域种子实体词和所述匹配后的新领域种子实体词进行合并,得到所述新领域种子实体词。即,利用挖掘种子实体词的算法,基于发现的新词来确定所述新领域种子实体词,并且利用句式模板挖掘所述新领域的种子实体词来对新领域文本数据进行实体识别,得到新领域种子实体词。
本实施例的新领域种子实体词是新领域的典型实体词,是发现新领域其它实体词的初始条件,也就是说,可以利用种子实体词实现新领域实体词的扩充。
在步骤S102,根据所述新领域种子实体词,对所述新领域文本 数据进行标注,得到已标注的新领域文本数据。
作为实现步骤S102的一种方式,可以对于每个所述新领域单句进行按字分词处理,得到组成所述新领域单句的字,然后根据每个字在所述新领域单句包含的新领域种子实体词中的位置,对所述新领域单句的每个字进行标注,并且在对所有新领域单句进行标注处理后,得到已标注的新领域文本数据。
在对中文的命名实体进行识别时,所述字可以是汉字,在对其它语言的命名实体进行识别时,所述字可以是针对该种语言的组成单句的最小单位,例如,英语中的一个单词。
在步骤S103,利用所述已标注的新领域文本数据,对命名实体识别模型进行训练,得到适用于所述新领域的命名实体识别模型。
在步骤S104,利用适用于所述新领域的命名实体识别模型,识别所述新领域的其它文本数据中的实体词。
作为一种实施例,所述命名实体识别模型可以是通用的基于深度学习框架双向长短期记忆(Long Short-Term Memory,LSTM)+条件随机场(Conditional Random Field,CRF)的NER模型。
作为另一种实施方式,在步骤S101之后,还可以将所述种子实体词发送至用户界面,以供用户对所述种子实体词进行人工校验。
本领域普通技术人员可以理解,实现上述实施例方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成,所述的程序可以存储于计算机可读取存储介质中。本公开还可以提供一种计算机存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时,所述处理器可以执行根据本公开各实施例的命名实体识别的方法。所述存储介质可以包括(但不限于)ROM/RAM、磁碟、光盘、U盘。
图2是根据本公开实施例的命名实体识别的装置的框图。
如图2所示,所述装置包括实体识别模块201、文本打标模块202。模型训练模块203和模型应用模块204。
实体识别模块201构造为对新领域文本数据进行实体识别,得到新领域种子实体词,即,可以实现图1步骤S101。
文本打标模块202构造为根据所述新领域种子实体词,对所述 新领域文本数据进行标注,得到已标注的新领域文本数据,即,可以实现图1步骤S102。
模型训练模块203构造为利用所述已标注的新领域文本数据,对命名实体识别模型进行训练,得到适用于所述新领域的命名实体识别模型,即,可以实现图1步骤S103。
模型应用模块204构造为利用适用于所述新领域的命名实体识别模型,识别所述新领域的其它文本数据中的实体词,即,可以实现图1的步骤S104。
此外,上述各个模块还可以进一步构造为实现结合图1描述的各个步骤的具体实现方式。出于清楚的目的,在此不再赘述。
图3是根据本公开实施例的命名实体识别的设备的框图。
如图3所示,所述设备包括处理器以及与所述处理器耦接的存储器。所述存储器上存储有计算机程序。所述计算机程序被所述处理器执行时,所述处理器可以执行根据本公开各实施例的命名实体识别的方法。
图4是根据本公开实施例的实体识别***的架构图,该***是用于实现一种通用的中文实体识别算法的***,用以解决数据标注工作量大,领域迁移困难的问题。
如图4所示,该***主要包括四大模块:挖掘种子实体词(相当于图2的实体识别模块201)、语料自动打标(相当于图2的文本打标模块202)、离线训练NER模型(相当于图2的模型训练模块203)和在线使用NER模型(相当于图2的模型应用模块204)。下面分别对各个主要模块进行详细说明。
1.挖掘种子实体词
该模块可以有效解决领域迁移,实体识别缺少标准语料的问题,是一个核心模块。该模块可以包括两个子模块:新词发现挖掘种子实体词、句式挖掘种子实体词。
新词发现方法适用于仅有新领域语料的实体识别场景,比如,需要对电信领域进行实体识别,但是缺乏电信或者其他领域标注语料。句式挖掘方法适用于已有领域和新领域相结合的实体识别场景,比如, “建行”领域已有一定量的实体词库,可以借助句式快速挖掘“中行”或者“电信”等领域的实体词。
2.语料自动打标
在挖掘种子实体词后,***自动标注新领域语料,避免繁琐的人工标注工作,以为NER模型训练提供数据支撑,是另一核心模块。
3.离线训练NER模型
该模块借助深度学习框架双向LSTM+CRF训练NER模型,用于提高实体识别***的泛化能力,是一个必备模块。
4.在线使用NER模型
该模块为***的必备模块,而非核心模块,可以采用传统的NER模型的使用流程。
为了便于本领域技术人员的理解,下面结合附图5至图13对本公开作进一步的描述,以下描述并不用来限制本公开的保护范围。
图5是根据本公开实施例的自动化实体识别***的架构图。
为进一步提高***准确率,可以在语料自动打标之前引入人工校验,如图5所示。该***可应用于多种设备,比如智能呼叫中心、智能机顶盒、智能知识库等,以提高设备准确率,减少人工工作量。
图6是根据本公开实施例的新词发现挖掘种子实体词的流程图。
如图6所示,该流程适用于仅有新领域语料的实体识别场景,比如,需要对电信领域进行实体识别,但是缺乏电信或者其他领域标注语料。借助新词发现算法挖掘种子领域实体词,可以很快应用到产品或者后续的实体识别模型训练过程当中。新领域只需要提供对应的语料库,无需重标注。
领域的文本数据可以是常用问题解答(Frequently Asked Question,FAQ)问答对也可以是篇章及文本语料。
把已准备的文本数据、领域信息、该领域实体最长长度作为输入参数,生成包含输入参数的入参消息,并经由接口进行输入。例如入参消息可以是JavaScript对象简谱(JavaScript Object Notation,json)消息,具体格式如下:
Figure PCTCN2019089325-appb-000001
Figure PCTCN2019089325-appb-000002
下面对图6的各个步骤进行详细描述。
在步骤301,获取新领域语料。
提取入参消息(例如json消息)中的新领域语料(即新领域的文本数据),并存入缓存。
在步骤302,子句拆分。
根据标点符号和停用词停用短语将文本数据拆分为单句。
在步骤303,挖掘新词。
图7示出了步骤303的流程,包括:获取断句文本(即,获取子句或单句)(步骤401);统计每个断句文本中所有满足长度(即,该领域实体最长长度)的短语(步骤402);统计每个短语的特征(步骤403);以及根据每个短语的特征,进行阈值过滤,得到最终的新词作为候选新词(步骤404)。即首先统计符合长度的短语组合。然后统计每个短语的特征,如互信息、左右信息熵、词频、词性等特征,得到候选新词。最后根据经验阈值过滤得到最终的新词。
在步骤304,新词过滤。
判断新词是否与其他行业有关,并进行过滤。可以借助领域概念图谱、关键词提取算法等得到新词和其他行业的相关性。可以根据概念图谱上下位关系来判断当前领域所处的行业。图10是领域概念图谱的示例性结构示意图。通过词频-逆向文档频率(Term Frequency–Inverse Document Frequency,TF-IDF)关键词提取算法判断每个新词在当前领域的重要程度,将TF-IDF作为实体词的概率分数。最终按照经验阈值得到实体词结果。
在步骤305,输出种子实体词。
可以生成领域实体词的json消息,便于采集信息。消息格式如下:
Figure PCTCN2019089325-appb-000003
“zxner_domain”为实体所在领域,“zxner_result”是实体识别结果,其数据形式为数组,包括实体词以及与该词对应的分数。
图8是根据本公开实施例的句式挖掘种子实体词的流程图。
如图8所示,该流程适用于已有领域和新领域相结合的实体识别场景,比如,“建行”领域已有一定量的实体词库,可以借助句式快速挖掘“中行”或者“电信”等领域的实体词。借助句式结构挖掘领域实体词,可以很快应用到产品或者后续的实体识别模型训练过程当中。新的场景扩充只需要提供对应的语料库,无需重标注。
新领域的文本数据可以是FAQ问答对也可以是篇章及文本语料。此外,还需要准备已有领域的实体词库以及文本数据,以用于句式挖掘。
把已准备的新领域的文本数据、新领域信息、新领域实体最长长度、已有领域文本数据、已有领域信息、已有领域实体词作为输入参数,生成包含输入参数的入参消息,并经由接口进行输入。例如所述入参消息可以是json消息,具体格式如下:
Figure PCTCN2019089325-appb-000004
下面对图8的各个步骤进行详细描述。
在步骤501,获取已有领域语料,包括实体词。
提取入参消息中已有领域文本数据以及实体词。
在步骤502,子句拆分。
根据标点符号和停用词停用短语将已有领域文本数据拆分为单句。
在步骤503,句式挖掘。
图9示出了步骤503的流程,包括:获取断句文本(即,子句或单句)(步骤601);将断句文本中的实体词替换为[E](即,预设的实体词挖掘符号)(步骤602);将断句文本中除实体词外的其它词或短语替换为同义词或同义短语,得到种子句式模板(步骤603); 以及利用Bootstrapping算法,得到句式模板(步骤604)。
首先将步骤502中得到的单句中的实体词替换为[E],然后用同义词或者同义短语替换形成种子句式模板,最后采用Bootstrapping算法挖掘更多的句式模板。
在步骤504,存储句式模板。
例如,存储的句式模板的结构如下:
模板(pattern) 领域(domain)
[E]怎么办理 建行
在步骤505,获取新领域语料。
提取入参消息中的新领域文本数据。
在步骤506,子句拆分。
根据标点符号和停用词停用短语将新领域文本数据拆分为单句。
在步骤507,句式匹配。
根据新领域与已有领域的相关性进行排序,匹配句式模板,提取可能的实体词。
领域相关性依赖于领域概念图谱(参见图10)中两部分内容:行业之间的上下位关系,具有上下位关系的领域,相关性最大,比如图10所示,“建行”和“银行“之前具有上下位关系;以及行业之间句式结构相似性关系,比如图10所示,“金融”领域和“运营商”领域具有相似的句式结构(相似度0.75)。
在步骤508,输出种子实体词。
生成领域实体词的json消息,便于采集信息。消息格式如下:
Figure PCTCN2019089325-appb-000005
Figure PCTCN2019089325-appb-000006
“zxner_domain”为实体所在领域,“zxner_result”是实体识别结果,其数据形式为数组,包括实体词以及与该词对应的模板。
图11是根据本公开实施例的语料自动打标的流程图。通过该流程,可以根据种子实体词自动对新领域语料打标(即,标注新领域文本数据),减少了人工标注工作。
语料自动打标流程除了可以应用于本公开的各实施例,还可以应用于其他序列标注任务***,比如,分词***等。当应用于其他***时,把语料所在领域、实体词以及相应的语料作为输入参数,生成包含输入参数的入参消息,并经由接口进行输入。例如所述入参消息可以是json消息,具体格式如下:
Figure PCTCN2019089325-appb-000007
下面对图11的各个步骤进行详细描述。
在步骤801,获取新领域语料。
提取入参消息中的新领域语料,包括领域文本数据以及种子实体词。
在步骤802,子句拆分。
根据标点符号和停用词停用短语将已有领域文本数据拆分为单句。
在步骤803,按字分词。
将所有子句按字分词,减少分词错误对***的结果的影响。
在步骤804,判断字在种子实体词中的位置,将起始位置标为B、中间位置标为I、结束位置标为E,不在实体词中标为O。
生成标注语料的json消息,便于采集信息。消息格式如下:
Figure PCTCN2019089325-appb-000008
“zxner_domain”为实体所在领域,“zxner_result”是语料标记结果,其数据形式为数组,包括单句语料以及每个字的标记结果。
图12是根据本公开实施例的仅有新领域的半自动化实体识别的流程图。该实施例用于说明仅有新领域实体识别的应用场。
在步骤901,获取新领域语料。
接收入参消息,并从入参消息中提取新领域语料,包括新领域文本数据:“乾元满溢是理财产品吗?乾元满溢收益怎么样?乾元满溢是稳健的理财产品吗?”以及新领域信息:“建设银行”。
Figure PCTCN2019089325-appb-000009
Figure PCTCN2019089325-appb-000010
在步骤902,通过新词发现挖掘种子实体词。
1)子句拆分结果
子句1:乾元满溢是理财产品
子句2:乾元满溢收益怎么样
子句3:乾元满溢是稳健的理财产品吗
2)挖掘新词
新词:乾元,乾元满溢,理财产品,收益
3)新词过滤
通过领域概念图谱(参见图10)确定当前数据属于“金融”领域,并且通过TF-IDF计算新词与“运营商”、“科技”等其他领域的相关性,分数越大与“建行”领域相关性越高,与其他领域相关性越低。得到如下计算结果:
乾元:2.34
乾元满溢:2.12
理财产品:2.08
收益:1.83
4)挖掘种子实体词
Figure PCTCN2019089325-appb-000011
Figure PCTCN2019089325-appb-000012
在步骤903,人工校验,确定实体词为:乾元,乾元满溢,理财产品。
在步骤904,语料自动打标。可以采用通用的BIEO标注语料。打标结果如下:
标记 标记 标记
B B B
I I I
I I I
E E E
O O O
B O O
I O O
I O O
E O B
O I
I
E
O
在步骤905,训练模型。
可以采用常用的训练方式相同,这里不做详细描述。
图13是根据本公开实施例的已有领域和新领域相结合的半自动化实体识别的流程图。该实施例用于说明已有领域和新领域相结合的实体识别的应用场景。
在步骤1001,获取新领域语料和已有领域语料。
接收入参消息,并从入参消息中提取新领域语料,包括新领域文本数据:“***密码修改?我想知道借记卡的办理方式是什么?***怎么办理?”,新领域信息:“建行”,已有领域信息“电信”,已有领域实体词:“天翼领航A8套餐,***套餐,校园套餐”和已有领域文本数据:“我想知道***套餐的办理方式是什么?天翼领航A8套餐怎么办理?”。
Figure PCTCN2019089325-appb-000013
Figure PCTCN2019089325-appb-000014
在步骤1002,通过新词发现和句式挖掘种子实体词。
1)新领域语料子句拆分结果
子句1:***密码修改
子句2:借记卡的办理方式是什么
子句3:***怎么办理
2)已有领域语料子句拆分结果
子句1:***套餐的办理方式是什么
子句2:天翼领航A8套餐怎么办理
3)挖掘新词
新词:***
4)挖掘句式
句式1:[E]的办理方式是什么
句式2:[E]怎么办理
句式3:[E]的开通方式是什么
句式4:[E]怎么开通
5)新词过滤
通过领域概念图谱(参见图10)确定当前数据属于“金融”领域,并且通过TF-IDF计算新词与“运营商”、“科技”等其他领域的相关性,分数越大与“建行”领域相关性越高,与其他领域相关性越低。得到如下计算结果:
***:2.39
6)句式匹配
根据领域概念图谱(参见图10),新领域属于“金融”领域,与“电信”领域相关性比较高,可以匹配句式模板,得到结果:
[E]的办理方式是什么 借记卡
[E]怎么办理 ***
7)挖掘种子实体词
Figure PCTCN2019089325-appb-000015
在步骤1003,人工校验,确定实体词为:***,借记卡。
在步骤1004,语料自动打标。可以采用通用的BIEO标注语料。打标结果如下:
标记 标记 标记
B B B
I I I
E E E
O O O
O O O
O O O
O O O
O
O
O
O
在步骤1005,训练模型。
可以采用常用的训练方式,这里不做详细描述。
根据本公开实施例,增加了挖掘种子实体词的功能模块(图2的实体识别模块201),采用新词发现、关键词提取、句式挖掘、领域概念图谱等技术手段,在新领域自动挖掘种子实体词的集合,并对语料自动打标,然后再进行深度学习利用双向LSTM+CRF训练NER模型,从而可以减少数据标注工作量,降低模型迁移训练时的要求,提高了算法的领域通用性,适合应用于多种场景,包括(但不限于)语音助手、智能客服、智能知识库等各种涉及人工智能(Artificial Intelligence,AI)的应用。
尽管上文对本公开进行了详细说明,但是本公开不限于此,本技术领域技术人员可以根据本公开的原理进行各种修改。因此,凡按照本公开原理所作的修改,都应当理解为落入本公开的保护范围。

Claims (16)

  1. 一种命名实体识别的方法,包括:
    对新领域文本数据进行实体识别,得到新领域种子实体词;
    根据所述新领域种子实体词,对所述新领域文本数据进行标注,得到已标注的新领域文本数据;
    利用所述已标注的新领域文本数据,对命名实体识别模型进行训练,得到适用于所述新领域的命名实体识别模型;以及
    利用适用于所述新领域的命名实体识别模型,识别所述新领域的其它文本数据中的实体词。
  2. 根据权利要求1所述的方法,其中,对新领域文本数据进行实体识别,得到新领域种子实体词的步骤包括:
    将所述新领域文本数据拆分为新领域单句;
    根据所述新领域种子实体词的允许长度,确定每个新领域单句中的新词;以及
    根据所述新词与所述新领域的相关性,对所述新词进行过滤,得到所述新领域种子实体词。
  3. 根据权利要求1所述的方法,其中,对新领域文本数据进行实体识别,得到新领域种子实体词的步骤包括:
    将所述新领域文本数据和已有领域的文本数据分别拆分为新领域单句和已有领域单句;
    利用所述已有领域单句,生成句式模板;以及
    通过匹配所述新领域单句与所述句式模板,确定所述新领域单句中的所述新领域种子实体词。
  4. 根据权利要求1所述的方法,其中,对新领域文本数据进行实体识别,得到新领域种子实体词的步骤包括:
    将所述新领域文本数据和已有领域的文本数据分别拆分为新领 域单句和已有领域单句;
    根据所述新领域种子实体词的允许长度,确定每个新领域单句中的新词;
    根据所述新词与所述新领域的相关性,对所述新词进行过滤,得到过滤后的新领域种子实体词;
    利用所述已有领域单句,生成句式模板;
    通过匹配所述新领域单句与所述句式模板,得到匹配后的新领域种子实体词;以及
    将所述过滤后的新领域种子实体词和所述匹配后的新领域种子实体词进行合并,得到所述新领域种子实体词。
  5. 根据权利要求2或4所述的方法,其中,根据所述新词与所述新领域的相关性,对所述新词进行过滤的步骤包括:
    利用表征领域间相关性的领域概念图谱和词频-逆向文档频率算法,确定表征所述新词与所述新领域的相关性的相关分数;以及
    根据所述相关分数和经验阈值,对所述新词进行过滤,得到所述相关分数高于所述经验阈值的新词作为所述新领域种子实体词。
  6. 根据权利要求5所述的方法,其中,利用表征领域间相关性的领域概念图谱和词频-逆向文档频率算法,确定表征所述新词与所述新领域的相关性的相关分数的步骤包括:
    按照所述领域概念图谱,获取所述新领域与其它领域的相关性权重;
    利用所述词频-逆向文档频率算法,确定表征所述新词对所述新领域的重要程度的概率分数;以及
    利用所述相关性权重和所述概率分数,确定所述新词与所述新领域的相关分数。
  7. 根据权利要求2-4任意一项所述的方法,其中,根据所述新领域种子实体词,对所述新领域文本数据进行标注,得到已标注的新 领域文本数据的步骤包括:
    对于每个所述新领域单句进行按字分词处理,得到组成所述新领域单句的字;
    根据每个字在所述新领域单句包含的新领域种子实体词中的位置,对所述新领域单句的每个字进行标注;以及
    在对所有新领域单句进行标注处理后,得到已标注的新领域文本数据。
  8. 一种命名实体识别的装置,包括:
    实体识别模块,其构造为对新领域文本数据进行实体识别,得到新领域种子实体词;
    文本打标模块,其构造为根据所述新领域种子实体词,对所述新领域文本数据进行标注,得到已标注的新领域文本数据;
    模型训练模块,其构造为利用所述已标注的新领域文本数据,对命名实体识别模型进行训练,得到适用于所述新领域的命名实体识别模型;以及
    模型应用模块,其构造为利用适用于所述新领域的命名实体识别模型,识别所述新领域的其它文本数据中的实体词。
  9. 根据权利要求8所述的装置,其中,所述实体识别模块构造为:
    将所述新领域文本数据拆分为新领域单句;
    根据所述新领域种子实体词的允许长度,确定每个新领域单句中的新词;以及
    根据所述新词与所述新领域的相关性,对所述新词进行过滤,得到所述新领域种子实体词。
  10. 根据权利要求8所述的装置,其中,所述实体识别模块构造为:
    将所述新领域文本数据和已有领域的文本数据分别拆分为新领 域单句和已有领域单句;
    利用所述已有领域单句,生成句式模板;以及
    通过匹配所述新领域单句与所述句式模板,确定所述新领域单句中的所述新领域种子实体词。
  11. 根据权利要求8所述的装置,其中,所述实体识别模块构造为:
    将所述新领域文本数据和已有领域的文本数据分别拆分为新领域单句和已有领域单句;
    根据所述新领域种子实体词的允许长度,确定每个新领域单句中的新词;
    根据所述新词与所述新领域的相关性,对所述新词进行过滤,得到过滤后的新领域种子实体词;
    利用所述已有领域单句,生成句式模板;
    通过匹配所述新领域单句与所述句式模板,得到匹配后的新领域种子实体词;以及
    将所述过滤后的新领域种子实体词和所述匹配后的新领域种子实体词进行合并,得到所述新领域种子实体词。
  12. 根据权利要求9或11所述的装置,其中,所述实体识别模块进一步构造为:
    利用表征领域间相关性的领域概念图谱和词频-逆向文档频率算法,确定表征所述新词与所述新领域的相关性的相关分数;以及
    根据所述相关分数和经验阈值,对所述新词进行过滤,得到所述相关分数高于所述经验阈值的新词作为所述新领域种子实体词。
  13. 根据权利要求12所述的装置,其中,所述实体识别模块进一步构造为:
    按照所述领域概念图谱,获取所述新领域与其它领域的相关性权重;
    利用所述词频-逆向文档频率算法,确定表征所述新词对所述新领域的重要程度的概率分数;以及
    利用所述相关性权重和所述概率分数,确定所述新词与所述新领域的相关分数。
  14. 根据权利要求9-11任意一项所述的装置,其中,所述文本打标模块构造为:
    对于每个所述新领域单句进行按字分词处理,得到组成所述新领域单句的字;
    根据每个字在所述新领域单句包含的新领域种子实体词中的位置,对所述新领域单句的每个字进行标注;以及
    在对所有新领域单句进行标注处理后,得到已标注的新领域文本数据。
  15. 一种命名实体识别的设备,包括处理器以及与所述处理器耦接的存储器,其中,所述存储器上存储有计算机程序,所述计算机程序被所述处理器执行时,所述处理器执行根据权利要求1至7中任一项所述的命名实体识别的方法。
  16. 一种计算机存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时,所述处理器执行根据权利要求1至7中任一项所述的命名实体识别的方法。
PCT/CN2019/089325 2018-06-01 2019-05-30 命名实体识别的方法、装置、设备及存储介质 WO2019228466A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810556103.4 2018-06-01
CN201810556103.4A CN110555206A (zh) 2018-06-01 2018-06-01 一种命名实体识别方法、装置、设备及存储介质

Publications (1)

Publication Number Publication Date
WO2019228466A1 true WO2019228466A1 (zh) 2019-12-05

Family

ID=68698713

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/089325 WO2019228466A1 (zh) 2018-06-01 2019-05-30 命名实体识别的方法、装置、设备及存储介质

Country Status (2)

Country Link
CN (1) CN110555206A (zh)
WO (1) WO2019228466A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111553158A (zh) * 2020-04-21 2020-08-18 中国电力科学研究院有限公司 一种基于BiLSTM-CRF模型的电力调度领域命名实体识别方法及***
CN111967266A (zh) * 2020-09-09 2020-11-20 中国人民解放军国防科技大学 中文命名实体识别模型及其构建方法和应用
CN113127503A (zh) * 2021-03-18 2021-07-16 中国科学院国家空间科学中心 一种面向航天情报的自动信息提取方法及***

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111062216B (zh) * 2019-12-18 2021-11-23 腾讯科技(深圳)有限公司 命名实体识别方法、装置、终端及可读介质
CN111178076B (zh) * 2019-12-19 2023-08-08 成都欧珀通信科技有限公司 命名实体识别与链接方法、装置、设备及可读存储介质
CN110969021A (zh) * 2019-12-23 2020-04-07 竹间智能科技(上海)有限公司 单轮对话中的命名实体识别方法、装置、设备及介质
CN111241839B (zh) * 2020-01-16 2022-04-05 腾讯科技(深圳)有限公司 实体识别方法、装置、计算机可读存储介质和计算机设备
CN111597813A (zh) * 2020-05-21 2020-08-28 上海创蓝文化传播有限公司 一种基于命名实体识别提取短信文本摘要的方法及装置
CN111832291B (zh) * 2020-06-02 2024-01-09 北京百度网讯科技有限公司 实体识别模型的生成方法、装置、电子设备及存储介质
CN114282586A (zh) * 2020-09-27 2022-04-05 中兴通讯股份有限公司 一种数据标注方法、***和电子设备
CN113887227B (zh) * 2021-09-15 2023-05-02 北京三快在线科技有限公司 一种模型训练与实体识别方法及装置

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170060835A1 (en) * 2015-08-27 2017-03-02 Xerox Corporation Document-specific gazetteers for named entity recognition
CN107133220A (zh) * 2017-06-07 2017-09-05 东南大学 一种地理学科领域命名实体识别方法
CN107908614A (zh) * 2017-10-12 2018-04-13 北京知道未来信息技术有限公司 一种基于Bi‑LSTM的命名实体识别方法

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170060835A1 (en) * 2015-08-27 2017-03-02 Xerox Corporation Document-specific gazetteers for named entity recognition
CN107133220A (zh) * 2017-06-07 2017-09-05 东南大学 一种地理学科领域命名实体识别方法
CN107908614A (zh) * 2017-10-12 2018-04-13 北京知道未来信息技术有限公司 一种基于Bi‑LSTM的命名实体识别方法

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111553158A (zh) * 2020-04-21 2020-08-18 中国电力科学研究院有限公司 一种基于BiLSTM-CRF模型的电力调度领域命名实体识别方法及***
CN111967266A (zh) * 2020-09-09 2020-11-20 中国人民解放军国防科技大学 中文命名实体识别模型及其构建方法和应用
CN111967266B (zh) * 2020-09-09 2024-01-26 中国人民解放军国防科技大学 中文命名实体识别***、模型构建方法和应用及相关设备
CN113127503A (zh) * 2021-03-18 2021-07-16 中国科学院国家空间科学中心 一种面向航天情报的自动信息提取方法及***

Also Published As

Publication number Publication date
CN110555206A (zh) 2019-12-10

Similar Documents

Publication Publication Date Title
WO2019228466A1 (zh) 命名实体识别的方法、装置、设备及存储介质
CN107436864B (zh) 一种基于Word2Vec的中文问答语义相似度计算方法
WO2021121198A1 (zh) 基于语义相似度的实体关系抽取方法、装置、设备及介质
WO2020143163A1 (zh) 基于注意力机制的命名实体识别方法、装置和计算机设备
CN110457689B (zh) 语义处理方法及相关装置
CN112069826B (zh) 融合主题模型和卷积神经网络的垂直域实体消歧方法
CN105138507A (zh) 一种基于模式自学习的中文开放式关系抽取方法
CN111143571B (zh) 实体标注模型的训练方法、实体标注方法以及装置
WO2024131111A1 (zh) 一种智能写作方法、装置、设备及非易失性可读存储介质
US20140032207A1 (en) Information Classification Based on Product Recognition
CN109635105A (zh) 一种中文文本多意图识别方法及***
CN105868179A (zh) 一种智能问答方法及装置
CN106980620A (zh) 一种对中文字串进行匹配的方法及装置
CN111144102A (zh) 用于识别语句中实体的方法、装置和电子设备
CN109214445A (zh) 一种基于人工智能的多标签分类方法
CN109062977A (zh) 一种基于语义相似度的自动问答文本匹配方法、自动问答方法和***
CN101271448A (zh) 汉语基本名词短语的识别及其规则的生成方法和装置
Li et al. Towards real-world writing assistance: A chinese character checking benchmark with faked and misspelled characters
CN110929022A (zh) 一种文本摘要生成方法及***
CN106484660A (zh) 标题处理方法和装置
Chuang et al. Resume parser: Semi-structured chinese document analysis
CN114842982A (zh) 一种面向医疗信息***的知识表达方法、装置及***
CN111859915B (zh) 一种基于词频显著度水平的英文文本类别识别方法及***
CN107622049A (zh) 一种供电服务专用词库生成方法
CN109299456B (zh) 一种地名识别方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19811613

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 22.04.2021)

122 Ep: pct application non-entry in european phase

Ref document number: 19811613

Country of ref document: EP

Kind code of ref document: A1