WO2021043085A1 - Method and apparatus for recognizing named entity, computer device, and storage medium - Google Patents

Method and apparatus for recognizing named entity, computer device, and storage medium Download PDF

Info

Publication number
WO2021043085A1
WO2021043085A1 PCT/CN2020/112303 CN2020112303W WO2021043085A1 WO 2021043085 A1 WO2021043085 A1 WO 2021043085A1 CN 2020112303 W CN2020112303 W CN 2020112303W WO 2021043085 A1 WO2021043085 A1 WO 2021043085A1
Authority
WO
WIPO (PCT)
Prior art keywords
named entity
text
recognized
sample data
data set
Prior art date
Application number
PCT/CN2020/112303
Other languages
French (fr)
Chinese (zh)
Inventor
张师琲
霍晓燕
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021043085A1 publication Critical patent/WO2021043085A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • This application belongs to the field of artificial intelligence technology, and in particular relates to a named entity recognition method, device, computer equipment and storage medium.
  • CRF Conditional Random Field, Conditional Random Field
  • RNN Recurrent Neural Network, cyclic neural network
  • LSTM long-short term memory, long-short term memory
  • the present application provides a named entity recognition method with high recognition accuracy to solve the problem of poor accuracy of the prior art named entity recognition.
  • this application provides a named entity identification method, which includes the following steps:
  • the initial sample data set contains a plurality of training texts and a named entity labeling result corresponding to each training text;
  • the target named entity annotation result is added to the initial sample data set, so that when the training text in the initial sample data set reaches a preset number, the named entity recognition model is retrained according to the initial sample data set where the training text reaches the preset number .
  • this application also provides a named entity recognition device, including:
  • the initial sample data set acquisition module is used to acquire an initial sample data set from a sample database, and the initial sample data set contains multiple training texts and a named entity annotation result corresponding to each training text;
  • the judging module is used to judge whether the number of training texts in the initial sample data set reaches a preset threshold:
  • the first model training module is configured to train a preset named entity recognition model according to the initial sample data set when the result of the judgment module is yes;
  • the first text receiving module is configured to receive the first text to be recognized and preprocess the first text to be recognized;
  • the first model processing module is configured to use the trained named entity recognition model to process the preprocessed first to-be-recognized text to obtain a named entity automatic labeling result of the first to-be-recognized text;
  • the first comparison module is used to compare whether the automatic labeling result of the named entity is the same as the pre-obtained manual labeling result of the named entity, and if they are the same, the automatic labeling result of the named entity is used as the first text to be recognized If the target named entity marking results are not the same, output the first manual review notification, and receive the first target named entity marking results in response to the first manual review notification;
  • the second comparison module is configured to compare whether the automatic labeling result of the named entity is the same as the received labeling result of the first target named entity;
  • the first sample adding module is configured to correspond to the first to-be-recognized text and the first to-be-recognized text when the automatically annotated named entity result is not the same as the received annotated result of the first target named entity
  • the first target named entity annotation result is added to the initial sample data set, so that when the training text in the initial sample data set reaches a preset number, the named entity recognition model is performed on the initial sample data set where the training text reaches the preset number.
  • this application also provides a computer device, including a memory, a processor, and a computer program stored in the memory and running on the processor, and the processor implements the named entity recognition method when the computer program is executed. The following steps:
  • the initial sample data set contains a plurality of training texts and a named entity labeling result corresponding to each training text;
  • the target named entity annotation result is added to the initial sample data set, so that when the training text in the initial sample data set reaches a preset number, the named entity recognition model is retrained according to the initial sample data set where the training text reaches the preset number .
  • the present application also provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the following steps of the name entity identification method are realized:
  • the initial sample data set contains a plurality of training texts and a named entity labeling result corresponding to each training text;
  • the target named entity annotation result is added to the initial sample data set, so that when the training text in the initial sample data set reaches a preset number, the named entity recognition model is retrained according to the initial sample data set where the training text reaches the preset number .
  • this application performs named entity recognition on the first to-be-recognized text, it compares whether the named entity automatic annotation result is the same as the pre-obtained manual annotation result of the named entity. If they are not the same, the first manual review notification is output and the response is received.
  • the first target named entity annotation result of the first target named entity is added to the initial sample data set, so that when the training text in the initial sample data set reaches a preset number, the named entity recognition model is identified according to the initial sample data set with the training text reaching the preset number Re-training improves the accuracy of the model, thereby improving the accuracy of named entity recognition.
  • FIG. 1 is a flowchart of an embodiment of a named entity identification method according to this application.
  • Figure 2 is a schematic diagram of the named entity recognition model in this application.
  • FIG. 3 is a structural block diagram of an embodiment of a named entity recognition device according to this application.
  • FIG. 4 is a hardware architecture diagram of an embodiment of the computer device of this application.
  • This embodiment provides a named entity recognition method, which is mainly suitable for natural language processing. As shown in FIG. 1, the method includes the following steps:
  • the training text is a text in .doc or .docx format.
  • the training text can include time, person name, location, organization name, company name, country name, economic vocabulary, transaction type, economic quality indicator, product Names and other different types of named entities.
  • the named entities of different categories in each training text have been preset to different font styles, such as different font colors.
  • step S1 specifically includes the following process: First, obtain an initial sample data set from the sample database, the initial sample data set contains multiple training texts, and the named entities of different categories in each training text have been preset to different fonts Style; then, according to the font style (such as font color attribute) of each word in each training text, obtain the labeled result of the named entity corresponding to each training text.
  • font style such as font color attribute
  • the training text is set to red
  • the time font is set to yellow
  • the place font is set to blue
  • the organization name is set to green
  • the non-named entity is set to black
  • the training text is set to red
  • Words in the font are marked as the name-named entity identification label PERS
  • words in yellow font are marked as the time-named entity identification label TIME
  • words in blue font are marked as the place-named entity identification label LOC
  • words in green font are marked as the organization name-named entity
  • the identification label ORGE the words in black font are marked as the non-named entity label O, which will not be listed here.
  • step S2 Determine whether the number of training texts in the initial sample data set reaches a preset threshold, and if so, perform step S3.
  • the named entity recognition model includes a BERT layer and a CRF layer, that is, the named entity recognition model of this embodiment is constructed by splicing a layer of CRF model on the basis of the BERT model.
  • the BERT model is a natural language processing model released by Google. Its framework is shown in Figure 2. It has a two-way Transformer encoder (ie, the two-layer Trm in the figure). Through the processing of the two-way Transformer encoder, it can fully consider contextual words and The relationship between words makes the result of named entity labeling more accurate. As shown in Figure 2, tok1, tok2,..., tokN represent the input sequence of the training text, E1, E2,..., EN represent the vector corresponding to tok1, tok2,..., tokN, and each vector is input into the forward layer Transformer.
  • each Transformer encoder For each Transformer encoder, the output of each Transformer encoder in the forward layer Transformer is used as the input of each Transformer encoder in the backward layer Transformer, and the output result of each Transformer encoder in the backward layer Transformer is passed through
  • the softmax function is normalized to obtain the probability matrix T1, T2,..., TN of each word corresponding to the named entity category.
  • the CRF model is a discriminative probability model, which is a kind of random field. It is often used to label or analyze sequence data, such as natural language text sequences.
  • sequence data such as natural language text sequences.
  • T T1, T2...Ti...TN
  • N the length of the input
  • P the probability of probability of [y1,..., yN]
  • the CRF model will find the sequence with the largest probability P (y1,..., yN) of [y1,..., yN] under the condition of known sequence X [Y1,..., YN], and then predict the label of each word to get the result of named entity recognition.
  • step S3 is specifically implemented by the following steps: first, the initial sample data set is divided into a training set, a verification set, and a test set; then, the named entity recognition model is trained according to the training set; when the training is completed, According to the verification set, the accuracy of the trained named entity recognition model is verified; when the verification is passed, the verified named entity recognition model is tested according to the test set. If the test is successful, the training ends.
  • the process of training the named entity recognition model according to the training set is as follows: input the sample data in the training set to the BERT layer, and then input the output result of the BERT layer to the CRF layer to perform the training parameters of the BERT layer and the CRF layer. Iterative training.
  • S5 Use the trained named entity recognition model to process the preprocessed first to-be-recognized text to obtain a named entity automatic labeling result of the first to-be-recognized text. It includes the following steps:
  • the embedding representation vector is spliced and combined into a total vector corresponding to each word or identifier.
  • the total vector E [0.05,0.82,0.03,0.05,0,0,0,0,0,0,1,2,3] corresponding to the word.
  • each Transformer encoder in the forward layer Transformer receives the total vector corresponding to each word or identifier in the input sequence and the output result of each Transformer encoder in the forward layer Transformer is used as the backward layer Transformer.
  • the output results of each Transformer encoder in the backward layer Transformer are normalized by the softmax function, and the probability matrix corresponding to the named entity category for each word in the input sequence is obtained as the first The text feature sequence corresponding to the text to be recognized.
  • step S2 when it is determined that the number of training texts in the initial sample data set does not reach the preset threshold, the following operations are performed:
  • S21 Perform N times scramble processing on the sentence sequence of the training text in the initial sample data set to generate N different new sample data sets, where N takes a positive integer. It can be understood that after the sentences of a training text are scrambled, a new training text can be obtained. After the sentences of all training texts in the initial sample data set are scrambled, a new sample data set can be obtained. Random scrambled N times can be Get N different new sample data sets.
  • S23 Receive the second text to be recognized, and preprocess the second text to be recognized.
  • the preprocessing process of the second text to be recognized is the same as the preprocessing process of the first text to be recognized, so it will not be repeated here.
  • N+1 named entity recognition models obtained from the aforementioned N+1 training to process the preprocessed second to-be-recognized text respectively. It can be understood that a named-entity recognition model is used to process the second to-be-recognized text. Recognition text processing, you can get a named entity automatic labeling result, then use N+1 named entity recognition model to process the second to-be-recognized text, you can get the N+1 named-entity automatic labeling results corresponding to the second to-be-recognized text .
  • step S26 Add the second to-be-recognized text and the second target named entity annotation result corresponding to the second to-be-recognized text into the initial sample data set until the number of training texts in the initial sample data set reaches a preset threshold, and then perform step S3 to improve The accuracy of the model further improves the accuracy of named entity recognition.
  • This embodiment provides a named entity recognition device 10, as shown in FIG. 3, including:
  • the initial sample data set acquisition module 101 is configured to acquire an initial sample data set from a sample database, and the initial sample data set contains multiple training texts and a named entity annotation result corresponding to each training text;
  • the judging module 102 is used to judge whether the number of training texts in the initial sample data set reaches a preset threshold:
  • the first model training module 103 is configured to train the preset named entity recognition model according to the initial sample data set when the result of the judgment module is yes;
  • the first text receiving module 104 is configured to receive the first text to be recognized and preprocess the first text to be recognized;
  • the first model processing module 105 is configured to use the trained named entity recognition model to process the preprocessed first to-be-recognized text to obtain a named entity automatic labeling result of the first to-be-recognized text;
  • the first comparison module 106 is used to compare whether the automatic named entity annotation result is the same as the pre-obtained manual annotation result of the named entity, and if they are the same, the automatic named entity annotation result is used as the target named entity annotation result of the first text to be recognized , If they are not the same, output the first manual review notification, and receive the first target named entity marking result in response to the first manual review notification;
  • the second comparison module 107 is configured to compare whether the automatic labeling result of the named entity is the same as the received labeling result of the first target named entity;
  • the first sample adding module 108 is configured to combine the first to-be-recognized text and the first target-named entity corresponding to the first to-be-recognized text when the automatic named entity annotation result is different from the received first target named entity annotation result
  • the annotation result is added to the initial sample data set, so that when the training text in the initial sample data set reaches a preset number, the named entity recognition model is retrained according to the initial sample data set where the training text reaches the preset number.
  • the named entity recognition device 10 further includes:
  • the new sample data set acquisition module 109 is used to perform N times scramble processing on the sentence sequence of the training text in the initial sample data set when the judgment result of the judgment module is no to generate N different new sample data sets, where N Take a positive integer;
  • the second model training module 110 trains the preset named entity recognition model according to the initial sample data set, and trains the preset named entity recognition model respectively according to N different new sample data sets, to obtain N+1 trained named entity recognition models;
  • the second text receiving module 111 is configured to receive the second text to be recognized and preprocess the second text to be recognized;
  • the second model processing module 112 is configured to use the N+1 trained named entity recognition models to respectively process the preprocessed second to-be-recognized text to obtain N+1 names corresponding to the second to-be-recognized text Entity automatically annotates the result;
  • the third comparison module 113 is used to compare whether the labeling results of N+1 named entities are the same, if they are the same, the automatic labeling result of the same named entity is used as the target named entity labeling result of the second text to be recognized, if they are not the same , Output the second manual review notification, and receive the second target named entity marking result in response to the second manual review notification;
  • the second sample adding module 114 is configured to add the second to-be-recognized text and the second target named entity annotation result corresponding to the second to-be-recognized text into the initial sample data set until the number of training texts in the initial sample data set reaches a preset threshold.
  • the first model training module is specifically used for:
  • the verified named entity recognition model is tested according to the test set. If the test is successful, the training ends.
  • the named entity recognition model includes a BERT layer and a CRF layer.
  • the first model processing module is specifically used for:
  • the CRF layer is used to process the text feature sequence of the first text to be recognized, and the result of automatically marking named entities of the first text to be recognized is obtained.
  • the initial sample data set acquisition module is specifically used for:
  • the initial sample data set contains multiple training texts, and different types of named entities in each training text have been preset with different font styles;
  • the labeled result of the named entity corresponding to each training text is obtained.
  • the preprocessing is text serialization processing.
  • This application also provides a computer device, such as a smart phone, a tablet computer, a notebook computer, a desktop computer, a rack server, a blade server, a tower server, or a cabinet server (including independent servers, or more) that can execute programs.
  • a server cluster composed of two servers) and so on.
  • the computer device 20 in this embodiment at least includes but is not limited to: a memory 21 and a processor 22 that can be communicatively connected to each other through a system bus, as shown in FIG. It should be pointed out that FIG. 4 only shows the computer device 20 with the components 21-22, but it should be understood that it is not required to implement all the illustrated components, and more or fewer components may be implemented instead.
  • the memory 21 (ie, readable storage medium) includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), random access memory (RAM), static random access memory (SRAM), Read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), programmable read-only memory (PROM), magnetic memory, magnetic disks, optical disks, etc.
  • the memory 21 may be an internal storage unit of the computer device 20, such as a hard disk or a memory of the computer device 20.
  • the memory 21 may also be an external storage device of the computer device 20, such as a plug-in hard disk, a smart memory card (Smart Media Card, SMC), and a secure digital (Secure Digital, SD) card, flash card (Flash Card), etc.
  • the memory 21 may also include both an internal storage unit of the computer device 20 and an external storage device thereof.
  • the memory 21 is generally used to store the operating system and various application software installed in the computer device 20, such as the program code of the named entity recognition device 10 in the second embodiment.
  • the memory 21 can also be used to temporarily store various types of data that have been output or will be output.
  • the processor 22 may be a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, a microprocessor, or other data processing chips.
  • the processor 22 is generally used to control the overall operation of the computer device 20.
  • the processor 22 is used to run the program code or process data stored in the memory 21, for example, to run the named entity recognition device 10, so as to implement the following steps of the named entity recognition method of the first embodiment:
  • the initial sample data set contains a plurality of training texts and a named entity labeling result corresponding to each training text;
  • the target named entity annotation result is added to the initial sample data set, so that when the training text in the initial sample data set reaches a preset number, the named entity recognition model is retrained according to the initial sample data set where the training text reaches the preset number .
  • This application also provides a computer-readable storage medium, such as flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), random access memory (RAM), static random access memory (SRAM), read-only Memory (ROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Programmable Read-Only Memory (PROM), Magnetic Memory, Disk, CD, Server, App Store, etc., on which computer programs and programs are stored When executed by the processor, the corresponding function is realized, and the computer-readable storage medium may be non-volatile or volatile. .
  • the computer-readable storage medium in this embodiment is used to store the named entity recognition device 10, and when executed by a processor, it implements the following steps of the named entity recognition method in the first embodiment:
  • the initial sample data set contains a plurality of training texts and a named entity labeling result corresponding to each training text;
  • the target named entity annotation result is added to the initial sample data set, so that when the training text in the initial sample data set reaches a preset number, the named entity recognition model is retrained according to the initial sample data set where the training text reaches the preset number .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Character Discrimination (AREA)
  • Machine Translation (AREA)

Abstract

A method for recognizing a named entity. The method comprises: acquiring an initial sample data set, and if the number of training texts in the initial sample data set reaches a preset threshold, training a named entity recognition model according to the initial sample data set; processing a first text to be recognized by using the named entity recognition model obtained from training, so as to obtain a named entity automatic labeling result; comparing the named entity automatic labeling result with a named entity manual labeling result to determine whether the two are the same, if so, taking the named entity automatic labeling result as a target named entity labeling result, otherwise, outputting a first manual checking notification, and receiving a first target named entity labeling result in response to the first manual checking notification; and comparing the named entity automatic labeling result with the first target named entity labeling result to determine whether the two are the same, and if the two are different, adding the first text to be recognized to the initial sample data set. The method can improve the accuracy of recognition of a named entity.

Description

命名实体识别方法、装置、计算机设备及存储介质Named entity recognition method, device, computer equipment and storage medium
本申请要求于2019年09月04日递交的申请号为CN201910832541.3、名称为“命名实体识别方法、装置、计算机设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application with the application number CN201910832541.3 and the name "Named Entity Recognition Method, Device, Computer Equipment and Storage Medium" filed on September 4, 2019, the entire content of which is incorporated by reference In this application.
技术领域Technical field
本申请属于人工智能技术领域,尤其涉及一种命名实体识别方法、装置、计算机设备及存储介质。This application belongs to the field of artificial intelligence technology, and in particular relates to a named entity recognition method, device, computer equipment and storage medium.
背景技术Background technique
命名实体(例如时间、人名、地名、组织机构名称、特定领域词汇等)识别是自然语言理解的一个重要组成部分,常用于信息抽取、实体链接等自然语言处理场景中。在现有技术中,一般采用CRF(Conditional Random Field,条件随机场)模型、RNN(Recurrent Neural Network,循环神经网络)或LSTM(long-short term memory,长短期记忆)+CRF模型等方式对第一待识别文本进行命名实体识别。The recognition of named entities (such as time, person names, place names, organization names, vocabularies in specific fields, etc.) is an important part of natural language understanding, and is often used in natural language processing scenarios such as information extraction and entity linking. In the prior art, CRF (Conditional Random Field, Conditional Random Field) model, RNN (Recurrent Neural Network, cyclic neural network) or LSTM (long-short term memory, long-short term memory) + CRF model are generally used for the Perform named entity recognition on the text to be recognized.
技术问题technical problem
发明人发现,无论采用CRF模型或RNN或LSTM+CRF模型进行命名实体识别,准确率都不高。The inventor found that no matter whether the CRF model or the RNN or LSTM+CRF model is used for named entity recognition, the accuracy rate is not high.
技术解决方案Technical solutions
针对上述现有技术的不足,本申请提供一种识别准确率高的命名实体识别方法,以解决现有技术命名实体识别准确不高的问题。In view of the above-mentioned shortcomings of the prior art, the present application provides a named entity recognition method with high recognition accuracy to solve the problem of poor accuracy of the prior art named entity recognition.
为了实现上述目的,本申请提供一种命名实体识别方法,包括以下步骤:In order to achieve the above objective, this application provides a named entity identification method, which includes the following steps:
从样本数据库中获取初始样本数据集,该初始样本数据集中包含多个训练文本以及各训练文本对应的命名实体标注结果;Obtain an initial sample data set from the sample database, the initial sample data set contains a plurality of training texts and a named entity labeling result corresponding to each training text;
判断所述初始样本数据集中的训练文本数量是否达到预设阈值,若是,则执行如下操作:Determine whether the number of training texts in the initial sample data set reaches a preset threshold, and if so, perform the following operations:
根据所述初始样本数据集对预设的命名实体识别模型进行训练;Training a preset named entity recognition model according to the initial sample data set;
接收第一待识别文本,并对所述第一待识别文本进行预处理;Receiving the first text to be recognized, and preprocessing the first text to be recognized;
利用训练得到的命名实体识别模型对预处理后的第一待识别文本进行处理,得到所述第一待识别文本的命名实体自动标注结果;Using the trained named entity recognition model to process the preprocessed first to-be-recognized text to obtain a named entity automatic labeling result of the first to-be-recognized text;
比对所述命名实体自动标注结果与预先获得的命名实体人工标注结果是否相同,若相同,则将所述命名实体自动标注结果作为所述第一待识别文本的目标命名实体标注结果,若不相同,则输出第一人工审核通知,并接收响应所述第一人工审核通知的第一目标命名实体标注结果;Compare the automatic annotation result of the named entity with the pre-obtained manual annotation result of the named entity. If they are the same, use the automatic annotation result of the named entity as the target named entity annotation result of the first text to be recognized. If they are the same, output the first manual review notification, and receive the first target named entity marking result in response to the first manual review notification;
比对所述命名实体自动标注结果与接收到的所述第一目标命名实体标注结果是否相同,若不相同,则将所述第一待识别文本及所述第一待识别文本对应的第一目标命名实体标注结果加入所述初始样本数据集中,以便在所述初始样本数据集中的训练文本达到预设数量时,根据训练文本达到预设数量的初始样本数据集对命名实体识别模型进行重新训练。Check whether the automatic labeling result of the named entity is the same as the received labeling result of the first target named entity. If they are not the same, the first text to be recognized and the first text corresponding to the first to be recognized are compared. The target named entity annotation result is added to the initial sample data set, so that when the training text in the initial sample data set reaches a preset number, the named entity recognition model is retrained according to the initial sample data set where the training text reaches the preset number .
为了实现上述目的,本申请还提供一种命名实体识别装置,包括:In order to achieve the above objective, this application also provides a named entity recognition device, including:
初始样本数据集获取模块,用于从样本数据库中获取初始样本数据集,该初始样本数据集中包含多个训练文本以及各训练文本对应的命名实体标注结果;The initial sample data set acquisition module is used to acquire an initial sample data set from a sample database, and the initial sample data set contains multiple training texts and a named entity annotation result corresponding to each training text;
判断模块,用于判断所述初始样本数据集中的训练文本数量是否达到预设阈值:The judging module is used to judge whether the number of training texts in the initial sample data set reaches a preset threshold:
第一模型训练模块,用于在所述判断模块的结果为是时,根据所述初始样本数据集对预设的命名实体识别模型进行训练;The first model training module is configured to train a preset named entity recognition model according to the initial sample data set when the result of the judgment module is yes;
第一文本接收模块,用于接收第一待识别文本,并对所述第一待识别文本进行预处理;The first text receiving module is configured to receive the first text to be recognized and preprocess the first text to be recognized;
第一模型处理模块,用于利用训练得到的命名实体识别模型对预处理后的第一待识别文本进行处理,得到所述第一待识别文本的命名实体自动标注结果;The first model processing module is configured to use the trained named entity recognition model to process the preprocessed first to-be-recognized text to obtain a named entity automatic labeling result of the first to-be-recognized text;
第一比对模块,用于比对所述命名实体自动标注结果与预先获得的命名实体人工标注结果是否相同,若相同,则将所述命名实体自动标注结果作为所述第一待识别文本的目标命名实体标注结果,若不相同,则输出第一人工审核通知,并接收响应所述第一人工审核通知的第一目标命名实体标注结果;The first comparison module is used to compare whether the automatic labeling result of the named entity is the same as the pre-obtained manual labeling result of the named entity, and if they are the same, the automatic labeling result of the named entity is used as the first text to be recognized If the target named entity marking results are not the same, output the first manual review notification, and receive the first target named entity marking results in response to the first manual review notification;
第二比对模块,用于比对所述命名实体自动标注结果与接收到的所述第一目标命名实体标注结果是否相同;The second comparison module is configured to compare whether the automatic labeling result of the named entity is the same as the received labeling result of the first target named entity;
第一样本增加模块,用于在所述命名实体自动标注结果与接收到的第一目标命名实体标注结果不相同时,将所述第一待识别文本及所述第一待识别文本对应的第一目标命名实体标注结果加入所述初始样本数据集中,以便在所述初始样本数据集中的训练文本达到预设数量时,根据训练文本达到预设数量的初始样本数据集对命名实体识别模型进行重新训练。The first sample adding module is configured to correspond to the first to-be-recognized text and the first to-be-recognized text when the automatically annotated named entity result is not the same as the received annotated result of the first target named entity The first target named entity annotation result is added to the initial sample data set, so that when the training text in the initial sample data set reaches a preset number, the named entity recognition model is performed on the initial sample data set where the training text reaches the preset number. Retrain.
为了实现上述目的,本申请还提供一种计算机设备,包括存储器、处理器以及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现命名实体识别方法的以下步骤:In order to achieve the above object, this application also provides a computer device, including a memory, a processor, and a computer program stored in the memory and running on the processor, and the processor implements the named entity recognition method when the computer program is executed. The following steps:
从样本数据库中获取初始样本数据集,该初始样本数据集中包含多个训练文本以及各训练文本对应的命名实体标注结果;Obtain an initial sample data set from the sample database, the initial sample data set contains a plurality of training texts and a named entity labeling result corresponding to each training text;
判断所述初始样本数据集中的训练文本数量是否达到预设阈值,若是,则执行如下操作:Determine whether the number of training texts in the initial sample data set reaches a preset threshold, and if so, perform the following operations:
根据所述初始样本数据集对预设的命名实体识别模型进行训练;Training a preset named entity recognition model according to the initial sample data set;
接收第一待识别文本,并对所述第一待识别文本进行预处理;Receiving the first text to be recognized, and preprocessing the first text to be recognized;
利用训练得到的命名实体识别模型对预处理后的第一待识别文本进行处理,得到所述第一待识别文本的命名实体自动标注结果;Using the trained named entity recognition model to process the preprocessed first to-be-recognized text to obtain a named entity automatic labeling result of the first to-be-recognized text;
比对所述命名实体自动标注结果与预先获得的命名实体人工标注结果是否相同,若相同,则将所述命名实体自动标注结果作为所述第一待识别文本的目标命名实体标注结果,若不相同,则输出第一人工审核通知,并接收响应所述第一人工审核通知的第一目标命名实体标注结果;Compare the automatic annotation result of the named entity with the pre-obtained manual annotation result of the named entity. If they are the same, use the automatic annotation result of the named entity as the target named entity annotation result of the first text to be recognized. If they are the same, output the first manual review notification, and receive the first target named entity marking result in response to the first manual review notification;
比对所述命名实体自动标注结果与接收到的所述第一目标命名实体标注结果是否相同,若不相同,则将所述第一待识别文本及所述第一待识别文本对应的第一目标命名实体标注结果加入所述初始样本数据集中,以便在所述初始样本数据集中的训练文本达到预设数量时,根据训练文本达到预设数量的初始样本数据集对命名实体识别模型进行重新训练。Check whether the automatic labeling result of the named entity is the same as the received labeling result of the first target named entity. If they are not the same, the first text to be recognized and the first text corresponding to the first to be recognized are compared. The target named entity annotation result is added to the initial sample data set, so that when the training text in the initial sample data set reaches a preset number, the named entity recognition model is retrained according to the initial sample data set where the training text reaches the preset number .
为了实现上述目的,本申请还提供一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现名实体识别方法的以下步骤:In order to achieve the above object, the present application also provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the following steps of the name entity identification method are realized:
从样本数据库中获取初始样本数据集,该初始样本数据集中包含多个训练文本以及各训练文本对应的命名实体标注结果;Obtain an initial sample data set from the sample database, the initial sample data set contains a plurality of training texts and a named entity labeling result corresponding to each training text;
判断所述初始样本数据集中的训练文本数量是否达到预设阈值,若是,则执行如下操作:Determine whether the number of training texts in the initial sample data set reaches a preset threshold, and if so, perform the following operations:
根据所述初始样本数据集对预设的命名实体识别模型进行训练;Training a preset named entity recognition model according to the initial sample data set;
接收第一待识别文本,并对所述第一待识别文本进行预处理;Receiving the first text to be recognized, and preprocessing the first text to be recognized;
利用训练得到的命名实体识别模型对预处理后的第一待识别文本进行处理,得到所述第一待识别文本的命名实体自动标注结果;Using the trained named entity recognition model to process the preprocessed first to-be-recognized text to obtain a named entity automatic labeling result of the first to-be-recognized text;
比对所述命名实体自动标注结果与预先获得的命名实体人工标注结果是否相同,若相同,则将所述命名实体自动标注结果作为所述第一待识别文本的目标命名实体标注结果,若不相同,则输出第一人工审核通知,并接收响应所述第一人工审核通知的第一目标命名实体标注结果;Compare the automatic annotation result of the named entity with the pre-obtained manual annotation result of the named entity. If they are the same, use the automatic annotation result of the named entity as the target named entity annotation result of the first text to be recognized. If they are the same, output the first manual review notification, and receive the first target named entity marking result in response to the first manual review notification;
比对所述命名实体自动标注结果与接收到的所述第一目标命名实体标注结果是否相同,若不相同,则将所述第一待识别文本及所述第一待识别文本对应的第一目标命名实体标注结果加入所述初始样本数据集中,以便在所述初始样本数据集中的训练文本达到预设数量时,根据训练文本达到预设数量的初始样本数据集对命名实体识别模型进行重新训练。Check whether the automatic labeling result of the named entity is the same as the received labeling result of the first target named entity. If they are not the same, the first text to be recognized and the first text corresponding to the first to be recognized are compared. The target named entity annotation result is added to the initial sample data set, so that when the training text in the initial sample data set reaches a preset number, the named entity recognition model is retrained according to the initial sample data set where the training text reaches the preset number .
有益效果Beneficial effect
本申请对第一待识别文本进行命名实体识别后,比对命名实体自动标注结果与预先获得的命名实体人工标注结果是否相同,若不相同,则输出第一人工审核通知,并接收响应所述第一人工审核通知的第一目标命名实体标注结果;若所述命名实体自动标注结果与第一目标命名实体标注结果不相同,则所述第一待识别文本及所述第一待识别文本对应的第一目标命名实体标注结果加入所述初始样本数据集中,以在所述初始样本数据集中的训练文本达到预设数量时,根据训练文本达到预设数量的初始样本数据集对命名实体识别模型进行重新训练,从而提高了模型的准确度,进而提高命名实体识别的准确率。After this application performs named entity recognition on the first to-be-recognized text, it compares whether the named entity automatic annotation result is the same as the pre-obtained manual annotation result of the named entity. If they are not the same, the first manual review notification is output and the response is received. The first target named entity labeling result notified by the first manual review; if the automatic labeling result of the named entity is different from the first target named entity labeling result, the first to-be-recognized text and the first to-be-recognized text correspond to The first target named entity annotation result of the first target named entity is added to the initial sample data set, so that when the training text in the initial sample data set reaches a preset number, the named entity recognition model is identified according to the initial sample data set with the training text reaching the preset number Re-training improves the accuracy of the model, thereby improving the accuracy of named entity recognition.
附图说明Description of the drawings
图1为本申请一种命名实体识别方法的一个实施例的流程图;FIG. 1 is a flowchart of an embodiment of a named entity identification method according to this application;
图2为本申请中命名实体识别模型的原理图;Figure 2 is a schematic diagram of the named entity recognition model in this application;
图3为本申请一种命名实体识别装置的一个实施例的结构框图;FIG. 3 is a structural block diagram of an embodiment of a named entity recognition device according to this application;
图4为本申请计算机设备的一个实施例的硬件架构图。FIG. 4 is a hardware architecture diagram of an embodiment of the computer device of this application.
本发明的实施方式Embodiments of the present invention
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本申请,并不用于限定本申请。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。In order to make the purpose, technical solutions, and advantages of this application clearer, the following further describes this application in detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present application, and are not used to limit the present application. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.
实施例一Example one
本实施例提供一种命名实体识别方法,主要适用于自然语言处理,如图1所示,该方法包括以下步骤:This embodiment provides a named entity recognition method, which is mainly suitable for natural language processing. As shown in FIG. 1, the method includes the following steps:
S1,从样本数据库中获取初始样本数据集,该初始样本数据集中包含多个训练文本以及各训练文本对应的命名实体标注结果。在本实施例中,训练文本为.doc或.docx格式的文本,训练文本中可包含时间、人名、地点、组织机构名称、公司名称、国家名称、经济词汇、交易类型、经济质量指标、产品名称等各种不同类别的命名实体。其中,各训练文本中不同类别的命名实体已预先设置为不同的字体样式,如设置为不同的字体颜色。在此情况下,步骤S1具体包括如下过程:首先,从样本数据库中获取初始样本数据集,初始样本数据集包含多个训练文本,各训练文本中不同类别的命名实体已预先设置为不同的字体样式;而后,根据各训练文本中每个词的字体样式(如字体颜色属性),获取各训练文本对应的命名实体标注结果。例如,假设预先通过人工将训练文本中的人名字体设置为红色,时间字体设置为黄色,地点字体设置为蓝色、组织机构名称设置为绿色,非命名实体设置为黑色,则将训练文本中红色字体的词标注为人名命名实体识别标签PERS,黄色字体的词标注为时间命名实体识别标签TIME,蓝色字体的词标注为地点命名实体识别标签LOC,绿色字体的词标注为组织机构名称命名实体识别标签ORGE,黑色字体的词标注为非命名实体标签O,在此不一一列举。S1. Obtain an initial sample data set from a sample database, where the initial sample data set contains multiple training texts and a named entity labeling result corresponding to each training text. In this embodiment, the training text is a text in .doc or .docx format. The training text can include time, person name, location, organization name, company name, country name, economic vocabulary, transaction type, economic quality indicator, product Names and other different types of named entities. Among them, the named entities of different categories in each training text have been preset to different font styles, such as different font colors. In this case, step S1 specifically includes the following process: First, obtain an initial sample data set from the sample database, the initial sample data set contains multiple training texts, and the named entities of different categories in each training text have been preset to different fonts Style; then, according to the font style (such as font color attribute) of each word in each training text, obtain the labeled result of the named entity corresponding to each training text. For example, suppose that the name of the person in the training text is manually set to red, the time font is set to yellow, the place font is set to blue, the organization name is set to green, and the non-named entity is set to black, then the training text is set to red Words in the font are marked as the name-named entity identification label PERS, words in yellow font are marked as the time-named entity identification label TIME, words in blue font are marked as the place-named entity identification label LOC, words in green font are marked as the organization name-named entity The identification label ORGE, the words in black font are marked as the non-named entity label O, which will not be listed here.
S2,判断初始样本数据集中的训练文本数量是否达到预设阈值,若是,执行步骤S3。S2: Determine whether the number of training texts in the initial sample data set reaches a preset threshold, and if so, perform step S3.
S3,根据初始样本数据集对命名实体识别模型进行训练。在本实施例中,如图2所示,命名实体识别模型包含BERT层和CRF层,即,本实施例的命名实体识别模型是通过在BERT模型的基础上再拼接一层CRF模型而构成。S3, training the named entity recognition model according to the initial sample data set. In this embodiment, as shown in FIG. 2, the named entity recognition model includes a BERT layer and a CRF layer, that is, the named entity recognition model of this embodiment is constructed by splicing a layer of CRF model on the basis of the BERT model.
BERT模型是由Google公司发布的自然语言处理模型,其框架如图2所示,具有双向Transformer编码器(即图中的双层Trm),通过双向Transformer编码器的处理,能充分考虑上下文词与词之间的关系,使得命名实体标注结果更加准确。如图2所示,tok1、tok2、…、tokN表示训练文本的输入序列,E1、E2、…、EN表示tok1、tok2、…、tokN分别对应的向量,各向量分别输入前向层Transformer中的每一个Transformer编码器,将前向层Transformer中的每一个Transformer编码器的输出作为后向层Transformer中的每一个Transformer编码器的输入,将后向层Transformer中的各Transformer编码器输出的结果通过softmax函数做归一化处理,得到每个词对应命名实体类别的概率矩阵T1、T2、…、TN。The BERT model is a natural language processing model released by Google. Its framework is shown in Figure 2. It has a two-way Transformer encoder (ie, the two-layer Trm in the figure). Through the processing of the two-way Transformer encoder, it can fully consider contextual words and The relationship between words makes the result of named entity labeling more accurate. As shown in Figure 2, tok1, tok2,..., tokN represent the input sequence of the training text, E1, E2,..., EN represent the vector corresponding to tok1, tok2,..., tokN, and each vector is input into the forward layer Transformer. For each Transformer encoder, the output of each Transformer encoder in the forward layer Transformer is used as the input of each Transformer encoder in the backward layer Transformer, and the output result of each Transformer encoder in the backward layer Transformer is passed through The softmax function is normalized to obtain the probability matrix T1, T2,..., TN of each word corresponding to the named entity category.
CRF模型是一种判别式概率模型,是随机场的一种,常用于标注或分析序列资料,如自然语言文字序列,对于输入的长度为 N 的序列T [T1、T2…Ti…TN],假设标签的标注结果为[ y1, … , yN ],则CRF模型将在已知序列 X 的条件下,找出使得[ y1, … , yN ]的概率P (y1, … , yN)最大的序列[ Y1, … , YN],然后预测每个词的标签,即得到命名实体识别结果。The CRF model is a discriminative probability model, which is a kind of random field. It is often used to label or analyze sequence data, such as natural language text sequences. For the input sequence T [T1, T2...Ti...TN], the length of the input is N, Assuming that the labeling result of the label is [y1,…, yN ], the CRF model will find the sequence with the largest probability P (y1,…, yN) of [y1,…, yN] under the condition of known sequence X [Y1,…, YN], and then predict the label of each word to get the result of named entity recognition.
在本实施例中,步骤S3具体通过如下步骤实现:首先,将初始样本数据集划分为训练集、验证集和测试集;而后,根据训练集对命名实体识别模型进行训练;当训练完成后,根据验证集对经过训练的命名实体识别模型的准确率等性能进行验证;当验证通过后,根据测试集对经过验证的命名实体识别模型进行测试,若测试成功,训练结束。其中,根据训练集对命名实体识别模型进行训练的过程如下:将训练集中的样本数据输入到BERT层,再将BERT层的输出结果输入到CRF层,以对BERT层和CRF层的训练参数进行迭代训练。In this embodiment, step S3 is specifically implemented by the following steps: first, the initial sample data set is divided into a training set, a verification set, and a test set; then, the named entity recognition model is trained according to the training set; when the training is completed, According to the verification set, the accuracy of the trained named entity recognition model is verified; when the verification is passed, the verified named entity recognition model is tested according to the test set. If the test is successful, the training ends. Among them, the process of training the named entity recognition model according to the training set is as follows: input the sample data in the training set to the BERT layer, and then input the output result of the BERT layer to the CRF layer to perform the training parameters of the BERT layer and the CRF layer. Iterative training.
S4,接收第一待识别文本,并对第一待识别文本进行预处理,此处的预处理具体是指文本序列化处理。具体来说,首先对第一待识别文本中的语句进行词处理,并在语句的前面加上开始标志符CLS,在两个语句之间加上分隔标志符SEP。例如,假设第一待识别文本为“小明喜欢看NBA”,则对应的输入序列为“[CLS]、小明、喜欢、看、NBA”。S4, receiving the first text to be recognized, and preprocessing the first text to be recognized, where the preprocessing specifically refers to text serialization processing. Specifically, firstly, word processing is performed on the sentence in the first to-be-recognized text, and the start marker CLS is added in front of the sentence, and the separation marker SEP is added between the two sentences. For example, if the first text to be recognized is "Xiao Ming likes to watch NBA", the corresponding input sequence is "[CLS], Xiao Ming, like, watch, NBA".
S5,利用训练得到的命名实体识别模型对预处理后的第一待识别文本进行处理,得到第一待识别文本的命名实体自动标注结果。具体包括以下步骤:S5: Use the trained named entity recognition model to process the preprocessed first to-be-recognized text to obtain a named entity automatic labeling result of the first to-be-recognized text. It includes the following steps:
S51,利用BERT层对第一待识别文本对应的输入序列进行处理,得到待识别文本对应的文本特征序列。具体处理过程如下:S51: Use the BERT layer to process the input sequence corresponding to the first text to be recognized to obtain the text feature sequence corresponding to the text to be recognized. The specific process is as follows:
首先,对待识别文本对应的输入序列中每个词或标志符([CLS]、[SEP])进行词编码、对每个词或标志符所在的段落进行段落编码,对每个词或标志符在相应语句中的位置进行位置编码,从而得到每个词或标志符对应的词嵌入表征向量、段落嵌入表征向量和位置嵌入表征向量,并将对应的词嵌入表征向量、段落嵌入表征向量和位置嵌入表征向量拼接组合成各词或标志符对应的总向量。例如,某词对应的词嵌入表征向量为Etoken=[0.05,0.82,0.03,0.05]、段落嵌入表征向量为Esegment=[0,0,0,0]、位置嵌入表征向量为Eposition=[0,1,2,3],则该词对应的总向量E=[0.05,0.82,0.03,0.05,0,0,0,0,0,1,2,3]。First, perform word encoding for each word or identifier ([CLS], [SEP]) in the input sequence corresponding to the text to be recognized, and perform paragraph encoding for the paragraph where each word or identifier is located. For each word or identifier Perform position coding on the position in the corresponding sentence to obtain the word embedding characterization vector, paragraph embedding characterization vector, and position embedding characterization vector corresponding to each word or marker, and embed the corresponding word embedding characterization vector, paragraph embedding characterization vector and position The embedding representation vector is spliced and combined into a total vector corresponding to each word or identifier. For example, the word embedding representation vector corresponding to a word is Etoken=[0.05,0.82,0.03,0.05], the paragraph embedding representation vector is Esegment=[0,0,0,0], and the position embedding representation vector is Eposition=[0, 1,2,3], then the total vector E=[0.05,0.82,0.03,0.05,0,0,0,0,0,1,2,3] corresponding to the word.
而后,将输入序列中每个词或标志符对应的总向量输入前向层Transformer中的每一个Transformer编码器,将前向层Transformer中的每一个Transformer编码器的输出结果作为后向层Transformer中的每一个Transformer编码器的输入,将后向层Transformer中的各Transformer编码器的输出结果通过softmax函数做归一化处理,得到输入序列中每个词对应命名实体类别的概率矩阵,作为第一待识别文本对应的文本特征序列。Then, the total vector corresponding to each word or identifier in the input sequence is input to each Transformer encoder in the forward layer Transformer, and the output result of each Transformer encoder in the forward layer Transformer is used as the backward layer Transformer. For the input of each Transformer encoder, the output results of each Transformer encoder in the backward layer Transformer are normalized by the softmax function, and the probability matrix corresponding to the named entity category for each word in the input sequence is obtained as the first The text feature sequence corresponding to the text to be recognized.
S52,利用CRF层对第一待识别文本的文本特征序列进行处理,以预测第一待识别文本中各词的命名实体标签,得到第一待识别文本的命名实体自动标注结果。例如,针对待识别文本“小明在北京大学的图书馆学习”,标注结果将如下表1所示:S52: Use the CRF layer to process the text feature sequence of the first text to be recognized to predict the named entity tag of each word in the first text to be recognized, and obtain a named entity automatic tagging result of the first text to be recognized. For example, for the to-be-recognized text "Xiao Ming studied in the library of Peking University", the labeling results will be shown in Table 1 below:
表1Table 1
  命名实体识别结果 To Named entity recognition result   PERS To PERS   O To O   ORG To ORG   O To O   LOC To LOC   O To O
  待识别文本 To Text to be recognized   小明 To Xiao Ming in   北京大学 To Beijing University   的 To of   图书馆 To library   学习 To Learn
S6,比对命名实体自动标注结果与预先获得的命名实体人工标注结果是否相同,若相同,认为第一待识别文本的命名实体自动标注结果是准确的,则将命名实体自动标注结果作为第一待识别文本的目标命名实体标注结果;若不相同,则认为第一待识别文本的命名实体自动标注结果可能是错误的,则输出第一人工审核通知,工作人员接收到通知后进行审核,并返回第一待识别文本的目标命名实体标注结果,记为第一目标命名实体标注结果,从而可以接收到响应第一人工审核通知的第一目标命名实体标注结果。S6. Check whether the automatic named entity annotation result is the same as the pre-obtained manual annotation result of the named entity. If they are the same, the named entity automatic annotation result of the first text to be recognized is considered to be accurate, and the named entity automatic annotation result is taken as the first The marking result of the target named entity of the text to be recognized; if it is not the same, it is considered that the automatic marking result of the named entity of the first text to be recognized may be wrong, and the first manual review notice is output, and the staff will review it after receiving the notice, and The target named entity tagging result of the first text to be recognized is returned, and it is recorded as the first target named entity tagging result, so that the first target named entity tagging result in response to the first manual review notification can be received.
S7,比对第一待识别文本的命名实体自动标注结果与接收到的第一目标命名实体标注结果是否相同,若相同,流程结束,若不相同,认为第一待识别文本的命名实体自动标注结果是错误的,则将第一待识别文本及第一待识别文本对应的第一目标命名实体标注结果加入初始样本数据集中,以在初始样本数据集中的训练文本达到预设数量时,根据训练文本达到预设数量的初始样本数据集对命名实体识别模型进行重新训练,从而提高了模型的准确度。S7. Compare whether the automatic labeling result of the first text to be recognized is the same as the received first target named entity labeling result. If they are the same, the process ends. If they are not the same, it is considered that the named entity of the first text to be recognized is automatically annotated. If the result is wrong, add the first to-be-recognized text and the first target named entity labeling result corresponding to the first to-be-recognized text into the initial sample data set, so that when the training text in the initial sample data set reaches the preset number, according to the training The initial sample data set with the text reaching the preset number retrains the named entity recognition model, thereby improving the accuracy of the model.
回到步骤S2,当判断得到初始样本数据集中的训练文本数量未达到预设阈值时,则执行以下操作:Returning to step S2, when it is determined that the number of training texts in the initial sample data set does not reach the preset threshold, the following operations are performed:
S21,对初始样本数据集中的训练文本的语句顺序进行N次打乱处理,生成N个不同的新样本数据集,其中N取正整数。可以理解,一个训练文本的语句打乱后,可以得到一个新的训练文本,初始样本数据集中的所有训练文本的语句打乱后,即可得到一个新的样本数据集,随机打乱N次可得到N个不同的新样本数据集。S21: Perform N times scramble processing on the sentence sequence of the training text in the initial sample data set to generate N different new sample data sets, where N takes a positive integer. It can be understood that after the sentences of a training text are scrambled, a new training text can be obtained. After the sentences of all training texts in the initial sample data set are scrambled, a new sample data set can be obtained. Random scrambled N times can be Get N different new sample data sets.
S22,根据初始样本数据集对前述预设的命名实体识别模型进行训练,并根据所述N个不同的新样本数据集分别对前述预设的命名实体识别模型进行训练,从而得到N+1个训练后的命名实体识别模型。可以理解,根据初始样本数据集可以训练得到一个命名实体识别模型,根据一个新样本数据集也可以训练得到一个命名实体识别模型,则根据初始样本数据集和N个新样本数据集可以训练得到N+1个命名实体识别模型。S22. Train the aforementioned preset named entity recognition model according to the initial sample data set, and train the aforementioned preset named entity recognition model respectively according to the N different new sample data sets, so as to obtain N+1 The trained named entity recognition model. It can be understood that a named entity recognition model can be trained based on the initial sample data set, and a named entity recognition model can also be trained based on a new sample data set, and then N can be trained based on the initial sample data set and N new sample data sets. +1 named entity recognition model.
S23,接收第二待识别文本,并对第二待识别文本进行预处理。其中,对第二待识别文本的预处理过程与对第一待识别文本的预处理过程是相同的,故在此不再赘述。S23: Receive the second text to be recognized, and preprocess the second text to be recognized. Among them, the preprocessing process of the second text to be recognized is the same as the preprocessing process of the first text to be recognized, so it will not be repeated here.
S24,利用前述N+1个训练后的训练得到的N+1个命名实体识别模型分别对的预处理后的第二待识别文本进行处理,可以理解,利用一个命名实体识别模型对第二待识别文本处理,可以得到一个命名实体自动标注结果,则利用N+1个命名实体识别模型对第二待识别文本进行处理,可以得到第二待识别文本对应的N+1个命名实体自动标注结果。S24. Use the N+1 named entity recognition models obtained from the aforementioned N+1 training to process the preprocessed second to-be-recognized text respectively. It can be understood that a named-entity recognition model is used to process the second to-be-recognized text. Recognition text processing, you can get a named entity automatic labeling result, then use N+1 named entity recognition model to process the second to-be-recognized text, you can get the N+1 named-entity automatic labeling results corresponding to the second to-be-recognized text .
S25,比对前述N+1个命名实体标注结果是否相同,若相同,则认为相同的命名实体自动标注结果是正确的,将相同的命名实体自动标注结果作为第二待识别文本的目标命名实体标注结果,若不相同,则输出第二人工审核通知,工作人员接收到通知后进行审核,并返回第二待识别文本的目标命名实体标注结果,记为第二目标命名实体标注结果,从而可以接收到响应第二人工审核通知的第二目标命名实体标注结果;S25. Compare whether the aforementioned N+1 named entity labeling results are the same. If they are the same, the automatic labeling result of the same named entity is considered to be correct, and the same named entity automatic labeling result is used as the target named entity of the second text to be recognized If the labeling results are not the same, the second manual review notification will be output. The staff will review after receiving the notification, and return the target named entity labeling result of the second text to be recognized, which is recorded as the second target named entity labeling result. Receiving the marking result of the second target named entity in response to the second manual review notification;
S26,将第二待识别文本及第二待识别文本对应的第二目标命名实体标注结果加入初始样本数据集中,直到初始样本数据集中的训练文本数量达到预设阈值,而后执行步骤S3,从而提高模型的准确度,进而提高命名实体识别的准确性。S26. Add the second to-be-recognized text and the second target named entity annotation result corresponding to the second to-be-recognized text into the initial sample data set until the number of training texts in the initial sample data set reaches a preset threshold, and then perform step S3 to improve The accuracy of the model further improves the accuracy of named entity recognition.
需要说明的是,对于本实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请并不受所描述的动作顺序的限制,因为依据本申请,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作并不一定是本申请所必须的。It should be noted that for this embodiment, for the sake of simple description, it is expressed as a series of action combinations, but those skilled in the art should know that this application is not limited by the described sequence of actions, because according to this For application, some steps can be performed in other order or at the same time. Secondly, those skilled in the art should also know that the embodiments described in the specification are all preferred embodiments, and the actions involved are not necessarily required by this application.
实施例二Example two
本实施例提供一种命名实体识别装置10,如图3所示,包括:This embodiment provides a named entity recognition device 10, as shown in FIG. 3, including:
初始样本数据集获取模块101,用于从样本数据库中获取初始样本数据集,该初始样本数据集中包含多个训练文本以及各训练文本对应的命名实体标注结果;The initial sample data set acquisition module 101 is configured to acquire an initial sample data set from a sample database, and the initial sample data set contains multiple training texts and a named entity annotation result corresponding to each training text;
判断模块102,用于判断初始样本数据集中的训练文本数量是否达到预设阈值:The judging module 102 is used to judge whether the number of training texts in the initial sample data set reaches a preset threshold:
第一模型训练模块103,用于在判断模块的结果为是时,根据初始样本数据集对预设的命名实体识别模型进行训练;The first model training module 103 is configured to train the preset named entity recognition model according to the initial sample data set when the result of the judgment module is yes;
第一文本接收模块104,用于接收第一待识别文本,并对第一待识别文本进行预处理;The first text receiving module 104 is configured to receive the first text to be recognized and preprocess the first text to be recognized;
第一模型处理模块105,用于利用训练得到的命名实体识别模型对预处理后的第一待识别文本进行处理,得到所述第一待识别文本的命名实体自动标注结果;The first model processing module 105 is configured to use the trained named entity recognition model to process the preprocessed first to-be-recognized text to obtain a named entity automatic labeling result of the first to-be-recognized text;
第一比对模块106,用于比对命名实体自动标注结果与预先获得的命名实体人工标注结果是否相同,若相同,则将命名实体自动标注结果作为第一待识别文本的目标命名实体标注结果,若不相同,则输出第一人工审核通知,并接收响应第一人工审核通知的第一目标命名实体标注结果;The first comparison module 106 is used to compare whether the automatic named entity annotation result is the same as the pre-obtained manual annotation result of the named entity, and if they are the same, the automatic named entity annotation result is used as the target named entity annotation result of the first text to be recognized , If they are not the same, output the first manual review notification, and receive the first target named entity marking result in response to the first manual review notification;
第二比对模块107,用于比对命名实体自动标注结果与接收到的第一目标命名实体标注结果是否相同;The second comparison module 107 is configured to compare whether the automatic labeling result of the named entity is the same as the received labeling result of the first target named entity;
第一样本增加模块108,用于在命名实体自动标注结果与接收到的第一目标命名实体标注结果不相同时,将第一待识别文本及第一待识别文本对应的第一目标命名实体标注结果加入初始样本数据集中,以便在初始样本数据集中的训练文本达到预设数量时,根据训练文本达到预设数量的初始样本数据集对命名实体识别模型进行重新训练。The first sample adding module 108 is configured to combine the first to-be-recognized text and the first target-named entity corresponding to the first to-be-recognized text when the automatic named entity annotation result is different from the received first target named entity annotation result The annotation result is added to the initial sample data set, so that when the training text in the initial sample data set reaches a preset number, the named entity recognition model is retrained according to the initial sample data set where the training text reaches the preset number.
在本申请一个实施例中,命名实体识别装置10还包括:In an embodiment of the present application, the named entity recognition device 10 further includes:
新样本数据集获取模块109,用于在判断模块的判断结果为否时,对初始样本数据集中的训练文本的语句顺序进行N次打乱处理,生成N个不同的新样本数据集,其中N取正整数;The new sample data set acquisition module 109 is used to perform N times scramble processing on the sentence sequence of the training text in the initial sample data set when the judgment result of the judgment module is no to generate N different new sample data sets, where N Take a positive integer;
第二模型训练模块110,根据初始样本数据集对所述预设的命名实体识别模型进行训练,并根据N个不同的新样本数据集分别对所述预设的命名实体识别模型进行训练,得到N+1个训练后的命名实体识别模型;The second model training module 110 trains the preset named entity recognition model according to the initial sample data set, and trains the preset named entity recognition model respectively according to N different new sample data sets, to obtain N+1 trained named entity recognition models;
第二文本接收模块111,用于接收第二待识别文本,并对第二待识别文本进行预处理;The second text receiving module 111 is configured to receive the second text to be recognized and preprocess the second text to be recognized;
第二模型处理模块112,用于利用所述N+1个训练后的命名实体识别模型分别对预处理后的第二待识别文本进行处理,得到第二待识别文本对应的N+1个命名实体自动标注结果;The second model processing module 112 is configured to use the N+1 trained named entity recognition models to respectively process the preprocessed second to-be-recognized text to obtain N+1 names corresponding to the second to-be-recognized text Entity automatically annotates the result;
第三比对模块113,用于比对N+1个命名实体标注结果是否相同,若相同,则将相同的命名实体自动标注结果作为第二待识别文本的目标命名实体标注结果,若不相同,则输出第二人工审核通知,并接收响应第二人工审核通知的第二目标命名实体标注结果;The third comparison module 113 is used to compare whether the labeling results of N+1 named entities are the same, if they are the same, the automatic labeling result of the same named entity is used as the target named entity labeling result of the second text to be recognized, if they are not the same , Output the second manual review notification, and receive the second target named entity marking result in response to the second manual review notification;
第二样本增加模块114,用于将第二待识别文本及第二待识别文本对应的第二目标命名实体标注结果加入初始样本数据集中,直到初始样本数据集中的训练文本数量达到预设阈值。The second sample adding module 114 is configured to add the second to-be-recognized text and the second target named entity annotation result corresponding to the second to-be-recognized text into the initial sample data set until the number of training texts in the initial sample data set reaches a preset threshold.
在本申请一个实施例中,第一模型训练模块具体用于:In an embodiment of the present application, the first model training module is specifically used for:
将初始样本数据集划分为训练集、验证集和测试集;Divide the initial sample data set into training set, validation set and test set;
根据训练集对命名实体识别模型进行训练;Train the named entity recognition model according to the training set;
根据验证集对经过训练的命名实体识别模型进行验证;Validate the trained named entity recognition model according to the validation set;
根据测试集对经过验证的命名实体识别模型进行测试,若测试成功,训练结束。The verified named entity recognition model is tested according to the test set. If the test is successful, the training ends.
在本申请一个实施例中,命名实体识别模型包括BERT层和CRF层。In an embodiment of the present application, the named entity recognition model includes a BERT layer and a CRF layer.
在本申请一个实施例中,第一模型处理模块具体用于:In an embodiment of the present application, the first model processing module is specifically used for:
利用BERT层对第一待识别文本对应的输入序列进行处理,得到第一待识别文本对应的文本特征序列;Use the BERT layer to process the input sequence corresponding to the first text to be recognized to obtain the text feature sequence corresponding to the first text to be recognized;
利用CRF层对第一待识别文本的文本特征序列进行处理,得到第一待识别文本的命名实体自动标注结果。The CRF layer is used to process the text feature sequence of the first text to be recognized, and the result of automatically marking named entities of the first text to be recognized is obtained.
在本申请一个实施例中,初始样本数据集获取模块具体用于:In an embodiment of the present application, the initial sample data set acquisition module is specifically used for:
从样本数据库中获取初始样本数据集,初始样本数据集包含多个训练文本,各训练文本中不同类别的命名实体已预先设置为不同的字体样式;Obtain an initial sample data set from the sample database. The initial sample data set contains multiple training texts, and different types of named entities in each training text have been preset with different font styles;
根据各训练文本中每个词的字体样式,获取各训练文本对应的命名实体标注结果。According to the font style of each word in each training text, the labeled result of the named entity corresponding to each training text is obtained.
在本申请一个实施例中,预处理为文本序列化处理。In an embodiment of the present application, the preprocessing is text serialization processing.
本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的模块作并不一定是本申请所必须的。Those skilled in the art should also be aware that the embodiments described in the specification are all preferred embodiments, and the modules involved are not necessarily required by this application.
实施例三Example three
本申请还提供一种计算机设备,如可以执行程序的智能手机、平板电脑、笔记本电脑、台式计算机、机架式服务器、刀片式服务器、塔式服务器或机柜式服务器(包括独立的服务器,或者多个服务器所组成的服务器集群)等。本实施例的计算机设备20至少包括但不限于:可通过***总线相互通信连接的存储器21、处理器22,如图4所示。需要指出的是,图4仅示出了具有组件21-22的计算机设备20,但是应理解的是,并不要求实施所有示出的组件,可以替代的实施更多或者更少的组件。This application also provides a computer device, such as a smart phone, a tablet computer, a notebook computer, a desktop computer, a rack server, a blade server, a tower server, or a cabinet server (including independent servers, or more) that can execute programs. A server cluster composed of two servers) and so on. The computer device 20 in this embodiment at least includes but is not limited to: a memory 21 and a processor 22 that can be communicatively connected to each other through a system bus, as shown in FIG. It should be pointed out that FIG. 4 only shows the computer device 20 with the components 21-22, but it should be understood that it is not required to implement all the illustrated components, and more or fewer components may be implemented instead.
本实施例中,存储器21(即可读存储介质)包括闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、随机访问存储器(RAM)、静态随机访问存储器(SRAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、可编程只读存储器(PROM)、磁性存储器、磁盘、光盘等。在一些实施例中,存储器21可以是计算机设备20的内部存储单元,例如该计算机设备20的硬盘或内存。在另一些实施例中,存储器21也可以是计算机设备20的外部存储设备,例如该计算机设备20上配备的插接式硬盘,智能存储卡(Smart Media Card, SMC),安全数字(Secure Digital, SD)卡,闪存卡(Flash Card)等。当然,存储器21还可以既包括计算机设备20的内部存储单元也包括其外部存储设备。本实施例中,存储器21通常用于存储安装于计算机设备20的操作***和各类应用软件,例如实施例二的命名实体识别置10的程序代码等。此外,存储器21还可以用于暂时地存储已经输出或者将要输出的各类数据。In this embodiment, the memory 21 (ie, readable storage medium) includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), random access memory (RAM), static random access memory (SRAM), Read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), programmable read-only memory (PROM), magnetic memory, magnetic disks, optical disks, etc. In some embodiments, the memory 21 may be an internal storage unit of the computer device 20, such as a hard disk or a memory of the computer device 20. In other embodiments, the memory 21 may also be an external storage device of the computer device 20, such as a plug-in hard disk, a smart memory card (Smart Media Card, SMC), and a secure digital (Secure Digital, SD) card, flash card (Flash Card), etc. Of course, the memory 21 may also include both an internal storage unit of the computer device 20 and an external storage device thereof. In this embodiment, the memory 21 is generally used to store the operating system and various application software installed in the computer device 20, such as the program code of the named entity recognition device 10 in the second embodiment. In addition, the memory 21 can also be used to temporarily store various types of data that have been output or will be output.
处理器22在一些实施例中可以是中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器、或其他数据处理芯片。该处理器22通常用于控制计算机设备20的总体操作。本实施例中,处理器22用于运行存储器21中存储的程序代码或者处理数据,例如运行命名实体识别装置10,以实现实施例一的命名实体识别方法的以下步骤:In some embodiments, the processor 22 may be a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, a microprocessor, or other data processing chips. The processor 22 is generally used to control the overall operation of the computer device 20. In this embodiment, the processor 22 is used to run the program code or process data stored in the memory 21, for example, to run the named entity recognition device 10, so as to implement the following steps of the named entity recognition method of the first embodiment:
从样本数据库中获取初始样本数据集,该初始样本数据集中包含多个训练文本以及各训练文本对应的命名实体标注结果;Obtain an initial sample data set from the sample database, the initial sample data set contains a plurality of training texts and a named entity labeling result corresponding to each training text;
判断所述初始样本数据集中的训练文本数量是否达到预设阈值,若是,则执行如下操作:Determine whether the number of training texts in the initial sample data set reaches a preset threshold, and if so, perform the following operations:
根据所述初始样本数据集对预设的命名实体识别模型进行训练;Training a preset named entity recognition model according to the initial sample data set;
接收第一待识别文本,并对所述第一待识别文本进行预处理;Receiving the first text to be recognized, and preprocessing the first text to be recognized;
利用训练得到的命名实体识别模型对预处理后的第一待识别文本进行处理,得到所述第一待识别文本的命名实体自动标注结果;Using the trained named entity recognition model to process the preprocessed first to-be-recognized text to obtain a named entity automatic labeling result of the first to-be-recognized text;
比对所述命名实体自动标注结果与预先获得的命名实体人工标注结果是否相同,若相同,则将所述命名实体自动标注结果作为所述第一待识别文本的目标命名实体标注结果,若不相同,则输出第一人工审核通知,并接收响应所述第一人工审核通知的第一目标命名实体标注结果;Compare the automatic annotation result of the named entity with the pre-obtained manual annotation result of the named entity. If they are the same, use the automatic annotation result of the named entity as the target named entity annotation result of the first text to be recognized. If they are the same, output the first manual review notification, and receive the first target named entity marking result in response to the first manual review notification;
比对所述命名实体自动标注结果与接收到的所述第一目标命名实体标注结果是否相同,若不相同,则将所述第一待识别文本及所述第一待识别文本对应的第一目标命名实体标注结果加入所述初始样本数据集中,以便在所述初始样本数据集中的训练文本达到预设数量时,根据训练文本达到预设数量的初始样本数据集对命名实体识别模型进行重新训练。Check whether the automatic labeling result of the named entity is the same as the received labeling result of the first target named entity. If they are not the same, the first text to be recognized and the first text corresponding to the first to be recognized are compared. The target named entity annotation result is added to the initial sample data set, so that when the training text in the initial sample data set reaches a preset number, the named entity recognition model is retrained according to the initial sample data set where the training text reaches the preset number .
实施例四Example four
本申请还提供一种计算机可读存储介质,如闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、随机访问存储器(RAM)、静态随机访问存储器(SRAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、可编程只读存储器(PROM)、磁性存储器、磁盘、光盘、服务器、App应用商城等等,其上存储有计算机程序,程序被处理器执行时实现相应功能,所述计算机可读存储介质可以是非易失性,也可以是易失性。。本实施例的计算机可读存储介质用于存储命名实体识别装置10,被处理器执行时实现实施例一的命名实体识别方法的以下步骤:This application also provides a computer-readable storage medium, such as flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), random access memory (RAM), static random access memory (SRAM), read-only Memory (ROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Programmable Read-Only Memory (PROM), Magnetic Memory, Disk, CD, Server, App Store, etc., on which computer programs and programs are stored When executed by the processor, the corresponding function is realized, and the computer-readable storage medium may be non-volatile or volatile. . The computer-readable storage medium in this embodiment is used to store the named entity recognition device 10, and when executed by a processor, it implements the following steps of the named entity recognition method in the first embodiment:
从样本数据库中获取初始样本数据集,该初始样本数据集中包含多个训练文本以及各训练文本对应的命名实体标注结果;Obtain an initial sample data set from the sample database, the initial sample data set contains a plurality of training texts and a named entity labeling result corresponding to each training text;
判断所述初始样本数据集中的训练文本数量是否达到预设阈值,若是,则执行如下操作:Determine whether the number of training texts in the initial sample data set reaches a preset threshold, and if so, perform the following operations:
根据所述初始样本数据集对预设的命名实体识别模型进行训练;Training a preset named entity recognition model according to the initial sample data set;
接收第一待识别文本,并对所述第一待识别文本进行预处理;Receiving the first text to be recognized, and preprocessing the first text to be recognized;
利用训练得到的命名实体识别模型对预处理后的第一待识别文本进行处理,得到所述第一待识别文本的命名实体自动标注结果;Using the trained named entity recognition model to process the preprocessed first to-be-recognized text to obtain a named entity automatic labeling result of the first to-be-recognized text;
比对所述命名实体自动标注结果与预先获得的命名实体人工标注结果是否相同,若相同,则将所述命名实体自动标注结果作为所述第一待识别文本的目标命名实体标注结果,若不相同,则输出第一人工审核通知,并接收响应所述第一人工审核通知的第一目标命名实体标注结果;Compare the automatic annotation result of the named entity with the pre-obtained manual annotation result of the named entity. If they are the same, use the automatic annotation result of the named entity as the target named entity annotation result of the first text to be recognized. If they are the same, output the first manual review notification, and receive the first target named entity marking result in response to the first manual review notification;
比对所述命名实体自动标注结果与接收到的所述第一目标命名实体标注结果是否相同,若不相同,则将所述第一待识别文本及所述第一待识别文本对应的第一目标命名实体标注结果加入所述初始样本数据集中,以便在所述初始样本数据集中的训练文本达到预设数量时,根据训练文本达到预设数量的初始样本数据集对命名实体识别模型进行重新训练。Check whether the automatic labeling result of the named entity is the same as the received labeling result of the first target named entity. If they are not the same, the first text to be recognized and the first text corresponding to the first to be recognized are compared. The target named entity annotation result is added to the initial sample data set, so that when the training text in the initial sample data set reaches a preset number, the named entity recognition model is retrained according to the initial sample data set where the training text reaches the preset number .

Claims (20)

  1. 一种命名实体识别方法,其中,包括以下步骤:A named entity recognition method, which includes the following steps:
    从样本数据库中获取初始样本数据集,该初始样本数据集中包含多个训练文本以及各训练文本对应的命名实体标注结果;Obtain an initial sample data set from the sample database, the initial sample data set contains a plurality of training texts and a named entity labeling result corresponding to each training text;
    判断所述初始样本数据集中的训练文本数量是否达到预设阈值,若是,则执行如下操作:Determine whether the number of training texts in the initial sample data set reaches a preset threshold, and if so, perform the following operations:
    根据所述初始样本数据集对预设的命名实体识别模型进行训练;Training a preset named entity recognition model according to the initial sample data set;
    接收第一待识别文本,并对所述第一待识别文本进行预处理;Receiving the first text to be recognized, and preprocessing the first text to be recognized;
    利用训练得到的命名实体识别模型对预处理后的第一待识别文本进行处理,得到所述第一待识别文本的命名实体自动标注结果;Using the trained named entity recognition model to process the preprocessed first to-be-recognized text to obtain a named entity automatic labeling result of the first to-be-recognized text;
    比对所述命名实体自动标注结果与预先获得的命名实体人工标注结果是否相同,若相同,则将所述命名实体自动标注结果作为所述第一待识别文本的目标命名实体标注结果,若不相同,则输出第一人工审核通知,并接收响应所述第一人工审核通知的第一目标命名实体标注结果;Compare the automatic annotation result of the named entity with the pre-obtained manual annotation result of the named entity. If they are the same, use the automatic annotation result of the named entity as the target named entity annotation result of the first text to be recognized. If they are the same, output the first manual review notification, and receive the first target named entity marking result in response to the first manual review notification;
    比对所述命名实体自动标注结果与接收到的所述第一目标命名实体标注结果是否相同,若不相同,则将所述第一待识别文本及所述第一待识别文本对应的第一目标命名实体标注结果加入所述初始样本数据集中,以便在所述初始样本数据集中的训练文本达到预设数量时,根据训练文本达到预设数量的初始样本数据集对命名实体识别模型进行重新训练。Check whether the automatic labeling result of the named entity is the same as the received labeling result of the first target named entity. If they are not the same, the first text to be recognized and the first text corresponding to the first to be recognized are compared. The target named entity annotation result is added to the initial sample data set, so that when the training text in the initial sample data set reaches a preset number, the named entity recognition model is retrained according to the initial sample data set where the training text reaches the preset number .
  2. 根据权利要求1所述的命名实体识别方法,其中,当判断所述初始样本数据集中的训练文本数量未达到预设阈值时,执行以下操作:The named entity recognition method according to claim 1, wherein when it is determined that the number of training texts in the initial sample data set does not reach a preset threshold, the following operations are performed:
    对所述初始样本数据集中的训练文本的语句顺序进行N次打乱处理,生成N个不同的新样本数据集,其中N取正整数;Performing N times scramble processing on the sentence sequence of the training text in the initial sample data set to generate N different new sample data sets, where N is a positive integer;
    根据所述初始样本数据集对所述预设的命名实体识别模型进行训练,并根据所述N个不同的新样本数据集分别对所述预设的命名实体识别模型进行训练,得到N+1个训练后的命名实体识别模型;The preset named entity recognition model is trained according to the initial sample data set, and the preset named entity recognition model is trained separately according to the N different new sample data sets, to obtain N+1 A trained named entity recognition model;
    接收第二待识别文本,并对所述第二待识别文本进行预处理;Receiving the second text to be recognized, and preprocessing the second text to be recognized;
    利用所述N+1个训练后的命名实体识别模型分别对预处理后的第二待识别文本进行处理,得到所述第二待识别文本对应的N+1个命名实体自动标注结果;Using the N+1 trained named entity recognition models to respectively process the preprocessed second to-be-recognized text to obtain the N+1 named-entity automatic labeling results corresponding to the second to-be-recognized text;
    比对所述N+1个命名实体标注结果是否相同,若相同,则将相同的命名实体自动标注结果作为所述第二待识别文本的目标命名实体标注结果,若不相同,则输出第二人工审核通知,并接收响应所述第二人工审核通知的第二目标命名实体标注结果;Compare whether the N+1 named entity annotation results are the same, if they are the same, the same named entity automatic annotation result is used as the target named entity annotation result of the second text to be recognized, and if they are not the same, output the second A manual review notification, and receiving a second target named entity marking result in response to the second manual review notification;
    将所述第二待识别文本及所述第二待识别文本对应的第二目标命名实体标注结果加入所述初始样本数据集中,直到所述初始样本数据集中的训练文本数量达到预设阈值。The second to-be-recognized text and the second target named entity annotation result corresponding to the second to-be-recognized text are added to the initial sample data set until the number of training texts in the initial sample data set reaches a preset threshold.
  3. 根据权利要求1所述的命名实体识别方法,其中,所述根据所述初始样本数据集对命名实体识别模型进行训练的步骤包括:The named entity recognition method according to claim 1, wherein the step of training a named entity recognition model according to the initial sample data set comprises:
    将所述初始样本数据集划分为训练集、验证集和测试集;Dividing the initial sample data set into a training set, a verification set, and a test set;
    根据所述训练集对命名实体识别模型进行训练;Training the named entity recognition model according to the training set;
    根据所述验证集对经过训练的命名实体识别模型进行验证;Verifying the trained named entity recognition model according to the verification set;
    根据所述测试集对经过验证的命名实体识别模型进行测试,若测试成功,训练结束。The verified named entity recognition model is tested according to the test set, and if the test is successful, the training ends.
  4. 根据权利要求1所述的命名实体识别方法,其中,所述命名实体识别模型包括BERT层和CRF层。The named entity recognition method according to claim 1, wherein the named entity recognition model includes a BERT layer and a CRF layer.
  5. 根据权利要求4所述的命名实体识别方法,其中,所述利用训练得到的命名实体识别模型对预处理后的第一待识别文本进行处理,得到所述第一待识别文本的命名实体自动标注结果的步骤包括:The method for recognizing a named entity according to claim 4, wherein the preprocessed first text to be recognized is processed by using the named entity recognition model obtained by training, and the named entity of the first text to be recognized is automatically labeled The resulting steps include:
    利用BERT层对所述第一待识别文本对应的输入序列进行处理,得到所述第一待识别文本的文本特征序列;Using the BERT layer to process the input sequence corresponding to the first text to be recognized to obtain the text feature sequence of the first text to be recognized;
    利用CRF层对所述第一待识别文本的文本特征序列进行处理,得到所述第一待识别文本的命名实体自动标注结果。The CRF layer is used to process the text feature sequence of the first to-be-recognized text to obtain a named entity automatic labeling result of the first to-be-recognized text.
  6. 根据权利要求1所述的命名实体识别方法,其中,所述从样本数据库中获取初始样本数据集的步骤具体包括:The named entity recognition method according to claim 1, wherein the step of obtaining the initial sample data set from the sample database specifically comprises:
    从所述样本数据库中获取所述初始样本数据集,所述初始样本数据集包含多个训练文本,各训练文本中不同类别的命名实体已预先设置为不同的字体样式;Acquiring the initial sample data set from the sample database, the initial sample data set containing a plurality of training texts, and different types of named entities in each training text have been preset to different font styles;
    根据各训练文本中每个词的字体样式,获取各训练文本对应的命名实体标注结果。According to the font style of each word in each training text, the labeled result of the named entity corresponding to each training text is obtained.
  7. 根据权利要求1所述的命名实体识别方法,其中,所述对所述第一待识别文本进行预处理的步骤包括:The named entity recognition method according to claim 1, wherein the step of preprocessing the first text to be recognized comprises:
    对所述第一待识别文本进行文本序列化处理。Perform text serialization processing on the first text to be recognized.
  8. 一种命名实体识别装置,其中,包括:A named entity recognition device, which includes:
    初始样本数据集获取模块,用于从样本数据库中获取初始样本数据集,该初始样本数据集中包含多个训练文本以及各训练文本对应的命名实体标注结果;The initial sample data set acquisition module is used to acquire an initial sample data set from a sample database, and the initial sample data set contains multiple training texts and a named entity annotation result corresponding to each training text;
    判断模块,用于判断所述初始样本数据集中的训练文本数量是否达到预设阈值:The judging module is used to judge whether the number of training texts in the initial sample data set reaches a preset threshold:
    第一模型训练模块,用于在所述判断模块的结果为是时,根据所述初始样本数据集对预设的命名实体识别模型进行训练;The first model training module is configured to train a preset named entity recognition model according to the initial sample data set when the result of the judgment module is yes;
    第一文本接收模块,用于接收第一待识别文本,并对所述第一待识别文本进行预处理;The first text receiving module is configured to receive the first text to be recognized and preprocess the first text to be recognized;
    第一模型处理模块,用于利用训练得到的命名实体识别模型对预处理后的第一待识别文本进行处理,得到所述第一待识别文本的命名实体自动标注结果;The first model processing module is configured to use the trained named entity recognition model to process the preprocessed first to-be-recognized text to obtain a named entity automatic labeling result of the first to-be-recognized text;
    第一比对模块,用于比对所述命名实体自动标注结果与预先获得的命名实体人工标注结果是否相同,若相同,则将所述命名实体自动标注结果作为所述第一待识别文本的目标命名实体标注结果,若不相同,则输出第一人工审核通知,并接收响应所述第一人工审核通知的第一目标命名实体标注结果;The first comparison module is used to compare whether the automatic labeling result of the named entity is the same as the pre-obtained manual labeling result of the named entity, and if they are the same, the automatic labeling result of the named entity is used as the first text to be recognized If the target named entity marking results are not the same, output the first manual review notification, and receive the first target named entity marking results in response to the first manual review notification;
    第二比对模块,用于比对所述命名实体自动标注结果与接收到的所述第一目标命名实体标注结果是否相同;The second comparison module is configured to compare whether the automatic labeling result of the named entity is the same as the received labeling result of the first target named entity;
    第一样本增加模块,用于在所述命名实体自动标注结果与接收到的第一目标命名实体标注结果不相同时,将所述第一待识别文本及所述第一待识别文本对应的第一目标命名实体标注结果加入所述初始样本数据集中,以便在所述初始样本数据集中的训练文本达到预设数量时,根据训练文本达到预设数量的初始样本数据集对命名实体识别模型进行重新训练。The first sample adding module is configured to correspond to the first to-be-recognized text and the first to-be-recognized text when the automatically annotated named entity result is not the same as the received annotated result of the first target named entity The first target named entity annotation result is added to the initial sample data set, so that when the training text in the initial sample data set reaches a preset number, the named entity recognition model is performed on the initial sample data set where the training text reaches the preset number. Retrain.
  9. 一种计算机设备,包括存储器、处理器以及存储在存储器上并可在处理器上运行的计算机程序,其中,所述处理器执行所述计算机程序时实现命名实体识别方法的以下步骤:A computer device includes a memory, a processor, and a computer program stored on the memory and running on the processor, wherein the processor implements the following steps of a named entity recognition method when the processor executes the computer program:
    样本数据库中获取初始样本数据集,该初始样本数据集中包含多个训练文本以及各训练文本对应的命名实体标注结果;Acquire an initial sample data set from the sample database, the initial sample data set contains multiple training texts and the labeled results of named entities corresponding to each training text;
    判断所述初始样本数据集中的训练文本数量是否达到预设阈值,若是,则执行如下操作:Determine whether the number of training texts in the initial sample data set reaches a preset threshold, and if so, perform the following operations:
    根据所述初始样本数据集对预设的命名实体识别模型进行训练;Training a preset named entity recognition model according to the initial sample data set;
    接收第一待识别文本,并对所述第一待识别文本进行预处理;Receiving the first text to be recognized, and preprocessing the first text to be recognized;
    利用训练得到的命名实体识别模型对预处理后的第一待识别文本进行处理,得到所述第一待识别文本的命名实体自动标注结果;Using the trained named entity recognition model to process the preprocessed first to-be-recognized text to obtain a named entity automatic labeling result of the first to-be-recognized text;
    比对所述命名实体自动标注结果与预先获得的命名实体人工标注结果是否相同,若相同,则将所述命名实体自动标注结果作为所述第一待识别文本的目标命名实体标注结果,若不相同,则输出第一人工审核通知,并接收响应所述第一人工审核通知的第一目标命名实体标注结果;Compare the automatic annotation result of the named entity with the pre-obtained manual annotation result of the named entity. If they are the same, use the automatic annotation result of the named entity as the target named entity annotation result of the first text to be recognized. If they are the same, output the first manual review notification, and receive the first target named entity marking result in response to the first manual review notification;
    比对所述命名实体自动标注结果与接收到的所述第一目标命名实体标注结果是否相同,若不相同,则将所述第一待识别文本及所述第一待识别文本对应的第一目标命名实体标注结果加入所述初始样本数据集中,以便在所述初始样本数据集中的训练文本达到预设数量时,根据训练文本达到预设数量的初始样本数据集对命名实体识别模型进行重新训练。Check whether the automatic labeling result of the named entity is the same as the received labeling result of the first target named entity. If they are not the same, the first text to be recognized and the first text corresponding to the first to be recognized are compared. The target named entity annotation result is added to the initial sample data set, so that when the training text in the initial sample data set reaches a preset number, the named entity recognition model is retrained according to the initial sample data set where the training text reaches the preset number .
  10. 根据权利要求9所述的计算机设备,其中,当判断所述初始样本数据集中的训练文本数量未达到预设阈值时,执行以下操作:The computer device according to claim 9, wherein when it is determined that the number of training texts in the initial sample data set does not reach a preset threshold, the following operations are performed:
    对所述初始样本数据集中的训练文本的语句顺序进行N次打乱处理,生成N个不同的新样本数据集,其中N取正整数;Performing N times scramble processing on the sentence sequence of the training text in the initial sample data set to generate N different new sample data sets, where N is a positive integer;
    根据所述初始样本数据集对所述预设的命名实体识别模型进行训练,并根据所述N个不同的新样本数据集分别对所述预设的命名实体识别模型进行训练,得到N+1个训练后的命名实体识别模型;The preset named entity recognition model is trained according to the initial sample data set, and the preset named entity recognition model is trained separately according to the N different new sample data sets, to obtain N+1 A trained named entity recognition model;
    接收第二待识别文本,并对所述第二待识别文本进行预处理;Receiving the second text to be recognized, and preprocessing the second text to be recognized;
    利用所述N+1个训练后的命名实体识别模型分别对预处理后的第二待识别文本进行处理,得到所述第二待识别文本对应的N+1个命名实体自动标注结果;Using the N+1 trained named entity recognition models to respectively process the preprocessed second to-be-recognized text to obtain the N+1 named-entity automatic labeling results corresponding to the second to-be-recognized text;
    比对所述N+1个命名实体标注结果是否相同,若相同,则将相同的命名实体自动标注结果作为所述第二待识别文本的目标命名实体标注结果,若不相同,则输出第二人工审核通知,并接收响应所述第二人工审核通知的第二目标命名实体标注结果;Compare whether the N+1 named entity annotation results are the same, if they are the same, the same named entity automatic annotation result is used as the target named entity annotation result of the second text to be recognized, and if they are not the same, output the second A manual review notification, and receiving a second target named entity marking result in response to the second manual review notification;
    将所述第二待识别文本及所述第二待识别文本对应的第二目标命名实体标注结果加入所述初始样本数据集中,直到所述初始样本数据集中的训练文本数量达到预设阈值。The second to-be-recognized text and the second target named entity annotation result corresponding to the second to-be-recognized text are added to the initial sample data set until the number of training texts in the initial sample data set reaches a preset threshold.
  11. 根据权利要求9所述的计算机设备,其中,所述根据所述初始样本数据集对命名实体识别模型进行训练的步骤包括:The computer device according to claim 9, wherein the step of training a named entity recognition model according to the initial sample data set comprises:
    将所述初始样本数据集划分为训练集、验证集和测试集;Dividing the initial sample data set into a training set, a verification set, and a test set;
    根据所述训练集对命名实体识别模型进行训练;Training the named entity recognition model according to the training set;
    根据所述验证集对经过训练的命名实体识别模型进行验证;Verifying the trained named entity recognition model according to the verification set;
    根据所述测试集对经过验证的命名实体识别模型进行测试,若测试成功,训练结束。The verified named entity recognition model is tested according to the test set, and if the test is successful, the training ends.
  12. 根据权利要求9所述的计算机设备,其中,所述命名实体识别模型包括BERT层和CRF层。The computer device according to claim 9, wherein the named entity recognition model includes a BERT layer and a CRF layer.
  13. 根据权利要求12所述的计算机设备,其中,所述利用训练得到的命名实体识别模型对预处理后的第一待识别文本进行处理,得到所述第一待识别文本的命名实体自动标注结果的步骤包括:The computer device according to claim 12, wherein the named entity recognition model obtained by the training processes the preprocessed first to-be-recognized text to obtain the result of automatic labeling of the named-entity of the first to-be-recognized text The steps include:
    利用BERT层对所述第一待识别文本对应的输入序列进行处理,得到所述第一待识别文本的文本特征序列;Using the BERT layer to process the input sequence corresponding to the first text to be recognized to obtain the text feature sequence of the first text to be recognized;
    利用CRF层对所述第一待识别文本的文本特征序列进行处理,得到所述第一待识别文本的命名实体自动标注结果。The CRF layer is used to process the text feature sequence of the first to-be-recognized text to obtain a named entity automatic labeling result of the first to-be-recognized text.
  14. 根据权利要求9所述的计算机设备,其中,所述从样本数据库中获取初始样本数据集的步骤具体包括:The computer device according to claim 9, wherein the step of obtaining the initial sample data set from the sample database specifically comprises:
    从所述样本数据库中获取所述初始样本数据集,所述初始样本数据集包含多个训练文本,各训练文本中不同类别的命名实体已预先设置为不同的字体样式;Acquiring the initial sample data set from the sample database, the initial sample data set containing a plurality of training texts, and different types of named entities in each training text have been preset to different font styles;
    根据各训练文本中每个词的字体样式,获取各训练文本对应的命名实体标注结果。According to the font style of each word in each training text, the labeled result of the named entity corresponding to each training text is obtained.
  15. 一种计算机可读存储介质,其上存储有计算机程序,其中,所述计算机程序被处理器执行时实现命名实体识别方法的以下步骤:A computer-readable storage medium having a computer program stored thereon, wherein the computer program implements the following steps of a named entity recognition method when the computer program is executed by a processor:
    样本数据库中获取初始样本数据集,该初始样本数据集中包含多个训练文本以及各训练文本对应的命名实体标注结果;Acquire an initial sample data set from the sample database, the initial sample data set contains multiple training texts and the labeled results of named entities corresponding to each training text;
    判断所述初始样本数据集中的训练文本数量是否达到预设阈值,若是,则执行如下操作:Determine whether the number of training texts in the initial sample data set reaches a preset threshold, and if so, perform the following operations:
    根据所述初始样本数据集对预设的命名实体识别模型进行训练;Training a preset named entity recognition model according to the initial sample data set;
    接收第一待识别文本,并对所述第一待识别文本进行预处理;Receiving the first text to be recognized, and preprocessing the first text to be recognized;
    利用训练得到的命名实体识别模型对预处理后的第一待识别文本进行处理,得到所述第一待识别文本的命名实体自动标注结果;Using the trained named entity recognition model to process the preprocessed first to-be-recognized text to obtain a named entity automatic labeling result of the first to-be-recognized text;
    比对所述命名实体自动标注结果与预先获得的命名实体人工标注结果是否相同,若相同,则将所述命名实体自动标注结果作为所述第一待识别文本的目标命名实体标注结果,若不相同,则输出第一人工审核通知,并接收响应所述第一人工审核通知的第一目标命名实体标注结果;Compare the automatic annotation result of the named entity with the pre-obtained manual annotation result of the named entity. If they are the same, use the automatic annotation result of the named entity as the target named entity annotation result of the first text to be recognized. If they are the same, output the first manual review notification, and receive the first target named entity marking result in response to the first manual review notification;
    比对所述命名实体自动标注结果与接收到的所述第一目标命名实体标注结果是否相同,若不相同,则将所述第一待识别文本及所述第一待识别文本对应的第一目标命名实体标注结果加入所述初始样本数据集中,以便在所述初始样本数据集中的训练文本达到预设数量时,根据训练文本达到预设数量的初始样本数据集对命名实体识别模型进行重新训练。Check whether the automatic labeling result of the named entity is the same as the received labeling result of the first target named entity. If they are not the same, the first text to be recognized and the first text corresponding to the first to be recognized are compared. The target named entity annotation result is added to the initial sample data set, so that when the training text in the initial sample data set reaches a preset number, the named entity recognition model is retrained according to the initial sample data set where the training text reaches the preset number .
  16. 根据权利要求15所述的计算机可读存储介质,其中,当判断所述初始样本数据集中的训练文本数量未达到预设阈值时,执行以下操作:15. The computer-readable storage medium according to claim 15, wherein when it is determined that the number of training texts in the initial sample data set does not reach a preset threshold, the following operations are performed:
    对所述初始样本数据集中的训练文本的语句顺序进行N次打乱处理,生成N个不同的新样本数据集,其中N取正整数;Performing N times scramble processing on the sentence sequence of the training text in the initial sample data set to generate N different new sample data sets, where N is a positive integer;
    根据所述初始样本数据集对所述预设的命名实体识别模型进行训练,并根据所述N个不同的新样本数据集分别对所述预设的命名实体识别模型进行训练,得到N+1个训练后的命名实体识别模型;The preset named entity recognition model is trained according to the initial sample data set, and the preset named entity recognition model is trained separately according to the N different new sample data sets, to obtain N+1 A trained named entity recognition model;
    接收第二待识别文本,并对所述第二待识别文本进行预处理;Receiving the second text to be recognized, and preprocessing the second text to be recognized;
    利用所述N+1个训练后的命名实体识别模型分别对预处理后的第二待识别文本进行处理,得到所述第二待识别文本对应的N+1个命名实体自动标注结果;Using the N+1 trained named entity recognition models to respectively process the preprocessed second to-be-recognized text to obtain the N+1 named-entity automatic labeling results corresponding to the second to-be-recognized text;
    比对所述N+1个命名实体标注结果是否相同,若相同,则将相同的命名实体自动标注结果作为所述第二待识别文本的目标命名实体标注结果,若不相同,则输出第二人工审核通知,并接收响应所述第二人工审核通知的第二目标命名实体标注结果;Compare whether the labeling results of the N+1 named entities are the same. If they are the same, the automatic labeling result of the same named entity will be used as the target named entity labeling result of the second text to be recognized. If they are not the same, output the second A manual review notification, and receiving a second target named entity marking result in response to the second manual review notification;
    将所述第二待识别文本及所述第二待识别文本对应的第二目标命名实体标注结果加入所述初始样本数据集中,直到所述初始样本数据集中的训练文本数量达到预设阈值。The second to-be-recognized text and the second target named entity annotation result corresponding to the second to-be-recognized text are added to the initial sample data set until the number of training texts in the initial sample data set reaches a preset threshold.
  17. 根据权利要求15所述的计算机可读存储介质,其中,所述根据所述初始样本数据集对命名实体识别模型进行训练的步骤包括:The computer-readable storage medium according to claim 15, wherein the step of training a named entity recognition model according to the initial sample data set comprises:
    将所述初始样本数据集划分为训练集、验证集和测试集;Dividing the initial sample data set into a training set, a verification set, and a test set;
    根据所述训练集对命名实体识别模型进行训练;Training the named entity recognition model according to the training set;
    根据所述验证集对经过训练的命名实体识别模型进行验证;Verifying the trained named entity recognition model according to the verification set;
    根据所述测试集对经过验证的命名实体识别模型进行测试,若测试成功,训练结束。The verified named entity recognition model is tested according to the test set, and if the test is successful, the training ends.
  18. 根据权利要求15所述的计算机可读存储介质,其中,所述命名实体识别模型包括BERT层和CRF层。The computer-readable storage medium of claim 15, wherein the named entity recognition model includes a BERT layer and a CRF layer.
  19. 根据权利要求18所述的计算机可读存储介质,其中,所述利用训练得到的命名实体识别模型对预处理后的第一待识别文本进行处理,得到所述第一待识别文本的命名实体自动标注结果的步骤包括:The computer-readable storage medium according to claim 18, wherein the named entity recognition model obtained by the training processes the preprocessed first to-be-recognized text to obtain the named entity of the first to-be-recognized text. The steps to label the results include:
    利用BERT层对所述第一待识别文本对应的输入序列进行处理,得到所述第一待识别文本的文本特征序列;Using the BERT layer to process the input sequence corresponding to the first text to be recognized to obtain the text feature sequence of the first text to be recognized;
    利用CRF层对所述第一待识别文本的文本特征序列进行处理,得到所述第一待识别文本的命名实体自动标注结果。The CRF layer is used to process the text feature sequence of the first to-be-recognized text to obtain a named entity automatic labeling result of the first to-be-recognized text.
  20. 根据权利要求15所述的计算机可读存储介质,其中,所述从样本数据库中获取初始样本数据集的步骤具体包括:15. The computer-readable storage medium according to claim 15, wherein the step of obtaining the initial sample data set from the sample database specifically comprises:
    从所述样本数据库中获取所述初始样本数据集,所述初始样本数据集包含多个训练文本,各训练文本中不同类别的命名实体已预先设置为不同的字体样式;Acquiring the initial sample data set from the sample database, the initial sample data set containing a plurality of training texts, and different types of named entities in each training text have been preset to different font styles;
    根据各训练文本中每个词的字体样式,获取各训练文本对应的命名实体标注结果。According to the font style of each word in each training text, the labeled result of the named entity corresponding to each training text is obtained.
PCT/CN2020/112303 2019-09-04 2020-08-29 Method and apparatus for recognizing named entity, computer device, and storage medium WO2021043085A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910832541.3 2019-09-04
CN201910832541.3A CN110704633B (en) 2019-09-04 2019-09-04 Named entity recognition method, named entity recognition device, named entity recognition computer equipment and named entity recognition storage medium

Publications (1)

Publication Number Publication Date
WO2021043085A1 true WO2021043085A1 (en) 2021-03-11

Family

ID=69194309

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/112303 WO2021043085A1 (en) 2019-09-04 2020-08-29 Method and apparatus for recognizing named entity, computer device, and storage medium

Country Status (2)

Country Link
CN (1) CN110704633B (en)
WO (1) WO2021043085A1 (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112906375A (en) * 2021-03-24 2021-06-04 平安科技(深圳)有限公司 Text data labeling method, device, equipment and storage medium
CN113723102A (en) * 2021-06-30 2021-11-30 平安国际智慧城市科技股份有限公司 Named entity recognition method and device, electronic equipment and storage medium
CN113762132A (en) * 2021-09-01 2021-12-07 国网浙江省电力有限公司金华供电公司 Automatic classification and automatic naming system for unmanned aerial vehicle inspection image
CN113836927A (en) * 2021-09-27 2021-12-24 平安科技(深圳)有限公司 Training method, device and equipment for named entity recognition model and storage medium
CN113838524A (en) * 2021-09-27 2021-12-24 电子科技大学长三角研究院(衢州) S-nitrosylation site prediction method, model training method and storage medium
CN113849597A (en) * 2021-08-31 2021-12-28 艾迪恩(山东)科技有限公司 Illegal advertising word detection method based on named entity recognition
CN114048744A (en) * 2021-10-28 2022-02-15 盐城金堤科技有限公司 Entity extraction-based job record generation method, device and equipment
CN114492383A (en) * 2021-12-20 2022-05-13 北京邮电大学 Entity name identification method and device for digital currency transaction address
CN117010390A (en) * 2023-07-04 2023-11-07 北大荒信息有限公司 Company entity identification method, device, equipment and medium based on bidding information
CN117034864A (en) * 2023-09-07 2023-11-10 广州市新谷电子科技有限公司 Visual labeling method, visual labeling device, computer equipment and storage medium
CN117252202A (en) * 2023-11-20 2023-12-19 江西风向标智能科技有限公司 Construction method, identification method and system for named entities in high school mathematics topics
CN117610574A (en) * 2024-01-23 2024-02-27 广东省人民医院 Named entity recognition method and device based on cross-domain transfer learning
CN117875319A (en) * 2023-12-29 2024-04-12 汉王科技股份有限公司 Medical field labeling data acquisition method and device and electronic equipment

Families Citing this family (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110704633B (en) * 2019-09-04 2023-07-21 平安科技(深圳)有限公司 Named entity recognition method, named entity recognition device, named entity recognition computer equipment and named entity recognition storage medium
CN111259134B (en) * 2020-01-19 2023-08-08 出门问问信息科技有限公司 Entity identification method, equipment and computer readable storage medium
CN111444718A (en) * 2020-03-12 2020-07-24 泰康保险集团股份有限公司 Insurance product demand document processing method and device and electronic equipment
CN111414950B (en) * 2020-03-13 2023-08-18 天津美腾科技股份有限公司 Ore picture labeling method and system based on labeling person professional management
CN111597813A (en) * 2020-05-21 2020-08-28 上海创蓝文化传播有限公司 Method and device for extracting text abstract of short message based on named entity identification
CN111522958A (en) * 2020-05-28 2020-08-11 泰康保险集团股份有限公司 Text classification method and device
CN111738004B (en) * 2020-06-16 2023-10-27 中国科学院计算技术研究所 Named entity recognition model training method and named entity recognition method
CN111797629B (en) * 2020-06-23 2022-07-29 平安医疗健康管理股份有限公司 Method and device for processing medical text data, computer equipment and storage medium
CN111881296A (en) * 2020-07-31 2020-11-03 深圳市万物云科技有限公司 Work order processing method based on community scene and related components
CN112257441B (en) * 2020-09-15 2024-04-05 浙江大学 Named entity recognition enhancement method based on counterfactual generation
CN112487817A (en) * 2020-12-14 2021-03-12 北京明略软件***有限公司 Named entity recognition model training method, sample labeling method, device and equipment
CN112633002A (en) * 2020-12-29 2021-04-09 上海明略人工智能(集团)有限公司 Sample labeling method, model training method, named entity recognition method and device
CN112765985B (en) * 2021-01-13 2023-10-27 中国科学技术信息研究所 Named entity identification method for patent embodiments in specific fields
CN112686047B (en) * 2021-01-21 2024-03-29 北京云上曲率科技有限公司 Sensitive text recognition method, device and system based on named entity recognition
CN112818691A (en) * 2021-02-01 2021-05-18 北京金山数字娱乐科技有限公司 Named entity recognition model training method and device
CN113064992A (en) * 2021-03-22 2021-07-02 平安银行股份有限公司 Complaint work order structured processing method, device, equipment and storage medium
CN112906349A (en) * 2021-03-30 2021-06-04 苏州大学 Data annotation method, system, equipment and readable storage medium
CN113807096A (en) * 2021-04-09 2021-12-17 京东科技控股股份有限公司 Text data processing method and device, computer equipment and storage medium
CN113221576B (en) * 2021-06-01 2023-01-13 复旦大学 Named entity identification method based on sequence-to-sequence architecture
CN113449632B (en) * 2021-06-28 2023-04-07 重庆长安汽车股份有限公司 Vision and radar perception algorithm optimization method and system based on fusion perception and automobile
CN113779065A (en) * 2021-08-23 2021-12-10 深圳价值在线信息科技股份有限公司 Verification method and device for data comparison, terminal equipment and medium
CN114912455B (en) * 2022-07-12 2022-09-30 共道网络科技有限公司 Named entity identification method and device
CN115640808B (en) * 2022-12-05 2023-03-21 苏州浪潮智能科技有限公司 Text labeling method and device, electronic equipment and readable storage medium
CN117077679B (en) * 2023-10-16 2024-03-12 之江实验室 Named entity recognition method and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030187642A1 (en) * 2002-03-29 2003-10-02 International Business Machines Corporation System and method for the automatic discovery of salient segments in speech transcripts
CN109101481A (en) * 2018-06-25 2018-12-28 北京奇艺世纪科技有限公司 A kind of name entity recognition method, device and electronic equipment
CN109885825A (en) * 2019-01-07 2019-06-14 平安科技(深圳)有限公司 Name entity recognition method, device and computer equipment based on attention mechanism
CN110704633A (en) * 2019-09-04 2020-01-17 平安科技(深圳)有限公司 Named entity recognition method and device, computer equipment and storage medium

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102033950A (en) * 2010-12-23 2011-04-27 哈尔滨工业大学 Construction method and identification method of automatic electronic product named entity identification system
CN109241520B (en) * 2018-07-18 2023-05-23 五邑大学 Sentence trunk analysis method and system based on multi-layer error feedback neural network for word segmentation and named entity recognition
CN109145303B (en) * 2018-09-06 2023-04-18 腾讯科技(深圳)有限公司 Named entity recognition method, device, medium and equipment
CN109543181B (en) * 2018-11-09 2023-01-31 中译语通科技股份有限公司 Named entity model and system based on combination of active learning and deep learning

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030187642A1 (en) * 2002-03-29 2003-10-02 International Business Machines Corporation System and method for the automatic discovery of salient segments in speech transcripts
CN109101481A (en) * 2018-06-25 2018-12-28 北京奇艺世纪科技有限公司 A kind of name entity recognition method, device and electronic equipment
CN109885825A (en) * 2019-01-07 2019-06-14 平安科技(深圳)有限公司 Name entity recognition method, device and computer equipment based on attention mechanism
CN110704633A (en) * 2019-09-04 2020-01-17 平安科技(深圳)有限公司 Named entity recognition method and device, computer equipment and storage medium

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112906375B (en) * 2021-03-24 2024-05-14 平安科技(深圳)有限公司 Text data labeling method, device, equipment and storage medium
CN112906375A (en) * 2021-03-24 2021-06-04 平安科技(深圳)有限公司 Text data labeling method, device, equipment and storage medium
CN113723102A (en) * 2021-06-30 2021-11-30 平安国际智慧城市科技股份有限公司 Named entity recognition method and device, electronic equipment and storage medium
CN113723102B (en) * 2021-06-30 2024-04-26 平安国际智慧城市科技股份有限公司 Named entity recognition method, named entity recognition device, electronic equipment and storage medium
CN113849597A (en) * 2021-08-31 2021-12-28 艾迪恩(山东)科技有限公司 Illegal advertising word detection method based on named entity recognition
CN113849597B (en) * 2021-08-31 2024-04-30 艾迪恩(山东)科技有限公司 Illegal advertisement word detection method based on named entity recognition
CN113762132A (en) * 2021-09-01 2021-12-07 国网浙江省电力有限公司金华供电公司 Automatic classification and automatic naming system for unmanned aerial vehicle inspection image
CN113836927A (en) * 2021-09-27 2021-12-24 平安科技(深圳)有限公司 Training method, device and equipment for named entity recognition model and storage medium
CN113838524A (en) * 2021-09-27 2021-12-24 电子科技大学长三角研究院(衢州) S-nitrosylation site prediction method, model training method and storage medium
CN113838524B (en) * 2021-09-27 2024-04-26 电子科技大学长三角研究院(衢州) S-nitrosylation site prediction method, model training method and storage medium
CN113836927B (en) * 2021-09-27 2023-09-29 平安科技(深圳)有限公司 Named entity recognition model training method, device, equipment and storage medium
CN114048744A (en) * 2021-10-28 2022-02-15 盐城金堤科技有限公司 Entity extraction-based job record generation method, device and equipment
CN114492383A (en) * 2021-12-20 2022-05-13 北京邮电大学 Entity name identification method and device for digital currency transaction address
CN117010390A (en) * 2023-07-04 2023-11-07 北大荒信息有限公司 Company entity identification method, device, equipment and medium based on bidding information
CN117034864A (en) * 2023-09-07 2023-11-10 广州市新谷电子科技有限公司 Visual labeling method, visual labeling device, computer equipment and storage medium
CN117034864B (en) * 2023-09-07 2024-05-10 广州市新谷电子科技有限公司 Visual labeling method, visual labeling device, computer equipment and storage medium
CN117252202B (en) * 2023-11-20 2024-03-19 江西风向标智能科技有限公司 Construction method, identification method and system for named entities in high school mathematics topics
CN117252202A (en) * 2023-11-20 2023-12-19 江西风向标智能科技有限公司 Construction method, identification method and system for named entities in high school mathematics topics
CN117875319A (en) * 2023-12-29 2024-04-12 汉王科技股份有限公司 Medical field labeling data acquisition method and device and electronic equipment
CN117610574B (en) * 2024-01-23 2024-04-26 广东省人民医院 Named entity recognition method and device based on cross-domain transfer learning
CN117610574A (en) * 2024-01-23 2024-02-27 广东省人民医院 Named entity recognition method and device based on cross-domain transfer learning

Also Published As

Publication number Publication date
CN110704633A (en) 2020-01-17
CN110704633B (en) 2023-07-21

Similar Documents

Publication Publication Date Title
WO2021043085A1 (en) Method and apparatus for recognizing named entity, computer device, and storage medium
TWI621077B (en) Character recognition method and server for claim documents
CN109472033B (en) Method and system for extracting entity relationship in text, storage medium and electronic equipment
US11449537B2 (en) Detecting affective characteristics of text with gated convolutional encoder-decoder framework
CN109446885B (en) Text-based component identification method, system, device and storage medium
US10628403B2 (en) Annotation system for extracting attributes from electronic data structures
CN113051356B (en) Open relation extraction method and device, electronic equipment and storage medium
CN111191275A (en) Sensitive data identification method, system and device
CN111613212A (en) Speech recognition method, system, electronic device and storage medium
CN111723870B (en) Artificial intelligence-based data set acquisition method, apparatus, device and medium
CN113704429A (en) Semi-supervised learning-based intention identification method, device, equipment and medium
WO2021031505A1 (en) Method and apparatus for detecting errors during audio annotation, computer device and storage medium
CN113158656B (en) Ironic content recognition method, ironic content recognition device, electronic device, and storage medium
CN111506595B (en) Data query method, system and related equipment
CN112052305A (en) Information extraction method and device, computer equipment and readable storage medium
WO2019095899A1 (en) Material annotation method and apparatus, terminal, and computer readable storage medium
CN114036950A (en) Medical text named entity recognition method and system
CN113743101A (en) Text error correction method and device, electronic equipment and computer storage medium
CN113807973A (en) Text error correction method and device, electronic equipment and computer readable storage medium
CN112906361A (en) Text data labeling method and device, electronic equipment and storage medium
CN113204956B (en) Multi-model training method, abstract segmentation method, text segmentation method and text segmentation device
CN110705211A (en) Text key content marking method and device, computer equipment and storage medium
CN112418813B (en) AEO qualification intelligent rating management system and method based on intelligent analysis and identification and storage medium
US20240112236A1 (en) Information processing device, information processing method, and computer-readable storage medium storing program
CN111126056B (en) Method and device for identifying trigger words

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20861355

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20861355

Country of ref document: EP

Kind code of ref document: A1