WO2022142123A1 - Training method and apparatus for named entity model, device, and medium - Google Patents

Training method and apparatus for named entity model, device, and medium Download PDF

Info

Publication number
WO2022142123A1
WO2022142123A1 PCT/CN2021/097545 CN2021097545W WO2022142123A1 WO 2022142123 A1 WO2022142123 A1 WO 2022142123A1 CN 2021097545 W CN2021097545 W CN 2021097545W WO 2022142123 A1 WO2022142123 A1 WO 2022142123A1
Authority
WO
WIPO (PCT)
Prior art keywords
training samples
target
incompletely
labeled
estimated
Prior art date
Application number
PCT/CN2021/097545
Other languages
French (fr)
Chinese (zh)
Inventor
阮鸿涛
郑立颖
胡沛弦
徐亮
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2022142123A1 publication Critical patent/WO2022142123A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • G06F18/2155Generating training patterns; Bootstrap methods, e.g. bagging or boosting characterised by the incorporation of unlabelled data, e.g. multiple instance learning [MIL], semi-supervised techniques using expectation-maximisation [EM] or naïve labelling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Definitions

  • the estimation rule refers to satisfying both the consistent marked entity information and all the estimated markings of the unmarked part
  • the adaptive loss function is a loss function that can be adjusted according to the training process, so as to avoid distracting attention to a large number of label sequences during training.
  • S31 Obtain one of the incompletely labeled training samples from the plurality of incompletely labeled training samples, as a target incompletely labeled training sample;
  • An embodiment of the present application also provides a computer-readable storage medium on which a computer program is stored.
  • a method for training a named entity model is implemented, including the step of: acquiring a plurality of incompletely labeled training
  • the incompletely labeled training samples include: text sample data and incompletely labeled label sequences; each of the incompletely labeled training samples is determined by using preset prediction rules to estimate the label sequence, and the obtained
  • the set of estimated label sequences corresponding to each of the plurality of incompletely labeled training samples, and the preset prediction rule refers to satisfying the consistent information of the labeled entities and all the predicted labels of the unlabeled parts at the same time; obtaining the initially trained named entity model , adopt the adaptive loss function, the initially trained named entity model, the plurality of incompletely labeled training samples, and the estimated label sequence set corresponding to each of the plurality of incompletely labeled training samples to be trained.
  • the named entity model is trained to obtain the target named entity model.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Evolutionary Biology (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computational Mathematics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Operations Research (AREA)
  • Probability & Statistics with Applications (AREA)
  • Molecular Biology (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present application relates to the technical field of artificial intelligence, and discloses a training method and apparatus for a named entity model, a device, and a medium. The method comprises: acquiring a plurality of incompletely labeled training samples, the incompletely labeled training samples comprising: text sample data and an incompletely labeled label sequence; determining an estimated label sequence of each incompletely labeled training sample by means of preset estimation rules, and obtaining an estimated label sequence set respectively corresponding to the plurality of incompletely labeled training samples; and acquiring a named entity model for preliminary training, and using an adaptive loss function, the named entity model for preliminary training, the plurality of incompletely labeled training samples and the estimated label sequence set respectively corresponding to the plurality of incompletely labeled training samples to train a named entity model to be trained so as to obtain a target named entity model. Hence, dependence on labeling quality is reduced, and an adaptive loss function is used to avoid redirecting attention towards a large number of label sequences during training.

Description

命名实体模型的训练方法、装置、设备及介质Training method, device, equipment and medium for named entity model
本申请要求于2020年12月31日提交中国专利局、申请号为2020116266180,发明名称为“命名实体模型的训练方法、装置、设备及介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application filed on December 31, 2020 with the application number 2020116266180 and the title of the invention is "Named Entity Model Training Method, Apparatus, Equipment and Medium", the entire contents of which are by reference Incorporated in this application.
技术领域technical field
本申请涉及到人工智能技术领域,特别是涉及到一种命名实体模型的训练方法、装置、设备及介质。The present application relates to the technical field of artificial intelligence, and in particular, to a training method, apparatus, device and medium for a named entity model.
背景技术Background technique
目前命名实体识别模型训练都依赖于大量的完全标注的数据,然而现实情况是大量的、高质量的完全标注的数据的获取极其昂贵且困难。为了解决这个问题,很多情况下由标注人员输出的命名实体数据都是标注不完全的,即仅标注了一部分实体,然后利用不完全标注的数据训练命名实体识别模型。发明人意识到不完全标注的数据的未被标注为实体的内容可以为任何标签,而文本中命名实体一般是稀疏的,导致可能的标签序列的数量随着未标注文本内容的长度增加而呈现指数增加,因为现有利用不完全标注的数据训练命名实体识别模型时将注意力分散到大量的标签序列上,使得模型在搜索真实标签序列的时候遇到较大的困难。At present, the training of named entity recognition models relies on a large amount of fully annotated data. However, the reality is that it is extremely expensive and difficult to obtain a large amount of high-quality fully annotated data. In order to solve this problem, in many cases, the named entity data output by the annotator is incompletely labeled, that is, only a part of the entities are labeled, and then the named entity recognition model is trained using the incompletely labeled data. The inventors realize that the unlabeled content of incompletely labeled data can be any label, while named entities in text are generally sparse, resulting in the number of possible label sequences presented as the length of the unlabeled text content increases. Exponentially increases, because the existing use of incompletely labeled data to train named entity recognition models distracts attention to a large number of label sequences, making the model encounter greater difficulties when searching for real label sequences.
技术问题technical problem
旨在解决现有技术利用不完全标注的数据训练命名实体识别模型时将注意力分散到大量的标签序列上,使得模型在搜索真实标签序列的时候遇到较大的困难的技术问题。The purpose is to solve the technical problem that the existing technology uses incompletely labeled data to train a named entity recognition model to distract attention to a large number of label sequences, so that the model encounters greater difficulties when searching for real label sequences.
技术解决方案technical solutions
本申请的主要目的为提供一种命名实体模型的训练方法、装置、设备及介质,旨在解决现有技术利用不完全标注的数据训练命名实体识别模型时将注意力分散到大量的标签序列上,使得模型在搜索真实标签序列的时候遇到较大的困难的技术问题。The main purpose of this application is to provide a training method, device, equipment and medium for a named entity model, which aims to solve the problem of distracting attention to a large number of label sequences when training a named entity recognition model using incompletely labeled data in the prior art , so that the model encounters more difficult technical problems when searching for the real label sequence.
为了实现上述发明目的,本申请提出一种命名实体模型的训练方法,所述方法包括:In order to achieve the above purpose of the invention, the present application proposes a training method for a named entity model, the method comprising:
获取多个不完全标注的训练样本,所述不完全标注的训练样本包括:文本样本数据、不完全标注的标签序列;Obtain a plurality of incompletely labeled training samples, where the incompletely labeled training samples include: text sample data and incompletely labeled label sequences;
采用预设预估规则分别对每个所述不完全标注的训练样本进行预估标签序列确定,得到所述多个不完全标注的训练样本各自对应的预估标签序列集合,所述预设预估规则是指同时满足已标注实体信息一致和未标注部分全部预估标注;Determine the estimated label sequence for each of the incompletely labeled training samples by using preset prediction rules, and obtain the estimated label sequence set corresponding to each of the plurality of incompletely labeled training samples. The estimation rule refers to satisfying both the consistent marked entity information and all the estimated markings of the unmarked part;
获取初步训练的命名实体模型,采用自适应损失函数、所述初步训练的命名实体模型、所述多个不完全标注的训练样本、所述多个不完全标注的训练样本各自对应的所述预估标签序列集合对待训练的命名实体模型进行训练,得到目标命名实体模型。Obtain the initially trained named entity model, and adopt the adaptive loss function, the initially trained named entity model, the plurality of incompletely labeled training samples, and the respective corresponding predictions of the plurality of incompletely labeled training samples. Estimate the label sequence set to train the named entity model to be trained, and obtain the target named entity model.
本申请还提出了一种命名实体模型的训练装置,所述装置包括:The present application also proposes a training device for a named entity model, the device comprising:
训练样本获取模块,用于获取多个不完全标注的训练样本,所述不完全标注的训练样本包括:文本样本数据、不完全标注的标签序列;The training sample acquisition module is used to obtain a plurality of incompletely labeled training samples, and the incompletely labeled training samples include: text sample data and incompletely labeled label sequences;
预估标签序列集合确定模块,用于采用预设预估规则分别对每个所述不完全标注的训练样本进行预估标签序列确定,得到所述多个不完全标注的训练样本各 自对应的预估标签序列集合,所述预设预估规则是指同时满足已标注实体信息一致和未标注部分全部预估标注;The estimated label sequence set determination module is used to determine the estimated label sequence for each of the incompletely labeled training samples by using a preset prediction rule, and obtain the corresponding predictions of the plurality of incompletely labeled training samples. The set of estimated label sequences, and the preset estimation rule refers to satisfying both the consistency of the marked entity information and all the estimated labels of the unmarked part;
模型训练模块,用于获取初步训练的命名实体模型,采用自适应损失函数、所述初步训练的命名实体模型、所述多个不完全标注的训练样本、所述多个不完全标注的训练样本各自对应的所述预估标签序列集合对待训练的命名实体模型进行训练,得到目标命名实体模型。The model training module is used to obtain the initially trained named entity model, using the adaptive loss function, the initially trained named entity model, the multiple incompletely labeled training samples, and the multiple incompletely labeled training samples The respective corresponding set of estimated label sequences is trained on the named entity model to be trained to obtain a target named entity model.
本申请还提出了一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,所述处理器执行所述计算机程序时实现如下方法步骤:The present application also proposes a computer device, including a memory and a processor, the memory stores a computer program, and the processor implements the following method steps when executing the computer program:
获取多个不完全标注的训练样本,所述不完全标注的训练样本包括:文本样本数据、不完全标注的标签序列;Obtain a plurality of incompletely labeled training samples, where the incompletely labeled training samples include: text sample data and incompletely labeled label sequences;
采用预设预估规则分别对每个所述不完全标注的训练样本进行预估标签序列确定,得到所述多个不完全标注的训练样本各自对应的预估标签序列集合,所述预设预估规则是指同时满足已标注实体信息一致和未标注部分全部预估标注;Determine the estimated label sequence for each of the incompletely labeled training samples by using preset prediction rules, and obtain the estimated label sequence set corresponding to each of the plurality of incompletely labeled training samples. The estimation rule refers to satisfying both the consistent marked entity information and all the estimated markings of the unmarked part;
获取初步训练的命名实体模型,采用自适应损失函数、所述初步训练的命名实体模型、所述多个不完全标注的训练样本、所述多个不完全标注的训练样本各自对应的所述预估标签序列集合对待训练的命名实体模型进行训练,得到目标命名实体模型。本申请还提出了一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现如下方法步骤:Obtain the initially trained named entity model, and adopt the adaptive loss function, the initially trained named entity model, the plurality of incompletely labeled training samples, and the respective corresponding predictions of the plurality of incompletely labeled training samples. Estimate the label sequence set to train the named entity model to be trained, and obtain the target named entity model. The present application also proposes a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the following method steps are implemented:
获取多个不完全标注的训练样本,所述不完全标注的训练样本包括:文本样本数据、不完全标注的标签序列;Obtain a plurality of incompletely labeled training samples, where the incompletely labeled training samples include: text sample data and incompletely labeled label sequences;
采用预设预估规则分别对每个所述不完全标注的训练样本进行预估标签序列确定,得到所述多个不完全标注的训练样本各自对应的预估标签序列集合,所述预设预估规则是指同时满足已标注实体信息一致和未标注部分全部预估标注;Determine the estimated label sequence for each of the incompletely labeled training samples by using preset prediction rules, and obtain the estimated label sequence set corresponding to each of the plurality of incompletely labeled training samples. The estimation rule refers to satisfying both the consistent marked entity information and all the estimated markings of the unmarked part;
获取初步训练的命名实体模型,采用自适应损失函数、所述初步训练的命名实体模型、所述多个不完全标注的训练样本、所述多个不完全标注的训练样本各自对应的所述预估标签序列集合对待训练的命名实体模型进行训练,得到目标命名实体模型。Obtain the initially trained named entity model, and adopt the adaptive loss function, the initially trained named entity model, the plurality of incompletely labeled training samples, and the respective corresponding predictions of the plurality of incompletely labeled training samples. Estimate the label sequence set to train the named entity model to be trained, and obtain the target named entity model.
有益效果beneficial effect
本申请的命名实体模型的训练方法、装置、设备及介质,首先采用预设预估规则分别对每个不完全标注的训练样本进行预估标签序列确定,得到多个不完全标注的训练样本各自对应的预估标签序列集合,预设预估规则是指同时满足已标注实体信息一致和未标注部分全部预估标注,然后采用自适应损失函数、初步训练的命名实体模型、多个不完全标注的训练样本、多个不完全标注的训练样本各自对应的预估标签序列集合对待训练的命名实体模型进行训练,得到目标命名实体模型,通过采用不完全标注的训练样本,减少了对标注质量的依赖,采用自适应损失函数避免在训练时将注意力分散到大量的标签序列上,从而使采用多个不完全标注的训练样本训练模型也可以取得较好的效果。The training method, device, device and medium of the named entity model of the present application firstly uses preset prediction rules to determine the estimated label sequence for each incompletely labeled training sample, and obtains a plurality of incompletely labeled training samples respectively. The corresponding set of predicted label sequences, the preset prediction rule refers to satisfying both the consistent information of the labeled entities and all the predicted labels of the unlabeled parts, and then adopts the adaptive loss function, the initially trained named entity model, and multiple incomplete labels. The training samples and the estimated label sequence sets corresponding to multiple incompletely labeled training samples are trained on the named entity model to be trained, and the target named entity model is obtained. By using the incompletely labeled training samples, the impact on the labeling quality is reduced. Dependency, the adaptive loss function is used to avoid distracting attention to a large number of label sequences during training, so that training models with multiple incompletely labeled training samples can also achieve better results.
附图说明Description of drawings
图1为本申请一实施例的命名实体模型的训练方法的流程示意图;1 is a schematic flowchart of a training method for a named entity model according to an embodiment of the present application;
图2为本申请一实施例的命名实体模型的训练装置的结构示意框图;FIG. 2 is a schematic structural block diagram of a training apparatus for a named entity model according to an embodiment of the present application;
图3为本申请一实施例的计算机设备的结构示意框图。FIG. 3 is a schematic structural block diagram of a computer device according to an embodiment of the present application.
本申请目的实现、功能特点及优点将结合实施例,参照附图做进一步说明。The realization, functional features and advantages of the present application will be further described with reference to the accompanying drawings in conjunction with the embodiments.
本发明的实施方式Embodiments of the present invention
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。In order to make the purpose, technical solutions and advantages of the present application more clearly understood, the present application will be described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present application, but not to limit the present application.
为了解决现有技术利用不完全标注的数据训练命名实体识别模型时将注意力分散到大量的标签序列上,使得模型在搜索真实标签序列的时候遇到较大的困难的技术问题,本申请提出了一种命名实体模型的训练方法,所述方法应用于人工智能技术领域。所述命名实体模型的训练方法通过采用不完全标注的训练样本,减少了对标注质量的依赖,采用自适应损失函数避免在训练时将注意力分散到大量的标签序列上,从而使采用多个不完全标注的训练样本训练模型也可以取得较好的效果。In order to solve the technical problem of distracting attention to a large number of label sequences when training a named entity recognition model using incompletely labeled data, the model encounters greater difficulties when searching for real label sequences, this application proposes A training method of a named entity model is presented, and the method is applied in the field of artificial intelligence technology. The training method of the named entity model reduces the dependence on the labeling quality by using incompletely labeled training samples, and adopts an adaptive loss function to avoid distracting attention to a large number of label sequences during training, so that multiple label sequences are used. Training models with incompletely labeled training samples can also achieve better results.
参照图1,本申请实施例中提供一种命名实体模型的训练方法,所述方法包括:Referring to FIG. 1, an embodiment of the present application provides a training method for a named entity model, and the method includes:
S1:获取多个不完全标注的训练样本,所述不完全标注的训练样本包括:文本样本数据、不完全标注的标签序列;S1: Obtain a plurality of incompletely labeled training samples, where the incompletely labeled training samples include: text sample data and incompletely labeled label sequences;
S2:采用预设预估规则分别对每个所述不完全标注的训练样本进行预估标签序列确定,得到所述多个不完全标注的训练样本各自对应的预估标签序列集合,所述预设预估规则是指同时满足已标注实体信息一致和未标注部分全部预估标注;S2: Determine an estimated label sequence for each of the incompletely labeled training samples by using a preset prediction rule, and obtain a set of estimated label sequences corresponding to each of the plurality of incompletely labeled training samples. Let the estimation rule refer to satisfying both the consistent information of the marked entities and all the estimated markings of the unmarked parts;
S3:获取初步训练的命名实体模型,采用自适应损失函数、所述初步训练的命名实体模型、所述多个不完全标注的训练样本、所述多个不完全标注的训练样本各自对应的所述预估标签序列集合对待训练的命名实体模型进行训练,得到目标命名实体模型。S3: Acquire the initially trained named entity model, and use the adaptive loss function, the initially trained named entity model, the plurality of incompletely labeled training samples, and the corresponding data of the plurality of incompletely labeled training samples Train the named entity model to be trained using the set of predicted label sequences to obtain the target named entity model.
本实施例首先通过采用预设预估规则分别对每个不完全标注的训练样本进行预估标签序列确定,得到多个不完全标注的训练样本各自对应的预估标签序列集合,预设预估规则是指同时满足已标注实体信息一致和未标注部分全部预估标注,然后采用自适应损失函数、初步训练的命名实体模型、多个不完全标注的训练样本、多个不完全标注的训练样本各自对应的预估标签序列集合对待训练的命名实体模型进行训练,得到目标命名实体模型,通过采用不完全标注的训练样本,减少了对标注质量的依赖,采用自适应损失函数避免在训练时将注意力分散到大量的标签序列上,从而使采用多个不完全标注的训练样本训练模型也可以取得较好的效果。In this embodiment, a preset prediction rule is used to determine the estimated label sequence for each incompletely labeled training sample, and a set of estimated label sequences corresponding to each of the incompletely labeled training samples is obtained. The rule is to satisfy both the marked entity information consistent and the unmarked part of all the estimated labels, and then use the adaptive loss function, the initially trained named entity model, multiple incompletely labeled training samples, and multiple incompletely labeled training samples. The corresponding estimated label sequence sets are trained to the named entity model to be trained, and the target named entity model is obtained. By using incompletely labeled training samples, the dependence on the labeling quality is reduced, and the adaptive loss function is used to avoid The attention is spread over a large number of label sequences, so that training the model with multiple incompletely labeled training samples can achieve better results.
对于S1,可以从数据库中获取多个不完全标注的训练样本,也可以是用户输入的多个不完全标注的训练样本,还可以是第三方应用***发送的多个不完全标注的训练样本。For S1, multiple incompletely labeled training samples can be obtained from the database, multiple incompletely labeled training samples input by the user, or multiple incompletely labeled training samples sent by third-party application systems.
文本样本数据中包括多个文字。The text sample data includes multiple characters.
在每个不完全标注的训练样本中,不完全标注的标签序列是对文本样本数据的实体进行不完全标注的结果。In each incompletely labeled training sample, the incompletely labeled label sequence is the result of incompletely labeling the entities of the text sample data.
比如,不完全标注的训练样本的文本样本数据x=(x 1,x 2,…,x n),其中每个x i(i=1,2,…,n)代表文本样本数据中的一个字,x对应的不完全标注的标签序列为y u=(-,y 2,-,…y i,…,-),其中y i代表标注人员标注的字x i对应的标签,“-”代表未被标注,即“-”对应文本样本数据中的字可能为实体或非实体,在此举例不做具体限定。 For example, text sample data x=(x 1 ,x 2 ,...,x n ) of incompletely labeled training samples, where each x i (i=1,2,...,n) represents one of the text sample data Word, the incomplete label sequence corresponding to x is yu = (-,y 2 ,-,...y i ,...,-), where y i represents the label corresponding to the word x i marked by the labeler, "-" Indicates that it is not marked, that is, the words in the text sample data corresponding to "-" may be entities or non-entities, which are not specifically limited in this example.
对于S2,分别对每个所述不完全标注的训练样本进行所有可能的预估标签序列确定,预估标签序列中包含了预估标签序列对应的所述不完全标注的训练样本的不完全标注的标签序列的已标注实体信息,并且预估标签序列是完全标注(包括:可能标注和不完全标注的标签序列的已标注实体信息)的。For S2, determine all possible estimated label sequences for each of the incompletely labeled training samples, and the estimated label sequence includes the incomplete labels of the incompletely labeled training samples corresponding to the estimated label sequence. The labeled entity information of the label sequence, and the estimated label sequence is fully labeled (including: the labeled entity information of the label sequence that may be labeled and incompletely labeled).
比如,不完全标注的训练样本x的不完全标注的标签序列为y u=(-,y 2,-,…y i,…,-),一条与y u中已标注实体信息一致未标注部分全部预估标注的预估标签序列为y c=(y c1,y 2,y c3,…y i,…,y cn),其中y c2表示未标注的位置2上一种可能的标签,将所有符合y u的已标注实体信息未标注部分全部预估标注的预估标签序列y c组成的集合为C(y u),C(y u)即为所述不完全标注的训练样本x对应的预估标签序列集合,在此举例不做具体限定。 For example, the incompletely labeled label sequence of the incompletely labeled training sample x is y u =(-,y 2 ,-,...y i ,...,-), and an unlabeled part consistent with the labeled entity information in y u The estimated label sequence of all estimated labels is y c = (y c1 , y 2 , y c3 ,...y i ,..., y cn ), where y c2 represents a possible label on the unlabeled position 2, and the The set composed of the estimated label sequence y c of all the labeled entity information and the unlabeled part conforming to yu is C(y u ), and C(y u ) corresponds to the incompletely labeled training sample x The estimated tag sequence set of , which is not specifically limited in this example.
对于S3,可以从数据库中获取初步训练的命名实体模型,也可以是用户输入的初步训练的命名实体模型,还可以是第三方应用***发送的初步训练的命名实体模型;采用所述初步训练的命名实体模型、所述多个不完全标注的训练样本、所述多个不完全标注的训练样本各自对应的所述预估标签序列集合对待训练的命名实体模型进行训练,训练时采用自适应损失函数作为损失函数,将训练结束的待训练的命名实体模型作为目标命名实体模型。For S3, the named entity model for preliminary training can be obtained from the database, the named entity model for preliminary training input by the user, or the named entity model for preliminary training sent by a third-party application system; The named entity model, the plurality of incompletely labeled training samples, and the estimated label sequence set corresponding to each of the plurality of incompletely labeled training samples are trained on the named entity model to be trained, and an adaptive loss is used during training. The function is used as the loss function, and the named entity model to be trained after training is used as the target named entity model.
初步训练的命名实体模型,是指采用少量的完全标注的训练样本对预训练模型和条件随机场模型训练得到的模型。The initially trained named entity model refers to the model obtained by training the pre-training model and the conditional random field model with a small number of fully annotated training samples.
待训练的命名实体模型包括:预训练模型、条件随机场模型。预训练模型是基于Bert网络训练得到的模型。Named entity models to be trained include: pre-trained models and conditional random field models. The pre-trained model is a model trained on the Bert network.
自适应损失函数,是根据训练的进程可以调整的损失函数,从而避免在训练时将注意力分散到大量的标签序列上。The adaptive loss function is a loss function that can be adjusted according to the training process, so as to avoid distracting attention to a large number of label sequences during training.
在一个实施例中,上述所述采用预设预估规则分别对每个所述不完全标注的训练样本进行预估标签序列确定,得到所述多个不完全标注的训练样本各自对应的预估标签序列集合的步骤,包括的步骤,包括:In one embodiment, the above-mentioned pre-set estimation rule is used to determine the estimated label sequence for each of the incompletely labeled training samples, and the corresponding estimates of the plurality of incompletely labeled training samples are obtained. The steps of tag sequence collection, including steps, include:
S21:从所述多个不完全标注的训练样本中获取一个所述不完全标注的训练样本,作为目标不完全标注的训练样本;S21: Obtain one of the incompletely labeled training samples from the plurality of incompletely labeled training samples, as a target incompletely labeled training sample;
S22:从所述目标不完全标注的训练样本的所述不完全标注的标签序列中提取出已标注实体信息,得到所述目标不完全标注的训练样本对应的已标注实体信息;S22: Extract the marked entity information from the incompletely marked label sequence of the training samples with the target incompletely marked, and obtain the marked entity information corresponding to the training samples with the incompletely marked targets;
S23:采用所述目标不完全标注的训练样本对应的所述已标注实体信息从所述目标不完全标注的训练样本的所述文本样本数据中找出未标注的文字,得到所述目标不完全标注的训练样本对应的未标注文本数据;S23: Use the marked entity information corresponding to the training samples with the target incompletely marked to find out the unmarked text from the text sample data of the training samples with the incompletely marked targets, and obtain the incomplete target The unlabeled text data corresponding to the labeled training samples;
S24:分别对所述目标不完全标注的训练样本对应的所述未标注文本数据中每个文字进行所有可能的标签预估,得到所述目标不完全标注的训练样本对应的所述未标注文本数据的各个文字各自对应的预估标签集合;S24: Perform all possible label predictions on each character in the unlabeled text data corresponding to the training samples that are incompletely labeled by the target, and obtain the unlabeled text corresponding to the training samples that are not completely labeled by the target. The set of estimated labels corresponding to each text of the data;
S25:分别将所述目标不完全标注的训练样本对应的所述未标注文本数据的各个文字各自对应的所述预估标签集合和所述目标不完全标注的训练样本对应的所述已标注实体信息进行所有可能的标签序列组合,得到所述目标不完全标注的训练样本对应的所述预估标签序列集合;S25: Respectively separate the estimated label set corresponding to each character of the unlabeled text data corresponding to the target incompletely labeled training sample and the labeled entity corresponding to the target incompletely labeled training sample The information is combined with all possible label sequences to obtain the estimated label sequence set corresponding to the training samples that are not completely labeled by the target;
S26:重复执行所述从所述多个不完全标注的训练样本中获取一个所述不完全标注的训练样本作为目标不完全标注的训练样本的步骤,直至确定所述多个不完全标注的训练样本各自对应的所述预估标签序列集合。S26: Repeat the step of obtaining one of the incompletely labeled training samples from the plurality of incompletely labeled training samples as a target incompletely labeled training sample, until the plurality of incompletely labeled training samples are determined The estimated label sequence set corresponding to each sample.
本实施例实现了确定预估标签序列,为后续进行模型训练提供了数据基础。This embodiment realizes the determination of the estimated label sequence, which provides a data basis for subsequent model training.
对于S21,从所述多个不完全标注的训练样本中获取一个所述不完全标注的训练样本,将获取的所述不完全标注的训练样本作为目标不完全标注的训练样本。For S21, obtain one of the incompletely labeled training samples from the plurality of incompletely labeled training samples, and use the obtained incompletely labeled training sample as the target incompletely labeled training sample.
对于S22,从所述目标不完全标注的训练样本对应的所述不完全标注的标签序列中提取出所有的已标注实体的信息,将提取得到的已标注实体的信息作为所述目标不完全标注的训练样本对应的已标注实体信息。For S22, extract the information of all the labeled entities from the incompletely labeled label sequence corresponding to the incompletely labeled training samples of the target, and use the extracted information of the labeled entities as the incompletely labeled target The annotated entity information corresponding to the training samples of .
已标注实体信息包括:被标注为实体的文字在文本样本数据中的位置数据。The marked entity information includes: position data of the text marked as an entity in the text sample data.
对于S23,采用所述目标不完全标注的训练样本对应的所述已标注实体信息从所述目标不完全标注的训练样本的所述文本样本数据中找出未标注的文字,将找出的未标注的文字作为所述目标不完全标注的训练样本对应的未标注文本数据。For S23, use the marked entity information corresponding to the training sample with the target incompletely marked to find out the unmarked text from the text sample data of the training sample with the partial mark of the target, and use the found unmarked text to The marked text is used as the unmarked text data corresponding to the training samples that are not fully marked by the target.
未标注文本数据包括:在文本样本数据中的位置数据、未标注的文字,在未标注文本数据中每个在文本样本数据中的位置数据对应一个未标注的文字。The unlabeled text data includes: position data and unlabeled characters in the text sample data. In the unlabeled text data, each position data in the text sample data corresponds to an unlabeled character.
对于S24,分别对所述目标不完全标注的训练样本对应的所述未标注文本数据中每个文字(也就是未标注的文字)进行所有可能的标签预估,将一个文字(也就是未标注的文字)对应的所有可能的标签预估结果作为一个预估标签集合。For S24, perform all possible label estimations on each character (that is, the unlabeled text) in the unlabeled text data corresponding to the training samples that are not completely labeled by the target, and assign a character (that is, the unlabeled text) to All possible label prediction results corresponding to the text) are used as a set of estimated labels.
对于S25,从所述目标不完全标注的训练样本对应的所述未标注文本数据的各个文字各自对应的所述预估标签集合中的每个所述预估标签集合中挑出一个预估标签,将挑选出的预估标签作为可能的待组合预估标签集合;将所述目标不完全标注的训练样本对应的所述已标注实体信息和所述多个可能的待组合预估标签集合中每个可能的待组合预估标签集合按位置数据的先后顺序进行组合,得到所述目标不完全标注的训练样本对应的所述预估标签序列集合。For S25, select an estimated label from each of the estimated label sets in the estimated label sets corresponding to the respective characters of the unlabeled text data corresponding to the training samples that are not completely labeled by the target. , the selected estimated label is used as a possible set of estimated labels to be combined; Each possible set of estimated labels to be combined is combined in the order of the position data, to obtain the set of estimated label sequences corresponding to the training samples that are incompletely labeled by the target.
对于S26,重复执行步骤S21至步骤S26,直至确定所述多个不完全标注的训练样本各自对应的所述预估标签序列集合。For S26, steps S21 to S26 are repeatedly performed until the estimated label sequence set corresponding to each of the plurality of incompletely labeled training samples is determined.
在一个实施例中,上述获取初步训练的命名实体模型,采用自适应损失函数、所述初步训练的命名实体模型、所述多个不完全标注的训练样本、所述多个不完全标注的训练样本各自对应的所述预估标签序列集合对待训练的命名实体模型进行训练,得到目标命名实体模型的步骤,包括:In one embodiment, the above-mentioned acquisition of the initially trained named entity model adopts an adaptive loss function, the initially trained named entity model, the plurality of incompletely labeled training samples, and the plurality of incompletely labeled training samples The steps of training the named entity model to be trained, and obtaining the target named entity model, comprising:
S31:从所述多个不完全标注的训练样本中获取一个所述不完全标注的训练样本,作为目标不完全标注的训练样本;S31: Obtain one of the incompletely labeled training samples from the plurality of incompletely labeled training samples, as a target incompletely labeled training sample;
S32:采用所述初步训练的命名实体模型分别对所述目标不完全标注的训练样本对应的所述预估标签序列集合中的每个预估标签序列进行概率分布计算,得到所述目标不完全标注的训练样本对应的所述预估标签序列集合中所有所述预估标签序列各自对应的待分析概率分布数据;S32: Use the initially trained named entity model to respectively perform probability distribution calculation on each estimated label sequence in the set of estimated label sequences corresponding to the training samples with incompletely labeled targets, to obtain the incomplete target Probability distribution data to be analyzed corresponding to all the estimated label sequences in the set of estimated label sequences corresponding to the labeled training samples;
S33:采用所述待训练的命名实体模型分别对所述目标不完全标注的训练样本对应的所述预估标签序列集合中的每个所述预估标签序列进行条件概率计算,得到所述目标不完全标注的训练样本对应的所述预估标签序列集合中所有所述预估标签序列各自对应的待分析条件概率数据;S33: Use the named entity model to be trained to perform conditional probability calculation on each of the estimated label sequences in the set of estimated label sequences corresponding to the training samples that are incompletely labeled by the target, to obtain the target Conditional probability data to be analyzed corresponding to all the estimated label sequences in the estimated label sequence set corresponding to the incompletely labeled training samples;
S34:采用所述待训练的命名实体模型,对所述目标不完全标注的训练样本对应的所述预估标签序列集合中的所有所述预估标签序列进行最有可能的标签序列解析,得到所述目标不完全标注的训练样本对应的最有可能标签序列集合;S34: Using the named entity model to be trained, perform the most likely label sequence analysis on all the estimated label sequences in the estimated label sequence set corresponding to the training samples with incompletely labeled targets, to obtain The most likely label sequence set corresponding to the training samples that are not completely labeled by the target;
S35:将所述目标不完全标注的训练样本对应的所述预估标签序列集合中所有所述预估标签序列各自对应的所述待分析概率分布数据、所述目标不完全标注 的训练样本对应的所述预估标签序列集合中所有所述预估标签序列各自对应的所述待分析条件概率数据和所述目标不完全标注的训练样本对应的所述最有可能标签序列集合输入所述自适应损失函数进行计算,得到所述待训练的命名实体模型的损失值,根据所述损失值更新所述待训练的命名实体模型的参数,更新后的所述待训练的命名实体模型被用于下一次计算所述待分析条件概率数据、所述最有可能标签序列集合;S35: Corresponding to the probability distribution data to be analyzed corresponding to all the estimated label sequences in the estimated label sequence set corresponding to the training samples with incompletely labeled targets, corresponding to the training samples with incompletely labeled targets The most probable label sequence set corresponding to the conditional probability data to be analyzed corresponding to each of the estimated label sequences in the estimated label sequence set and the most likely label sequence set corresponding to the training samples that are not completely labeled by the target input the Adapt the loss function to calculate, obtain the loss value of the named entity model to be trained, update the parameters of the named entity model to be trained according to the loss value, and the updated named entity model to be trained is used for Calculate the conditional probability data to be analyzed and the most likely tag sequence set next time;
S36:重复执行上述方法步骤直至所述损失值达到第一收敛条件或迭代次数达到第二收敛条件,将所述损失值达到所述第一收敛条件或迭代次数达到所述第二收敛条件的所述待训练的命名实体模型,确定为所述目标命名实体模型。S36: Repeat the above method steps until the loss value reaches the first convergence condition or the number of iterations reaches the second convergence condition, and set the loss value to meet the first convergence condition or the iteration number reaches the second convergence condition The named entity model to be trained is determined as the target named entity model.
本实施例实现了采用自适应损失函数避免在训练时将注意力分散到大量的标签序列上,从而使采用多个不完全标注的训练样本训练模型也可以取得较好的效果。This embodiment implements the use of an adaptive loss function to avoid distracting attention to a large number of label sequences during training, so that the model can also achieve better results by using multiple incompletely labeled training samples to train the model.
对于S31,从所述多个不完全标注的训练样本中获取一个所述不完全标注的训练样本,将获取的所述不完全标注的训练样本作为目标不完全标注的训练样本。For S31, obtain one of the incompletely labeled training samples from the plurality of incompletely labeled training samples, and use the obtained incompletely labeled training sample as the target incompletely labeled training sample.
对于S32,分别将所述目标不完全标注的训练样本对应的所述预估标签序列集合中的每个预估标签序列输入所述初步训练的命名实体模型进行概率分布预测,得到所述目标不完全标注的训练样本对应的所述预估标签序列集合中所有所述预估标签序列各自对应的待分析概率分布数据。也就是说,待分析概率分布数据的数量和目标不完全标注的训练样本对应的所述预估标签序列集合的预估标签序列的数量相同。For S32, input each estimated label sequence in the set of estimated label sequences corresponding to the training samples with incomplete labels of the target respectively into the initially trained named entity model for probability distribution prediction, and obtain the result that the target is not completely labeled. Probability distribution data to be analyzed corresponding to all the estimated label sequences in the estimated label sequence set corresponding to the fully labeled training samples. That is to say, the number of probability distribution data to be analyzed is the same as the number of estimated label sequences of the estimated label sequence set corresponding to the training samples whose targets are incompletely labeled.
对于S33,分别将所述目标不完全标注的训练样本对应的所述预估标签序列集合中的每个所述预估标签序列输入所述待训练的命名实体模型,获取所述待训练的命名实体模型的条件随机场模型输出的条件概率,得到所述目标不完全标注的训练样本对应的所述预估标签序列集合中所有所述预估标签序列各自对应的待分析条件概率数据。也就是说,待分析条件概率数据的数量与目标不完全标注的训练样本对应的所述预估标签序列集合的预估标签序列的数量相同。For S33, input each of the estimated label sequences in the set of estimated label sequences corresponding to the training samples with incomplete labels of the target into the named entity model to be trained, and obtain the named entity model to be trained. The conditional probability output by the conditional random field model of the entity model is used to obtain the conditional probability data to be analyzed corresponding to all the estimated label sequences in the estimated label sequence set corresponding to the training samples with incomplete labels of the target. That is to say, the number of conditional probability data to be analyzed is the same as the number of estimated label sequences of the estimated label sequence set corresponding to the training samples whose targets are incompletely labeled.
对于S34,采用当前的所述待训练的命名实体模型的条件随机场模型的状态转移矩阵和当前的所述待训练的命名实体模型的预训练模型的输出进行最有可能的标签序列解析,将解析得到的所有所述最有可能标签序列作为所述目标不完全标注的训练样本对应的最有可能标签序列集合。For S34, use the current state transition matrix of the conditional random field model of the named entity model to be trained and the output of the current pre-trained model of the named entity model to be trained to perform the most likely label sequence analysis, All the most probable label sequences obtained by parsing are taken as the most probable label sequence sets corresponding to the training samples that are not fully labeled by the target.
对于S35,根据所述损失值更新所述待训练的命名实体模型的参数的方法可以从现有技术中选择,在此不做赘述。For S35, the method for updating the parameters of the named entity model to be trained according to the loss value can be selected from the prior art, and details are not described here.
对于S36,重复执行步骤S31至S36,直至所述损失值达到第一收敛条件或迭代次数达到第二收敛条件。For S36, steps S31 to S36 are repeatedly performed until the loss value reaches the first convergence condition or the number of iterations reaches the second convergence condition.
所述第一收敛条件是指相邻两次计算的损失的大小满足lipschitz条件(利普希茨连续条件)。The first convergence condition means that the magnitude of the loss calculated twice adjacently satisfies the Lipschitz condition (the Lipschitz continuity condition).
所述迭代次数达到第二收敛条件是指所述待训练的命名实体模型被用于计算所述待分析条件概率数据、所述最有可能标签序列集合的次数,也就是说,计算一次所述待分析条件概率数据、所述最有可能标签序列集合,迭代次数增加1。The number of iterations reaching the second convergence condition refers to the number of times that the named entity model to be trained is used to calculate the conditional probability data to be analyzed and the most likely label sequence set, that is, to calculate the For the conditional probability data to be analyzed, the most likely label sequence set, the number of iterations is increased by 1.
在一个实施例中,上述采用所述初步训练的命名实体模型分别对所述目标不完全标注的训练样本对应的所述预估标签序列集合中的每个预估标签序列进行概率分布计算,得到所述目标不完全标注的训练样本对应的所述预估标签序列集合中所有所述预估标签序列各自对应的待分析概率分布数据的步骤,包括:In one embodiment, the above-mentioned named entity model for preliminary training performs probability distribution calculation on each estimated label sequence in the set of estimated label sequences corresponding to the training samples with incompletely labeled targets, respectively, to obtain The steps of the probability distribution data to be analyzed corresponding to all the estimated label sequences in the estimated label sequence set corresponding to the training samples that are not completely marked by the target, include:
S321:基于前向-后向算法和所述初步训练的命名实体模型,分别对所述目标不完全标注的训练样本的所述文本样本数据的每个字进行各个标签的边缘概率计算,得到所述目标不完全标注的训练样本的所述文本样本数据的各个字各自对应的各个标签的边缘概率数据;S321: Based on the forward-backward algorithm and the initially trained named entity model, perform edge probability calculation of each label on each word of the text sample data of the training sample incompletely labelled by the target, respectively, to obtain the The edge probability data of each label corresponding to each word of the text sample data of the training sample that the target is not completely labeled;
S322:分别根据所述目标不完全标注的训练样本对应的所述预估标签序列集合中的每个预估标签序列、所述目标不完全标注的训练样本的所述文本样本数据的各个字各自对应的各个标签的所述边缘概率数据进行各个字各自对应的所述边缘概率数据的相乘计算,得到所述目标不完全标注的训练样本对应的所述预估标签序列集合中所有所述预估标签序列各自对应的所述待分析概率分布数据。S322: Respectively according to each estimated label sequence in the estimated label sequence set corresponding to the target incompletely labeled training sample, and each word of the text sample data of the target incompletely labeled training sample, respectively The edge probability data corresponding to each label is multiplied by the edge probability data corresponding to each word to obtain all the prediction labels in the estimated label sequence set corresponding to the training samples with incomplete labeling of the target. The probability distribution data to be analyzed corresponding to each tag sequence is estimated.
本实施例实现了计算所有所述预估标签序列各自对应的待分析概率分布数据,为后续进行模型训练提供了数据基础。This embodiment realizes the calculation of the probability distribution data to be analyzed corresponding to all the estimated label sequences, which provides a data basis for subsequent model training.
对于S321,将所述目标不完全标注的训练样本的所述文本样本数据输入所述初步训练的命名实体模型,并且采用前向-后向算法计算出所述目标不完全标注的训练样本的所述文本样本数据的每个字对应各个标签的边缘概率(边缘分布)。也就是说,每个字对应的边缘概率和标签总数量相同。比如,标签包括:实体、非实体,则标签总数量为2,在此举例不做具体限定。For S321, input the text sample data of the training samples with the target incompletely labeled into the initially trained named entity model, and use a forward-backward algorithm to calculate all the training samples with the target incompletely labeled. Each word of the text sample data corresponds to the edge probability (edge distribution) of each label. That is, the edge probability corresponding to each word is the same as the total number of labels. For example, if the tags include: entity and non-entity, the total number of tags is 2, which is not specifically limited in this example.
对于S322,从所述目标不完全标注的训练样本对应的所述预估标签序列集合中提取出一个预估标签序列作为目标预估标签序列;将目标预估标签序列的每个标签依次在所述目标不完全标注的训练样本的所述文本样本数据的各个字各自对应的各个标签的所述边缘概率数据中进行查找,将查找到的所述边缘概率数据相乘,将相乘结果作为所述目标预估标签序列对应的所述待分析概率分布数据;重复执行所述从所述目标不完全标注的训练样本对应的所述预估标签序列集合中提取出一个预估标签序列作为目标预估标签序列的步骤,直至确定所述目标不完全标注的训练样本对应的所述预估标签序列集合中所有所述预估标签序列各自对应的所述待分析概率分布数据。For S322, extract an estimated label sequence from the set of estimated label sequences corresponding to the training samples that are incompletely labeled by the target as the target estimated label sequence; place each label of the target estimated label sequence in the Search in the edge probability data of each label corresponding to each word of the text sample data of the training sample that the target is incompletely marked with, multiply the found edge probability data, and use the multiplication result as the The probability distribution data to be analyzed corresponding to the target estimated label sequence; repeat the process of extracting an estimated label sequence from the estimated label sequence set corresponding to the training samples incompletely labeled by the target as the target prediction sequence. The step of estimating the label sequence is performed until the probability distribution data to be analyzed corresponding to all the estimated label sequences in the estimated label sequence set corresponding to the training samples that are not completely labeled by the target are determined.
比如,所述目标不完全标注的训练样本的目标预估标签序列有10个预估标签,目标预估标签序列第二个位置的预估标签是实体,将所述目标不完全标注的训练样本的所述文本样本数据的第二个字(与目标预估标签序列第二个位置对应)对应的标签为实体(与目标预估标签序列第二个位置的预估标签对应)的所述边缘概率数据作为目标预估标签序列第二个位置的预估标签对应的边缘概率数据,然后将目标预估标签序列10个位置(与10个预估标签对应)的预估标签对应的边缘概率数据进行相乘(也就是10个边缘概率数据相乘),将相乘结果作为所述目标不完全标注的训练样本的目标预估标签序列对应的所述待分析概率分布数据,在此举例不做具体限定。For example, the target estimated label sequence of the training sample with incomplete target labeling has 10 estimated labels, and the estimated label at the second position of the target estimated label sequence is an entity. The label corresponding to the second word of the text sample data (corresponding to the second position of the target estimated label sequence) is the edge of the entity (corresponding to the estimated label of the second position of the target estimated label sequence) The probability data is used as the edge probability data corresponding to the estimated label at the second position of the target estimated label sequence, and then the edge probability data corresponding to the estimated label at the 10 positions (corresponding to the 10 estimated labels) of the target estimated label sequence Multiply (that is, multiply 10 marginal probability data), and use the multiplication result as the probability distribution data to be analyzed corresponding to the target estimated label sequence of the training samples that are not completely labeled by the target. Specific restrictions.
在一个实施例中,上述采用所述待训练的命名实体模型,对所述目标不完全标注的训练样本对应的所述预估标签序列集合中的所有所述预估标签序列进行最有可能的标签序列解析,得到所述目标不完全标注的训练样本对应的最有可能标签序列集合的步骤,包括:In one embodiment, the above-mentioned named entity model to be trained is used to perform the most likely estimation on all the estimated label sequences in the set of estimated label sequences corresponding to the training samples that are not completely labeled by the target. The steps of analyzing the label sequence to obtain the most likely label sequence set corresponding to the incompletely labeled training samples of the target include:
S341:将所述目标不完全标注的训练样本的所述文本样本数据输入所述待训练的命名实体模型进行计算,获取所述目标不完全标注的训练样本的预训练模型输出的所述目标不完全标注的训练样本对应的概率预测结果;S341: Input the text sample data of the training samples that are incompletely labeled with the target into the named entity model to be trained for calculation, and obtain the target incomplete output from the pre-training model of the training samples that are not completely labeled with the target. Probabilistic prediction results corresponding to fully labeled training samples;
S342:采用k-best Viterbi Decoding算法根据所述待训练的命名实体模型的条件随机场模型的状态转移矩阵、所述目标不完全标注的训练样本的所述概率预测 结果进行最有可能的标签序列的解码计算,得到所述目标不完全标注的训练样本对应的所述最有可能标签序列集合。S342: Use the k-best Viterbi Decoding algorithm to perform the most likely label sequence according to the state transition matrix of the conditional random field model of the named entity model to be trained, and the probability prediction result of the training samples that are incompletely labeled by the target to obtain the most likely label sequence set corresponding to the training samples with incomplete labeling of the target.
本实施例实现了采用k-best Viterbi Decoding算法进行最有可能的标签序列的解码计算,从而为后续进行模型训练提供了数据基础。In this embodiment, the k-best Viterbi Decoding algorithm is used to decode and calculate the most likely label sequence, thereby providing a data basis for subsequent model training.
对于S341,将所述目标不完全标注的训练样本的所述文本样本数据输入所述待训练的命名实体模型进行计算,将所述目标不完全标注的训练样本的预训练模型输出的概率作为所述目标不完全标注的训练样本对应的概率预测结果。For S341, input the text sample data of the training samples with the target incompletely labeled into the named entity model to be trained for calculation, and use the probability of the pre-training model output of the training samples with the target incompletely labeled as the The probability prediction results corresponding to the training samples whose targets are not fully labeled.
对于S342,从当前的所述待训练的命名实体模型中提取出条件随机场模型的状态转移矩阵;采用k-best Viterbi Decoding算法将提取出的状态转移矩阵和所述目标不完全标注的训练样本的所述概率预测结果进行最有可能的标签序列的解码计算,得到所述目标不完全标注的训练样本对应的所述最有可能标签序列集合。For S342, the state transition matrix of the conditional random field model is extracted from the current named entity model to be trained; the k-best Viterbi Decoding algorithm is used to extract the extracted state transition matrix and the incompletely marked training samples of the target The most probable tag sequence is decoded and calculated on the probability prediction result, and the most probable tag sequence set corresponding to the training samples that are not fully annotated by the target is obtained.
采用k-best Viterbi Decoding算法进行最有可能的标签序列的解码计算的方法可以从现有技术中选择,在此不做赘述。The method for decoding and calculating the most likely tag sequence using the k-best Viterbi Decoding algorithm can be selected from the prior art, and details are not described here.
在一个实施例中,上述自适应损失函数的计算公式L(w,x)为:In one embodiment, the calculation formula L(w,x) of the above adaptive loss function is:
L(w,x)=(1-λ)L 1(w,x)+λL 2(w,x) L(w,x)=(1-λ)L 1 (w,x)+λL 2 (w,x)
Figure PCTCN2021097545-appb-000001
Figure PCTCN2021097545-appb-000001
Figure PCTCN2021097545-appb-000002
Figure PCTCN2021097545-appb-000002
其中,q(y′|x)是所述目标不完全标注的训练样本对应的所述预估标签序列集合中所有所述预估标签序列各自对应的所述待分析概率分布数据,p w(y′|x)是所述目标不完全标注的训练样本对应的所述预估标签序列集合中所有所述预估标签序列各自对应的所述待分析条件概率数据,C(y u)是所述目标不完全标注的训练样本对应的所述预估标签序列集合,K w(x)所述目标不完全标注的训练样本对应的所述最有可能标签序列集合,log()是对数函数,λ是自适应参数,λ从0逐步增加到1。 Wherein, q(y′|x) is the probability distribution data to be analyzed corresponding to all the estimated label sequences in the estimated label sequence set corresponding to the training samples that are not completely labeled by the target, and p w ( y′|x) is the conditional probability data to be analyzed corresponding to each of the estimated label sequences in the estimated label sequence set corresponding to the incompletely labeled training samples of the target, and C( yu ) is the Kw (x) the most likely label sequence set corresponding to the training samples not completely labeled by the target, log() is the logarithmic function , λ is an adaptive parameter, λ gradually increases from 0 to 1.
本实施例实现了采用自适应损失函数避免在训练时将注意力分散到大量的标签序列上,从而使采用多个不完全标注的训练样本训练模型也可以取得较好的效果。This embodiment implements the use of an adaptive loss function to avoid distracting attention to a large number of label sequences during training, so that the model can also achieve better results by using multiple incompletely labeled training samples to train the model.
在训练初期,模型能够利用所有所述预估标签序列各自对应的所述待分析概率分布数据和标签信息进行训练,此时使自适应损失函数中的L 1(w,x)权重偏大,L 2(w,x)的权重偏小;在训练过程中,通过λ逐步调整损失函数,增加最有可能的所述预估标签序列的权重,此时使自适应损失函数中的L 1(w,x)权重偏小,L 2(w,x)的权重偏大,使模型更易把握住真实的标签序列,从而使采用多个不完全标注的训练样本训练模型也可以取得较好的效果。 In the early stage of training, the model can use the probability distribution data to be analyzed and the label information corresponding to all the estimated label sequences for training. At this time, the weight of L 1 (w, x) in the adaptive loss function is too large, The weight of L 2 (w,x) is too small; in the training process, the loss function is gradually adjusted by λ to increase the weight of the most likely predicted label sequence, and at this time, the L 1 ( The weight of w,x) is too small, and the weight of L 2 (w,x) is too large, which makes it easier for the model to grasp the real label sequence, so that the training model using multiple incompletely labeled training samples can also achieve better results. .
在一个实施例中,上述自适应参数λ的计算公式为:In one embodiment, the calculation formula of the above-mentioned adaptive parameter λ is:
Figure PCTCN2021097545-appb-000003
Figure PCTCN2021097545-appb-000003
其中,exp[]是以自然常数e为底的指数函数,b是所述目标不完全标注的训练样本训练所述待训练的命名实体模型时对应的训练步数,B是预设的总训练步数,γ为控制λ增长速度的常数。Wherein, exp[] is an exponential function with the natural constant e as the base, b is the number of training steps corresponding to the training samples to be trained when the training samples that are not completely labeled by the target train the named entity model to be trained, and B is the preset total training The number of steps, γ is a constant that controls the growth rate of λ.
本实施例通过训练步数、总训练步数、控制λ增长速度的常数来控制自适应参数的值从0逐步增加到1,实现了根据训练进度从小到大调整自适应参数。In this embodiment, the value of the adaptive parameter is gradually increased from 0 to 1 through the number of training steps, the total number of training steps, and the constant controlling the growth rate of λ, so that the adaptive parameter can be adjusted from small to large according to the training progress.
参照图2,本申请提出了一种命名实体模型的训练装置,所述装置包括:Referring to FIG. 2, the present application proposes a training device for a named entity model, and the device includes:
训练样本获取模块100,用于获取多个不完全标注的训练样本,所述不完全标注的训练样本包括:文本样本数据、不完全标注的标签序列;The training sample acquisition module 100 is configured to acquire a plurality of incompletely labeled training samples, where the incompletely labeled training samples include: text sample data and incompletely labeled label sequences;
预估标签序列集合确定模块200,用于采用预设预估规则分别对每个所述不完全标注的训练样本进行预估标签序列确定,得到所述多个不完全标注的训练样本各自对应的预估标签序列集合,所述预设预估规则是指同时满足已标注实体信息一致和未标注部分全部预估标注;The estimated label sequence set determination module 200 is used to determine the estimated label sequence for each of the incompletely labeled training samples by using a preset prediction rule, and obtain the respective corresponding values of the plurality of incompletely labeled training samples. A set of estimated label sequences, and the preset estimation rule refers to satisfying both the consistency of the marked entity information and all the estimated labels of the unmarked part;
模型训练模块300,用于获取初步训练的命名实体模型,采用自适应损失函数、所述初步训练的命名实体模型、所述多个不完全标注的训练样本、所述多个不完全标注的训练样本各自对应的所述预估标签序列集合对待训练的命名实体模型进行训练,得到目标命名实体模型。The model training module 300 is used to obtain a named entity model for preliminary training, using an adaptive loss function, the named entity model for preliminary training, the plurality of incompletely labeled training samples, and the plurality of incompletely labeled training samples The named entity model to be trained is trained on the estimated label sequence set corresponding to each sample to obtain a target named entity model.
本实施例首先通过采用预设预估规则分别对每个不完全标注的训练样本进行预估标签序列确定,得到多个不完全标注的训练样本各自对应的预估标签序列集合,预设预估规则是指同时满足已标注实体信息一致和未标注部分全部预估标注,然后采用自适应损失函数、初步训练的命名实体模型、多个不完全标注的训练样本、多个不完全标注的训练样本各自对应的预估标签序列集合对待训练的命名实体模型进行训练,得到目标命名实体模型,通过采用不完全标注的训练样本,减少了对标注质量的依赖,采用自适应损失函数避免在训练时将注意力分散到大量的标签序列上,从而使采用多个不完全标注的训练样本训练模型也可以取得较好的效果。In this embodiment, a preset prediction rule is used to determine the estimated label sequence for each incompletely labeled training sample, and a set of estimated label sequences corresponding to each of the incompletely labeled training samples is obtained. The rule is to satisfy both the marked entity information consistent and the unmarked part of all the estimated labels, and then use the adaptive loss function, the initially trained named entity model, multiple incompletely labeled training samples, and multiple incompletely labeled training samples. The corresponding set of estimated label sequences is trained on the named entity model to be trained, and the target named entity model is obtained. By using incompletely labeled training samples, the dependence on the labeling quality is reduced, and the adaptive loss function is used to avoid The attention is spread over a large number of label sequences, so that training the model with multiple incompletely labeled training samples can achieve better results.
参照图3,本申请实施例中还提供一种计算机设备,该计算机设备可以是服务器,其内部结构可以如图3所示。该计算机设备包括通过***总线连接的处理器、存储器、网络接口和数据库。其中,该计算机设计的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作***、计算机程序和数据库。该内存器为非易失性存储介质中的操作***和计算机程序的运行提供环境。该计算机设备的数据库用于储存命名实体模型的训练方法等数据。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机程序被处理器执行时以实现一种命名实体模型的训练方法。所述命名实体模型的训练方法,包括:获取多个不完全标注的训练样本,所述不完全标注的训练样本包括:文本样本数据、不完全标注的标签序列;采用预设预估规则分别对每个所述不完全标注的训练样本进行预估标签序列确定,得到所述多个不完全标注的训练样本各自对应的预估标签序列集合,所述预设预估规则是指同时满足已标注实体信息一致和未标注部分全部预估标注;获取初步训练的命名实体模型,采用自适应损失函数、所述初步训练的命名实体模型、所述多个不完全标注的训练样本、所述多个不完全标注的训练样本各自对应的所述预估标签序列集合对待训练的命名实体模型进行训练,得到目标命名实体模型。Referring to FIG. 3 , an embodiment of the present application further provides a computer device. The computer device may be a server, and its internal structure may be as shown in FIG. 3 . The computer device includes a processor, memory, a network interface, and a database connected by a system bus. Among them, the processor of the computer design is used to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium, an internal memory. The nonvolatile storage medium stores an operating system, a computer program, and a database. The memory provides an environment for the execution of the operating system and computer programs in the non-volatile storage medium. The database of the computer equipment is used for storing data such as training methods of the named entity model. The network interface of the computer device is used to communicate with an external terminal through a network connection. The computer program, when executed by a processor, implements a method of training a named entity model. The training method for the named entity model includes: acquiring a plurality of incompletely labeled training samples, where the incompletely labeled training samples include: text sample data and incompletely labeled label sequences; Determine the estimated label sequence for each of the incompletely labeled training samples, and obtain the estimated label sequence set corresponding to each of the plurality of incompletely labeled training samples. The entity information is consistent and all the unlabeled parts are estimated and labeled; the named entity model of the preliminary training is obtained, and the adaptive loss function, the named entity model of the preliminary training, the multiple incompletely labeled training samples, the multiple The estimated label sequence set corresponding to each of the incompletely labeled training samples is trained on the named entity model to be trained to obtain a target named entity model.
本实施例首先通过采用预设预估规则分别对每个不完全标注的训练样本进行预估标签序列确定,得到多个不完全标注的训练样本各自对应的预估标签序列集合,预设预估规则是指同时满足已标注实体信息一致和未标注部分全部预估标注,然后采用自适应损失函数、初步训练的命名实体模型、多个不完全标注的训 练样本、多个不完全标注的训练样本各自对应的预估标签序列集合对待训练的命名实体模型进行训练,得到目标命名实体模型,通过采用不完全标注的训练样本,减少了对标注质量的依赖,采用自适应损失函数避免在训练时将注意力分散到大量的标签序列上,从而使采用多个不完全标注的训练样本训练模型也可以取得较好的效果。In this embodiment, a preset prediction rule is used to determine the estimated label sequence for each incompletely labeled training sample, and a set of estimated label sequences corresponding to each of the incompletely labeled training samples is obtained. The rule is to satisfy both the marked entity information consistent and the unmarked part of all the estimated labels, and then use the adaptive loss function, the initially trained named entity model, multiple incompletely labeled training samples, and multiple incompletely labeled training samples. The corresponding estimated label sequence sets are trained to the named entity model to be trained, and the target named entity model is obtained. By using incompletely labeled training samples, the dependence on the labeling quality is reduced, and the adaptive loss function is used to avoid The attention is spread over a large number of label sequences, so that training the model with multiple incompletely labeled training samples can achieve better results.
本申请一实施例还提供一种计算机可读存储介质,其上存储有计算机程序,计算机程序被处理器执行时实现一种命名实体模型的训练方法,包括步骤:获取多个不完全标注的训练样本,所述不完全标注的训练样本包括:文本样本数据、不完全标注的标签序列;采用预设预估规则分别对每个所述不完全标注的训练样本进行预估标签序列确定,得到所述多个不完全标注的训练样本各自对应的预估标签序列集合,所述预设预估规则是指同时满足已标注实体信息一致和未标注部分全部预估标注;获取初步训练的命名实体模型,采用自适应损失函数、所述初步训练的命名实体模型、所述多个不完全标注的训练样本、所述多个不完全标注的训练样本各自对应的所述预估标签序列集合对待训练的命名实体模型进行训练,得到目标命名实体模型。An embodiment of the present application also provides a computer-readable storage medium on which a computer program is stored. When the computer program is executed by a processor, a method for training a named entity model is implemented, including the step of: acquiring a plurality of incompletely labeled training The incompletely labeled training samples include: text sample data and incompletely labeled label sequences; each of the incompletely labeled training samples is determined by using preset prediction rules to estimate the label sequence, and the obtained The set of estimated label sequences corresponding to each of the plurality of incompletely labeled training samples, and the preset prediction rule refers to satisfying the consistent information of the labeled entities and all the predicted labels of the unlabeled parts at the same time; obtaining the initially trained named entity model , adopt the adaptive loss function, the initially trained named entity model, the plurality of incompletely labeled training samples, and the estimated label sequence set corresponding to each of the plurality of incompletely labeled training samples to be trained. The named entity model is trained to obtain the target named entity model.
上述执行的命名实体模型的训练方法,首先通过采用预设预估规则分别对每个不完全标注的训练样本进行预估标签序列确定,得到多个不完全标注的训练样本各自对应的预估标签序列集合,预设预估规则是指同时满足已标注实体信息一致和未标注部分全部预估标注,然后采用自适应损失函数、初步训练的命名实体模型、多个不完全标注的训练样本、多个不完全标注的训练样本各自对应的预估标签序列集合对待训练的命名实体模型进行训练,得到目标命名实体模型,通过采用不完全标注的训练样本,减少了对标注质量的依赖,采用自适应损失函数避免在训练时将注意力分散到大量的标签序列上,从而使采用多个不完全标注的训练样本训练模型也可以取得较好的效果。The training method of the named entity model performed above firstly determines the estimated label sequence for each incompletely labeled training sample by using a preset estimation rule, and obtains the estimated labels corresponding to each of the multiple incompletely labeled training samples. Sequence set, the preset prediction rule is to satisfy the consistent information of the labeled entities and all the predicted labels of the unlabeled parts at the same time, and then use the adaptive loss function, the initially trained named entity model, multiple incompletely labeled training samples, and more The estimated label sequence sets corresponding to each incompletely labeled training sample are trained on the named entity model to be trained, and the target named entity model is obtained. By using the incompletely labeled training samples, the dependence on the labeling quality is reduced, and the adaptive The loss function avoids distracting attention to a large number of label sequences during training, so that training models with multiple incompletely labeled training samples can achieve better results.
所述计算机可读存储介质可以是非易失性,也可以是易失性。The computer-readable storage medium may be non-volatile or volatile.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的计算机程序可存储于一非易失性计算机可读取存储介质中,该计算机程序在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的和实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可以包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双速据率SDRAM(SSRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be implemented by instructing relevant hardware through a computer program, and the computer program can be stored in a non-volatile computer-readable storage In the medium, when the computer program is executed, it may include the processes of the above-mentioned method embodiments. Wherein, any reference to memory, storage, database or other medium provided in this application and used in the embodiments may include non-volatile and/or volatile memory. Nonvolatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in various forms such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double-rate SDRAM (SSRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、装置、物品或者方法不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、装置、物品或者方法所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、装置、物品或者方法中还存在另外的相同要素。It should be noted that, herein, the terms "comprising", "comprising" or any other variation thereof are intended to encompass non-exclusive inclusion, such that a process, device, article or method comprising a series of elements includes not only those elements, It also includes other elements not expressly listed or inherent to such a process, apparatus, article or method. Without further limitation, an element qualified by the phrase "comprising a..." does not preclude the presence of additional identical elements in the process, apparatus, article, or method that includes the element.
以上所述仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。The above are only the preferred embodiments of the present application, and are not intended to limit the scope of the patent of the present application. Any equivalent structure or equivalent process transformation made by using the contents of the description and drawings of the present application, or directly or indirectly applied to other related The technical field is similarly included in the scope of patent protection of this application.

Claims (20)

  1. 一种命名实体模型的训练方法,其中,所述方法包括:A training method for a named entity model, wherein the method comprises:
    获取多个不完全标注的训练样本,所述不完全标注的训练样本包括:文本样本数据、不完全标注的标签序列;Obtain a plurality of incompletely labeled training samples, where the incompletely labeled training samples include: text sample data and incompletely labeled label sequences;
    采用预设预估规则分别对每个所述不完全标注的训练样本进行预估标签序列确定,得到所述多个不完全标注的训练样本各自对应的预估标签序列集合,所述预设预估规则是指同时满足已标注实体信息一致和未标注部分全部预估标注;Determine the estimated label sequence for each of the incompletely labeled training samples by using preset prediction rules, and obtain the estimated label sequence set corresponding to each of the plurality of incompletely labeled training samples. The estimation rule refers to satisfying both the consistent marked entity information and all the estimated markings of the unmarked part;
    获取初步训练的命名实体模型,采用自适应损失函数、所述初步训练的命名实体模型、所述多个不完全标注的训练样本、所述多个不完全标注的训练样本各自对应的所述预估标签序列集合对待训练的命名实体模型进行训练,得到目标命名实体模型。Obtain the initially trained named entity model, and adopt the adaptive loss function, the initially trained named entity model, the plurality of incompletely labeled training samples, and the respective corresponding predictions of the plurality of incompletely labeled training samples. Estimate the label sequence set to train the named entity model to be trained, and obtain the target named entity model.
  2. 根据权利要求1所述的命名实体模型的训练方法,其中,所述采用预设预估规则分别对每个所述不完全标注的训练样本进行预估标签序列确定,得到所述多个不完全标注的训练样本各自对应的预估标签序列集合的步骤,包括:The training method for a named entity model according to claim 1, wherein the predetermined estimation rule is used to determine the estimated label sequence for each of the incompletely labeled training samples, to obtain the plurality of incompletely labeled training samples. The steps of the estimated label sequence set corresponding to each of the labeled training samples include:
    从所述多个不完全标注的训练样本中获取一个所述不完全标注的训练样本,作为目标不完全标注的训练样本;Obtaining one of the incompletely labeled training samples from the plurality of incompletely labeled training samples, as a target incompletely labeled training sample;
    从所述目标不完全标注的训练样本的所述不完全标注的标签序列中提取出已标注实体信息,得到所述目标不完全标注的训练样本对应的已标注实体信息;Extracting the marked entity information from the incompletely marked label sequence of the training sample with the target incompletely marked, and obtaining the marked entity information corresponding to the training sample with the incompletely marked target;
    采用所述目标不完全标注的训练样本对应的所述已标注实体信息从所述目标不完全标注的训练样本的所述文本样本数据中找出未标注的文字,得到所述目标不完全标注的训练样本对应的未标注文本数据;Using the marked entity information corresponding to the training sample with the target incompletely marked, find out the unmarked text from the text sample data of the training sample with the target incompletely marked, and obtain the incompletely marked with the target. Unlabeled text data corresponding to training samples;
    分别对所述目标不完全标注的训练样本对应的所述未标注文本数据中每个文字进行所有可能的标签预估,得到所述目标不完全标注的训练样本对应的所述未标注文本数据的各个文字各自对应的预估标签集合;Perform all possible label predictions on each character in the unlabeled text data corresponding to the training samples that are incompletely labeled by the target, and obtain the unlabeled text data corresponding to the training samples that are not completely labeled by the target. The set of estimated labels corresponding to each text;
    分别将所述目标不完全标注的训练样本对应的所述未标注文本数据的各个文字各自对应的所述预估标签集合和所述目标不完全标注的训练样本对应的所述已标注实体信息进行所有可能的标签序列组合,得到所述目标不完全标注的训练样本对应的所述预估标签序列集合;The estimated label set corresponding to each character of the unlabeled text data corresponding to the training samples that are not completely labeled by the target and the labeled entity information corresponding to the training samples that are not completely labeled by the target are respectively performed. All possible label sequence combinations are obtained to obtain the estimated label sequence set corresponding to the training samples that are incompletely labeled by the target;
    重复执行所述从所述多个不完全标注的训练样本中获取一个所述不完全标注的训练样本作为目标不完全标注的训练样本的步骤,直至确定所述多个不完全标注的训练样本各自对应的所述预估标签序列集合。Repeat the step of obtaining one of the incompletely labeled training samples from the plurality of incompletely labeled training samples as the target incompletely labeled training sample, until it is determined that each of the plurality of incompletely labeled training samples is determined. the corresponding set of estimated label sequences.
  3. 根据权利要求1所述的命名实体模型的训练方法,其中,所述获取初步训练的命名实体模型,采用自适应损失函数、所述初步训练的命名实体模型、所述多个不完全标注的训练样本、所述多个不完全标注的训练样本各自对应的所述预估标签序列集合对待训练的命名实体模型进行训练,得到目标命名实体模型的步骤,包括:The method for training a named entity model according to claim 1, wherein the acquiring the initially trained named entity model adopts an adaptive loss function, the initially trained named entity model, and the plurality of incompletely labeled training methods. The steps of training the named entity model to be trained, and obtaining the target named entity model, comprising:
    从所述多个不完全标注的训练样本中获取一个所述不完全标注的训练样本,作为目标不完全标注的训练样本;Obtaining one of the incompletely labeled training samples from the plurality of incompletely labeled training samples, as a target incompletely labeled training sample;
    采用所述初步训练的命名实体模型分别对所述目标不完全标注的训练样本对应的所述预估标签序列集合中的每个预估标签序列进行概率分布计算,得到所述目标不完全标注的训练样本对应的所述预估标签序列集合中所有所述预估标签序列各自对应的待分析概率分布数据;Using the initially trained named entity model to calculate the probability distribution of each estimated label sequence in the estimated label sequence set corresponding to the training samples with incompletely labeled targets, to obtain the incompletely labeled target Probability distribution data to be analyzed corresponding to all the estimated label sequences in the estimated label sequence set corresponding to the training sample;
    采用所述待训练的命名实体模型分别对所述目标不完全标注的训练样本对 应的所述预估标签序列集合中的每个所述预估标签序列进行条件概率计算,得到所述目标不完全标注的训练样本对应的所述预估标签序列集合中所有所述预估标签序列各自对应的待分析条件概率数据;The named entity model to be trained is used to perform conditional probability calculation on each of the estimated label sequences in the set of estimated label sequences corresponding to the training samples with incompletely labeled targets, to obtain the incomplete target. Conditional probability data to be analyzed corresponding to all the estimated label sequences in the set of estimated label sequences corresponding to the labeled training samples;
    采用所述待训练的命名实体模型,对所述目标不完全标注的训练样本对应的所述预估标签序列集合中的所有所述预估标签序列进行最有可能的标签序列解析,得到所述目标不完全标注的训练样本对应的最有可能标签序列集合;Using the named entity model to be trained, perform the most probable label sequence analysis on all the estimated label sequences in the estimated label sequence set corresponding to the training samples with incomplete labels of the target, and obtain the The most likely label sequence set corresponding to the training samples that are not fully labeled by the target;
    将所述目标不完全标注的训练样本对应的所述预估标签序列集合中所有所述预估标签序列各自对应的所述待分析概率分布数据、所述目标不完全标注的训练样本对应的所述预估标签序列集合中所有所述预估标签序列各自对应的所述待分析条件概率数据和所述目标不完全标注的训练样本对应的所述最有可能标签序列集合输入所述自适应损失函数进行计算,得到所述待训练的命名实体模型的损失值,根据所述损失值更新所述待训练的命名实体模型的参数,更新后的所述待训练的命名实体模型被用于下一次计算所述待分析条件概率数据、所述最有可能标签序列集合;The probability distribution data to be analyzed corresponding to all the estimated label sequences in the set of estimated label sequences corresponding to the training samples that are not completely labeled by the target, and the corresponding data of the training samples that are not completely labeled by the target. The most likely label sequence set corresponding to the conditional probability data to be analyzed corresponding to all the estimated label sequences in the estimated label sequence set and the most likely label sequence set corresponding to the training samples that are not fully labeled by the target is input to the adaptive loss function to perform calculation to obtain the loss value of the named entity model to be trained, update the parameters of the named entity model to be trained according to the loss value, and the updated named entity model to be trained is used for the next time Calculate the conditional probability data to be analyzed and the most likely tag sequence set;
    重复执行上述方法步骤直至所述损失值达到第一收敛条件或迭代次数达到第二收敛条件,将所述损失值达到所述第一收敛条件或迭代次数达到所述第二收敛条件的所述待训练的命名实体模型,确定为所述目标命名实体模型。Repeat the above method steps until the loss value reaches the first convergence condition or the number of iterations reaches the second convergence condition, and the loss value reaches the first convergence condition or the iteration number reaches the second convergence condition for the The trained named entity model is determined as the target named entity model.
  4. 根据权利要求3所述的命名实体模型的训练方法,其中,所述采用所述初步训练的命名实体模型分别对所述目标不完全标注的训练样本对应的所述预估标签序列集合中的每个预估标签序列进行概率分布计算,得到所述目标不完全标注的训练样本对应的所述预估标签序列集合中所有所述预估标签序列各自对应的待分析概率分布数据的步骤,包括:The method for training a named entity model according to claim 3, wherein the named entity model using the preliminary training respectively labels each item in the estimated label sequence set corresponding to the training sample with the target incompletely labeled. The steps of obtaining probability distribution data to be analyzed corresponding to each of the estimated label sequences in the set of estimated label sequences corresponding to the training samples that are not completely labeled by the target, including:
    基于前向-后向算法和所述初步训练的命名实体模型,分别对所述目标不完全标注的训练样本的所述文本样本数据的每个字进行各个标签的边缘概率计算,得到所述目标不完全标注的训练样本的所述文本样本数据的各个字各自对应的各个标签的边缘概率数据;Based on the forward-backward algorithm and the initially trained named entity model, the edge probability calculation of each label is performed on each word of the text sample data of the training sample that is not completely labeled by the target, and the target is obtained. The edge probability data of each label corresponding to each word of the text sample data of the incompletely labeled training sample;
    分别根据所述目标不完全标注的训练样本对应的所述预估标签序列集合中的每个预估标签序列、所述目标不完全标注的训练样本的所述文本样本数据的各个字各自对应的各个标签的所述边缘概率数据进行各个字各自对应的所述边缘概率数据的相乘计算,得到所述目标不完全标注的训练样本对应的所述预估标签序列集合中所有所述预估标签序列各自对应的所述待分析概率分布数据。According to each estimated label sequence in the estimated label sequence set corresponding to the target incompletely labeled training sample, and each word of the text sample data of the target incompletely labeled training sample respectively corresponding The edge probability data of each label is calculated by multiplying the edge probability data corresponding to each word to obtain all the estimated labels in the estimated label sequence set corresponding to the training samples with incomplete labels of the target. The probability distribution data to be analyzed corresponding to each sequence.
  5. 根据权利要求3所述的命名实体模型的训练方法,其中,所述采用所述待训练的命名实体模型,对所述目标不完全标注的训练样本对应的所述预估标签序列集合中的所有所述预估标签序列进行最有可能的标签序列解析,得到所述目标不完全标注的训练样本对应的最有可能标签序列集合的步骤,包括:The method for training a named entity model according to claim 3, wherein, by using the named entity model to be trained, all the estimated label sequences in the set of estimated label sequences corresponding to the training samples that are incompletely labeled for the target are The estimated label sequence is subjected to the most likely label sequence analysis, and the steps of obtaining the most likely label sequence set corresponding to the incompletely labeled training samples of the target include:
    将所述目标不完全标注的训练样本的所述文本样本数据输入所述待训练的命名实体模型进行计算,获取所述目标不完全标注的训练样本的预训练模型输出的所述目标不完全标注的训练样本对应的概率预测结果;Inputting the text sample data of the training samples with incompletely labeled targets into the named entity model to be trained for calculation, and obtaining the incompletely labeled targets output by the pre-training model of the training samples with incompletely labeled targets The probability prediction results corresponding to the training samples;
    采用k-best Viterbi Decoding算法根据所述待训练的命名实体模型的条件随机场模型的状态转移矩阵、所述目标不完全标注的训练样本的所述概率预测结果进行最有可能的标签序列的解码计算,得到所述目标不完全标注的训练样本对应的所述最有可能标签序列集合。The k-best Viterbi Decoding algorithm is used to decode the most likely label sequence according to the state transition matrix of the conditional random field model of the named entity model to be trained, and the probability prediction result of the training samples with incomplete labels of the target. Calculation is performed to obtain the most likely label sequence set corresponding to the training samples with incomplete labeling of the target.
  6. 根据权利要求3所述的命名实体模型的训练方法,其中,所述自适应损失函数的计算公式L(w,x)为:The training method of a named entity model according to claim 3, wherein the calculation formula L(w,x) of the adaptive loss function is:
    L(w,x)=(1-λ)L 1(w,x)+λL 2(w,x) L(w,x)=(1-λ)L 1 (w,x)+λL 2 (w,x)
    Figure PCTCN2021097545-appb-100001
    Figure PCTCN2021097545-appb-100001
    Figure PCTCN2021097545-appb-100002
    Figure PCTCN2021097545-appb-100002
    其中,q(y′|x)是所述目标不完全标注的训练样本对应的所述预估标签序列集合中所有所述预估标签序列各自对应的所述待分析概率分布数据,p w(y′|x)是所述目标不完全标注的训练样本对应的所述预估标签序列集合中所有所述预估标签序列各自对应的所述待分析条件概率数据,C(y u)是所述目标不完全标注的训练样本对应的所述预估标签序列集合,K w(x)所述目标不完全标注的训练样本对应的所述最有可能标签序列集合,log()是对数函数,λ是自适应参数,λ从0逐步增加到1。 Wherein, q(y′|x) is the probability distribution data to be analyzed corresponding to all the estimated label sequences in the estimated label sequence set corresponding to the training samples that are not completely labeled by the target, and p w ( y′|x) is the conditional probability data to be analyzed corresponding to each of the estimated label sequences in the estimated label sequence set corresponding to the incompletely labeled training samples of the target, and C( yu ) is the Kw (x) the most likely label sequence set corresponding to the training samples not completely labeled by the target, log() is the logarithmic function , λ is an adaptive parameter, λ gradually increases from 0 to 1.
  7. 根据权利要求6所述的命名实体模型的训练方法,其中,所述自适应参数λ的计算公式为:The training method of a named entity model according to claim 6, wherein the calculation formula of the adaptive parameter λ is:
    Figure PCTCN2021097545-appb-100003
    Figure PCTCN2021097545-appb-100003
    其中,exp[]是以自然常数e为底的指数函数,b是所述目标不完全标注的训练样本训练所述待训练的命名实体模型时对应的训练步数,B是预设的总训练步数,γ为控制λ增长速度的常数。Wherein, exp[] is an exponential function with the natural constant e as the base, b is the number of training steps corresponding to the training samples to be trained when the training samples that are not completely labeled by the target train the named entity model to be trained, and B is the preset total training The number of steps, γ is a constant that controls the growth rate of λ.
  8. 一种命名实体模型的训练装置,其中,所述装置包括:A training device for a named entity model, wherein the device comprises:
    训练样本获取模块,用于获取多个不完全标注的训练样本,所述不完全标注的训练样本包括:文本样本数据、不完全标注的标签序列;The training sample acquisition module is used to obtain a plurality of incompletely labeled training samples, and the incompletely labeled training samples include: text sample data and incompletely labeled label sequences;
    预估标签序列集合确定模块,用于采用预设预估规则分别对每个所述不完全标注的训练样本进行预估标签序列确定,得到所述多个不完全标注的训练样本各自对应的预估标签序列集合,所述预设预估规则是指同时满足已标注实体信息一致和未标注部分全部预估标注;The estimated label sequence set determination module is used to determine the estimated label sequence for each of the incompletely labeled training samples by using a preset prediction rule, and obtain the corresponding predictions of the plurality of incompletely labeled training samples. The set of estimated label sequences, and the preset estimation rule refers to satisfying both the consistency of the marked entity information and all the estimated labels of the unmarked part;
    模型训练模块,用于获取初步训练的命名实体模型,采用自适应损失函数、所述初步训练的命名实体模型、所述多个不完全标注的训练样本、所述多个不完全标注的训练样本各自对应的所述预估标签序列集合对待训练的命名实体模型进行训练,得到目标命名实体模型。The model training module is used to obtain the initially trained named entity model, using the adaptive loss function, the initially trained named entity model, the multiple incompletely labeled training samples, and the multiple incompletely labeled training samples The respective corresponding set of estimated label sequences is trained on the named entity model to be trained to obtain a target named entity model.
  9. 一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,其中,所述处理器执行所述计算机程序时实现如下方法步骤:A computer device includes a memory and a processor, wherein the memory stores a computer program, wherein the processor implements the following method steps when executing the computer program:
    获取多个不完全标注的训练样本,所述不完全标注的训练样本包括:文本样本数据、不完全标注的标签序列;Obtain a plurality of incompletely labeled training samples, where the incompletely labeled training samples include: text sample data and incompletely labeled label sequences;
    采用预设预估规则分别对每个所述不完全标注的训练样本进行预估标签序列确定,得到所述多个不完全标注的训练样本各自对应的预估标签序列集合,所述预设预估规则是指同时满足已标注实体信息一致和未标注部分全部预估标注;Determine the estimated label sequence for each of the incompletely labeled training samples by using preset prediction rules, and obtain the estimated label sequence set corresponding to each of the plurality of incompletely labeled training samples. The estimation rule refers to satisfying both the consistent marked entity information and all the estimated markings of the unmarked part;
    获取初步训练的命名实体模型,采用自适应损失函数、所述初步训练的命名实体模型、所述多个不完全标注的训练样本、所述多个不完全标注的训练样本各自对应的所述预估标签序列集合对待训练的命名实体模型进行训练,得到目标命名实体模型。Obtain the initially trained named entity model, and adopt the adaptive loss function, the initially trained named entity model, the plurality of incompletely labeled training samples, and the respective corresponding predictions of the plurality of incompletely labeled training samples. Estimate the label sequence set to train the named entity model to be trained, and obtain the target named entity model.
  10. 根据权利要求9所述的计算机设备,其中,所述采用预设预估规则分别对每个所述不完全标注的训练样本进行预估标签序列确定,得到所述多个不完全标 注的训练样本各自对应的预估标签序列集合的步骤,包括:The computer device according to claim 9, wherein the predetermined prediction rule is used to determine the estimated label sequence for each of the incompletely labeled training samples, to obtain the plurality of incompletely labeled training samples The steps of each corresponding estimated label sequence set include:
    从所述多个不完全标注的训练样本中获取一个所述不完全标注的训练样本,作为目标不完全标注的训练样本;Obtaining one of the incompletely labeled training samples from the plurality of incompletely labeled training samples, as a target incompletely labeled training sample;
    从所述目标不完全标注的训练样本的所述不完全标注的标签序列中提取出已标注实体信息,得到所述目标不完全标注的训练样本对应的已标注实体信息;Extracting the marked entity information from the incompletely marked label sequence of the training sample with the target incompletely marked, and obtaining the marked entity information corresponding to the training sample with the incompletely marked target;
    采用所述目标不完全标注的训练样本对应的所述已标注实体信息从所述目标不完全标注的训练样本的所述文本样本数据中找出未标注的文字,得到所述目标不完全标注的训练样本对应的未标注文本数据;Using the marked entity information corresponding to the training sample with the target incompletely marked, find out the unmarked text from the text sample data of the training sample with the target incompletely marked, and obtain the incompletely marked with the target. Unlabeled text data corresponding to training samples;
    分别对所述目标不完全标注的训练样本对应的所述未标注文本数据中每个文字进行所有可能的标签预估,得到所述目标不完全标注的训练样本对应的所述未标注文本数据的各个文字各自对应的预估标签集合;Perform all possible label predictions on each character in the unlabeled text data corresponding to the training samples that are incompletely labeled by the target, and obtain the unlabeled text data corresponding to the training samples that are not completely labeled by the target. The set of estimated labels corresponding to each text;
    分别将所述目标不完全标注的训练样本对应的所述未标注文本数据的各个文字各自对应的所述预估标签集合和所述目标不完全标注的训练样本对应的所述已标注实体信息进行所有可能的标签序列组合,得到所述目标不完全标注的训练样本对应的所述预估标签序列集合;The estimated label set corresponding to each character of the unlabeled text data corresponding to the training samples that are not completely labeled by the target and the labeled entity information corresponding to the training samples that are not completely labeled by the target are respectively performed. All possible label sequence combinations are obtained to obtain the estimated label sequence set corresponding to the training samples that are incompletely labeled by the target;
    重复执行所述从所述多个不完全标注的训练样本中获取一个所述不完全标注的训练样本作为目标不完全标注的训练样本的步骤,直至确定所述多个不完全标注的训练样本各自对应的所述预估标签序列集合。Repeat the step of obtaining one of the incompletely labeled training samples from the plurality of incompletely labeled training samples as the target incompletely labeled training sample, until it is determined that each of the plurality of incompletely labeled training samples is determined. the corresponding set of estimated label sequences.
  11. 根据权利要求9所述的计算机设备,其中,所述获取初步训练的命名实体模型,采用自适应损失函数、所述初步训练的命名实体模型、所述多个不完全标注的训练样本、所述多个不完全标注的训练样本各自对应的所述预估标签序列集合对待训练的命名实体模型进行训练,得到目标命名实体模型的步骤,包括:The computer device according to claim 9, wherein the obtaining of the initially trained named entity model adopts an adaptive loss function, the initially trained named entity model, the plurality of incompletely labeled training samples, the The steps of training the named entity model to be trained, and the steps of obtaining the target named entity model, comprising:
    从所述多个不完全标注的训练样本中获取一个所述不完全标注的训练样本,作为目标不完全标注的训练样本;Obtaining one of the incompletely labeled training samples from the plurality of incompletely labeled training samples, as a target incompletely labeled training sample;
    采用所述初步训练的命名实体模型分别对所述目标不完全标注的训练样本对应的所述预估标签序列集合中的每个预估标签序列进行概率分布计算,得到所述目标不完全标注的训练样本对应的所述预估标签序列集合中所有所述预估标签序列各自对应的待分析概率分布数据;Using the initially trained named entity model to calculate the probability distribution of each estimated label sequence in the estimated label sequence set corresponding to the training samples with incompletely labeled targets, to obtain the incompletely labeled target Probability distribution data to be analyzed corresponding to all the estimated label sequences in the estimated label sequence set corresponding to the training sample;
    采用所述待训练的命名实体模型分别对所述目标不完全标注的训练样本对应的所述预估标签序列集合中的每个所述预估标签序列进行条件概率计算,得到所述目标不完全标注的训练样本对应的所述预估标签序列集合中所有所述预估标签序列各自对应的待分析条件概率数据;The named entity model to be trained is used to perform conditional probability calculation on each of the estimated label sequences in the set of estimated label sequences corresponding to the training samples with incompletely labeled targets, to obtain the incomplete target. Conditional probability data to be analyzed corresponding to all the estimated label sequences in the set of estimated label sequences corresponding to the labeled training samples;
    采用所述待训练的命名实体模型,对所述目标不完全标注的训练样本对应的所述预估标签序列集合中的所有所述预估标签序列进行最有可能的标签序列解析,得到所述目标不完全标注的训练样本对应的最有可能标签序列集合;Using the named entity model to be trained, perform the most probable label sequence analysis on all the estimated label sequences in the estimated label sequence set corresponding to the training samples with incomplete labels of the target, and obtain the The most likely label sequence set corresponding to the training samples that are not fully labeled by the target;
    将所述目标不完全标注的训练样本对应的所述预估标签序列集合中所有所述预估标签序列各自对应的所述待分析概率分布数据、所述目标不完全标注的训练样本对应的所述预估标签序列集合中所有所述预估标签序列各自对应的所述待分析条件概率数据和所述目标不完全标注的训练样本对应的所述最有可能标签序列集合输入所述自适应损失函数进行计算,得到所述待训练的命名实体模型的损失值,根据所述损失值更新所述待训练的命名实体模型的参数,更新后的所述待训练的命名实体模型被用于下一次计算所述待分析条件概率数据、所述最有可能标签序列集合;The probability distribution data to be analyzed corresponding to all the estimated label sequences in the set of estimated label sequences corresponding to the training samples that are not completely labeled by the target, and the corresponding data of the training samples that are not completely labeled by the target. The most likely label sequence set corresponding to the conditional probability data to be analyzed corresponding to all the estimated label sequences in the estimated label sequence set and the most likely label sequence set corresponding to the training samples that are not fully labeled by the target is input to the adaptive loss function to perform calculation to obtain the loss value of the named entity model to be trained, update the parameters of the named entity model to be trained according to the loss value, and the updated named entity model to be trained is used for the next time Calculate the conditional probability data to be analyzed and the most likely tag sequence set;
    重复执行上述方法步骤直至所述损失值达到第一收敛条件或迭代次数达到 第二收敛条件,将所述损失值达到所述第一收敛条件或迭代次数达到所述第二收敛条件的所述待训练的命名实体模型,确定为所述目标命名实体模型。Repeat the above method steps until the loss value reaches the first convergence condition or the number of iterations reaches the second convergence condition, and the loss value reaches the first convergence condition or the iteration number reaches the second convergence condition for the The trained named entity model is determined as the target named entity model.
  12. 根据权利要求11所述的计算机设备,其中,所述采用所述初步训练的命名实体模型分别对所述目标不完全标注的训练样本对应的所述预估标签序列集合中的每个预估标签序列进行概率分布计算,得到所述目标不完全标注的训练样本对应的所述预估标签序列集合中所有所述预估标签序列各自对应的待分析概率分布数据的步骤,包括:The computer device according to claim 11, wherein each estimated label in the set of estimated label sequences corresponding to the training samples that are incompletely labeled by the initially trained named entity model is respectively used The sequence performs probability distribution calculation, and the steps of obtaining the probability distribution data to be analyzed corresponding to all the estimated label sequences in the set of estimated label sequences corresponding to the training samples that are not completely marked by the target, include:
    基于前向-后向算法和所述初步训练的命名实体模型,分别对所述目标不完全标注的训练样本的所述文本样本数据的每个字进行各个标签的边缘概率计算,得到所述目标不完全标注的训练样本的所述文本样本数据的各个字各自对应的各个标签的边缘概率数据;Based on the forward-backward algorithm and the initially trained named entity model, the edge probability calculation of each label is performed on each word of the text sample data of the training sample that is not completely labeled by the target, and the target is obtained. The edge probability data of each label corresponding to each word of the text sample data of the incompletely labeled training sample;
    分别根据所述目标不完全标注的训练样本对应的所述预估标签序列集合中的每个预估标签序列、所述目标不完全标注的训练样本的所述文本样本数据的各个字各自对应的各个标签的所述边缘概率数据进行各个字各自对应的所述边缘概率数据的相乘计算,得到所述目标不完全标注的训练样本对应的所述预估标签序列集合中所有所述预估标签序列各自对应的所述待分析概率分布数据。According to each estimated label sequence in the estimated label sequence set corresponding to the target incompletely labeled training sample, and each word of the text sample data of the target incompletely labeled training sample respectively corresponding The edge probability data of each label is calculated by multiplying the edge probability data corresponding to each word to obtain all the estimated labels in the estimated label sequence set corresponding to the training samples with incomplete labels of the target. The probability distribution data to be analyzed corresponding to each sequence.
  13. 根据权利要求11所述的计算机设备,其中,所述采用所述待训练的命名实体模型,对所述目标不完全标注的训练样本对应的所述预估标签序列集合中的所有所述预估标签序列进行最有可能的标签序列解析,得到所述目标不完全标注的训练样本对应的最有可能标签序列集合的步骤,包括:The computer device according to claim 11, wherein, by adopting the named entity model to be trained, all the estimates in the estimated label sequence set corresponding to the training samples that are not completely labeled for the target are The most probable tag sequence parsing is performed on the tag sequence, and the steps of obtaining the most probable tag sequence set corresponding to the incompletely annotated training samples of the target include:
    将所述目标不完全标注的训练样本的所述文本样本数据输入所述待训练的命名实体模型进行计算,获取所述目标不完全标注的训练样本的预训练模型输出的所述目标不完全标注的训练样本对应的概率预测结果;Inputting the text sample data of the training samples with incompletely labeled targets into the named entity model to be trained for calculation, and obtaining the incompletely labeled targets output by the pre-training model of the training samples with incompletely labeled targets The probability prediction results corresponding to the training samples;
    采用k-best Viterbi Decoding算法根据所述待训练的命名实体模型的条件随机场模型的状态转移矩阵、所述目标不完全标注的训练样本的所述概率预测结果进行最有可能的标签序列的解码计算,得到所述目标不完全标注的训练样本对应的所述最有可能标签序列集合。The k-best Viterbi Decoding algorithm is used to decode the most likely label sequence according to the state transition matrix of the conditional random field model of the named entity model to be trained, and the probability prediction result of the training samples that are not fully labeled by the target. Calculation is performed to obtain the most likely label sequence set corresponding to the training samples with incomplete labeling of the target.
  14. 根据权利要求11所述的计算机设备,其中,所述自适应损失函数的计算公式L(w,x)为:The computer device according to claim 11, wherein the calculation formula L(w,x) of the adaptive loss function is:
    L(w,x)=(1-λ)L 1(w,x)+λL 2(w,x) L(w,x)=(1-λ)L 1 (w,x)+λL 2 (w,x)
    Figure PCTCN2021097545-appb-100004
    Figure PCTCN2021097545-appb-100004
    Figure PCTCN2021097545-appb-100005
    Figure PCTCN2021097545-appb-100005
    其中,q(y′|x)是所述目标不完全标注的训练样本对应的所述预估标签序列集合中所有所述预估标签序列各自对应的所述待分析概率分布数据,p w(y′|x)是所述目标不完全标注的训练样本对应的所述预估标签序列集合中所有所述预估标签序列各自对应的所述待分析条件概率数据,C(y u)是所述目标不完全标注的训练样本对应的所述预估标签序列集合,K w(x)所述目标不完全标注的训练样本对应的所述最有可能标签序列集合,log()是对数函数,λ是自适应参数,λ从0逐步增加到1。 Wherein, q(y′|x) is the probability distribution data to be analyzed corresponding to all the estimated label sequences in the estimated label sequence set corresponding to the training samples that are not completely labeled by the target, and p w ( y′|x) is the conditional probability data to be analyzed corresponding to each of the estimated label sequences in the estimated label sequence set corresponding to the incompletely labeled training samples of the target, and C( yu ) is the Kw (x) the most likely label sequence set corresponding to the training samples not completely labeled by the target, log() is the logarithmic function , λ is an adaptive parameter, λ gradually increases from 0 to 1.
  15. 一种计算机可读存储介质,其上存储有计算机程序,其中,所述计算机程序被处理器执行时实现如下方法步骤:A computer-readable storage medium on which a computer program is stored, wherein when the computer program is executed by a processor, the following method steps are implemented:
    获取多个不完全标注的训练样本,所述不完全标注的训练样本包括:文本样本数据、不完全标注的标签序列;Obtain a plurality of incompletely labeled training samples, where the incompletely labeled training samples include: text sample data and incompletely labeled label sequences;
    采用预设预估规则分别对每个所述不完全标注的训练样本进行预估标签序列确定,得到所述多个不完全标注的训练样本各自对应的预估标签序列集合,所述预设预估规则是指同时满足已标注实体信息一致和未标注部分全部预估标注;Determine the estimated label sequence for each of the incompletely labeled training samples by using preset prediction rules, and obtain the estimated label sequence set corresponding to each of the plurality of incompletely labeled training samples. The estimation rule refers to satisfying both the consistent marked entity information and all the estimated markings of the unmarked part;
    获取初步训练的命名实体模型,采用自适应损失函数、所述初步训练的命名实体模型、所述多个不完全标注的训练样本、所述多个不完全标注的训练样本各自对应的所述预估标签序列集合对待训练的命名实体模型进行训练,得到目标命名实体模型。Obtain the initially trained named entity model, and adopt the adaptive loss function, the initially trained named entity model, the plurality of incompletely labeled training samples, and the respective corresponding predictions of the plurality of incompletely labeled training samples. Estimate the label sequence set to train the named entity model to be trained, and obtain the target named entity model.
  16. 根据权利要求15所述的计算机可读存储介质,其中,所述采用预设预估规则分别对每个所述不完全标注的训练样本进行预估标签序列确定,得到所述多个不完全标注的训练样本各自对应的预估标签序列集合的步骤,包括:The computer-readable storage medium according to claim 15, wherein the predetermined estimation rule is used to determine the estimated label sequence for each of the incompletely labeled training samples, to obtain the plurality of incompletely labeled training samples. The steps of the estimated label sequence sets corresponding to the training samples of , including:
    从所述多个不完全标注的训练样本中获取一个所述不完全标注的训练样本,作为目标不完全标注的训练样本;Obtaining one of the incompletely labeled training samples from the plurality of incompletely labeled training samples, as a target incompletely labeled training sample;
    从所述目标不完全标注的训练样本的所述不完全标注的标签序列中提取出已标注实体信息,得到所述目标不完全标注的训练样本对应的已标注实体信息;Extracting the marked entity information from the incompletely marked label sequence of the training sample with the target incompletely marked, and obtaining the marked entity information corresponding to the training sample with the incompletely marked target;
    采用所述目标不完全标注的训练样本对应的所述已标注实体信息从所述目标不完全标注的训练样本的所述文本样本数据中找出未标注的文字,得到所述目标不完全标注的训练样本对应的未标注文本数据;Using the marked entity information corresponding to the training sample with the target incompletely marked, find out the unmarked text from the text sample data of the training sample with the target incompletely marked, and obtain the incompletely marked with the target. Unlabeled text data corresponding to training samples;
    分别对所述目标不完全标注的训练样本对应的所述未标注文本数据中每个文字进行所有可能的标签预估,得到所述目标不完全标注的训练样本对应的所述未标注文本数据的各个文字各自对应的预估标签集合;Perform all possible label predictions on each character in the unlabeled text data corresponding to the training samples that are incompletely labeled by the target, and obtain the unlabeled text data corresponding to the training samples that are not completely labeled by the target. The set of estimated labels corresponding to each text;
    分别将所述目标不完全标注的训练样本对应的所述未标注文本数据的各个文字各自对应的所述预估标签集合和所述目标不完全标注的训练样本对应的所述已标注实体信息进行所有可能的标签序列组合,得到所述目标不完全标注的训练样本对应的所述预估标签序列集合;The estimated label set corresponding to each character of the unlabeled text data corresponding to the training samples that are not completely labeled by the target and the labeled entity information corresponding to the training samples that are not completely labeled by the target are respectively performed. All possible label sequence combinations are obtained to obtain the estimated label sequence set corresponding to the training samples that are incompletely labeled by the target;
    重复执行所述从所述多个不完全标注的训练样本中获取一个所述不完全标注的训练样本作为目标不完全标注的训练样本的步骤,直至确定所述多个不完全标注的训练样本各自对应的所述预估标签序列集合。Repeat the step of obtaining one of the incompletely labeled training samples from the plurality of incompletely labeled training samples as the target incompletely labeled training sample, until it is determined that each of the plurality of incompletely labeled training samples is determined. the corresponding set of estimated label sequences.
  17. 根据权利要求15所述的计算机可读存储介质,其中,所述获取初步训练的命名实体模型,采用自适应损失函数、所述初步训练的命名实体模型、所述多个不完全标注的训练样本、所述多个不完全标注的训练样本各自对应的所述预估标签序列集合对待训练的命名实体模型进行训练,得到目标命名实体模型的步骤,包括:The computer-readable storage medium according to claim 15, wherein the obtaining of the initially trained named entity model adopts an adaptive loss function, the initially trained named entity model, the plurality of incompletely labeled training samples , the described estimated label sequence set corresponding to each of the plurality of incompletely marked training samples is trained to train the named entity model to be trained, and the steps of obtaining the target named entity model include:
    从所述多个不完全标注的训练样本中获取一个所述不完全标注的训练样本,作为目标不完全标注的训练样本;Obtaining one of the incompletely labeled training samples from the plurality of incompletely labeled training samples, as a target incompletely labeled training sample;
    采用所述初步训练的命名实体模型分别对所述目标不完全标注的训练样本对应的所述预估标签序列集合中的每个预估标签序列进行概率分布计算,得到所述目标不完全标注的训练样本对应的所述预估标签序列集合中所有所述预估标签序列各自对应的待分析概率分布数据;Using the initially trained named entity model to calculate the probability distribution of each estimated label sequence in the estimated label sequence set corresponding to the training samples with incompletely labeled targets, to obtain the incompletely labeled target Probability distribution data to be analyzed corresponding to all the estimated label sequences in the estimated label sequence set corresponding to the training sample;
    采用所述待训练的命名实体模型分别对所述目标不完全标注的训练样本对应的所述预估标签序列集合中的每个所述预估标签序列进行条件概率计算,得到所述目标不完全标注的训练样本对应的所述预估标签序列集合中所有所述预估标签序列各自对应的待分析条件概率数据;The named entity model to be trained is used to perform conditional probability calculation on each of the estimated label sequences in the set of estimated label sequences corresponding to the training samples with incompletely labeled targets, to obtain the incomplete target. Conditional probability data to be analyzed corresponding to all the estimated label sequences in the set of estimated label sequences corresponding to the labeled training samples;
    采用所述待训练的命名实体模型,对所述目标不完全标注的训练样本对应的所述预估标签序列集合中的所有所述预估标签序列进行最有可能的标签序列解析,得到所述目标不完全标注的训练样本对应的最有可能标签序列集合;Using the named entity model to be trained, perform the most probable label sequence analysis on all the estimated label sequences in the estimated label sequence set corresponding to the training samples with incomplete labels of the target, and obtain the The most likely label sequence set corresponding to the training samples that are not fully labeled by the target;
    将所述目标不完全标注的训练样本对应的所述预估标签序列集合中所有所述预估标签序列各自对应的所述待分析概率分布数据、所述目标不完全标注的训练样本对应的所述预估标签序列集合中所有所述预估标签序列各自对应的所述待分析条件概率数据和所述目标不完全标注的训练样本对应的所述最有可能标签序列集合输入所述自适应损失函数进行计算,得到所述待训练的命名实体模型的损失值,根据所述损失值更新所述待训练的命名实体模型的参数,更新后的所述待训练的命名实体模型被用于下一次计算所述待分析条件概率数据、所述最有可能标签序列集合;The probability distribution data to be analyzed corresponding to all the estimated label sequences in the set of estimated label sequences corresponding to the training samples that are not completely labeled by the target, and the corresponding data of the training samples that are not completely labeled by the target. The most likely label sequence set corresponding to the conditional probability data to be analyzed corresponding to all the estimated label sequences in the estimated label sequence set and the most likely label sequence set corresponding to the training samples that are not fully labeled by the target is input to the adaptive loss function to perform calculation to obtain the loss value of the named entity model to be trained, update the parameters of the named entity model to be trained according to the loss value, and the updated named entity model to be trained is used for the next time Calculate the conditional probability data to be analyzed and the most likely tag sequence set;
    重复执行上述方法步骤直至所述损失值达到第一收敛条件或迭代次数达到第二收敛条件,将所述损失值达到所述第一收敛条件或迭代次数达到所述第二收敛条件的所述待训练的命名实体模型,确定为所述目标命名实体模型。Repeat the above method steps until the loss value reaches the first convergence condition or the number of iterations reaches the second convergence condition, and the loss value reaches the first convergence condition or the iteration number reaches the second convergence condition for the The trained named entity model is determined as the target named entity model.
  18. 根据权利要求17所述的计算机可读存储介质,其中,所述采用所述初步训练的命名实体模型分别对所述目标不完全标注的训练样本对应的所述预估标签序列集合中的每个预估标签序列进行概率分布计算,得到所述目标不完全标注的训练样本对应的所述预估标签序列集合中所有所述预估标签序列各自对应的待分析概率分布数据的步骤,包括:The computer-readable storage medium according to claim 17, wherein each of the estimated label sequence sets corresponding to the training samples that are incompletely labeled by the initially trained named entity model are respectively used The estimated label sequence is subjected to probability distribution calculation, and the steps of obtaining the probability distribution data to be analyzed corresponding to each of the estimated label sequences in the estimated label sequence set corresponding to the training samples with incomplete labels of the target, include:
    基于前向-后向算法和所述初步训练的命名实体模型,分别对所述目标不完全标注的训练样本的所述文本样本数据的每个字进行各个标签的边缘概率计算,得到所述目标不完全标注的训练样本的所述文本样本数据的各个字各自对应的各个标签的边缘概率数据;Based on the forward-backward algorithm and the initially trained named entity model, the edge probability calculation of each label is performed on each word of the text sample data of the training sample that is not completely labeled by the target, and the target is obtained. The edge probability data of each label corresponding to each word of the text sample data of the incompletely labeled training sample;
    分别根据所述目标不完全标注的训练样本对应的所述预估标签序列集合中的每个预估标签序列、所述目标不完全标注的训练样本的所述文本样本数据的各个字各自对应的各个标签的所述边缘概率数据进行各个字各自对应的所述边缘概率数据的相乘计算,得到所述目标不完全标注的训练样本对应的所述预估标签序列集合中所有所述预估标签序列各自对应的所述待分析概率分布数据。According to each estimated label sequence in the estimated label sequence set corresponding to the target incompletely labeled training sample, and each word of the text sample data of the target incompletely labeled training sample respectively corresponding The edge probability data of each label is calculated by multiplying the edge probability data corresponding to each word to obtain all the estimated labels in the estimated label sequence set corresponding to the training samples with incomplete labels of the target. The probability distribution data to be analyzed corresponding to each sequence.
  19. 根据权利要求17所述的计算机可读存储介质,其中,所述采用所述待训练的命名实体模型,对所述目标不完全标注的训练样本对应的所述预估标签序列集合中的所有所述预估标签序列进行最有可能的标签序列解析,得到所述目标不完全标注的训练样本对应的最有可能标签序列集合的步骤,包括:The computer-readable storage medium according to claim 17, wherein, by using the named entity model to be trained, all all the estimated label sequences in the set of estimated label sequences corresponding to the training samples that are incompletely labelled for the target are used. The most likely label sequence analysis is performed on the estimated label sequence, and the steps of obtaining the most likely label sequence set corresponding to the incompletely labeled training samples of the target include:
    将所述目标不完全标注的训练样本的所述文本样本数据输入所述待训练的命名实体模型进行计算,获取所述目标不完全标注的训练样本的预训练模型输出的所述目标不完全标注的训练样本对应的概率预测结果;Inputting the text sample data of the training samples with incompletely labeled targets into the named entity model to be trained for calculation, and obtaining the incompletely labeled targets output by the pre-training model of the training samples with incompletely labeled targets The probability prediction results corresponding to the training samples;
    采用k-best Viterbi Decoding算法根据所述待训练的命名实体模型的条件随机场模型的状态转移矩阵、所述目标不完全标注的训练样本的所述概率预测结果进行最有可能的标签序列的解码计算,得到所述目标不完全标注的训练样本对应的所述最有可能标签序列集合。The k-best Viterbi Decoding algorithm is used to decode the most likely label sequence according to the state transition matrix of the conditional random field model of the named entity model to be trained, and the probability prediction result of the training samples that are not fully labeled by the target. Calculation is performed to obtain the most likely label sequence set corresponding to the training samples with incomplete labeling of the target.
  20. 根据权利要求17所述的计算机可读存储介质,其中,所述自适应损失函数的计算公式L(w,x)为:The computer-readable storage medium according to claim 17, wherein the calculation formula L(w,x) of the adaptive loss function is:
    L(w,x)=(1-λ)L 1(w,x)+λL 2(w,x) L(w,x)=(1-λ)L 1 (w,x)+λL 2 (w,x)
    Figure PCTCN2021097545-appb-100006
    Figure PCTCN2021097545-appb-100006
    Figure PCTCN2021097545-appb-100007
    Figure PCTCN2021097545-appb-100007
    其中,q(y′|x)是所述目标不完全标注的训练样本对应的所述预估标签序列集合中所有所述预估标签序列各自对应的所述待分析概率分布数据,p w(y′|x)是所述目标不完全标注的训练样本对应的所述预估标签序列集合中所有所述预估标签序列各自对应的所述待分析条件概率数据,C(y u)是所述目标不完全标注的训练样本对应的所述预估标签序列集合,K w(x)所述目标不完全标注的训练样本对应的所述最有可能标签序列集合,log()是对数函数,λ是自适应参数,λ从0逐步增加到1。 Wherein, q(y′|x) is the probability distribution data to be analyzed corresponding to all the estimated label sequences in the estimated label sequence set corresponding to the training samples that are not completely labeled by the target, and p w ( y′|x) is the conditional probability data to be analyzed corresponding to each of the estimated label sequences in the estimated label sequence set corresponding to the incompletely labeled training samples of the target, and C( yu ) is the Kw (x) the most likely label sequence set corresponding to the training samples not completely labeled by the target, log() is the logarithmic function , λ is an adaptive parameter, λ gradually increases from 0 to 1.
PCT/CN2021/097545 2020-12-31 2021-05-31 Training method and apparatus for named entity model, device, and medium WO2022142123A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011626618.0A CN112766485B (en) 2020-12-31 2020-12-31 Named entity model training method, device, equipment and medium
CN202011626618.0 2020-12-31

Publications (1)

Publication Number Publication Date
WO2022142123A1 true WO2022142123A1 (en) 2022-07-07

Family

ID=75698970

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/097545 WO2022142123A1 (en) 2020-12-31 2021-05-31 Training method and apparatus for named entity model, device, and medium

Country Status (2)

Country Link
CN (1) CN112766485B (en)
WO (1) WO2022142123A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117251650A (en) * 2023-11-20 2023-12-19 之江实验室 Geographic hotspot center identification method, device, computer equipment and storage medium

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112766485B (en) * 2020-12-31 2023-10-24 平安科技(深圳)有限公司 Named entity model training method, device, equipment and medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110032649A (en) * 2019-04-12 2019-07-19 北京科技大学 Relation extraction method and device between a kind of entity of TCM Document
CN110287480A (en) * 2019-05-27 2019-09-27 广州多益网络股份有限公司 A kind of name entity recognition method, device, storage medium and terminal device
US20200050662A1 (en) * 2018-08-09 2020-02-13 Oracle International Corporation System And Method To Generate A Labeled Dataset For Training An Entity Detection System
CN111382572A (en) * 2020-03-03 2020-07-07 北京香侬慧语科技有限责任公司 Named entity identification method, device, equipment and medium
CN112766485A (en) * 2020-12-31 2021-05-07 平安科技(深圳)有限公司 Training method, device, equipment and medium for named entity model

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109741347B (en) * 2018-12-30 2021-03-16 北京工业大学 Iterative learning image segmentation method based on convolutional neural network
CN110348017B (en) * 2019-07-15 2022-12-23 苏州大学 Text entity detection method, system and related components
CN111222393A (en) * 2019-10-12 2020-06-02 浙江大学 Self-learning neural network-based method for detecting signet ring cells in pathological section
CN110851597A (en) * 2019-10-28 2020-02-28 青岛聚好联科技有限公司 Method and device for sentence annotation based on similar entity replacement
CN111062215B (en) * 2019-12-10 2024-02-13 金蝶软件(中国)有限公司 Named entity recognition method and device based on semi-supervised learning training
CN111553164A (en) * 2020-04-29 2020-08-18 平安科技(深圳)有限公司 Training method and device for named entity recognition model and computer equipment
CN111985239B (en) * 2020-07-31 2024-04-26 杭州远传新业科技股份有限公司 Entity identification method, entity identification device, electronic equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200050662A1 (en) * 2018-08-09 2020-02-13 Oracle International Corporation System And Method To Generate A Labeled Dataset For Training An Entity Detection System
CN110032649A (en) * 2019-04-12 2019-07-19 北京科技大学 Relation extraction method and device between a kind of entity of TCM Document
CN110287480A (en) * 2019-05-27 2019-09-27 广州多益网络股份有限公司 A kind of name entity recognition method, device, storage medium and terminal device
CN111382572A (en) * 2020-03-03 2020-07-07 北京香侬慧语科技有限责任公司 Named entity identification method, device, equipment and medium
CN112766485A (en) * 2020-12-31 2021-05-07 平安科技(深圳)有限公司 Training method, device, equipment and medium for named entity model

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117251650A (en) * 2023-11-20 2023-12-19 之江实验室 Geographic hotspot center identification method, device, computer equipment and storage medium
CN117251650B (en) * 2023-11-20 2024-02-06 之江实验室 Geographic hotspot center identification method, device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN112766485A (en) 2021-05-07
CN112766485B (en) 2023-10-24

Similar Documents

Publication Publication Date Title
CN110457675B (en) Predictive model training method and device, storage medium and computer equipment
WO2022142123A1 (en) Training method and apparatus for named entity model, device, and medium
CN110019471B (en) Generating text from structured data
WO2021218024A1 (en) Method and apparatus for training named entity recognition model, and computer device
CN110717034A (en) Ontology construction method and device
WO2022142043A1 (en) Course recommendation method and apparatus, device, and storage medium
CN110688853B (en) Sequence labeling method and device, computer equipment and storage medium
US20230244704A1 (en) Sequenced data processing method and device, and text processing method and device
US20230076658A1 (en) Method, apparatus, computer device and storage medium for decoding speech data
CN111666401A (en) Official document recommendation method and device based on graph structure, computer equipment and medium
WO2022142122A1 (en) Method and apparatus for training entity recognition model, and device and storage medium
WO2020215694A1 (en) Chinese word segmentation method and apparatus based on deep learning, and storage medium and computer device
CN110175273B (en) Text processing method and device, computer readable storage medium and computer equipment
CN112861518B (en) Text error correction method and device, storage medium and electronic device
WO2020073532A1 (en) Method and device for identifying conversation state of customer service robot, electronic device, and computer-readable storage medium
CN113011191A (en) Knowledge joint extraction model training method
CA3117833A1 (en) Regularization of recurrent machine-learned architectures
CN115409111A (en) Training method of named entity recognition model and named entity recognition method
CN113268564B (en) Method, device, equipment and storage medium for generating similar problems
CN112765985A (en) Named entity identification method for specific field patent embodiment
JP4328362B2 (en) Language analysis model learning apparatus, language analysis model learning method, language analysis model learning program, and recording medium thereof
CN113139368B (en) Text editing method and system
CN111444710B (en) Word segmentation method and word segmentation device
JP7121819B2 (en) Image processing method and apparatus, electronic device, computer-readable storage medium, and computer program
CN108073704B (en) L IWC vocabulary extension method

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21912884

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21912884

Country of ref document: EP

Kind code of ref document: A1