WO2022077891A1 - Multi-labeled data-based dependency and syntactic parsing model training method and apparatus - Google Patents

Multi-labeled data-based dependency and syntactic parsing model training method and apparatus Download PDF

Info

Publication number
WO2022077891A1
WO2022077891A1 PCT/CN2021/088601 CN2021088601W WO2022077891A1 WO 2022077891 A1 WO2022077891 A1 WO 2022077891A1 CN 2021088601 W CN2021088601 W CN 2021088601W WO 2022077891 A1 WO2022077891 A1 WO 2022077891A1
Authority
WO
WIPO (PCT)
Prior art keywords
label
dependency
arc
score
loss
Prior art date
Application number
PCT/CN2021/088601
Other languages
French (fr)
Chinese (zh)
Inventor
李正华
周明月
赵煜
张民
Original Assignee
苏州大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 苏州大学 filed Critical 苏州大学
Publication of WO2022077891A1 publication Critical patent/WO2022077891A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • G06F40/117Tagging; Marking up; Designating a block; Setting of attributes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Definitions

  • the present application relates to the field of computer technologies, and in particular, to a method, apparatus, device, and readable storage medium for training a dependency parsing model based on multi-labeled data.
  • the goal of dependency parsing is to capture the modification and collocation relationship between words within the sentence given the input sentence, to describe the syntactic and semantic structure of the sentence, and to construct a dependency syntax tree.
  • the result obtained by voting may also be a completely wrong answer, which completely discards possible correct information and affects the training effect.
  • the weighted voting method can also be used, it still cannot solve the problem of partiality and partiality when the number of labels is small.
  • the purpose of this application is to provide a method, device, device and readable storage medium for training a dependency parsing model based on multi-labeled data, so as to solve the problem that the current use of multi-labeled data to train a dependency parsing model is essentially discarded.
  • For some labeled data only one of the labeled data is used for model training, and the effective information in the multi-labeled data cannot be fully utilized, resulting in poor model performance. Its specific plan is as follows:
  • the present application provides a method for training a dependency parsing model based on multi-labeled data, including:
  • the labeling results include arcs and dependency labels, and each labeling result is from a different user;
  • the model parameters of the dependency parsing model are adjusted, so as to realize the training of the dependency parsing model.
  • calculating the loss values of the arc score and the label score relative to the multiple labeling results including:
  • the loss values of the arc score and the label score relative to the various annotation results are calculated.
  • the weight value is set for each of the multiple annotation results, including:
  • Arc weight values and/or label weight values are respectively set for various annotation results in the multiple annotation results.
  • calculating the loss values of the arc score and the label score relative to the multiple labeling results including:
  • the arc loss function calculate the arc score relative to the arc loss value in the multiple labeling results to obtain the first loss value
  • the label loss function calculate the loss value of the label score relative to the dependency label in the multiple labeling results, to obtain a second loss value
  • loss values of the arc score and the label score with respect to various labeling results are determined.
  • the loss value of the label score relative to the dependency label in the multiple labeling results is calculated to obtain the second loss value, including:
  • the loss value of the label score relative to the dependency label in the target labeling result is calculated, and the second loss value is obtained, wherein the target labeling result is the labeling result in which the arc is equal to the target arc in the multiple labeling results , the target arc is an arc determined according to a target strategy, and the target strategy includes: arc score prediction, majority voting, weighted voting, and random selection.
  • the dependency parsing model includes: an input layer, an encoding layer, a first MLP layer, a first scoring layer, a second MLP layer, and a second scoring layer;
  • the first MLP layer is used to determine the representation vector of the current word as the core word and the representation vector of the current word as the modifier according to the output of the coding layer
  • the first scoring layer is used to determine the representation vector of the current word as the modifier according to the output of the coding layer.
  • the output of the layer determines the arc score
  • the second MLP layer is configured to determine, according to the output of the encoding layer, a representation vector that includes dependency label information when the current word is used as a core word, and a representation vector that includes dependency label information when the current word is used as a modifier, and the second The scoring layer is used to determine the label score based on the output of the second MLP layer.
  • the coding layer of the dependency parsing model includes multiple layers of BiLSTM.
  • the present application provides an apparatus for training a dependency parsing model based on multi-labeled data, including:
  • Training sample acquisition module used to obtain word sequences and various labeling results of the word sequences.
  • the labeling results include arc and dependency labels, and each labeling result comes from different User;
  • Input and output module used to input the word sequence into the dependency parsing model to obtain arc scores and label scores;
  • Loss calculation module used to calculate the loss value of the arc score and the label score relative to the multiple labeling results according to the target loss function;
  • Iterative module configured to adjust the model parameters of the dependency parsing model for the purpose of minimizing the loss value through iterative training, so as to realize the training of the dependency parsing model.
  • the present application provides a device for training a dependency parsing model based on multi-labeled data, including:
  • Memory used to store computer programs
  • Processor used to execute the computer program to implement the above-mentioned method for training a dependency parsing model based on multi-labeled data.
  • the present application provides a readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, is used to implement the above-mentioned multi-labeled data-based dependency syntax Analyzing model training methods.
  • a method for training a dependency syntax analysis model based on multi-labeled data includes: obtaining word sequences and various labeling results of the word sequences; inputting the word sequences into the dependency syntax analysis model to obtain arc scores and label scores; The objective loss function calculates the loss value of arc score and label score relative to various annotation results; through iterative training, in order to minimize the loss value, the model parameters of the dependency parsing model are adjusted to realize the training of the dependency parsing model .
  • this method can calculate the loss value of the output result of the model relative to all the labeled results according to the target loss function, and complete the iterative training of the model accordingly, realize the purpose of making full use of the effective information in all the labeled data, and improve the performance of the model.
  • the present application also provides a multi-labeled data-based dependency parsing model training device, device, and readable storage medium, the technical effects of which are corresponding to the above method, and are not repeated here.
  • Fig. 1 is a realization flow chart of Embodiment 1 of a method for training a dependency parsing model based on multi-labeled data provided by the present application;
  • Embodiment 2 is a detailed flow chart of S103 in Embodiment 1 of a method for training a multi-labeled data-based dependency parsing model provided by the present application;
  • Embodiment 3 is a schematic diagram of the model architecture of Embodiment 2 of a method for training a multi-labeled data-based dependency parsing model provided by the present application;
  • Embodiment 4 is a schematic diagram of a single-label result in Embodiment 2 of a method for training a multi-labeled data-based dependency parsing model provided by the present application;
  • Embodiment 5 is a data storage format of a single-labeled result in Embodiment 2 of a method for training a multi-labeled data-based dependency parsing model provided by the present application;
  • FIG. 6 is a schematic diagram of a multi-labeling result of Embodiment 2 of a method for training a dependency parsing model based on multi-labeled data provided by the present application;
  • FIG. 7 is a data storage format of multi-label results in Embodiment 2 of a method for training a dependency parsing model based on multi-label data provided by the present application;
  • FIG. 8 is a functional block diagram of an embodiment of an apparatus for training a dependency parsing model based on multi-labeled data provided by the present application.
  • the core of this application is to provide a method, device, device and readable storage medium for training a dependency syntax analysis model based on multi-labeled data, which can make full use of the effective information in all the labelled data and improve the dependency syntax analysis capability of the model.
  • Embodiment 1 of a method for training a multi-labeled data-based dependency parsing model provided by the present application will be introduced below. Referring to FIG. 1 , Embodiment 1 includes:
  • the labeling result includes an arc and a dependency label
  • the above word sequence refers to a sequence obtained by segmenting a sentence.
  • each type of annotation result comes from a different user. Assuming that a sentence is labeled by K users, K kinds of labeling results will be generated, and each labeling result is a dependent syntax tree of the sentence.
  • Dependency syntax tree is used to describe the dependencies between words.
  • a dependency contains three elements: modifier, core word and dependency type, which means that modifier modifies the core word with a certain dependency type.
  • the labeling result includes the following two pieces of information: arc (core word) and dependency label.
  • the dependency syntax analysis model is used to predict the core word and dependency relationship label of each word according to the word sequence. Specifically, the model outputs the arc score and the dependency relationship score, and the actual predicted arc sum and the dependency relationship score can be determined according to the two. Dependency label.
  • This embodiment does not limit which neural network is selected as the dependency syntax analysis model, as long as it can predict the dependency relationship according to the word sequence.
  • a feasible solution is provided here, and the Biaffine Parser model is selected as the dependency syntax analysis model in this embodiment.
  • the loss value between the actual prediction result and the labeling result can be directly calculated. Since this application uses a variety of annotation results, when calculating the loss value, it is necessary to calculate the loss value of the actual prediction result relative to all the annotation results. Specifically, the loss value between the actual prediction result relative to each labeling result can be calculated separately, and then accumulated.
  • This embodiment provides a method for training a model of dependency parsing based on multi-labeled data, which can calculate the loss value of the output result of the model relative to all the labeling results according to the target loss function, and complete the iterative training of the model accordingly.
  • the purpose of utilizing the valid information in all the labeled data improves the model's dependency parsing ability.
  • a weight value may be assigned to the annotation results of different users, so as to distinguish the annotation capabilities of different users. For example, for the annotation results given by experts, a relatively high weight value can be given; for the annotation results given by ordinary users, a lower weight value can be given.
  • weight values are respectively set for various labeling results in the aforementioned various labeling results, and then the above-mentioned process of S103 is modified as: according to the target loss function and the weight values of various labeling results, calculate Loss values of the arc score and the label score relative to the various annotation results.
  • the labeling result contains two pieces of information: arc and dependency label
  • the arc weight value and label weight value can be set respectively. It is even possible to distinguish the user's labeling ability from only one of the dimensions, and not to distinguish the labeling ability of the other dimension.
  • the above-mentioned weight setting process is specifically as follows: for each of the various annotation results in the multiple annotation results, the arc weight value and/or the label weight value are respectively set.
  • the value of the arc weight value is different from the weight value of the label weight.
  • the calculation can be performed from the two dimensions of the arc and the dependency label.
  • the above S103 includes:
  • the difference calculation may not be performed with the relationship type labels in all the labeling results, but only with the relationship type labels in some of the labeling results. calculate.
  • the partial annotation results here refer to the annotation results selected from all the annotation results according to a certain strategy.
  • the strategy here can specifically be majority voting, weighted voting, arc score prediction, random selection, etc.
  • the loss value of the label score relative to the dependency label in the target labeling result is calculated, and the second loss value is obtained, wherein the target labeling result is the labeling result in which the arc is equal to the target arc in the multiple labeling results , the target arc is an arc determined according to a target strategy, and the target strategy includes: arc score prediction, majority voting, weighted voting, and random selection.
  • the arc score prediction refers to: according to the arc score output by the dependency syntax analysis model, select the arc with the largest score as the target arc;
  • Majority voting refers to: adopting the majority voting method to select the arc with the most occurrences in the multi-label results as the target arc;
  • Weighted voting refers to: adopting the weighted majority voting method to select the target arc in combination with the weight of each labeling result and the number of times each labeling result appears in the multiple labeling results;
  • Random selection refers to randomly selecting an arc from the multiple labeling results as a target arc.
  • the second embodiment of a method for training a multi-labeled data-based dependency parsing model provided by the present application will be described in detail below.
  • the second embodiment provides a detailed description of the training process based on the foregoing introduction and taking practical applications as an example.
  • the Biaffine Parser model is used, as shown in FIG. 3 .
  • the dependency syntax analysis model includes: an input layer, an encoding layer, a first MLP layer, a first scoring layer, a second MLP layer and a second scoring layer;
  • the coding layer includes multiple layers of BiLSTM;
  • the first MLP layer is used to determine the representation vector of the current word as the core word and the representation vector of the current word as the modifier according to the output of the encoding layer, and the first scoring layer is used to determine the arc according to the output of the first MLP layer. Score;
  • the second MLP layer is configured to determine, according to the output of the encoding layer, a representation vector containing dependency label information when the current word is used as a core word, and a representation vector containing dependency label information when the current word is used as a modifier, and the second score layer is used to determine a label score from the output of the second MLP layer.
  • w 0 is to insert an auxiliary root node at the beginning of the sentence.
  • the input layer maps each word wi to a vector xi , where xi is the concatenation of the word embedding vector and the character embedding (Char-LSTM) vector, namely:
  • the encoding layer is a multi-layer BiLSTM, and the output of the two-direction connection of the previous layer of BiLSTM is the input of the latter layer.
  • the MLP representation layer takes the output h i of the encoding layer as input, and uses four independent MLPs to obtain four low-dimensional representation vectors containing corresponding information respectively.
  • the representation vector when wi is the core word is the representation vector when wi is used as a modifier
  • Represents a representation vector containing prediction-dependent label information when wi is used as a core word is a representation vector containing prediction-dependent label information when wi is used as a modifier.
  • the biaffine score layer calculates the scores of all dependencies through biaffine.
  • the scores of dependencies are divided into two parts, the arc score and the dependency label score, where the arc score is as follows:
  • score arc (i, j) represents the score of the dependency arc with j as the core word and i as the modifier.
  • the matrix W b is the biaffine parameter.
  • the overall loss of the model includes two parts: arc loss and label loss, where arc loss refers to a part of the overall loss function, which represents the difference between the distribution of predicted arcs and the distribution of real arcs; label loss also refers to a part of the overall loss function, which represents the prediction The distribution of labels and the difference between the true labels.
  • the original Biaffine attention parser uses cross-entropy as the loss function, and calculates the local loss separately for each word.
  • the original arc loss function looks like this:
  • the original loss function of the model is modified to make full use of all the answers of the multi-label data.
  • a sentence is annotated by K annotators, resulting in multi-labeled data.
  • the final syntactic analysis model is obtained, which can decode and analyze any input sentence to obtain the syntactic tree result.
  • the syntactic information of the data is obtained, it can be used to extract long-distance information to meet the needs of other natural language tasks.
  • weight values can be set for various annotation results. For example, using the consistency rate of an annotator with other annotators to measure his annotation ability, the higher the consistency rate, the higher the weight.
  • s( ak ) is the number of words tagged by tagger k
  • w( ak ) is the number of words tagged by tagger k
  • the number of words that are consistent with the answers given by other labelers, then w( ak )/s( ak ) is the consistency rate of labelers a k .
  • the weight is the standardized consistency rate, that is, the weight calculation formula of the labeler a k is:
  • a dependency syntax tree is shown in Figure 4, where $ 0 represents a pseudo node, and the word it points to is the root node of the sentence.
  • a dependency arc consists of three elements Among them, wi is called the core word, w j is called the modifier, and r is the relation type, which means that w j modifies wi with the syntactic role r.
  • the dependency arc is used as an example, and the relationship type is omitted.
  • FIG 4 is a graphical representation of dependent syntax data, and the corresponding CoNLL format of data storage is shown in Figure 5, in which the second column is the vector representation of the word, and the seventh column is the corresponding core word sequence standard answer.
  • This application allows multiple annotators to annotate the same sentence according to the annotation guidelines, thereby obtaining a variety of annotation results. Each sentence will have multiple syntax trees to annotate the answer.
  • Figure 6 is an example of two people's annotation. The top of the sentence is the annotation result of one person, and the bottom of the sentence is the annotation result of another person.
  • this application has modified the CoNLL format, so that the data format is also adapted to the multi-label format, as shown in FIG. 7 .
  • the first 10 columns are consistent with the CoNLL format
  • the 11th to 12th columns are the logo of the first annotator and the answer to the core word sequence annotation
  • the 14th to 15th columns are the logo and core of the second annotator, respectively.
  • the word sequence annotates the answer.
  • a multi-label data-based dependency syntax analysis model training device provided by the embodiments of the present application will be introduced below.
  • the training methods of the dependency parsing model can refer to each other correspondingly.
  • the apparatus for training a dependency parsing model based on multi-labeled data in this embodiment includes:
  • Training sample acquisition module 801 used to acquire word sequences and various labeling results of the word sequences.
  • the labeling results include arc and dependency labels, and each labeling result comes from different users;
  • Input and output module 802 used to input the word sequence into a dependency parsing model to obtain arc scores and tag scores;
  • Loss calculation module 803 for calculating the loss values of the arc score and the label score relative to the multiple labeling results according to the target loss function;
  • Iterative module 804 for adjusting the model parameters of the dependency parsing model for the purpose of minimizing the loss value through iterative training, so as to realize the training of the dependency parsing model.
  • the apparatus for training a dependency parsing model based on multi-labeled data in this embodiment is used to implement the aforementioned method for training a dependency syntax parsing model based on multi-labeled data. Therefore, the specific implementation of the apparatus can be seen in the aforementioned dependency based on multi-labeled data.
  • Embodiment parts of the syntax analysis model training method for example, the training sample acquisition module 801, the input and output module 802, the loss calculation module 803, and the iterative module 804 are respectively used to implement the steps in the above-mentioned multi-labeled data-based dependency syntax analysis model training method S101, S102, S103, S104. Therefore, reference may be made to the descriptions of the corresponding partial embodiments for specific implementations thereof, which will not be described herein again.
  • the multi-labeled data-based dependency syntax analysis model training device in this embodiment is used to implement the aforementioned multi-labeled data-based dependency syntax analysis model training method, its function corresponds to the function of the above method, and will not be repeated here. .
  • the present application also provides a multi-labeled data-based dependency parsing model training device, including:
  • Memory used to store computer programs
  • Processor used to execute the computer program to implement the method for training a dependency parsing model based on multi-labeled data as described above.
  • the present application provides a readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, is used to implement the above-mentioned dependency syntax analysis based on multi-labeled data Model training method.
  • a software module can be placed in random access memory (RAM), internal memory, read only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or any other in the technical field. in any other known form of storage medium.
  • RAM random access memory
  • ROM read only memory
  • electrically programmable ROM electrically erasable programmable ROM
  • registers hard disk, removable disk, CD-ROM, or any other in the technical field. in any other known form of storage medium.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Machine Translation (AREA)

Abstract

A multi-labeled data-based dependency and syntactic parsing model training method and apparatus, a device and a readable storage medium, the method comprising: obtaining a word sequence and multiple labeling results; inputting the word sequence into a dependency and syntactic parsing model to obtain an arc score and a label score; according to an objective loss function, calculating loss values of the arc score and the label score relative to the multiple labeling results; adjusting model parameters of the dependency and syntactic parsing model by means of iterative training by taking the minimization of the loss values as the objective so as to implement model training. It can be seen that the described method can calculate the loss values of an output result of the model relative to all labeling results according to the objective loss function, and complete the iterative training of the model on the basis of the foregoing, thus achieving the objective of making full use of valid information in all labeled data and improving the dependency and syntactic parsing abilities of the model.

Description

一种基础多标注数据的依存句法分析模型训练方法及装置A method and device for training a dependency parsing model of basic multi-labeled data
本申请要求于2020年10月13日提交至中国专利局、申请号为202011089840.1、发明名称为“一种基础多标注数据的依存句法分析模型训练方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application submitted to the China Patent Office on October 13, 2020, the application number is 202011089840.1, and the invention title is "A method and device for training a dependency parsing model for basic multi-labeled data". The entire contents of this application are incorporated by reference.
技术领域technical field
本申请涉及计算机技术领域,特别涉及一种基于多标注数据的依存句法分析模型训练方法、装置、设备及可读存储介质。The present application relates to the field of computer technologies, and in particular, to a method, apparatus, device, and readable storage medium for training a dependency parsing model based on multi-labeled data.
背景技术Background technique
依存句法分析的目标是给定输入句子,捕捉句子内部词语之间的修饰和搭配关系,刻画句子的句法和语义结构,构建依存句法树。The goal of dependency parsing is to capture the modification and collocation relationship between words within the sentence given the input sentence, to describe the syntactic and semantic structure of the sentence, and to construct a dependency syntax tree.
近几年来,随着深度学习的在自然语言处理领域的快速发展,依存句法分析准确率有了显著提高。但是,当处理有别于训练数据的文本时,依存句法分析的准确率会急剧下降。针对该问题,一种直接的解决方法是标注特定领域的句法数据。然而,大多数依存句法树库是由少数语言学专家长期标注构建,存在费时费力、成本高的缺点,无法满足当前需求。In recent years, with the rapid development of deep learning in the field of natural language processing, the accuracy of dependency parsing has been significantly improved. However, when dealing with text that is different from the training data, the accuracy of dependency parsing drops dramatically. A straightforward solution to this problem is to label domain-specific syntactic data. However, most Dependency Syntax Treebanks are constructed by a few linguistic experts for a long time, which are time-consuming, labor-intensive, and costly, and cannot meet current needs.
受到众包工作的启发,利用大量非专家标注人员的标注数据,快速构建多标注依存句法树库是一种可行的方法。但是,相较于专家标注,这种方法的标注质量相对较低且不一致性高。目前的解决方式有两种,一种是采用多数投票方式从多种标注数据中选出一种标注数据,另一种是简单丢弃不一致的标注数据或人工审核。Inspired by crowdsourcing work, it is a feasible method to quickly construct a multi-label dependency syntactic tree bank using a large number of annotation data from non-expert annotators. However, this method has relatively low quality and high inconsistency of annotations compared to expert annotations. There are two solutions at present, one is to select one type of labeled data from a variety of labeled data by majority voting, and the other is to simply discard inconsistent labeled data or manually review it.
对于多数投票的方式,投票得到的结果也有可能是完全错误的答案,这样就完全丢弃了可能正确的信息,影响训练效果,且标注人数越少,投票结果越不可靠。虽然也可以使用加权投票的方法,但是依然无法解决在标注人数较少时偏听偏信的问题。For the majority voting method, the result obtained by voting may also be a completely wrong answer, which completely discards possible correct information and affects the training effect. Although the weighted voting method can also be used, it still cannot solve the problem of partiality and partiality when the number of labels is small.
对于简单丢弃不一致的句子的方式,虽然提高了数据集的可靠性,但是,如果原本数据集的不一致率较高,这种方式将导致数据集规模大大减少,产生浪费。人工审核方法虽然可以大大提高数据集的质量,但是非常费时费力,成本较高。For the method of simply discarding inconsistent sentences, although the reliability of the data set is improved, if the inconsistency rate of the original data set is high, this method will greatly reduce the size of the data set and cause waste. Although the manual review method can greatly improve the quality of the dataset, it is very time-consuming, labor-intensive, and costly.
综上,多数投票方式和简单丢弃不一致数据的方式,虽然可以获得一个可直接用于依存句法分析模型的数据集,但是这两种方式都产生了数据浪费,丢弃了一部分数据集的信息,没有充分利用多标注数据中的有效信息,导致模型性能较差。To sum up, although the majority voting method and the method of simply discarding inconsistent data can obtain a dataset that can be directly used for the dependency parsing model, both methods result in wasted data, discarding part of the information in the dataset, and there is no data set. The effective information in multi-label data is fully utilized, resulting in poor model performance.
可见,如何充分利用多标注数据完成对依存句法分析模型的训练,提升模型性能,是亟待本领域技术人员解决的问题。It can be seen that how to make full use of the multi-label data to complete the training of the dependency parsing model and improve the performance of the model is an urgent problem to be solved by those skilled in the art.
发明内容SUMMARY OF THE INVENTION
本申请的目的是提供一种基于多标注数据的依存句法分析模型训练方法、装置、设备及可读存储介质,用以解决目前在利用多标注数据训练依存句法分析模型的时候,本质上还是丢弃部分标注数据,只利用其中一种标注数据进行模型训练,无法充分利用多标注数据中的有效信息,导致模型性能较差的问题。其具体方案如下:The purpose of this application is to provide a method, device, device and readable storage medium for training a dependency parsing model based on multi-labeled data, so as to solve the problem that the current use of multi-labeled data to train a dependency parsing model is essentially discarded. For some labeled data, only one of the labeled data is used for model training, and the effective information in the multi-labeled data cannot be fully utilized, resulting in poor model performance. Its specific plan is as follows:
第一方面,本申请提供了一种基于多标注数据的依存句法分析模型训练方法,包括:In a first aspect, the present application provides a method for training a dependency parsing model based on multi-labeled data, including:
获取词序列以及所述词序列的多种标注结果,对于所述词序列中的每个修饰词,所述标注结果包括弧和依存关系标签,每种标注结果来自不同的用户;Obtain word sequences and various labeling results of the word sequences. For each modifier in the word sequence, the labeling results include arcs and dependency labels, and each labeling result is from a different user;
将所述词序列输入依存句法分析模型,得到弧得分和标签得分;Inputting the word sequence into a dependency parsing model to obtain arc scores and label scores;
根据目标损失函数,计算所述弧得分和所述标签得分相对于所述多种标注结果的损失值;According to the objective loss function, calculate the loss value of the arc score and the label score relative to the multiple labeling results;
通过迭代训练,以最小化所述损失值为目的,调整所述依存句法分析模型的模型参数,以实现对所述依存句法分析模型的训练。Through iterative training, in order to minimize the loss value, the model parameters of the dependency parsing model are adjusted, so as to realize the training of the dependency parsing model.
优选的,所述根据目标损失函数,计算所述弧得分和所述标签得分相对于所述多种标注结果的损失值,包括:Preferably, according to the target loss function, calculating the loss values of the arc score and the label score relative to the multiple labeling results, including:
根据不同用户的标注能力,针对所述多种标注结果中的各种标注结果,设置权重值;According to the labeling ability of different users, setting weight values for various labeling results in the multiple labeling results;
根据目标损失函数和各种标注结果的权重值,计算所述弧得分和所述标签得分相对于所述多种标注结果的损失值。According to the objective loss function and the weight values of various annotation results, the loss values of the arc score and the label score relative to the various annotation results are calculated.
优选的,所述针对所述多种标注结果中的各种标注结果,设置权重值,包括:Preferably, the weight value is set for each of the multiple annotation results, including:
针对所述多种标注结果中的各种标注结果,分别设置弧权重值和/或标签权重值。Arc weight values and/or label weight values are respectively set for various annotation results in the multiple annotation results.
优选的,所述根据目标损失函数,计算所述弧得分和所述标签得分相对于所述多种标注结果的损失值,包括:Preferably, according to the target loss function, calculating the loss values of the arc score and the label score relative to the multiple labeling results, including:
根据弧损失函数,计算所述弧得分相对于所述多种标注结果中弧的损失值,得到第一损失值;According to the arc loss function, calculate the arc score relative to the arc loss value in the multiple labeling results to obtain the first loss value;
根据标签损失函数,计算所述标签得分相对于所述多种标注结果中依存关系标签的损失值,得到第二损失值;According to the label loss function, calculate the loss value of the label score relative to the dependency label in the multiple labeling results, to obtain a second loss value;
根据所述第一损失值和所述第二损失值,确定所述弧得分和所述标签得分相对于多种标注结果的损失值。According to the first loss value and the second loss value, loss values of the arc score and the label score with respect to various labeling results are determined.
优选的,所述根据标签损失函数,计算所述标签得分相对于所述多种标注结果中依存关系标签的损失值,得到第二损失值,包括:Preferably, according to the label loss function, the loss value of the label score relative to the dependency label in the multiple labeling results is calculated to obtain the second loss value, including:
根据标签损失函数,计算所述标签得分相对于目标标注结果中依存关系标签的损失值,得到第二损失值,其中所述目标标注结果为所述多种标注结果中弧等于目标弧的标注结果,所述目标弧为根据目标策略确定的弧,所述目标策略包括:弧得分预测、多数投票、加权投票、随机选取。According to the label loss function, the loss value of the label score relative to the dependency label in the target labeling result is calculated, and the second loss value is obtained, wherein the target labeling result is the labeling result in which the arc is equal to the target arc in the multiple labeling results , the target arc is an arc determined according to a target strategy, and the target strategy includes: arc score prediction, majority voting, weighted voting, and random selection.
优选的,所述依存句法分析模型包括:输入层、编码层、第一MLP层、第一得分层、第二MLP层和第二得分层;Preferably, the dependency parsing model includes: an input layer, an encoding layer, a first MLP layer, a first scoring layer, a second MLP layer, and a second scoring layer;
其中所述第一MLP层用于根据所述编码层的输出确定当前词作为核心词的表示向量和当前词作为修饰词的表示向量,所述第一得分层用于根据所述第一MLP层的输出确定弧得分;The first MLP layer is used to determine the representation vector of the current word as the core word and the representation vector of the current word as the modifier according to the output of the coding layer, and the first scoring layer is used to determine the representation vector of the current word as the modifier according to the output of the coding layer. The output of the layer determines the arc score;
所述第二MLP层用于根据所述编码层的输出确定当前词作为核心词时包含依存关系标签信息的表示向量、当前词作为修饰词时包含依存关系 标签信息的表示向量,所述第二得分层用于根据所述第二MLP层的输出确定标签得分。The second MLP layer is configured to determine, according to the output of the encoding layer, a representation vector that includes dependency label information when the current word is used as a core word, and a representation vector that includes dependency label information when the current word is used as a modifier, and the second The scoring layer is used to determine the label score based on the output of the second MLP layer.
优选的,所述依存句法分析模型的编码层包括多层BiLSTM。Preferably, the coding layer of the dependency parsing model includes multiple layers of BiLSTM.
第二方面,本申请提供了一种基于多标注数据的依存句法分析模型训练装置,包括:In a second aspect, the present application provides an apparatus for training a dependency parsing model based on multi-labeled data, including:
训练样本获取模块:用于获取词序列以及所述词序列的多种标注结果,对于所述词序列中的每个修饰词,所述标注结果包括弧和依存关系标签,每种标注结果来自不同的用户;Training sample acquisition module: used to obtain word sequences and various labeling results of the word sequences. For each modifier in the word sequence, the labeling results include arc and dependency labels, and each labeling result comes from different User;
输入输出模块:用于将所述词序列输入依存句法分析模型,得到弧得分和标签得分;Input and output module: used to input the word sequence into the dependency parsing model to obtain arc scores and label scores;
损失计算模块:用于根据目标损失函数,计算所述弧得分和所述标签得分相对于所述多种标注结果的损失值;Loss calculation module: used to calculate the loss value of the arc score and the label score relative to the multiple labeling results according to the target loss function;
迭代模块:用于通过迭代训练,以最小化所述损失值为目的,调整所述依存句法分析模型的模型参数,以实现对所述依存句法分析模型的训练。Iterative module: configured to adjust the model parameters of the dependency parsing model for the purpose of minimizing the loss value through iterative training, so as to realize the training of the dependency parsing model.
第三方面,本申请提供了一种基于多标注数据的依存句法分析模型训练设备,包括:In a third aspect, the present application provides a device for training a dependency parsing model based on multi-labeled data, including:
存储器:用于存储计算机程序;Memory: used to store computer programs;
处理器:用于执行所述计算机程序,以实现如上所述的基于多标注数据的依存句法分析模型训练方法。Processor: used to execute the computer program to implement the above-mentioned method for training a dependency parsing model based on multi-labeled data.
第四方面,本申请提供了一种可读存储介质,所述可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时用于实现如上所述的基于多标注数据的依存句法分析模型训练方法。In a fourth aspect, the present application provides a readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, is used to implement the above-mentioned multi-labeled data-based dependency syntax Analyzing model training methods.
本申请所提供的一种基于多标注数据的依存句法分析模型训练方法,包括:获取词序列以及词序列的多种标注结果;将词序列输入依存句法分析模型,得到弧得分和标签得分;根据目标损失函数,计算弧得分和标签得分相对于多种标注结果的损失值;通过迭代训练,以最小化损失值为目的,调整依存句法分析模型的模型参数,以实现对依存句法分析模型的训练。可见,该方法能够根据目标损失函数计算模型输出结果相对于全部标注结果的损失值,并据此完成对模型的迭代训练,实现了充分利用全部标 注数据中的有效信息的目的,提升了模型的依存句法分析能力。A method for training a dependency syntax analysis model based on multi-labeled data provided by the present application includes: obtaining word sequences and various labeling results of the word sequences; inputting the word sequences into the dependency syntax analysis model to obtain arc scores and label scores; The objective loss function calculates the loss value of arc score and label score relative to various annotation results; through iterative training, in order to minimize the loss value, the model parameters of the dependency parsing model are adjusted to realize the training of the dependency parsing model . It can be seen that this method can calculate the loss value of the output result of the model relative to all the labeled results according to the target loss function, and complete the iterative training of the model accordingly, realize the purpose of making full use of the effective information in all the labeled data, and improve the performance of the model. Depends on syntactic analysis ability.
此外,本申请还提供了一种基于多标注数据的依存句法分析模型训练装置、设备及可读存储介质,其技术效果与上述方法相对应,这里不再赘述。In addition, the present application also provides a multi-labeled data-based dependency parsing model training device, device, and readable storage medium, the technical effects of which are corresponding to the above method, and are not repeated here.
附图说明Description of drawings
为了更清楚的说明本申请实施例或现有技术的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单的介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to illustrate the technical solutions of the embodiments of the present application or the prior art more clearly, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments or the prior art. Obviously, the drawings in the following description are only For some embodiments of the present application, for those of ordinary skill in the art, other drawings can also be obtained according to these drawings without any creative effort.
图1为本申请所提供的一种基于多标注数据的依存句法分析模型训练方法实施例一的实现流程图;Fig. 1 is a realization flow chart of Embodiment 1 of a method for training a dependency parsing model based on multi-labeled data provided by the present application;
图2为本申请所提供的一种基于多标注数据的依存句法分析模型训练方法实施例一中S103的细化流程图;2 is a detailed flow chart of S103 in Embodiment 1 of a method for training a multi-labeled data-based dependency parsing model provided by the present application;
图3为本申请所提供的一种基于多标注数据的依存句法分析模型训练方法实施例二的模型架构示意图;3 is a schematic diagram of the model architecture of Embodiment 2 of a method for training a multi-labeled data-based dependency parsing model provided by the present application;
图4为本申请所提供的一种基于多标注数据的依存句法分析模型训练方法实施例二中的单标注结果示意图;4 is a schematic diagram of a single-label result in Embodiment 2 of a method for training a multi-labeled data-based dependency parsing model provided by the present application;
图5为本申请所提供的一种基于多标注数据的依存句法分析模型训练方法实施例二中单标注结果的数据存储格式;5 is a data storage format of a single-labeled result in Embodiment 2 of a method for training a multi-labeled data-based dependency parsing model provided by the present application;
图6为本申请所提供的一种基于多标注数据的依存句法分析模型训练方法实施例二的多标注结果示意图;6 is a schematic diagram of a multi-labeling result of Embodiment 2 of a method for training a dependency parsing model based on multi-labeled data provided by the present application;
图7为本申请所提供的一种基于多标注数据的依存句法分析模型训练方法实施例二中多标注结果的数据存储格式;FIG. 7 is a data storage format of multi-label results in Embodiment 2 of a method for training a dependency parsing model based on multi-label data provided by the present application;
图8为本申请所提供的一种基于多标注数据的依存句法分析模型训练装置实施例的功能框图。FIG. 8 is a functional block diagram of an embodiment of an apparatus for training a dependency parsing model based on multi-labeled data provided by the present application.
具体实施方式Detailed ways
本申请的核心是提供一种基于多标注数据的依存句法分析模型训练方 法、装置、设备及可读存储介质,能够充分利用全部标注数据中的有效信息,提升模型的依存句法分析能力。The core of this application is to provide a method, device, device and readable storage medium for training a dependency syntax analysis model based on multi-labeled data, which can make full use of the effective information in all the labelled data and improve the dependency syntax analysis capability of the model.
为了使本技术领域的人员更好地理解本申请方案,下面结合附图和具体实施方式对本申请作进一步的详细说明。显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。In order to make those skilled in the art better understand the solution of the present application, the present application will be further described in detail below with reference to the accompanying drawings and specific embodiments. Obviously, the described embodiments are only a part of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.
下面对本申请提供的一种基于多标注数据的依存句法分析模型训练方法实施例一进行介绍,参见图1,实施例一包括: Embodiment 1 of a method for training a multi-labeled data-based dependency parsing model provided by the present application will be introduced below. Referring to FIG. 1 , Embodiment 1 includes:
S101、获取词序列以及所述词序列的多种标注结果,对于所述词序列中的每个修饰词,所述标注结果包括弧和依存关系标签;S101, obtaining a word sequence and various labeling results of the word sequence, and for each modifier in the word sequence, the labeling result includes an arc and a dependency label;
上述词序列是指对某个句子进行分词得到的序列。本实施例获取的多种标注结果(两种以上的标注结果)中,每种标注结果来自不同的用户。假设一个句子被K名用户标注,则会产生K种标注结果,每种标注结果即该句子的一颗依存句法树。The above word sequence refers to a sequence obtained by segmenting a sentence. Among the multiple types of annotation results (two or more types of annotation results) obtained in this embodiment, each type of annotation result comes from a different user. Assuming that a sentence is labeled by K users, K kinds of labeling results will be generated, and each labeling result is a dependent syntax tree of the sentence.
依存句法树用于描述词与词之间的依存关系,一则依存关系包含三项元素:修饰词、核心词和依存关系类型,即表示修饰词以某种依存关系类型修饰核心词。Dependency syntax tree is used to describe the dependencies between words. A dependency contains three elements: modifier, core word and dependency type, which means that modifier modifies the core word with a certain dependency type.
本实施例中,对于词序列中的每个修饰词,标注结果包含以下两项信息:弧(核心词)和依存关系标签。In this embodiment, for each modifier in the word sequence, the labeling result includes the following two pieces of information: arc (core word) and dependency label.
S102、将所述词序列输入依存句法分析模型,得到弧得分和标签得分;S102, inputting the word sequence into a dependency parsing model to obtain an arc score and a label score;
本实施例中,依存句法分析模型用于根据词序列预测每个词的核心词和依存关系标签,具体的,模型输出弧得分和依存关系得分,根据二者即可确定实际预测得到的弧和依存关系标签。In this embodiment, the dependency syntax analysis model is used to predict the core word and dependency relationship label of each word according to the word sequence. Specifically, the model outputs the arc score and the dependency relationship score, and the actual predicted arc sum and the dependency relationship score can be determined according to the two. Dependency label.
本实施例对选用何种神经网络作为依存句法分析模型不做限定,只要其能够根据词序列预测依存关系即可。此处提供一种可行方案,选用Biaffine Parser模型作为本实施例的依存句法分析模型。This embodiment does not limit which neural network is selected as the dependency syntax analysis model, as long as it can predict the dependency relationship according to the word sequence. A feasible solution is provided here, and the Biaffine Parser model is selected as the dependency syntax analysis model in this embodiment.
S103、根据目标损失函数,计算所述弧得分和所述标签得分相对于所 述多种标注结果的损失值;S103, according to the target loss function, calculate the loss value of the arc score and the label score relative to the multiple labeling results;
一般情况下,当只采用一种标注结果作为标准时,直接计算实际预测结果和该标注结果之间的损失值即可。由于本申请采用了多种标注结果,因此,在计算损失值时,需要计算实际预测结果相对于全部标注结果的损失值。具体的,可以分别计算实际预测结果相对于每种标注结果之间的损失值,然后进行累加。In general, when only one labeling result is used as the standard, the loss value between the actual prediction result and the labeling result can be directly calculated. Since this application uses a variety of annotation results, when calculating the loss value, it is necessary to calculate the loss value of the actual prediction result relative to all the annotation results. Specifically, the loss value between the actual prediction result relative to each labeling result can be calculated separately, and then accumulated.
S104、通过迭代训练,以最小化所述损失值为目的,调整所述依存句法分析模型的模型参数,以实现对所述依存句法分析模型的训练。S104. Through iterative training, with the aim of minimizing the loss value, adjust the model parameters of the dependency parsing model, so as to realize the training of the dependency parsing model.
本实施例所提供一种基于多标注数据的依存句法分析模型训练方法,能够根据目标损失函数计算模型输出结果相对于全部标注结果的损失值,并据此完成对模型的迭代训练,实现了充分利用全部标注数据中的有效信息的目的,提升了模型的依存句法分析能力。This embodiment provides a method for training a model of dependency parsing based on multi-labeled data, which can calculate the loss value of the output result of the model relative to all the labeling results according to the target loss function, and complete the iterative training of the model accordingly. The purpose of utilizing the valid information in all the labeled data improves the model's dependency parsing ability.
作为一种优选的实施方式,在实施例一的基础上,可以为不同用户的标注结果赋予权重值,以区别不同用户的标注能力。例如,对于专家给出的标注结果,可以赋予相对较高的权重值;对于普通用户给出的标注结果,可以赋予较低的权重值。As a preferred implementation manner, on the basis of the first embodiment, a weight value may be assigned to the annotation results of different users, so as to distinguish the annotation capabilities of different users. For example, for the annotation results given by experts, a relatively high weight value can be given; for the annotation results given by ordinary users, a lower weight value can be given.
具体的,为了区别不同用户的标注能力,分别为前述多种标注结果中的各种标注结果设置权重值,然后前述S103的过程修改为:根据目标损失函数和各种标注结果的权重值,计算所述弧得分和所述标签得分相对于所述多种标注结果的损失值。Specifically, in order to distinguish the labeling abilities of different users, weight values are respectively set for various labeling results in the aforementioned various labeling results, and then the above-mentioned process of S103 is modified as: according to the target loss function and the weight values of various labeling results, calculate Loss values of the arc score and the label score relative to the various annotation results.
在此基础之上,考虑到标注结果包含两项信息:弧和依存关系标签,那么,在区别用户标注能力的时候,也可以分别从两个维度进行区分,分别设置弧权重值和标签权重值。甚至可以只从其中一个维度上对用户的标注能力进行区分,对另一个维度的标注能力不做区分。On this basis, considering that the labeling result contains two pieces of information: arc and dependency label, then, when distinguishing the user's labeling ability, it can also be distinguished from two dimensions, and the arc weight value and label weight value can be set respectively. . It is even possible to distinguish the user's labeling ability from only one of the dimensions, and not to distinguish the labeling ability of the other dimension.
此时,上述权重设置过程具体为:针对所述多种标注结果中的各种标注结果,分别设置弧权重值和/或标签权重值。在分别设置弧权重值和标签权重值时,弧权重值的数值不同于标签权重的权重值。At this time, the above-mentioned weight setting process is specifically as follows: for each of the various annotation results in the multiple annotation results, the arc weight value and/or the label weight value are respectively set. When setting the arc weight value and the label weight value respectively, the value of the arc weight value is different from the weight value of the label weight.
综上,以表1为例,在为词i设置权重时,本实施例提供以下四种权 重设置方式,以适应不同场景需求:To sum up, taking Table 1 as an example, when setting weights for word i, the present embodiment provides the following four weight setting modes to adapt to different scene requirements:
表1Table 1
Figure PCTCN2021088601-appb-000001
Figure PCTCN2021088601-appb-000001
具体的,在计算实际预测结果(模型输出的标注结果)相对于全部标注结果的损失值时,可以分别从弧和依存关系标签两个维度进行计算。此时,如图2所示,上述S103包括:Specifically, when calculating the loss value of the actual prediction result (the labeling result output by the model) relative to all the labeling results, the calculation can be performed from the two dimensions of the arc and the dependency label. At this time, as shown in FIG. 2 , the above S103 includes:
S201、根据弧损失函数,计算所述弧得分相对于所述多种标注结果中弧的损失值,得到第一损失值;S201, according to the arc loss function, calculate the loss value of the arc score relative to the arc in the multiple labeling results, and obtain a first loss value;
S202、根据标签损失函数,计算所述标签得分相对于所述多种标注结果中依存关系标签的损失值,得到第二损失值;S202, according to the label loss function, calculate the loss value of the label score relative to the dependency label in the multiple labeling results, to obtain a second loss value;
S203、根据所述第一损失值和所述第二损失值,确定所述弧得分和所述标签得分相对于多种标注结果的损失值。S203. Determine, according to the first loss value and the second loss value, loss values of the arc score and the label score relative to various labeling results.
在此基础之上,作为一种优选的实施方式,在计算标签损失的时候,可以不与全部标注结果中的关系类型标签做差异计算,而是仅仅与其中部分标注结果的关系类型标签做差异计算。这里的部分标注结果是指根据某种策略从全部标注结果中选取出来的标注结果,此处的策略具体可以是多数投票、加权投票、弧得分预测、随机选取等。此时,如图3所示,上述S202具体为:On this basis, as a preferred embodiment, when calculating the label loss, the difference calculation may not be performed with the relationship type labels in all the labeling results, but only with the relationship type labels in some of the labeling results. calculate. The partial annotation results here refer to the annotation results selected from all the annotation results according to a certain strategy. The strategy here can specifically be majority voting, weighted voting, arc score prediction, random selection, etc. At this time, as shown in FIG. 3 , the above S202 is specifically:
根据标签损失函数,计算所述标签得分相对于目标标注结果中依存关系标签的损失值,得到第二损失值,其中所述目标标注结果为所述多种标注结果中弧等于目标弧的标注结果,所述目标弧为根据目标策略确定的弧,所述目标策略包括:弧得分预测、多数投票、加权投票、随机选取。According to the label loss function, the loss value of the label score relative to the dependency label in the target labeling result is calculated, and the second loss value is obtained, wherein the target labeling result is the labeling result in which the arc is equal to the target arc in the multiple labeling results , the target arc is an arc determined according to a target strategy, and the target strategy includes: arc score prediction, majority voting, weighted voting, and random selection.
其中,弧得分预测是指:根据依存句法分析模型输出的弧得分,选取 分值最大的弧以作为目标弧;Among them, the arc score prediction refers to: according to the arc score output by the dependency syntax analysis model, select the arc with the largest score as the target arc;
多数投票是指:采用多数投票方法,选取所述多标注结果中出现次数最多的弧以作为目标弧;Majority voting refers to: adopting the majority voting method to select the arc with the most occurrences in the multi-label results as the target arc;
加权投票是指:采用加权多数投票方法,结合每种标注结果的权重以及每种标注结果在所述多种标注结果中出现的次数,选取目标弧;Weighted voting refers to: adopting the weighted majority voting method to select the target arc in combination with the weight of each labeling result and the number of times each labeling result appears in the multiple labeling results;
随机选取是指:从所述多种标注结果中随机选取一个弧作为目标弧。Random selection refers to randomly selecting an arc from the multiple labeling results as a target arc.
下面开始详细介绍本申请提供的一种基于多标注数据的依存句法分析模型训练方法实施例二,实施例二基于前述介绍以实际应用为例对训练过程进行了详尽的说明。The second embodiment of a method for training a multi-labeled data-based dependency parsing model provided by the present application will be described in detail below. The second embodiment provides a detailed description of the training process based on the foregoing introduction and taking practical applications as an example.
本实施例中采用Biaffine Parser模型,如图3所示。依存句法分析模型包括:输入层、编码层、第一MLP层、第一得分层、第二MLP层和第二得分层;In this embodiment, the Biaffine Parser model is used, as shown in FIG. 3 . The dependency syntax analysis model includes: an input layer, an encoding layer, a first MLP layer, a first scoring layer, a second MLP layer and a second scoring layer;
其中,编码层包括多层BiLSTM;Among them, the coding layer includes multiple layers of BiLSTM;
第一MLP层用于根据编码层的输出确定当前词作为核心词的表示向量和当前词作为修饰词的表示向量,所述第一得分层用于根据所述第一MLP层的输出确定弧得分;The first MLP layer is used to determine the representation vector of the current word as the core word and the representation vector of the current word as the modifier according to the output of the encoding layer, and the first scoring layer is used to determine the arc according to the output of the first MLP layer. Score;
第二MLP层用于根据所述编码层的输出确定当前词作为核心词时包含依存关系标签信息的表示向量、当前词作为修饰词时包含依存关系标签信息的表示向量,所述第二得分层用于根据所述第二MLP层的输出确定标签得分。The second MLP layer is configured to determine, according to the output of the encoding layer, a representation vector containing dependency label information when the current word is used as a core word, and a representation vector containing dependency label information when the current word is used as a modifier, and the second score layer is used to determine a label score from the output of the second MLP layer.
对句子S=w 0w 1w 2w 3...w n,w 0是在句首***一个起辅助作用的根节点。输入层将每个词w i映射为一个向量x i,x i是词嵌入向量和字符嵌入(Char-LSTM)向量的连接,即: For the sentence S=w 0 w 1 w 2 w 3 ... wn , w 0 is to insert an auxiliary root node at the beginning of the sentence. The input layer maps each word wi to a vector xi , where xi is the concatenation of the word embedding vector and the character embedding (Char-LSTM) vector, namely:
Figure PCTCN2021088601-appb-000002
Figure PCTCN2021088601-appb-000002
编码层是多层BiLSTM,前一层BiLSTM的两个方向连接的输出是后一层的输入。The encoding layer is a multi-layer BiLSTM, and the output of the two-direction connection of the previous layer of BiLSTM is the input of the latter layer.
然后MLP表示层将编码层的输出h i作为输入,使用四个独立的MLPs分别得到四个包含对应信息的低维表示向量
Figure PCTCN2021088601-appb-000003
Figure PCTCN2021088601-appb-000004
如 下所示:
Then the MLP representation layer takes the output h i of the encoding layer as input, and uses four independent MLPs to obtain four low-dimensional representation vectors containing corresponding information respectively.
Figure PCTCN2021088601-appb-000003
and
Figure PCTCN2021088601-appb-000004
As follows:
Figure PCTCN2021088601-appb-000005
Figure PCTCN2021088601-appb-000005
Figure PCTCN2021088601-appb-000006
Figure PCTCN2021088601-appb-000006
Figure PCTCN2021088601-appb-000007
Figure PCTCN2021088601-appb-000007
Figure PCTCN2021088601-appb-000008
Figure PCTCN2021088601-appb-000008
其中
Figure PCTCN2021088601-appb-000009
是w i作为核心词时的表示向量,
Figure PCTCN2021088601-appb-000010
是w i作为修饰词时的表示向量,
Figure PCTCN2021088601-appb-000011
表示w i作为核心词时包含预测依存标签信息的表示向量,
Figure PCTCN2021088601-appb-000012
是w i作为修饰词时包含预测依存标签信息的表示向量。
in
Figure PCTCN2021088601-appb-000009
is the representation vector when wi is the core word,
Figure PCTCN2021088601-appb-000010
is the representation vector when wi is used as a modifier,
Figure PCTCN2021088601-appb-000011
Represents a representation vector containing prediction-dependent label information when wi is used as a core word,
Figure PCTCN2021088601-appb-000012
is a representation vector containing prediction-dependent label information when wi is used as a modifier.
然后biaffine得分层通过biaffine计算所有依存关系的得分,依存关系的得分分为两部分,弧得分和依存关系标签得分,其中弧得分如下所示:Then the biaffine score layer calculates the scores of all dependencies through biaffine. The scores of dependencies are divided into two parts, the arc score and the dependency label score, where the arc score is as follows:
Figure PCTCN2021088601-appb-000013
Figure PCTCN2021088601-appb-000013
其中score arc(i,j)表示由j充当核心词、i充当修饰词的依存弧的得分。矩阵W b是biaffine参数。 where score arc (i, j) represents the score of the dependency arc with j as the core word and i as the modifier. The matrix W b is the biaffine parameter.
依存关系标签得分如下所示:Dependency label scores are as follows:
Figure PCTCN2021088601-appb-000014
Figure PCTCN2021088601-appb-000014
其中
Figure PCTCN2021088601-appb-000015
Figure PCTCN2021088601-appb-000016
是biaffine参数,b是偏置。
in
Figure PCTCN2021088601-appb-000015
and
Figure PCTCN2021088601-appb-000016
is the biaffine parameter, and b is the bias.
模型的整体损失包括两部分:弧损失和标签损失,其中,弧损失指整体损失函数的一部分,表示预测弧的分布和真实弧的分布的差异;标签损失也指整体损失函数的一部分,表示预测标签的分布和真实标签的差异。The overall loss of the model includes two parts: arc loss and label loss, where arc loss refers to a part of the overall loss function, which represents the difference between the distribution of predicted arcs and the distribution of real arcs; label loss also refers to a part of the overall loss function, which represents the prediction The distribution of labels and the difference between the true labels.
原始的Biaffine attention parser使用交叉熵作为损失函数,每一个词单独计算局部的损失。原始的弧损失函数如下所示:The original Biaffine attention parser uses cross-entropy as the loss function, and calculates the local loss separately for each word. The original arc loss function looks like this:
Figure PCTCN2021088601-appb-000017
Figure PCTCN2021088601-appb-000017
本实施例中,为使模型适应多标注数据,通过修改模型的原始损失函数来充分利用多标注数据的所有答案。假设一个句子被K名标注人员标注, 产生的多标注数据。对于第i个词,K名标注人员相应标注的K个核心词表示为一个列表H=[h 1,h 2,...,h K],那么这个词的弧损失为: In this embodiment, in order to adapt the model to the multi-label data, the original loss function of the model is modified to make full use of all the answers of the multi-label data. Suppose a sentence is annotated by K annotators, resulting in multi-labeled data. For the i-th word, the K core words correspondingly marked by K annotators are represented as a list H=[h 1 , h 2 , ..., h K ], then the arc loss of this word is:
Figure PCTCN2021088601-appb-000018
Figure PCTCN2021088601-appb-000018
假设标签集为L={l 1,l 2,...,l T},对于修饰词i以依存关系类型l修饰核心词j的依存弧,原始的标签损失为: Assuming that the label set is L={l 1 , l 2 , ..., l T }, for the modifier i modifies the dependency arc of the core word j with the dependency type l, the original label loss is:
Figure PCTCN2021088601-appb-000019
Figure PCTCN2021088601-appb-000019
假设K名标注人员相应标注的K个依存关系标签表示为Y=[y 1,y 2,...,y K]。结合上述标签损失函数计算每对答案(h k,y k)的损失,然后求和,得到最终的整体损失函数为: It is assumed that the K dependencies labels correspondingly labelled by K labelers are represented as Y=[y 1 , y 2 , . . . , y K ]. Combining the above label loss function to calculate the loss of each pair of answers (h k , y k ), and then summing, the final overall loss function is:
Figure PCTCN2021088601-appb-000020
Figure PCTCN2021088601-appb-000020
在模型训练迭代中最小化整体损失,使差异变小从而达到最优化结果。The overall loss is minimized in the model training iterations, so that the difference is small and the optimal result is achieved.
经过迭代训练,得到最终的句法分析模型,能对任意输入句子解码分析出句法树结果。得到数据的句法信息之后,可用于抽取远距离信息以适应其他自然语言任务需要。After iterative training, the final syntactic analysis model is obtained, which can decode and analyze any input sentence to obtain the syntactic tree result. After the syntactic information of the data is obtained, it can be used to extract long-distance information to meet the needs of other natural language tasks.
在此基础之上,可以为多种标注结果设置权重值。例如,使用一个标注人员与其他标注人员的一致率来衡量他的标注能力,一致率越高,其权重越高。On this basis, weight values can be set for various annotation results. For example, using the consistency rate of an annotator with other annotators to measure his annotation ability, the higher the consistency rate, the higher the weight.
如果有K个标注人员{a 1,a 2,...,a K},s(a k)是标注人员a k标注的词数,w(a k)是标注人员a k标注的词中与其他标注人员给出答案一致的词数,那么w(a k)/s(a k)就是标注人员a k的一致率。权重是标准化的一致率,即标注人员a k的权重计算公式为: If there are K taggers {a 1 , a 2 , ..., a K }, s( ak ) is the number of words tagged by tagger k , and w( ak ) is the number of words tagged by tagger k The number of words that are consistent with the answers given by other labelers, then w( ak )/s( ak ) is the consistency rate of labelers a k . The weight is the standardized consistency rate, that is, the weight calculation formula of the labeler a k is:
Figure PCTCN2021088601-appb-000021
Figure PCTCN2021088601-appb-000021
这样,第i个词的弧损失函数修改为:In this way, the arc loss function of the ith word is modified as:
Figure PCTCN2021088601-appb-000022
Figure PCTCN2021088601-appb-000022
这里不再次为依存关系类型损失加权,那么最终的损失函数为:Without weighting the dependency type loss again, the final loss function is:
Figure PCTCN2021088601-appb-000023
Figure PCTCN2021088601-appb-000023
以上说明了本实施例中损失函数计算方式,实际应用中可以采用其他计算方式,不应当将其理解为对本申请的限定。The calculation method of the loss function in this embodiment has been described above, and other calculation methods may be adopted in practical applications, which should not be construed as a limitation on this application.
依存句法树示例如图4,其中$ 0表示伪节点,它指向的词是句子的根节点。一条依存弧由三个元素构成
Figure PCTCN2021088601-appb-000024
其中w i称为核心词,w j称为修饰词,r为关系类型,表示w j以句法角色r修饰w i。此处以依存弧为示例,省略关系类型。
An example of a dependency syntax tree is shown in Figure 4, where $ 0 represents a pseudo node, and the word it points to is the root node of the sentence. A dependency arc consists of three elements
Figure PCTCN2021088601-appb-000024
Among them, wi is called the core word, w j is called the modifier, and r is the relation type, which means that w j modifies wi with the syntactic role r. Here, the dependency arc is used as an example, and the relationship type is omitted.
现有模型使用黄金标准数据库,黄金标准数据库中的每个句子只有一个标准答案,如图4所示。图4是依存句法数据的图形化表达,对应的数据存储的CoNLL格式如图5所示,其中第二列的词的向量表示,第七列是对应的核心词序列标准答案.Existing models use a gold-standard database, and each sentence in the gold-standard database has only one standard answer, as shown in Figure 4. Figure 4 is a graphical representation of dependent syntax data, and the corresponding CoNLL format of data storage is shown in Figure 5, in which the second column is the vector representation of the word, and the seventh column is the corresponding core word sequence standard answer.
本申请让多个标注人员根据标注指南来标注同一个句子,从而得到多种标注结果。每一个句子会有多个句法树标注答案,图6是两个人标注的示例,句子上方是一个人的标注结果,句子下方是另一个人的标注结果。对应的,本申请在CoNLL格式的基础上进行了修改,使得数据格式也适应多标注形式,如图7所示。其中前10列与CoNLL格式一致,第11列 到第12列分别是第一个标注人员的标识、核心词序列标注答案,第14列到第15列分别是第二个标注人员的标识、核心词序列标注答案。This application allows multiple annotators to annotate the same sentence according to the annotation guidelines, thereby obtaining a variety of annotation results. Each sentence will have multiple syntax trees to annotate the answer. Figure 6 is an example of two people's annotation. The top of the sentence is the annotation result of one person, and the bottom of the sentence is the annotation result of another person. Correspondingly, this application has modified the CoNLL format, so that the data format is also adapted to the multi-label format, as shown in FIG. 7 . The first 10 columns are consistent with the CoNLL format, the 11th to 12th columns are the logo of the first annotator and the answer to the core word sequence annotation, and the 14th to 15th columns are the logo and core of the second annotator, respectively. The word sequence annotates the answer.
按照本申请给出的方案修改Biaffine Parser基本模型的数据输入格式和损失函数,然后就可以直接使用多标注数据进行训练了。经过迭代训练后可得到一个句法分析模型,对任意输入句子可给出其句法树结果。Modify the data input format and loss function of the Biaffine Parser basic model according to the scheme given in this application, and then you can directly use the multi-label data for training. After iterative training, a syntactic analysis model can be obtained, and its syntactic tree result can be given to any input sentence.
下面对本申请实施例提供的一种基于多标注数据的依存句法分析模型训练装置进行介绍,下文描述的一种基于多标注数据的依存句法分析模型训练装置与上文描述的一种基于多标注数据的依存句法分析模型训练方法可相互对应参照。A multi-label data-based dependency syntax analysis model training device provided by the embodiments of the present application will be introduced below. The multi-label data-based dependency syntax analysis model training device described below and the multi-label data-based training device described above The training methods of the dependency parsing model can refer to each other correspondingly.
如图8所示,本实施例的基于多标注数据的依存句法分析模型训练装置,包括:As shown in FIG. 8 , the apparatus for training a dependency parsing model based on multi-labeled data in this embodiment includes:
训练样本获取模块801:用于获取词序列以及所述词序列的多种标注结果,对于所述词序列中的每个修饰词,所述标注结果包括弧和依存关系标签,每种标注结果来自不同的用户;Training sample acquisition module 801: used to acquire word sequences and various labeling results of the word sequences. For each modifier in the word sequence, the labeling results include arc and dependency labels, and each labeling result comes from different users;
输入输出模块802:用于将所述词序列输入依存句法分析模型,得到弧得分和标签得分;Input and output module 802: used to input the word sequence into a dependency parsing model to obtain arc scores and tag scores;
损失计算模块803:用于根据目标损失函数,计算所述弧得分和所述标签得分相对于所述多种标注结果的损失值;Loss calculation module 803: for calculating the loss values of the arc score and the label score relative to the multiple labeling results according to the target loss function;
迭代模块804:用于通过迭代训练,以最小化所述损失值为目的,调整所述依存句法分析模型的模型参数,以实现对所述依存句法分析模型的训练。Iterative module 804 : for adjusting the model parameters of the dependency parsing model for the purpose of minimizing the loss value through iterative training, so as to realize the training of the dependency parsing model.
本实施例的基于多标注数据的依存句法分析模型训练装置用于实现前述的基于多标注数据的依存句法分析模型训练方法,因此该装置中的具体实施方式可见前文中的基于多标注数据的依存句法分析模型训练方法的实施例部分,例如,训练样本获取模块801,输入输出模块802,损失计算模块803,迭代模块804,分别用于实现上述基于多标注数据的依存句法分析模型训练方法中步骤S101,S102,S103,S104。所以,其具体实施方式可以参照相应的各个部分实施例的描述,在此不再展开介绍。The apparatus for training a dependency parsing model based on multi-labeled data in this embodiment is used to implement the aforementioned method for training a dependency syntax parsing model based on multi-labeled data. Therefore, the specific implementation of the apparatus can be seen in the aforementioned dependency based on multi-labeled data. Embodiment parts of the syntax analysis model training method, for example, the training sample acquisition module 801, the input and output module 802, the loss calculation module 803, and the iterative module 804 are respectively used to implement the steps in the above-mentioned multi-labeled data-based dependency syntax analysis model training method S101, S102, S103, S104. Therefore, reference may be made to the descriptions of the corresponding partial embodiments for specific implementations thereof, which will not be described herein again.
另外,由于本实施例的基于多标注数据的依存句法分析模型训练装置用于实现前述的基于多标注数据的依存句法分析模型训练方法,因此其作用与上述方法的作用相对应,这里不再赘述。In addition, since the multi-labeled data-based dependency syntax analysis model training device in this embodiment is used to implement the aforementioned multi-labeled data-based dependency syntax analysis model training method, its function corresponds to the function of the above method, and will not be repeated here. .
此外,本申请还提供了一种基于多标注数据的依存句法分析模型训练设备,包括:In addition, the present application also provides a multi-labeled data-based dependency parsing model training device, including:
存储器:用于存储计算机程序;Memory: used to store computer programs;
处理器:用于执行所述计算机程序,以实现如上文所述的基于多标注数据的依存句法分析模型训练方法。Processor: used to execute the computer program to implement the method for training a dependency parsing model based on multi-labeled data as described above.
最后,本申请提供了一种可读存储介质,所述可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时用于实现如上文所述的基于多标注数据的依存句法分析模型训练方法。Finally, the present application provides a readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, is used to implement the above-mentioned dependency syntax analysis based on multi-labeled data Model training method.
本说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其它实施例的不同之处,各个实施例之间相同或相似部分互相参见即可。对于实施例公开的装置而言,由于其与实施例公开的方法相对应,所以描述的比较简单,相关之处参见方法部分说明即可。The various embodiments in this specification are described in a progressive manner, and each embodiment focuses on the differences from other embodiments, and the same or similar parts between the various embodiments may be referred to each other. As for the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant part can be referred to the description of the method.
结合本文中所公开的实施例描述的方法或算法的步骤可以直接用硬件、处理器执行的软件模块,或者二者的结合来实施。软件模块可以置于随机存储器(RAM)、内存、只读存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM、或技术领域内所公知的任意其它形式的存储介质中。The steps of a method or algorithm described in conjunction with the embodiments disclosed herein may be directly implemented in hardware, a software module executed by a processor, or a combination of the two. A software module can be placed in random access memory (RAM), internal memory, read only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or any other in the technical field. in any other known form of storage medium.
以上对本申请所提供的方案进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的一般技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。The solutions provided by this application have been introduced in detail above, and specific examples are used to illustrate the principles and implementations of this application. The descriptions of the above embodiments are only used to help understand the methods and core ideas of this application; , for those of ordinary skill in the art, according to the idea of the application, there will be changes in the specific embodiments and application scope. To sum up, the content of this specification should not be construed as a limitation to the application.

Claims (10)

  1. 一种基于多标注数据的依存句法分析模型训练方法,其特征在于,包括:A method for training a dependency parsing model based on multi-labeled data, comprising:
    获取词序列以及所述词序列的多种标注结果,对于所述词序列中的每个修饰词,所述标注结果包括弧和依存关系标签,每种标注结果来自不同的用户;Obtain word sequences and various labeling results of the word sequences. For each modifier in the word sequence, the labeling results include arcs and dependency labels, and each labeling result is from a different user;
    将所述词序列输入依存句法分析模型,得到弧得分和标签得分;Inputting the word sequence into a dependency parsing model to obtain arc scores and label scores;
    根据目标损失函数,计算所述弧得分和所述标签得分相对于所述多种标注结果的损失值;According to the objective loss function, calculate the loss value of the arc score and the label score relative to the multiple labeling results;
    通过迭代训练,以最小化所述损失值为目的,调整所述依存句法分析模型的模型参数,以实现对所述依存句法分析模型的训练。Through iterative training, in order to minimize the loss value, the model parameters of the dependency parsing model are adjusted, so as to realize the training of the dependency parsing model.
  2. 如权利要求1所述的方法,其特征在于,所述根据目标损失函数,计算所述弧得分和所述标签得分相对于所述多种标注结果的损失值,包括:The method according to claim 1, wherein calculating the loss values of the arc score and the label score relative to the multiple labeling results according to an objective loss function, comprising:
    根据不同用户的标注能力,针对所述多种标注结果中的各种标注结果,设置权重值;According to the labeling ability of different users, setting weight values for various labeling results in the multiple labeling results;
    根据目标损失函数和各种标注结果的权重值,计算所述弧得分和所述标签得分相对于所述多种标注结果的损失值。According to the objective loss function and the weight values of various annotation results, the loss values of the arc score and the label score relative to the various annotation results are calculated.
  3. 如权利要求2所述的方法,其特征在于,所述针对所述多种标注结果中的各种标注结果,设置权重值,包括:The method according to claim 2, wherein, for each of the various annotation results in the multiple annotation results, setting a weight value includes:
    针对所述多种标注结果中的各种标注结果,分别设置弧权重值和/或标签权重值。Arc weight values and/or label weight values are respectively set for various annotation results in the multiple annotation results.
  4. 如权利要求1所述的方法,其特征在于,所述根据目标损失函数,计算所述弧得分和所述标签得分相对于所述多种标注结果的损失值,包括:The method according to claim 1, wherein calculating the loss values of the arc score and the label score relative to the multiple labeling results according to an objective loss function, comprising:
    根据弧损失函数,计算所述弧得分相对于所述多种标注结果中弧的损失值,得到第一损失值;According to the arc loss function, calculate the arc score relative to the arc loss value in the multiple labeling results to obtain the first loss value;
    根据标签损失函数,计算所述标签得分相对于所述多种标注结果中依存关系标签的损失值,得到第二损失值;According to the label loss function, calculate the loss value of the label score relative to the dependency label in the multiple labeling results, to obtain a second loss value;
    根据所述第一损失值和所述第二损失值,确定所述弧得分和所述标签得分相对于多种标注结果的损失值。According to the first loss value and the second loss value, loss values of the arc score and the label score with respect to various labeling results are determined.
  5. 如权利要求4所述的方法,其特征在于,所述根据标签损失函数,计算所述标签得分相对于所述多种标注结果中依存关系标签的损失值,得到第二损失值,包括:The method according to claim 4, wherein calculating the loss value of the label score relative to the dependency label in the multiple labeling results according to the label loss function, to obtain the second loss value, comprising:
    根据标签损失函数,计算所述标签得分相对于目标标注结果中依存关系标签的损失值,得到第二损失值,其中所述目标标注结果为所述多种标注结果中弧等于目标弧的标注结果,所述目标弧为根据目标策略确定的弧,所述目标策略包括:弧得分预测、多数投票、加权投票、随机选取。According to the label loss function, the loss value of the label score relative to the dependency label in the target labeling result is calculated, and the second loss value is obtained, wherein the target labeling result is the labeling result in which the arc is equal to the target arc in the multiple labeling results , the target arc is an arc determined according to a target strategy, and the target strategy includes: arc score prediction, majority voting, weighted voting, and random selection.
  6. 如权利要求1所述的方法,其特征在于,所述依存句法分析模型包括:输入层、编码层、第一MLP层、第一得分层、第二MLP层和第二得分层;The method of claim 1, wherein the dependency parsing model comprises: an input layer, an encoding layer, a first MLP layer, a first scoring layer, a second MLP layer, and a second scoring layer;
    其中所述第一MLP层用于根据所述编码层的输出确定当前词作为核心词的表示向量和当前词作为修饰词的表示向量,所述第一得分层用于根据所述第一MLP层的输出确定弧得分;The first MLP layer is used to determine the representation vector of the current word as the core word and the representation vector of the current word as the modifier according to the output of the coding layer, and the first scoring layer is used to determine the representation vector of the current word as the modifier according to the output of the coding layer. The output of the layer determines the arc score;
    所述第二MLP层用于根据所述编码层的输出确定当前词作为核心词时包含依存关系标签信息的表示向量、当前词作为修饰词时包含依存关系标签信息的表示向量,所述第二得分层用于根据所述第二MLP层的输出确定标签得分。The second MLP layer is configured to determine, according to the output of the encoding layer, a representation vector that includes dependency label information when the current word is used as a core word, and a representation vector that includes dependency label information when the current word is used as a modifier, and the second The scoring layer is used to determine the label score based on the output of the second MLP layer.
  7. 如权利要求6所述的方法,其特征在于,所述依存句法分析模型的编码层包括多层BiLSTM。7. The method of claim 6, wherein the encoding layer of the dependency parsing model comprises a multi-layer BiLSTM.
  8. 一种基于多标注数据的依存句法分析模型训练装置,其特征在于,包括:A dependency syntax analysis model training device based on multi-labeled data, characterized in that it includes:
    训练样本获取模块:用于获取词序列以及所述词序列的多种标注结果,对于所述词序列中的每个修饰词,所述标注结果包括弧和依存关系标签,每种标注结果来自不同的用户;Training sample acquisition module: used to obtain word sequences and various labeling results of the word sequences. For each modifier in the word sequence, the labeling results include arc and dependency labels, and each labeling result comes from different User;
    输入输出模块:用于将所述词序列输入依存句法分析模型,得到弧得分和标签得分;Input and output module: used to input the word sequence into the dependency parsing model to obtain arc scores and label scores;
    损失计算模块:用于根据目标损失函数,计算所述弧得分和所述标签得分相对于所述多种标注结果的损失值;Loss calculation module: used to calculate the loss value of the arc score and the label score relative to the multiple labeling results according to the target loss function;
    迭代模块:用于通过迭代训练,以最小化所述损失值为目的,调整所 述依存句法分析模型的模型参数,以实现对所述依存句法分析模型的训练。Iterative module: It is used to adjust the model parameters of the dependency parsing model for the purpose of minimizing the loss value through iterative training, so as to realize the training of the dependency parsing model.
  9. 一种基于多标注数据的依存句法分析模型训练设备,其特征在于,包括:A dependency syntax analysis model training device based on multi-labeled data, characterized in that it includes:
    存储器:用于存储计算机程序;Memory: used to store computer programs;
    处理器:用于执行所述计算机程序,以实现如权利要求1-7任意一项所述的基于多标注数据的依存句法分析模型训练方法。Processor: used to execute the computer program to implement the multi-labeled data-based dependency parsing model training method according to any one of claims 1-7.
  10. 一种可读存储介质,其特征在于,所述可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时用于实现如权利要求1-7任意一项所述的基于多标注数据的依存句法分析模型训练方法。A readable storage medium, characterized in that, a computer program is stored on the readable storage medium, and when the computer program is executed by a processor, the computer program is used to implement the multi-label-based method according to any one of claims 1-7. Data-dependent parsing model training method.
PCT/CN2021/088601 2020-10-13 2021-04-21 Multi-labeled data-based dependency and syntactic parsing model training method and apparatus WO2022077891A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011089840.1A CN112232024A (en) 2020-10-13 2020-10-13 Dependency syntax analysis model training method and device based on multi-labeled data
CN202011089840.1 2020-10-13

Publications (1)

Publication Number Publication Date
WO2022077891A1 true WO2022077891A1 (en) 2022-04-21

Family

ID=74112424

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/088601 WO2022077891A1 (en) 2020-10-13 2021-04-21 Multi-labeled data-based dependency and syntactic parsing model training method and apparatus

Country Status (2)

Country Link
CN (1) CN112232024A (en)
WO (1) WO2022077891A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115062611A (en) * 2022-05-23 2022-09-16 广东外语外贸大学 Training method, device, equipment and storage medium of grammar error correction model
CN115391608A (en) * 2022-08-23 2022-11-25 哈尔滨工业大学 Automatic labeling conversion method for graph-to-graph structure
CN117436446A (en) * 2023-12-21 2024-01-23 江西农业大学 Weak supervision-based agricultural social sales service user evaluation data analysis method

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112232024A (en) * 2020-10-13 2021-01-15 苏州大学 Dependency syntax analysis model training method and device based on multi-labeled data
CN113901791B (en) * 2021-09-15 2022-09-23 昆明理工大学 Enhanced dependency syntax analysis method for fusing multi-strategy data under low-resource condition
CN114611487B (en) * 2022-03-10 2022-12-13 昆明理工大学 Unsupervised Thai dependency syntax analysis method based on dynamic word embedding alignment
CN114611463B (en) * 2022-05-10 2022-09-13 天津大学 Dependency analysis-oriented crowdsourcing labeling method and device
CN116306663B (en) * 2022-12-27 2024-01-02 华润数字科技有限公司 Semantic role labeling method, device, equipment and medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104965821A (en) * 2015-07-17 2015-10-07 苏州大学张家港工业技术研究院 Data annotation method and apparatus
CN110444261A (en) * 2019-07-11 2019-11-12 新华三大数据技术有限公司 Sequence labelling network training method, electronic health record processing method and relevant apparatus
CN110472229A (en) * 2019-07-11 2019-11-19 新华三大数据技术有限公司 Sequence labelling model training method, electronic health record processing method and relevant apparatus
CN112232024A (en) * 2020-10-13 2021-01-15 苏州大学 Dependency syntax analysis model training method and device based on multi-labeled data

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104462066B (en) * 2014-12-24 2017-10-03 北京百度网讯科技有限公司 Semantic character labeling method and device
US10002129B1 (en) * 2017-02-15 2018-06-19 Wipro Limited System and method for extracting information from unstructured text
CN107168945B (en) * 2017-04-13 2020-07-14 广东工业大学 Bidirectional cyclic neural network fine-grained opinion mining method integrating multiple features
CN108172246A (en) * 2017-12-29 2018-06-15 北京淳中科技股份有限公司 The Collaborative Tagging method and apparatus of more tagging equipments
CN108647254B (en) * 2018-04-23 2021-06-22 苏州大学 Automatic tree library conversion method and system based on pattern embedding
CN108628829B (en) * 2018-04-23 2022-03-15 苏州大学 Automatic tree bank transformation method and system based on tree-shaped cyclic neural network
CN108776820A (en) * 2018-06-07 2018-11-09 中国矿业大学 It is a kind of to utilize the improved random forest integrated approach of width neural network
CN110795934B (en) * 2019-10-31 2023-09-19 北京金山数字娱乐科技有限公司 Sentence analysis model training method and device and sentence analysis method and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104965821A (en) * 2015-07-17 2015-10-07 苏州大学张家港工业技术研究院 Data annotation method and apparatus
CN110444261A (en) * 2019-07-11 2019-11-12 新华三大数据技术有限公司 Sequence labelling network training method, electronic health record processing method and relevant apparatus
CN110472229A (en) * 2019-07-11 2019-11-19 新华三大数据技术有限公司 Sequence labelling model training method, electronic health record processing method and relevant apparatus
CN112232024A (en) * 2020-10-13 2021-01-15 苏州大学 Dependency syntax analysis model training method and device based on multi-labeled data

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"Computer vision - ECCV 2020 : 16th European conference, Glasgow, UK, August 23-28, 2020 : proceedings", 2 October 2020, SPRINGER INTERNATIONAL PUBLISHING, Cham, ISBN: 978-3-030-58594-5, article ZHAO YU; ZHOU MINGYUE; LI ZHENGHUA; ZHANG MIN: "Dependency Parsing with Noisy Multi-annotation Data", pages: 120 - 131, XP047565281, DOI: 10.1007/978-3-030-60457-8_10 *
LIU SHIYI: "Dependency Parsing Research Model Based on Deep Learning", INFORMATION SCIENCE AND TECHNOLOGY, CHINESE MASTER’S THESES FULL-TEXT DATABASE, 15 August 2019 (2019-08-15), XP055922616 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115062611A (en) * 2022-05-23 2022-09-16 广东外语外贸大学 Training method, device, equipment and storage medium of grammar error correction model
CN115062611B (en) * 2022-05-23 2023-05-05 广东外语外贸大学 Training method, device, equipment and storage medium of grammar error correction model
CN115391608A (en) * 2022-08-23 2022-11-25 哈尔滨工业大学 Automatic labeling conversion method for graph-to-graph structure
CN117436446A (en) * 2023-12-21 2024-01-23 江西农业大学 Weak supervision-based agricultural social sales service user evaluation data analysis method
CN117436446B (en) * 2023-12-21 2024-03-22 江西农业大学 Weak supervision-based agricultural social sales service user evaluation data analysis method

Also Published As

Publication number Publication date
CN112232024A (en) 2021-01-15

Similar Documents

Publication Publication Date Title
WO2022077891A1 (en) Multi-labeled data-based dependency and syntactic parsing model training method and apparatus
Tan et al. Phrase-based image caption generator with hierarchical LSTM network
AU2004232276B2 (en) Method for sentence structure analysis based on mobile configuration concept and method for natural language search using of it
CN111666758B (en) Chinese word segmentation method, training device and computer readable storage medium
CN110008309B (en) Phrase mining method and device
CN117076653B (en) Knowledge base question-answering method based on thinking chain and visual lifting context learning
US9141601B2 (en) Learning device, determination device, learning method, determination method, and computer program product
WO2020207179A1 (en) Method for extracting concept word from video caption
CN109062904B (en) Logic predicate extraction method and device
CN110119510B (en) Relationship extraction method and device based on transfer dependency relationship and structure auxiliary word
WO2023231331A1 (en) Knowledge extraction method, system and device, and storage medium
CN115238690A (en) Military field composite named entity identification method based on BERT
CN109948144A (en) A method of the Teachers ' Talk Intelligent treatment based on classroom instruction situation
CN114818717A (en) Chinese named entity recognition method and system fusing vocabulary and syntax information
WO2023169301A1 (en) Text processing method and apparatus, and electronic device
CN117056543A (en) Multi-mode patent retrieval method based on images
CN115526176A (en) Text recognition method and device, electronic equipment and storage medium
He et al. [Retracted] Application of Grammar Error Detection Method for English Composition Based on Machine Learning
CN114611463B (en) Dependency analysis-oriented crowdsourcing labeling method and device
CN111737422B (en) Entity linking method and device, electronic equipment and storage medium
CN112181389B (en) Method, system and computer equipment for generating API (application program interface) marks of course fragments
JP4933118B2 (en) Sentence extraction device and program
WO2024131111A1 (en) Intelligent writing method and apparatus, device, and nonvolatile readable storage medium
Moradshahi Internationalization of Task-Oriented Dialogue Systems
JP4622272B2 (en) Language processing apparatus, language processing method and program

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21878945

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21878945

Country of ref document: EP

Kind code of ref document: A1