CN114091468A - Reference resolution model training method and device and electronic equipment - Google Patents

Reference resolution model training method and device and electronic equipment Download PDF

Info

Publication number
CN114091468A
CN114091468A CN202111258623.5A CN202111258623A CN114091468A CN 114091468 A CN114091468 A CN 114091468A CN 202111258623 A CN202111258623 A CN 202111258623A CN 114091468 A CN114091468 A CN 114091468A
Authority
CN
China
Prior art keywords
noun
target
preset
corpus
candidate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111258623.5A
Other languages
Chinese (zh)
Inventor
李晨
阳任科
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing QIYI Century Science and Technology Co Ltd
Original Assignee
Beijing QIYI Century Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing QIYI Century Science and Technology Co Ltd filed Critical Beijing QIYI Century Science and Technology Co Ltd
Priority to CN202111258623.5A priority Critical patent/CN114091468A/en
Publication of CN114091468A publication Critical patent/CN114091468A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/274Converting codes to words; Guess-ahead of partial word inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Databases & Information Systems (AREA)
  • Computing Systems (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the invention provides a training method and device for a reference resolution model and electronic equipment. The method comprises the following steps: screening corpora which accord with target conditions in a preset corpus pool; the corpus meeting the target condition at least comprises a first candidate noun, a second candidate noun and a target noun, wherein the first candidate noun is the same as the target noun, and the second candidate noun is different from the target noun. And replacing the target nouns in the corpus meeting the target conditions with preset identifications, replacing the first candidate nouns with first preset nouns, and replacing the second candidate nouns with second preset nouns to obtain the target corpus. And generating marking information corresponding to the target corpus, and training according to the target corpus and the marking information corresponding to the target corpus. According to the embodiment of the invention, a large number of target corpora used for training the reference resolution model are constructed in an automatic mode, so that the process of manually marking the corpora is avoided, and the whole process of training the reference resolution model is time-saving and labor-saving.

Description

Reference resolution model training method and device and electronic equipment
Technical Field
The invention relates to the field of natural language processing, in particular to a training method and device for a reference resolution model and electronic equipment.
Background
In the field of natural language processing, a machine is required to perform semantic analysis and semantic understanding on natural language so as to predict nouns to which pronouns in natural language refer. Wherein the process of determining the noun to which a pronoun refers may be understood to refer to resolution. It is also understood that the term "resolution" refers to the prediction of the noun to which the pronoun is referring in the sentence.
Generally, in order to ensure the accuracy of the prediction of the reference resolution model, a large amount of corpora with labeled information are required to be used for training the reference resolution model. The annotation information is the proper noun and the wrong noun. The number of corpora and the labeling condition are an important part of the whole process of training the reference resolution model.
However, when the corpus is labeled, the semantics of the corpus needs to be accurately understood, so that the corpus can be labeled only manually at present, and in view of the fact that the whole process of training the reference resolution model needs a large amount of corpus with labels, the whole process of training the reference resolution model at present is time-consuming and labor-consuming.
Disclosure of Invention
In view of the above problems, embodiments of the present invention provide a method and an apparatus for training a reference resolution model, and an electronic device, so as to solve the problem that the whole process of training the reference resolution model in the prior art is time-consuming and labor-consuming.
In a first aspect of the embodiments of the present invention, there is provided a training method for a reference resolution model, the method including:
screening corpora which accord with target conditions in a preset corpus pool; wherein, there are at least a first candidate noun, a second candidate noun and a target noun in the corpus that meets the target condition, the target noun is located after the first candidate noun and the second candidate noun in the corpus, the first candidate noun is the same as the target noun, the second candidate noun is different from the target noun;
replacing the target noun in the corpus meeting the target condition with a preset identifier, replacing the first candidate noun with a first preset noun, and replacing the second candidate noun with a second preset noun to obtain a target corpus, wherein the first preset noun and the second preset noun are two different nouns in a target word bank containing a preset number of nouns;
generating tagging information corresponding to the target corpus according to the first preset noun and the second preset noun;
and training a reference resolution model according to the target linguistic data and the marking information corresponding to the target linguistic data.
Optionally, replacing the target noun in the corpus meeting the target condition with a preset identifier, replacing the first candidate noun with a first preset noun, and replacing the second candidate noun with a second preset noun to obtain a target corpus, including:
respectively aiming at each corpus meeting the target condition, randomly selecting a noun in the target word library as a first preset noun, and randomly selecting a noun different from the first preset noun again as a second preset noun;
and respectively aiming at each corpus which meets the target condition, adopting a first preset noun to replace the first candidate noun, adopting a second preset noun to replace the second candidate noun and adopting a preset identifier to replace the target noun to generate the target corpus.
Optionally, in a case that the noun, the first candidate noun, and the second candidate noun in the target thesaurus are all names of people, the randomly selecting a noun in the target thesaurus as a first preset noun, and randomly selecting a noun different from the first preset noun again as a second preset noun includes:
according to the literary work to which the corpus conforming to the target condition belongs, determining that the gender corresponding to the first candidate noun is a first gender, and the gender corresponding to the second candidate noun is a second gender;
randomly selecting a noun from nouns corresponding to the first gender in the target word bank as a first preset noun;
randomly selecting a noun different from the first preset noun from the nouns corresponding to the second gender in the target word stock as a second preset noun.
Optionally, the screening, in the preset corpus pool, the corpuses meeting the target condition includes:
determining nouns contained in each corpus in the preset corpus pool based on named entity recognition;
and screening the linguistic data at least comprising the first candidate noun, the second candidate noun and the target noun according to the noun contained in each linguistic data.
Optionally, the generating, according to the first preset noun and the second preset noun, tagging information corresponding to the target corpus includes:
forming the first preset noun and the second preset noun into a candidate noun set;
and recording target information of the preset identification referring to a first preset noun in the candidate noun set.
Optionally, the training the reference resolution model according to the target corpus and the label information corresponding to the target corpus includes:
adding a first label identification to the first preset noun in the target corpus, and adding a second label identification to the second preset noun to obtain an intermediate corpus;
inputting the intermediate corpus into the reference resolution model, and generating a first semantic vector indicating the first preset noun, a second semantic vector indicating a second preset noun and a target semantic vector according to the first labeled identifier and the second labeled identifier, wherein the target semantic vector is determined according to context information of a preset identifier in the intermediate corpus;
determining a prediction result according to the similarity between the first semantic vector and the target semantic vector and the similarity between the second semantic vector and the target semantic vector, wherein the prediction result comprises that the preset identifier refers to the first preset noun or the preset identifier refers to the second preset noun;
and adjusting corresponding parameters in the reference resolution model according to the prediction result and the target information.
Optionally, the determining a prediction result according to the similarity between the first semantic vector and the target semantic vector and the similarity between the second semantic vector and the target semantic vector includes:
splicing the first semantic vector and the second semantic vector with the target semantic vector respectively to obtain a first spliced vector and a second spliced vector;
inputting the first splicing vector and the second splicing vector into a similarity calculation model respectively to obtain a first calculation value and a second calculation value;
determining a probability value that the preset identifier refers to a first preset noun in the candidate noun set according to the first calculated value and the second calculated value;
and determining the probability value of the first preset noun in the candidate noun set referred by the preset identification as the prediction result.
In a second aspect of the present invention, there is also provided a training apparatus for a reference resolution model, the apparatus including:
the corpus module is used for screening corpora meeting target conditions in a preset corpus pool; wherein, there are at least a first candidate noun, a second candidate noun and a target noun in the corpus that meets the target condition, the target noun is located after the first candidate noun and the second candidate noun in the corpus, the first candidate noun is the same as the target noun, the second candidate noun is different from the target noun;
the processing module is used for replacing the target noun in the corpus meeting the target condition with a preset identifier, replacing the first candidate noun with a first preset noun, and replacing the second candidate noun with a second preset noun to obtain a target corpus, wherein the first preset noun and the second preset noun are two different nouns in a target word bank containing a preset number of nouns;
the labeling module is used for generating labeling information corresponding to the target corpus according to the first preset noun and the second preset noun;
and the training module is used for training the reference resolution model according to the target linguistic data and the marking information corresponding to the target linguistic data.
Optionally, the processing module includes:
a random selection unit, configured to randomly select a noun in the target lexicon as a first preset noun and randomly select a noun different from the first preset noun as a second preset noun again for each corpus meeting the target condition;
and the replacing unit is used for replacing each corpus which meets the target condition by adopting a first preset noun, a second preset noun and a preset identifier to generate the target corpus.
Optionally, in a case where the noun in the target thesaurus, the first candidate noun and the second candidate noun are all names of people, the random selection unit includes:
the gender subunit is used for determining the gender corresponding to the first candidate noun as a first gender and the gender corresponding to the second candidate noun as a second gender according to the literary work to which the corpus meeting the target condition belongs;
the first random selection subunit is used for randomly selecting a noun from nouns corresponding to the first gender in the target word stock as a first preset noun;
and the second random selection subunit is used for randomly selecting a noun different from the first preset noun from the nouns corresponding to the second gender in the target word bank as a second preset noun.
Optionally, the corpus module includes:
the recognition unit is used for recognizing and determining nouns contained in each corpus in the preset corpus pool based on named entities;
and the screening unit is used for screening the corpora at which at least the first candidate noun, the second candidate noun and the target noun exist according to the noun contained in each corpus.
Optionally, the labeling module includes:
a first labeling unit, configured to combine the first candidate noun and the second candidate noun into a candidate noun set;
and the second labeling unit is used for recording target information of the first preset noun in the candidate noun set referred by the preset identification.
Optionally, the training module comprises:
the intermediate processing unit is used for adding a first label identifier to the first preset noun in the target corpus and adding a second label identifier to the second preset noun to obtain an intermediate corpus;
a semantic vector unit, configured to input the intermediate corpus into the reference resolution model, and generate a first semantic vector indicating the first preset noun, a second semantic vector indicating the second preset noun, and a target semantic vector according to the first labeled identifier and the second labeled identifier, where the target semantic vector is a semantic vector determined according to context information of a preset identifier in the intermediate corpus;
a similarity unit, configured to determine a prediction result according to similarities between the first semantic vector and the target semantic vector, where the prediction result includes that the preset identifier refers to the first preset noun or that the preset identifier refers to the second preset noun;
and the adjusting unit is used for adjusting corresponding parameters in the reference resolution model according to the prediction result and the target information.
Optionally, the similarity unit is specifically configured to splice the first semantic vector and the second semantic vector with the target semantic vector to obtain a first spliced vector and a second spliced vector; inputting the first splicing vector and the second splicing vector into a similarity calculation model respectively to obtain a first calculation value and a second calculation value; determining a probability value that the preset identifier refers to a first preset noun in the candidate noun set according to the first calculated value and the second calculated value; and determining the probability value of the first preset noun in the candidate noun set referred by the preset identification as the prediction result.
In a third aspect of the present invention, there is also provided an electronic device, including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus;
a memory for storing a computer program;
and the processor is used for realizing the steps of the training method of the reference resolution model when executing the program stored on the memory.
In a fourth aspect implemented by the present invention, there is also provided a computer readable storage medium, on which a computer program is stored, which, when being executed by a processor, realizes the steps of the training method for a reference resolution model according to any one of the first aspect.
In a fifth aspect of the present invention, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform the above training method for a reference resolution model.
Aiming at the prior art, the invention has the following advantages:
according to the training method of the reference digestion model, the linguistic data meeting the target condition are screened in the preset linguistic data pool; the corpus meeting the target condition at least comprises a first candidate noun, a second candidate noun and a target noun, wherein the target noun is positioned behind the first candidate noun and the second candidate noun in the corpus, the first candidate noun is the same as the target noun, and the second candidate noun is different from the target noun. Through richly presetting the corpus pool, a large amount of corpora meeting the target condition can be quickly screened out. Meanwhile, target conditions are set based on the corpora used for training the reference resolution model, so that the selected corpora can be processed into the corpora used for training the reference resolution model. And replacing the target nouns in the corpus meeting the target conditions with preset marks, replacing the first candidate nouns with first preset nouns, and replacing the second candidate nouns with second preset nouns to obtain the target corpus, wherein the first preset nouns and the second preset nouns are two different nouns in a target word library containing a preset number of nouns. The preset identification in the target corpus can be regarded as a pronoun in the target corpus, so that a large number of corpora which can be used for training the reference resolution model can be obtained. Meanwhile, the first candidate nouns and the second candidate nouns are replaced, the first candidate nouns and the second candidate nouns can be prevented from being leaked to the reference resolution model, all the linguistic data after noun replacement can be realized as the linguistic data related to less nouns by controlling the number of the nouns in the target word bank, and the homogenization of the nouns in the linguistic data is realized. And generating the labeling information corresponding to the target corpus according to the first preset noun and the second preset noun. According to the objective fact that the same nouns refer to the same entity, the target corpus is labeled without semantic analysis and understanding. Therefore, the target corpus is labeled in a non-manual mode. And finally, training the reference resolution model according to the target linguistic data and the marking information corresponding to the target linguistic data. According to the embodiment of the invention, a large number of target corpora used for training the reference resolution model are constructed in an automatic mode, and the target corpora are labeled based on the objective fact that the same nouns refer to the same entity, so that manual participation is not needed, the process of manually labeling the corpora is avoided, and the whole process of training the reference resolution model is time-saving and labor-saving. Meanwhile, a first preset noun and a second preset noun are selected from the target word stock, the homogenization of nouns in the corpus can be realized by controlling the number of the nouns in the target word stock, and the preference of a reference resolution model on some nouns in the screened corpus is avoided.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.
FIG. 1 is a flowchart illustrating steps of a training method for a reference resolution model according to an embodiment of the present invention;
FIG. 2 is a schematic diagram illustrating an application of a training method for a reference resolution model according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a process for determining a prediction result according to an embodiment of the present invention;
fig. 4 is a block diagram of a training apparatus for a reference resolution model according to an embodiment of the present invention;
fig. 5 is a block diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be appreciated that reference throughout this specification to "one embodiment" or "an embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, the appearances of the phrases "in one embodiment" or "in an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
In various embodiments of the present invention, it should be understood that the sequence numbers of the following processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.
Referring to fig. 1, an embodiment of the present invention provides a reference resolution model training method, including:
step 101: and screening the linguistic data meeting the target condition in a preset linguistic data pool.
It should be noted that a large number of articles, novels, scripts and other literary works can be acquired through the network as the preset corpus. Preferably, a large number of corpora meeting the business requirements can be obtained as the preset corpus pool based on the business requirements in the process of using the reference resolution model. For example, when the referential resolution model is applied to the novel analysis, a large number of novel may be obtained as the preset corpus pool, and in order to avoid the problem that the generalization capability of the referential resolution model is poor due to the difference in writing style of novel authors, a large number of different types of novel novels of different authors may be selected as the preset corpus pool. Specifically, 20 different types of novels can be selected, 20 books can be selected for each type, and 400 novels are used as the preset corpus.
It is to be understood that not any corpus may be used to train the reference resolution model. Pronouns are required to be present in the corpus and preceded by a number of different nouns that should include nouns to which the pronouns refer correctly as well as nouns to which they refer incorrectly. The target condition is set for screening the linguistic data which can be processed to generate the reference resolution model for training from the preset linguistic data pool. The pronouns and the nouns to which they refer may be replaced with the same nouns, so the corpus before processing may not include pronouns. Based on this, at least a first candidate noun, a second candidate noun and a target noun exist in the corpus meeting the target condition, after the target noun is located behind the first candidate noun and the second candidate noun in the corpus, the first candidate noun is the same as the target noun, and the second candidate noun is different from the target noun. For example, a corpus is: "Zhang III bought an apple in the shop, but the apple was bad", the corpus includes in order: "zhang san", "shop", "apple", four nouns, the last noun "apple" of the four nouns is the same as the noun "apple" before it, different from the "shop" before it, then the corpus conforms to the target condition. As another example, a corpus is: "zhangsan knows that zhangsi originally hates wangwu, and where can let wangwu leave like this", this corpus includes "zhangsan", "zhangsi", "wangwu" and "wangwu" four nouns in turn, the last noun "wangwu" of four nouns is the same as its preceding noun "wangwu", is different from its preceding "zhangsan", this corpus accords with the target condition equally.
Step 102: and replacing the target nouns in the corpus meeting the target conditions with preset identifications, replacing the first candidate nouns with first preset nouns, and replacing the second candidate nouns with second preset nouns to obtain the target corpus.
It should be noted that training the reference resolution model requires a large amount of corpus. Therefore, a large amount of corpora meeting the target conditions are obtained through screening. And generating a target corpus aiming at each corpus which meets the target condition, thereby obtaining a large number of target corpora. And replacing the target nouns by adopting preset identifications, and actually performing mask processing on the target nouns. Similarly, the first candidate noun is replaced by the first preset noun, and the mask processing is actually performed on the first candidate noun; and replacing the second candidate nouns by adopting second preset nouns, and actually performing mask processing on the second candidate nouns. By performing mask processing on the target noun, the first candidate noun and the second candidate noun, the target noun, the first candidate noun and the second candidate noun are prevented from being leaked to the reference resolution model. Taking the example of masking the target nouns, when masking the target nouns, the target nouns are masked by using a preset arbitrary mark. The preset mark can be any one or more of numbers, letters and special symbols. Specifically, a MASK Language Model (MLM) may be used to replace the target nouns in the corpus that meet the target condition with [ MASK ]. The first preset noun and the second preset noun in the process of performing mask processing on the first candidate noun and the second candidate noun may be two different nouns in a target word library containing a preset number of nouns. For example, two different nouns are set in the target word library, and when the mask processing is performed on the first candidate noun, one of the nouns is selected as a first preset noun, and the first candidate noun is replaced by the first preset noun. When the mask processing is carried out on the second candidate nouns, another noun is selected as a second preset noun, and the second candidate nouns are replaced by the second preset nouns. Of course, the number of nouns in the target word library is not limited to two, and may be greater than two.
The linguistic data meeting the target conditions are taken as follows: "three knows that zhangsi originally hates wangwu and can leave wangwu in this way", and the target corpus generated for the corpus is: "Xiaoming knows that small reds originally hate small plums and can let [ MASK ] leave it.
Step 103: and generating the labeling information corresponding to the target corpus according to the first preset noun and the second preset noun.
It should be noted that the number of the target corpora is multiple, and for each target corpus, the tagging information corresponding to the target corpus is generated according to the first preset noun and the second preset noun. The labeled information is similar to labeled information when the labeled information is manually labeled for training the linguistic data of the reference resolution model. Through the annotation information, it can be determined that [ MASK ] in the target corpus refers to the first preset noun or the first candidate noun as a correct noun, and [ MASK ] refers to the second preset noun or the second candidate noun as an incorrect noun. And (4) regarding the [ MASK ] as pronouns in the corpus for training the reference resolution model, and determining correct reference and wrong reference through the relationship between the target noun replaced by the [ MASK ] and the first candidate noun and the second candidate noun, so that semantic analysis and semantic understanding are not required.
Step 104: and training the reference resolution model according to the target linguistic data and the marking information corresponding to the target linguistic data.
It should be noted that the reference resolution model is a network model that is predetermined to implement the reference resolution. Here, an existing reference resolution model may be trained to improve its accuracy.
In the embodiment of the invention, the linguistic data meeting the target condition are screened in a preset linguistic data pool; the corpus meeting the target condition at least comprises a first candidate noun, a second candidate noun and a target noun, wherein the target noun is positioned behind the first candidate noun and the second candidate noun in the corpus, the first candidate noun is the same as the target noun, and the second candidate noun is different from the target noun. Through richly presetting the corpus pool, a large amount of corpora meeting the target condition can be quickly screened out. Meanwhile, target conditions are set based on the corpora used for training the reference resolution model, so that the selected corpora can be processed into the corpora used for training the reference resolution model. And replacing the target nouns in the corpus meeting the target conditions with preset marks, replacing the first candidate nouns with first preset nouns, and replacing the second candidate nouns with second preset nouns to obtain the target corpus, wherein the first preset nouns and the second preset nouns are two different nouns in a target word library containing a preset number of nouns. The preset identification in the target corpus can be regarded as a pronoun in the target corpus, so that a large number of corpora which can be used for training the reference resolution model can be obtained. Meanwhile, the first candidate nouns and the second candidate nouns are replaced, the first candidate nouns and the second candidate nouns can be prevented from being leaked to the reference resolution model, all the linguistic data after noun replacement can be realized as the linguistic data related to less nouns by controlling the number of the nouns in the target word bank, and the homogenization of the nouns in the linguistic data is realized. And generating the labeling information corresponding to the target corpus according to the first preset noun and the second preset noun. According to the objective fact that the same nouns refer to the same entity, the target corpus is labeled without semantic analysis and understanding. Therefore, the target corpus is labeled in a non-manual mode. And finally, training the reference resolution model according to the target linguistic data and the marking information corresponding to the target linguistic data. According to the embodiment of the invention, a large number of target corpora used for training the reference resolution model are constructed in an automatic mode, and the target corpora are labeled based on the objective fact that the same nouns refer to the same entity, so that manual participation is not needed, the process of manually labeling the corpora is avoided, and the whole process of training the reference resolution model is time-saving and labor-saving. Meanwhile, a first preset noun and a second preset noun are selected from the target word stock, the homogenization of nouns in the corpus can be realized by controlling the number of the nouns in the target word stock, and the preference of a reference resolution model on some nouns in the screened corpus is avoided.
Optionally, the first candidate noun, the second candidate noun, the target noun, the first predetermined noun, and the second predetermined noun all correspond to the same entity type, where the entity type includes: a person name or a place name.
It should be noted that the reference resolution model is used to determine the proper nouns to which pronouns in the corpus refer. Here, the noun may be a noun corresponding to any entity type. It can be understood that, in order to improve the correctness of the recognition pronouns of the reference resolution model to correctly refer to the nouns under a certain entity type, the corpus containing the nouns under the entity type can be adopted for training. For example, the pronouns of the human names only refer to the names of the people, and in order to improve the accuracy of the resolution model for identifying the correct pronouns of the human names, corpora containing different names are adopted for training, namely, the first candidate nouns, the second candidate nouns, the target nouns, the first preset nouns and the second preset nouns are respectively different names of the people. Preferably, the first predetermined noun and the second predetermined noun are different names preset. For example, the first predetermined term may be "zhangsan", and the second predetermined term may be "liquad". Preferably, different names of the public can be selected as the first predetermined noun and the second predetermined noun. Here, the popular names may be well-known names or names with a large number of names. For example, ten names with the top ten of the duplicate names are determined according to the census, and any two of the ten names are respectively selected as a first preset noun and a second preset noun.
Referring to fig. 2, a schematic diagram of an actual application of the training method for the reference resolution model provided in the embodiment of the present invention includes:
step 201: and acquiring the dialect corpus without the label. And obtaining corpora in multiple novels with different styles and constructing a preset corpus pool.
Step 202: and generating the self-supervision corpus. Namely, the linguistic data meeting the target condition are screened from the preset linguistic data pool, and the linguistic data meeting the target condition are used as the self-supervision linguistic data. The process of screening corpus meeting the target condition is similar to the above step 101, and in the embodiment of the present invention, the first candidate noun, the second candidate noun and the target noun are all human names.
Step 203: the name information is homogenized. And aiming at the self-supervision corpus, different public names are adopted to replace the first candidate nouns and the second candidate nouns, and meanwhile, preset identification is adopted to replace the target nouns. And labeling the self-supervision corpus after the name information is homogenized to generate labeling information.
Step 204: and (5) vector extraction. And extracting vector representation of a preset identification, vector representation of a public name with a correct preset identification and vector representation of a public name with an incorrect preset identification in the self-supervision corpus.
Step 205: and (5) training a model. And training the reference resolution model according to the extracted vector identification, and monitoring the training effect of the reference resolution model based on a loss function.
It can be understood that, in the case that the entity type includes a place name, in order to improve the accuracy of correctly referring to the place name by the recognition pronouns of the reference resolution model, the corpus including different place names is used for training, that is, the first candidate noun, the second candidate noun, the target noun, the first preset noun and the second preset noun are respectively different place names. Of course, the entity types may also include: organization name, proper noun, etc.
In the embodiment of the invention, the first candidate noun, the second candidate noun, the target noun, the first preset noun and the second preset noun all correspond to the same entity type, so that the accuracy of identifying pronouns by referring to the resolution model to correctly refer to nouns under the same entity type can be improved.
Optionally, replacing the target noun in the corpus meeting the target condition with a preset identifier, replacing the first candidate noun with a first preset noun, and replacing the second candidate noun with a second preset noun to obtain a target corpus, including:
and respectively aiming at each corpus which meets the target condition, randomly selecting a noun in the target word library as a first preset noun, and randomly selecting a noun which is different from the first preset noun again as a second preset noun.
It should be noted that the first predetermined noun and the second predetermined noun selected for different corpora meeting the target condition may be the same or different. Because the first preset noun and the second preset noun are randomly selected twice, the number of the first type corpora and the number of the second type corpora are almost or completely consistent theoretically. The first preset noun in the first type corpus is the same as the second preset noun in the second type corpus, the second preset noun in the first type corpus is the same as the first preset noun in the second type corpus, Zhang three and Li four in the target lexicon are taken as examples, Zhang three is randomly selected from the target lexicon, then the probability of selecting Li four is randomly selected, then the probability of selecting Zhang three is the same, therefore, Zhang three is the first preset noun, Li four is the probability of the second preset noun and Li four is the first preset noun, and Zhang three is the probability of the second preset noun. Zhang III is taken as a first preset noun, Li IV is taken as a second preset noun in the first type corpus, Li IV is taken as the first preset noun and Zhang III is taken as the second preset noun in the second type corpus.
And respectively aiming at each corpus which meets the target condition, replacing the first candidate nouns by adopting first preset nouns, replacing the second candidate nouns by adopting second preset nouns and replacing the target nouns by adopting preset marks to generate the target corpus.
It should be noted that each corpus meeting the target condition corresponds to a first preset noun and a second preset noun randomly selected for the corpus, and when noun replacement is performed, the first preset noun and the second preset noun corresponding to the corpus are used for replacement for each corpus. For example, if the first preset noun randomly selected for the first corpus is zhang-three and the second preset noun randomly selected is lie-four, the zhang-three is used to replace the first candidate noun in the first corpus, and the lie-four is used to replace the second candidate noun in the first corpus.
In the embodiment of the invention, aiming at each corpus meeting the target condition, a first preset noun and a second preset noun are selected from a target word bank in a random selection mode, so that the quantity of the target corpus correctly referring to the first noun and the quantity of the target corpus correctly referring to the second noun in the obtained target corpora containing the two referring nouns are ensured to be nearly or completely consistent.
Optionally, the randomly selecting a noun in the target thesaurus as a first preset noun, and randomly selecting a noun different from the first preset noun again as a second preset noun includes:
and according to the literary work to which the corpus conforming to the target condition belongs, determining that the sex corresponding to the first candidate noun is the first sex, and the sex corresponding to the second candidate noun is the second sex.
It should be noted that the corpus in the preset corpus pool is derived from literature, so the corpus meeting the target condition is also derived from literature. The literary works to which each corpus belongs can be marked in the preset corpus pool, so that the literary works to which each corpus belongs can be conveniently determined. For example, all the corpora derived from the literature M in the preset corpus pool are marked as the corpus M, so that the literature to which the corpus belongs can be determined by marking the corpora. After determining the literary work, the gender corresponding to the first candidate noun and the second candidate noun may be determined based on the analysis of the literary work. The gender here is the gender of the character named as the first candidate noun or the second candidate noun in the literary work. For example, the characters in a novel include zhang san and lie xi, and if the corpus meeting the target condition is from the novel, and the first candidate noun is zhang san and the second candidate noun is lie xi, the first gender is male and the second gender is female. It is understood that the first gender and the second gender may be the same or different, wherein the first and second gender are used to distinguish the corresponding candidate nouns.
Randomly selecting a noun from the nouns corresponding to the first gender in the target word bank as a first preset noun.
Randomly selecting a noun different from the first preset noun from the nouns corresponding to the second gender in the target word stock as a second preset noun.
It should be noted that the target word library including a preset number of names is a preset name library, and a name library storing at least four names may be preset. Each person name in the preset person name library is marked with gender, and each gender corresponds to at least two person names. For example, the preset name library may include: xiaoming, male; plums, males; reddish, female; floret, female. When the first sex and the second sex are both male, randomly selecting one of Xiaoming and Xiaoli as a first preset noun, if the randomly selected one is Xiao Li, then randomly selecting one noun different from the Xiaoming and Xiao Li again, if the randomly selected one is Xiao Ming Li, reselecting the noun until the selected noun is different from the Xiaoming Li, if the randomly selected one is Xiaoming for the second time, then the Xiaoming is a first preset noun, and the Xiaoming is a second preset noun. If the first sex is male and the second sex is female, one of Xiaoming and Xiaoli is randomly selected as a first preset noun and one of Xiaohong and Xiaohua is randomly selected as a second preset noun. Preferably, the names in the preset name library may be public names, for example, ten names ranked ten times higher than the number of the duplicate names of the male people determined according to the census, and at least two of the names are selected as the names in the preset name library. Similarly, ten names with the top ten of the number of the names of the female persons are determined according to the census, and at least two of the names are selected as the names in the preset name database.
In the embodiment of the invention, the first preset noun and the first candidate noun corresponding to the same gender are adopted for replacement, and the second preset noun and the second candidate noun corresponding to the same gender are adopted for replacement, so that the target corpus is kept with gender information in the corpus which meets the target condition, and the accuracy of the reference resolution model can be further improved.
Optionally, the step 101: the method for screening the corpora meeting the target condition in the preset corpus pool may include:
and determining nouns contained in each corpus in the preset corpus pool based on named entity recognition.
According to the nouns contained in each corpus, the corpus with at least a first candidate noun, a second candidate noun and a target noun is screened.
It should be noted that the noun referred to by pronouns or terms in the resolution technology may correspond to any entity type, i.e. name of person, name of place, name of building, phrase, etc. Based on business requirements in the process of using the reference resolution model, the reference resolution model can be trained and only used for analyzing and processing any one of a reference person name, a place name or a building name and the like. For example, in the case of analyzing and processing the names of the persons by the trained resolution model, the trained resolution model is only applicable to the names of the persons referred by the pronouns or the correspondences in the predicted sentences. Correspondingly, in the process of screening the linguistic data meeting the target conditions, all the nouns are the names of people. And then determining the name of each corpus in the preset corpus pool based on named entity identification. When a corpus is: "Zhang III bought an apple in the shop, but the apple was bad", which contained only one name, "Zhang III", and the corpus did not meet the target condition. The other corpus is: "zhangsan knows that zhangsi originally hates wangwu, and where can let wangwu leave like this", wherein include four names of zhangsan "," zhangsi "," wangwu "and" wangwu ", the fourth name is the same as the third name, the fourth name is different from the first name or the second name, so this corpus accords with the target condition. Here, the target condition may be set as: two or more different names need to exist in each corpus, one name needs to exist in each corpus for more than or equal to two times, and the repeated names in each corpus need to be behind all the different names in the sentence.
In the embodiment of the invention, nouns in the corpus can be identified based on named entity identification, so that whether the corpus meets the target condition or not is judged according to the identified nouns, and the corpus meeting the target condition is automatically screened out in the preset corpus without manual participation.
Optionally, the step 103: generating tagging information corresponding to the target corpus according to the first preset noun and the second preset noun, which may include:
the first predetermined noun and the second predetermined noun are formed into a candidate noun set.
It should be noted that the candidate noun set includes: proper and wrong. Here, the pair type may be used for representation, for example, the candidate noun set may be [ wang five, zhang three ]. It is understood that the target corpus may include two or more terms that are incorrectly identified, for example, according to the corpus: "Zusanli knows that Zusanli originally hates Wangwen and can leave Wangwen in this way", and the generated target corpus is: "Xiaoming knows that small reds originally hate small plums and can let [ MASK ] leave it. The nouns wrongly indicated by the preset mark [ MASK ] in the target corpus comprise Xiaoming and Xiaohong, and the nouns correctly indicated by the preset mark [ MASK ] comprise plum. At this time, the candidate noun set may be [ Xiaoming, Xiaoli ] or [ Xiaohong, Xiaoli ].
Preferably, the target corpora of the target number can be generated for each selected corpus that meets the target condition. Wherein the target number is the same as the number of all nouns different from the target noun in the corpus meeting the target condition. For example, the corpus that meets the target condition is: "zhang san knows that zhang si originally hates wang wu, and where you can leave wang wu, including" zhang san "and" zhang si "two nouns different from the target noun" wang wu ", two target corpora can be correspondingly generated. The first target corpus is: "Xiaoming knows that the small red originally hates the plumes and allows [ MASK ] to leave, and the set of candidate nouns can be [ Xiaoming, plumes ]. The second target corpus is: "Xiaoming knows that small reds are very hate plums by nature, and where [ MASK ] can be left off as such", the set of candidate nouns can be [ small reds, plums ]. Although the sentences in the two target corpora are the same, the candidate noun sets are different, so that the two target corpora can be regarded as two different target corpora.
Target information of a preset identifier referring to a first preset noun in the candidate noun set is recorded.
It should be noted that the first preset noun to which the preset identifier correctly refers may be recorded together with the corresponding candidate noun set. For example, another line, the first predetermined noun to which the predetermined mark correctly refers is recorded below the candidate noun set.
In the embodiment of the present invention, based on that the target noun is the same as the first candidate noun, the first predetermined noun replacing the first candidate noun is used as the noun correctly referring to the predetermined identifier of the replacement target noun. And based on the difference between the target noun and the second candidate noun, taking a second preset noun replacing the second candidate noun as a noun which is wrongly referred by the preset identifier of the replacement target noun. And judging the correct reference and the wrong reference of the preset identification based on the comparison between the nouns without performing semantic analysis.
Optionally, the step 104: training the reference resolution model according to the target corpus and the labeling information corresponding to the target corpus, which may include:
and adding a first label identification to a first preset noun in the target corpus, and adding a second label identification to a second preset noun to obtain an intermediate corpus.
It should be noted that, in the process of training the reference resolution model by using the target corpus pair, it is necessary to extract vector representations of nouns with correct identifiers and vector representations of nouns with wrong references preset in the target corpus. The vector representation of the noun in the target corpus can be obtained through a labeling position notation method. Therefore, the target corpus needs to be processed to obtain the intermediate corpus. The first annotation mark comprises two parts, one part is positioned in front of the first preset noun, the other part is positioned behind the first preset noun, and the two parts are adjacent to the first preset noun. Similarly, the second annotation mark includes two parts, one part is located in front of the second predetermined noun, the other part is located behind the second predetermined noun, and both the two parts are adjacent to the second predetermined noun. Here, the first and second label identifiers may be any identifiers composed of numbers and/or letters. For example, the target corpus is: "Small reds are originally very hate small plums and where [ MASK ] can leave this way"; the intermediate corpus is: "[ E1start ] Small Red [ E1end ] is originally very hate [ E2start ] Small plum [ E2end ], where [ MASK ] can leave as well. Wherein [ E1start ] and [ E1end ] are first label marks, and [ E2start ] and [ E2end ] are second label marks. Of course, when vector representation of proper nouns and vector representation of wrong nouns are preset in the extracted target corpus, word vectors (tokentokens) of nouns can be directly extracted based on a bert (bidirectional Encoder retrieval from transforms) module in the reference resolution model.
Inputting the intermediate corpus into a reference resolution model, and generating a first semantic vector indicating a first preset noun, a second semantic vector indicating a second preset noun and a target semantic vector according to the first labeled identifier and the second labeled identifier, wherein the target semantic vector is determined according to context information of the preset identifier in the intermediate corpus.
It should be noted that for the intermediate corpus, it is possible to extract [ E1start ] and [ E1end ] as vector representations of a first preset noun, and extract [ E2start ] and [ E2end ] as vector representations of a second preset noun. Specifically, the first semantic vector is a word vector of the first label identifier in the intermediate corpus, and is used as a semantic vector of the first preset noun. Similarly, the second semantic vector is a word vector of the second label identifier in the target corpus, and is used as the semantic vector of the second preset noun. The target semantic vector is determined according to a word vector of a context of a preset identifier in the target corpus.
And determining a prediction result according to the similarity between the first semantic vector and the target semantic vector and the similarity between the second semantic vector and the target semantic vector.
In this step, the prediction result includes that the preset identifier refers to a first preset noun or the preset identifier refers to a second preset noun; specifically, when the similarity between the first semantic vector and the target semantic vector is higher than the similarity between the second semantic vector and the target semantic vector, the prediction result is that the preset identifier refers to the first preset noun, otherwise, the prediction result is that the preset identifier refers to the second preset noun. Preferably, a probability value of the preset identifier referring to the first preset noun may be used as the prediction result, when the probability value is greater than 50%, the preset identifier may be considered to refer to the first preset noun, otherwise, the preset identifier may be considered to refer to the second preset noun.
And adjusting corresponding parameters in the reference resolution model according to the prediction result and the target information.
In the step, the reference resolution model can be trained once through one target corpus, and corresponding parameters in the reference resolution model are adjusted based on each training result, so that the result is more accurate. Namely, the process of optimizing the model parameters by using the loss function in the model training process, wherein the prediction result is the result calculated according to the reference resolution model, and the target information can be regarded as the known real result. In order to improve the accuracy of the prediction result of the reference resolution model to a higher value, a large number of different target corpora are continuously used for training, and after each training is finished, the corresponding parameters in the reference resolution model are adjusted according to the prediction result and the real result, so that the prediction result is more accurate. The training may be stopped when the accuracy of the prediction referring to the digestion model reaches a high value. Specifically, a cross entropy loss function is adopted to optimize model parameters, and the optimization is finished when the cross entropy is minimum. Cross entropy loss function: l- [ acyl ' + (1-y) log (1-y ') ], wherein L denotes the cross-entropy value, y denotes the predicted result and y ' denotes the true result.
In the embodiment of the invention, the target corpus is processed by using a labeling position labeling method to obtain an intermediate corpus. And determining a prediction result based on the similarity among a first semantic vector indicating a first preset noun, a second semantic vector indicating a second preset noun and the target semantic vector in the intermediate corpus, and adjusting corresponding parameters in the reference resolution model according to the prediction result and target information representing a real result to realize the training of the reference resolution model.
Optionally, determining a prediction result according to the similarity between the first semantic vector and the target semantic vector and the similarity between the second semantic vector and the target semantic vector, where the determining includes:
and respectively splicing the first semantic vector and the second semantic vector with the target semantic vector to obtain a first spliced vector and a second spliced vector.
And respectively inputting the first splicing vector and the second splicing vector into a similarity calculation model to obtain a first calculation value and a second calculation value.
It should be noted that the first calculated value characterizes the similarity of the first semantic vector to the target semantic vector. The second calculated value characterizes a similarity of the second semantic vector and the target semantic vector.
Determining a probability value of a first preset noun in the candidate noun set referred by the preset identifier according to the first calculated value and the second calculated value;
and determining the probability value of the first preset noun in the candidate noun set referred by the preset identifier as a prediction result.
It should be noted that the first calculated value is greater than the second calculated value, indicating that the similarity between the first semantic vector and the target semantic vector is higher than the similarity between the second semantic vector and the target semantic vector. The first calculated value is less than the second calculated value, indicating that the similarity between the first semantic vector and the target semantic vector is lower than the similarity between the second semantic vector and the target semantic vector. According to the numerical relationship between the first calculated value and the second calculated value, the probability value that the target semantic vector is the first semantic vector, that is, the probability value that the preset identifier refers to the first preset noun in the candidate noun set, can be determined.
As shown in fig. 3, the process diagram is a schematic diagram for determining the prediction result according to the similarity between the first semantic vector and the target semantic vector; wherein, the three semantic vectors of the semantic vector layer are respectively: a first semantic vector, a second semantic vector, a target semantic vector; and inputting the first semantic vector and the target semantic vector into a first vector splicing module of the vector splicing layer to obtain the first spliced vector in the embodiment. And inputting the second semantic vector and the target semantic vector into a second vector splicing module of the vector splicing layer to obtain a second spliced vector in the embodiment. Inputting the first splicing vector into a first similarity calculation module of a fractional prediction layer to obtain a first calculation value in the embodiment; and inputting the second splicing vector into a second similarity calculation module of the fractional prediction layer to obtain a second calculation value in the embodiment. And inputting the first calculation value and the second calculation value into a Softmax layer, performing Softmax operation on the first calculation value and the second calculation value to obtain a prediction result in the embodiment, inputting the prediction result into a result layer, and outputting the prediction result from the result layer for subsequent processing.
In the embodiment of the invention, two different vectors are spliced, and the similarity of the two vectors before splicing is calculated by using one vector after splicing, so that the calculation of the similarity among a plurality of vectors is facilitated. And obtaining a probability value of a first preset noun in the preset mark correct referring candidate noun set based on the similarity between the first semantic vector and the target semantic vector and the similarity between the second semantic vector and the target semantic vector, so as to facilitate subsequent adjustment of corresponding parameters in the referring resolution model.
The above describes a training method of a reference resolution model provided in an embodiment of the present invention, and a training apparatus of a reference resolution model provided in an embodiment of the present invention is described below with reference to the accompanying drawings.
Referring to fig. 4, an embodiment of the present invention further provides a training apparatus for a reference resolution model, where the apparatus includes:
a corpus module 41, configured to screen corpora meeting the target condition from a preset corpus pool; the corpus which meets the target condition at least comprises a first candidate noun, a second candidate noun and a target noun, wherein the target noun is positioned behind the first candidate noun and the second candidate noun in the corpus, the first candidate noun is the same as the target noun, and the second candidate noun is different from the target noun;
the processing module 42 is configured to replace a target noun in the corpus that meets the target condition with a preset identifier, replace a first candidate noun with a first preset noun, replace a second candidate noun with a second preset noun, and obtain a target corpus, where the first preset noun and the second preset noun are two different nouns in a target corpus that includes a preset number of nouns;
a labeling module 43, configured to generate labeling information corresponding to the target corpus according to the first preset noun and the second preset noun;
and the training module 44 is configured to train the reference resolution model according to the target corpus and the labeling information corresponding to the target corpus.
Optionally, the first candidate noun, the second candidate noun, the target noun, the first predetermined noun, and the second predetermined noun all correspond to the same entity type, where the entity type includes: a person name or a place name.
Optionally, the processing module 42 includes:
a random selection unit, configured to randomly select a noun in the target lexicon as a first preset noun and randomly select a noun different from the first preset noun as a second preset noun again for each corpus meeting the target condition;
and the replacing unit is used for replacing each corpus which meets the target condition by adopting a first preset noun, a second preset noun and a preset identifier to generate the target corpus.
Optionally, in a case where the noun in the target thesaurus, the first candidate noun and the second candidate noun are all names of people, the random selection unit includes:
the gender subunit is used for determining the gender corresponding to the first candidate noun as a first gender and the gender corresponding to the second candidate noun as a second gender according to the literary work to which the corpus meeting the target condition belongs;
the first random selection subunit is used for randomly selecting a noun from nouns corresponding to the first gender in the target word stock as a first preset noun;
and the second random selection subunit is used for randomly selecting a noun different from the first preset noun from the nouns corresponding to the second gender in the target word bank as a second preset noun.
Optionally, the corpus module 41 includes:
the recognition unit is used for recognizing and determining nouns contained in each corpus in the preset corpus pool based on the named entities;
the filtering unit is used for filtering the corpora at least containing a first candidate noun, a second candidate noun and a target noun according to the noun contained in each corpus.
Optionally, the labeling module 43 includes:
a first labeling unit, configured to combine the first candidate noun and the second candidate noun into a candidate noun set;
and the second labeling unit is used for recording target information of the first preset noun in the preset mark referring candidate noun set.
Optionally, training module 44, comprising:
the intermediate processing unit is used for adding a first label identification to a first preset noun in the target corpus and adding a second label identification to a second preset noun to obtain an intermediate corpus;
the semantic vector unit is used for inputting the intermediate corpus into the reference resolution model, and generating a first semantic vector indicating a first preset noun, a second semantic vector indicating a second preset noun and a target semantic vector according to the first labeling identifier and the second labeling identifier, wherein the target semantic vector is determined according to context information of the preset identifiers in the intermediate corpus;
the similarity unit is used for determining a prediction result according to the similarity between the first semantic vector and the target semantic vector and the similarity between the second semantic vector and the target semantic vector, wherein the prediction result comprises a preset identifier which refers to a first preset noun or a preset identifier which refers to a second preset noun;
and the adjusting unit is used for adjusting corresponding parameters in the reference resolution model according to the prediction result and the target information.
Optionally, the similarity unit is specifically configured to splice the first semantic vector and the second semantic vector with the target semantic vector to obtain a first spliced vector and a second spliced vector; inputting the first splicing vector and the second splicing vector into a similarity calculation model respectively to obtain a first calculation value and a second calculation value; determining a probability value of a first preset noun in the candidate noun set referred by the preset identifier according to the first calculated value and the second calculated value; and determining the probability value of the first preset noun in the candidate noun set referred by the preset identifier as a prediction result.
The training device for the reference resolution model provided by the embodiment of the invention can realize each process realized by the training method for the reference resolution model in the embodiment of the method in fig. 1, and is not repeated here for avoiding repetition.
In the embodiment of the invention, a large number of target corpora used for training the reference resolution model are constructed in an automatic mode, and the target corpora are labeled based on the objective fact that the same nouns refer to the same entity, so that manual participation is not needed, the process of manually labeling the corpora is avoided, and the whole process of training the reference resolution model is time-saving and labor-saving. Meanwhile, a first preset noun and a second preset noun are selected from the target word stock, the homogenization of nouns in the corpus can be realized by controlling the number of the nouns in the target word stock, and the preference of a reference resolution model on some nouns in the screened corpus is avoided.
The embodiment of the present invention further provides an electronic device, as shown in fig. 5, including a processor 501, a communication interface 502, a memory 503 and a communication bus 504, where the processor 501, the communication interface 502 and the memory 503 complete mutual communication through the communication bus 504;
a memory 503 for storing a computer program;
the processor 501, when executing the program stored in the memory 503, implements the following steps:
screening corpora which accord with target conditions in a preset corpus pool; the corpus which meets the target condition at least comprises a first candidate noun, a second candidate noun and a target noun, wherein the target noun is positioned behind the first candidate noun and the second candidate noun in the corpus, the first candidate noun is the same as the target noun, and the second candidate noun is different from the target noun;
replacing a target noun in the corpus meeting the target condition with a preset identifier, replacing a first candidate noun with a first preset noun, and replacing a second candidate noun with a second preset noun to obtain a target corpus, wherein the first preset noun and the second preset noun are two different nouns in a target word library containing a preset number of nouns;
generating labeling information corresponding to the target corpus according to the first preset noun and the second preset noun;
and training the reference resolution model according to the target linguistic data and the marking information corresponding to the target linguistic data.
The communication bus mentioned in the above terminal may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.
The communication interface is used for communication between the terminal and other equipment.
The Memory may include a Random Access Memory (RAM) or a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.
The Processor may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component.
In yet another embodiment of the present invention, a computer-readable storage medium is further provided, which has instructions stored therein, and when the instructions are executed on a computer, the instructions cause the computer to execute the training method for the reference resolution model in any of the above embodiments.
In a further embodiment provided by the present invention, there is also provided a computer program product comprising instructions which, when run on a computer, cause the computer to perform the training method referred to as digestion model described in the above embodiments.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims (10)

1. A method of training a reference resolution model, the method comprising:
screening corpora which accord with target conditions in a preset corpus pool; wherein, there are at least a first candidate noun, a second candidate noun and a target noun in the corpus that meets the target condition, the target noun is located after the first candidate noun and the second candidate noun in the corpus, the first candidate noun is the same as the target noun, the second candidate noun is different from the target noun;
replacing the target noun in the corpus meeting the target condition with a preset identifier, replacing the first candidate noun with a first preset noun, and replacing the second candidate noun with a second preset noun to obtain a target corpus, wherein the first preset noun and the second preset noun are two different nouns in a target word bank containing a preset number of nouns;
generating tagging information corresponding to the target corpus according to the first preset noun and the second preset noun;
and training a reference resolution model according to the target linguistic data and the marking information corresponding to the target linguistic data.
2. The method according to claim 1, wherein replacing the target noun in the corpus meeting the target condition with a preset identifier, replacing the first candidate noun with a first preset noun, and replacing the second candidate noun with a second preset noun to obtain a target corpus, comprises:
respectively aiming at each corpus meeting the target condition, randomly selecting a noun in the target word library as a first preset noun, and randomly selecting a noun different from the first preset noun again as a second preset noun;
and respectively aiming at each corpus which meets the target condition, replacing the first candidate nouns by adopting first preset nouns, replacing the second candidate nouns by adopting second preset nouns and replacing the target nouns by adopting preset marks to generate the target corpus.
3. The method according to claim 2, wherein in a case where the noun, the first candidate noun and the second candidate noun in the target thesaurus are all human names, the randomly selecting a noun in the target thesaurus as a first preset noun and randomly selecting a noun different from the first preset noun again as a second preset noun comprises:
according to the literary work to which the corpus conforming to the target condition belongs, determining that the gender corresponding to the first candidate noun is a first gender, and the gender corresponding to the second candidate noun is a second gender;
randomly selecting a noun from nouns corresponding to the first gender in the target word bank as a first preset noun;
randomly selecting a noun different from the first preset noun from the nouns corresponding to the second gender in the target word stock as a second preset noun.
4. The method according to claim 1, wherein the selecting the corpus meeting the target condition from the predetermined corpus pool comprises:
determining nouns contained in each corpus in the preset corpus pool based on named entity recognition;
and screening the linguistic data at least comprising the first candidate noun, the second candidate noun and the target noun according to the noun contained in each linguistic data.
5. The method according to claim 1, wherein the generating of the annotation information corresponding to the target corpus according to the first predetermined noun and the second predetermined noun comprises:
forming the first preset noun and the second preset noun into a candidate noun set;
and recording target information of the preset identification referring to a first preset noun in the candidate noun set.
6. The method according to claim 5, wherein the training of the reference resolution model according to the target corpus and the labeled information of the corresponding target corpus comprises:
adding a first label identification to the first preset noun in the target corpus, and adding a second label identification to the second preset noun to obtain an intermediate corpus;
inputting the intermediate corpus into the reference resolution model, and generating a first semantic vector indicating the first preset noun, a second semantic vector indicating a second preset noun and a target semantic vector according to the first labeled identifier and the second labeled identifier, wherein the target semantic vector is determined according to context information of a preset identifier in the intermediate corpus;
determining a prediction result according to the similarity between the first semantic vector and the target semantic vector and the similarity between the second semantic vector and the target semantic vector, wherein the prediction result comprises that the preset identifier refers to the first preset noun or the preset identifier refers to the second preset noun;
and adjusting corresponding parameters in the reference resolution model according to the prediction result and the target information.
7. The method according to claim 6, wherein the determining the prediction result according to the similarity between the first semantic vector and the target semantic vector and the similarity between the second semantic vector and the target semantic vector comprises:
splicing the first semantic vector and the second semantic vector with the target semantic vector respectively to obtain a first spliced vector and a second spliced vector;
inputting the first splicing vector and the second splicing vector into a similarity calculation model respectively to obtain a first calculation value and a second calculation value;
determining a probability value that the preset identifier refers to a first preset noun in the candidate noun set according to the first calculated value and the second calculated value;
and determining the probability value of the first preset noun in the candidate noun set referred by the preset identification as the prediction result.
8. A training apparatus for a reference resolution model, the apparatus comprising:
the corpus module is used for screening corpora meeting target conditions in a preset corpus pool; wherein, there are at least a first candidate noun, a second candidate noun and a target noun in the corpus that meets the target condition, the target noun is located after the first candidate noun and the second candidate noun in the corpus, the first candidate noun is the same as the target noun, the second candidate noun is different from the target noun;
the processing module is used for replacing the target noun in the corpus meeting the target condition with a preset identifier, replacing the first candidate noun with a first preset noun, and replacing the second candidate noun with a second preset noun to obtain a target corpus, wherein the first preset noun and the second preset noun are two different nouns in a target word bank containing a preset number of nouns;
the labeling module is used for generating labeling information corresponding to the target corpus according to the first preset noun and the second preset noun;
and the training module is used for training the reference resolution model according to the target linguistic data and the marking information corresponding to the target linguistic data.
9. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus;
a memory for storing a computer program;
a processor for implementing the steps of the training method for a reference resolution model according to any one of claims 1 to 7 when executing a program stored in a memory.
10. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, carries out the steps of the training method for a reference resolution model according to any one of claims 1 to 7.
CN202111258623.5A 2021-10-27 2021-10-27 Reference resolution model training method and device and electronic equipment Pending CN114091468A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111258623.5A CN114091468A (en) 2021-10-27 2021-10-27 Reference resolution model training method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111258623.5A CN114091468A (en) 2021-10-27 2021-10-27 Reference resolution model training method and device and electronic equipment

Publications (1)

Publication Number Publication Date
CN114091468A true CN114091468A (en) 2022-02-25

Family

ID=80297921

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111258623.5A Pending CN114091468A (en) 2021-10-27 2021-10-27 Reference resolution model training method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN114091468A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110134944A (en) * 2019-04-08 2019-08-16 国家计算机网络与信息安全管理中心 A kind of reference resolution method based on intensified learning
CN110362824A (en) * 2019-06-24 2019-10-22 广州多益网络股份有限公司 A kind of method, apparatus of automatic error-correcting, terminal device and storage medium
CN112001190A (en) * 2020-07-20 2020-11-27 北京百度网讯科技有限公司 Training method, device and equipment of natural language processing model and storage medium
CN112749547A (en) * 2019-10-30 2021-05-04 激发认知有限公司 Generation of text classifier training data

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110134944A (en) * 2019-04-08 2019-08-16 国家计算机网络与信息安全管理中心 A kind of reference resolution method based on intensified learning
CN110362824A (en) * 2019-06-24 2019-10-22 广州多益网络股份有限公司 A kind of method, apparatus of automatic error-correcting, terminal device and storage medium
CN112749547A (en) * 2019-10-30 2021-05-04 激发认知有限公司 Generation of text classifier training data
CN112001190A (en) * 2020-07-20 2020-11-27 北京百度网讯科技有限公司 Training method, device and equipment of natural language processing model and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
宋洋;王厚峰;: "共指消解研究方法综述", 中文信息学报, no. 01, 15 January 2015 (2015-01-15), pages 1 - 12 *

Similar Documents

Publication Publication Date Title
CN108287858B (en) Semantic extraction method and device for natural language
WO2022105122A1 (en) Answer generation method and apparatus based on artificial intelligence, and computer device and medium
CN106570180B (en) Voice search method and device based on artificial intelligence
CN108573707B (en) Method, device, equipment and medium for processing voice recognition result
CN111814482B (en) Text key data extraction method and system and computer equipment
CN113158695A (en) Semantic auditing method and system for multi-language mixed text
JP2020190970A (en) Document processing device, method therefor, and program
CN110795942B (en) Keyword determination method and device based on semantic recognition and storage medium
CN116150621A (en) Training method, device and equipment for text model
CN111046627B (en) Chinese character display method and system
CN113204956B (en) Multi-model training method, abstract segmentation method, text segmentation method and text segmentation device
CN113434631A (en) Emotion analysis method and device based on event, computer equipment and storage medium
RU2546064C1 (en) Distributed system and method of language translation
WO2024007810A1 (en) Coding method and apparatus based on medical diseases and medicines
CN109857746B (en) Automatic updating method and device for bilingual word stock and electronic equipment
US8666987B2 (en) Apparatus and method for processing documents to extract expressions and descriptions
CN111368547A (en) Entity identification method, device, equipment and storage medium based on semantic analysis
CN114091468A (en) Reference resolution model training method and device and electronic equipment
CN114896382A (en) Artificial intelligent question-answering model generation method, question-answering method, device and storage medium
CN114298048A (en) Named entity identification method and device
CN113011162B (en) Reference digestion method, device, electronic equipment and medium
CN115169328A (en) High-accuracy Chinese spelling check method, system and medium
CN114091467A (en) Reference resolution model training method and device and electronic equipment
CN111339756B (en) Text error detection method and device
CN113012685A (en) Audio recognition method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination