CN112069821A

CN112069821A - Named entity extraction method and device, electronic equipment and storage medium

Info

Publication number: CN112069821A
Application number: CN202010949598.4A
Authority: CN
Inventors: 张鹏涛; 景艳山
Original assignee: Beijing Minglue Zhaohui Technology Co Ltd
Current assignee: Beijing Minglue Zhaohui Technology Co Ltd
Priority date: 2020-09-10
Filing date: 2020-09-10
Publication date: 2020-12-11

Abstract

The application provides a named entity extraction method, a named entity extraction device, electronic equipment and a storage medium, wherein the extraction method comprises the following steps: inputting a target text into a pre-trained coding model, and acquiring a first text matrix corresponding to the target text output by the coding model; determining a first head pointer set and a first tail pointer set corresponding to a target text based on a first text matrix corresponding to the target text; the first head pointer set comprises a first characteristic value of a first character of a target word meeting a preset category in the target text; the first tail pointer set comprises second characteristic values of tail characters of target word segmentation meeting preset categories in the target text; and extracting the named entity from the target text according to the first characteristic value in the first head pointer set and the second characteristic value in the first tail pointer set, and determining the category corresponding to the named entity. The method and the device can identify the updated named entity, the generalization capability of the data is strong, and the identification efficiency of the named entity is high.

Description

Named entity extraction method and device, electronic equipment and storage medium

Technical Field

The application relates to the technical field of computer information, in particular to a named entity extraction method and device, electronic equipment and a storage medium.

Background

At present, named entity recognition is a basic task in the field of natural language processing, and aims to recognize named entities in texts and classify the recognized named entities, and the processing precision of downstream tasks of natural language processing is directly determined by the effect of the named entity recognition.

In practice, the named entity is usually identified by: the method comprises the steps that a worker collects the named entities existing in each field, the collected named entities are compiled into a named entity dictionary, the named entities are extracted from a text to be processed based on the named entity dictionary, and the category of the named entities is determined.

However, the named entity identification method can only identify the recorded named entities, cannot identify the updated named entities, and is weak in data generalization capability.

Disclosure of Invention

In view of this, an object of the embodiments of the present application is to provide a method and an apparatus for extracting a named entity, an electronic device, and a storage medium, which are capable of extracting a named entity from a target text based on a head pointer set and a tail pointer set corresponding to the target text, and capable of identifying an updated named entity, and are high in data generalization capability and high in efficiency of identifying the named entity.

In a first aspect, an embodiment of the present application provides an extraction method of a named entity, where the extraction method includes:

inputting a target text into a pre-trained coding model, and acquiring a first text matrix corresponding to the target text output by the coding model;

determining a first head pointer set and a first tail pointer set corresponding to the target text based on a first text matrix corresponding to the target text; the first initial pointer set comprises a first characteristic value of an initial character of a target word meeting a preset category in the target text; the first tail pointer set comprises second characteristic values of tail characters of target word segmentation meeting preset categories in the target text;

and extracting a named entity from the target text according to the first characteristic value in the first head pointer set and the second characteristic value in the first tail pointer set, and determining a category corresponding to the named entity.

In a possible implementation, the determining, based on the first text matrix corresponding to the target text, a first set of head pointers and a first set of tail pointers corresponding to the target text includes:

performing matrix transformation on a first text matrix corresponding to the target text based on a pre-trained standard matrix to obtain a first probability that each word in the target text belongs to the preset category;

for each word in the target text, determining whether the word is a first word or a last word of a target word segmentation meeting a preset category according to a first probability that the word belongs to the preset category, a first probability that other words in the target text belong to the preset category and a position relation between the other words and the word;

and generating a first head pointer set and a first tail pointer set corresponding to the target text according to the judgment result of whether the character is the first character or the tail character of the target word segmentation meeting the preset category and the characteristic value corresponding to each judgment result.

In a possible implementation manner, the extracting a named entity from the target text according to the first feature value in the first head pointer set and the second feature value in the first tail pointer set, and determining a category corresponding to the named entity includes:

for each first characteristic value in the first head pointer set, selecting a second characteristic value, of which the category is consistent with that of the first characteristic value and the position relation with the first characteristic value meets a first preset condition, from the first tail pointer set;

and determining the participle corresponding to the first characteristic value and the selected second characteristic value as the named entity in the target text, and determining the category of the first characteristic value as the category of the named entity.

In one possible embodiment, the coding model and the standard matrix are trained as follows:

constructing first training data, wherein the first training data comprises a plurality of first sample texts and a second head pointer set and a second tail pointer set corresponding to each first sample text;

inputting a first text in the first training data into a coding model, and acquiring a second text matrix corresponding to the first text output by the coding model;

performing matrix transformation on a second text matrix corresponding to the first sample text based on a standard matrix to obtain a third head pointer set and a third tail pointer set corresponding to the first sample text;

determining a loss value corresponding to the first sample according to a second head pointer set and a second tail pointer set corresponding to the first sample, and a third head pointer set and a third tail pointer set corresponding to the first sample;

and adjusting the coding model and the standard matrix based on the loss value until the loss value corresponding to any one of the first samples is smaller than a preset threshold value, so as to obtain the pre-trained coding model and the pre-trained standard matrix.

In one possible embodiment, the first training data is constructed by:

constructing second training data, wherein the second training data comprises a plurality of second sample texts;

inputting a second sample text in the second training data into a coding model, and acquiring a third text matrix corresponding to the second sample text output by the coding model;

performing matrix transformation on a third text matrix corresponding to the second sample text based on the standard matrix to obtain a second probability that each word in the second sample text belongs to the preset category;

determining a fourth head pointer set and a fourth tail pointer set corresponding to the second sample text based on a second probability that each word in the second sample text belongs to the preset category;

if the second probability that each word in the second sample text belongs to the preset category meets a second preset condition, determining the second sample text as the first sample text, and determining a fourth head pointer set and a fourth tail pointer set corresponding to the second sample text as the second head pointer set and the second tail pointer set corresponding to the first sample text, respectively.

In a second aspect, an embodiment of the present application provides an apparatus for extracting a named entity, where the apparatus includes:

the first acquisition module is used for inputting a target text into a pre-trained coding model and acquiring a first text matrix corresponding to the target text output by the coding model;

a first determining module, configured to determine, based on a first text matrix corresponding to the target text, a first head pointer set and a first tail pointer set corresponding to the target text; the first initial pointer set comprises a first characteristic value of an initial character of a target word meeting a preset category in the target text; the first tail pointer set comprises second characteristic values of tail characters of target word segmentation meeting preset categories in the target text;

and the extraction module is used for extracting a named entity from the target text according to the first characteristic value in the first head pointer set and the second characteristic value in the first tail pointer set, and determining a category corresponding to the named entity.

In a possible implementation, the first determining module, when determining the first set of head pointers and the first set of tail pointers corresponding to the target text based on the first text matrix corresponding to the target text, includes:

In a possible implementation manner, the extracting module, when extracting a named entity from the target text according to the first feature value in the first head pointer set and the second feature value in the first tail pointer set, and determining a category corresponding to the named entity, includes:

In a third aspect, an embodiment of the present application provides an electronic device, including: a processor, a memory and a bus, wherein the memory stores machine-readable instructions executable by the processor, the processor and the memory communicate with each other through the bus when the electronic device runs, and the processor executes the machine-readable instructions to execute the steps of the named entity extracting method according to any one of the first aspect.

In a fourth aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to perform the steps of the named entity extraction method according to any one of the first aspect.

According to the method, the device, the electronic equipment and the storage medium for extracting the named entity, the target text is input into a pre-trained coding model, and a first text matrix corresponding to the target text output by the coding model is obtained; determining a first head pointer set and a first tail pointer set corresponding to the target text based on a first text matrix corresponding to the target text; the first initial pointer set comprises a first characteristic value of an initial character of a target word meeting a preset category in the target text; the first tail pointer set comprises second characteristic values of tail characters of target word segmentation meeting preset categories in the target text; and extracting a named entity from the target text according to the first characteristic value in the first head pointer set and the second characteristic value in the first tail pointer set, and determining a category corresponding to the named entity. According to the method and the device, the named entities can be extracted from the target text based on the head pointer set and the tail pointer set corresponding to the target text, the updated named entities can be identified, the generalization capability of data is strong, and the identification efficiency of the named entities is high.

In order to make the aforementioned objects, features and advantages of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.

Fig. 1 is a flowchart illustrating an extraction method of a named entity according to an embodiment of the present disclosure;

FIG. 2 is a flow chart illustrating another method for extracting named entities according to an embodiment of the present disclosure;

FIG. 3 is a flow chart illustrating another method for extracting named entities according to an embodiment of the present disclosure;

FIG. 4 is a flow chart illustrating another method for extracting named entities according to an embodiment of the present disclosure;

FIG. 5 is a flow chart illustrating another method for extracting named entities according to an embodiment of the present disclosure;

fig. 6 is a schematic structural diagram illustrating an apparatus for extracting a named entity according to an embodiment of the present application;

fig. 7 shows a schematic diagram of an electronic device provided in an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application.

At present, named entity recognition is a basic task in the field of natural language processing, and aims to recognize named entities in texts and classify the recognized named entities, and the processing precision of downstream tasks of natural language processing is directly determined by the effect of the named entity recognition. In practice, the named entity is usually identified by: the method comprises the steps that a worker collects the named entities existing in each field, the collected named entities are compiled into a named entity dictionary, the named entities are extracted from a text to be processed based on the named entity dictionary, and the category of the named entities is determined. However, the named entity identification method can only identify the recorded named entities, cannot identify the updated named entities, and is weak in data generalization capability.

Based on the above problems, in the method, the apparatus, the electronic device, and the storage medium for extracting a named entity provided in the embodiments of the present application, a target text is input into a pre-trained coding model, and a first text matrix corresponding to the target text output by the coding model is obtained; determining a first head pointer set and a first tail pointer set corresponding to the target text based on a first text matrix corresponding to the target text; the first initial pointer set comprises a first characteristic value of an initial character of a target word meeting a preset category in the target text; the first tail pointer set comprises second characteristic values of tail characters of target word segmentation meeting preset categories in the target text; and extracting a named entity from the target text according to the first characteristic value in the first head pointer set and the second characteristic value in the first tail pointer set, and determining a category corresponding to the named entity. According to the method and the device, the named entities can be extracted from the target text based on the head pointer set and the tail pointer set corresponding to the target text, the updated named entities can be identified, the generalization capability of data is strong, and the identification efficiency of the named entities is high.

The above-mentioned drawbacks are the results of the inventor after practical and careful study, and therefore, the discovery process of the above-mentioned problems and the solution proposed by the present application to the above-mentioned problems in the following should be the contribution of the inventor to the present application in the process of the present application.

The technical solutions in the present application will be described clearly and completely with reference to the drawings in the present application, and it should be understood that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. The components of the present application, as generally described and illustrated in the figures herein, may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

For the convenience of understanding of the present embodiment, a detailed description will be first given of an extraction method of a named entity disclosed in the embodiments of the present application.

Referring to fig. 1, fig. 1 is a flowchart of an extraction method of a named entity provided in the embodiment of the present application, where the extraction method includes the following steps:

s101, inputting a target text into a pre-trained coding model, and acquiring a first text matrix corresponding to the target text output by the coding model.

In the embodiment of the present application, the target text is a text containing named entities, where the named entities are names of people, organizations, places, and all other entities identified by names, and the broader entities also include numbers, dates, currencies, addresses, and the like.

Optionally, the target text is a text related to a 3C product, for example, an introduction text of the 3C product, an evaluation text of the 3C product, the 3C refers to a combination of a Computer (Computer), a Communication (Communication), and a Consumer Electronics (Consumer Electronics), and the 3C product specifically includes: hardware devices such as computers, tablet computers, mobile phones, digital cameras, walkman, electronic dictionaries, video players, digital audio players, and the like.

The target text is composed of a plurality of words, a word vector corresponding to each word in the target text is determined based on a preset word-word vector dictionary, the word vector of each word in the target text is input into the coding model, and a first text matrix of the target text output by the coding model is obtained, wherein the first text matrix is used for representing text semantics corresponding to the target text.

Optionally, the encoding model is a ROBERTA model, a first text matrix output by the ROBERTA model is a B × N × 768 matrix, where B is a size of a batch size (batch _ size), the batch size is a data quantity adopted by each batch when learning and training are performed by using a gradient descent method, B may be different target text quantities such as 1, 32, and 64, N is a target length of each target text, N corresponding to different target texts is the same, and a determination method of N includes: acquiring the actual length of each target text, namely the number of words in each target text, and determining the maximum value of the actual lengths of the target texts as N, or selecting N from the actual lengths of the target texts, wherein N is greater than 99% of the actual length of the target text, and 768 is the output dimension of the ROBERTA model.

S102, determining a first head pointer set and a first tail pointer set corresponding to the target text based on a first text matrix corresponding to the target text; the first initial pointer set comprises a first characteristic value of an initial character of a target word meeting a preset category in the target text; the first tail pointer set comprises second characteristic values of tail words of target word segmentation meeting preset categories in the target text.

In the embodiment of the application, matrix transformation is performed on a first text matrix corresponding to each target text to obtain a first head pointer set and a first tail pointer set corresponding to the target text. The target text comprises a plurality of participles, different participles correspond to different categories, such as a brand category, an attribute category and an nonsense category, each target text corresponds to a first pointer set and a first tail pointer set, each word in the target text corresponds to a characteristic value in the first pointer set and the first tail pointer set, the characteristic value corresponding to the first character of a target participle in a preset category in the target text in the first pointer set is a first characteristic value, the characteristic value corresponding to the tail character of a target participle in a preset category in the target text in the first tail pointer set is a second characteristic value, the first characteristic values corresponding to the first characters of different categories of target participles are different, the second characteristic values corresponding to the tail characters of different categories of target participles are different, the first characteristic value corresponding to the first character of a target participle in the same category is the same as the second characteristic value corresponding to the tail character of the target participle, for example, the first feature value (or the second feature value corresponding to the suffix) corresponding to the first character of the target participle of the brand class is 1, the first feature value (or the second feature value corresponding to the suffix) corresponding to the first character of the target participle of the attribute class is 2, the first feature value (or the second feature value) corresponding to each character of the target participle of the nonsense class is 0, and the first feature value (or the second feature value) corresponding to the middle character (the character other than the first character and the suffix) of the target participle of the preset class (the brand class and the attribute class) is 0.

For example, a target text 1 "i likes hua as a screen of a mobile phone", a preset category is a brand category and an attribute category, a target participle 1 having a category of "brand category" is "hua is" and a target participle 2 having a category of "attribute category" is "screen", a first characteristic value and a second characteristic value corresponding to the "brand category" are both 1, a first characteristic value and a second characteristic value corresponding to the "attribute category" are both 2, a first pointer set corresponding to the target text 1 is (0,0,0,1,0,0,0,0,2,0), and a first tail pointer set is (0,0,0,0, 0,1,0,0,0,0, 0, 2).

S103, according to the first characteristic value in the first head pointer set and the second characteristic value in the first tail pointer set, a named entity is extracted from the target text, and a category corresponding to the named entity is determined.

In the embodiment of the application, named entities are extracted from a target text according to a first feature value corresponding to a first character of a target word meeting a preset category in the target text in a first pointer set and a second feature value corresponding to a tail word of the target word meeting the preset category in the target text in a first tail pointer set, and for each named entity, a category corresponding to the named entity is determined according to a category corresponding to the first feature value and the second feature value corresponding to the named entity together, for example, if the first feature value and the second feature value corresponding to the named entity are both 1, the named entity is a brand category.

According to the method for extracting the named entities, the named entities are extracted from the target text based on the head pointer set and the tail pointer set corresponding to the target text, the updated named entities can be identified, the data generalization capability is strong, and the named entity identification efficiency is high.

Further, as shown in fig. 2, in the method for extracting a named entity provided in the embodiment of the present application, the determining, based on the first text matrix corresponding to the target text, a first head pointer set and a first tail pointer set corresponding to the target text includes:

s201, performing matrix transformation on a first text matrix corresponding to the target text based on a pre-trained standard matrix to obtain a first probability that each word in the target text belongs to the preset category.

In the embodiment of the application, a first text matrix corresponding to a target text is multiplied by a pre-trained standard matrix to obtain the target matrix. Wherein the number of columns of the standard matrix is consistent with the number of the preset categories, for example, the number of columns of the standard matrix is 8, the first text matrix of the target text is a 1 × 10 × 768 matrix (the target text includes 10 words), after the first text matrix is multiplied by the standard matrix, the obtained target matrix is a 1 × 10 × 8 matrix, each row of the target matrix represents each word in the target text, each column of the target matrix represents the probability that each word in the target text belongs to each preset category, different columns of the target matrix correspond to different preset categories, assuming that the third behavior (0.1, 0.2, 0.1, 0.9, … …) of the target matrix, the probability that the third word of the target text belongs to the first preset category is 0.1, the probability that the third word belongs to the second preset category is 0.2, and the probability that the third preset category is 0.1, the probability of belonging to the fourth preset category is 0.9.

S202, aiming at each word in the target text, determining whether the word is a first word or a last word of a target word segmentation meeting a preset category according to the first probability that the word belongs to the preset category, the first probability that other words in the target text belong to the preset category and the position relation between the other words and the word.

In this embodiment of the application, for each preset category, if first probabilities that a plurality of continuous characters in a target text belong to the preset category are all greater than a preset threshold, the plurality of continuous characters are determined as target participles of the preset category, first characters in the plurality of continuous characters are determined as first characters meeting the target participles of the preset category, and last characters in the plurality of continuous characters are determined as last characters meeting the target participles of the preset category.

For example, the second column of the target matrix is (0.1, 0.2, 0.1, 0.2, 0.1, 0.8, 0.9), and the preset category corresponding to the second column of the target matrix is "attribute class", wherein both the first probability at the ninth position and the first probability at the tenth position are greater than the preset threshold value 0.7, for the target text 1 "i like to be hua as the screen of the mobile phone", the "screen" at the ninth position is the initial word of the target participle of the preset category "attribute class", and the "screen" at the tenth position is the final word of the target participle of the preset category "attribute class".

S203, generating a first head pointer set and a first tail pointer set corresponding to the target text according to the judgment result of whether the character is the first character or the tail character of the target word segmentation meeting the preset category and the characteristic value corresponding to each judgment result.

In this embodiment of the application, for each word, if the word is a first word meeting a target word segmentation of a preset category, a feature value corresponding to the word in a first pointer set is determined as a first feature value matching the preset category, and if the word is a last word meeting the target word segmentation of the preset category, a feature value corresponding to the word in a first tail pointer set is determined as a second feature value matching the preset category. For example, a target text 1 "i like hua as a screen of a mobile phone," hua "is a first word of a target participle 1" hua is, "screen" is a first word of a target participle 2 "screen," is a last word of a target participle 1 "hua is," screen "is a last word of a target participle 2" screen, "and a first characteristic value (or a first characteristic value) corresponding to the" brand "is" 1, "a first characteristic value (or a first characteristic value) corresponding to an attribute class" is "2," characteristic values corresponding to words other than the first word and the last word of the target participle in the target text are both "0," a first pointer set corresponding to the target text 1 is (0,0,0,1,0,0,0,0, 0, 2).

Further, referring to fig. 3, in the method for extracting a named entity provided in the embodiment of the present application, the extracting a named entity from the target text according to the first feature value in the first head pointer set and the second feature value in the first tail pointer set, and determining a category corresponding to the named entity includes:

s301, aiming at each first characteristic value in the first head pointer set, selecting a second characteristic value, of which the category is consistent with that of the first characteristic value and the position relation with the first characteristic value meets a first preset condition, from the first tail pointer set.

In the embodiment of the application, for each first feature value in a first head pointer set corresponding to a target text, a second feature value is selected from a first tail pointer set corresponding to the target text, wherein the numerical value of the second feature value is consistent with the first feature value and is closest to the position of the first feature value.

For example, the first set of head pointers corresponding to the target text 2 is (0,0,0,1,0,0,0,0,1,0), the corresponding first set of tail pointers is (0,0,0,0,1,0,0,0,0,1), the first set of head pointers includes two first feature values "1" in the fourth position and the ninth position, respectively, the first set of tail pointers includes two second feature values "1" in the fifth position and the tenth position, for the first feature value "1" in the fourth position, there are two second feature values in the first set of tail pointers having numerical values consistent with the first feature value "1", where the second feature value "1" in the fifth position is closest to the first feature value "1" in the fourth position, the second feature value "1" in the fifth position is matched with the first feature value "1" in the fourth position, the second feature value "1" at the tenth position is closest to the first feature value "1" at the ninth position, and the second feature value "1" at the tenth position is matched with the first feature value "1" at the ninth position.

S302, determining the participle corresponding to the first characteristic value and the selected second characteristic value as the named entity in the target text, and determining the category of the first characteristic value as the category of the named entity.

In the embodiment of the application, for each first eigenvalue, a second eigenvalue matched with the first eigenvalue is selected, a word of the first eigenvalue corresponding to the target text is a first word of a named entity, a word of the second eigenvalue corresponding to the target text is a last word of the named entity, the named entity is further extracted from the target text, and the category of the named entity is determined according to the category of the first eigenvalue and the category of the second eigenvalue corresponding to the named entity.

For example, the target text 2 is "i like hua as a mobile phone and millet", the second feature value "1" at the fifth position is matched with the first feature value "1" at the fourth position, the named entity "hua is" extracted from the target text 2, the second feature value "1" at the tenth position is matched with the first feature value "1" at the ninth position, the named entity "millet" is extracted from the target text 2, and both the "hua is" and the "millet" are determined to be brand-class named entities according to the first feature value (or the second feature value) "1".

Further, referring to fig. 4, in the method for extracting a named entity provided in the embodiment of the present application, the coding model and the standard matrix are trained in the following manner:

s401, constructing first training data, wherein the first training data comprises a plurality of first sample texts and a second head pointer set and a second tail pointer set corresponding to each first sample text.

In the embodiment of the present application, there are two types of corresponding training data, that is, first training data and second training data, where a first sample text in the first training data is a sample text with a label, that is, there are a second leading pointer set and a second trailing pointer set corresponding to the first sample text, and a second sample text in the second training data is a sample text without a label.

S402, inputting the first sample text in the first training data into a coding model, and acquiring a second text matrix corresponding to the first sample text output by the coding model.

In the embodiment of the application, a first sample text is composed of a plurality of words, a word vector corresponding to each word in the first sample text is determined based on a preset word-word vector dictionary, the word vector of each word in the first sample text is input into an encoding model, and a second text matrix of the first sample text output by the encoding model is obtained, wherein the second text matrix is used for representing text semantics corresponding to the first sample text. Optionally, the coding model is a ROBERTA model, where the coding model may be an initialized coding model, may also be a coding model in training, and may also be a trained coding model, and the coding model is dynamically changed in a process of training the coding model using the first sample text.

And S403, performing matrix transformation on the second text matrix corresponding to the first text based on the standard matrix to obtain a third head pointer set and a third tail pointer set corresponding to the first text.

In the embodiment of the application, based on the standard matrix, matrix transformation is performed on the second text matrix corresponding to the first sample text, that is, the second text matrix corresponding to the first sample text is multiplied by the standard matrix, so that a third probability that each word in the first sample text belongs to a preset category is obtained, where the standard matrix may be an initialized standard matrix, a standard matrix in training, or a standard matrix after training, and the standard matrix is dynamically changed in a process of training the standard matrix by using the first sample text.

For each preset category, if the third probabilities that the continuous characters in the first sample text belong to the preset category are all larger than a preset threshold, determining the continuous characters as target participles of the preset category, determining the first characters in the continuous characters as the first characters of the target participles, and determining the last characters in the continuous characters as the last characters of the target participles.

For each character, if the character is a first character meeting the target word segmentation of a preset category, determining a characteristic value corresponding to the character in a first head pointer set as a first characteristic value matched with the preset category, and if the character is a tail character meeting the target word segmentation of the preset category, determining a characteristic value corresponding to the character in a first tail pointer set as a second characteristic value matched with the preset category, so as to obtain a third head pointer set and a third tail pointer set corresponding to the first sample text.

S404, determining a loss value corresponding to the first sample text according to the second head pointer set and the second tail pointer set corresponding to the first sample text and the third head pointer set and the third tail pointer set corresponding to the first sample text.

In the embodiment of the application, a first loss value corresponding to a first sample is determined based on a second pointer set and a third pointer set respectively corresponding to the first sample; determining a second loss value corresponding to the first sample based on a second tail pointer set and a third tail pointer set respectively corresponding to the first sample; and determining a loss value corresponding to the first sample text based on the first loss value and the second loss value corresponding to the first sample text. Optionally, a cross-entropy function is used to determine a loss value corresponding to the first sample.

S405, adjusting the coding model and the standard matrix based on the loss value until the loss value corresponding to any one of the first sample texts is smaller than a preset threshold value, and obtaining the pre-trained coding model and the pre-trained standard matrix.

In this embodiment of the application, a loss value corresponding to each first sample text in the first training data is obtained, if a loss value corresponding to any first sample text in the first training data is greater than a preset threshold, the parameters of the coding model and the standard function are adjusted, until the loss value corresponding to any first sample text in the first training data is less than the preset threshold, the parameters of the coding model and the standard function are stopped being adjusted, and the trained coding model and the trained standard matrix are obtained.

Further, as shown in fig. 5, in the method for extracting a named entity provided in the embodiment of the present application, not only the second head pointer set and the second tail pointer set corresponding to each first sample text may be manually marked to construct the first training data, but also the first training data may be constructed in the following manner:

s501, second training data is constructed, and the second training data comprises a plurality of second sample texts.

In the embodiment of the application, the second sample text in the second training data is a sample text without a label, after the trained coding model and standard matrix are obtained based on the first sample text with the label, the second sample text in the second training data is obtained, the second sample text is converted into the first sample text with the label, and the trained coding model and standard matrix are trained again based on the converted first sample text.

S502, inputting a second sample text in the second training data into a coding model, and acquiring a third text matrix corresponding to the second sample text output by the coding model.

In the embodiment of the application, the second sample text is composed of a plurality of words, a word vector corresponding to each word in the second sample text is determined based on a preset word-word vector dictionary, the word vector of each word in the second sample text is input into the coding model, and a third text matrix of the second sample text output by the coding model is obtained, wherein the third text matrix is used for representing text semantics corresponding to the second sample text, and the coding model is the coding model trained in steps 401 to 405. Optionally, the coding model is a ROBERTA model.

S503, based on the standard matrix, performing matrix transformation on a third text matrix corresponding to the second sample text to obtain a second probability that each word in the second sample text belongs to the preset category.

In the embodiment of the application, based on the standard matrix, matrix transformation is performed on a third text matrix corresponding to the second sample text, that is, the third text matrix corresponding to the second sample text is multiplied by the standard matrix, so that a second probability that each word in the second sample text belongs to the preset category is obtained. Here, the standard matrix is the standard matrix trained in steps 401 to 405.

S504, determining a fourth head pointer set and a fourth tail pointer set corresponding to the second sample text based on a second probability that each word in the second sample text belongs to the preset category.

In this embodiment of the application, for each preset category, if second probabilities that a plurality of continuous characters in a second sample text belong to the preset category are all greater than a preset threshold, the plurality of continuous characters are determined as target participles of the preset category, first characters in the plurality of continuous characters are determined as first characters of the target participles, and last characters in the plurality of continuous characters are determined as last characters of the target participles.

For each character, if the character is a first character meeting the target word segmentation of the preset category, determining a characteristic value corresponding to the character in a first head pointer set as a first characteristic value matched with the preset category, and if the character is a tail character meeting the target word segmentation of the preset category, determining a characteristic value corresponding to the character in a first tail pointer set as a second characteristic value matched with the preset category, so as to obtain a fourth head pointer set and a fourth tail pointer set corresponding to the second sample text.

And S505, if a second probability that each word in the second sample text belongs to the preset category meets a second preset condition, determining the second sample text as the first sample text, and respectively determining a fourth head pointer set and a fourth tail pointer set corresponding to the second sample text as the second head pointer set and the second tail pointer set corresponding to the first sample text.

In the embodiment of the application, a second probability that each word in a second sample text belongs to the preset category is obtained, if a word with the second probability being greater than a preset threshold exists in the second sample text, which indicates that the second sample text includes a named entity, the second sample text is used as a first sample text in first training data, a fourth head pointer set corresponding to the second sample text is marked as a second head pointer set corresponding to the first sample text, and a fourth tail pointer set corresponding to the second sample text is marked as a second tail pointer set corresponding to the first sample text.

According to the embodiment of the application, a self-training mode is adopted, unlabeled second training data are effectively utilized, the number of first sample texts in the first training data is increased, and the coding model and the standard matrix obtained through training are high in stability and accuracy.

Based on the same inventive concept, the embodiment of the present application further provides a device for extracting a named entity corresponding to the method for extracting a named entity, and since the principle of solving the problem of the device in the embodiment of the present application is similar to that of the method for extracting a named entity in the embodiment of the present application, the implementation of the device may refer to the implementation of the method, and repeated details are not repeated.

Referring to fig. 6, fig. 6 is a schematic structural diagram of an apparatus for extracting a named entity according to an embodiment of the present application, where the apparatus includes:

a first obtaining module 601, configured to input a target text into a pre-trained coding model, and obtain a first text matrix corresponding to the target text output by the coding model;

a first determining module 602, configured to determine, based on a first text matrix corresponding to the target text, a first head pointer set and a first tail pointer set corresponding to the target text; the first initial pointer set comprises a first characteristic value of an initial character of a target word meeting a preset category in the target text; the first tail pointer set comprises second characteristic values of tail characters of target word segmentation meeting preset categories in the target text;

an extracting module 603, configured to extract a named entity from the target text according to the first feature value in the first head pointer set and the second feature value in the first tail pointer set, and determine a category corresponding to the named entity.

In a possible implementation, the first determining module 602, when determining the first set of head pointers and the first set of tail pointers corresponding to the target text based on the first text matrix corresponding to the target text, includes:

In a possible implementation manner, the extracting module 603, when extracting a named entity from the target text according to the first feature value in the first head pointer set and the second feature value in the first tail pointer set, and determining a category corresponding to the named entity, includes:

In a possible implementation manner, the named entity extracting apparatus further includes:

the first construction module is used for constructing first training data, wherein the first training data comprises a plurality of first sample texts and a second head pointer set and a second tail pointer set corresponding to each first sample text;

the second obtaining module is used for inputting the first sample text in the first training data into the coding model and obtaining a second text matrix corresponding to the first sample text output by the coding model;

a first matrix transformation module, configured to perform matrix transformation on a second text matrix corresponding to the first sample text based on a standard matrix to obtain a third head pointer set and a third tail pointer set corresponding to the first sample text;

a second determining module, configured to determine a loss value corresponding to the first sample according to a second head pointer set and a second tail pointer set corresponding to the first sample, and a third head pointer set and a third tail pointer set corresponding to the first sample;

and the adjusting module is used for adjusting the coding model and the standard matrix based on the loss value until the loss value corresponding to any one of the first sample texts is smaller than a preset threshold value, so as to obtain the pre-trained coding model and the pre-trained standard matrix.

the second construction module is used for constructing second training data, and the second training data comprises a plurality of second sample texts;

a third obtaining module, configured to input a second sample text in the second training data into a coding model, and obtain a third text matrix corresponding to the second sample text output by the coding model;

the second matrix transformation module is used for performing matrix transformation on a third text matrix corresponding to the second sample text based on the standard matrix to obtain a second probability that each word in the second sample text belongs to the preset category;

a third determining module, configured to determine, based on a second probability that each word in the second sample text belongs to the preset category, a fourth head pointer set and a fourth tail pointer set corresponding to the second sample text;

a fourth determining module, configured to determine the second sample text as the first sample text if a second probability that each word in the second sample text belongs to the preset category satisfies a second preset condition, and determine a fourth head pointer set and a fourth tail pointer set corresponding to the second sample text as the second head pointer set and the second tail pointer set corresponding to the first sample text, respectively.

The device for extracting the named entities, provided by the embodiment of the application, can be used for extracting the named entities from the target text based on the head pointer set and the tail pointer set corresponding to the target text, identifying the updated named entities, and is strong in data generalization capability and high in named entity identification efficiency.

Referring to fig. 7, fig. 7 is an electronic device 700 provided in an embodiment of the present application, where the electronic device 700 includes: a processor 701, a memory 702 and a bus, wherein the memory 702 stores machine-readable instructions executable by the processor 701, when the electronic device is operated, the processor 701 and the memory 702 communicate with each other through the bus, and the processor 701 executes the machine-readable instructions to execute the steps of the named entity extracting method.

Specifically, the memory 702 and the processor 701 can be general-purpose memory and processor, which are not limited in particular, and the named entity extracting method can be performed when the processor 701 runs a computer program stored in the memory 702.

Corresponding to the named entity extraction method, an embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and the computer program is executed by a processor to perform the steps of the named entity extraction method.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the system and the apparatus described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again. In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. The above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is merely a logical division, and there may be other divisions in actual implementation, and for example, a plurality of modules or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or modules through some communication interfaces, and may be in an electrical, mechanical or other form.

The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

In addition, functional modules in the embodiments of the present application may be integrated into one processing module, or each of the modules may exist alone physically, or two or more modules are integrated into one module.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present application, and are used for illustrating the technical solutions of the present application, but not limiting the same, and the scope of the present application is not limited thereto, and although the present application is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope disclosed in the present application; such modifications, changes or substitutions do not depart from the spirit and scope of the exemplary embodiments of the present application, and are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for extracting named entities, the method comprising:

2. The method for extracting named entities according to claim 1, wherein the determining a first head pointer set and a first tail pointer set corresponding to the target text based on the first text matrix corresponding to the target text comprises:

3. The method according to claim 1, wherein the extracting named entities from the target text according to the first feature value in the first head pointer set and the second feature value in the first tail pointer set and determining the category corresponding to the named entities comprises:

4. The method of claim 2, wherein the coding model and the criteria matrix are trained by:

5. The method of claim 4, wherein the first training data is constructed by:

6. An apparatus for extracting a named entity, the apparatus comprising:

7. The apparatus according to claim 6, wherein the first determining module, when determining the first head pointer set and the first tail pointer set corresponding to the target text based on the first text matrix corresponding to the target text, comprises:

8. The apparatus according to claim 6, wherein the extracting module, when extracting a named entity from the target text according to the first feature value in the first head pointer set and the second feature value in the first tail pointer set, and determining a category corresponding to the named entity, comprises:

9. An electronic device, comprising: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating via the bus when the electronic device is operating, the processor executing the machine-readable instructions to perform the steps of the named entity extraction method according to any one of claims 1 to 5.

10. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, performs the steps of the method for named entity extraction according to any one of claims 1 to 5.