CN114330346A

CN114330346A - Text entity identification method and device, equipment, medium and product thereof

Info

Publication number: CN114330346A
Application number: CN202111628410.7A
Authority: CN
Inventors: 吴智东
Original assignee: Guangzhou Huaduo Network Technology Co Ltd
Current assignee: Guangzhou Huaduo Network Technology Co Ltd
Priority date: 2021-12-28
Filing date: 2021-12-28
Publication date: 2022-04-12

Abstract

The application discloses a text entity identification method and a device, equipment, medium and product thereof, comprising the following steps: reading text information to be identified; inputting the text information into a preset student model, wherein the student model is trained to a convergence state through knowledge distillation based on a preset teacher model, and the model scale of the student model is smaller than that of the teacher model; reading the head pointer vector and the tail pointer vector output by the student model, and performing entity extraction on the text information according to an entity attribute mapping table corresponding to the head pointer vector and the tail pointer vector to generate entity information and entity attribute information; and generating an identification result of the text information according to the entity information and the entity attribute information. Because the model scale of the student model is smaller than that of the teacher model, the student model is used for entity recognition, so that the requirement on the deployment environment is lowered, and the adaptability to the deployment environment is improved.

Description

Text entity identification method and device, equipment, medium and product thereof

Technical Field

The present application relates to the field of text processing, and in particular, to a text entity identification method, apparatus, computer device, and storage medium.

Background

Entity identification, a natural language field, is used to identify a plurality of entity information, such as the basic tasks of text boundaries and types of entities, such as name, time, location, etc. The method is generally applied to various fields such as brand name identification, commodity attribute identification and the like of the e-commerce field.

The inventor of the present application found in research that the text entity recognition models in the prior art are large pre-training models, and when the large pre-training models are used to train the entity recognition models, although the accuracy is higher than that of the small models, the large pre-training models face huge performance problems if the large pre-training models are deployed in a production environment. The large pre-training model requires longer training time and larger running environment, which causes obstruction to the deployment and application of the entity recognition model.

Disclosure of Invention

The application provides a text entity recognition method, a text entity recognition device, a computer device and a storage medium, wherein training can be accelerated, and weight is reduced.

In order to solve the above technical problem, the embodiment of the present application adopts a technical solution that: a text entity recognition method is provided, which comprises the following steps:

reading text information to be identified;

inputting the text information into a preset student model, wherein the student model is trained to a convergence state through knowledge distillation based on a preset teacher model, a head pointer vector and a tail pointer vector of the text information are extracted, and the model scale of the student model is smaller than that of the teacher model;

reading the head pointer vector and the tail pointer vector output by the student model, and performing entity extraction on the text information according to an entity attribute mapping table corresponding to the head pointer vector and the tail pointer vector to generate entity information and entity attribute information;

and generating an identification result of the text information according to the entity information and the entity attribute information.

Optionally, before reading the text information to be recognized, the method includes:

collecting a sample text to be processed;

performing word segmentation processing on the sample text to generate a sample entity;

generating a marking head pointer vector and a marking tail pointer vector of the sample text according to the positions of the first character and the tail character of the sample entity and the mapping value of the sample entity in the entity attribute mapping table;

constructing a training sample according to the first labeling pointer vector, the last labeling pointer vector and the sample text;

and performing model training on the student model and/or the teacher model according to the training samples.

Optionally, the training samples are used for model training of the teacher model, and the model training of the student model and/or the teacher model according to the training samples includes:

inputting the sample text into an initial teacher model of the teacher models, wherein the initial teacher model is a non-converged state of the teacher model;

reading a first head pointer vector and a first tail pointer vector output by the initial teacher model;

calculating a first loss value between the first label pointer vector and the first head pointer vector and a second loss value between the second label pointer vector and the first tail pointer vector;

and performing return correction on the model parameters of the initial teacher model according to the second loss value between the first tail pointer vectors.

Optionally, the model training of the student model and/or teacher model according to the training samples comprises:

inputting the training samples into initial student models of the teacher model and the student models respectively, wherein the initial student models are in non-convergence states of the student models;

reading the teacher feature vector output by the teacher model and the student feature vector output by the initial student model;

calculating a distillation loss between the teacher feature vector and the student feature vector.

Optionally, the teacher feature vector comprises: teacher's feature code vector, the first pointer vector of second and the second tail pointer vector, student's feature vector includes: a student feature code vector, a third head pointer vector and a third tail pointer vector; the calculating of the distillation loss between the teacher feature vector and the student feature vector comprises:

respectively calculating a first cross entropy loss value between the third head pointer vector and the label head pointer vector and a second cross entropy loss value between the third tail pointer vector and the label tail pointer vector;

respectively calculating a first divergence loss value between the second head pointer vector and a third head pointer vector and a second divergence loss value between the second tail pointer vector and a third tail pointer vector;

calculating a third divergence loss value between the teacher feature encoding vector and the student feature encoding vectors;

generating the distillation loss from the first cross entropy loss value, the first divergence loss value, the second cross entropy loss value, the second divergence loss value, and the third divergence loss value.

Optionally, after calculating the distillation loss between the teacher feature vector and the student feature vector, the method further comprises:

constructing a negative sample of the training sample;

inputting the training samples into the teacher model and the initial student model to generate a first joint feature set;

inputting the training sample into the initial student model for the second time, and generating a second combined feature set by combining the student model features in the first combined feature set;

inputting the negative sample into the initial student model, and combining the student model features in the first combined feature set to generate a third combined feature set;

calculating a first alignment loss value between the first set of combined features and the third set of combined features and a second alignment loss value between the second set of combined features and the third set of combined features.

Optionally, the calculating a first alignment loss value between the first combined feature set and the third combined feature set, and a second alignment loss value between the second combined feature set and the third combined feature set comprises:

performing weighting operation on the first cross entropy loss value, the first divergence loss value, the second cross entropy loss value, the second divergence loss value, the third divergence loss value, the first comparison loss value and the second comparison loss value according to a preset parameter threshold value to generate a total loss value;

and carrying out back transmission correction on the model parameters of the initial student model according to the total loss value.

In order to solve the foregoing technical problem, an embodiment of the present application further provides a text entity recognition apparatus, including:

the reading module is used for reading text information to be identified;

the extraction module is used for inputting the text information into a preset student model, wherein the student model is trained to be in a convergence state through knowledge distillation based on a preset teacher model, is used for extracting a head pointer vector and a tail pointer vector of the text information, and the model scale of the student model is smaller than that of the teacher model;

the processing module is used for reading the head pointer vector and the tail pointer vector output by the student model, and performing entity extraction on the text information according to an entity attribute mapping table corresponding to the head pointer vector and the tail pointer vector to generate entity information and entity attribute information;

and the execution module is used for generating the identification result of the text information according to the entity information and the entity attribute information.

Optionally, the text entity recognition apparatus further includes:

the first acquisition submodule is used for acquiring a sample text to be processed;

the first word segmentation sub-module is used for carrying out word segmentation on the sample text to generate a sample entity;

the first mapping submodule is used for generating a marking head pointer vector and a marking tail pointer vector of the sample text according to the positions of the first character and the tail character of the sample entity and the mapping value of the sample entity in the entity attribute mapping table;

the first processing submodule is used for constructing a training sample according to the first labeling pointer vector, the last labeling pointer vector and the sample text;

and the first execution submodule is used for performing model training on the student model and/or the teacher model according to the training samples.

Optionally, the training samples are used for model training of the teacher model, and the text entity recognition apparatus further includes:

a first input sub-module for inputting the sample text into an initial teacher model of the teacher models, wherein the initial teacher model is a non-converged state of the teacher model;

the first reading submodule is used for reading a first head pointer vector and a first tail pointer vector output by the initial teacher model;

the first calculation submodule is used for calculating a first loss value between the first label pointer vector and the first head pointer vector and a second loss value between the second label pointer vector and the first tail pointer vector;

and the first correction submodule is used for carrying out feedback correction on the model parameters of the initial teacher model according to the second loss value between the first tail pointer vectors.

Optionally, the text entity recognition apparatus further includes:

a second input submodule, configured to input the training samples into initial student models of the teacher model and the student models, respectively, where the initial student models are non-convergence states of the student models;

the second reading submodule is used for reading the teacher feature vector output by the teacher model and the student feature vector output by the initial student model;

a second calculation submodule for calculating a distillation loss between the teacher feature vector and the student feature vector.

Optionally, the teacher feature vector comprises: teacher's feature code vector, the first pointer vector of second and the second tail pointer vector, student's feature vector includes: a student feature code vector, a third head pointer vector and a third tail pointer vector; the text entity recognition apparatus further includes:

a third calculation submodule, configured to calculate a first cross entropy loss value between the third head pointer vector and the labeled head pointer vector, and a second cross entropy loss value between the third tail pointer vector and the labeled tail pointer vector, respectively;

a fourth calculation submodule, configured to calculate a first divergence loss value between the second head pointer vector and the third head pointer vector, and a second divergence loss value between the second tail pointer vector and the third tail pointer vector, respectively;

a fifth calculating sub-module, configured to calculate a third divergence loss value between the teacher feature encoding vector and the student feature encoding vector;

a second processing submodule for generating the distillation loss from the first cross entropy loss value, the first divergence loss value, the second cross entropy loss value, the second divergence loss value, and the third divergence loss value.

Optionally, the text entity recognition apparatus further includes:

the first construction submodule is used for constructing a negative sample of the training sample;

the third input submodule is used for inputting the training samples into the teacher model and the initial student model to generate a first joint feature set;

the fourth input submodule is used for inputting the training sample into the initial student model for the second time, and generating a second combined feature set by combining the student model features in the first combined feature set;

the third processing submodule is used for inputting the negative sample into the initial student model and generating a third combined feature set by combining the student model features in the first combined feature set;

a sixth calculating sub-module, configured to calculate a first comparison loss value between the first combined feature set and the third combined feature set, and a second comparison loss value between the second combined feature set and the third combined feature set.

Optionally, the text entity recognition apparatus further includes:

the fourth processing submodule is used for carrying out weighting operation on the first cross entropy loss value, the first divergence loss value, the second cross entropy loss value, the second divergence loss value, the third divergence loss value, the first comparison loss value and the second comparison loss value according to a preset parameter threshold value to generate a total loss value;

and the second execution submodule is used for carrying out feedback correction on the model parameters of the initial student model according to the total loss value.

In order to solve the above technical problem, an embodiment of the present invention further provides a computer device, including a memory and a processor, where the memory stores computer-readable instructions, and the computer-readable instructions, when executed by the processor, cause the processor to perform the steps of the text entity identification method.

In order to solve the above technical problem, an embodiment of the present invention further provides a storage medium storing computer-readable instructions, which, when executed by one or more processors, cause the one or more processors to perform the steps of the text entity identification method.

A computer program product, provided to adapt another object of the present application, comprises computer programs/instructions which, when executed by a processor, implement the steps of the text entity identification method described in any of the embodiments of the present application.

The beneficial effect of this application is: when text information entity recognition is carried out, the teacher model which is pre-selected and arranged is used for training the student models which participate in the entity recognition, and the teacher model is trained to be convergent when the student models are trained, so that the student models can be rapidly trained to be in a convergent state in a knowledge distillation and comparison training mode, and the student models are used for entity recognition through the student models because the model scale of the student models is smaller than that of the teacher model, the requirement on deployment environment is lowered, and the adaptability to the deployment environment is improved. In addition, the entity information is extracted by a head-to-tail double-pointer method, so that the accuracy of extracting the entity information can be improved.

Drawings

The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

fig. 1 is a schematic basic flow chart of a text entity identification method according to an embodiment of the present application;

FIG. 2 is a schematic flow chart of generating training samples according to an embodiment of the present application;

FIG. 3 is a flow chart illustrating the training of a teacher model according to an embodiment of the present application;

FIG. 4 is a schematic flow diagram of the generation of distillation losses according to an embodiment of the present application;

FIG. 5 is a schematic flow chart of a calculation of distillation loss according to an embodiment of the present application;

FIG. 6 is a schematic flow chart of comparative training according to an embodiment of the present application;

FIG. 7 is a flowchart illustrating the process of performing the back pass correction on the initial student model according to an embodiment of the present application;

FIG. 8 is a diagram illustrating a basic structure of a text entity recognition apparatus according to an embodiment of the present application;

fig. 9 is a block diagram of a basic structure of a computer device according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It will be understood by those within the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

As used herein, a "terminal" includes both devices that are wireless signal receivers, devices that have only wireless signal receivers without transmit capability, and devices that have receive and transmit hardware, devices that have receive and transmit hardware capable of performing two-way communication over a two-way communication link, as will be understood by those skilled in the art. Such a device may include: a cellular or other communication device having a single line display or a multi-line display or a cellular or other communication device without a multi-line display; PCS (Personal Communications Service), which may combine voice, data processing, facsimile and/or data communication capabilities; a PDA (Personal Digital Assistant), which may include a radio frequency receiver, a pager, internet/intranet access, a web browser, a notepad, a calendar and/or a GPS (Global Positioning System) receiver; a conventional laptop and/or palmtop computer or other device having and/or including a radio frequency receiver. As used herein, a "terminal" may be portable, transportable, installed in a vehicle (aeronautical, maritime, and/or land-based), or situated and/or configured to operate locally and/or in a distributed fashion at any other location(s) on earth and/or in space. The "terminal" used herein may also be a communication terminal, a web-enabled terminal, a music/video playing terminal, such as a PDA, an MID (Mobile Internet Device) and/or a Mobile phone with music/video playing function, and may also be a smart tv, a set-top box, etc.

The hardware referred to by the names "server", "client", "service node", etc. is essentially an electronic device with the performance of a personal computer, and is a hardware device having necessary components disclosed by the von neumann principle such as a central processing unit (including an arithmetic unit and a controller), a memory, an input device, an output device, etc., a computer program is stored in the memory, and the central processing unit calls a program stored in an external memory into the internal memory to run, executes instructions in the program, and interacts with the input and output devices, thereby completing a specific function.

It should be noted that the concept of "server" as referred to in this application can be extended to the case of a server cluster. According to the network deployment principle understood by those skilled in the art, the servers should be logically divided, and in physical space, the servers may be independent from each other but can be called through an interface, or may be integrated into one physical computer or a set of computer clusters. Those skilled in the art will appreciate this variation and should not be so limited as to restrict the implementation of the network deployment of the present application.

One or more technical features of the present application, unless expressly specified otherwise, may be deployed to a server for implementation by a client remotely invoking an online service interface provided by a capture server for access, or may be deployed directly and run on the client for access.

Unless specified in clear text, the neural network model referred to or possibly referred to in the application can be deployed in a remote server and used for remote call at a client, and can also be deployed in a client with qualified equipment capability for direct call.

Various data referred to in the present application may be stored in a server remotely or in a local terminal device unless specified in the clear text, as long as the data is suitable for being called by the technical solution of the present application.

The person skilled in the art will know this: although the various methods of the present application are described based on the same concept so as to be common to each other, they may be independently performed unless otherwise specified. In the same way, for each embodiment disclosed in the present application, it is proposed based on the same inventive concept, and therefore, concepts of the same expression and concepts of which expressions are different but are appropriately changed only for convenience should be equally understood.

The embodiments to be disclosed herein can be flexibly constructed by cross-linking related technical features of the embodiments unless the mutual exclusion relationship between the related technical features is stated in the clear text, as long as the combination does not depart from the inventive spirit of the present application and can meet the needs of the prior art or solve the deficiencies of the prior art. Those skilled in the art will appreciate variations therefrom.

Referring to fig. 1, fig. 1 is a schematic diagram illustrating a basic flow of a text entity recognition method according to an exemplary embodiment of the present application. As shown in fig. 1, a text entity recognition method includes:

s1100, reading text information to be identified;

the text entity recognition method according to the present embodiment is used to recognize entity attribute information in text information, and therefore, the text information may be text information in which product attribute information is described.

The commodity in the present embodiment means any commodity capable of being traded, including (without limitation): food, clothing, medicine, toys, appliances, virtual goods or gifts.

The text information can include attribute information of one or more items of merchandise, wherein the attribute information can be of the merchandise (without limitation): product name, brand name of the product, material of the product, function of the product, style of the product, shape of the product, or term or performance standard of the product.

S1200, inputting the text information into a preset student model, wherein the student model is trained to a convergence state through knowledge distillation based on a preset teacher model and is used for extracting a head pointer vector and a tail pointer vector of the text information, and the model scale of the student model is smaller than that of the teacher model;

inputting the text information into a preset student model, wherein the student model is trained to a convergence state through knowledge distillation and comparison learning based on a preset teacher model and is used for extracting a head pointer vector and a tail pointer vector of the text information, and the model scale of the student model is smaller than that of the teacher model.

The teacher model in this embodiment is trained to a convergence state by a convolutional neural network model, a deep convolutional neural network model, a cyclic neural network model, a neural network model of a coding-decoding structure type, or a variation model of any one of the above models, in a supervised training or semi-supervised training manner. The teacher model trained to the converged state can extract the head pointer vector and the tail pointer vector in the input text information.

The student model in this embodiment is a transform model, and the student model only includes an encoder (coding) component in the transform model, specifically, the student model includes n layers of cascaded encoders, where n is less than or equal to 12. Since the student model does not include the decoding module in the transform model, the model scale of the student model is smaller than that of the teacher model.

Due to the lack of a decoding module in the student model, the student model lacks of inevitable conditions for feedback correction of the transform model, and therefore the student model needs to be trained in combination with the teacher model. Specifically, the teacher model trains the student models in a knowledge distillation processing and comparison training mode.

The knowledge distillation is to calculate the loss between the student model and the output result of the teacher model in the training process of the student model, and because the teacher model is trained to be convergent in advance, the loss value between the output result of the student model and the output result of the teacher model can be used for carrying out gradient calculation on the student model through a return function, and correcting the weight parameters of the student model to enable the student model to tend to be convergent.

Further, while the student model tends to converge by knowledge distillation, a positive sample and a negative sample for training are constructed, and the student model tends to converge by one-step constraint through a loss value calculated by a loss value between the positive sample characteristics output by the teacher model and the student model and the positive sample characteristics and the negative sample characteristics output by the student model. The robustness and accuracy of the model are improved.

The head pointer vector refers to a vector set formed by head characters of the entity information in the text information extracted by the student model, and the tail pointer vector is a vector set formed by tail characters of the entity information in the text information extracted by the student model.

S1300, reading the head pointer vector and the tail pointer vector output by the student model, and performing entity extraction on the text information according to an entity attribute mapping table corresponding to the head pointer vector and the tail pointer vector to generate entity information and entity attribute information;

reading the head pointer vector and the tail pointer vector output by the student model, because the head pointer vector can determine the position of the first character of one or more entity information in the text information, and the tail pointer vector can determine the position of the tail character of one or more entity information in the text information, and combining the two positions can determine one or more entity information in the text information.

Furthermore, in the training process of the student model, the head pointer vector and the tail pointer vector extracted by the student model include entity attribute information corresponding to the entity information, so that the extracted head pointer vector and the extracted tail pointer vector can be obtained by searching the entity attribute information corresponding to the entity information through a preset entity attribute mapping table.

And S1400, generating an identification result of the text information according to the entity information and the entity attribute information.

And obtaining the identification result of the text information according to the one-to-one correspondence between the entity information and the entity attribute information corresponding to the entity information. For example, the text information includes: some 2021 summer ultraviolet-proof ice silk loose comfortable wind coat. The results obtained by student model extraction were: the novel fabric is characterized by comprising the following components of ("brand", "certain department"), ("applicable season", "summer"), ("function", "ultraviolet-proof"), ("fabric", "ice silk"), ("type", "loose"), ("style", "wind coat").

In the embodiment, when text information entity recognition is carried out, the teacher model which is pre-selected and arranged is used for training the student models which participate in the entity recognition, and the teacher model is trained to be convergent when the student models are trained, so that the student models can be rapidly trained to be in a convergent state in a knowledge distillation and comparison training mode, and the student models are used for entity recognition through the student models because the model scale of the student models is smaller than that of the teacher model, so that the requirement on deployment environment is reduced, and the adaptability to the deployment environment is improved. In addition, the entity information is extracted by a head-to-tail double-pointer method, so that the accuracy of extracting the entity information can be improved.

In some embodiments, before a teacher model or a student model is trained, training samples for training are prepared. Referring to fig. 2, fig. 2 is a schematic flow chart of generating training samples according to the present embodiment.

As shown in fig. 2, S1100 previously includes:

s1101, collecting a sample text to be processed;

before model training, sample texts for training need to be collected firstly, and the sample texts can be obtained by requesting from a professional text supply server. In some embodiments, the commodity data and the sample text corresponding to the commodity data can be crawled from the network e-commerce platform through a network crawler technology, and the originally crawled sample text is cleaned, filtered, put in storage and stored according to a pre-designed database specification.

S1102, performing word segmentation processing on the sample text to generate a sample entity;

performing word segmentation on the extracted sample text, wherein the word segmentation mode can be as follows: standard participles, NLP participles, index participles, N-shortest-distance robust participles, or top-speed dictionary participles. After word segmentation processing is finished, word senses which are obtained by word segmentation processing and represent commodity attributes, such as nouns, adjectives, quantifiers and the like, are defined as sample entities. Thus, one or more sample entities can be generated for the sample text.

S1103, generating a label head pointer vector and a label tail pointer vector of the sample text according to the positions of the appearance of the first character and the tail character of the sample entity and the mapping value of the sample entity in the entity attribute mapping table;

in this embodiment, a corresponding mapping value is set for the attribute of each sample entity, and an entity attribute mapping table, such as the entity attribute mapping list shown in table 1, is constructed, and numbers 1 to 6 are used to map each sample entity of the cloth goods. However, the mapping list in the present embodiment is not limited to the contents listed in table 1, and the contents of the entity attribute mapping table can be customized according to the application scenario, depending on the specific application scenario.

TABLE 1

According to the position of the sample entity appearing in the sample text and the entity attribute mapping table, mapping the initial character in the sample entity to be a labeling initial pointer vector, and mapping the tail character in the sample entity to be a labeling tail pointer vector. For example, the "some 2021 ultraviolet-proof ice-silk loose comfortable coat in summer", the physical samples are ("brand", "some"), ("applicable season", "summer"), ("function", "ultraviolet-proof"), ("fabric", "ice-silk"), ("type", "loose"), ("style", "coat"). According to the position of the sample entity in the sample text and the mapping value of the table 1, the converted head-labeled pointer vector is [10000050600040200030], and similarly, the converted tail-labeled pointer vector is [01000005000604020003], wherein in the head-labeled pointer vector and the tail-labeled pointer vector, "1" and "1" are adjacent to each other, which indicates that the first and second characters in the sample text represent the brand of the commodity, and the brand of the commodity can be obtained through mapping query. And by analogy, the sample entity can be determined by the position of the same number in the label head pointer vector and the label tail pointer vector, and the attribute information of the sample entity is determined in the entity attribute mapping table through the sample entity. This mapping process is the same as the method of determining the recognition result from the head pointer vector and the tail pointer vector in S1300-S1400.

S1104, constructing a training sample according to the marking head pointer vector, the marking tail pointer vector and the sample text;

and obtaining a marking head pointer vector and a marking tail pointer vector of the sample text, and taking the marking head pointer vector and the marking tail pointer vector as marking data of the sample text. Therefore, the labeling process of the sample text is completed, and the first labeling pointer vector, the last labeling pointer vector and the sample text jointly construct a training sample.

And S1105, performing model training on the student model and/or the teacher model according to the training samples.

The training sample constructed in the present embodiment is only one of several training samples in the subsequent training set, and in the construction process of the training set, each training sample is generated through the processes of S1101-S1105.

The training sample obtained by training can be used for training the student model, can also be used for training the teacher model, and can also be used for training the teacher model and the student model together.

In the embodiment, the training sample adopts a head-to-tail double-pointer method to extract the commodity attributes of the sample text. The traditional keyword method can only be listed in an exhaustive mode, the method has obvious technical limitation on the rapidly-increased E-commerce commodity data, and the accuracy of the extracted result is lower for the newly-increased commodity data. The template-based method can improve the accuracy, but only can dig out the result matched with the design mode, and the recall rate is low. According to the scheme, the method of the head pointer and the tail pointer is used, so that the accuracy and the recall rate can be improved while the commodity attribute extraction confidence coefficient can be improved.

In some embodiments, when training the student model, it is first necessary to train the teacher model so that the teacher model is in a converged state. Referring to fig. 3, fig. 3 is a flowchart illustrating a teacher model training process according to the present embodiment.

S1105 as shown in fig. 3 includes:

s1111, inputting the sample text into an initial teacher model of the teacher model, wherein the initial teacher model is in a non-convergence state of the teacher model;

in this embodiment, the training samples are used to train the teacher model. Model parameters of a model used to train a teacher model are initialized, and the initialized model is called an initial teacher model, and the initial teacher model is in a discrete state, namely a non-convergent state.

The teacher model in this embodiment is trained to converge by a convolutional neural network model, a deep convolutional neural network model, a cyclic neural network model, or a variant model of any of the above models, through supervised training.

When training starts, sample texts in training samples are input into an initial teacher model for feature extraction.

S1112, reading a first head pointer vector and a first tail pointer vector output by the initial teacher model;

the initial teacher model extracts the feature vectors according to the input sample text to obtain two groups of feature vectors which are respectively a first pointer vector and a first tail pointer vector.

S1113, calculating a first loss value between the first label pointer vector and the first head pointer vector and a second loss value between the second label tail pointer vector and the first tail pointer vector;

and a teacher loss function is set in the initial teacher model, and a first loss value between the first pointer vector and the first head pointer vector and a second loss value between the last pointer vector and the first tail pointer vector are respectively calculated.

The teacher loss function in this embodiment includes (without limitation): a coincidence loss function in which one or more than two of a 0-1 loss function, an absolute value loss function, a log-log loss function, a square loss function, and an exponential loss function are combined.

And S1114, performing return correction on the model parameters of the initial teacher model according to the second loss value between the first tail pointer vectors.

After the first loss value and the second loss value are obtained through calculation, the initial teacher model is subjected to return operation through an AdamW algorithm, the first loss value and the second loss value, the weight parameters in the initial teacher model are corrected through calculation of the gradient value of the initial teacher function, the first loss value and the second loss value in the initial teacher model gradually tend to be smaller than a preset teacher loss threshold value, and only when the first loss value and the second loss value are smaller than the preset teacher loss threshold value, the teacher model finishes training the training sample.

It should be noted that, in the process of training the initial teacher model to converge into the teacher model, a large number of training samples are required for training, and S1111 to S1114 only show the process of one training sample in one training turn. And the training of the model is carried out in a loop iteration mode, so that the same training sample needs to go through one or more rounds of S1111-S1114 in the training process.

When the training round of the initial teacher model reaches the preset training number or the accuracy of the initial teacher model reaches the preset accuracy requirement, the initial teacher model is trained to be in a convergence state to become the teacher model.

In some embodiments, after the teacher model is successfully trained, the initial student model is trained by the teacher model, and the first step in training the initial student model is to calculate the distillation loss between the teacher model and the initial student model. Referring to fig. 4, fig. 4 is a schematic flow chart illustrating the distillation loss generated in the present embodiment.

As shown in fig. 4, S1114 thereafter includes:

s1121, inputting the training samples into initial student models of the teacher model and the student models respectively, wherein the initial student models are non-convergence states of the student models;

after the teacher model is trained to the convergence state, the teacher model is needed to assist in training the student model. The student model in this embodiment is a transform model, and the student model only includes an encoder (coding) component in the transform model, specifically, the student model includes n layers of cascaded encoders, where n is less than or equal to 12.

Before a student model is trained, model parameters for training the student model need to be initialized, the model with the initialized parameters is called an initial student model, and the initial student model is in a discrete state, namely in a non-convergence state.

When training begins, sample texts in training samples are respectively input into a teacher model and an initial student model. And respectively carrying out feature extraction and classification calculation on the sample text by the teacher model and the initial student model.

S1122, reading the teacher feature vector output by the teacher model and the student feature vector output by the initial student model;

inputting the sample text into a teacher model, extracting the feature vectors in the sample text by the teacher model through convolution operation to generate teacher feature coding vectors, and then performing confidence calculation on the teacher feature coding vectors through a full-connection layer or a classification layer to generate a second head pointer vector and a second tail pointer vector. The teacher feature encoding vector, the second head pointer vector and the second tail pointer vector are collectively referred to as a teacher feature vector.

Inputting the sample text into an initial student model, extracting feature vectors in the sample text through convolution operation and self-attention calculation by the initial student model to generate student feature coding vectors, and then obtaining a third head pointer vector and a third tail pointer vector through feature mapping. The student feature code vector, the third head pointer vector and the third tail pointer vector are collectively called student feature vectors.

And S1123, calculating distillation loss between the teacher characteristic vector and the student characteristic vector.

And calculating distillation loss between the teacher characteristic vector and the student characteristic vector through the set loss function, wherein the distillation loss refers to a loss value between the teacher characteristic coding vector and the student characteristic coding vector, loss between the second head pointer vector and the third head pointer, loss between the second tail pointer vector and the third tail pointer vector, loss between the third head pointer vector and the mark head pointer vector, and loss between the third tail pointer vector and the mark tail pointer vector.

Referring to fig. 5, fig. 5 is a schematic flow chart of calculating distillation loss according to the present embodiment. As shown in fig. 5, S1123 includes:

s1131, respectively calculating a first cross entropy loss value between the third head pointer vector and the label head pointer vector, and a second cross entropy loss value between the third tail pointer vector and the label tail pointer vector;

the cross entropy can be used as a loss function in a neural network, p represents the distribution of the real labels, q is the distribution of the prediction labels of the trained model, and the similarity of p and q can be measured through the cross entropy loss function.

In this embodiment, a cross entropy loss function is used to calculate a first cross entropy loss value between the third head pointer vector and the labeled head pointer vector, and a second cross entropy loss value between the third tail pointer vector and the labeled tail pointer vector.

For example: inputting the training text into the teacher model to obtain the teacher characteristic coding vector V of the teacher model_tSecond head pointer vector Y_t，sAnd a second tail pointer vector Y_t，e。

Inputting the training sample into a student model S to obtain a feature coding vector V of the student model_sFirst pointer vector output Y_s，sSum tail pointer vector output Y_s，e。

Suppose that labels of a labeled head-to-tail pointer vector and a labeled tail pointer vector corresponding to a training sample X are respectively Y_s、Y_e。

The calculation formula of the first cross entropy loss value and the second cross entropy loss value is:

Loss_ce，s＝CrossEntropy((Y_s，s,Y_s)

Loss_ce，e＝CrossEntropy(K_s，e，Y_e)

s1132, respectively calculating a first divergence loss value between the second head pointer vector and the third head pointer vector and a second divergence loss value between the second tail pointer vector and the third tail pointer vector;

KL divergence (Kullback-Leibler divergence), also known as relative entropy, is a method to describe the difference between two probability distributions P and Q.

In the present embodiment, a first divergence loss value between the second head pointer vector and the third head pointer vector and a second divergence loss value between the second tail pointer vector and the third tail pointer vector are calculated by using the KL divergence function.

The first divergence loss value and the second divergence loss value are calculated as follows:

Loss_kl，s＝KLdivergence(Y_s，s，Y_t，s)

Loss_kl，e＝KLdivergence(Y_s，e，Y_t，e)

s1133, calculating a third divergence loss value between the teacher feature encoding vector and the student feature encoding vector;

calculating a third divergence loss value between the teacher feature encoding vector and the student feature encoding vector by using the KL divergence function.

The third divergence loss value is calculated as follows:

Loss_kl，v＝KLdivergence(V_s，Y_t)

s1134, generating the distillation loss according to the first cross entropy loss value, the first divergence loss value, the second cross entropy loss value, the second divergence loss value and the third divergence loss value.

After the first cross entropy loss value, the first divergence loss value, the second cross entropy loss value, the second divergence loss value and the third divergence loss value are obtained through calculation, the first cross entropy loss value, the first divergence loss value, the second cross entropy loss value, the second divergence loss value and the third divergence loss value are collectively called as the distillation loss of the student model.

It should be noted that in this embodiment, the operations in S1131 to S1133 have no explicit precedence relationship, and the operation sequence can be adjusted accordingly according to actual requirements.

In some embodiments, after calculating the loss of distillation, the initial student model is trained by positive and negative samples. Referring to fig. 6, fig. 6 is a schematic diagram illustrating a comparison training process according to the present embodiment.

As shown in fig. 6, after S1134, the method includes:

s1141, constructing a negative sample of the training sample;

in this embodiment, the annotation result in the training sample constructed in S1101-S1105 is the same as the actual expression content of the sample text, and therefore, the training sample is used as a positive sample, and when a negative sample is constructed, the training text is used as the text of the negative sample, but the annotation content of the training sample is replaced with the annotation content of other training texts, and even if the training sample in the negative sample is not consistent with the corresponding real standard, the annotation content adapted to the negative sample is the wrong standard content. In this way, a negative example is constructed.

S1142, inputting the training samples into the teacher model and the initial student model to generate a first joint feature set;

and inputting the training sample serving as a positive sample into the teacher model and the initial student model, and combining the output of the teacher model and the output of the initial student model to generate a first combined feature set. The first joint feature set comprises a head pointer vector and a tail pointer vector output by the teacher model and a head pointer vector and a tail pointer vector output by the initial student model.

S1143, inputting the training sample into the initial student model for the second time, and generating a second combined feature set by combining the student model features in the first combined feature set;

and inputting the training samples into the initial student model for the second time, wherein the student model is composed of a transform encoder, and a random pooling mechanism is arranged in the transform encoder, so that the initial student model has different output results aiming at the same input. Therefore, in order to constrain the randomness of the student model output and make the positive sample output consistent, secondary input is required to be performed on the training sample.

And inputting the training sample into the initial student model for the second time to obtain a secondary output result output by the initial student model, and combining the result output by the initial student model in the first combined feature set and the secondary output result output by the initial student model to generate a second combined feature set. The second combined feature set comprises a head pointer vector and a tail pointer vector which are output by the initial student model for the training sample for the first time, and a head pointer vector and a tail pointer vector which are output for the second time.

S1144, inputting the negative sample into the initial student model, and generating a third combined feature set by combining the student model features in the first combined feature set;

and inputting the negative sample into the initial student model to obtain the negative sample characteristics output by the initial student model, and combining the output result of the training sample output by the initial student model in the first combined characteristic set and the negative sample characteristics to generate a third combined characteristic set. The third combined feature set comprises: the initial student model outputs a head pointer vector and a tail pointer vector for the training samples, and the initial student model outputs a head pointer vector and a tail pointer vector for the negative samples.

S1145, calculating a first comparison loss value between the first combined feature set and the third combined feature set, and a second comparison loss value between the second combined feature set and the third combined feature set.

After the first combined feature set, the second combined feature set and the third combined feature set are obtained, a first comparison loss value between the first combined feature set and the third combined feature set and a second comparison loss value between the second combined feature set and the third combined feature set are calculated through a coincidence loss function combining a log logarithmic loss function and a logarithmic loss function.

For example, let the training sample be X_iOutput V through teacher model T_t，iOutputting V through the initial student model S_s，i。

Will train sample X_iThe corresponding negative example is denoted X_jWhere j ∈ 1, 2, i-1, i +1,. m, m is the number of samples of the training sample. Then the sample set has a feature set output by the teacher model T as V_t，jThe feature set output by the initial student model S is V_s，j。

Construct positive sample pair A₁：{V_t，i，V_s，i}, negative sample pair B₁：{(V_s，i，V_s，j) I j ∈ 1, 2,. i-1, i +1,. m }. The student model is composed of a transformer encoder, and a stochastic pooling mechanism is carried in the transformer encoder, namely, the student model is also X_iInputting the student model S twice, the results obtained by the two times being different, here, the feature code vector obtained by the second input is defined as V'_s，iThen another positive sample pair A is constructed₂：{V_s，i，V′_s，i}。

The calculation formula of the first comparison loss value and the second comparison loss value is as follows:

in the embodiment, through a training mode of comparison learning, the student model can learn a single relative task except an entity recognition task, and more semantic information improves the expression of the student model. In the model training stage, two positive sample pairs (output of a teacher model and an initial student model, and twice output of the initial student model) and two negative sample pairs (in the same training process, output of the student model of the current training sample and output of the student/teacher model of the negative sample) are constructed, and the student model training is carried out by using a comparison learning mode. When the model is trained, the semantic space distance between the positive samples is shortened, and meanwhile, the semantic space distance between the negative samples is lengthened, and the semantic space distance is kept uniformly distributed on the hypersphere of the parameter space.

In some embodiments, after the first cross entropy loss value, the first divergence loss value, the second cross entropy loss value, the second divergence loss value, the third divergence loss value, the first comparison loss value and the second comparison loss value are calculated, the student model needs to be corrected according to the loss values. Referring to fig. 7, fig. 7 is a schematic flow chart illustrating the feedback correction of the initial student model according to the present embodiment.

As shown in fig. 7, after S1145, the method includes:

s1151, performing weighted operation on the first cross entropy loss value, the first divergence loss value, the second cross entropy loss value, the second divergence loss value, the third divergence loss value, the first comparison loss value and the second comparison loss value according to a preset parameter threshold value to generate a total loss value;

after the first cross entropy loss value, the first divergence loss value, the second cross entropy loss value, the second divergence loss value, the third divergence loss value, the first comparison loss value and the second comparison loss value are obtained through calculation, a total loss value needs to be calculated according to the first cross entropy loss value, the first divergence loss value, the second cross entropy loss value, the second divergence loss value, the third divergence loss value, the first comparison loss value and the second comparison loss value. The calculation of the total loss value can be obtained by performing weighted calculation on the first cross entropy loss value, the first divergence loss value, the second cross entropy loss value, the second divergence loss value, the third divergence loss value, the first comparison loss value and the second comparison loss value.

The total loss value is calculated as follows:

Loss＝α₁Loss_ce，s+α₂Loss_ce，e+α₃Loss_kl，s+α₄Loss_kl，e+α₅Loss_kl，v+α₆Loss_cl，1+α₇Loss_cl，2

where α represents the hyper-parameter, pre-training setting.

And S1152, carrying out back transmission correction on the model parameters of the initial student model according to the total loss value.

And after the total loss value is obtained through calculation, carrying out back-passing operation on the initial student model through an AdamW algorithm and the total loss value, correcting the weight parameter in the initial student model through calculating the gradient value of the initial student function, enabling the total loss value in the initial student model to gradually tend to be smaller than a preset student loss threshold value, and finishing the training of the training sample by the student model only when the total loss value is smaller than the preset student loss threshold value.

It should be noted that, in the process of training the initial student model to converge into the student model, a large number of training samples are required for training, and S1121-S1152 only show the process of one training sample in one training turn. And the training of the model is carried out in a loop iteration mode, so that the same training sample needs to go through one or more rounds of S1121-S1152 in the training process.

Further, the loss value calculation of the initial student model in the training process is not limited to the above total loss value calculation process, and in some embodiments, the loss value of the initial student model is the accumulated value of distillation loss obtained in S1134, that is, the weighted values of the first cross entropy loss value, the first divergence loss value, the second cross entropy loss value, the second divergence loss value, and the third divergence loss value. In yet other embodiments, the loss value of the initial student model is a comparison loss, i.e., a weighted value of the first comparison loss value and the second comparison loss value.

Referring to fig. 8, fig. 8 is a schematic diagram of a basic structure of the text entity recognition device according to the embodiment.

As shown in fig. 8, a text entity recognition apparatus includes: a read module 1100, an extract module 1200, a process module 1300, and an execute module 1400. The reading module 1100 is configured to read text information to be recognized; the extraction module 1200 is configured to input the text information into a preset student model, where the student model is trained to a convergence state through knowledge distillation based on a preset teacher model, and is configured to extract a head pointer vector and a tail pointer vector of the text information, and a model scale of the student model is smaller than a model scale of the teacher model; the processing module 1300 is configured to read the head pointer vector and the tail pointer vector output by the student model, and perform entity extraction on the text information according to an entity attribute mapping table corresponding to the head pointer vector and the tail pointer vector to generate entity information and entity attribute information; the execution module 1400 is configured to generate an identification result of the text information according to the entity information and the entity attribute information.

When the text entity recognition device carries out text information entity recognition, the teacher model which is pre-selected and arranged is used for training the student models which participate in the entity recognition, and the teacher model is trained to be convergent when the student models are trained, so that the student models can be rapidly trained to be in a convergent state in a knowledge distillation and comparison training mode, and the student models are used for carrying out entity recognition through the student models because the model scale of the student models is smaller than that of the teacher model, the requirement on deployment environment is reduced, and the adaptability to the deployment environment is improved. In addition, the entity information is extracted by a head-to-tail double-pointer method, so that the accuracy of extracting the entity information can be improved.

Optionally, the text entity recognition apparatus further includes: the device comprises a first acquisition submodule, a first word segmentation submodule, a first mapping submodule, a first processing submodule and a first execution submodule. The first acquisition submodule is used for acquiring a sample text to be processed; the first word segmentation sub-module is used for carrying out word segmentation on the sample text to generate a sample entity; the first mapping submodule is used for generating a marking head pointer vector and a marking tail pointer vector of the sample text according to the positions of the first character and the tail character of the sample entity and the mapping value of the sample entity in the entity attribute mapping table; the first processing submodule is used for constructing a training sample according to the first labeling pointer vector, the last labeling pointer vector and the sample text; and the first execution submodule is used for performing model training on the student model and/or the teacher model according to the training samples.

Optionally, the training sample is used for model training of the teacher model, and the text entity recognition apparatus further includes: the device comprises a first input submodule, a first reading submodule, a first calculating submodule and a first correcting submodule. The first input submodule is used for inputting the sample text into an initial teacher model of the teacher model, wherein the initial teacher model is in a non-convergence state of the teacher model; the first reading submodule is used for reading a first head pointer vector and a first tail pointer vector output by the initial teacher model; the first calculation submodule is used for calculating a first loss value between the first label pointer vector and the first head pointer vector and a second loss value between the second label pointer vector and the first tail pointer vector; and the first correction submodule is used for carrying out return correction on the model parameters of the initial teacher model according to the second loss value between the first tail pointer vectors.

Optionally, the text entity recognition apparatus further includes: the second input submodule, the second reading submodule and the second calculating submodule. The second input submodule is used for inputting the training samples into initial student models of the teacher model and the student models respectively, wherein the initial student models are in non-convergence states of the student models; the second reading submodule is used for reading the teacher feature vector output by the teacher model and the student feature vector output by the initial student model; the second calculation submodule is used for calculating distillation loss between the teacher feature vector and the student feature vector.

Optionally, the teacher feature vector comprises: teacher's feature code vector, the first pointer vector of second and the second tail pointer vector, student's feature vector includes: a student feature code vector, a third head pointer vector and a third tail pointer vector; the text entity recognition apparatus further includes: the device comprises a third calculation submodule, a fourth calculation submodule, a fifth calculation submodule and a second processing submodule. The third calculation submodule is used for calculating a first cross entropy loss value between the third head pointer vector and the label head pointer vector and a second cross entropy loss value between the third tail pointer vector and the label tail pointer vector respectively; the fourth calculation submodule is used for calculating a first divergence loss value between the second head pointer vector and the third head pointer vector and a second divergence loss value between the second tail pointer vector and the third tail pointer vector respectively; the fifth calculating submodule is used for calculating a third divergence loss value between the teacher feature encoding vector and the student feature encoding vector; the second processing submodule is used for generating the distillation loss according to the first cross entropy loss value, the first divergence loss value, the second cross entropy loss value, the second divergence loss value and the third divergence loss value.

Optionally, the text entity recognition apparatus further includes: the device comprises a first construction submodule, a third input submodule, a fourth input submodule, a third processing submodule and a sixth calculation submodule. The first construction submodule is used for constructing a negative sample of the training sample; the third input submodule is used for inputting the training samples into the teacher model and the initial student model to generate a first joint feature set; the fourth input submodule is used for inputting the training sample into the initial student model for the second time, and generating a second combined feature set by combining the student model features in the first combined feature set; the third processing submodule is used for inputting the negative sample into the initial student model and generating a third combined feature set by combining the student model features in the first combined feature set; the sixth calculation submodule is configured to calculate a first alignment loss value between the first combined feature set and the third combined feature set, and a second alignment loss value between the second combined feature set and the third combined feature set.

Optionally, the text entity recognition apparatus further includes: a fourth processing submodule and a second execution submodule. The fourth processing submodule is used for carrying out weighting operation on the first cross entropy loss value, the first divergence loss value, the second cross entropy loss value, the second divergence loss value, the third divergence loss value, the first comparison loss value and the second comparison loss value according to a preset parameter threshold value to generate a total loss value; and the second execution submodule is used for carrying out return correction on the model parameters of the initial student model according to the total loss value.

In order to solve the technical problem, an embodiment of the present application further provides a computer device. Referring to fig. 9, fig. 9 is a block diagram of a basic structure of a computer device according to the present embodiment.

As shown in fig. 9, the internal structure of the computer device is schematically illustrated. The computer device includes a processor, a non-volatile storage medium, a memory, and a network interface connected by a system bus. The non-volatile storage medium of the computer device stores an operating system, a database and computer readable instructions, the database can store control information sequences, and the computer readable instructions can enable the processor to realize a text entity identification method when being executed by the processor. The processor of the computer device is used for providing calculation and control capability and supporting the operation of the whole computer device. The memory of the computer device may have stored therein computer readable instructions that, when executed by the processor, may cause the processor to perform a text entity recognition method. The network interface of the computer device is used for connecting and communicating with the terminal. Those skilled in the art will appreciate that the architecture shown in fig. 9 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In this embodiment, the processor is configured to execute specific functions of the reading module 1100, the extracting module 1200, the processing module 1300, and the executing module 1400 in fig. 8, and the memory stores program codes and various data required for executing the modules. The network interface is used for data transmission to and from a user terminal or a server. The memory in this embodiment stores program codes and data required for executing all the sub-modules in the text entity recognition device, and the server can call the program codes and data of the server to execute the functions of all the sub-modules.

When the computer equipment identifies the text information entity, the teacher model which is pre-selected and arranged is used for training the student models which participate in the entity identification, and the student models are trained to be in a convergence state in the training process of the student models, so that the student models can be quickly trained to be in the convergence state in a knowledge distillation and comparison training mode, and the student models are used for identifying the entity due to the fact that the model scale of the student models is smaller than that of the teacher model, so that the requirement on deployment environment is lowered, and the adaptability to the deployment environment is improved. In addition, the entity information is extracted by a head-to-tail double-pointer method, so that the accuracy of extracting the entity information can be improved.

The present application further provides a storage medium storing computer-readable instructions which, when executed by one or more processors, cause the one or more processors to perform the steps of any of the above-described embodiments of the method for textual entity identification.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the computer program is executed. The storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a Random Access Memory (RAM).

Those of skill in the art will appreciate that the various operations, methods, steps in the processes, acts, or solutions discussed in this application can be interchanged, modified, combined, or eliminated. Further, other steps, measures, or schemes in various operations, methods, or flows that have been discussed in this application can be alternated, altered, rearranged, broken down, combined, or deleted. Further, steps, measures, schemes in the prior art having various operations, methods, procedures disclosed in the present application may also be alternated, modified, rearranged, decomposed, combined, or deleted.

The foregoing is only a partial embodiment of the present application, and it should be noted that, for those skilled in the art, several modifications and decorations can be made without departing from the principle of the present application, and these modifications and decorations should also be regarded as the protection scope of the present application.

Claims

1. A method for text entity recognition, comprising:

reading text information to be identified;

2. The text entity recognition method of claim 1, wherein reading the text information to be recognized comprises:

collecting a sample text to be processed;

3. The text entity recognition method of claim 2, wherein the training samples are used for model training of the teacher model, and the model training of the student model and/or the teacher model according to the training samples comprises:

4. The text entity recognition method of claim 2, wherein the model training of the student model and/or the teacher model according to the training samples comprises:

5. The textual entity identification method of claim 4, wherein the teacher feature vector comprises: teacher's feature code vector, the first pointer vector of second and the second tail pointer vector, student's feature vector includes: a student feature code vector, a third head pointer vector and a third tail pointer vector; the calculating of the distillation loss between the teacher feature vector and the student feature vector comprises:

6. The textual entity identification method of claim 5, wherein after calculating the distillation loss between the teacher feature vector and the student feature vector, comprising:

constructing a negative sample of the training sample;

7. The method of claim 6, wherein the calculating a first alignment loss value between the first combined feature set and the third combined feature set and a second alignment loss value between the second combined feature set and the third combined feature set comprises:

8. A text entity recognition apparatus, comprising:

the reading module is used for reading text information to be identified;

9. A computer storage medium, wherein the computer readable instructions, when executed by one or more processors, cause the one or more processors to perform the steps of the text entity recognition method of any one of claims 1 to 7.

10. A computer program product comprising computer program/instructions, characterized in that the computer program/instructions, when executed by a processor, implement the steps of the method as claimed in any one of claims 1 to 7.