CN114281933A - Text processing method and device, computer equipment and storage medium - Google Patents

Text processing method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN114281933A
CN114281933A CN202111081005.8A CN202111081005A CN114281933A CN 114281933 A CN114281933 A CN 114281933A CN 202111081005 A CN202111081005 A CN 202111081005A CN 114281933 A CN114281933 A CN 114281933A
Authority
CN
China
Prior art keywords
feature
coding
word
text
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111081005.8A
Other languages
Chinese (zh)
Inventor
欧子菁
赵瑞辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202111081005.8A priority Critical patent/CN114281933A/en
Publication of CN114281933A publication Critical patent/CN114281933A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application discloses a text processing method and device, computer equipment and a storage medium, and belongs to the technical field of computers. The method comprises the following steps: acquiring a first feature corresponding to each word in a first text and a second feature corresponding to the first text; calling a feature coding model, and coding each first feature and each second feature respectively to obtain a first coding feature corresponding to each first feature and a second coding feature corresponding to each second feature; training a feature coding model based on a first correlation feature between each first coding feature and a second coding feature; and calling the trained feature coding model to code the features of any text. According to the method provided by the embodiment of the application, the feature coding model is trained based on the associated features between each first coding feature and each second coding feature, so that the accuracy of the feature coding model can be improved, and the accuracy of the coding features obtained by the feature coding model is improved.

Description

Text processing method and device, computer equipment and storage medium
Technical Field
The embodiment of the application relates to the technical field of computers, in particular to a text processing method and device, computer equipment and a storage medium.
Background
The text is an important object in machine learning and natural language processing, and the coding characteristics of the text can be widely applied to various fields such as text recognition, text search and the like, and has important research significance.
In the related technology, the word frequency characteristics of the text are obtained, the word frequency characteristics represent the occurrence frequency of each word in the text, and the coding characteristics corresponding to the text are obtained by coding the word frequency characteristics. However, this method only considers the occurrence number of each word in the text, and the obtained coding features are not accurate enough.
Disclosure of Invention
The embodiment of the application provides a text processing method and device, computer equipment and a storage medium, which can improve the accuracy of coding characteristics. The technical scheme is as follows:
in one aspect, a text processing method is provided, and the method includes:
acquiring a first feature corresponding to each word in a first text and a second feature corresponding to the first text, wherein the first feature corresponding to the word represents the semantic meaning of the word in the first text, and the second feature is determined based on the first feature corresponding to each word;
calling a feature coding model, and coding each first feature and each second feature respectively to obtain a first coding feature corresponding to each first feature and a second coding feature corresponding to each second feature;
training the feature coding model based on a first correlation feature between each of the first coding features and the second coding features, the first correlation feature representing a degree of correlation between the first coding feature and the second coding feature;
and calling the trained feature coding model to code the features of any text.
In another aspect, there is provided a text processing apparatus, the apparatus including:
the feature acquisition module is used for acquiring a first feature corresponding to each word in a first text and a second feature corresponding to the first text, wherein the first feature corresponding to the word represents the semantic meaning of the word in the first text, and the second feature is determined based on the first feature corresponding to each word;
the first coding module is used for calling a feature coding model, and coding each first feature and each second feature respectively to obtain a first coding feature corresponding to each first feature and a second coding feature corresponding to each second feature;
a model training module for training the feature coding model based on a first correlation feature between each of the first coding features and the second coding features, the first correlation feature representing a degree of correlation between the first coding feature and the second coding feature;
and the second coding module is used for calling the trained feature coding model and coding the features of any text.
Optionally, the model training module includes:
a loss value determining unit, configured to determine a loss value based on a first correlation feature corresponding to each of the first encoding features, where the loss value is inversely correlated with the first correlation feature;
and the model training unit is used for training the feature coding model based on the loss value.
Optionally, the apparatus further comprises an association determining module configured to:
calling a discrimination model, discriminating the first coding feature and the second coding feature to obtain a discrimination result, wherein the discrimination result represents the possibility that a word corresponding to the first coding feature belongs to a text corresponding to the second coding feature;
and determining the discrimination result as a first associated feature corresponding to the first coding feature.
Optionally, the model training module includes:
a loss value determining unit, configured to determine a loss value based on a first correlation feature corresponding to each of the first encoding features, where the loss value is inversely correlated with the first correlation feature;
and the model training unit is used for training the feature coding model and the discrimination model based on the loss value.
Optionally, the apparatus further comprises:
the feature acquisition module is further configured to acquire a third feature corresponding to a word in a second text, where the third feature corresponding to the word represents a semantic meaning of the word in the second text, and the second text is different from the first text;
the first coding module is further configured to call the feature coding model, and code the third feature to obtain a third coding feature corresponding to the third feature;
an association determination module for determining a second association feature between the third encoding feature and the second encoding feature, the second association feature representing a degree of association between the third encoding feature and the second encoding feature;
the loss value determination unit is configured to:
determining the loss value based on each of the first correlation characteristic and the second correlation characteristic, the loss value being negatively correlated with the first correlation characteristic and the loss value being positively correlated with the second correlation characteristic.
Optionally, the first text includes words at a plurality of positions, and the loss value determining unit is configured to:
determining a loss component corresponding to each position based on a first associated feature and a second associated feature corresponding to each position, wherein the loss component is positively correlated with the first associated feature, and the loss component is negatively correlated with the second associated feature, wherein the first associated feature corresponding to the position refers to a first associated feature corresponding to a word located at the position in the first text, and the second associated feature corresponding to the position refers to a second associated feature corresponding to a word located at the position in the second text;
and carrying out fusion processing on the loss component corresponding to each position to obtain the loss value, wherein the loss value is in negative correlation with the loss component.
Optionally, the apparatus further comprises:
the feature acquisition module is further used for determining a first text feature corresponding to the first text, wherein the first text feature represents the semantic meaning of the first text;
the first coding module is further used for calling the feature coding model and coding the first text feature to obtain a fourth coding feature;
an association determination module for determining a third association feature between the fourth encoding feature and the second encoding feature, the third association feature representing a degree of association between the fourth encoding feature and the second encoding feature;
the loss value determination unit is configured to:
determining the loss value based on each of the first and third associated features, the loss value being inversely related to the first and third associated features.
Optionally, the apparatus further comprises:
the feature acquisition module is further configured to determine a second text feature corresponding to a second text, where the second text feature represents a semantic meaning of the second text, and the second text is different from the first text;
the first coding module is further used for calling the feature coding model and coding the second text feature to obtain a fifth coding feature;
an association determining module for determining a fourth association feature between the fifth coding feature and the second coding feature, the fourth association feature representing a degree of association between the fifth coding feature and the second coding feature;
the loss value determination unit is configured to:
determining the loss value based on each of the first correlation feature, the third correlation feature, and the fourth correlation feature, the loss value being negatively correlated with the first correlation feature and the third correlation feature, the loss value being positively correlated with the fourth correlation feature.
Optionally, the feature obtaining module includes:
the word characteristic determining unit is used for determining word characteristics corresponding to each word in the first text;
a first feature determining unit, configured to determine a first feature corresponding to each word based on a word feature corresponding to each word and a word feature corresponding to at least one word after each word, respectively.
Optionally, the first feature determining unit is configured to:
determining a plurality of target quantities, wherein the target quantities are different and less than the quantity of the words in the first text;
for each target quantity, determining a word group corresponding to the word, and determining a first sub-feature corresponding to the word based on a word feature corresponding to each word in the word group, wherein the word group comprises the word and a subsequent word of the word, the subsequent word of the word refers to a word after the word, and the quantity of the subsequent word in the word group is not greater than the target quantity;
and fusing the plurality of first sub-characteristics corresponding to the words to obtain the first characteristics corresponding to the words.
Optionally, the first feature determining unit is configured to:
determining the word and the target number of words after the word as a word group corresponding to the word under the condition that the total number of subsequent words of the word is not less than the target number;
determining the word and each word after the word as a word group corresponding to the word if the total number of subsequent words of the word is less than the target number.
Optionally, the first feature determining unit is configured to:
performing convolution processing on the word characteristics corresponding to each word in the word group respectively to obtain convolution characteristics corresponding to each word;
and determining the sum of the plurality of convolution characteristics as a first sub-characteristic corresponding to the word.
Optionally, the feature obtaining module includes:
the second feature determination unit is used for performing mean value pooling on first features corresponding to a plurality of words in the first text to obtain the second features; alternatively, the first and second electrodes may be,
the second feature determining unit is configured to determine a median of the first feature corresponding to the plurality of words in the first text as the second feature.
Optionally, the second encoding module is configured to:
determining word characteristics corresponding to each word in the target text;
determining fourth characteristics corresponding to each word based on word characteristics corresponding to each word in the target text and word characteristics corresponding to at least one word after each word;
determining a fifth feature corresponding to the target text based on the fourth feature corresponding to each word in the target text;
and calling the trained feature coding model, and coding the fifth feature to obtain a sixth coding feature corresponding to the fifth feature.
Optionally, the apparatus further comprises:
the feature query module is used for acquiring candidate coding features corresponding to a plurality of candidate texts, wherein the candidate coding features corresponding to each candidate text are obtained by calling the trained coding model for coding;
a similarity determining module, configured to determine a similarity between each candidate encoding feature and the sixth encoding feature respectively;
and the text determining module is used for determining the candidate text corresponding to the candidate coding features with the similarity greater than the target threshold as the text similar to the target text.
In another aspect, a computer device is provided, which includes a processor and a memory, in which at least one computer program is stored, the at least one computer program being loaded and executed by the processor to implement the operations performed in the text processing method according to the above aspect.
In another aspect, a computer-readable storage medium is provided, in which at least one computer program is stored, the at least one computer program being loaded and executed by a processor to implement the operations performed in the text processing method according to the above aspect.
In another aspect, a computer program product is provided, which comprises a computer program that is loaded and executed by a processor to perform the operations performed in the text processing method according to the above aspect.
According to the method, the device, the computer equipment and the storage medium provided by the embodiment of the application, the first feature and the second feature are respectively subjected to feature coding by utilizing the feature coding model to obtain the first coding feature and the second coding feature, because the first feature comprises the semantics of partial words in the text and the second feature comprises the semantics of each word in the same text, the association degree between the first feature and the second feature is higher, if the accuracy of the feature coding model is high enough, the association degree between the obtained first coding feature and the second coding feature is also higher, so that the feature coding model is trained based on the association feature between each first coding feature and each second coding feature, the accuracy of the feature coding model can be improved, and the accuracy of the coding features obtained by the feature coding model is improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic diagram of an implementation environment provided by an embodiment of the present application.
Fig. 2 is a flowchart of a text processing method according to an embodiment of the present application.
Fig. 3 is a flowchart of a text processing method according to an embodiment of the present application.
Fig. 4 is a flowchart of a model training method according to an embodiment of the present application.
Fig. 5 is a flowchart of a text processing method according to an embodiment of the present application.
Fig. 6 is a schematic diagram of a text processing method according to an embodiment of the present application.
Fig. 7 is a flowchart of a text search method according to an embodiment of the present application.
Fig. 8 is a schematic diagram of a text search method according to an embodiment of the present application.
Fig. 9 is a schematic structural diagram of a text processing apparatus according to an embodiment of the present application.
Fig. 10 is a schematic structural diagram of another text processing apparatus according to an embodiment of the present application.
Fig. 11 is a schematic structural diagram of a terminal according to an embodiment of the present application.
Fig. 12 is a schematic structural diagram of a server according to an embodiment of the present application.
Detailed Description
To make the objects, technical solutions and advantages of the embodiments of the present application more clear, the embodiments of the present application will be further described in detail with reference to the accompanying drawings.
It will be understood that the terms "first," "second," and the like as used herein may be used herein to describe various concepts, which are not limited by these terms unless otherwise specified. These terms are only used to distinguish one concept from another. For example, a first text may be referred to as a second text, and similarly, a second text may be referred to as a first text, without departing from the scope of the present application.
For example, the at least one text may be an integer number of texts greater than or equal to one, such as one text, two texts, three texts, and the like. The plurality of texts may be two or more, for example, the plurality of texts may be an integer of two or more texts, such as two texts and three texts. Each refers to each of the at least one, for example, each text refers to each of a plurality of texts, and if the plurality of texts is 3 texts, each text refers to each of the 3 texts.
Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.
The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and the like.
Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and teaching learning.
Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.
The text processing method provided by the embodiment of the present application will be described below based on an artificial intelligence technique and a natural language processing technique.
The text processing method provided by the embodiment of the application can be used in computer equipment. Optionally, the computer device is a terminal or a server. Optionally, the server is an independent physical server, or a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a web service, cloud communication, a middleware service, a domain name service, a security service, a CDN (Content Delivery Network), a big data and artificial intelligence platform, and the like. Optionally, the terminal is a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, etc., but is not limited thereto.
In one possible implementation, the computer program according to the embodiments of the present application may be deployed to be executed on one computer device or on multiple computer devices located at one site, or may be executed on multiple computer devices distributed at multiple sites and interconnected by a communication network, where the multiple computer devices distributed at the multiple sites and interconnected by the communication network can form a block chain system.
In one possible implementation manner, the computer device for training the feature coding model in the embodiment of the present application is a node in a blockchain system, and the node can store the trained feature coding model in the blockchain, and then the node or nodes corresponding to other devices in the blockchain can encode features of any text based on the feature coding model.
Fig. 1 is a schematic diagram of an implementation environment provided in an embodiment of the present application, and referring to fig. 1, the implementation environment includes: a terminal 101 and a server 102. The terminal 101 and the server 102 are connected via a wireless or wired network. Optionally, the server 102 is configured to train a feature coding model, which is used for coding features of any text, by using the method provided in the embodiment of the present application. The server 102 sends the trained feature coding model to the terminal 101, and the terminal 101 can call the feature coding model to code the feature of any text to obtain the coding feature, wherein the coding feature can be applied to various fields such as text recognition or text retrieval.
In a possible implementation manner, an application client provided by the server runs in the terminal 101, and the server 102 stores the trained feature coding model in the application client, and the application client has a text processing function. The terminal 101 calls a feature coding model based on the application client, codes the features of any text, and obtains coding features.
It should be noted that fig. 1 only illustrates an example in which the server 102 trains the feature coding model and transmits the feature coding model to the terminal 101, and in another embodiment, the terminal 101 may also directly train the feature coding model.
Fig. 2 is a flowchart of a text processing method according to an embodiment of the present application. The execution subject of the embodiment of the application is computer equipment, and referring to fig. 2, the method includes:
201. the computer device obtains a first feature corresponding to each word in the first text and a second feature corresponding to the first text.
The first text may be any type of text. The first feature corresponding to a word represents the semantics of the word in the first text, and the second feature is determined based on the first feature corresponding to each word, so that the second feature can represent the semantics of each word in the first text. The first feature and the second feature are features in a continuous space.
202. And calling the feature coding model by the computer equipment, and coding each first feature and each second feature respectively to obtain a first coding feature corresponding to each first feature and a second coding feature corresponding to each second feature.
The feature coding model is used for coding features of the text. And after the computer equipment obtains each first characteristic and each second characteristic, calling a characteristic coding model, and coding each first characteristic respectively to obtain a first coding characteristic corresponding to each first characteristic. And calling the feature coding model by the computer equipment, and coding the second feature to obtain a second coding feature corresponding to the second feature.
203. The computer device trains a feature coding model based on first associated features between each first coding feature and the second coding feature.
The computer device obtains a first association characteristic between each first coding characteristic and each second coding characteristic, and the first association characteristic corresponding to the first coding characteristic can represent the association degree between the first coding characteristic and the second coding characteristic.
The computer device trains the feature coding model based on the first associated features. Because the first coding feature and the second coding feature are obtained by respectively coding the first feature and the second feature by the feature coding model, and the first feature and the second feature can represent the semantics of words in the same text, the association degree of the first feature and the second feature is higher, the association degree of the first coding feature and the second coding feature is higher, the coding capability of the feature coding model is higher, and the feature coding model is more accurate. Since the first associated feature represents the degree of association between the first coding feature and the second coding feature, the computer device may train the feature coding model using the first associated feature, so that the degree of association between the coding features obtained by the feature coding model is higher and higher, thereby improving the accuracy of the feature coding model.
204. And calling the trained feature coding model by the computer equipment to code the features of any text.
And training the feature coding model by the computer equipment to obtain the trained feature coding model. The computer equipment can call the trained feature coding model to code the features of any text to obtain the corresponding coding features.
It should be noted that, in order to train the feature coding model, the computer device first obtains a plurality of texts as a sample data set, a process of training the feature coding model based on the plurality of texts includes a plurality of iterations, and training is performed based on at least one text in each iteration. The steps 201-204 in the embodiment of the present application are only described by taking the processing of one text in one iteration process as an example.
The method provided by the embodiment of the application utilizes the feature coding model to perform feature coding on the first feature and the second feature respectively to obtain the first coding feature and the second coding feature, because the first feature comprises the semantics of partial words in the text and the second feature comprises the semantics of each word in the same text, the association degree between the first feature and the second feature is higher, if the accuracy of the feature coding model is high enough, the association degree between the obtained first coding feature and the second coding feature is higher, and therefore the feature coding model is trained based on the association feature between each first coding feature and each second coding feature, the accuracy of the feature coding model can be improved, and the accuracy of the coding features obtained by the feature coding model is improved.
Fig. 3 is a flowchart of a text processing method according to an embodiment of the present application. The execution subject of the embodiment of the application is computer equipment, and referring to fig. 3, the method includes:
301. the computer device obtains a first feature corresponding to each word in the first text and a second feature corresponding to the first text.
The first feature corresponding to a word represents the semantics of the word in the first text, and the second feature is determined based on the first feature corresponding to each word. Since the first feature can characterize the semantics of a part of words in the first text and the second feature can characterize the semantics of each word in the first text, it can be understood that the first feature is a "local" feature of the first text and the second feature is a "global" feature of the first text. Each word in the first text corresponds to a first feature, the first text corresponds to a second feature, and the first feature and the second feature are features in a continuous space. Optionally, the first feature and the second feature are feature vectors or feature matrices, and the like, which is not limited in this embodiment of the application.
In one possible implementation manner, the computer device acquires a first text, determines a word feature corresponding to each word in the first text, and determines a first feature corresponding to each word based on the word feature corresponding to each word and a word feature corresponding to at least one word after each word respectively. The term feature corresponding to a term represents a semantic meaning corresponding to the term, for example, the term feature is a BERT (binary Encoder from transforms) feature, or the term feature is a word to vector (word vector) feature, and the like.
The words in the first text are arranged according to the word order, and the arranged words form a complete sentence. Since the word features corresponding to a word can only represent the semantics of a single word, it cannot represent the semantics between contexts in the text. Therefore, for each word, the computer device determines the word feature corresponding to the word and the word feature corresponding to the word after the word to determine the first feature corresponding to the word, and the semantics of the word itself and the semantics around the word are fused in the first feature, so that the first feature can represent the context relationship between the word and the surrounding words, and the information content included in the first feature is improved.
Optionally, the computer device determines a plurality of target numbers, the plurality of target numbers being different and less than the number of words in the first text. For each target quantity, determining a word group corresponding to the word, and determining a first sub-characteristic corresponding to the word based on the word characteristic corresponding to each word in the word group, wherein the word group comprises the word and a subsequent word of the word, the subsequent word of the word refers to the word after the word, and the quantity of the subsequent word in the word group is not greater than the target quantity. And the computer equipment fuses the plurality of first sub-characteristics corresponding to the words to obtain the first characteristics corresponding to the words. For example, the computer device splices the plurality of first sub-features to obtain the first feature.
In order to further improve the information content of the first feature, the computer device may merge a plurality of first sub-features corresponding to the words into the first feature. In order to extract the first sub-features with different granularities, different first sub-features can be determined according to the word features corresponding to different numbers of words. Therefore, the computer device first obtains a plurality of different target numbers, which are optionally preset by the computer device.
For each target quantity, the computer device determines at least one subsequent word after the word and not greater than the target quantity, constructs the word and the determined subsequent word into a word group corresponding to the word, determines word characteristics corresponding to the word in the word group and word characteristics corresponding to the subsequent word, and determines a first sub-characteristic corresponding to the word based on the word characteristics corresponding to each word in the word group. The computer device sequentially traverses each target number in the plurality of target numbers according to the above manner, so as to obtain a first sub-feature corresponding to each target number, that is, a plurality of first sub-features corresponding to the word.
Optionally, for each target quantity, the computer device determines a word group to which the word corresponds, including the following two cases: and under the condition that the total number of the subsequent words of the word is not less than the target number, determining the word and the target number of words after the word as a word group corresponding to the word. And in the case that the total number of the subsequent words of the word is less than the target number, determining the word and each word after the word as the word group corresponding to the word.
The computer equipment can add the word into the word group, acquire a first word after the word, add the first word into the word group, the computer equipment continues to acquire a second word after the word, add the second word into the word group until the computer equipment acquires a first target number of words after the word, add the first target number of words into the word group, that is, under the condition that the total number of subsequent words of the word is not less than the target number, the computer equipment can acquire a target number of subsequent word orders, and therefore the computer equipment can add the acquired target number of subsequent words into the word group. If the computer device has already acquired the last word in the first text and added the last word to the word group, but the computer device has not acquired the first target number of words, that is, under the condition that the total number of the subsequent words of the word is smaller than the target number, the computer device cannot acquire the target number of subsequent words, so the computer device may add each acquired subsequent word to the word group.
Or the computer device may determine the total number of subsequent words of the word first, and determine whether the total number of subsequent words is smaller than the target number, if the total number is not smaller than the target number, directly acquire the target number of subsequent words after the word, and determine the word and the acquired target number of subsequent words as the word group corresponding to the word. And if the total number is less than the target number, directly acquiring each subsequent word after the word, and determining the word and each acquired subsequent word as a word group corresponding to the word.
For example, the plurality of target numbers are 0, 2, and 4, respectively. The computer device determines a first sub-feature corresponding to the word based on the word feature corresponding to the word, determines a first sub-feature corresponding to the word based on the word feature corresponding to the word and the word features corresponding to 2 words after the word, determines a first sub-feature corresponding to the word based on the word features corresponding to the word and the word features corresponding to 4 words after the word, and finally obtains 3 first sub-features corresponding to the word.
Optionally, the computer device determines a first sub-feature corresponding to the word based on the word feature corresponding to each word in the word group, including: and performing convolution processing on the word characteristics corresponding to each word in the word group to obtain convolution characteristics corresponding to each word, and determining the sum of the convolution characteristics as a first sub-characteristic corresponding to the word. Optionally, the computer device processes the sum of the convolution features using an activation function, and determines the processed feature as the first sub-feature. For example, the activation function is a ReLU (Rectified Linear Unit).
The computer equipment can perform convolution processing on the word characteristics corresponding to the first word in the word group to obtain first convolution characteristics, and the computer equipment continues to perform convolution processing on the word characteristics corresponding to the second word in the word group to obtain second convolution characteristics until the computer equipment obtains the convolution characteristics corresponding to the last word in the word group, so that the convolution characteristics corresponding to each word in the word group are obtained. Or the computer device performs convolution processing on the word characteristics corresponding to each word in the word group in parallel to obtain the convolution characteristics corresponding to each word.
And if the convolution features corresponding to the words are feature vectors, the computer equipment adds the convolution features to obtain the sum of the convolution features, and determines the sum of the convolution features as the first sub-feature corresponding to the words.
In another possible implementation manner, the computer device acquiring a second feature corresponding to the first text includes: and the computer equipment performs mean value pooling on the first features corresponding to the words in the first text to obtain second features. Or the computer equipment determines the median of the first characteristic corresponding to a plurality of words in the first text as the second characteristic. In addition, the computer device may determine the second feature in other manners, for example, concatenate the first features corresponding to each word to obtain the second feature corresponding to the first text.
For ease of understanding, the following formulas are used to illustrate the process of the computer device obtaining the first feature and the second feature. First, a feature of a first text is represented as X ═ e1,e2,…,eTIn which eiThe word characteristics corresponding to the ith word in the first text are represented, T represents the number of the words in the first text, and i is a positive integer not larger than T. Computer device will eiInput convolutional neural network, in which the convolution kernel is denoted as
Figure BDA0003264015490000131
K is the number of convolution kernels, n represents convolution step size, in order to capture features with different granularities, convolution is carried out by using different convolution step sizes, and the convolution step size can also be understood as a sliding window of convolution.
The computer device uses equation (1) to determine a first sub-feature corresponding to each convolution step.
Figure BDA0003264015490000141
Wherein the content of the first and second substances,
Figure BDA0003264015490000142
denotes the first sub-feature corresponding to the ith word, and n denotes the convolution step, which is also the target number. Denotes a convolution operation, ei:i+nRepresenting the ith word and n words after the ith word. W ei:i+nThe convolution kernel W is adopted to carry out convolution processing on the ith word and n words after the ith word respectively. RELU () represents an activation function. The set of a plurality of first sub-features corresponding to the ith word is h ═ h1,h2,……,hT}。
The computer device employs equation (2) to determine a first feature corresponding to an ith word corresponding to the word.
Figure BDA0003264015490000143
Wherein N represents a set of a plurality of convolution steps,
Figure BDA0003264015490000144
represents the first sub-feature corresponding to the ith word, n represents the convolution step size, CONCAT () represents the splicing function, and MLP () represents the multi-layer perceptron.
The computer device determines a second feature corresponding to the first text using equation (3).
Figure BDA0003264015490000145
H represents a second feature, and the READOUT function is used to process a plurality of parameters to obtain one parameter, for example, the READOUT function is used to perform mean pooling on the plurality of parameters in the time dimension.
302. And calling the feature coding model by the computer equipment, and coding each first feature and each second feature respectively to obtain a first coding feature corresponding to each first feature and a second coding feature corresponding to each second feature.
The feature coding model is used for coding features of the text. And after the computer equipment obtains each first characteristic and each second characteristic, calling a characteristic coding model, and coding each first characteristic respectively to obtain a first coding characteristic corresponding to each first characteristic. And calling the feature coding model by the computer equipment, and coding the second feature to obtain a second coding feature corresponding to the second feature.
The first coding features can represent the semantics of one word and at least one word after the word, and the second coding features can represent the semantics of each word in the first text, so that the feature coding model is trained by using the first coding features and the second coding features subsequently, the dependency relationship between the word sequence and the context in the text is considered, the feature coding model can learn the context information between the words in the text, and the accuracy of feature coding is improved.
In a possible implementation manner, the first feature and the second feature obtained in step 301 are formed by feature values, for example, feature vectors or feature matrices. The first feature and the second feature are features in a continuous space, or the feature values in the first feature and the second feature are values in a continuous space, which is a space consisting of a range of values having continuity, for example, the continuous space is 0 to 1. The computer device maps the first text to at least one eigenvalue in the numerical range according to the semantics of the first text, so as to obtain the features in the continuous space. The encoding in the embodiment of the present application refers to mapping features in a continuous space to features in a discrete space, so that the first encoding feature and the second encoding feature are features in a discrete space, or feature values in the first encoding feature and the second encoding feature are values in a discrete space, the discrete space refers to a space formed by discrete values, for example, the discrete space is 0 and 1, and the computer device maps the feature values in the first feature and the second feature to values in the discrete space, so as to obtain features in the discrete space.
Since the first feature and the second feature are features in a continuous space, if the first feature and the second feature are directly applied to the fields of text recognition or text search, the processing overhead is relatively high, and therefore the first feature and the second feature are encoded in the embodiment of the application to obtain the first encoding feature and the second encoding feature in the discrete space. By converting the features of the continuous space into the features of the discrete space, the storage cost and the processing speed of the data can be reduced.
In a possible implementation manner, the feature coding model is a transform (a machine translation-based encoder) model, or the feature coding model is an LSTM (Long Short-Term Memory) model, or the feature coding model is a text-to-Convolutional Neural Network (CNN) model, and the like, and the type of the feature coding model is not limited in the embodiment of the present application.
In a possible implementation manner, the feature coding model is used for performing hash coding, and the first coding feature and the second coding feature obtained by the feature coding model are hash features.
303. The computer device determines a first association feature between each first coding feature and the second coding feature.
After obtaining the first coding features corresponding to each word and the second coding features corresponding to the first text, the computer device determines first association features between each first coding feature and the second coding features, wherein the first association features corresponding to the first coding features can represent association degrees between the first coding features and the second coding features. Optionally, the first correlation characteristic is a characteristic in a numerical form, the greater the first correlation characteristic is, the greater the degree of correlation between the first coding characteristic and the second coding characteristic is, and the smaller the first correlation characteristic is, the smaller the degree of correlation between the first coding characteristic and the second coding characteristic is.
In a possible implementation, the first correlation characteristic is mutual information between the first coding characteristic and the first correlation characteristic, and the mutual information is an information measure in the theory of information, and can be regarded as an information amount contained in one random variable about another random variable, or an inconclusive nature of a decrease of one random variable due to the fact that another random variable is known.
In one possible implementation manner, the computer device invokes the discrimination model to discriminate the first coding feature and the second coding feature to obtain a discrimination result, where the discrimination result indicates a possibility that a word corresponding to the first coding feature belongs to a text corresponding to the second coding feature. The computer equipment determines the judgment result as a first associated characteristic corresponding to the first coding characteristic.
In the embodiment of the present application, the first coding feature can represent semantics of a part of words in the first text, and the second coding feature can represent semantics of each word in the first text, it may be understood that the first coding feature is a "local coding feature" corresponding to the first text, and the second coding feature is a "global coding feature" corresponding to the first text, and a degree of association between the first coding feature and the second coding feature may be understood as a possibility that the first coding feature and the second coding feature correspond to the same text, that is, a possibility that a word corresponding to the first coding feature belongs to a text corresponding to the second coding feature. The higher the degree of association between the first coding feature and the second coding feature, the higher the probability that the first coding feature and the second coding feature correspond to the same text, and the lower the degree of association between the first coding feature and the second coding feature, the lower the probability that the first coding feature and the second coding feature correspond to the same text. Therefore, the computer device may determine the discrimination result of the first coding feature and the second coding feature as a first association feature between the first coding feature and the second coding feature.
304. The computer device obtains a third feature corresponding to a word in the second text.
Wherein the second text is different from the first text, and the third feature corresponding to the word in the second text represents the semantic meaning of the word in the second text. This third feature, which is the same as the first feature in the above-described step, can be understood as a "local feature" of the second text.
305. And calling the feature coding model by the computer equipment, and coding the third feature to obtain a third coding feature corresponding to the third feature.
306. The computer device determines a second association feature between the third encoding feature and the second encoding feature.
Wherein the second correlation characteristic represents a degree of correlation between the third encoding characteristic and the second encoding characteristic.
The process of determining the second associated characteristic in the step 304-.
307. The computer device determines a first text feature corresponding to the first text.
The computer equipment extracts the features of the first text to obtain a first text feature corresponding to the first text, wherein the first text feature represents the semantics of the first text.
The first feature, the second feature and the third feature are obtained based on the semantics of the words, the first feature, the second feature and the third feature are features of word dimensions, the first text feature is obtained based on the semantics of the text, and the first text feature is a feature of text dimensions.
308. And the computer equipment calls the feature coding model to code the first text feature to obtain a fourth coding feature.
309. The computer device determines a third correlation feature between the fourth encoding feature and the second encoding feature.
Wherein the third correlation characteristic represents a degree of correlation between the fourth encoding characteristic and the second encoding characteristic;
the process of determining the third correlation characteristic in the step 308-.
310. The computer device determines a second text feature corresponding to the second text.
And the computer equipment extracts the features of the second text to obtain second text features corresponding to the second text, wherein the second text features represent the semantics of the second text. The second text feature may be understood as a text-level feature of the second text, in the same way as the first text feature in the above step.
311. And the computer equipment calls the feature coding model to code the second text feature to obtain a fifth coding feature.
312. The computer device determines a fourth correlation feature between the fifth encoding feature and the second encoding feature.
Wherein the fourth correlation characteristic represents a degree of correlation between the fifth encoding characteristic and the second encoding characteristic.
The process of determining the fourth correlation characteristic in the step 311-.
313. The computer device determines a loss value based on each of the first associated feature, the second associated feature, the third associated feature, and the fourth associated feature, and trains the feature coding model based on the loss value.
And the loss value is negatively correlated with the first correlation characteristic and the third correlation characteristic, and the loss value is positively correlated with the second correlation characteristic and the fourth correlation characteristic. Based on the loss value, the computer device trains the feature coding model such that the loss value gradually converges. That is, the larger the first and third correlation features are, the smaller the loss value is, and the smaller the first and third correlation features are, the larger the loss value is. The larger the second correlation feature and the fourth correlation feature are, the larger the loss value is, and the smaller the second correlation feature and the fourth correlation feature are, the smaller the loss value is. The smaller the loss value, the greater the accuracy of the feature coding model, and the larger the loss component, the less accurate the feature coding model.
Because the first coding feature and the second coding feature are obtained by respectively coding the first feature and the second feature by the feature coding model, a word corresponding to the first feature belongs to a text corresponding to the second feature, and the association degree of the first feature and the second feature is higher, the higher the association degree of the first coding feature and the second coding feature is, the higher the coding capability of the feature coding model is, namely, the more accurate the feature coding model is. Since the first associated feature represents a degree of association between the first coding feature and the second coding feature, the computer device may train the feature coding model using the first associated feature as a positive sample of a word dimension to make the first associated feature larger and larger, thereby improving accuracy of the feature coding model.
The third coding feature and the second coding feature are obtained by respectively coding the third feature and the second feature through the feature coding model, a word corresponding to the third feature does not belong to a text corresponding to the second feature, and the association degree of the third feature and the second feature is lower, so that the association degree of the third coding feature and the second coding feature is lower, the coding capability of the feature coding model is higher, and the feature coding model is more accurate. Since the second associated feature represents the degree of association between the first coding feature and the second coding feature, the computer device may train the feature coding model using the second associated feature as a negative sample of the word dimension to make the second associated feature smaller and smaller, thereby improving the accuracy of the feature coding model.
The first associated feature and the second associated feature belong to associated features of word dimensions, and in order to enable the feature coding model to learn richer semantic information, the computer device further trains the feature coding model by using a third associated feature and a fourth associated feature of text dimensions.
The fourth coding feature and the second coding feature are obtained by respectively coding the first text feature and the second feature through the feature coding model, the text corresponding to the first text feature and the text corresponding to the second feature belong to the same text, the association degree of the first text feature and the second feature is higher, the association degree of the fourth coding feature and the second coding feature is higher, the coding capability of the feature coding model is higher, and the feature coding model is more accurate. Since the third associated feature represents the degree of association between the first coding feature and the second coding feature, the computer device may train the feature coding model using the third associated feature as a negative sample of a text dimension, so that the third associated feature becomes larger and larger, thereby improving the accuracy of the feature coding model.
The fifth coding feature and the second coding feature are obtained by respectively coding the second text feature and the second feature through the feature coding model, the text corresponding to the second text feature and the text corresponding to the second feature do not belong to the same text, and the association degree of the second text feature and the second feature is lower, so that the association degree of the fifth coding feature and the second coding feature is lower, the coding capability of the feature coding model is higher, and the feature coding model is more accurate. Since the fourth associated feature represents the degree of association between the fifth encoding feature and the second encoding feature, the computer device may train the feature encoding model using the fourth associated feature as a negative sample of the text dimension, so that the fourth associated feature becomes smaller and smaller, thereby improving the accuracy of the feature encoding model.
In the related technology, the word characteristics corresponding to the words in the text are directly used for training the feature coding model, but the feature coding task focuses more on the category information in the coding characteristics, that is, the coding characteristics corresponding to the texts of the same category are expected to be similar, so the coding characteristics obtained by adopting the method of the related technology contain a large amount of redundant information, and the performance of the feature coding model is low. In the embodiment of the application, the association degree between the local features (first coding features) and the global features (second coding features) is maximized, so that semantic information related to the local features is kept as much as possible in the coded global features, redundant information of excessive details in each local feature is omitted, and the coding quality of the feature coding model is improved.
In one possible implementation, the first text includes words at multiple locations. The computer equipment determines a loss component corresponding to each position based on a first associated feature and a second associated feature corresponding to each position, wherein the loss component is positively correlated with the first associated feature, and the loss component is negatively correlated with the second associated feature, the first associated feature corresponding to the position refers to a first associated feature corresponding to a word at the position in the first text, and the second associated feature corresponding to the position refers to a second associated feature corresponding to a word at the position in the second text; and carrying out fusion processing on the loss component corresponding to each position to obtain a loss value, wherein the loss value is in negative correlation with the loss component.
Wherein the first text includes words at a plurality of locations and the second text includes words at a plurality of locations. For each position, the computer device obtains a first associated feature corresponding to a word located at the position in the first text, obtains a second associated feature corresponding to a word located at the position in the second text, wherein the first associated feature is a positive sample corresponding to the position, the second associated feature is a negative sample corresponding to the position, and the computer device determines a loss component corresponding to the position based on the positive sample and the negative sample, the larger the first associated feature is, the larger the loss component is, the smaller the first associated feature is, the smaller the loss component is, the larger the second associated feature is, the smaller the loss component is, the smaller the second associated feature is, and the loss component is. The loss component can be understood as the accuracy of each coding feature corresponding to the position, the greater the loss component, the greater the accuracy of the feature coding model, and the smaller the loss component, the less accurate the feature coding model.
Since the first text corresponds to a plurality of positions, each position corresponds to one loss component, so that a plurality of loss components are obtained. And the computer equipment performs fusion processing on the plurality of loss components to obtain a loss value. For example, the computer device adds the plurality of loss components and then takes the negative to obtain the loss value. The loss value is inversely related to the loss component, i.e., the larger the loss component, the smaller the loss value, the smaller the loss component, and the larger the loss value.
In another possible implementation manner, as described in step 303 above, the computer device invokes the discriminant model to obtain the first associated feature, the second associated feature, the third associated feature, and the fourth associated feature, and then the computer device trains the feature coding model and the discriminant model based on the loss value. That is, the computer device trains the feature coding model based on the loss value, and also trains the discriminant model based on the loss value, so as to improve the accuracy of the discriminant model.
In another possible implementation manner, the feature coding model is used for performing hash coding, the first feature, the second feature, the third feature, the first text feature and the second text feature in the above steps are features in a continuous space, and the first coding feature, the second coding feature, the third coding feature, the fourth coding feature and the fifth coding feature are hash features in a discrete space, wherein the hash features obey bernoulli distribution, and taking the first coding feature and the second coding feature as an example, the first coding feature and the second coding feature obey the following distribution:
bi~Bernoulli(σ(hi));
B~Bernoulli(σ(H));
wherein, biRepresenting the first coding feature corresponding to the ith word, B representing the second coding feature, hiRepresenting a first feature, H a second feature, σ () an activation function, and Bernoulli () a Bernoulli function. In the embodiment of the application, a discrete gradient estimation algorithm may be adopted to train the feature coding model, and the hash features output by the trained feature coding model obey the distribution.
In the embodiment of the present application, the feature coding model is trained by maximizing mutual information between the first coding feature and the second coding feature, and then an objective function of the trained feature coding model may be represented by the following formula (4).
Figure BDA0003264015490000201
Wherein the content of the first and second substances,
Figure BDA0003264015490000202
parameters representing a feature coding model, biRepresenting a first coding characteristic corresponding to the ith word, B representing a second coding characteristic, I (B)i(ii) a B) And representing mutual information between the first coding characteristic and the second coding characteristic, wherein the mutual information can represent the degree of association between the first coding characteristic and the second coding characteristic, and the mutual information is the first association characteristic in the steps. T represents the number of words in the first text, and equation (4) is shown in
Figure BDA0003264015490000203
And when the maximum value is reached, the value of the parameter theta is obtained.
However, since the true distribution of the hash feature is unknown, I (b) in the above formula (4)i(ii) a B) Cannot be directly obtained. Thus, the present applicationThe embodiment estimates mutual information between the first coding characteristic and the second coding characteristic in other ways, that is, the embodiment adopts other ways to represent the degree of association between the first coding characteristic and the second coding characteristic.
For ease of understanding, the following formula is used to illustrate the process by which a computer device determines a loss value for training a feature coding model. In one possible implementation, the computer device employs equation (5) below to determine a loss component for the word dimension for the ith word.
Figure BDA0003264015490000211
Wherein the content of the first and second substances,
Figure BDA0003264015490000212
representing the loss component corresponding to the ith position, biRepresenting a first coding feature corresponding to an ith word in the first text, B representing a second coding feature corresponding to the first text,
Figure BDA0003264015490000213
representing a third coding feature corresponding to the ith word in the second text, biIn the case of a positive sample,
Figure BDA0003264015490000214
are negative examples. softplus is an activation function defined as softplus (x) log (1+ e)x)。Ep[]Representing a mathematical expectation. DφRepresents a discriminant model defined as
Figure BDA0003264015490000215
Where φ ═ { W, b } is a parameter of the discriminant model, and σ () is an activation function.
However, the above formula (5) only includes the encoding features of the word dimension, and the loss component of the above formula (5) is only the loss component of the word dimension, and in order to make the feature encoding model learn richer semantic information, the loss component of the text dimension is also considered in the embodiment of the present application. In one possible implementation, the computer device employs equation (6) below to determine a loss component for a text dimension corresponding to the first text.
Figure BDA0003264015490000216
Wherein the content of the first and second substances,
Figure BDA0003264015490000217
representing a loss component corresponding to the first text, E representing a fourth encoding characteristic corresponding to the first text, B representing a second encoding characteristic corresponding to the first text,
Figure BDA0003264015490000218
indicating a fifth encoding characteristic corresponding to the second text, E being a positive sample,
Figure BDA0003264015490000219
are negative examples. softplus is an activation function defined as softplus (x) log (1+ e)x)。Ep[]Representing a mathematical expectation. DφRepresents a discriminant model, defined as Dφ(E,B)=σ(ETWB + b), where Φ ═ { W, b } is a parameter of the discriminant model, and σ () is the activation function.
The above equation (5) provides a loss component for the word dimension, the above equation (6) provides a loss component for the text dimension, and the computer device determines a loss value using the following equation (7).
Figure BDA00032640154900002110
Figure BDA00032640154900002111
The value of the loss is represented by,
Figure BDA00032640154900002112
and theta denotes a parameter of the feature coding model,
Figure BDA00032640154900002113
represents the loss component corresponding to the ith position, T represents the number of positions corresponding to the first text, beta is a weight parameter,
Figure BDA00032640154900002114
representing the corresponding loss component of the first text.
Fig. 4 is a flowchart of a model training method provided in an embodiment of the present application, and as shown in fig. 4, a word a1, a word a2, and a word A3 are included in a first text, a word B1, a word B2, and a word B3 are included in a second text, a computer device invokes a feature coding model, obtains a first coding feature 1, a first coding feature 2, and a first coding feature 3 corresponding to a word in the first text, and a second coding feature of a word dimension and a fourth coding feature of a text dimension corresponding to the first text, and invokes a feature coding model, obtains a third coding feature 4, a third coding feature 5, and a third coding feature 6 corresponding to a word in the second text, and a sixth coding feature of a word dimension and a fifth coding feature of a text dimension corresponding to the second text.
And the computer equipment takes the first text as a positive sample and the second text as a negative sample for processing, respectively inputs each first coding feature, each third coding feature, each fourth coding feature, each fifth coding feature and the second coding feature as an input pair into a discrimination model for discrimination to obtain mutual information between each coding feature and the second coding feature, and trains the feature coding model and the discrimination model by maximizing the mutual information of the positive sample and minimizing the mutual information of the negative sample.
And the computer equipment takes the second text as a positive sample and the first text as a negative sample for processing, respectively inputs each first coding feature, each third coding feature, each fourth coding feature, each fifth coding feature and each sixth coding feature as an input pair into a discrimination model for discrimination to obtain mutual information between each coding feature and the sixth coding feature, and trains the feature coding model and the discrimination model by maximizing the mutual information of the positive sample and minimizing the mutual information of the negative sample.
The training process aims to enable the discrimination model to obtain a correct discrimination result according to the coding features corresponding to the first text and the second text. As shown in fig. 4, "+" in fig. 4 indicates that the word belongs to text, "-" indicates that the word does not belong to text, and the correct discrimination results are that the word a1, the word a2, and the word A3 belong to first text, the word B1, the word B2, and the word B3 do not belong to first text, the word a1, the word a2, and the word A3 do not belong to second text, and the word B1, the word B2, and the word B3 belong to second text.
It should be noted that, in the embodiment of the present application, only the training model based on the first relevant feature, the second relevant feature, the third relevant feature and the fourth relevant feature is taken as an example for description. In another embodiment, the computer device may not perform step 304 and 306, and the computer device determines a loss value based on each of the first associated feature, the third associated feature and the fourth associated feature, and trains the feature coding model based on the loss value. Wherein the loss value is negatively correlated with the first associated feature and the third associated feature, and the loss value is positively correlated with the fourth associated feature.
Or in another embodiment, the computer device may not perform step 307 and 312, that is, the computer device only needs to determine the first associated feature and the second associated feature, determine a loss value based on each of the first associated feature and the second associated feature, and train the feature coding model based on the loss value. Wherein the loss value is negatively correlated with the first correlation characteristic and positively correlated with the second correlation characteristic.
Or in another embodiment, the computer device may not perform the step 304 and the step 310 and the step 312, that is, the computer device only needs to determine the first associated feature and the third associated feature, determine a loss value based on each of the first associated feature and the third associated feature, and train the feature coding model based on the loss value. Wherein the loss value is inversely related to the first associated feature and the third associated feature.
Alternatively, in another embodiment, the computer device may not perform step 304 and 312, that is, the computer device only needs to determine the first associated features, and train the feature coding model based on the first associated features between each first coding feature and the second coding feature. In one possible implementation, the computer device determines a loss value based on the first associated feature corresponding to each first coding feature, and trains the feature coding model based on the loss value. Wherein the loss value is inversely related to the first correlation characteristic.
It should be noted that, in order to train the feature coding model, the computer device first obtains a plurality of texts as a sample data set, the process of training the feature coding model based on the plurality of texts includes a plurality of iterations, and training is performed based on two different texts in each iteration. The steps 301-313 in the embodiment of the present application are only described by taking the first text and the second text processed in one iteration process as an example.
314. And calling the trained feature coding model by the computer equipment to code the features of any text.
And training the feature coding model by the computer equipment to obtain the trained feature coding model. The computer equipment can call the trained feature coding model to code the features of any text to obtain the corresponding coding features. The process is described in detail in the following embodiment of fig. 5, and will not be described here.
The method provided by the embodiment of the application utilizes the feature coding model to perform feature coding on the first feature and the second feature respectively to obtain the first coding feature and the second coding feature, because the first feature comprises the semantics of partial words in the text and the second feature comprises the semantics of each word in the same text, the association degree between the first feature and the second feature is higher, if the accuracy of the feature coding model is high enough, the association degree between the obtained first coding feature and the second coding feature is higher, and therefore the feature coding model is trained based on the association feature between each first coding feature and each second coding feature, the accuracy of the feature coding model can be improved, and the accuracy of the coding features obtained by the feature coding model is improved.
And determining the word characteristics corresponding to the word and the word characteristics corresponding to the words after the word to determine the first characteristics corresponding to the word, wherein the first characteristics are integrated with the semantics of the word and the semantics around the word, so that the first characteristics can represent the context relationship between the word and the surrounding words, and the information content contained in the first characteristics is improved.
And coding the first feature and the second feature in the continuous space to obtain the first coding feature and the second coding feature in the discrete space. By converting the features of the continuous space into the features of the discrete space, the storage cost and the processing speed of the data can be reduced.
And the first associated feature and the second associated feature belong to the associated feature of word dimensionality, the third associated feature and the fourth associated feature belong to the associated feature of sentence dimensionality, and the first associated feature, the second associated feature, the third associated feature and the fourth associated feature are adopted to train the feature coding model, so that the feature coding model can learn richer semantic information, and the coding capability and accuracy of the feature coding model are improved.
And the first associated feature and the third associated feature belong to associated features serving as positive samples, the second associated feature and the fourth associated feature belong to associated features serving as negative samples, and the feature coding model is trained by adopting the first associated feature, the second associated feature, the third associated feature and the fourth associated feature, so that the feature coding model can be learned from two angles of the positive samples and the negative samples, and the coding capability and the accuracy of the feature coding model are further improved.
Fig. 5 is a flowchart of a text processing method according to an embodiment of the present application. The feature coding model in the embodiment of the present application is obtained by training using the method in the embodiment of fig. 3, an execution subject in the embodiment of the present application is a computer device, see fig. 5, and the method includes:
501. the computer device determines word characteristics corresponding to each word in the target text.
Wherein, the target text is any text needing to be processed.
502. The computer device determines a fourth feature corresponding to each word based on the word feature corresponding to each word in the target text and the word feature corresponding to at least one word after each word, respectively.
503. And the computer equipment determines a fifth characteristic corresponding to the target text based on the fourth characteristic corresponding to each word in the target text.
504. And calling the trained feature coding model by the computer equipment, and coding the fifth feature to obtain a sixth coding feature corresponding to the fifth feature.
The process of determining the sixth encoding characteristic in the step 501-504 is the same as the process of determining the second encoding characteristic in the step 301-302, and is not repeated here.
Fig. 6 is a schematic diagram of a text processing method provided in an embodiment of the present application, and as shown in fig. 6, a computer device performs convolution processing on word features corresponding to words in a target text 601 to obtain fourth features 602 corresponding to each word, and performs fusion processing on each fourth feature 602 to obtain fifth features 603 corresponding to the target text 601. The computer device inputs the fifth feature 603 into the feature coding model, processes it by the activation function and the sampling function in the feature coding model, and outputs a sixth coding feature 604. The sixth encoding feature 604 is a hash feature, and as shown in fig. 6, the sixth encoding feature is a feature vector consisting of 0 and 1.
505. The computer device obtains candidate encoding features corresponding to the plurality of candidate texts.
When the sixth coding feature obtained in the embodiment of the application is applied to a text search scene, the computer device obtains the sixth coding feature corresponding to the target text and then obtains candidate coding features corresponding to a plurality of candidate texts. The computer device searches for a text similar to the target text in the plurality of candidate texts according to the candidate encoding features. And the candidate coding features corresponding to each candidate text are obtained by calling the trained coding model for coding. The candidate coding features and the sixth coding features are both coding features of word dimensions and belong to the global features of the text, and the determination mode process of the candidate coding features is the same as the determination mode of the sixth coding features, and is not repeated here.
In a possible implementation manner, the computer device includes a text database, and the text database stores the candidate texts, and the computer device determines the candidate encoding features corresponding to each candidate text in real time by using a manner similar to the manner of determining the sixth encoding features in steps 501 and 504. Or, the computer device determines in advance the candidate encoding feature corresponding to each candidate text in the text database, and stores each candidate encoding feature in the text database in correspondence with the candidate text, then in this step 505, the computer device directly obtains the candidate encoding feature corresponding to each candidate text in the text database.
506. The computer device determines a similarity between each candidate encoding feature and the sixth encoding feature, respectively.
In order to determine whether each candidate text is similar to the target text, the computer device determines the similarity between each candidate encoding feature and the sixth encoding feature respectively, wherein the similarity between the candidate encoding feature and the sixth encoding feature can represent the similarity between the corresponding candidate text and the target text.
In one possible implementation, the computer device determines a hamming distance between the candidate coding feature and the sixth coding feature, the hamming distance being determined as a similarity between the candidate coding feature and the sixth coding feature. Alternatively, the computer device may also determine the similarity between the candidate coding feature and the sixth coding feature in other manners, and the determination manner of the similarity is not limited in this embodiment of the application.
507. And the computer equipment determines the candidate text corresponding to the candidate coding features with the similarity larger than the target threshold as the text similar to the target text.
After the computer device obtains the similarity corresponding to each candidate coding feature, candidate coding features with the similarity larger than a target threshold value are determined in the candidate coding features, and candidate texts corresponding to the candidate coding features are determined to be texts similar to the target texts.
Optionally, the target threshold is a threshold preset by the computer device. And if the similarity corresponding to the candidate coding features is not greater than the target threshold, the candidate text corresponding to the candidate coding features is considered to be not similar to the target text.
It should be noted that, the embodiment of the present application is described by taking a search for a text similar to a target text as an example, in a possible implementation manner, a computer device executes step 501 and step 507 in the embodiment of the present application in response to a search request for the target text. Alternatively, in another embodiment, the sixth encoding characteristic is not used for text searching, and the computer device does not perform step 505 and 507.
In order to verify the accuracy of the feature coding model trained in the embodiment of the present application, the feature coding model is tested on the data set 1 and the data set 2 respectively. The specific test mode is that for one text, a feature coding model is called to code features of the text to obtain coding features, similarity calculation is carried out on the coding features of the text and coding features of other texts in a text database, 1000 texts with the highest similarity are determined, how many texts in the 1000 searched texts belong to the same category as the text are manually determined, and the accuracy of searching is used as an evaluation index of the feature coding model. The results are shown in Table 1.
TABLE 1
Figure BDA0003264015490000261
The correlation techniques 1 to 6 are semantic hashing methods based on a generative model, and the TFIDF (Term Frequency-Inverse text Frequency) features and the BERT features are encoded by the methods of the correlation techniques 1 to 6, respectively, with the accuracy shown in table 1, and the method of the present application is used to encode the "global features" of word dimensions, with the accuracy shown in table 1. As can be seen from table 1, the performance of using the BERT feature as an input is worse than that of using the TFIDF feature as an input because the BERT feature contains a lot of redundant information, and the semantic hashing method based on the generative model requires that the hash feature can completely reconstruct the BERT feature, thereby introducing a lot of redundant noise and resulting in performance degradation. The method provided by the embodiment of the application has higher accuracy than the method of the related technology, and proves that the effectiveness of the mutual information is maximized based on the local characteristics and the global characteristics.
The above embodiments can be applied to any scene that requires text processing. For example, in a text search scenario, other texts similar to any text can be searched by using the method provided by the above embodiment. Fig. 7 is a flowchart of a text search method provided in an embodiment of the present application, and referring to fig. 7, the method includes:
701. a computer device displays a text search interface.
702. The computer equipment acquires the input target text based on the text search interface.
703. The computer device responds to the search request for the target text, and determines the sixth encoding characteristic corresponding to the target text by adopting the method provided by the embodiment of fig. 5.
704. And the computer equipment acquires the candidate coding features corresponding to each candidate text in the text database and determines the similarity between each candidate coding feature and the sixth coding feature.
705. And the computer equipment determines the candidate text corresponding to the candidate coding feature with the highest similarity and displays the candidate text in a text search interface.
The text can be any type of text, and a text search scene in the medical field is taken as an example to illustrate that the medical text search is an important research problem in the information retrieval field. However, in a low-resource application scenario, because the computing power and the storage space of the computer device are limited, a large overhead is incurred if the similarity calculation is directly performed based on the text. Therefore, by adopting the text processing method provided by the embodiment of the application, the medical text can be converted into the coding characteristics, and the similarity calculation is carried out based on the coding characteristics, so that the storage cost is reduced and the search speed is increased in a low-resource scene.
Fig. 8 is a schematic diagram of a text search method provided in an embodiment of the present application, where a text search interface 801 in fig. 8 includes a search box, a patient inputs a question text 802 that the patient wants to ask in the search box of the text search interface 801 and performs a search operation on the question text 802, and a computer device searches an answer text 803 that has the highest similarity to the question text 802 by using the text search method provided in the embodiment of the present application in response to the search operation, and displays the answer text 803 on the text search interface 801. Or, in addition to the patient question-answer scenario, the computer device may also store the historical diagnosis plans of the doctors in the case database, and other doctors search the current case. In addition, the text search method provided by the embodiment of the application can also be applied to medical insurance question and answer scenes and the like.
Fig. 9 is a schematic structural diagram of a text processing apparatus according to an embodiment of the present application. Referring to fig. 9, the apparatus includes:
a feature obtaining module 901, configured to obtain a first feature corresponding to each word in the first text and a second feature corresponding to the first text, where the first feature corresponding to the word represents a semantic meaning of the word in the first text, and the second feature is determined based on the first feature corresponding to each word;
a first encoding module 902, configured to invoke a feature encoding model, and encode each first feature and each second feature respectively to obtain a first encoding feature corresponding to each first feature and a second encoding feature corresponding to each second feature;
a model training module 903, configured to train a feature coding model based on a first association feature between each first coding feature and the second coding feature, where the first association feature represents a degree of association between the first coding feature and the second coding feature;
and a second encoding module 904, configured to invoke the trained feature coding model to encode features of any text.
The text processing device provided by the embodiment of the application utilizes the feature coding model to perform feature coding on the first feature and the second feature respectively to obtain the first coding feature and the second coding feature, because the first feature comprises the semantics of partial words in the text, and the second feature comprises the semantics of each word in the same text, the association degree between the first feature and the second feature is higher, if the accuracy of the feature coding model is high enough, the association degree between the obtained first coding feature and the second coding feature is also higher, so that the feature coding model is trained based on the association feature between each first coding feature and each second coding feature, the accuracy of the feature coding model can be improved, and the accuracy of the coding features obtained by the feature coding model is improved.
Optionally, referring to fig. 10, the model training module 903 comprises:
a loss value determining unit 913 configured to determine a loss value based on the first associated feature corresponding to each first encoding feature, where the loss value is negatively correlated with the first associated feature;
a model training unit 923 configured to train the feature coding model based on the loss value.
Optionally, referring to fig. 10, the apparatus further comprises an association determining module 905 configured to:
calling a discrimination model, discriminating the first coding feature and the second coding feature to obtain a discrimination result, wherein the discrimination result represents the possibility that the word corresponding to the first coding feature belongs to the text corresponding to the second coding feature;
and determining the judgment result as a first associated feature corresponding to the first coding feature.
Optionally, referring to fig. 10, the model training module comprises:
a loss value determining unit 913 configured to determine a loss value based on the first correlation characteristic corresponding to each of the first encoding characteristics, where the loss value is inversely correlated with the first correlation characteristic;
and a model training unit 923 for training the feature coding model and the discrimination model based on the loss value.
Optionally, referring to fig. 10, the apparatus further comprises:
the feature obtaining module 901 is further configured to obtain a third feature corresponding to a word in the second text, where the third feature corresponding to the word represents a semantic meaning of the word in the second text, and the second text is different from the first text;
the first encoding module 902 is further configured to invoke a feature encoding model, encode the third feature, and obtain a third encoding feature corresponding to the third feature;
an association determining module 905 configured to determine a second association feature between the third encoding feature and the second encoding feature, where the second association feature represents a degree of association between the third encoding feature and the second encoding feature;
a loss value determination unit 913 for:
and determining a loss value based on each first correlation characteristic and each second correlation characteristic, wherein the loss value is negatively correlated with the first correlation characteristic, and the loss value is positively correlated with the second correlation characteristic.
Alternatively, referring to fig. 10, the first text includes words at a plurality of positions, and the loss value determination unit 913 is configured to:
determining a loss component corresponding to each position based on a first associated feature and a second associated feature corresponding to each position, wherein the loss component is positively correlated with the first associated feature, and the loss component is negatively correlated with the second associated feature, the first associated feature corresponding to the position refers to a first associated feature corresponding to a word at the position in the first text, and the second associated feature corresponding to the position refers to a second associated feature corresponding to a word at the position in the second text;
and carrying out fusion processing on the loss component corresponding to each position to obtain a loss value, wherein the loss value is in negative correlation with the loss component.
Optionally, referring to fig. 10, the apparatus further comprises:
the feature obtaining module 901 is further configured to determine a first text feature corresponding to the first text, where the first text feature represents a semantic meaning of the first text;
the first encoding module 902 is further configured to invoke a feature encoding model, encode the first text feature, and obtain a fourth encoding feature;
an association determining module 905, configured to determine a third associated feature between the fourth encoding feature and the second encoding feature, where the third associated feature represents a degree of association between the fourth encoding feature and the second encoding feature;
a loss value determination unit 913 for:
based on each first associated feature and the third associated feature, a loss value is determined, the loss value being inversely correlated with the first associated feature and the third associated feature.
Optionally, referring to fig. 10, the apparatus further comprises:
the feature obtaining module 901 is further configured to determine a second text feature corresponding to a second text, where the second text feature represents a semantic meaning of the second text, and the second text is different from the first text;
the first encoding module 902 is further configured to invoke a feature encoding model, encode the second text feature, and obtain a fifth encoding feature;
an association determining module 905, configured to determine a fourth association feature between the fifth encoding feature and the second encoding feature, where the fourth association feature represents a degree of association between the fifth encoding feature and the second encoding feature;
a loss value determination unit 913 for:
and determining a loss value based on each first correlation characteristic, each third correlation characteristic and each fourth correlation characteristic, wherein the loss value is negatively correlated with the first correlation characteristic and the third correlation characteristic, and the loss value is positively correlated with the fourth correlation characteristic.
Optionally, referring to fig. 10, the feature obtaining module 901 includes:
a word feature determining unit 911, configured to determine a word feature corresponding to each word in the first text;
a first feature determining unit 921, configured to determine a first feature corresponding to each word based on the word feature corresponding to each word and the word feature corresponding to at least one word after each word, respectively.
Alternatively, referring to fig. 10, a first feature determination unit 921 for:
determining a plurality of target quantities, wherein the target quantities are different and less than the quantity of words in the first text;
for each target number, determining a word group corresponding to the word, and determining a first sub-feature corresponding to the word based on a word feature corresponding to each word in the word group, wherein the word group comprises the word and a subsequent word of the word, the subsequent word of the word refers to the word after the word, and the number of the subsequent word in the word group is not more than the target number;
and carrying out fusion processing on the plurality of first sub-characteristics corresponding to the words to obtain the first characteristics corresponding to the words.
Alternatively, referring to fig. 10, a first feature determination unit 921 for:
determining the words and the target number of words behind the words as word groups corresponding to the words under the condition that the total number of the subsequent words of the words is not less than the target number;
and under the condition that the total number of the subsequent words of the words is less than the target number, determining the words and each word after the words as the word group corresponding to the words.
Alternatively, referring to fig. 10, a first feature determination unit 921 for:
performing convolution processing on the word characteristics corresponding to each word in the word group respectively to obtain convolution characteristics corresponding to each word;
and determining the sum of the plurality of convolution characteristics as a first sub-characteristic corresponding to the word.
Optionally, referring to fig. 10, the feature obtaining module 901 includes:
the second feature determining unit 931 is configured to perform mean pooling on first features corresponding to a plurality of words in the first text to obtain second features; alternatively, the first and second electrodes may be,
the second feature determining unit 931 is configured to determine a median of the first feature corresponding to the plurality of words in the first text as the second feature.
Optionally, referring to fig. 10, a second encoding module 904 for:
determining word characteristics corresponding to each word in the target text;
determining fourth characteristics corresponding to each word based on the word characteristics corresponding to each word in the target text and the word characteristics corresponding to at least one word behind each word;
determining a fifth feature corresponding to the target text based on the fourth feature corresponding to each word in the target text;
and calling the trained feature coding model, and coding the fifth feature to obtain a sixth coding feature corresponding to the fifth feature.
Optionally, referring to fig. 10, the apparatus further comprises:
the feature query module 906 is configured to obtain candidate coding features corresponding to multiple candidate texts, where the candidate coding feature corresponding to each candidate text is obtained by calling a trained coding model to perform coding;
a similarity determining module 907 for determining the similarity between each candidate encoding feature and the sixth encoding feature respectively;
and a text determining module 908, configured to determine a candidate text corresponding to the candidate encoding feature with the similarity greater than the target threshold as a text similar to the target text.
It should be noted that: in the text processing apparatus provided in the above embodiment, when processing a text, only the division of the above functional modules is taken as an example, and in practical applications, the above functions may be distributed by different functional modules according to needs, that is, the internal structure of the computer device is divided into different functional modules to complete all or part of the above described functions. In addition, the text processing apparatus and the text processing method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments and are not described herein again.
The embodiment of the present application further provides a computer device, where the computer device includes a processor and a memory, and the memory stores at least one computer program, and the at least one computer program is loaded and executed by the processor to implement the operations executed in the text processing method of the foregoing embodiment.
Optionally, the computer device is provided as a terminal. Fig. 11 shows a schematic structural diagram of a terminal 1100 according to an exemplary embodiment of the present application.
The terminal 1100 includes: a processor 1101 and a memory 1102.
Processor 1101 may include one or more processing cores, such as a 4-core processor, an 8-core processor, or the like. The processor 1101 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 1101 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 1101 may be integrated with a GPU (Graphics Processing Unit, image Processing interactor) for rendering and drawing content required to be displayed by the display screen. In some embodiments, the processor 1101 may further include an AI (Artificial Intelligence) processor for processing computing operations related to machine learning.
Memory 1102 may include one or more computer-readable storage media, which may be non-transitory. Memory 1102 can also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 1102 is used to store at least one computer program for being possessed by processor 1101 to implement the text processing methods provided by method embodiments herein.
In some embodiments, the terminal 1100 further comprises: a peripheral interface 1103 and at least one peripheral. The processor 1101, memory 1102 and peripheral interface 1103 may be connected by a bus or signal lines. Various peripheral devices may be connected to the peripheral interface 1103 by buses, signal lines, or circuit boards. Optionally, the peripheral device comprises: at least one of radio frequency circuitry 1104, a display screen 1105, and a camera assembly 1106.
The peripheral interface 1103 may be used to connect at least one peripheral associated with I/O (Input/Output) to the processor 1101 and the memory 1102. In some embodiments, the processor 1101, memory 1102, and peripheral interface 1103 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 1101, the memory 1102 and the peripheral device interface 1103 may be implemented on separate chips or circuit boards, which is not limited by this embodiment.
The Radio Frequency circuit 1104 is used to receive and transmit RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuit 1104 communicates with communication networks and other communication devices via electromagnetic signals. The radio frequency circuit 1104 converts an electric signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electric signal. Optionally, the radio frequency circuit 1104 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuit 1104 may communicate with other devices via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: metropolitan area networks, various generation mobile communication networks (2G, 3G, 4G, and 5G), Wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the rf circuit 1104 may further include NFC (Near Field Communication) related circuits, which are not limited in this application.
The display screen 1105 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 1105 is a touch display screen, the display screen 1105 also has the ability to capture touch signals on or over the surface of the display screen 1105. The touch signal may be input to the processor 1101 as a control signal for processing. At this point, the display screen 1105 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, display 1105 may be one, disposed on a front panel of terminal 1100; in other embodiments, the display screens 1105 can be at least two, respectively disposed on different surfaces of the terminal 1100 or in a folded design; in other embodiments, display 1105 can be a flexible display disposed on a curved surface or on a folded surface of terminal 1100. Even further, the display screen 1105 may be arranged in a non-rectangular irregular pattern, i.e., a shaped screen. The Display screen 1105 may be made of LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode), and the like.
Camera assembly 1106 is used to capture images or video. Optionally, camera assembly 1106 includes a front camera and a rear camera. The front camera is disposed on the front panel of the terminal 1100, and the rear camera is disposed on the rear surface of the terminal 1100. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions. In some embodiments, camera assembly 1106 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.
Those skilled in the art will appreciate that the configuration shown in fig. 11 does not constitute a limitation of terminal 1100, and may include more or fewer components than those shown, or may combine certain components, or may employ a different arrangement of components.
Optionally, the computer device is provided as a server. Fig. 12 is a schematic structural diagram of a server 1200 according to an embodiment of the present application, where the server 1200 may generate a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 1201 and one or more memories 1202, where the memory 1202 stores at least one computer program, and the at least one computer program is loaded and executed by the processors 1201 to implement the methods provided by the foregoing method embodiments. Of course, the server may also have components such as a wired or wireless network interface, a keyboard, and an input/output interface, so as to perform input/output, and the server may also include other components for implementing the functions of the device, which are not described herein again.
The embodiment of the present application further provides a computer-readable storage medium, where at least one computer program is stored in the computer-readable storage medium, and the at least one computer program is loaded and executed by a processor to implement the operations executed in the text processing method of the foregoing embodiment.
Embodiments of the present application further provide a computer program product, which includes a computer program that is loaded and executed by a processor to implement the operations performed in the text processing method according to the above aspect. In some embodiments, the computer program according to the embodiments of the present application may be deployed to be executed on one computer device or on multiple computer devices located at one site, or may be executed on multiple computer devices distributed at multiple sites and interconnected by a communication network, and the multiple computer devices distributed at the multiple sites and interconnected by the communication network may constitute a block chain system.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The above description is only an alternative embodiment of the present application and should not be construed as limiting the present application, and any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims (19)

1. A method of text processing, the method comprising:
acquiring a first feature corresponding to each word in a first text and a second feature corresponding to the first text, wherein the first feature corresponding to the word represents the semantic meaning of the word in the first text, and the second feature is determined based on the first feature corresponding to each word;
calling a feature coding model, and coding each first feature and each second feature respectively to obtain a first coding feature corresponding to each first feature and a second coding feature corresponding to each second feature;
training the feature coding model based on a first correlation feature between each of the first coding features and the second coding features, the first correlation feature representing a degree of correlation between the first coding feature and the second coding feature;
and calling the trained feature coding model to code the features of any text.
2. The method of claim 1, wherein training the feature coding model based on a first correlation feature between each of the first coding features and the second coding features comprises:
determining a loss value based on a first associated feature corresponding to each of the first encoding features, the loss value being inversely related to the first associated feature;
training the feature coding model based on the loss values.
3. The method of claim 1, wherein before training the feature coding model based on the first correlation feature between each of the first coding features and the second coding features, the method further comprises:
calling a discrimination model, discriminating the first coding feature and the second coding feature to obtain a discrimination result, wherein the discrimination result represents the possibility that a word corresponding to the first coding feature belongs to a text corresponding to the second coding feature;
and determining the discrimination result as a first associated feature corresponding to the first coding feature.
4. The method of claim 3, wherein training the feature coding model based on the first associated feature between each of the first coding features and the second coding features comprises:
determining a loss value based on a first associated feature corresponding to each of the first encoding features, the loss value being inversely related to the first associated feature;
training the feature coding model and the discriminant model based on the loss values.
5. The method of claim 2, further comprising:
acquiring third characteristics corresponding to words in a second text, wherein the third characteristics corresponding to the words represent the semantics of the words in the second text, and the second text is different from the first text;
calling the feature coding model, and coding the third feature to obtain a third coding feature corresponding to the third feature;
determining a second correlation characteristic between the third coding characteristic and the second coding characteristic, the second correlation characteristic representing a degree of correlation between the third coding characteristic and the second coding characteristic;
said determining a loss value based on the first associated feature corresponding to each of said first encoding features comprises:
determining the loss value based on each of the first correlation characteristic and the second correlation characteristic, the loss value being negatively correlated with the first correlation characteristic and the loss value being positively correlated with the second correlation characteristic.
6. The method of claim 5, wherein the first text comprises words at a plurality of locations, and wherein determining the loss value based on each of the first associated feature and the second associated feature comprises:
determining a loss component corresponding to each position based on a first associated feature and a second associated feature corresponding to each position, wherein the loss component is positively correlated with the first associated feature, and the loss component is negatively correlated with the second associated feature, wherein the first associated feature corresponding to the position refers to a first associated feature corresponding to a word located at the position in the first text, and the second associated feature corresponding to the position refers to a second associated feature corresponding to a word located at the position in the second text;
and carrying out fusion processing on the loss component corresponding to each position to obtain the loss value, wherein the loss value is in negative correlation with the loss component.
7. The method of claim 2, further comprising:
determining a first text feature corresponding to the first text, wherein the first text feature represents the semantic meaning of the first text;
calling the feature coding model, and coding the first text feature to obtain a fourth coding feature;
determining a third correlation feature between the fourth encoding feature and the second encoding feature, the third correlation feature representing a degree of correlation between the fourth encoding feature and the second encoding feature;
said determining a loss value based on the first associated feature corresponding to each of said first encoding features comprises:
determining the loss value based on each of the first and third associated features, the loss value being inversely related to the first and third associated features.
8. The method of claim 7, further comprising:
determining a second text feature corresponding to a second text, wherein the second text feature represents the semantic meaning of the second text, and the second text is different from the first text;
calling the feature coding model, and coding the second text feature to obtain a fifth coding feature;
determining a fourth correlation feature between the fifth coding feature and the second coding feature, the fourth correlation feature representing a degree of correlation between the fifth coding feature and the second coding feature;
said determining said loss value based on each of said first associated feature and said third associated feature comprises:
determining the loss value based on each of the first correlation feature, the third correlation feature, and the fourth correlation feature, the loss value being negatively correlated with the first correlation feature and the third correlation feature, the loss value being positively correlated with the fourth correlation feature.
9. The method of any one of claims 1-8, wherein obtaining a first feature corresponding to each word in the first text comprises:
determining word characteristics corresponding to each word in the first text;
and determining the first characteristic corresponding to each word respectively based on the word characteristic corresponding to each word and the word characteristic corresponding to at least one word after each word.
10. The method of claim 9, wherein determining the first feature corresponding to each of the words based on the word feature corresponding to each of the words and the word feature corresponding to at least one word after each of the words, respectively, comprises:
determining a plurality of target quantities, wherein the target quantities are different and less than the quantity of the words in the first text;
for each target quantity, determining a word group corresponding to the word, and determining a first sub-feature corresponding to the word based on a word feature corresponding to each word in the word group, wherein the word group comprises the word and a subsequent word of the word, the subsequent word of the word refers to a word after the word, and the quantity of the subsequent word in the word group is not greater than the target quantity;
and fusing the plurality of first sub-characteristics corresponding to the words to obtain the first characteristics corresponding to the words.
11. The method of claim 10, wherein the determining the word group to which the word corresponds comprises:
determining the word and the target number of words after the word as a word group corresponding to the word under the condition that the total number of subsequent words of the word is not less than the target number;
determining the word and each word after the word as a word group corresponding to the word if the total number of subsequent words of the word is less than the target number.
12. The method of claim 10, wherein determining the first sub-feature corresponding to the word based on the word feature corresponding to each word in the group of words comprises:
performing convolution processing on the word characteristics corresponding to each word in the word group respectively to obtain convolution characteristics corresponding to each word;
and determining the sum of the plurality of convolution characteristics as a first sub-characteristic corresponding to the word.
13. The method according to any one of claims 1-8, wherein obtaining the second feature corresponding to the first text comprises:
performing mean value pooling on first features corresponding to a plurality of words in the first text to obtain the second features; alternatively, the first and second electrodes may be,
determining the median of the first feature corresponding to the plurality of words in the first text as the second feature.
14. The method according to any one of claims 1-8, wherein said invoking the trained feature coding model to code the feature of any text comprises:
determining word characteristics corresponding to each word in the target text;
determining fourth characteristics corresponding to each word based on word characteristics corresponding to each word in the target text and word characteristics corresponding to at least one word after each word;
determining a fifth feature corresponding to the target text based on the fourth feature corresponding to each word in the target text;
and calling the trained feature coding model, and coding the fifth feature to obtain a sixth coding feature corresponding to the fifth feature.
15. The method according to claim 14, wherein after the calling the trained feature coding model and coding the fifth feature to obtain a sixth coding feature corresponding to the fifth feature, the method further comprises:
acquiring candidate coding features corresponding to a plurality of candidate texts, wherein the candidate coding features corresponding to each candidate text are obtained by calling the trained coding model for coding;
respectively determining the similarity between each candidate coding feature and the sixth coding feature;
and determining the candidate text corresponding to the candidate coding features with the similarity greater than the target threshold as the text similar to the target text.
16. A text processing apparatus, characterized in that the apparatus comprises:
the feature acquisition module is used for acquiring a first feature corresponding to each word in a first text and a second feature corresponding to the first text, wherein the first feature corresponding to the word represents the semantic meaning of the word in the first text, and the second feature is determined based on the first feature corresponding to each word;
the first coding module is used for calling a feature coding model, and coding each first feature and each second feature respectively to obtain a first coding feature corresponding to each first feature and a second coding feature corresponding to each second feature;
a model training module for training the feature coding model based on a first correlation feature between each of the first coding features and the second coding features, the first correlation feature representing a degree of correlation between the first coding feature and the second coding feature;
and the second coding module is used for calling the trained feature coding model and coding the features of any text.
17. A computer device, characterized in that the computer device comprises a processor and a memory, in which at least one computer program is stored, which is loaded and executed by the processor to implement the operations performed by the text processing method according to any of claims 1 to 15.
18. A computer-readable storage medium, having stored therein at least one computer program, which is loaded and executed by a processor, to perform operations performed by a text processing method according to any one of claims 1 to 15.
19. A computer program product comprising a computer program, wherein the computer program is loaded and executed by a processor to perform the operations performed by the text processing method of any of claims 1 to 15.
CN202111081005.8A 2021-09-15 2021-09-15 Text processing method and device, computer equipment and storage medium Pending CN114281933A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111081005.8A CN114281933A (en) 2021-09-15 2021-09-15 Text processing method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111081005.8A CN114281933A (en) 2021-09-15 2021-09-15 Text processing method and device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114281933A true CN114281933A (en) 2022-04-05

Family

ID=80868586

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111081005.8A Pending CN114281933A (en) 2021-09-15 2021-09-15 Text processing method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114281933A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117972138A (en) * 2024-04-02 2024-05-03 腾讯科技(深圳)有限公司 Training method and device for pre-training model and computer equipment

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117972138A (en) * 2024-04-02 2024-05-03 腾讯科技(深圳)有限公司 Training method and device for pre-training model and computer equipment

Similar Documents

Publication Publication Date Title
CN111582409A (en) Training method of image label classification network, image label classification method and device
CN111598168B (en) Image classification method, device, computer equipment and medium
CN112419326B (en) Image segmentation data processing method, device, equipment and storage medium
CN116580257A (en) Feature fusion model training and sample retrieval method and device and computer equipment
CN113011387B (en) Network training and human face living body detection method, device, equipment and storage medium
CN110781413A (en) Interest point determining method and device, storage medium and electronic equipment
CN114282013A (en) Data processing method, device and storage medium
WO2021169366A1 (en) Data enhancement method and apparatus
CN113806487A (en) Semantic search method, device, equipment and storage medium based on neural network
CN113569607A (en) Motion recognition method, motion recognition device, motion recognition equipment and storage medium
CN113569042A (en) Text information classification method and device, computer equipment and storage medium
KR20230048614A (en) Systems, methods, and apparatus for image classification with domain invariant regularization
CN114298997B (en) Fake picture detection method, fake picture detection device and storage medium
CN112085120A (en) Multimedia data processing method and device, electronic equipment and storage medium
CN114677350A (en) Connection point extraction method and device, computer equipment and storage medium
CN114281933A (en) Text processing method and device, computer equipment and storage medium
CN114282543A (en) Text data processing method and device, computer equipment and storage medium
CN115952317A (en) Video processing method, device, equipment, medium and program product
CN112417260B (en) Localized recommendation method, device and storage medium
CN115862794A (en) Medical record text generation method and device, computer equipment and storage medium
CN113762037A (en) Image recognition method, device, equipment and storage medium
CN112765377A (en) Time slot positioning in media streams
CN117173731B (en) Model training method, image processing method and related device
CN117851967A (en) Multi-mode information fusion method, device, equipment and storage medium
CN116758365A (en) Video processing method, machine learning model training method, related device and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination