CN113887215A

CN113887215A - Text similarity calculation method and device, electronic equipment and storage medium

Info

Publication number: CN113887215A
Application number: CN202111210677.4A
Authority: CN
Inventors: 舒畅; 陈又新
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2021-10-18
Filing date: 2021-10-18
Publication date: 2022-01-04

Abstract

The embodiment of the application provides a text similarity calculation method and device, electronic equipment and a storage medium, and belongs to the technical field of artificial intelligence. The method comprises the following steps: acquiring an original text to be calculated; performing word segmentation processing on an original text by using a pre-trained text word segmentation model to obtain a plurality of text word segments; carrying out position recognition on each text word segment by using a pre-trained target word bank model to obtain a target position of each text word segment; coding each text word segment according to the target position to obtain a text word segment vector; inputting the text word segment vector into a pre-trained comparison model so as to perform matrix multiplication on the text word segment vector and a reference word embedding matrix in the comparison model to obtain a target word embedding vector; and carrying out similarity calculation on the target word embedding vectors to obtain a similarity value between every two text word segments. The method and the device for calculating the relevance between the text word segments can calculate the relevance between the text word segments more accurately.

Description

Text similarity calculation method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a text similarity calculation method and apparatus, an electronic device, and a storage medium.

Background

With the development of computer technology, many businesses need natural language processing using computer technology, such as search engines, intelligent services, and the like. When natural language processing is performed, calculation of text similarity is generally performed. The current calculation mode is often low in calculation accuracy. Therefore, how to provide a calculation method to improve the accuracy of text similarity calculation becomes a technical problem to be solved urgently.

Disclosure of Invention

The embodiment of the application mainly aims to provide a text similarity calculation method, a text similarity calculation device, electronic equipment and a storage medium, and aims to improve the accuracy of text similarity calculation.

In order to achieve the above object, a first aspect of an embodiment of the present application provides a text similarity calculation method, including:

acquiring an original text to be calculated;

performing word segmentation processing on the original text by using a pre-trained text word segmentation model to obtain a plurality of text word segments;

carrying out position identification on each text word segment by using a pre-trained target word bank model to obtain a target position of each text word segment;

coding each text word segment according to the target position to obtain a text word segment vector;

inputting the text word segment vector into a pre-trained comparison model so as to perform matrix multiplication on the text word segment vector and a reference word embedding matrix in the comparison model to obtain a target word embedding vector;

and performing similarity calculation on the target word embedding vectors to obtain a similarity value between every two text word segments.

In some embodiments, the step of performing location recognition on each text word by using a pre-trained target word library model to obtain a target location of each text word includes:

extracting elements of each text word segment by using a preset function in the target word library model to obtain an element value of each text word segment;

and carrying out position identification on the text word segment according to the element value to obtain the target position of the text word segment.

In some embodiments, before the step of performing location recognition on each text word by using a pre-trained target lexicon model to obtain a target location of each text word, the method further includes pre-training the target lexicon model, specifically including:

acquiring reference text data;

performing word segmentation processing on the reference text data by using an initial word segmentation model to obtain reference word segment data;

dividing the reference word segment data into a training set, a test set and a verification set according to a preset proportion;

training the initial model by using the training set to obtain a current word stock model;

and verifying the current word stock model by using the test set and the verification set to obtain the target word stock model.

In some embodiments, the step of encoding each text segment according to the target position to obtain a text segment vector includes:

according to the target position, performing normalization processing on each text word segment to obtain a standard word segment;

and carrying out one-hot coding on the standard word segment to obtain a text word segment vector.

In some embodiments, the step of inputting the text word segment vector into a pre-trained contrast model to perform matrix multiplication on the text word segment vector and a reference word embedding matrix in the contrast model to obtain a target word embedding vector includes:

inputting the text word segment vectors into the comparison model so as to enable the text word segment vectors to be subjected to matrix multiplication with the reference word embedding matrix to obtain a plurality of basic word embedding vectors;

and mapping the basic word embedded vector to obtain a target word embedded vector.

In some embodiments, before the step of inputting the text word segment vector into a pre-trained contrast model to perform matrix multiplication on the text word segment vector and a reference word embedding matrix in the contrast model to obtain a target word embedding vector, the method further includes training the contrast model, specifically including:

acquiring sample data;

performing data enhancement processing on the sample data to obtain a true example pair;

inputting the positive example pair to the comparative learning model;

calculating a first similarity of the positive example pair and a second similarity of the negative example pair through a loss function of the comparison learning model;

and optimizing a loss function of the comparison learning model according to the first similarity and the second similarity so as to update the comparison learning model.

In some embodiments, the step of performing similarity calculation on the target word embedding vectors to obtain a similarity value between each two text word segments includes:

and performing similarity calculation on the target word embedding vectors by using a cosine similarity calculation method to obtain a similarity value between every two text word segments.

To achieve the above object, a second aspect of an embodiment of the present application proposes a text similarity calculation apparatus, including:

the text acquisition module is used for acquiring an original text to be calculated;

the word segmentation module is used for carrying out word segmentation processing on the original text by utilizing a pre-trained text word segmentation model to obtain a plurality of text word segments;

the position identification module is used for carrying out position identification on each text word segment by utilizing a pre-trained target word bank model to obtain a target position of each text word segment;

the encoding module is used for encoding each text word segment according to the target position to obtain a text word segment vector;

the comparison module is used for inputting the text word segment vector into a pre-trained comparison model so as to enable the text word segment vector to be subjected to matrix multiplication with a reference word embedding matrix in the comparison model to obtain a target word embedding vector;

and the similarity calculation module is used for performing similarity calculation on the target word embedding vectors to obtain a similarity value between every two text word segments.

In order to achieve the above object, a third aspect of the embodiments of the present application provides an electronic device, which includes a memory, a processor, a program stored on the memory and executable on the processor, and a data bus for implementing connection communication between the processor and the memory, wherein the program, when executed by the processor, implements the method of the first aspect.

To achieve the above object, a fourth aspect of the embodiments of the present application proposes a computer-readable storage medium for a computer-readable storage, the computer-readable storage medium storing one or more programs, the one or more programs being executable by one or more processors to implement the method of the first aspect.

According to the text similarity calculation method, the text similarity calculation device, the electronic equipment and the storage medium, the original text to be calculated is obtained, and the pre-trained text word segmentation model is used for carrying out word segmentation on the original text to obtain a plurality of text word segments, so that fragmentation processing on the original text can be realized, and the required text word segments can be extracted more conveniently; and then, position recognition is carried out on each text word segment by utilizing a pre-trained target word bank model to obtain the target position of each text word segment, and the position recognition can be carried out on each text word segment more accurately. Therefore, each text word segment can be further encoded according to the target position to obtain a text word segment vector, the text word segment vector is input into a pre-trained comparison model, the text word segment vector is subjected to matrix multiplication with a reference word embedding matrix in the comparison model to obtain a target word embedding vector, finally, similarity calculation is carried out on a plurality of target word embedding vectors to obtain a similarity value between every two text word segments, the problem of uniform distribution of the word segment vectors can be effectively solved through the comparison model, and therefore the accuracy of the similarity calculation is improved, and the method can more accurately determine the correlation between the text word segments.

Drawings

Fig. 1 is a flowchart of a text similarity calculation method provided in an embodiment of the present application;

fig. 2 is a flowchart of step S103 in fig. 1;

fig. 3 is another flowchart of a text similarity calculation method provided in an embodiment of the present application;

FIG. 4 is a flowchart of step S104 in FIG. 1;

fig. 5 is a flowchart of step S105 in fig. 1;

FIG. 6 is another flowchart of a text similarity calculation method according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a text similarity calculation apparatus provided in an embodiment of the present application;

fig. 8 is a schematic hardware structure diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

It should be noted that although functional blocks are partitioned in a schematic diagram of an apparatus and a logical order is shown in a flowchart, in some cases, the steps shown or described may be performed in a different order than the partitioning of blocks in the apparatus or the order in the flowchart. The terms first, second and the like in the description and in the claims, and the drawings described above, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the application.

First, several terms referred to in the present application are resolved:

artificial Intelligence (AI): is a new technical science for researching and developing theories, methods, technologies and application systems for simulating, extending and expanding human intelligence; artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produces a new intelligent machine that can react in a manner similar to human intelligence, and research in this field includes robotics, language recognition, image recognition, natural language processing, and expert systems, among others. The artificial intelligence can simulate the information process of human consciousness and thinking. Artificial intelligence is also a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results.

Natural Language Processing (NLP): NLP uses computer to process, understand and use human language (such as chinese, english, etc.), and belongs to a branch of artificial intelligence, which is a cross discipline between computer science and linguistics, also commonly called computational linguistics. Natural language processing includes parsing, semantic analysis, discourse understanding, and the like. Natural language processing is commonly used in the technical fields of machine translation, character recognition of handwriting and print, speech recognition and text-to-speech conversion, information retrieval, information extraction and filtering, text classification and clustering, public opinion analysis and opinion mining, and relates to data mining, machine learning, knowledge acquisition, knowledge engineering, artificial intelligence research, linguistic research related to language calculation, and the like, which are related to language processing.

Information Extraction (NER): and extracting the fact information of entities, relations, events and the like of specified types from the natural language text, and forming a text processing technology for outputting structured data. Information extraction is a technique for extracting specific information from text data. The text data is composed of specific units, such as sentences, paragraphs and chapters, and the text information is composed of small specific units, such as words, phrases, sentences and paragraphs or combinations of these specific units. The extraction of noun phrases, names of people, names of places, etc. in the text data is text information extraction, and of course, the information extracted by the text information extraction technology can be various types of information.

Entity: refers to something that is distinguishable and exists independently. Such as a person, a city, a plant, etc., a commodity, etc. All things in the world are composed of specific things, which are referred to as entities. The entity is the most basic element in the knowledge graph, and different relationships exist among different entities.

The concept is as follows: a collection of entities of a certain class.

Semantic class (concept): a collection of entities with the same characteristics, such as countries, nationalities, books, computers, etc. Concepts refer primarily to collections, categories, types of objects, categories of things, such as people, geographies, and the like.

Self-supervision learning: the self-supervision learning mainly utilizes a secondary task (pretext) to mine self supervision information from large-scale unsupervised data, and the network is trained by the constructed supervision information, so that valuable characteristics of downstream tasks can be learned. That is, the supervised information of the self-supervised learning is not labeled manually, but the algorithm automatically constructs the supervised information in large-scale unsupervised data to perform the supervised learning or training.

Contrast Learning (contrast Learning) is a kind of self-supervised Learning, and does not need to rely on manually labeled class label information, and directly utilizes data itself as supervision information. Contrast learning is a method of task that describes similar and different things for a deep learning model. Using a contrast learning approach, a machine learning model may be trained to distinguish between similar and different images. The self-supervised learning in the image field is classified into two types: generative self-monitoring learning and discriminant self-monitoring learning. The comparative learning is typically discriminant self-supervised learning. The core key points of comparative learning are as follows: through automatically constructing similar examples and dissimilar examples, namely positive samples and negative samples, learning is carried out to compare the positive samples and the negative samples in a feature space, so that the distances of the similar examples in the feature space are reduced, the distances of the dissimilar examples in the feature space are reduced, the differences are increased, model representations obtained through the learning process can be used for executing downstream tasks, fine adjustment is carried out on a small labeled data set, and therefore the unsupervised model learning process is achieved. The guiding principle of comparative learning is as follows: by automatically constructing similar examples and dissimilar examples, a learning model is obtained through learning, and by utilizing the model, similar examples are relatively close in a projection space, while dissimilar examples can be relatively far away in the projection space.

Embedding (embedding): embedding is a vector representation, which means that a low-dimensional vector represents an object, which can be a word, a commodity, a movie, etc.; the embedding vector has the property that objects corresponding to vectors with similar distances have similar meanings, for example, the distance between the embedding (revenge league) and the embedding (ironmen) is very close, but the distance between the embedding (revenge league) and the embedding (dinners) is far away. The embedding essence is mapping from a semantic space to a vector space, and simultaneously, the relation of an original sample in the semantic space is kept as much as possible in the vector space, for example, the positions of two words with similar semantics in the vector space are also relatively close. The embedding can encode an object by using a low-dimensional vector and also can reserve the meaning of the object, is usually applied to machine learning, and in the process of constructing a machine learning model, the object is encoded into a low-dimensional dense vector and then transmitted to the DNN, so that the efficiency is improved.

batch: the Batch size (i.e., Batch size) is a hyper-parameter that defines the number of samples to be processed before updating the internal model parameters, i.e., the number of control training samples before updating the internal parameters of the model. The training data set may be divided into one or more batchs, where when all training samples are used to create one Batch, the learning algorithm is referred to as Batch gradient descent; when the batch is one sample size, the learning algorithm is called random gradient descent; when the batch size exceeds one sample and is less than the size of the training data set, the learning algorithm is referred to as a mini-batch gradient descent. The Batch size is a number of samples processed before the model is updated.

Data enhancement: data enhancement is mainly used for preventing overfitting and optimizing a data set when dataset is small, and through data enhancement, the data amount of training can be increased, the generalization capability of a model is improved, noise data is increased, and the robustness of the model is improved. Data enhancement can be divided into two categories, off-line enhancement and on-line enhancement; the off-line enhancement is to directly process the data set, the number of the data can be changed into the number of the enhancement factor x original data set, and the off-line enhancement is often used when the data set is very small; after obtaining the batch data, the online enhancement is mainly used for enhancing the batch data, such as corresponding changes of rotation, translation, turnover and the like, and because some data sets cannot accept the increase of linear level, the online enhancement is often used for larger data sets, and many machine learning frameworks already support the online enhancement mode and can use the GPU for optimizing calculation.

dropout (discard): dropout is a technique for preventing model overfitting, which means that in the training process of a deep learning network, for a neural network unit, the neural network unit is temporarily discarded from the network according to a certain probability, so that the model can be made more robust because it does not rely too much on some local features (because the local features are likely to be discarded).

mask (mask ): mask is a common operation in deep learning; simply put, a mask is equivalent to overlaying a mask over the original tensor to mask or select some specific elements, and is therefore often used to construct the tensor filter. The linear activation function Relu (simple rough bisection according to the positive and negative intervals of the output) and the dropout mechanism (bisection according to the probability) can be understood as generalized mask operation.

encoding, namely converting an input sequence into a vector with a fixed length; decoding (decoder), namely converting the fixed vector generated before into an output sequence; wherein, the input sequence can be characters, voice, images and videos; the output sequence may be text, images.

And (3) back propagation: the general principle of back propagation is: inputting training set data into an input layer of a neural network, passing through a hidden layer of the neural network, and finally reaching an output layer of the neural network and outputting a result; calculating the error between the estimated value and the actual value because the output result of the neural network has an error with the actual result, and reversely propagating the error from the output layer to the hidden layer until the error is propagated to the input layer; in the process of back propagation, adjusting the values of various parameters according to errors; and continuously iterating the process until convergence.

With the development of computer technology, many businesses need natural language processing using computer technology, such as search engines, intelligent services, and the like. When natural language processing is performed, calculation of text similarity is generally performed. The current calculation mode is often low in calculation accuracy. Therefore, how to provide a text similarity calculation method to improve the accuracy of similarity calculation becomes a technical problem to be solved urgently.

Based on this, the embodiment of the application provides a text similarity calculation method, a text similarity calculation device, an electronic device and a storage medium, which can improve the accuracy of similarity calculation.

The text similarity calculation method, the text similarity calculation device, the electronic device, and the storage medium provided in the embodiments of the present application are specifically described in the following embodiments, and first, the text similarity calculation method in the embodiments of the present application is described.

The embodiment of the application can acquire and process related data based on an artificial intelligence technology. Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result.

The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

The embodiment of the application provides a text similarity calculation method, and relates to the technical field of artificial intelligence. The text similarity calculation method provided by the embodiment of the application can be applied to a terminal, a server side and software running in the terminal or the server side. In some embodiments, the terminal may be a smartphone, tablet, laptop, desktop computer, or the like; the server side can be configured into an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and cloud servers for providing basic cloud computing services such as cloud service, a cloud database, cloud computing, cloud functions, cloud storage, network service, cloud communication, middleware service, domain name service, security service, CDN (content delivery network) and big data and artificial intelligence platforms; the software may be an application or the like that implements the text similarity calculation method, but is not limited to the above form.

The application is operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

Fig. 1 is an alternative flowchart of a text similarity calculation method provided in an embodiment of the present application, and the method in fig. 1 may include, but is not limited to, steps S101 to S106.

Step S101, obtaining an original text to be calculated;

step S102, carrying out word segmentation processing on an original text by using a pre-trained text word segmentation model to obtain a plurality of text word segments;

step S103, carrying out position identification on each text word segment by using a pre-trained target word bank model to obtain a target position of each text word segment;

step S104, coding each text word segment according to the target position to obtain a text word segment vector;

step S105, inputting the text word segment vector into a pre-trained comparison model so as to perform matrix multiplication on the text word segment vector and a reference word embedding matrix in the comparison model to obtain a target word embedding vector;

and step S106, performing similarity calculation on the target word embedding vectors to obtain a similarity value between every two text word segments.

In step S101 of some embodiments, the original text to be calculated may be obtained by writing a web crawler, and performing targeted crawling data after setting a data source. It should be noted that the original text is a natural language text.

In step S102 of some embodiments, the pre-trained text word segmentation model may include a Jieba word segmenter; other word segmentation software such as hand and the like is also possible. Taking a Jieba word segmentation device as a text word segmentation model for illustration, the step S102 specifically includes:

and performing word segmentation processing on the original text by using a pre-trained Jieba word segmentation device to obtain text word segments.

Specifically, when a Jieba word segmentation device is used for word segmentation processing, a directed acyclic graph corresponding to the original text is generated by contrasting a dictionary in the Jieba word segmentation device, then a shortest path on the directed acyclic graph is found according to a preset selection mode and the dictionary, and the original text is intercepted according to the shortest path, or the original text is directly intercepted to obtain text word segments.

Further, for text word segments that are not in the dictionary, new word discovery may be performed using HMM (hidden markov model). Specifically, the position B, M, E, S of the character in the text passage is taken as the hidden state, and the character is the observed state, wherein B/M/E/S represents the occurrence in the beginning of a word, in the word, at the end of a word and in the word formation. The representation probability matrix, the initial probability vector, and the transition probability matrix between the characters are respectively stored using a dictionary file. And solving the maximum possible hidden state by utilizing a Viterbi algorithm so as to obtain a text field.

In step S102 in other embodiments, a part-of-speech tagging process is further performed on the text word, that is, the text word is tagged according to a preset part-of-speech category to obtain a text word including a part-of-speech category tag, where the preset part-of-speech category includes a name, a verb, a modifier, an adjective, and the like.

The step S102 in the above embodiment can implement word segmentation processing on the original text, so that it is more convenient to extract the required text word segments.

Referring to fig. 2, in some embodiments, step S103 may include, but is not limited to, step S201 to step S202:

step S201, extracting elements of each text word segment by using a preset function in the target word library model to obtain an element value of each text word segment;

and S202, identifying the position of the text word segment according to the element value to obtain the target position of the text word segment.

Specifically, the target lexicon model may be a coding model, including at least one coder; in addition, in order to extract element values in text word segments, an index function is preset in the target word stock model. Since the index function can return the value of an element in a table or array. Therefore, in step S201, the element value of each text word is extracted by an index function in an array form, so as to obtain the element value of the text word. Wherein the element values of the text field comprise index values of a line number and a column number of the text field. Therefore, the line number and the column number of the text word segment are searched through the index function, and the text word segment at the specified position can be acquired. In step S202, the line number and the column number of the text segment are searched through an index function, and each text field in the original text is traversed to generate a position sequence table of the text segment, where the position sequence table can reflect the corresponding relationship between the text field and the line number and the column number (element value). The target position of the text word segment is determined according to the element value, and therefore position recognition can be accurately carried out on each text word segment.

It should be noted that other functions may also be preset in the target lexicon model to extract the element values of the text word segments, and are not limited to index functions.

Referring to fig. 3, in some embodiments, before step S103, the method further includes pre-training the target lexicon model, which may specifically include, but is not limited to, step S301 to step S305:

step S301, acquiring reference text data;

step S302, performing word segmentation processing on the reference text data by using an initial word segmentation model to obtain reference word segment data;

step S303, dividing the reference word segment data into a training set, a test set and a verification set according to a preset proportion;

step S304, training the initial model by using a training set to obtain a current word stock model;

and S305, verifying the current word stock model by using the test set and the verification set to obtain a target word stock model.

Specifically, in step S301, the reference text data may be obtained by a web crawler or the like; in step S302, the initial word segmentation model may be a Jieba word segmenter, and the Jieba word segmenter is used to perform word segmentation processing on the reference text data to obtain reference word segment data. Specifically, the process of performing the word segmentation on the reference text data is substantially the same as the process of performing the word segmentation on the original text in step S102, and is not described herein again.

Further, step S303 may be performed to divide the reference field data into a training set, a verification set, and a test set according to a preset ratio. It should be noted that the preset proportion may be set according to actual requirements, for example, the reference word segment data are divided into a training set, a verification set and a test set according to a proportion of 7:2:1, and the reference word segment data of the three data sets are labeled to obtain the reference word segment data containing the label.

Specifically, after step S303 is executed, step S304 may be executed to input reference field data of the training set into an initial model, and perform model training on the initial model to obtain a current lexicon model, where the initial model includes an encoder.

Finally, step S305 is executed to input the reference word segment data of the verification set into the current lexicon model, observe the accuracy rate convergence condition of the verification set, monitor whether the current lexicon model has an overfitting phenomenon, and adjust the model parameters of the current lexicon module. Further, step S305 includes: and verifying the current word stock model by using the test set and the verification set to obtain an optimal word stock model, and optimizing the optimal word stock model to obtain a target word stock model. In step S305, the comprehensive index MAP of the lexicon models at different times is compared, the current lexicon model with the highest MAP index is selected as the optimal lexicon model, the optimal lexicon model is further optimized to obtain a target lexicon model, and the target lexicon model is used for performing position recognition on each text word segment to obtain a target position of each text word segment, so that the recognition accuracy can be improved.

Referring to fig. 4, in some embodiments, step S104 may include, but is not limited to, step S401 to step S402:

step S401, according to the target position, normalization processing is carried out on each text word segment to obtain a standard word segment;

and S402, carrying out one-hot coding on the standard word segments to obtain text word segment vectors.

Specifically, the target position is an index position of the text word segment. In step S401 of some embodiments, each text word segment is extracted from the original text according to its index position, and each text word segment is linearly scaled to [ -1,1], or each text word segment is scaled to mean 0 and variance 1, so as to implement normalization processing on each text word segment to obtain a standard word segment.

It should be noted that the One-Hot coding is One-Hot coding, which is also called One-bit effective coding. The method uses an N-bit status register to encode N states, each state having its own independent register bit and only one bit of the state being active at any one time.

In step S402 of some embodiments, the length of the standard word segment may be expressed in a vector form by one-hot encoding, resulting in a plurality of text word segment vectors. For example, assuming that a certain original text is composed of 3 text word segments, the index positions of the 3 text word segments can be obtained through the foregoing steps. One-hot encoding is to use a vector with the length of V to represent each text word segment, and the V is the number of dictionary words corresponding to the text word segment in the target word library model. The vector marks the index position of the text word segment in the original text as 1, the others are all 0, and if a certain sentence consists of 3 text word segments, the vector has 3 1 s, and the position of the 1 s can correspond to the index position of the text word segment.

Through the step S104 in the above embodiment, each text segment can be encoded more conveniently according to the target position to obtain a text segment vector, so as to obtain a target word embedded vector through the text segment vector.

Referring to fig. 5, in some embodiments, step S105 may further include, but is not limited to, step S501 to step S502:

step S501, inputting the text word segment vector into a comparison model so as to enable the text word segment vector to be subjected to matrix multiplication with a reference word embedding matrix to obtain a plurality of basic word embedding vectors;

step S502, mapping the basic word embedded vector to obtain a target word embedded vector.

Specifically, step S501 is executed to train the contrast model so that the numerical value of the reference word embedding matrix in the contrast model will be completely fixed, and other model parameters of the contrast model are also fixed. Therefore, the text word segment vectors are input into the comparison model, and matrix multiplication can be performed on each text word segment vector by using a fixed reference word embedding matrix to obtain a basic word embedding vector.

Further, step S502 may be executed to perform mapping processing on the base word embedding vector by using the MLP network fixed in the comparison model, so as to obtain the target word embedding vector. The MLP network comprises a linear layer, a ReLu activation function and the linear layer.

Referring to fig. 6, in some embodiments, before step S105, the method further includes training a comparison model, which may specifically include, but is not limited to, steps S601 to S605:

step S601, sample data is obtained;

step S602, performing data enhancement processing on the sample data to obtain a true case pair;

step S603, inputting the positive example pair into a comparison learning model;

step S604, calculating a first similarity of a positive example pair and a second similarity of a negative example pair by comparing loss functions of the learning models;

step S605, optimizing the loss function of the comparative learning model according to the first similarity and the second similarity to update the comparative learning model.

Specifically, first, sample data is mapped to an embedding space, and vector representation is performed on the sample data, so that initial embedded data (i.e., initial embedding data) can be obtained, where the initial embedded data includes positive sample data and negative sample data.

In step S602 in some embodiments, data enhancement processing is performed on the initial embedded data through a dropout mask mechanism; the method has the advantages that the traditional data enhancement method is replaced by the dropout mask mechanism, namely two vectors obtained by inputting the same sample data into a dropout encoder twice are used as a positive example pair for comparative learning, and the effect is good enough. It can be understood that dropout mask, which is a kind of random network model, is a mask to model parameters W, and plays a role in preventing overfitting.

In a batch, data obtained through data enhancement processing (i.e. a first vector and a second vector) is a positive example pair, and other data which is not subjected to data enhancement is a negative example pair. In this embodiment of the present application, a positive example pair may be obtained by performing data enhancement processing on a part of initial embedded data in one batch, and another part of the initial embedded data may be used as a negative example pair.

In some embodiments, the positive case pairs are generated by randomly sampling the dropout mask.

In some specific application scenarios, in the stage of performing contrast learning, a typical contrast learning method in the batch is adopted to perform data enhancement processing inside the batch, that is, the obtained complete initial embedding data is subjected to data enhancement processing, so that two samples (first sample data and second sample data) of a positive example are different. In the embodiment of the application, dropout is directly taken as data enhancement, that is, a positive example pair is generated by randomly sampling dropout mask, more specifically, the same sample is repeatedly input into the same tape dropout encoder twice (that is, the same first sample data and second sample data are respectively input into the dropout encoder for data enhancement processing), so that two different expression vectors x (first vector) and x '(second vector) can be obtained, and the first vector and the second vector are taken as a positive example pair < x, x' >. In practice, sentence vectors of < x, x' > are not identical, but because the input sentences are identical, the semantics of the last sentence vector are expected to be identical, thus letting the model pull the distance between them closer as a positive example pair.

In step S604 of some embodiments, the first similarity and the second similarity are both cosine similarities, and the optimizing the loss function of the comparative learning model according to the first similarity and the second similarity may include, but is not limited to:

maximizing the first similarity to a first value and minimizing the second similarity to the first value to optimize the loss function; the first similarity is a numerator of the loss function, the first similarity and the second similarity are denominators of the loss function, the first numerical value is 1, and the second numerical value is 0. In the loss function, the numerator is the first similarity corresponding to the positive case pair, the denominator is the first similarity and the second similarity of all the negative case pairs, and then the molecular formula value formed by the numerator and the denominator is packaged in-log (), so that the minimization of the loss function can be realized by maximizing the numerator and minimizing the denominator. In this embodiment, the loss function info loss is minimized by maximizing the numerator and minimizing the denominator, i.e., maximizing the first similarity of the positive case pair and minimizing the second similarity of the negative case pair, and the loss function is minimized to optimize the loss function. More specifically, the loss function is shown in equation (1):

wherein the content of the first and second substances,

is the transpose of f (x), f (x) is the original sample, f (x)⁺) Is a positive example, f (x)_j) Is a single negative example sample, then the negative examples are all accumulated, the denominator term comprises a positive example sample and N-1 negative examples samples;

the loss function represents the loss (loss) of the sample N; in the loss function, the numerator is the similarity of the positive case pair, the denominator is the similarity of the positive case pair and all the negative case pairs, and then the value is packed in-log (), so that the minimization of the loss function can be realized by maximizing the numerator and minimizing the denominator.

Note that the similarity of the positive example pair (first similarity) and the similarity of the negative example pair (second similarity) satisfy the condition:

Score(f(x)，f(x⁺))＞＞Score(f(x)，f(x^-) Equation (2)

As can be seen from the above formula, the method needs to satisfy: the similarity of the positive case pair is greater than or equal to the similarity of the negative case pair, where x⁺Refers to data similar to x, i.e., positive sample pair data; where x is^-Refers to data dissimilar to x, i.e. negative sample pair data, f (x)⁺) Is a positive example, f (x)^-) Is a negative example sample.

Further, the preset metric function is:

where Score is a metric function used to evaluate the similarity between two features. The predetermined metric function is a function using a dot product as a fractional function.

Specifically, in step S605 of some embodiments, optimizing the loss function of the comparative learning model according to the first similarity and the second similarity may include, but is not limited to, including:

and performing back propagation according to the loss function, and updating the loss parameters of the loss function so as to optimize the loss function.

In the embodiment of the application, back propagation is performed according to the loss function, so that the comparison learning model is updated by optimizing the loss function, and the internal parameters (namely, the loss parameters) of the comparison learning model are updated. It is to be understood that the back propagation principle can be applied to a conventional back propagation principle, and the embodiments of the present application are not limited thereto.

In some embodiments, step S106 may include, but is not limited to including, the steps of:

Specifically, when calculating the similarity value between every two text segments, assuming that the target embedding vector of one text segment is u and the target embedding vector of the other text segment is v, the similarity value of the two text segments is calculated according to the formula of the cosine similarity algorithm (as shown in formula 5).

According to the method, the original text to be calculated is obtained, and the original text is subjected to word segmentation processing by utilizing the pre-trained text word segmentation model to obtain a plurality of text word segments, so that the original text can be segmented, and the required text word segments can be extracted more conveniently; and then, position recognition is carried out on each text word segment by utilizing a pre-trained target word bank model to obtain the target position of each text word segment, and the position recognition can be carried out on each text word segment more accurately. Therefore, each text word segment can be further encoded according to the target position to obtain a text word segment vector, the text word segment vector is input into a pre-trained comparison model, the text word segment vector is subjected to matrix multiplication with a reference word embedding matrix in the comparison model to obtain a target word embedding vector, finally, text similarity calculation is carried out on a plurality of target word embedding vectors to obtain a similarity value between every two text word segments, the problem of uniform distribution of the word segment vectors can be effectively solved through the comparison model, and therefore the accuracy of similarity calculation is improved, and the method can more accurately determine the correlation between the text word segments.

Referring to fig. 7, an embodiment of the present application further provides a text similarity calculation apparatus, which can implement the text similarity calculation method, and the apparatus includes:

a text obtaining module 701, configured to obtain an original text to be calculated;

a word segmentation module 705, configured to perform word segmentation processing on an original text by using a pre-trained text word segmentation model to obtain a plurality of text word segments;

the position recognition module 703 is configured to perform position recognition on each text word by using a pre-trained target word library model to obtain a target position of each text word;

the encoding module 704 is configured to perform encoding processing on each text segment according to the target position to obtain a text segment vector;

the comparison module 705 is configured to input the text word segment vector into a pre-trained comparison model, so that the text word segment vector and a reference word embedding matrix in the comparison model are subjected to matrix multiplication to obtain a target word embedding vector;

and the similarity calculation module 706 is configured to perform similarity calculation on the multiple target word embedding vectors to obtain a similarity value between each two text word segments.

The specific implementation of the text similarity calculation apparatus is basically the same as the specific implementation of the text similarity calculation method, and is not described herein again.

An embodiment of the present application further provides an electronic device, where the electronic device includes: the text similarity calculation method comprises a memory, a processor, a program stored on the memory and capable of running on the processor, and a data bus for realizing connection communication between the processor and the memory, wherein the program realizes the text similarity calculation method when being executed by the processor. The electronic equipment can be any intelligent terminal including a tablet computer, a vehicle-mounted computer and the like.

Referring to fig. 8, fig. 8 illustrates a hardware structure of an electronic device according to another embodiment, where the electronic device includes:

the processor 801 may be implemented by a general-purpose CPU (central processing unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more integrated circuits, and is configured to execute a relevant program to implement the technical solution provided in the embodiment of the present application;

the memory 802 may be implemented in a form of a Read Only Memory (ROM), a static storage device, a dynamic storage device, or a Random Access Memory (RAM). The memory 802 may store an operating system and other application programs, and when the technical solution provided by the embodiments of the present disclosure is implemented by software or firmware, the relevant program codes are stored in the memory 802, and the processor 801 calls the text similarity calculation method for executing the embodiments of the present disclosure;

an input/output interface 803 for realizing information input and output;

the communication interface 804 is used for realizing communication interaction between the device and other devices, and can realize communication in a wired manner (such as USB, network cable, and the like) or in a wireless manner (such as mobile network, WIFI, bluetooth, and the like); and

a bus 805 that transfers information between the various components of the device (e.g., the processor 801, memory 802, input/output interfaces 803, and communication interface 804);

wherein the processor 801, the memory 802, the input/output interface 803 and the communication interface 804 are communicatively connected to each other within the device via a bus 805.

Embodiments of the present application also provide a computer-readable storage medium for a computer-readable storage, where one or more programs are stored in the computer-readable storage medium, and the one or more programs are executable by one or more processors to implement the text similarity calculation method.

The memory, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer executable programs. Further, the memory may include high speed random access memory, and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory located remotely from the processor, and these remote memories may be connected to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The embodiments described in the embodiments of the present application are for more clearly illustrating the technical solutions of the embodiments of the present application, and do not constitute a limitation to the technical solutions provided in the embodiments of the present application, and it is obvious to those skilled in the art that the technical solutions provided in the embodiments of the present application are also applicable to similar technical problems with the evolution of technology and the emergence of new application scenarios.

It will be appreciated by those skilled in the art that the solutions shown in fig. 1-6 are not intended to limit the embodiments of the present application and may include more or fewer steps than those shown, or some of the steps may be combined, or different steps may be included.

The above-described embodiments of the apparatus are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may also be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

One of ordinary skill in the art will appreciate that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof.

The terms "first," "second," "third," "fourth," and the like in the description of the application and the above-described figures, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be understood that in the present application, "at least one" means one or more, "a plurality" means two or more. "and/or" for describing an association relationship of associated objects, indicating that there may be three relationships, e.g., "a and/or B" may indicate: only A, only B and both A and B are present, wherein A and B may be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of single item(s) or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the above-described division of units is only one type of division of logical functions, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes multiple instructions for causing a computer device (which may be a personal computer, a server, or a network device) to perform all or part of the steps of the method of the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing programs, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The preferred embodiments of the present application have been described above with reference to the accompanying drawings, and the scope of the claims of the embodiments of the present application is not limited thereto. Any modifications, equivalents and improvements that may occur to those skilled in the art without departing from the scope and spirit of the embodiments of the present application are intended to be within the scope of the claims of the embodiments of the present application.

Claims

1. A text similarity calculation method, characterized in that the method comprises:

acquiring an original text to be calculated;

2. The method of calculating text similarity according to claim 1, wherein the step of performing position recognition on each text word segment by using a pre-trained target lexicon model to obtain a target position of each text word segment comprises:

3. The method for calculating text similarity according to claim 1, wherein before the step of performing position recognition on each text word by using a pre-trained target lexicon model to obtain a target position of each text word, the method further comprises pre-training the target lexicon model, and specifically comprises:

acquiring reference text data;

4. The method of claim 1, wherein the step of encoding each text segment according to the target position to obtain a text segment vector comprises:

5. The method of claim 1, wherein the step of inputting the text segment vector into a pre-trained contrast model to perform matrix multiplication on the text segment vector and a reference word embedding matrix in the contrast model to obtain a target word embedding vector comprises:

6. The method for calculating text similarity according to claim 1, wherein before the step of inputting the text word segment vector into a pre-trained comparison model to perform matrix multiplication on the text word segment vector and a reference word embedding matrix in the comparison model to obtain a target word embedding vector, the method further comprises training the comparison model, specifically comprising:

acquiring sample data;

inputting the positive example pair to the comparative learning model;

7. The method according to any one of claims 1 to 6, wherein the step of performing similarity calculation on the target word embedding vectors to obtain a similarity value between each two text word segments includes:

8. A text similarity calculation apparatus, characterized in that the apparatus comprises:

9. An electronic device, characterized in that the electronic device comprises a memory, a processor, a program stored on the memory and executable on the processor, and a data bus for realizing connection communication between the processor and the memory, the program, when executed by the processor, realizing the steps of the text similarity calculation method according to any one of claims 1 to 7.

10. A computer-readable storage medium for computer-readable storage, characterized in that the computer-readable storage medium stores one or more programs which are executable by one or more processors to implement the steps of the text similarity calculation method according to any one of claims 1 to 7.