CN111382563A

CN111382563A - Text relevance determining method and device

Info

Publication number: CN111382563A
Application number: CN202010201255.XA
Authority: CN
Inventors: 王皓; 周宇超; 康斌; 高雪峰
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-03-20
Filing date: 2020-03-20
Publication date: 2020-07-07
Anticipated expiration: 2040-03-20
Also published as: CN111382563B

Abstract

The invention provides a method and a device for determining text relevance; the method comprises the following steps: obtaining at least two text vector models; respectively carrying out vector coding on a first text and a second text through each text vector model to obtain a first text vector for representing the first text and a second text vector for representing the second text; determining the similarity of the first text and the second text based on a first text vector and a second text vector obtained by each text vector model respectively to obtain at least two similarities; and determining the relevance of the first text and the second text according to the obtained at least two similarities. By the method and the device, the relevance of the two texts can be more accurately determined.

Description

Text relevance determining method and device

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a method and a device for determining text relevance.

Background

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results.

Natural Language Processing (NLP) is an important direction in the field of artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Text processing is one of the important technologies included in natural language processing technology.

Text is a common medium for conveying information. In application scenarios such as recommending interesting articles for a user, performing quality analysis on comments of the articles, sequencing the comments of the articles, and the like, the correlation between the two texts needs to be analyzed so as to improve user experience.

In the related art, a correlation determination method that is generally used includes: performing similarity calculation according to a Term Frequency-Inverse text Frequency (TF-IDF) vector of the text; according to the theme distribution vector of the text, similarity calculation is carried out; similarity calculation is performed according to vector representation obtained by embedding a word into a vector of each word in the text. However, the vectors used in the above method cannot accurately characterize the text, so that the determination of the correlation between the two texts is not accurate enough.

Disclosure of Invention

The embodiment of the invention provides a method and a device for determining text relevance, which can more accurately determine the relevance of two texts.

The technical scheme of the embodiment of the invention is realized as follows:

the embodiment of the invention provides a method for determining text relevance, which comprises the following steps:

obtaining at least two text vector models;

respectively carrying out vector coding on a first text and a second text through each text vector model to obtain a first text vector for representing the first text and a second text vector for representing the second text;

determining the similarity of the first text and the second text based on a first text vector and a second text vector obtained by each text vector model respectively to obtain at least two similarities;

and determining the relevance of the first text and the second text according to the obtained at least two similarities.

In the foregoing solution, the determining the similarity between the first text and the second text based on the first text vector and the second text vector obtained by each text vector model includes:

respectively obtaining cosine values of included angles between the first text vector and the second text vector obtained by each text vector model;

and taking the cosine value of the included angle as the similarity of the first text and the second text.

An embodiment of the present invention provides a device for determining text relevance, including:

the acquisition module is used for acquiring at least two text vector models;

the encoding module is used for respectively carrying out vector encoding on the first text and the second text through each text vector model to obtain a first text vector for representing the first text and a second text vector for representing the second text;

a first determining module, configured to determine similarity between the first text and the second text based on a first text vector and a second text vector obtained by each text vector model, respectively, so as to obtain at least two similarities;

and the second determining module is used for determining the correlation between the first text and the second text according to the obtained at least two similarities.

In the above scheme, the at least two obtained text vector models include: a first text vector model and a second text vector model,

the encoding module is further configured to process, through the first text vector model, context word vectors of words in the first text and the second text and paragraph vectors of paragraphs where the words are located, respectively, to obtain word feature vectors of the words in the first text and word feature vectors of the words in the second text;

generating a first text vector for representing the first text and a second text vector for representing the second text according to the word feature vector of each word in the first text and the word feature vector of each word in the second text respectively;

and respectively coding the word sequence of the first text and the word sequence of the second text by using a coder in the second text vector model to obtain a first text vector for representing the first text and a second text vector for representing the second text.

In the above scheme, the encoding module is further configured to perform vector encoding on the target article through each text vector model, so as to obtain a first text vector for representing the target article;

carrying out vector coding on the comments corresponding to the target article through each text vector model to obtain a second text vector for representing the comments;

the second determining module is further configured to determine a relevance between the target article and the comments according to the obtained at least two similarities, to perform priority ranking on the at least two comments corresponding to the target article according to the relevance to obtain a priority order, and to present the at least two comments corresponding to the target article according to the priority order.

In the foregoing solution, the encoding module is further configured to perform the following operations on each word in the first text and the second text, respectively:

obtaining word vectors of all words in the paragraph where the words are located to obtain a word vector set of the paragraph;

dividing the set of word vectors into a first subset and a second subset;

taking the word vectors in the first subset as context word vectors of the words, and generating paragraph vectors of the paragraphs according to the word vectors in the second subset;

and generating a word feature vector of the word according to the context word vector of the word and the paragraph vector of the paragraph.

In the above scheme, the encoding module is further configured to obtain a word frequency of each word in the first text and the second text in the corresponding text;

determining the module length of the word feature vector of each word according to the word frequency;

and respectively obtaining word feature vectors of all words in the first text and word feature vectors of all words in the second text according to the context word vectors of all the words in the first text and the second text, the paragraph vectors of the paragraphs where the words are located and the model lengths of the word feature vectors of the words.

In the above scheme, the encoding module is further configured to split the first text and the second text respectively by taking a sentence as a unit to obtain at least two sentences included in the first text and at least two sentences included in the second text;

respectively coding the obtained word sequence corresponding to each sentence through a coder in the second text vector model to obtain a sentence vector for representing each sentence;

combining sentence vectors of at least two sentences included in the first text to obtain a first text vector for representing the first text,

and combining sentence vectors of at least two sentences included in the second text to obtain a second text vector for representing the second text.

In the foregoing solution, the second text vector model further includes a decoder, and the apparatus further includes:

the first training module is used for acquiring a word sequence of a sample text;

coding the word sequence of the sample text through the coder to obtain a text vector of the sample text;

decoding the text vector of the sample text through the decoder to obtain an output word sequence;

and acquiring the difference between the output word sequence and the word sequence of the sample text, and updating the model parameters of the second text vector model based on the difference.

In the above solution, the second text vector model further includes a decoder and a variation hierarchy, and the apparatus further includes:

the second training module is used for acquiring a word sequence of the sample text;

generating normal distribution of text vectors corresponding to the sample texts through the variable layering;

sampling the normal distribution to obtain a sampling vector;

decoding the sampling vector through the decoder to obtain an output word sequence;

In the above scheme, the first determining module is further configured to obtain cosine values of included angles between the first text vector and the second text vector obtained by each text vector model;

In the foregoing solution, the second determining module is further configured to determine an average value of the obtained at least two similarities;

and characterizing the relevance of the first text and the second text through the average value.

An embodiment of the present invention provides an electronic device, including:

a memory for storing executable instructions;

and the processor is used for realizing the method provided by the embodiment of the invention when executing the executable instructions stored in the memory.

Embodiments of the present invention provide a storage medium storing executable instructions for causing a processor to execute the method provided by the embodiments of the present invention.

The embodiment of the invention has the following beneficial effects:

respectively carrying out vector coding on a first text and a second text through each text vector model to obtain a first text vector for representing the first text and a second text vector for representing the second text; determining the similarity of the first text and the second text based on a first text vector and a second text vector obtained by each text vector model respectively to obtain at least two similarities; determining the relevance of the first text and the second text according to the obtained at least two similarities; therefore, the text is coded through at least two text vector models, and text vectors used for representing different characteristics of the text can be obtained, so that the loss of information is reduced, and the accuracy of the determined correlation is improved.

Drawings

FIG. 1 is a schematic diagram of an interface for article reviews provided by an embodiment of the invention;

FIG. 2 is an alternative architectural diagram of a text relevance determination system 100 provided by an embodiment of the present invention;

FIG. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present invention;

FIG. 4 is an alternative flow chart of a method for determining text relevance according to an embodiment of the present invention;

FIG. 5 is an alternative structural diagram of a first text vector model provided by an embodiment of the invention;

FIG. 6 is an alternative structural diagram of a second text vector model provided by an embodiment of the present invention;

FIG. 7 is an alternative structural diagram of a second text vector model provided by an embodiment of the invention;

FIG. 8 is an alternative flow chart of a method for determining text relevance according to an embodiment of the present invention;

FIG. 9 is an alternative structural diagram of a seq2seq model provided by an embodiment of the present invention;

fig. 10 is an alternative structural diagram of the apparatus for determining text relevance according to the embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail with reference to the accompanying drawings, the described embodiments should not be construed as limiting the present invention, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.

Reference to the terms "first \ second \ third" merely distinguishes similar objects and does not denote a particular ordering with respect to the objects, it being understood that "first \ second \ third" may be interchanged under certain circumstances or order of precedence, such that embodiments of the invention described herein may be practiced in an order other than that shown or described herein.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein is for the purpose of describing embodiments of the invention only and is not intended to be limiting of the invention.

Before further detailed description of the embodiments of the present invention, terms and expressions mentioned in the embodiments of the present invention are explained, and the terms and expressions mentioned in the embodiments of the present invention are applied to the following explanations.

1) Relevancy refers to the degree of association between two texts, i.e., the degree of sharing or causal relationship between two texts.

2) The paragraph is a basic unit forming the article and has an obvious mark for line change. The article is stopped by setting sections, so that a more striking and clear impression is formed visually by readers, the readers can read, understand and smell, and clear expression contents can be organized by authors.

3) The modulus length, i.e. the length of the vector, for an n-dimensional vector (x)₁,x₂,...,x_n) Die length of

In application scenarios such as comment quality analysis and comment ordering, the determination of text relevance is very important. Fig. 1 is a schematic interface diagram of article comments provided in an embodiment of the present invention, and referring to fig. 1, in a news Application (App, Application) and a web portal, there are usually multiple comments under an article, and if the comments are only sorted according to factors such as time and amount of praise, there are comments that are irrelevant to the article and have relatively low quality and appear at the head of a comment area, so that a user has relatively low evaluation on the quality of the App or the comment area of the web portal, thereby affecting use experience. If the relevance between each comment and each article can be obtained, and the relevance between the comments and the articles is considered when the comments are ranked, the use experience of the user can be improved.

To implement the determination of text relevance, the following methods are provided in the related art:

1) text relevance determining method based on TF-IDF

The TF-IDF vector of the text may be obtained by calculating the word Frequency (TF) in the current text and the total Frequency (DF) of its occurrence in the entire text for each word in the text. And performing cosine similarity calculation on TF-IDF vectors of the two input texts to obtain the correlation of the two texts.

2) Text relevance determination method based on Latent Dirichlet distribution (LDA)

And training a model through large-scale linguistic data, and respectively calculating the theme and the distribution thereof existing in the text by using the model obtained by training to obtain a theme distribution vector. And performing cosine similarity calculation on the vectors of the two texts to obtain the correlation of the two texts.

3) Text relevance determining method based on word embedding

Word embedding of each word in the text can be obtained through a word2vec method commonly used in deep learning. By summing the word insertions for each word in the text, a vector representation of the text can be obtained. And performing cosine similarity calculation on the vectors of the two texts to obtain the correlation of the two texts.

4) Text relevance determining method based on deep semantic matching model

And training the deep semantic matching model by taking the pre-labeled original text-related text pairs and some irrelevant texts as input to obtain a deep learning model for calculating the relevance scores of the document pairs.

The inventor finds that the method has the following problems in the process of implementing the embodiment of the invention:

1. with the above-mentioned method 1), since it relies too much on literal similar information, it is difficult to capture deep semantic similarity, and therefore the accuracy of its determined correlation is low;

2. for the above mentioned method 2), since the LDA topic model is an unsupervised model, it can only try to fit the topic distribution corresponding to the text, so the accuracy of the determined correlation is low;

3. for the above mentioned method 3), since word embedding is to express word senses, semantic information of a large number of words is lost in the summation process, resulting in feature loss, and further making the accuracy of the determined correlation low;

4. for the above-mentioned method 4), because a large amount of labeled corpora are required for pre-training, and the related corpora are usually difficult to be judged by the hard standard, the labeled data quality is not high, so that the accuracy of the determined correlation is low. In addition, the deep semantic matching model needs to perform cross calculation by using the characteristics of two texts, and cannot calculate the vector representation of a single text in advance, so that more computing resources need to be used on line, and the service maintenance cost is relatively high.

Based on this, the method for determining text relevance according to the embodiment of the present invention is proposed to solve at least the above problems in the related art, and is separately described below.

Referring to fig. 2, fig. 2 is an alternative architecture diagram of the text relevance determination system 100 according to an embodiment of the present invention, in order to support an exemplary application, terminals (exemplary terminal 400-1 and terminal 400-2 are shown) are connected to the server 200 through a network 300, and the network 300 may be a wide area network or a local area network, or a combination of both networks. The terminal is provided with an application client, such as a news application client.

The terminal (e.g., 400-1) sends second text (e.g., comments) for the first text (e.g., news content) to the server.

A server 200 for obtaining at least two text vector models; respectively carrying out vector coding on the first text and the second text through each text vector model to obtain a first text vector for representing the first text and a second text vector for representing the second text; determining the similarity of the first text and the second text based on the first text vector and the second text vector obtained by the text vector models respectively to obtain at least two similarities; determining the correlation between the first text and the second text according to the obtained at least two similarities, and returning to the terminal;

and the terminal (such as 400-1) is used for presenting the second text in a presentation mode corresponding to the relevance according to the relevance of the first text and the second text.

In practical application, the server may be a server configured independently to support various services, or may be configured as a server cluster; the terminal may be a smartphone, a tablet, a laptop, or any other type of user terminal, and may also be a wearable computing device, a Personal Digital Assistant (PDA), a desktop computer, a cellular phone, a media player, a navigation device, a game console, a television, or a combination of any two or more of these or other data processing devices.

Next, an electronic device implementing the text relevance determination method according to the embodiment of the present invention will be described. Referring to fig. 3, fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, where the electronic device shown in fig. 3 includes: at least one processor 410, memory 450, at least one network interface 420, and a user interface 430. The various components in the electronic device are coupled together by a bus system 440. It is understood that the bus system 440 is used to enable communications among the components. The bus system 440 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 440 in FIG. 3.

The Processor 410 may be an integrated circuit chip having Signal processing capabilities, such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like, wherein the general purpose Processor may be a microprocessor or any conventional Processor, or the like.

The user interface 430 includes one or more output devices 431, including one or more speakers and/or one or more visual displays, that enable the presentation of media content. The user interface 430 also includes one or more input devices 432, including user interface components that facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, other input buttons and controls.

The memory 450 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard disk drives, optical disk drives, and the like. Memory 450 optionally includes one or more storage devices physically located remote from processor 410.

The memory 450 includes either volatile memory or nonvolatile memory, and may include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read Only Memory (ROM), and the volatile Memory may be a Random Access Memory (RAM). The memory 450 described in embodiments of the invention is intended to comprise any suitable type of memory.

In some embodiments, memory 450 is capable of storing data, examples of which include programs, modules, and data structures, or a subset or superset thereof, to support various operations, as exemplified below.

An operating system 451, including system programs for handling various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and handling hardware-based tasks;

a network communication module 452 for communicating to other computing devices via one or more (wired or wireless) network interfaces 420, exemplary network interfaces 420 including: bluetooth, wireless compatibility authentication (WiFi), and Universal Serial Bus (USB), etc.;

a presentation module 453 for enabling presentation of information (e.g., user interfaces for operating peripherals and displaying content and information) via one or more output devices 431 (e.g., display screens, speakers, etc.) associated with user interface 430;

an input processing module 454 for detecting one or more user inputs or interactions from one of the one or more input devices 432 and translating the detected inputs or interactions.

In some embodiments, the apparatus provided by the embodiments of the present invention may be implemented in software, and fig. 3 illustrates the text relevance determination apparatus 455 stored in the memory 450, which may be software in the form of programs and plug-ins, and includes the following software modules: an obtaining module 4551, an encoding module 4552, a first determining module 4553 and a second determining module 4554, which are logical and thus may be arbitrarily combined or further divided according to the functions implemented.

The functions of the respective modules will be explained below.

In other embodiments, the Device for determining text dependency provided by the embodiments of the present invention may be implemented in hardware, and as an example, the Device for determining text dependency provided by the embodiments of the present invention may be a processor in the form of a hardware decoding processor, which is programmed to execute the method for determining text dependency provided by the embodiments of the present invention, for example, the processor in the form of the hardware decoding processor may be implemented by one or more Application Specific Integrated Circuits (ASICs), DSPs, Programmable Logic Devices (PLDs), Complex Programmable Logic Devices (CPLDs), Field Programmable Gate Arrays (FPGAs), or other electronic components.

The text relevance determination method of the present invention will be described below in connection with an exemplary application when implemented as a server provided by an embodiment of the present invention.

Referring to fig. 4, fig. 4 is an optional flowchart of a text relevance determination method according to an embodiment of the present invention, and the text relevance determination method according to the present invention will be described with reference to the steps shown in fig. 4.

Step 401: the server obtains at least two text vector models.

Here, the at least two text vector models may be obtained by training; or the correlation degree may be previously trained, stored in the server, and directly obtained from the server when the correlation degree needs to be determined.

The invention stores at least two pre-trained models in the server, thereby avoiding on-line training of the models when determining the text relevance, reducing the computing resources and improving the determination efficiency of the text relevance.

Step 402: and respectively carrying out vector coding on the first text and the second text through each text vector model to obtain a first text vector for representing the first text and a second text vector for representing the second text.

Here, the at least two text vector models are obtained for encoding the text to obtain a vector text capable of representing the text, and therefore, the at least two text vector models may include any model capable of being used to obtain a text vector for representing the text, such as a word-to-vector (word2vec) model, a text-to-vector (doc2vec) model, a sequence-to-sequence (Seq2Seq) model, and the like.

In practical implementation, the number of the obtained first text vectors and the number of the obtained second text vectors are the same as the number of the obtained text vector models, that is, one first text vector and one second text vector can be generated through each text vector model.

In this case, the text is encoded by different text vector models according to different text features, that is, the obtained plurality of first text vectors may be used to represent different features of the first text, and the obtained plurality of second text vectors may be used to represent different features of the second text, so that accuracy in subsequently determining the correlation between the first text and the second text can be improved.

In some embodiments, the at least two text vector models comprise: correspondingly, a first text vector for representing the first text and a second text vector for representing the second text can be obtained through the following modes:

respectively processing context word vectors of each word in the first text and the second text and paragraph vectors of a paragraph where the word is located through a first text vector model to obtain word feature vectors of each word in the first text and word feature vectors of each word in the second text; generating a first text vector for representing the first text and a second text vector for representing the second text according to the word feature vector of each word in the first text and the word feature vector of each word in the second text respectively; and respectively coding the word sequence of the first text and the word sequence of the second text by a coder in a second text vector model to obtain a first text vector for representing the first text and a second text vector for representing the second text.

Here, the context word vector of a word includes a plurality of word vectors, that is, word vectors of words before the word and word vectors of words after the word, and in actual implementation, the number of the context word vectors may be preset, for example, word vectors of three words before the word and word vectors of three words after the word may be obtained as the context word vectors; the number of context word vectors may also be dynamically adjusted.

According to the invention, when the word feature vector of a word is obtained through the first text vector model, the context word vector and the paragraph vector of the word are introduced, so that the obtained word feature vector can better represent a word due to the consideration of the words before and after the word and the far context content.

It should be noted that, in practical implementation, the paragraph vector of the paragraph where the word is located may be replaced by the sentence vector of the sentence where the word is located, or may be replaced by the text vector of the text where the word is located.

In some embodiments, for each word in the first text and the second text, the server may concatenate a context word vector and a paragraph vector of the word to obtain a word feature vector of the word; alternatively, the server may obtain an average vector of a plurality of word vectors and paragraph vectors included in the context word vector, and use the average vector as the word feature vector of the word.

According to the method and the device, the characteristics of each word in the text are obtained through the first text vector model, and the characteristics of the word sequence in the text are obtained through the second text vector model, so that the characteristics of the first text and the second text can be well represented.

In some embodiments, the word feature vector of each word in the first text and the word feature vector of each word in the second text may be obtained by:

for each word in the first text and the second text, the following operations are performed: obtaining word vectors of all words in a paragraph where the words are located to obtain a word vector set of the paragraph; dividing a word vector set into a first subset and a second subset; taking the word vectors in the first subset as context word vectors of the words, and generating paragraph vectors of paragraphs according to the word vectors in the second subset; and generating a word feature vector of the word according to the word vectors of the upper and lower words of the word and the paragraph vector of the paragraph.

In practical implementation, word vectors of a plurality of words before and after a word are used as context word vectors of the word, and then paragraph vectors of paragraphs are generated according to word vectors of other words except the words before and after the word in the paragraph where the word is located.

Exemplarily, fig. 5 is an optional structural schematic diagram of a first text vector model provided in an embodiment of the present invention, and referring to fig. 5, when a word feature vector of a word needs to be obtained, word vectors of a word before the word and two words after the word are used as context word vectors of the word, a paragraph vector of a paragraph is generated according to word vectors of other words except for a plurality of words before and after the word in the word, and then a plurality of word vectors included in the context word vectors are spliced/averaged with the paragraph vector to obtain the word feature vector of the word.

For example, for a paragraph "fierce celebration zhongshan marathon match is successful and successful", when a word feature vector of "celebration" needs to be obtained, the word vectors of "fierce celebration", "zhongshan", "marathon" are used as context word vectors, paragraph vectors are generated according to the word vectors of "match", "successful" and a plurality of word vectors included in the context word vectors are spliced/averaged with the paragraph vectors to obtain a word feature vector of "celebration".

In some embodiments, the paragraph vector may be generated from a word vector of keywords in the paragraph by obtaining keywords in the paragraph in which the words are located.

It should be noted that, when a paragraph is replaced by a sentence or text, the manner of obtaining a sentence vector or a text vector is the same as the manner of obtaining a paragraph vector.

In some embodiments, the context word vector of the sample word and the paragraph vector of the paragraph in which the sample word is located are input into a first text vector model, the predicted word vector is output through the first text vector model, a difference between the predicted word vector and the sample word vector is obtained, and the model parameter of the first text vector model is updated according to the difference between the predicted word vector and the sample word vector.

In actual implementation, the value of the objective function is determined according to the predicted word vector and the sample word vector, the value of the objective function is reversely propagated in the first text vector model, and the model parameters of each layer are updated in the process of propagation.

In practical implementation, when the value of the objective function exceeds a threshold value, the value of the objective function is propagated reversely in the heterogeneous network model based on the value of the objective function, and the model parameters of each layer are updated in the process of propagation until convergence. In this way, training of the first text vector model is achieved.

In some embodiments, the word feature vector of each word in the first text and the word feature vector of each word in the second text may be obtained by: acquiring the word frequency of each word in the first text and the second text in the corresponding text; determining the module length of the word feature vector of each word according to the word frequency; and respectively processing the context word vector of each word in the first text and the second text, the paragraph vector of the paragraph where the word is located and the model length of the word feature vector of the word to obtain the word feature vector of each word in the first text and the word feature vector of each word in the second text.

In practical implementation, the weight of a word can be influenced with a certain probability through a dynamic weighting technology, generally, the influence of a low-frequency word on sentence semantics is larger, and the influence of a high-frequency word, a stop word and the like on paragraph semantics is smaller, so that the module length of a word feature vector of a word with a lower word frequency is longer, and thus, key information in a paragraph can be highlighted through a text vector generated according to the word feature vector subsequently.

In some embodiments, a first text vector for characterizing the first text may be obtained by summing or concatenating word feature vectors of respective words in the first text; and summing or splicing the word feature vectors of all the words in the second text to obtain a second text vector for representing the second text.

In some embodiments, a first text vector for characterizing a first text and a second text vector for characterizing a second text may be obtained by: splitting the first text and the second text respectively by taking sentences as units to obtain at least two sentences included in the first text and at least two sentences included in the second text; respectively coding the obtained word sequence corresponding to each sentence through a coder in a second text vector model to obtain a sentence vector for representing each sentence; the sentence vectors of at least two sentences included in the first text are combined to obtain a first text vector for representing the first text, and the sentence vectors of at least two sentences included in the second text are combined to obtain a second text vector for representing the second text.

In actual implementation, compared with the method of directly coding the word sequence of the whole text, the method of coding the word sequence of each sentence in the text through the coder in the second text vector model can reduce the occupation of computing resources and is beneficial to improving the operation efficiency.

In some embodiments, the second text vector model further comprises a decoder, and the server may train the second text vector model by: acquiring a word sequence of a sample text; coding the word sequence of the sample text through a coder to obtain a text vector of the sample text; decoding the text vector of the sample text through a decoder to obtain an output word sequence; and acquiring the difference between the output word sequence and the word sequence of the sample text, and updating the model parameters of the second text vector model based on the difference.

Here, the training target of the second text vector model is that the output sequence is the same as the input sequence, that is, the output word sequence is the same as the word sequence of the sample text.

In practical implementation, the two recurrent neural networks are used for encoding and decoding respectively, and the two recurrent neural networks are used as an encoder and a decoder respectively. The encoder is used to analyze the input sequence and the decoder is used to generate the output sequence.

It should be noted that, we can select the traditional recurrent neural network structure, and can also use the long-short term memory model, the gated recurrent unit, etc., and the two recurrent neural networks are trained together. After training is completed, only the encoder portion is used for obtaining the text vector of the text.

Referring to fig. 6, fig. 6 is an optional structural schematic diagram of a second text vector model according to an embodiment of the present invention, where the second text vector model includes two recurrent neural networks, and a word sequence of a sample text is input into one recurrent neural network (encoder), and the word sequence of the sample text is encoded by the recurrent neural network, so as to obtain a text vector of the sample text; then, a cyclic neural network (decoder) is used for decoding the text vector of the sample text to obtain an output word sequence.

In some embodiments, a variable layer may be added between the encoder and the decoder, that is, the second text vector model further includes the decoder and the variable layer, and accordingly, the server may train the second text vector model by:

acquiring a word sequence of a sample text; coding the word sequence of the sample text through a coder to obtain a text vector of the sample text; generating normal distribution of text vectors corresponding to the sample texts through variable layering; sampling the normal distribution to obtain a sampling vector; decoding the sampling vector through a decoder to obtain an output word sequence; and acquiring the difference between the output word sequence and the word sequence of the sample text, and updating the model parameters of the second text vector model based on the difference.

In practical implementation, a variational layer can be added between an encoder and a decoder according to a related technology in a variational self-encoder (VAE) so as to increase the randomness of vectors, avoid the problem of overfitting and improve the reliability of a second text vector model obtained by training.

Referring to fig. 7, fig. 7 is an optional structural schematic diagram of a second text vector model according to an embodiment of the present invention, where the second text vector model includes an encoder, a variation hierarchy and a decoder, and after a text vector of a sample text is obtained by encoding in the encoder, a normal distribution of the text vector of the sample text is generated by the variation hierarchy, where the normal distribution is represented by two parameters of the normal distribution, that is, a mean vector and a variance vector, and then the normal distribution is sampled to obtain a sampling vector, and the sampling vector is input to the decoder.

Step 403: and determining the similarity between the first text and the second text based on the first text vector and the second text vector obtained by the text vector models respectively to obtain at least two similarities.

Here, a similarity is determined for each text vector model to obtain the similarity corresponding to each text vector model. Wherein the similarity between the first text and the second text is represented by determining the similarity between the first text vector and the second text vector.

Taking at least two text vector models including a first text vector model and a second text vector model as an example, calculating the similarity of a first text vector and a second text vector obtained by the first text vector model to obtain a first similarity; and calculating the similarity of the first text vector and the second text vector obtained by the second text vector model to obtain a second similarity.

In some embodiments, the server may determine the similarity of the first text to the second text by: respectively obtaining cosine values of included angles between a first text vector and a second text vector obtained by each text vector model; and taking the cosine value of the included angle as the similarity of the first text and the second text.

In some embodiments, the server may further obtain a pearson correlation coefficient of the first text vector and the second text vector obtained by each text vector model, and use the pearson correlation coefficient as a similarity between the first text and the second text; or the server may further obtain euclidean distances between the first text vector and the second text vector obtained by each text vector model, and the euclidean distances are used as the similarity between the first text and the second text.

In some embodiments, the server may further perform stitching on the first text vectors obtained by the text vector models to obtain first stitched vectors; splicing the second text vectors obtained by the text vector models to obtain second spliced vectors; then the similarity of the two stitching vectors is calculated.

Step 404: and determining the relevance of the first text and the second text according to the obtained at least two similarities.

Here, the higher the similarity, the higher the relevance of the first text to the second text is, and thus, the relevance of the first text to the second text may be characterized by the similarity.

In some embodiments, the server may determine the relevance of the first text to the second text by: determining an average value of the obtained at least two similarities; the relevance of the first text to the second text is characterized by an average value.

In some embodiments, when there is a similarity with a greater similarity with other similarities in the obtained at least two similarities, the similarities may be eliminated and averaged to determine the relevance between the first text and the second text more accurately.

In some embodiments, the method may be applied to ranking of article comments, where the first text is a target article, the second text is a comment of the corresponding target article, and vector-coding is performed on the first text and the second text through each text vector model, so as to obtain a first text vector for representing the first text and a second text vector for representing the second text, including: respectively carrying out vector coding on the target article through each text vector model to obtain a first text vector for representing the target article; carrying out vector coding on the comments of the corresponding target article through each text vector model to obtain a second text vector for representing the comments; determining the relevance of the first text and the second text according to the obtained at least two similarities, comprising: and determining the relevance between the target article and the comments according to the obtained at least two similarities, performing priority ranking on the at least two comments of the corresponding target article according to the relevance to obtain a priority sequence, and presenting the at least two comments of the corresponding target article according to the priority sequence.

In actual implementation, the comments of the corresponding target article and the relevance between the target article can be acquired, so that the comments of the corresponding target article can be sorted according to the relevance, the higher the relevance is, the higher the priority is, and then at least two comments of the corresponding target article are presented according to the priority order; therefore, the user can preferentially see the comments related to the article, and the user experience is improved.

In the embodiment of the invention, the first text and the second text are respectively subjected to vector coding through each text vector model to obtain a first text vector for representing the first text and a second text vector for representing the second text; determining the similarity of the first text and the second text based on a first text vector and a second text vector obtained by each text vector model respectively to obtain at least two similarities; determining the relevance of the first text and the second text according to the obtained at least two similarities; therefore, the text is coded through at least two text vector models, and text vectors used for representing different characteristics of the text can be obtained, so that the loss of information is reduced, and the accuracy of the determined correlation is improved.

Taking ranking of comments in an article as an example to describe the text relevance determination method of the present invention, fig. 8 is an optional flow diagram of the text relevance determination method provided in the embodiment of the present invention, and referring to fig. 8, the text relevance determination method provided in the embodiment of the present invention is cooperatively implemented by an application client and a server.

Step 501: the client sends a request for viewing comments of the target article.

Step 502: the server obtains a plurality of comments of the target article and the target article.

Step 503: the server obtains word vectors of all words in a paragraph where the words are located through a first text vector model for all words in the comments and the target article, and a word vector set of the paragraph is obtained.

Step 504: the server divides the set of word vectors into a first subset and a second subset.

Step 505: the server takes the word vectors in the first subset as context word vectors of the words, and generates paragraph vectors of paragraphs according to the word vectors in the second subset.

Step 506: the server splices context word vectors of the words and paragraph vectors of the paragraphs to obtain word feature vectors of the words.

Step 507: the server sums word feature vectors of all words in the comments to obtain a first text vector for representing the comments; and summing the word feature vectors of all the words in the target article to obtain a second text vector for representing the target article.

Step 508: and the server respectively calculates the similarity between the first text vector for representing each comment and the second text vector of the target text to obtain the first similarity between each comment and the target text.

Step 509: the server takes sentences as units and splits the comments and the target articles respectively to obtain at least two sentences included by the comments and at least two sentences included by the target articles.

Step 510: and the server respectively encodes the obtained word sequences corresponding to the sentences through an encoder in the second text vector model to obtain sentence vectors for representing the sentences.

Step 511: and for each comment, the server combines sentence vectors of at least two sentences included in the comment to obtain a third text vector for representing each comment.

Step 512: the server combines sentence vectors of at least two sentences included in the target article to obtain a fourth text vector for representing the target article.

Step 513: and the server respectively calculates the similarity between the third text vector for representing each comment and the fourth text vector of the target text to obtain a second similarity between each comment and the target article.

Step 514: and the server acquires the average value of the first similarity and the second similarity, and the acquired average value is used as the relevance of each comment and the target article.

Step 515: and the server returns the relevance of each comment and the target article to the application client.

Step 516: and the application client ranks the comments of the target article according to the relevance between each comment and the target article, and presents the comment of the target article according to a ranking result.

In the following, an exemplary application of the embodiments of the present invention in a practical application scenario will be described. In practical implementation, the method for determining text relevance provided by the embodiment of the present invention includes: the method comprises the steps of performing vector representation on two texts to be compared (including a first text and a second text) through a document-to-vector (doc2vecc) model and a sequence-to-sequence (seq2seq) model to obtain text vectors for representing the texts, then calculating the similarity between the text vectors of the two texts to be compared obtained by doc2vecc and seq2seq respectively to obtain two similarities, and using the average value of the two similarities to represent the correlation between the two texts to be compared.

The doc2vecc model is explained below.

The doc2vecc model is similar to a common word2vec model, and the goal of the doc2vecc model is to obtain a word feature vector of each word, and then to sum the word feature vectors of each word in the text to obtain a text vector for representing the text. The calculation speed of the text vector obtained through the doc2vecc model is high, and the method is suitable for being applied to a big data scene.

Compared with a word2vec model, the doc2vec introduces paragraph vectors in the training process, so that words before and after each word and contents of far context can be obtained simultaneously in the training process of each word, and thus the problem that the word2vec model depends too much on local contents (namely words in a designated window during training) and ignores other contents in a text is solved.

For example, referring to fig. 5, for the t-th word, word vectors of a word before the word and word vectors of two words after the word are used as context word vectors of the word, then paragraph vectors of paragraphs are generated according to word vectors of other words in the paragraph where the word is located, and then a plurality of word vectors included in the context word vectors are spliced/averaged with the paragraph vectors to obtain a word feature vector of the word, so as to predict the t-th word.

In some embodiments, a dynamic weighting technique may be introduced into the doc2vecc model, and generally, the lower frequency words have a larger influence on sentence semantics, and the higher frequency words, stop words, etc. have a smaller influence on paragraph semantics, so that the module length of the word feature vector of a word with a lower frequency of words should be longer, so that a text vector generated subsequently according to the word feature vector can highlight key information in a paragraph.

In practical implementation, the word frequency of a word in a corresponding text can be obtained, so that the model length of the word feature vector of the word is determined according to the word frequency through a doc2vecc model, and the word feature vector of the word is determined jointly according to the context word vector, the paragraph vector and the model of the word feature vector of the word.

Next, the seq2seq model will be explained.

In the process of implementing the present invention, the inventor finds that, although doc2vecc can pay more attention to the information of the high-frequency word when generating the word feature vector, the problem of information loss still occurs in the process of summing the word feature vectors. Therefore, the embodiment of the present invention also employs a seq2seq model, which is a depth model generated by an unsupervised training vector.

Fig. 9 is an optional structural schematic diagram of a seq2seq model according to the embodiment of the present invention, and referring to fig. 9, the seq2seq model includes two parts, where the first half is an encoder for encoding an input sequence to obtain a text vector for representing a text; the second half is a decoder used for decoding the text vector to obtain an output sequence. The seq2seq model can be used in the fields of automatic dialogue, machine translation, etc.

In practical implementation, the two recurrent neural networks are used for encoding and decoding respectively, and the two recurrent neural networks are used as an encoder and a decoder respectively.

Here, a training process of the seq2seq model will be explained. Coding the word sequence of the sample text through a coder to obtain a text vector of the sample text; decoding the text vector of the sample text through a decoder to obtain an output word sequence; and acquiring the difference between the output word sequence and the word sequence of the sample text, and updating the model parameters of the seq2seq model based on the difference so as to realize the training of the seq2seq model.

After training is completed, when a text vector of the text is acquired, only the encoder part, i.e. the intermediate vector, is used as the text representation.

In some embodiments, a variation layer may be added between the encoder and the decoder according to a related technique in a variation self-encoder (VAE) to increase the randomness of the vector, avoid the problem of overfitting, and improve the reliability of the trained second text vector model.

Here, a training process of the seq2seq model including the varying hierarchy will be explained. Coding the word sequence of the sample text to obtain a text vector of the sample text; determining corresponding normal distribution according to the text vector of the sample text through variable layering, and sampling the normal distribution to obtain a sampling vector; decoding the sampling vector through a decoder to obtain an output word sequence; and acquiring the difference between the output word sequence and the word sequence of the sample text, and updating the model parameters of the seq2seq model based on the difference so as to realize the training of the seq2seq model.

In practical implementation, when a text is long, if a word sequence of the text is completely input into the seq2seq model, a large amount of computing resources are occupied, which is not beneficial to improving the operation efficiency. Therefore, the text is split by taking the sentence as a unit to obtain a plurality of sentences, the word sequences of the sentences are respectively input into the encoder in the seq2seq model, the word sequences of the sentences are encoded to obtain sentence vectors of the sentences, and the sentence vectors are summed to obtain a text vector for representing the text.

Next, a method of calculating the similarity will be described.

In practical implementation, the similarity between two text vectors can be determined by Cosine similarity (Cosine similarity), pearson correlation coefficient, euclidean distance, and the like.

Table 1 is a comparison table of experimental data between the method provided by the embodiment of the present invention and the method provided by the related art, and referring to table 1, in an article review ranking scene, the average NDCG of the method provided by the embodiment of the present invention is significantly higher than that of other methods, that is, the method provided by the embodiment of the present invention is superior to the method provided by the related art in performance.

TABLE 1

According to the embodiment of the invention, the model architecture is simple, so that the model architecture can operate without being deployed in a GPU environment, and GPU resources are saved.

In some embodiments, the way of calculating the similarity of vectors may be changed, for example, vectors obtained by two models are spliced, and then cosine similarity calculation is performed on the spliced vectors; or other similarity calculation means.

In some embodiments, the doc2vec may be replaced by other word-based paragraph generation methods such as doc2vec, word2vec, and the like.

In some embodiments, a more complex attention mechanism or the like is introduced into the seq2seq model, or a network structure of the encoder is replaced, such as replacing the encoder with TextCNN or other architecture.

Continuing with the exemplary structure of the text relevance determining apparatus 455 provided by the embodiment of the present invention implemented as a software module, fig. 10 is an alternative structural schematic diagram of the text relevance determining apparatus provided by the embodiment of the present invention, as shown in fig. 10, the text relevance determining apparatus 455 provided by the embodiment of the present invention includes:

an obtaining module 4551, configured to obtain at least two text vector models;

the encoding module 4552 is configured to perform vector encoding on the first text and the second text through each text vector model, so as to obtain a first text vector used for representing the first text and a second text vector used for representing the second text;

a first determining module 4553, configured to determine similarity between the first text and the second text based on a first text vector and a second text vector obtained by each text vector model, so as to obtain at least two similarities;

a second determining module 4554, configured to determine a correlation between the first text and the second text according to the obtained at least two similarities.

In some embodiments, the encoding module 4552 is further configured to perform vector encoding on the target article through each text vector model, so as to obtain a first text vector for characterizing the target article;

the second determining module 4554 is further configured to determine the relevance between the target article and the comments according to the obtained at least two similarities, to perform priority ranking on the at least two comments corresponding to the target article according to the relevance to obtain a priority order, and to present the at least two comments corresponding to the target article according to the priority order.

In some embodiments, the at least two obtained text vector models comprise: a first text vector model and a second text vector model,

the encoding module 4552 is further configured to separately process, through the first text vector model, context word vectors of words in the first text and the second text, and paragraph vectors of paragraphs where the words are located, to obtain word feature vectors of the words in the first text and word feature vectors of the words in the second text;

In some embodiments, the encoding module 4552 is further configured to perform the following operations for each word in the first text and the second text, respectively:

dividing the set of word vectors into a first subset and a second subset;

In some embodiments, the encoding module 4552 is further configured to obtain a word frequency of each word in the first text and the second text in the corresponding text;

In some embodiments, the encoding module 4552 is further configured to split the first text and the second text respectively by taking a sentence as a unit, so as to obtain at least two sentences included in the first text and at least two sentences included in the second text;

In some embodiments, the second text vector model further comprises a decoder, the apparatus further comprising:

In some embodiments, the second text vector model further comprises a decoder and variant hierarchies, the apparatus further comprising:

sampling the normal distribution to obtain a sampling vector;

In some embodiments, the first determining module 4553 is further configured to obtain cosine values of included angles between the first text vector and the second text vector obtained by each text vector model;

In some embodiments, the second determining module 4554 is further configured to determine an average value of the at least two obtained similarities;

Embodiments of the present invention provide a storage medium having stored therein executable instructions, which when executed by a processor, will cause the processor to perform a method for determining text relevance provided by embodiments of the present invention, for example, the method shown in fig. 4.

In some embodiments, the storage medium may be memory such as FRAM, ROM, PROM, EPROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; or may be various devices including one or any combination of the above memories.

In some embodiments, executable instructions may be written in any form of programming language (including compiled or interpreted languages), in the form of programs, software modules, scripts or code, and may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.

By way of example, executable instructions may correspond, but do not necessarily have to correspond, to files in a file system, and may be stored in a portion of a file that holds other programs or data, such as in one or more scripts in a hypertext markup Language (HTML) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).

By way of example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices at one site or distributed across multiple sites and interconnected by a communication network.

The above description is only an example of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, and improvement made within the spirit and scope of the present invention are included in the protection scope of the present invention.

Claims

1. A method for determining text relevance, the method comprising:

obtaining at least two text vector models;

2. The method of claim 1, wherein the vector-coding the first text and the second text by the text vector models respectively to obtain a first text vector for characterizing the first text and a second text vector for characterizing the second text comprises:

respectively carrying out vector coding on a target article through each text vector model to obtain a first text vector for representing the target article;

determining the relevance of the first text and the second text according to the obtained at least two similarities, including:

and determining the relevance between the target article and the comments according to the obtained at least two similarities, performing priority ranking on the at least two comments corresponding to the target article according to the relevance to obtain a priority order, and presenting the at least two comments corresponding to the target article according to the priority order.

3. The method of claim 1, wherein the at least two text vector models comprise: a first text vector model and a second text vector model,

the vector coding is respectively performed on the first text and the second text through each text vector model to obtain a first text vector for representing the first text and a second text vector for representing the second text, and the method comprises the following steps:

respectively processing context word vectors of each word in the first text and the second text and paragraph vectors of a paragraph where the word is located through the first text vector model to obtain word feature vectors of each word in the first text and word feature vectors of each word in the second text;

4. The method of claim 3, wherein the processing context word vectors of each word in the first text and the second text and the paragraph vector of the paragraph in which the word is located to obtain the word feature vector of each word in the first text and the word feature vector of each word in the second text comprises:

for each word in the first text and the second text, performing the following operation:

dividing the set of word vectors into a first subset and a second subset;

5. The method of claim 3, wherein the processing context word vectors of each word in the first text and the second text and the paragraph vector of the paragraph in which the word is located to obtain the word feature vector of each word in the first text and the word feature vector of each word in the second text comprises:

acquiring the word frequency of each word in the first text and the second text in the corresponding text;

and processing according to the context word vectors of the words in the first text and the second text, the paragraph vectors of the paragraphs where the words are located and the model lengths of the word feature vectors of the words to obtain the word feature vectors of the words in the first text and the word feature vectors of the words in the second text.

6. The method of claim 3, wherein the encoding, by an encoder in the second text vector model, a word sequence of a first text and a word sequence of a second text to obtain a first text vector for characterizing the first text and a second text vector for characterizing the second text, respectively, comprises:

splitting the first text and the second text respectively by taking sentences as units to obtain at least two sentences included in the first text and at least two sentences included in the second text;

7. The method of claim 3, wherein the second text vector model further comprises a decoder, the method further comprising:

acquiring a word sequence of a sample text;

8. The method of claim 3, wherein the second text vector model further comprises a decoder and a variant layer, the method further comprising:

acquiring a word sequence of a sample text;

sampling the normal distribution to obtain a sampling vector;

9. The method of claim 1, wherein said determining the relevance of the first text to the second text based on the obtained at least two similarities comprises:

determining an average value of the at least two similarity degrees;

10. An apparatus for determining text relevance, the apparatus comprising:

the acquisition module is used for acquiring at least two text vector models;