CN115630142A

CN115630142A - Multi-language long text similarity retrieval and classification tool

Info

Publication number: CN115630142A
Application number: CN202211568520.3A
Authority: CN
Inventors: 吴林; 周亭; 吴治伟; 王士奇; 李伟
Original assignee: Communication University of China
Current assignee: Communication University of China
Priority date: 2022-12-08
Filing date: 2022-12-08
Publication date: 2023-01-20
Anticipated expiration: 2042-12-08
Also published as: CN115630142B

Abstract

The invention provides a multilingual long text similarity retrieval and classification tool, which belongs to the technical field of natural language processing and specifically comprises the following steps: the system comprises a text acquisition module, a text preprocessing module, a text classification prediction module and a text classification result output module; the text acquisition module is responsible for acquiring a plurality of long texts in different languages; the text preprocessing module is responsible for preprocessing a long text to obtain a corpus, embedding the corpus into a vector space, and performing semantic coding by taking sentences as units to form sentence vectors; the text classification prediction module adopts a multi-language space mapping model to predict to obtain a mapped target language vector, and determines the similarity between the long texts in different languages according to a joint loss function between different target language vectors, wherein the joint loss function adopts infoNCE loss and mutual information loss; and the text classification result output module outputs the classification result according to the similarity between long texts, so that a more accurate matching result is realized.

Description

Multi-language long text similarity retrieval and classification tool

Technical Field

The invention belongs to the technical field of natural language processing, and particularly relates to a multilingual long text similarity retrieval and classification tool.

Background

With the development of different local cultures in the world, different historical environments lead to the diversity and uniqueness of language development in various countries. In the current international communication, how to cross language barriers and promote cultural communication of each country is still an open challenge to establish shared semantic relation. Based on the display of the related information, the world now has more than five thousand languages being used, and thus, the world is now diversified in languages. With the rapid development of the internet, the information generated by people is increasingly increased, the information carriers are various, including videos, images, characters and the like, the information is closely related to the whole world in the breadth, the cultures of all countries are strongly collided and fused, and meanwhile, the related technology of language processing is continuously developed.

The one-dimensional text data is converted into two-dimensional matrix data through a cross-language similarity matrix construction algorithm by author Wangkai of a Master thesis, multi-language-embedding-based cross-language text similarity comparison research, hierarchical feature extraction is carried out on an interactive structure between cross-language sentences, multiple languages such as Chinese, english, german and the like are involved in different experiments, and the applicability of the model to different language structures is verified.

Aiming at the technical problem, the invention provides a multilingual long text similarity retrieval and classification tool.

Disclosure of Invention

In order to realize the purpose of the invention, the invention adopts the following technical scheme:

according to one aspect of the invention, a multi-language long text similarity retrieval and classification tool is provided.

A multilingual long text similarity retrieval and classification tool specifically comprises:

the system comprises a text acquisition module, a text preprocessing module, a text classification prediction module and a text classification result output module;

the text acquisition module is responsible for acquiring a plurality of long texts in different languages and transmitting the long texts to the text preprocessing module;

the text preprocessing module is responsible for preprocessing the long text to obtain a preprocessed text, obtaining a corpus, embedding the corpus into a vector space, performing semantic coding on sentences as units to form sentence vectors, and transmitting the sentence vectors to the text classification prediction module;

the text classification prediction module predicts a mapped target language vector by adopting a Transformer-based multi-language space mapping model according to the sentence vector, and determines the similarity among the long texts of the different languages according to a joint loss function among the different target language vectors, wherein the joint loss function adopts info loss and mutual information loss;

and the text classification result output module classifies the long texts in different languages according to the similarity among the long texts in the different languages and outputs a classification result.

The method comprises the steps of firstly obtaining a plurality of long texts in different languages by adopting a text obtaining module, preprocessing the long texts so as to convert the long texts into linguistic data, embedding the linguistic data into a vector space, carrying out semantic coding by taking sentences as units to form sentence vectors, predicting the mapped target language vectors by adopting a multi-language space mapping model based on a Transformer, determining the similarity among the long texts in the different languages according to a joint loss function among different target language vectors, classifying the long texts in the different languages according to the similarity among the long texts in the different languages, and outputting a classification result, wherein the joint loss function adopts infoNCE loss and mutual information loss, so that the problems of poor matching result and low accuracy rate are solved when the original method only adopts a single angle from a distance space to consider the similarity among the different vectors without considering the difference of information amount among the different sentences, and the retrieval comparison and recommendation are carried out on user texts, so that the prediction result becomes more accurate, and the calculation result of the similarity becomes more real.

According to the combined loss function between different target language vectors, the similarity between the long texts of the different languages is determined, wherein the combined loss function adopts infoNCE loss and mutual information loss, so that the consideration based on two modes is realized, namely, the similarity in distance space and the mutual correlation degree of information quantity are similar, the problems that the matching result is poor and the accuracy is low when the retrieval comparison and recommendation are carried out on the user text are solved by only considering the similarity between different vectors from a single angle of the distance space and not considering the difference of the information quantity between different sentences, the prediction result is more accurate, the calculation result of the similarity is more real and credible, and the sentences with the same meaning and the larger quantity of the sentences in different languages can be accurately found.

The further technical scheme is that the preprocessing comprises obtaining operable linguistic data through word segmentation, word stop and word stem extraction.

By preprocessing the long sentences, the word stems are extracted, the interference items such as stop words are eliminated, and the final similarity evaluation accuracy is improved to a certain degree.

The further technical scheme is that the multi-language space mapping model based on the Transformer firstly encodes the sentence vector and then sends the sentence vector to a decoder to obtain a mapped target language vector.

In order to establish an effective same space mapping model in a multi-language space, a main body model in the product is a Transformer-based encoder and a decoder, and a source language is encoded by the encoder and then input into the decoder to obtain a mapped target language. In order to enable training to become effective, on one hand, a combined action of a comparison learning framework and reinforcement learning is introduced into the design of a loss function, and on the other hand, a pre-training model based on time sequence variation is added, so that model parameters can be converged stably in the training process. Meanwhile, a training process of word-level coding is added, a flow-based learning mode is selected in the training process, and the coding of words in the dictionary can be limited in a fixed distribution space.

The further technical scheme is that the calculation formula of the joint loss function is as follows:

wherein L is _CL As a joint loss function, L _NEC As an infoNCE loss function, L _I Is a mutual information loss function.

In order to enable synonymous sentences in different languages to be similar in codes, the loss function of the product is designed for comparison learning. The method is based on two modes, wherein the first mode is similar in distance space, the second mode is closer in information quantity mutual correlation degree, a combined loss function is obtained through comprehensive consideration, and the info loss and the mutual information loss are mainly introduced. Info loss and

respectively record as

、

Get it

Taking the ith sample as an example, taking the positive example samples of different languages as a notation

Samples of different languages with different meanings are

Thus, can be calculated by using a formula

At the same time will

And

and

is inputted into

To calculate corresponding mutual information

And

the goal of simple mutual information is to expect that sentences with the same meaning in different languages have a large mutual information amount, and sentences with different meanings in different languages have a small mutual information amount.

The further technical scheme is that the specific steps of the news similarity calculation are as follows:

s21, based on the multi-language long text similarity retrieval and classification tool, converting news titles of news in multiple languages into multiple mapped target language vectors;

s22, based on the plurality of mapped target language vectors, an LDA topic model is adopted to obtain the similarity of the news headlines;

s23, a theme similarity matrix is constructed, and theme similarity of the multi-language news is judged;

s24, constructing news similarity based on the similarity of the news headlines and the topic similarity of the news.

The similarity of news titles is obtained based on the news titles of the files, then the theme similarity matrix is constructed, and the theme similarity of the news is obtained, so that the news similarity is evaluated from multiple angles, and the result of the similarity evaluation is more accurate.

In a possible embodiment, the topic similarity of news is evaluated if and only if the similarity of the news headlines is greater than a second similarity threshold, wherein the second similarity threshold is determined according to the number of the multi-language news, thereby also greatly increasing the efficiency of the overall news similarity evaluation.

The further technical scheme is that the calculation formula of the news similarity is as follows:

wherein

For news topic A

And news B topic

The degree of similarity of (a) to (b),

for news A topic

And news B topic

The similarity of (2) is that alpha and beta are constants.

The further technical scheme is that when the news similarity is larger than a first similarity threshold, clustering is carried out on the news by adopting an improved SinglePass incremental clustering algorithm based on news time dimension, wherein the first similarity threshold is determined according to the number of the multi-language news.

The further technical scheme is that the specific steps of clustering the news are as follows:

s31, defining the average value of news in a cluster (the average value of a title vector and the average value of a theme probability distribution) as a cluster center, and calculating the distance between the news and the cluster;

s32, when event-level fine-grained news are clustered, adding a news release time parameter, and determining the distance between the modified news and the cluster;

s33, a cluster merging threshold value is designated, and if the inter-cluster distance is smaller than the cluster merging threshold value, the two clusters are merged.

The traditional static clustering algorithm needs to cluster all samples once when adding the samples every time, and the time cost is overlarge. The later SinglePass as an incremental clustering algorithm for streaming data has higher potential in the aspect of real-time performance, but still has the defects of overlarge time complexity, low accuracy and the like in the aspect of long text clustering. The invention improves the SinglePass algorithm, and effectively reduces similar clusters, improves clustering accuracy and reduces time overhead of real-time clustering of long texts in the clustering process by adding news release time parameters and other modes when event-level fine-grained news is clustered.

The further technical scheme is that the multi-language mapping space model adopts a random optimization distributed algorithm-Flex-SADMM to optimize the searching and mapping of the multi-language mapping.

The method combines the first-order information with reduced variance and the second-order approximate information to solve the sub-problem of the random alternative sub-direction method, aims at stable convergence, and improves the efficiency, the computability and the precision of the searching direction. The method only requires that each compute node update its corresponding variable at least once per iteration T. To introduce SVRG, the present product divides the ADMM process into two stages, where the overall gradient is calculated in the first stage and the number of iterations required for the second stage is T. In this way, each compute node may update its variables at least once in the second phase.

Drawings

The above and other features and advantages of the present invention will become more apparent by describing in detail exemplary embodiments thereof with reference to the attached drawings.

FIG. 1 is a block diagram of a multilingual long-text similarity search and classification tool according to example 1.

Fig. 2 is a flowchart of specific steps of the calculation of news likeness according to embodiment 1.

Fig. 3 is a flowchart of specific steps of clustering the news according to embodiment 1.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The same reference numerals in the drawings denote the same or similar structures, and thus their detailed description will be omitted.

The terms "a," "an," "the," "said" are used to indicate the presence of one or more elements/components/etc.; the terms "comprising" and "having" are intended to be inclusive and mean that there may be additional elements/components/etc. other than the listed elements/components/etc.

With the development of different local cultures in the world, different historical environments lead to the diversity and uniqueness of language development of each country. In the current international communication, how to cross language barriers and promote cultural communication of each country is still an open challenge to establish shared semantic relation. Based on the display of the related information, the world now has more than five thousand languages being used, and thus, the world is now diversified in languages. With the rapid development of the internet, the information generated by people is increasingly increased, the information carriers are various, including videos, images, characters and the like, the information is closely related to the whole world in the breadth, the cultures of all countries are strongly collided and fused, and meanwhile, the related technology of language processing is continuously developed.

Example 1

To solve the above problem, according to one aspect of the present invention, as shown in fig. 1, a multi-language long text similarity retrieval and classification tool is provided.

A multi-language long text similarity retrieval and classification tool specifically comprises:

the text classification prediction module predicts to obtain a mapped target language vector by adopting a multi-language space mapping model based on a Transformer according to the sentence vector, and determines the similarity among the long texts of the plurality of different languages according to a joint loss function among different target language vectors, wherein the joint loss function adopts infoNCE loss and mutual information loss;

Specifically, for example, a multi-language model obtained by mapping the consistency of the multi-language feature vector space can compare the similarity of long texts in different languages, and simultaneously obtain search results of multiple languages during searching, thereby supporting Chinese, english, french and Arabic.

In another possible embodiment, the preprocessing includes obtaining the operable corpus by word segmentation, word de-stop, and word stem extraction.

In another possible embodiment, the transform-based multi-language space mapping model first encodes the sentence vector and then sends the sentence vector to a decoder to obtain a mapped target language vector.

In order to establish an effective same space mapping model in a multi-language space, a main body model in the product is a Transformer-based encoder and a decoder, and a source language is encoded by the encoder and then input into the decoder to obtain a mapped target language. In order to enable training to become effective, on one hand, the combined action of a comparison learning framework and reinforcement learning is introduced into the design of a loss function, and on the other hand, a pre-training model based on time sequence variation is added, so that model parameters can be converged stably in the training process. Meanwhile, a training process of word-level coding is added, a flow-based learning mode is selected in the training process, and the coding of words in the dictionary can be limited in a fixed distribution space.

In another possible embodiment, the calculation formula of the joint loss function is:

wherein L is _CL As a joint loss function, L _NEC As an infoNCE loss function, L _I As a mutual information loss function.

In order to enable synonymous sentences in different languages to be similar in codes, the product designs a loss function for comparison learning. Consideration is carried out based on two modes, namely, the distance space is close, the information quantity is more closely correlated, a combined loss function is obtained through comprehensive consideration, and the info loss and the mutual information loss are mainly introduced. Info loss and

are respectively recorded as

、

Taking out

Samples of different languages with different meanings are

Thus, a formula can be utilizedCalculate out

At the same time will

And

and

is inputted into

To calculate corresponding mutual information

And

In another possible embodiment, the specific steps of calculating the news similarity are as follows:

In a possible embodiment, the topic similarity of news is evaluated if and only if the similarity of the news headlines is greater than a second similarity threshold, wherein the second similarity threshold is determined according to the number of the news in the multiple languages, so that the overall efficiency of news similarity evaluation is greatly increased.

Specifically, for example, the second similarity threshold is determined by constructing and analyzing the number of the multilingual news and constructing an empirical formula.

In another possible embodiment, the formula for calculating the news similarity is as follows:

wherein

For news topic A

And news B topic

The degree of similarity of (a) to (b),

for news A topic

And news B topics

The similarity of (2) is that alpha and beta are constants.

In another possible embodiment, when the news similarity is greater than a first similarity threshold, the news is clustered by using an improved SinglePass incremental clustering algorithm based on a news time dimension, wherein the first similarity threshold is determined according to the number of the news in the multiple languages.

Specifically, for example, the determination of the first similarity threshold is realized by constructing and analyzing the number of the multilingual news and by constructing an empirical formula.

In another possible embodiment, the specific steps of clustering the news are as follows:

The traditional static clustering algorithm needs to cluster all samples once when adding the samples, and the time cost is overlarge. The later SinglePass as an incremental clustering algorithm for streaming data has higher potential in the aspect of real-time performance, but still has the defects of overlarge time complexity, low accuracy and the like in the aspect of long text clustering. The invention improves the SinglePass algorithm, and effectively reduces similar clusters, improves clustering accuracy and reduces time overhead of real-time clustering of long texts in the clustering process by adding news release time parameters and other modes when event-level fine-grained news is clustered.

In another possible embodiment, the multi-language mapping space model optimizes the search and mapping of the multi-language mapping by using a random optimization distributed algorithm, flex-SADMM.

The method combines the first-order information with the second-order approximate information with reduced variance to solve the sub-problem of the random alternative sub-direction method, aims at stable convergence, and improves the efficiency, the computability and the precision of the searching direction. The method only requires that each compute node update its corresponding variable at least once per iteration T. To introduce SVRG, the present product divides the ADMM process into two stages, where the overall gradient is calculated in the first stage and the number of iterations required for the second stage is T. In this way, each compute node may update its variables at least once during the second phase.

In embodiments of the present invention, the term "plurality" means two or more unless explicitly defined otherwise. The terms "mounted," "connected," "secured," and the like are to be construed broadly, and for example, "connected" may be a fixed connection, a removable connection, or an integral connection. Specific meanings of the above terms in the embodiments of the present invention can be understood by those of ordinary skill in the art according to specific situations.

In the description of the embodiments of the present invention, it should be understood that the terms "upper", "lower", and the like indicate orientations or positional relationships based on those shown in the drawings, and are only for convenience in describing the embodiments of the present invention and simplifying the description, but do not indicate or imply that the referred devices or units must have a specific direction, be configured in a specific orientation, and operate, and thus, should not be construed as limiting the embodiments of the present invention.

In the description herein, the appearances of the phrase "one embodiment," "a preferred embodiment," or the like, are intended to mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the embodiments of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes may be made to the present embodiment by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the embodiments of the present invention should be included in the protection scope of the embodiments of the present invention.

Claims

1. A multilingual long-text similarity retrieval and classification tool is characterized by specifically comprising:

and the text classification result output module classifies the long texts in different languages according to the similarity between the long texts in the different languages and outputs a classification result.

2. The multilingual long-text-similarity search and classification tool of claim 1, wherein the preprocessing comprises obtaining operable corpora by word segmentation, word de-stop, and stem extraction.

3. The multi-lingual long text similarity retrieval and classification tool of claim 1, wherein the Transformer-based multi-lingual space mapping model first encodes the sentence vectors and then feeds the sentence vectors to a decoder to obtain the mapped target language vectors.

4. The multilingual long-text similarity search and classification tool of claim 1, wherein the joint loss function is calculated by the formula:

5. The multilingual long-text similarity search and classification tool of claim 1, wherein the news similarity is calculated by the steps of:

s21, based on the multilingual long text similarity retrieval and classification tool, converting news titles of the multilingual news into a plurality of mapped target language vectors;

s22, based on the plurality of mapped target language vectors, adopting an LDA topic model to obtain the similarity of the news headlines;

s23, constructing a theme similarity matrix and judging the theme similarity of the multi-language news;

and S24, constructing news similarity based on the similarity of the news headlines and the topic similarity of the news.

6. The multi-language long text similarity retrieval and classification tool of claim 5, wherein news is rated for topic similarity if and only if the similarity of the news headlines is greater than a second similarity threshold, wherein the second similarity threshold is determined based on the number of news in the multiple languages.

7. The multilingual long-text-similarity search and classification tool of claim 5, wherein the news similarity is calculated by the formula:

wherein

For news topic A

And news B topic

The degree of similarity of (a) to (b),

for news A topic

And news B topics

The similarity of (2) is that alpha and beta are constants.

8. The multi-language long text similarity retrieval and classification tool of claim 6, wherein the news is clustered using a modified SinglePass incremental clustering algorithm based on a news time dimension when the news similarity is greater than a first similarity threshold, wherein the first similarity threshold is determined based on the number of news in the multiple languages.

9. The multilingual long-text-similarity search and classification tool of claim 5, wherein clustering the news comprises:

s31, defining the average value of the news in the cluster as a cluster center, and calculating the distance between the news and the cluster;

s32, adding a news release time parameter when event-level fine-grained news is clustered, and determining the distance between the modified news and the cluster;

and S33, designating a cluster merging threshold, and merging the two clusters if the inter-cluster distance is smaller than the cluster merging threshold.

10. The multi-language long text similarity retrieval and classification tool of claim 1, wherein the multi-language mapping space model optimizes the search and mapping of the multi-language mapping using a stochastic optimization distributed algorithm, flex-SADMM.