CN115630142A - Multi-language long text similarity retrieval and classification tool - Google Patents

Multi-language long text similarity retrieval and classification tool Download PDF

Info

Publication number
CN115630142A
CN115630142A CN202211568520.3A CN202211568520A CN115630142A CN 115630142 A CN115630142 A CN 115630142A CN 202211568520 A CN202211568520 A CN 202211568520A CN 115630142 A CN115630142 A CN 115630142A
Authority
CN
China
Prior art keywords
similarity
text
news
long
language
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211568520.3A
Other languages
Chinese (zh)
Other versions
CN115630142B (en
Inventor
吴林
周亭
吴治伟
王士奇
李伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Communication University of China
Original Assignee
Communication University of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Communication University of China filed Critical Communication University of China
Priority to CN202211568520.3A priority Critical patent/CN115630142B/en
Publication of CN115630142A publication Critical patent/CN115630142A/en
Application granted granted Critical
Publication of CN115630142B publication Critical patent/CN115630142B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a multilingual long text similarity retrieval and classification tool, which belongs to the technical field of natural language processing and specifically comprises the following steps: the system comprises a text acquisition module, a text preprocessing module, a text classification prediction module and a text classification result output module; the text acquisition module is responsible for acquiring a plurality of long texts in different languages; the text preprocessing module is responsible for preprocessing a long text to obtain a corpus, embedding the corpus into a vector space, and performing semantic coding by taking sentences as units to form sentence vectors; the text classification prediction module adopts a multi-language space mapping model to predict to obtain a mapped target language vector, and determines the similarity between the long texts in different languages according to a joint loss function between different target language vectors, wherein the joint loss function adopts infoNCE loss and mutual information loss; and the text classification result output module outputs the classification result according to the similarity between long texts, so that a more accurate matching result is realized.

Description

Multi-language long text similarity retrieval and classification tool
Technical Field
The invention belongs to the technical field of natural language processing, and particularly relates to a multilingual long text similarity retrieval and classification tool.
Background
With the development of different local cultures in the world, different historical environments lead to the diversity and uniqueness of language development in various countries. In the current international communication, how to cross language barriers and promote cultural communication of each country is still an open challenge to establish shared semantic relation. Based on the display of the related information, the world now has more than five thousand languages being used, and thus, the world is now diversified in languages. With the rapid development of the internet, the information generated by people is increasingly increased, the information carriers are various, including videos, images, characters and the like, the information is closely related to the whole world in the breadth, the cultures of all countries are strongly collided and fused, and meanwhile, the related technology of language processing is continuously developed.
The one-dimensional text data is converted into two-dimensional matrix data through a cross-language similarity matrix construction algorithm by author Wangkai of a Master thesis, multi-language-embedding-based cross-language text similarity comparison research, hierarchical feature extraction is carried out on an interactive structure between cross-language sentences, multiple languages such as Chinese, english, german and the like are involved in different experiments, and the applicability of the model to different language structures is verified.
Aiming at the technical problem, the invention provides a multilingual long text similarity retrieval and classification tool.
Disclosure of Invention
In order to realize the purpose of the invention, the invention adopts the following technical scheme:
according to one aspect of the invention, a multi-language long text similarity retrieval and classification tool is provided.
A multilingual long text similarity retrieval and classification tool specifically comprises:
the system comprises a text acquisition module, a text preprocessing module, a text classification prediction module and a text classification result output module;
the text acquisition module is responsible for acquiring a plurality of long texts in different languages and transmitting the long texts to the text preprocessing module;
the text preprocessing module is responsible for preprocessing the long text to obtain a preprocessed text, obtaining a corpus, embedding the corpus into a vector space, performing semantic coding on sentences as units to form sentence vectors, and transmitting the sentence vectors to the text classification prediction module;
the text classification prediction module predicts a mapped target language vector by adopting a Transformer-based multi-language space mapping model according to the sentence vector, and determines the similarity among the long texts of the different languages according to a joint loss function among the different target language vectors, wherein the joint loss function adopts info loss and mutual information loss;
and the text classification result output module classifies the long texts in different languages according to the similarity among the long texts in the different languages and outputs a classification result.
The method comprises the steps of firstly obtaining a plurality of long texts in different languages by adopting a text obtaining module, preprocessing the long texts so as to convert the long texts into linguistic data, embedding the linguistic data into a vector space, carrying out semantic coding by taking sentences as units to form sentence vectors, predicting the mapped target language vectors by adopting a multi-language space mapping model based on a Transformer, determining the similarity among the long texts in the different languages according to a joint loss function among different target language vectors, classifying the long texts in the different languages according to the similarity among the long texts in the different languages, and outputting a classification result, wherein the joint loss function adopts infoNCE loss and mutual information loss, so that the problems of poor matching result and low accuracy rate are solved when the original method only adopts a single angle from a distance space to consider the similarity among the different vectors without considering the difference of information amount among the different sentences, and the retrieval comparison and recommendation are carried out on user texts, so that the prediction result becomes more accurate, and the calculation result of the similarity becomes more real.
According to the combined loss function between different target language vectors, the similarity between the long texts of the different languages is determined, wherein the combined loss function adopts infoNCE loss and mutual information loss, so that the consideration based on two modes is realized, namely, the similarity in distance space and the mutual correlation degree of information quantity are similar, the problems that the matching result is poor and the accuracy is low when the retrieval comparison and recommendation are carried out on the user text are solved by only considering the similarity between different vectors from a single angle of the distance space and not considering the difference of the information quantity between different sentences, the prediction result is more accurate, the calculation result of the similarity is more real and credible, and the sentences with the same meaning and the larger quantity of the sentences in different languages can be accurately found.
The further technical scheme is that the preprocessing comprises obtaining operable linguistic data through word segmentation, word stop and word stem extraction.
By preprocessing the long sentences, the word stems are extracted, the interference items such as stop words are eliminated, and the final similarity evaluation accuracy is improved to a certain degree.
The further technical scheme is that the multi-language space mapping model based on the Transformer firstly encodes the sentence vector and then sends the sentence vector to a decoder to obtain a mapped target language vector.
In order to establish an effective same space mapping model in a multi-language space, a main body model in the product is a Transformer-based encoder and a decoder, and a source language is encoded by the encoder and then input into the decoder to obtain a mapped target language. In order to enable training to become effective, on one hand, a combined action of a comparison learning framework and reinforcement learning is introduced into the design of a loss function, and on the other hand, a pre-training model based on time sequence variation is added, so that model parameters can be converged stably in the training process. Meanwhile, a training process of word-level coding is added, a flow-based learning mode is selected in the training process, and the coding of words in the dictionary can be limited in a fixed distribution space.
The further technical scheme is that the calculation formula of the joint loss function is as follows:
Figure 470849DEST_PATH_IMAGE001
wherein L is CL As a joint loss function, L NEC As an infoNCE loss function, L I Is a mutual information loss function.
In order to enable synonymous sentences in different languages to be similar in codes, the loss function of the product is designed for comparison learning. The method is based on two modes, wherein the first mode is similar in distance space, the second mode is closer in information quantity mutual correlation degree, a combined loss function is obtained through comprehensive consideration, and the info loss and the mutual information loss are mainly introduced. Info loss and
Figure 397217DEST_PATH_IMAGE002
respectively record as
Figure 680431DEST_PATH_IMAGE003
Figure 424265DEST_PATH_IMAGE004
Get it
Figure 222457DEST_PATH_IMAGE005
Taking the ith sample as an example, taking the positive example samples of different languages as a notation
Figure 93461DEST_PATH_IMAGE006
Samples of different languages with different meanings are
Figure 496760DEST_PATH_IMAGE007
Thus, can be calculated by using a formula
Figure 286862DEST_PATH_IMAGE003
At the same time will
Figure 696983DEST_PATH_IMAGE005
And
Figure 168416DEST_PATH_IMAGE006
and
Figure 222959DEST_PATH_IMAGE007
is inputted into
Figure 59328DEST_PATH_IMAGE008
To calculate corresponding mutual information
Figure 566533DEST_PATH_IMAGE009
And
Figure 28607DEST_PATH_IMAGE010
the goal of simple mutual information is to expect that sentences with the same meaning in different languages have a large mutual information amount, and sentences with different meanings in different languages have a small mutual information amount.
The further technical scheme is that the specific steps of the news similarity calculation are as follows:
s21, based on the multi-language long text similarity retrieval and classification tool, converting news titles of news in multiple languages into multiple mapped target language vectors;
s22, based on the plurality of mapped target language vectors, an LDA topic model is adopted to obtain the similarity of the news headlines;
s23, a theme similarity matrix is constructed, and theme similarity of the multi-language news is judged;
s24, constructing news similarity based on the similarity of the news headlines and the topic similarity of the news.
The similarity of news titles is obtained based on the news titles of the files, then the theme similarity matrix is constructed, and the theme similarity of the news is obtained, so that the news similarity is evaluated from multiple angles, and the result of the similarity evaluation is more accurate.
In a possible embodiment, the topic similarity of news is evaluated if and only if the similarity of the news headlines is greater than a second similarity threshold, wherein the second similarity threshold is determined according to the number of the multi-language news, thereby also greatly increasing the efficiency of the overall news similarity evaluation.
The further technical scheme is that the calculation formula of the news similarity is as follows:
Figure 406499DEST_PATH_IMAGE011
wherein
Figure 272824DEST_PATH_IMAGE012
For news topic A
Figure 673849DEST_PATH_IMAGE013
And news B topic
Figure 487085DEST_PATH_IMAGE014
The degree of similarity of (a) to (b),
Figure 516221DEST_PATH_IMAGE015
for news A topic
Figure 943660DEST_PATH_IMAGE016
And news B topic
Figure 894298DEST_PATH_IMAGE017
The similarity of (2) is that alpha and beta are constants.
The further technical scheme is that when the news similarity is larger than a first similarity threshold, clustering is carried out on the news by adopting an improved SinglePass incremental clustering algorithm based on news time dimension, wherein the first similarity threshold is determined according to the number of the multi-language news.
The further technical scheme is that the specific steps of clustering the news are as follows:
s31, defining the average value of news in a cluster (the average value of a title vector and the average value of a theme probability distribution) as a cluster center, and calculating the distance between the news and the cluster;
s32, when event-level fine-grained news are clustered, adding a news release time parameter, and determining the distance between the modified news and the cluster;
s33, a cluster merging threshold value is designated, and if the inter-cluster distance is smaller than the cluster merging threshold value, the two clusters are merged.
The traditional static clustering algorithm needs to cluster all samples once when adding the samples every time, and the time cost is overlarge. The later SinglePass as an incremental clustering algorithm for streaming data has higher potential in the aspect of real-time performance, but still has the defects of overlarge time complexity, low accuracy and the like in the aspect of long text clustering. The invention improves the SinglePass algorithm, and effectively reduces similar clusters, improves clustering accuracy and reduces time overhead of real-time clustering of long texts in the clustering process by adding news release time parameters and other modes when event-level fine-grained news is clustered.
The further technical scheme is that the multi-language mapping space model adopts a random optimization distributed algorithm-Flex-SADMM to optimize the searching and mapping of the multi-language mapping.
The method combines the first-order information with reduced variance and the second-order approximate information to solve the sub-problem of the random alternative sub-direction method, aims at stable convergence, and improves the efficiency, the computability and the precision of the searching direction. The method only requires that each compute node update its corresponding variable at least once per iteration T. To introduce SVRG, the present product divides the ADMM process into two stages, where the overall gradient is calculated in the first stage and the number of iterations required for the second stage is T. In this way, each compute node may update its variables at least once in the second phase.
Drawings
The above and other features and advantages of the present invention will become more apparent by describing in detail exemplary embodiments thereof with reference to the attached drawings.
FIG. 1 is a block diagram of a multilingual long-text similarity search and classification tool according to example 1.
Fig. 2 is a flowchart of specific steps of the calculation of news likeness according to embodiment 1.
Fig. 3 is a flowchart of specific steps of clustering the news according to embodiment 1.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The same reference numerals in the drawings denote the same or similar structures, and thus their detailed description will be omitted.
The terms "a," "an," "the," "said" are used to indicate the presence of one or more elements/components/etc.; the terms "comprising" and "having" are intended to be inclusive and mean that there may be additional elements/components/etc. other than the listed elements/components/etc.
With the development of different local cultures in the world, different historical environments lead to the diversity and uniqueness of language development of each country. In the current international communication, how to cross language barriers and promote cultural communication of each country is still an open challenge to establish shared semantic relation. Based on the display of the related information, the world now has more than five thousand languages being used, and thus, the world is now diversified in languages. With the rapid development of the internet, the information generated by people is increasingly increased, the information carriers are various, including videos, images, characters and the like, the information is closely related to the whole world in the breadth, the cultures of all countries are strongly collided and fused, and meanwhile, the related technology of language processing is continuously developed.
The one-dimensional text data is converted into two-dimensional matrix data through a cross-language similarity matrix construction algorithm by author Wangkai of a Master thesis, multi-language-embedding-based cross-language text similarity comparison research, hierarchical feature extraction is carried out on an interactive structure between cross-language sentences, multiple languages such as Chinese, english, german and the like are involved in different experiments, and the applicability of the model to different language structures is verified.
Example 1
To solve the above problem, according to one aspect of the present invention, as shown in fig. 1, a multi-language long text similarity retrieval and classification tool is provided.
A multi-language long text similarity retrieval and classification tool specifically comprises:
the system comprises a text acquisition module, a text preprocessing module, a text classification prediction module and a text classification result output module;
the text acquisition module is responsible for acquiring a plurality of long texts in different languages and transmitting the long texts to the text preprocessing module;
the text preprocessing module is responsible for preprocessing the long text to obtain a preprocessed text, obtaining a corpus, embedding the corpus into a vector space, performing semantic coding on sentences as units to form sentence vectors, and transmitting the sentence vectors to the text classification prediction module;
the text classification prediction module predicts to obtain a mapped target language vector by adopting a multi-language space mapping model based on a Transformer according to the sentence vector, and determines the similarity among the long texts of the plurality of different languages according to a joint loss function among different target language vectors, wherein the joint loss function adopts infoNCE loss and mutual information loss;
and the text classification result output module classifies the long texts in different languages according to the similarity among the long texts in the different languages and outputs a classification result.
The method comprises the steps of firstly obtaining a plurality of long texts in different languages by adopting a text obtaining module, preprocessing the long texts so as to convert the long texts into linguistic data, embedding the linguistic data into a vector space, carrying out semantic coding by taking sentences as units to form sentence vectors, predicting the mapped target language vectors by adopting a multi-language space mapping model based on a Transformer, determining the similarity among the long texts in the different languages according to a joint loss function among different target language vectors, classifying the long texts in the different languages according to the similarity among the long texts in the different languages, and outputting a classification result, wherein the joint loss function adopts infoNCE loss and mutual information loss, so that the problems of poor matching result and low accuracy rate are solved when the original method only adopts a single angle from a distance space to consider the similarity among the different vectors without considering the difference of information amount among the different sentences, and the retrieval comparison and recommendation are carried out on user texts, so that the prediction result becomes more accurate, and the calculation result of the similarity becomes more real.
According to the combined loss function between different target language vectors, the similarity between the long texts of the different languages is determined, wherein the combined loss function adopts infoNCE loss and mutual information loss, so that the consideration based on two modes is realized, namely, the similarity in distance space and the mutual correlation degree of information quantity are similar, the problems that the matching result is poor and the accuracy is low when the retrieval comparison and recommendation are carried out on the user text are solved by only considering the similarity between different vectors from a single angle of the distance space and not considering the difference of the information quantity between different sentences, the prediction result is more accurate, the calculation result of the similarity is more real and credible, and the sentences with the same meaning and the larger quantity of the sentences in different languages can be accurately found.
Specifically, for example, a multi-language model obtained by mapping the consistency of the multi-language feature vector space can compare the similarity of long texts in different languages, and simultaneously obtain search results of multiple languages during searching, thereby supporting Chinese, english, french and Arabic.
In another possible embodiment, the preprocessing includes obtaining the operable corpus by word segmentation, word de-stop, and word stem extraction.
By preprocessing the long sentences, the word stems are extracted, the interference items such as stop words are eliminated, and the final similarity evaluation accuracy is improved to a certain degree.
In another possible embodiment, the transform-based multi-language space mapping model first encodes the sentence vector and then sends the sentence vector to a decoder to obtain a mapped target language vector.
In order to establish an effective same space mapping model in a multi-language space, a main body model in the product is a Transformer-based encoder and a decoder, and a source language is encoded by the encoder and then input into the decoder to obtain a mapped target language. In order to enable training to become effective, on one hand, the combined action of a comparison learning framework and reinforcement learning is introduced into the design of a loss function, and on the other hand, a pre-training model based on time sequence variation is added, so that model parameters can be converged stably in the training process. Meanwhile, a training process of word-level coding is added, a flow-based learning mode is selected in the training process, and the coding of words in the dictionary can be limited in a fixed distribution space.
In another possible embodiment, the calculation formula of the joint loss function is:
Figure 307962DEST_PATH_IMAGE001
wherein L is CL As a joint loss function, L NEC As an infoNCE loss function, L I As a mutual information loss function.
In order to enable synonymous sentences in different languages to be similar in codes, the product designs a loss function for comparison learning. Consideration is carried out based on two modes, namely, the distance space is close, the information quantity is more closely correlated, a combined loss function is obtained through comprehensive consideration, and the info loss and the mutual information loss are mainly introduced. Info loss and
Figure 332550DEST_PATH_IMAGE002
are respectively recorded as
Figure 806256DEST_PATH_IMAGE003
Figure 978612DEST_PATH_IMAGE004
Taking out
Figure 55021DEST_PATH_IMAGE005
Taking the ith sample as an example, taking the positive example samples of different languages as a notation
Figure 730853DEST_PATH_IMAGE006
Samples of different languages with different meanings are
Figure 313144DEST_PATH_IMAGE007
Thus, a formula can be utilizedCalculate out
Figure 769533DEST_PATH_IMAGE003
At the same time will
Figure 915213DEST_PATH_IMAGE005
And
Figure 976710DEST_PATH_IMAGE006
and
Figure 57798DEST_PATH_IMAGE007
is inputted into
Figure 876849DEST_PATH_IMAGE008
To calculate corresponding mutual information
Figure 373690DEST_PATH_IMAGE009
And
Figure 352010DEST_PATH_IMAGE010
the goal of simple mutual information is to expect that sentences with the same meaning in different languages have a large mutual information amount, and sentences with different meanings in different languages have a small mutual information amount.
In another possible embodiment, the specific steps of calculating the news similarity are as follows:
s21, based on the multi-language long text similarity retrieval and classification tool, converting news titles of news in multiple languages into multiple mapped target language vectors;
s22, based on the plurality of mapped target language vectors, an LDA topic model is adopted to obtain the similarity of the news headlines;
s23, a theme similarity matrix is constructed, and theme similarity of the multi-language news is judged;
s24, constructing news similarity based on the similarity of the news headlines and the topic similarity of the news.
The similarity of news titles is obtained based on the news titles of the files, then the theme similarity matrix is constructed, and the theme similarity of the news is obtained, so that the news similarity is evaluated from multiple angles, and the result of the similarity evaluation is more accurate.
In a possible embodiment, the topic similarity of news is evaluated if and only if the similarity of the news headlines is greater than a second similarity threshold, wherein the second similarity threshold is determined according to the number of the news in the multiple languages, so that the overall efficiency of news similarity evaluation is greatly increased.
Specifically, for example, the second similarity threshold is determined by constructing and analyzing the number of the multilingual news and constructing an empirical formula.
In another possible embodiment, the formula for calculating the news similarity is as follows:
Figure 541683DEST_PATH_IMAGE011
wherein
Figure 362877DEST_PATH_IMAGE012
For news topic A
Figure 663409DEST_PATH_IMAGE013
And news B topic
Figure 496236DEST_PATH_IMAGE014
The degree of similarity of (a) to (b),
Figure 856810DEST_PATH_IMAGE015
for news A topic
Figure 916033DEST_PATH_IMAGE016
And news B topics
Figure 551413DEST_PATH_IMAGE017
The similarity of (2) is that alpha and beta are constants.
In another possible embodiment, when the news similarity is greater than a first similarity threshold, the news is clustered by using an improved SinglePass incremental clustering algorithm based on a news time dimension, wherein the first similarity threshold is determined according to the number of the news in the multiple languages.
Specifically, for example, the determination of the first similarity threshold is realized by constructing and analyzing the number of the multilingual news and by constructing an empirical formula.
In another possible embodiment, the specific steps of clustering the news are as follows:
s31, defining the average value of news in a cluster (the average value of a title vector and the average value of a theme probability distribution) as a cluster center, and calculating the distance between the news and the cluster;
s32, when event-level fine-grained news are clustered, adding a news release time parameter, and determining the distance between the modified news and the cluster;
s33, a cluster merging threshold value is designated, and if the inter-cluster distance is smaller than the cluster merging threshold value, the two clusters are merged.
The traditional static clustering algorithm needs to cluster all samples once when adding the samples, and the time cost is overlarge. The later SinglePass as an incremental clustering algorithm for streaming data has higher potential in the aspect of real-time performance, but still has the defects of overlarge time complexity, low accuracy and the like in the aspect of long text clustering. The invention improves the SinglePass algorithm, and effectively reduces similar clusters, improves clustering accuracy and reduces time overhead of real-time clustering of long texts in the clustering process by adding news release time parameters and other modes when event-level fine-grained news is clustered.
In another possible embodiment, the multi-language mapping space model optimizes the search and mapping of the multi-language mapping by using a random optimization distributed algorithm, flex-SADMM.
The method combines the first-order information with the second-order approximate information with reduced variance to solve the sub-problem of the random alternative sub-direction method, aims at stable convergence, and improves the efficiency, the computability and the precision of the searching direction. The method only requires that each compute node update its corresponding variable at least once per iteration T. To introduce SVRG, the present product divides the ADMM process into two stages, where the overall gradient is calculated in the first stage and the number of iterations required for the second stage is T. In this way, each compute node may update its variables at least once during the second phase.
In embodiments of the present invention, the term "plurality" means two or more unless explicitly defined otherwise. The terms "mounted," "connected," "secured," and the like are to be construed broadly, and for example, "connected" may be a fixed connection, a removable connection, or an integral connection. Specific meanings of the above terms in the embodiments of the present invention can be understood by those of ordinary skill in the art according to specific situations.
In the description of the embodiments of the present invention, it should be understood that the terms "upper", "lower", and the like indicate orientations or positional relationships based on those shown in the drawings, and are only for convenience in describing the embodiments of the present invention and simplifying the description, but do not indicate or imply that the referred devices or units must have a specific direction, be configured in a specific orientation, and operate, and thus, should not be construed as limiting the embodiments of the present invention.
In the description herein, the appearances of the phrase "one embodiment," "a preferred embodiment," or the like, are intended to mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the embodiments of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes may be made to the present embodiment by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the embodiments of the present invention should be included in the protection scope of the embodiments of the present invention.

Claims (10)

1. A multilingual long-text similarity retrieval and classification tool is characterized by specifically comprising:
the system comprises a text acquisition module, a text preprocessing module, a text classification prediction module and a text classification result output module;
the text acquisition module is responsible for acquiring a plurality of long texts in different languages and transmitting the long texts to the text preprocessing module;
the text preprocessing module is responsible for preprocessing the long text to obtain a preprocessed text, obtaining a corpus, embedding the corpus into a vector space, performing semantic coding on sentences as units to form sentence vectors, and transmitting the sentence vectors to the text classification prediction module;
the text classification prediction module predicts a mapped target language vector by adopting a Transformer-based multi-language space mapping model according to the sentence vector, and determines the similarity among the long texts of the different languages according to a joint loss function among the different target language vectors, wherein the joint loss function adopts info loss and mutual information loss;
and the text classification result output module classifies the long texts in different languages according to the similarity between the long texts in the different languages and outputs a classification result.
2. The multilingual long-text-similarity search and classification tool of claim 1, wherein the preprocessing comprises obtaining operable corpora by word segmentation, word de-stop, and stem extraction.
3. The multi-lingual long text similarity retrieval and classification tool of claim 1, wherein the Transformer-based multi-lingual space mapping model first encodes the sentence vectors and then feeds the sentence vectors to a decoder to obtain the mapped target language vectors.
4. The multilingual long-text similarity search and classification tool of claim 1, wherein the joint loss function is calculated by the formula:
Figure 305522DEST_PATH_IMAGE001
wherein L is CL As a joint loss function, L NEC As an infoNCE loss function, L I As a mutual information loss function.
5. The multilingual long-text similarity search and classification tool of claim 1, wherein the news similarity is calculated by the steps of:
s21, based on the multilingual long text similarity retrieval and classification tool, converting news titles of the multilingual news into a plurality of mapped target language vectors;
s22, based on the plurality of mapped target language vectors, adopting an LDA topic model to obtain the similarity of the news headlines;
s23, constructing a theme similarity matrix and judging the theme similarity of the multi-language news;
and S24, constructing news similarity based on the similarity of the news headlines and the topic similarity of the news.
6. The multi-language long text similarity retrieval and classification tool of claim 5, wherein news is rated for topic similarity if and only if the similarity of the news headlines is greater than a second similarity threshold, wherein the second similarity threshold is determined based on the number of news in the multiple languages.
7. The multilingual long-text-similarity search and classification tool of claim 5, wherein the news similarity is calculated by the formula:
Figure 14721DEST_PATH_IMAGE002
wherein
Figure 385659DEST_PATH_IMAGE003
For news topic A
Figure 69581DEST_PATH_IMAGE004
And news B topic
Figure 17815DEST_PATH_IMAGE005
The degree of similarity of (a) to (b),
Figure 406071DEST_PATH_IMAGE006
for news A topic
Figure 506882DEST_PATH_IMAGE007
And news B topics
Figure 689602DEST_PATH_IMAGE008
The similarity of (2) is that alpha and beta are constants.
8. The multi-language long text similarity retrieval and classification tool of claim 6, wherein the news is clustered using a modified SinglePass incremental clustering algorithm based on a news time dimension when the news similarity is greater than a first similarity threshold, wherein the first similarity threshold is determined based on the number of news in the multiple languages.
9. The multilingual long-text-similarity search and classification tool of claim 5, wherein clustering the news comprises:
s31, defining the average value of the news in the cluster as a cluster center, and calculating the distance between the news and the cluster;
s32, adding a news release time parameter when event-level fine-grained news is clustered, and determining the distance between the modified news and the cluster;
and S33, designating a cluster merging threshold, and merging the two clusters if the inter-cluster distance is smaller than the cluster merging threshold.
10. The multi-language long text similarity retrieval and classification tool of claim 1, wherein the multi-language mapping space model optimizes the search and mapping of the multi-language mapping using a stochastic optimization distributed algorithm, flex-SADMM.
CN202211568520.3A 2022-12-08 2022-12-08 Multi-language long text similarity retrieval and classification tool Active CN115630142B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211568520.3A CN115630142B (en) 2022-12-08 2022-12-08 Multi-language long text similarity retrieval and classification tool

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211568520.3A CN115630142B (en) 2022-12-08 2022-12-08 Multi-language long text similarity retrieval and classification tool

Publications (2)

Publication Number Publication Date
CN115630142A true CN115630142A (en) 2023-01-20
CN115630142B CN115630142B (en) 2023-03-14

Family

ID=84910843

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211568520.3A Active CN115630142B (en) 2022-12-08 2022-12-08 Multi-language long text similarity retrieval and classification tool

Country Status (1)

Country Link
CN (1) CN115630142B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111930952A (en) * 2020-09-21 2020-11-13 杭州识度科技有限公司 Method, system, equipment and storage medium for long text cascade classification
US20210383064A1 (en) * 2020-06-03 2021-12-09 Beijing Baidu Netcom Science And Technology Co., Ltd. Text recognition method, electronic device, and storage medium
CN114707516A (en) * 2022-03-29 2022-07-05 北京理工大学 Long text semantic similarity calculation method based on contrast learning
CN115115002A (en) * 2022-07-22 2022-09-27 宁波牛信网络科技有限公司 Text similarity calculation model generation method, device, equipment and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210383064A1 (en) * 2020-06-03 2021-12-09 Beijing Baidu Netcom Science And Technology Co., Ltd. Text recognition method, electronic device, and storage medium
CN111930952A (en) * 2020-09-21 2020-11-13 杭州识度科技有限公司 Method, system, equipment and storage medium for long text cascade classification
CN114707516A (en) * 2022-03-29 2022-07-05 北京理工大学 Long text semantic similarity calculation method based on contrast learning
CN115115002A (en) * 2022-07-22 2022-09-27 宁波牛信网络科技有限公司 Text similarity calculation model generation method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN115630142B (en) 2023-03-14

Similar Documents

Publication Publication Date Title
Al-Sabahi et al. A hierarchical structured self-attentive model for extractive document summarization (HSSAS)
CN100527125C (en) On-line translation model selection method of statistic machine translation
Deng et al. Syntax-guided hierarchical attention network for video captioning
CN113343683A (en) Chinese new word discovery method and device integrating self-encoder and countertraining
CN111368542A (en) Text language association extraction method and system based on recurrent neural network
Jian et al. [Retracted] LSTM‐Based Attentional Embedding for English Machine Translation
CN113821635A (en) Text abstract generation method and system for financial field
do Carmo Nogueira et al. Reference-based model using multimodal gated recurrent units for image captioning
Durrani et al. Using joint models or domain adaptation in statistical machine translation
CN111737497B (en) Weak supervision relation extraction method based on multi-source semantic representation fusion
CN114972848A (en) Image semantic understanding and text generation based on fine-grained visual information control network
Chen et al. Cross-lingual text image recognition via multi-task sequence to sequence learning
Ahrenberg Alignment
CN115098673A (en) Business document information extraction method based on variant attention and hierarchical structure
do Carmo Nogueira et al. A reference-based model using deep learning for image captioning
CN115630142B (en) Multi-language long text similarity retrieval and classification tool
Santosh et al. Gazetteer-guided keyphrase generation from research papers
CN114330327A (en) Language model pre-training method and apparatus, computer storage medium and electronic device
CN116842934A (en) Multi-document fusion deep learning title generation method based on continuous learning
Ma et al. Syntax-based transformer for neural machine translation
Zhang et al. Extractive Document Summarization based on hierarchical GRU
Wan et al. Abstractive document summarization via bidirectional decoder
CN113076467A (en) Chinese-crossing news topic discovery method based on cross-language neural topic model
Miranda et al. Tspnet-hf: A hand/face tspnet method for sign language translation
Tian et al. Label importance ranking with entropy variation complex networks for structured video captioning.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant