CN114282511A - Text duplicate removal method and device, electronic equipment and storage medium - Google Patents

Text duplicate removal method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN114282511A
CN114282511A CN202111246050.4A CN202111246050A CN114282511A CN 114282511 A CN114282511 A CN 114282511A CN 202111246050 A CN202111246050 A CN 202111246050A CN 114282511 A CN114282511 A CN 114282511A
Authority
CN
China
Prior art keywords
text
sub
string
deduplicated
duplicated
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111246050.4A
Other languages
Chinese (zh)
Inventor
石志林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202111246050.4A priority Critical patent/CN114282511A/en
Publication of CN114282511A publication Critical patent/CN114282511A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to the technical field of computers, in particular to a text duplicate removal method, a text duplicate removal device, electronic equipment and a storage medium, which are used for improving the accuracy and efficiency of text duplicate removal. The method comprises the following steps: respectively intercepting subfile strings of each text to be de-duplicated in the text set to obtain a subfile string set corresponding to each text to be de-duplicated; respectively determining the target weight corresponding to each sub text string based on the obtained characteristic information of each sub text string contained in each sub text string set; screening out subfile strings with target weight not lower than a target threshold value from each subfile string as target subfile strings; and respectively carrying out duplication removal on each text to be duplicated based on the inclusion relation between each text to be duplicated and each target sub-text string. According to the text deduplication method and device, the text deduplication is carried out through the incidence relation between the text to be deduplicated and the target sub-text string, so that the deduplication accuracy and efficiency can be effectively improved.

Description

Text duplicate removal method and device, electronic equipment and storage medium
Technical Field
The present application relates to the field of computer technologies, and in particular, to a text deduplication method and apparatus, an electronic device, and a storage medium.
Background
With the internet era, information is explosively increased, and the internet is full of massive texts and contains a large amount of repeated text contents; for example, a piece of news is downloaded, modified, edited by various media, resulting in a plurality of similar news. If there is a large amount of repeated text in the internet, on the one hand the overall text quality is reduced and on the other hand a large amount of storage resources are wasted. Thus, text needs to be deduplicated.
Text deduplication technology refers to identifying similar, repetitive information. The text duplication elimination method in the related technology mainly compares a plurality of texts to be duplicated pairwise based on the similarity of text feature vectors or Hamming distance of text word segmentation results and the like, and conducts duplication elimination according to the comparison results.
However, in the task of removing the duplicate of the massive texts, the accuracy and the efficiency of removing the duplicate obtained by the method are general. Therefore, how to improve the accuracy and efficiency of text deduplication is an urgent need to be solved.
Disclosure of Invention
The embodiment of the application provides a text duplicate removal method and device, electronic equipment and a storage medium, and aims to improve accuracy and efficiency of text duplicate removal.
The text duplicate removal method provided by the embodiment of the application comprises the following steps:
respectively intercepting subfile strings of each text to be deduplicated in a text set to obtain a subfile string set corresponding to each text to be deduplicated;
respectively determining target weights corresponding to the sub text strings based on the obtained characteristic information of the sub text strings contained in the sub text string sets;
screening out sub text strings with target weight not lower than a target threshold value from each sub text string as target sub text strings;
and respectively carrying out duplication removal on each text to be duplicated based on the inclusion relationship between each text to be duplicated and each target sub-text string.
The text duplicate removal device provided by the embodiment of the application comprises:
the text intercepting unit is used for respectively intercepting subfile strings of each text to be deduplicated in the text set to obtain a subfile string set corresponding to each text to be deduplicated;
the weight determining unit is used for respectively determining the target weight corresponding to each sub text string based on the obtained characteristic information of each sub text string contained in each sub text string set;
a screening unit, configured to screen out, from the respective sub-text strings, a sub-text string whose target weight is not lower than a target threshold as a target sub-text string;
and the duplication removing unit is used for respectively removing duplication of each text to be duplicated based on the inclusion relationship between each text to be duplicated and each target sub-text string.
Optionally, the deduplication unit is specifically configured to:
dividing each text to be de-duplicated to obtain a plurality of candidate text sets respectively based on the inclusion relation between each text to be de-duplicated and each target sub-text string;
respectively carrying out preliminary de-duplication on the text to be de-duplicated in each candidate text set to obtain the residual text to be de-duplicated;
and carrying out secondary duplication removal on each residual text to be duplicated according to the hash value of each residual text to be duplicated.
Optionally, the deduplication unit is specifically configured to:
for each candidate text set, the following operations are respectively performed:
acquiring text similarity between every two texts to be deduplicated in a candidate text set;
and based on the text similarity of every two texts to be deduplicated, performing deduplication on the texts to be deduplicated in the candidate text set.
Optionally, the text intercepting unit is specifically configured to:
taking the minimum value of the preset length reference value and the maximum text lengths corresponding to all texts to be deduplicated as the intercepting length of the sub text string;
respectively intercepting a plurality of sub text strings from each text to be deduplicated in a sliding manner according to the interception length;
and taking a set formed by the sub-text strings intercepted on the basis of the same text to be deduplicated as the set of the sub-text strings corresponding to the same text to be deduplicated.
An electronic device provided by an embodiment of the present application includes a processor and a memory, where the memory stores program codes, and when the program codes are executed by the processor, the processor is caused to execute any one of the steps of the text deduplication method.
Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the steps of any of the text deduplication methods described above.
An embodiment of the present application provides a computer-readable storage medium, which includes program code for causing an electronic device to perform any one of the steps of the text deduplication method described above when the program product runs on the electronic device.
The beneficial effect of this application is as follows:
the embodiment of the application provides a text duplicate removal method, a text duplicate removal device, electronic equipment and a storage medium, and the text duplicate removal method, the text duplicate removal device, the electronic equipment and the storage medium are used for intercepting text sub-strings, simultaneously, learning the weight of each text sub-string, directly abandoning the text sub-strings with the weights lower than a specific threshold value, and only taking the text sub-strings with the weights higher than the specific threshold value, so that noise of the meaningless text sub-strings is reduced. In addition, the text deduplication is performed based on the inclusion relationship between each text to be deduplicated and the target sub-text string after the target sub-text string is screened out through interception and weight calculation of the sub-text strings, so that the deduplication efficiency can be effectively improved, a better deduplication effect can be obtained for short texts, and the accuracy is higher.
Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the application. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
fig. 1 is an alternative schematic diagram of an application scenario in an embodiment of the present application;
fig. 2 is a schematic flowchart of a text deduplication method in an embodiment of the present application;
FIG. 3 is a flowchart illustrating a method for intercepting a sub-text string according to an embodiment of the present application;
fig. 4 is a schematic flowchart of a weight calculation method in an embodiment of the present application;
FIG. 5 is a flowchart illustrating another text deduplication method in an embodiment of the present application;
FIG. 6A is a diagram of an encoding matrix according to an embodiment of the present application;
FIG. 6B is a diagram of another encoding matrix in the embodiment of the present application;
FIG. 7 is a diagram illustrating a method for parallel processing of multiple nodes according to an embodiment of the present disclosure;
FIG. 8A is a complete timing diagram of a text deduplication method in an embodiment of the present application;
FIG. 8B is a schematic diagram of a deduplication process corresponding to FIG. 8A in an embodiment of the present application;
fig. 9 is a schematic structural diagram of a text deduplication apparatus in an embodiment of the present application;
fig. 10 is a schematic diagram of a hardware component of an electronic device to which an embodiment of the present application is applied;
fig. 11 is a schematic diagram of a hardware component structure of another electronic device to which the embodiment of the present application is applied.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments, but not all embodiments, of the technical solutions of the present application. All other embodiments obtained by a person skilled in the art without any inventive step based on the embodiments described in the present application are within the scope of the protection of the present application.
Some concepts related to the embodiments of the present application are described below.
Child text string and length of cut: the method is obtained by intercepting the text. In the embodiment of the application, a text is divided into a plurality of sub-text strings by a sliding window method, so as to form a sub-text string set. The interception length is the length of the sub-text string to be intercepted. In the embodiment of the present application, the length of the sub-text string refers to: the number of elements contained in the sub-text string. The elements can be single characters, or an English word, a phrase, etc. The description is mainly given by way of example of a single character.
Feature information of the sub-text string: the method mainly comprises the steps of embedding and representing a sub-text string to obtain first embedded characteristic information, embedding and representing elements in the sub-text string to obtain second embedded characteristic information, and representing the positions of the elements in the sub-text string to obtain offset characteristics.
Haiming distance: in an effective bit (bit) code set, the two bit strings are subjected to XOR operation, and the number of 1 s in the XOR operation result is calculated. Or the hamming distance of two strings of equal length is the number of different characters at the same position.
SimHash algorithm: the two original contents of the traditional Hash (Hash) algorithm only differ by a few bytes, and the generated Hash value may also differ greatly. The SimHash algorithm has a small difference in Hash values if the original content is not very different.
N-Gram: is an algorithm based on a statistical language model. The basic idea is to perform a sliding window operation with a size of N on the content in the text according to bytes, and form a byte fragment sequence with a length of N.
The embodiments of the present application relate to AI (artificial intelligence), NLP (natural language processing), and ML (machine learning technology), and are designed based on computer vision technology and machine learning in artificial intelligence.
Artificial intelligence is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence.
Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making. The artificial intelligence technology mainly comprises a computer vision technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and other directions. With the research and progress of artificial intelligence technology, artificial intelligence is researched and applied in a plurality of fields, such as common smart homes, smart customer service, virtual assistants, smart speakers, smart marketing, unmanned driving, automatic driving, robots, smart medical treatment and the like.
Natural language processing is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like. The text deduplication method in the application mainly belongs to text processing.
Machine learning is a multi-field cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Compared with the method for finding mutual characteristics among big data by data mining, the machine learning focuses on the design of an algorithm, so that a computer can automatically learn rules from the data and predict unknown data by using the rules.
Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and the like. The weight prediction model in the embodiment of the application is obtained by training through a machine learning or deep learning technology.
The text deduplication method provided in the embodiment of the application mainly comprises a model training part, which relates to the technical field of machine learning, and a weight prediction model is trained through the machine learning technology. After the model training is completed, the target weights corresponding to the sub-text strings can be calculated by using the model obtained based on the mode.
The following briefly introduces the design concept of the embodiments of the present application:
with the internet age, information has been explosively increased, and the internet is full of massive text and contains a large amount of repeated text content. Taking the recommendation service as an example, in the actual recommendation service, text data such as repeated articles, advertisements, news and the like need to be removed, and therefore, how to quickly and efficiently remove duplication of massive texts is a urgent need to be solved.
In the related technology, a SimHash algorithm is usually used to obtain the data local sensitive Hash of a text, the main idea is to reduce the dimension, convert the feature vector of a high dimension into a Hash value with a fixed length, and determine the similarity of two texts by calculating the Hamming distance of the two Hash values, wherein the smaller the Hamming distance is, the lower the similarity is. It is believed that a hamming distance of 3 would represent the same in both articles. However, the effect of calculating text similarity by using a Hamming distance mode based on SimHash in the field of short texts is poor, and the recall rate is too low. Since the text length is usually short, many similar texts do not satisfy the condition that the hamming distance is less than 3. Moreover, the accuracy and recall rate of the method are generally about 80%, that is, the similarity of 20% of texts can be misjudged.
In view of this, embodiments of the present application provide a text deduplication method, an apparatus, an electronic device, and a storage medium. According to the embodiment of the application, the text is intercepted, meanwhile, the weight of each sub text string is learned, the sub text strings with the weight lower than a specific threshold value are directly abandoned, and only the sub text strings with the high weights are taken, so that the noise of the meaningless sub text strings is reduced. In addition, the text deduplication is performed based on the inclusion relationship between each text to be deduplicated and the target sub-text string after the target sub-text string is screened out through interception and weight calculation of the sub-text strings, so that the deduplication efficiency can be effectively improved, a better deduplication effect can be obtained for short texts, and the accuracy is higher.
The preferred embodiments of the present application will be described below with reference to the accompanying drawings of the specification, it should be understood that the preferred embodiments described herein are merely for illustrating and explaining the present application, and are not intended to limit the present application, and that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.
Fig. 1 is a schematic view of an application scenario in the embodiment of the present application. The application scenario diagram includes a terminal device 110 and a server 120. The terminal device 110 in the embodiment of the present application may have a text-related application installed thereon. The application may be social software, such as instant messaging software and short video software, and may also be an applet, a web page, and the like, which are not limited in this respect. The server 120 is a server corresponding to software, a web page, an applet, or the like.
It should be noted that the application in the embodiment of the present application may also refer to various content recommendation applications that can be applied on a vehicle, such as advertisement, education, message, travel, listening book, etc.; accordingly, the text to be deduplicated may refer to text data of news, books, strategies, and the like related to advertisements, information flow messages, education, and travel, and is not specifically limited herein.
It should be noted that the text deduplication method in the embodiment of the present application may be executed by the server 120 or the terminal device 110 alone, or may be executed by both the server 120 and the terminal device 110. The present disclosure is mainly illustrated by an example in which the server 120 is executed separately, and specifically, the server 120 may be executed by one server 120, or a plurality of servers 120 may execute in parallel, and the like, and is not limited herein.
In the embodiment of the present application, the weight prediction model for calculating the association weight may be deployed on the terminal device 110 for training, or may be deployed on the server 120 for training. A large number of training samples may be stored in the server 120 for training the model. Optionally, after the model is trained based on the method in the embodiment of the present application, the trained model may be directly deployed on the server 120 or the terminal device 110. The model is typically deployed directly on the server 120.
In an alternative embodiment, terminal device 110 and server 120 may communicate via a communication network.
In an alternative embodiment, the communication network is a wired network or a wireless network.
In the embodiment of the present application, the terminal device 110 is a computer device used by a user, and the computer device includes, but is not limited to, a personal computer, a mobile phone, a tablet computer, a notebook, an electronic book reader, an intelligent voice interaction device, an intelligent appliance, a vehicle-mounted terminal, and the like. Each terminal device 110 is connected to a server 120 through a wireless network, and the server 120 is a server or a server cluster or a cloud computing center formed by a plurality of servers, or is a virtualization platform.
It should be noted that fig. 1 is only an example, and the number of the terminal devices and the servers is not limited in practice, and is not specifically limited in the embodiment of the present application.
In addition, the text deduplication method provided in the embodiment of the present application may be applied to various application scenarios including a text classification task, a text recommendation task, a text search task, a text deduplication task, and the like, including but not limited to cloud technology, artificial intelligence, smart traffic, driving assistance, and the like, and training samples used in different scenarios are different and are not listed here.
Hereinafter, the text deduplication method provided in the embodiment of the present application is mainly used in the field of advertisement recommendation as an example to remove duplicate data from text data such as massive information and advertisements. In the embodiment of the application, important service scenes such as advertisements and user portraits can be oriented, repeated recommendation of similar contents is avoided by removing duplicate of short texts, so that the user experience of a recommendation system is improved, and the click rate, the conversion rate and the like of the recommended contents are improved.
The video detection method provided by the exemplary embodiment of the present application is described below with reference to the accompanying drawings in conjunction with the application scenarios described above, and it should be noted that the application scenarios described above are only shown for the convenience of understanding the spirit and principles of the present application, and the embodiments of the present application are not limited in this respect.
Referring to fig. 2, a flowchart of an implementation of a text deduplication method according to an embodiment of the present application is shown, where fig. 2 illustrates an implementation of the method by taking an execution subject as a server, and the specific implementation flow of the method is as follows:
s21: the server respectively intercepts subfile strings of each text to be de-duplicated in the text set to obtain a subfile string set corresponding to each text to be de-duplicated;
the text set may include at least two texts to be deduplicated, such as text to be deduplicated S1, text to be deduplicated S2, … …, and text to be deduplicated SN, N is greater than or equal to 2 and is a positive integer.
In the embodiment of the present application, the text to be deduplicated is a text that needs text deduplication, and may include contents such as characters. In addition, the image may also include a picture, and the like, and is not particularly limited herein. Text deduplication refers to: similar or identical text in the text collection is removed.
The method and the device consider that any two similar texts which can be judged are necessarily completely consistent on one or more sub text strings, so when text deduplication is carried out on a text set, the sub text strings are firstly intercepted on the text to be deduplicated.
Optionally, before intercepting the sub-text strings, each text to be deduplicated may be preprocessed. For example, text data on a Distributed File System (HDFS) is loaded through a compute engine (Spark), and the text content is parsed through a Map operator; and performing word segmentation on each text to be deduplicated, loading a stop word dictionary, and removing stop words in the text.
Among them, Spark is a fast general-purpose computing engine specially designed for big data processing. HDFS is an important component of a highly fault-tolerant, extensible, distributed file system, the Hadoop (Hadoop) system. The Map operator, i.e. the BitMap file (BitMap), uses a bit to mark the Value (Value) corresponding to an element, and the Key (Key) is the element. Because the Bit is used as the unit to store the data, the storage space can be greatly saved.
In the process of intercepting the sub-text string, an optional implementation manner may be to implement S21 according to the flowchart shown in fig. 3, which is a flowchart of a sub-text string intercepting method in the embodiment of the present application, and includes the following steps:
s301: the server takes the minimum value of the preset length reference value and the maximum text lengths corresponding to all the texts to be deduplicated as the intercepting length of the sub text string;
and the intercepting length of the sub-text string is the length of the sub-text string needing to be intercepted. In the embodiment of the present application, the length of the sub-text string refers to: the number of elements contained in the sub-text string.
For example, when the element is a single character, the length of the sub-text string is the number of characters included in the sub-text string; for example, when the child text string is "minimum unit", since the child text string includes 4 characters, the length of the child text string is 4, that is, the truncation length is 4.
It should be noted that the elements may also be phrases, or english words, etc., which are only examples, and the text is not specifically limited, and can be flexibly set according to actual situations.
S302: the server respectively intercepts a plurality of sub text strings from each text to be deduplicated in a sliding manner according to the interception length;
s303: and the server takes a set formed by the sub-text strings intercepted on the basis of the same text to be deduplicated as a sub-text string set corresponding to the same text to be deduplicated.
Specifically, for each text to be deduplicated, a corresponding sub-text string set, which may also be referred to as an N-gram set, is selected by a sliding window method in the present application, and the sub-text strings in each N-gram set may also be referred to as N-gram sub-text strings.
When the convergence of the set is considered, the length N of the N-gram sliding window needs to be determined, that is, the smaller the selection of N is, the better the text deduplication effect is (the deduplication rate is increased), but the corresponding time complexity is also higher; the larger the n selection, the less effective the similar text deduplication (part of the similar text will not be compared), but the time complexity will be reduced. That is, if n is selected too high, n is greater than the length of many short texts, so that these short texts are no longer determined as similar texts, the deduplication effect is reduced, and the time complexity is reduced due to the reduction of the comparison times. On the contrary, as n is reduced, more short texts are divided into the same index, the deduplication effect is better, and the complexity of the computation time is increased. Therefore, in practical application, the truncation length of the sub-text string can be determined by comprehensively considering the deduplication rate and the time complexity. In the embodiment of the application, the minimum value of the preset length reference value and the maximum text lengths corresponding to all texts to be deduplicated is used as the intercepting length of the sub text string.
Suppose that the length of the ith text to be deduplicated is liMaximum text length of all text to be deduplicated
Figure BDA0003321015010000111
The following were used:
Figure BDA0003321015010000112
(N is the total number of all text to be deduplicated in the text set);
in practical calculation, when n is too large, the calculation complexity is increased, generally speaking, n is larger than or equal to 4, the de-duplication effect can be achieved, and the calculation time complexity is reduced to an acceptable range. N can thus be expressed as the maximum text length of the preset length reference value 4 corresponding to all text to be deduplicated
Figure BDA0003321015010000113
The formula is as follows:
Figure BDA0003321015010000114
after determining the truncation length n of the sub-text string based on the above manner, when executing step S302, the following two cases may be specifically adopted according to the size relationship between the length of the text to be deduplicated and the truncation length:
the length of the text to be deduplicated is the number of elements contained in the text to be deduplicated, for example, the number of characters or phrases contained in the text to be deduplicated. For example, when the text to be deduplicated contains 80 words, the length of the text to be deduplicated is 80.
Suppose that the ith text to be deduplicated is Si, and the length thereof is li. According to liAnd n, there are the following two cases:
(a) and when the intercepting length is less than or equal to the length of the text to be subjected to deduplication, slidingly intercepting a plurality of sub text strings from the text to be subjected to deduplication according to the intercepting length, wherein the length of the sub text strings is equal to the intercepting length.
I.e. when n is<=liIn time, some N-gram sub-text strings can be cut out according to the size of N to form an N-gram set, and the size of the N-gram set is li-n+1。
Assuming that L is defined as a set of truncated N-gram sub-text strings, then Li={Si1,Si2,Si3,…,SikIn which S isikAnd the sub-text string obtained by intercepting the text to be deduplicated Si is shown.
(b) And when the intercepting length is greater than the length of the text to be deduplicated, taking the text to be deduplicated as a sub-text string, namely intercepting the whole text to be deduplicated as the sub-text string.
I.e. when n is>liIn this case, the sub-text string having the length of n cannot be extracted, and therefore, the entire text is added to the sub-text string set as a whole, and Li is Si.
In the embodiment of the present application, the implementation manner of sliding and intercepting the sub text string is as follows:
and constructing a sliding window with the length of the intercepting length, and then sliding the sliding window in the text to be deduplicated according to the preset sliding direction and the preset sliding length to intercept a plurality of sub-text strings.
The predetermined sliding direction is a sliding direction of the sliding window in the text, and is set according to actual requirements, for example, the predetermined sliding direction may be a direction from a first element of the text to a last element of the text. The predetermined sliding length is the length or step of each sliding of the sliding window, and the length is the number of elements which the sliding window needs to slide each time. For example, when 1 element needs to be slid each time, the sliding length is 1.
S22: the server respectively determines the target weight corresponding to each sub text string based on the obtained characteristic information of each sub text string contained in each sub text string set;
s23: the server screens out the subfile strings with the target weight not lower than the target threshold value from each subfile string as target subfile strings;
s24: and the server performs deduplication on each text to be deduplicated based on the inclusion relationship between each text to be deduplicated and each target sub-text string.
In the above embodiment, by intercepting the sub-text strings of the text and learning the weight of each sub-text string, the sub-text strings with weights lower than a certain threshold are discarded directly, and only the sub-text strings with higher weights are taken, thereby reducing noise that makes the sub-text strings meaningless. In addition, the text deduplication is performed based on the inclusion relationship between each text to be deduplicated and the target sub-text string after the target sub-text string is screened out through interception and weight calculation of the sub-text strings, so that the deduplication efficiency can be effectively improved, a better deduplication effect can be obtained for short texts, and the accuracy is higher.
Steps S22-S24 are described in detail below.
In an alternative implementation manner, step S22 may be implemented according to the flowchart shown in fig. 4, which is a flowchart of a weight calculation method in the embodiment of the present application, and includes the following steps:
for each sub-text string, the following operations are performed:
s401: the server acquires first embedded characteristic information of a sub text string, second embedded characteristic information of each element contained in the sub text string and offset characteristic information corresponding to each element in the sub text string;
the elements contained in the sub-text string may be a word, an english word, etc.
For example, a subfile string is a 'minimum unit', and the elements included in the subfile string include 'minimum', 'single', 'element'.
In the embodiment of the present application, the offset feature information corresponding to an element in a sub-text string is determined based on the position of the element in the sub-text string, and may also be referred to as a position feature.
Specifically, 'S' denotes the beginning of the element in the sub-text string, M denotes the middle of the element in the sub-text string, and E denotes the end of the element in the sub-text string.
For example, the offset characteristic information corresponding to the 'most', 'little', 'single', 'element' respectively is: 'S', 'M', and 'E'.
S402: the server extracts attention characteristics of each element based on the first embedded characteristic information, the second embedded characteristic information and the offset characteristic information to obtain the association weight among the elements in one subfile string;
s403: the server takes the associated weight as a target weight corresponding to one sub-text string.
Optionally, when step S401 and step S402 are executed, the method may be implemented based on a weight prediction model in the embodiment of the present application, where the weight prediction model may be input into all N-gram sub-text strings and elements in each N-gram sub-text string, and output as a weight of each N-gram sub-text string.
Specifically, the weight prediction model may include a Bidirectional transform Encoder (Bert) network, a normalization (LayerNorm) layer, and an Attention (Attention) layer. In the embodiment of the application, for each N-gram set L corresponding to each text to be deduplicated, each N-gram sub-text string is directly input into an Embedding (Embedding) layer of a Bert network in the model as a word to obtain a corresponding first Embedding feature, and each word in each N-gram sub-text string is respectively input into the Embedding layer of the model to obtain a corresponding second Embedding feature. Meanwhile, adding the offset features of the words into the model, extracting the attention features of each element based on the features, analyzing and obtaining the association weight for representing the association relation between the elements in one subfile string based on the gradient attention features, and then taking the association weight as the target weight corresponding to the subfile string.
In the embodiment of the present application, when calculating the association weight, the method may specifically be implemented as follows:
s4021: the server performs semantic representation on a sub-text string based on the first embedded feature information, the second embedded feature information and the offset feature information to obtain a semantic feature vector corresponding to the sub-text string and a semantic feature vector corresponding to each element in the sub-text string;
s4022: the server respectively executes the following operations for each element: performing normalization processing based on a semantic feature vector corresponding to an element and a semantic feature vector corresponding to a sub-text string to obtain a normalization value corresponding to the element; taking the ratio of the exponential power corresponding to the normalized value of one element to the sum of the exponential powers corresponding to the normalized values of all elements in one subfile string as the attention weight corresponding to one element;
s4023: and the server respectively carries out weighted summation on the respective normalized value of each element and the corresponding attention weight to obtain the associated weight.
For example, taking the N-gram sub-text string as the 'minimum unit', the input model is characterized by 'minimum unit', 'most', 'small', 'single', 'element', and location features 'S', 'M', and 'E'. The output characteristic corresponding to the N-gram subfile string obtained based on the last layer of the Bert network is O1(minimum unit), that is, the semantic feature vector corresponding to the sub-text string, the output features corresponding to each element are O1 (maximum), O1 (small), O1 (single), O1 (element), that is, the semantic feature vector corresponding to each element.
And simultaneously, accessing a LayerNorm layer and an Attenttion layer behind the Bert network, namely inputting O1 (minimum unit), O1 (maximum), O1 (minimum), O1 (single) and O1 (unit) into the LayerNorm layer, learning a normalization value corresponding to each element based on the LayerNorm layer, further inputting the normalization value corresponding to each element into the Attenttion layer, and learning out the associated weight of the word and other words in the same N-gram subfile string based on the Attenttion layer to serve as the target weight of the N-gram subfile string.
Specifically, the normalization values corresponding to the elements in the sub-text strings are calculated respectively based on the following formulas:
o2 (most) ═ LayerNorm (O1 (most) + W O1 (minimum unit));
o2 (small) ═ LayerNorm (O1 (small) + W × O1 (minimum unit));
o2 (mono) ═ LayerNorm (O1 (mono) + W O1 (minimum unit));
o2 (m) ═ LayerNorm (O1 (m) + W O1 (m)).
In the above formula for calculating the normalization value, LayerNorm represents a model-level normalization operator, and the corresponding calculation formula is as follows:
Figure BDA0003321015010000151
where μ represents the mean of the input values for each layer and σ represents the variance of the input values for each layer.
Based on the above process, the normalization values corresponding to the elements can be obtained, that is: the O2 (the most), i.e. O2 (small), O2 (single), O2 (element) are the normalized values corresponding to the elements 'most', 'small', 'single', 'element' in turn.
After obtaining the normalized value corresponding to each element based on the LayerNorm layer, the Attention weight corresponding to each element can be learned based on the Attention layer, the normalized value corresponding to each element and the corresponding Attention weight are subjected to weighted summation to obtain the associated weight, the Attention layer is a module for calculating the weight by the weight prediction model, and the corresponding formula is as follows:
w (minimum unit) ═ Attention (O2 (max), O2 (min), O2 (mono), O2 (m));
Attention=∑iαi*Oi
Figure BDA0003321015010000152
wherein, the value of i represents the label number of an element in a sub-text string, taking the 'minimum unit' of the sub-text string as an example, the value of i can be any one of 1 to 4, and correspondingly, when i takes different numerical values, alpha isiRespectively represent the attention weight, O, corresponding to each of the four elementsiRespectively represent the respective pairs of the four elementsThe corresponding normalized values, i.e., O2 (min), O2 (min), O2 (mono), O2 (meta) as described above.
And j and i are similar and also represent the labels of elements in a subfile string, and j epsilon N represents that the value of j is also 1-4. exp (f (O)i) It represents the corresponding exponential power of the normalized value of an element,
Figure BDA0003321015010000153
it represents the sum of the exponentiations powers corresponding to the normalized values of all elements in a sub-string.
Finally, the weight prediction model learns the weight W of each N-gram sub-text string, filters out sub-text strings with corresponding target weights lower than a specific threshold (namely a target threshold) in a corresponding sub-text string set L, and obtains a new sub-text string set
Figure BDA0003321015010000161
Based on the above manner, all the new sub-text string sets are obtained after all the sub-text string sets are filtered
Figure BDA0003321015010000162
The sub text string in (1) is the target sub text string.
In the embodiment of the application, by learning the weight of each sub-text string, the sub-text strings with the weight lower than a specific threshold value are directly abandoned to construct the index, and only the sub-text strings with the high weight are taken, so that the noise of the meaningless sub-text strings can be effectively reduced.
Based on the above manner, after the intercepted sub text strings are screened, the text duplication can be removed based on the finally screened target sub text strings and further based on the inclusion relationship between the text to be duplicated and the target sub text strings.
In an alternative implementation manner, step S24 may be implemented according to the flowchart shown in fig. 5, which is a flowchart of another text deduplication method in the embodiment of the present application, and includes the following steps:
s501: the server divides each text to be de-duplicated to obtain a plurality of candidate text sets respectively based on the inclusion relation between each text to be de-duplicated and each target sub-text string;
optionally, the method may further include the following steps of preliminarily screening approximate texts through an improved Hash algorithm, and dividing each text to be deduplicated into a plurality of candidate text sets:
s5011: the server determines a coding vector corresponding to each text to be deduplicated based on whether each text to be deduplicated contains each target sub-text string;
s5012: the server acquires a minimum hash vector corresponding to each text to be deduplicated based on a coding matrix formed by each coding vector;
in step S501, based on the inclusion relationship between each text to be deduplicated and each target sub-text string, a One-Hot (One-Hot) matrix M, that is, the coding matrix in step S5012, may be generated for all the texts to be deduplicated in the text set, where the coding matrix is composed of the coding vectors corresponding to each text to be deduplicated.
Specifically, the row of the matrix M is an identifier (id) of the text to be deduplicated, the column is an id of the N-gram sub-text string, the matrix element indicates whether the text to be deduplicated of the current row contains the N-gram sub-text string of the current column, if the text to be deduplicated corresponding to row a contains the N-gram sub-text string corresponding to column B, the row and column value is 1, otherwise, the row and column value is 0. Thus, each row in the matrix is the One-Hot vectorized representation of the corresponding text to be deduplicated, i.e., the encoding vector in step S5011.
For example, there are five texts to be de-duplicated in the text set, which are S1, S2, S3, S4 and S4, and the target sub-text strings screened based on the above-mentioned method include a, b, c, d, e and f; where, S1 is { a, d, f }, S2 is { d, e, f }, S3 is { c, d }, S4 is { a, c, f }, and S4 is { a, b, d, e }. Based on the above method, a 0-1 matrix as shown in fig. 6A, which is a schematic diagram of an encoding matrix in the embodiment of the present application, can be constructed.
Through the process, One text can be represented as One-Hot vector. But as the text scales up, the vector becomes more and more highly dimensional and becomes more and more sparse.
In order to reduce the amount of calculation, in the embodiment of the present application, the vectors are mapped to a low-dimensional space by an improved Hash algorithm, and the original similarity between the text vectors can be maintained, which is specifically implemented as follows:
firstly, the matrix M is rearranged randomly according to columns, that is, the order of rows is randomly scrambled, and for each row (that is, each text to be deduplicated), the result of the minimum hash (minHash) is the number of columns of the first non-zero column counted from top to bottom, that is, the column number where the column with the first value of 1 of the corresponding row is located.
As shown in fig. 6B, which is one of the matrices listed in the embodiments of the present application, another matrix obtained after rearranging the matrix shown in fig. 6A is obtained. Based on the matrix shown in FIG. 6B, the minHash results for S1-S4 are: 1,1,1,3,1.
In the embodiment of the present application, after each text to be deduplicated is subjected to minHash, the following results can be obtained:
P(minHash(text1)=minHash(text2))=Jaccard(text1,text2);
wherein minHash (text1) represents the value of text1 after minHash, and minHash (text2) represents the value of text2 after minHash.
Taking text1 as the text to be deduplicated S1 and text2 as the text to be deduplicated S2 as an example, as shown in fig. 6B, minHash (S1) ═ 1 and minHash (S2) ═ 1, that is, P (minHash (S1) ═ minHash (S2)) ═ Jaccard (S1, S2) indicate that the probability of any column of minHash vectors of two texts to be deduplicated is equal to the Jaccard similarity of the two texts to be deduplicated.
M vectors of dimension can be obtained by using m minHash functions:
[minHash1(text),minHash2(text),…minHashm(text)];
based on the above process, each text vector may be compressed into a minHash vector of m dimensions, i.e., the minimum hash vector in step S5012, and step S5013 may be performed.
S5013: and the server segments the minimum hash vector of each text to be deduplicated, and performs hash bucket division on each text to be deduplicated based on the segmentation result to obtain a plurality of candidate text sets.
Specifically, after an m-dimensional minHash vector corresponding to each text to be deduplicated is obtained in the previous step, the minHash vector is divided into b segments, where the size of each segment (i.e., the number of columns included in the segment) is r, and then m is b × r. And further, carrying out hash bucket division on each segment of the minHash vector of each text to be deduplicated, taking the text to be deduplicated divided into the same bucket on any segment as a similar text, and adding the similar text into the same candidate text set, so that the similar text of each text can be found only by calculating the similarity of all the candidate text sets.
For example, the minHash vectors of the text to be de-duplicated 1 and the text to be de-duplicated 2 respectively have 10 segments, and if the Hash value of the first segment of the text to be de-duplicated 1 is the same as the Hash value of the first segment of the text to be de-duplicated 2, the text to be de-duplicated 1 and the text to be de-duplicated 2 can be put into the same similar text set. For example, the first segment of the vector of the text to be deduplicated 1 is 0101001, the Hash value is 12, the first segment of the vector of the text to be deduplicated 2 is 1011010, and the Hash value is 12, so that the two texts can be divided into the same candidate text set.
In the above embodiment, by dividing the minHash vector into b segments, the size of each segment (i.e., the number of rows contained in the band) is r. Assuming that the Jaccard similarity between two text vectors is s, based on the above, it can be known that the probability that any column of minHash vectors is the same is equal to the Jaccard similarity s, the probability that two texts to be de-duplicated are allocated to the same similar candidate set is:
1-(1-sr)bgenerally, b is 20 and r is 6.
In the above embodiment, segmentation is performed on the basis of minHash, a vector of each text to be deduplicated is subjected to hash bucket partitioning on each segment, and then texts of the same candidate text set are aggregated onto the same Reduce operator node, instead of calculating the similarity between every two documents directly according to minHash, in this way, id of each segment can be used as a key, texts contained in the same segment can be used as a value, and are combined into a key-value pair, and according to different index key values, all texts under the same index key value can be aggregated, that is, division of the candidate text set is realized, so as to realize deduplication logic, specifically including twice deduplication logic in step S502 and step S503.
S502: the server respectively carries out preliminary de-duplication on the text to be de-duplicated in each candidate text set to obtain the residual text to be de-duplicated;
optionally, the preliminary deduplication of each candidate text set may be performed in a manner of parallel processing by multiple nodes.
Fig. 7 is a schematic diagram of a method for parallel processing by multiple nodes in this embodiment. Specifically, the candidate text set is distributed to each node through Map operators, for example, the candidate text set 1 is distributed to Reduce operator node 1, the candidate text set 2 is distributed to Reduce operator node 2, the candidate text set 3 is distributed to Reduce operator nodes 3, …, and the candidate text set m is distributed to Reduce operator node m.
In the embodiment of the application, an original text set is stored on each node of a cluster in an HDFS (Hadoop distributed file system) form, and after each text is read, analyzed and divided into corresponding indexes through a Map operator according to the method, the id of each text segment can be used as a key, and all the texts to be deduplicated are distributed to corresponding Reduce operator nodes according to the key value, so that the texts to be deduplicated in the same candidate text set can be divided into the same Reduce operator node, and a plurality of Reduce operator nodes can be processed in parallel.
In the above embodiment, the whole process is equivalent to performing coarse-grained aggregation on potential repeated texts, the non-repeated texts are completely segmented, and similar texts are necessarily allocated to the same node, so that each Reduce operator node is only responsible for the deduplication operation of the node.
In an optional implementation manner, when performing preliminary deduplication based on Reduce operator nodes, for each candidate text set, the following operations are respectively performed:
s5021: the server obtains the text similarity between every two texts to be deduplicated in a candidate text set;
s5022: and the server performs deduplication on the texts to be deduplicated in the candidate text set based on the text similarity of every two texts to be deduplicated.
The server may specifically refer to a Reduce operator node corresponding to the candidate text set. The text similarity between two texts to be deduplicated may be the above listed Jaccard similarity, or may also be a cosine similarity, or a hamming distance, and the like, which is not specifically limited herein, and the following mainly takes the Jaccard similarity as an example for illustration.
On the Reduce operator node, pairwise comparison can be carried out on any two texts in the candidate text set with the same segment id, so that text duplication removal is achieved. For each candidate text set S, any two texts in the set S are respectively S1And S2By comparison of S1And S2The similarity of the texts is calculated according to the Jaccard similarity.
The specific method comprises the following steps: dynamically maintaining a result set R, and randomly selecting a text S to be deduplicated from texts in the set S in an initial stateiAs seed text, all the texts S to be deduplicated in the candidate text set S are subsequently traversedjIf S isiAnd SjThe similarity between the S and S exceeds a certain threshold value, indicating that the sum S exists in the set SiSimilar text, therefore the text S needs to be modifiediIt is discarded. If all the texts to be de-duplicated in the set S and SiAll the similarity of S is lower than the threshold value, the set S is not similar to SiSimilar text, so the text SiAdd to the result set and begin traversing other text in the set S. Finally, the output of each Reduce operator node is the text after the preliminary deduplication.
For example, the candidate text set under a certain Reduce operator node is { Sl, S2, S3, S4}, S3 can be randomly selected as a seed text and added to an empty de-duplication result set, at this time, the de-duplication result set is { S3}, and then the operator node set under the Reduce operator node is traversed { Sl, S2, S3, S4 }; when the Sl is traversed, judging whether the Sl is similar to S3 in the duplicate removal result set or not, if not, adding the Sl into the duplicate removal result set, wherein the duplicate removal result set is { S1, S3}, and then traversing S2; if so, S2 is traversed.
Assuming that Sl and S3 are not similar texts, when traversing to S2, if S2 is similar to any text of Sl and S3 in the deduplication result set, traversing the next text S3, and if S2 is not similar to any text of Sl and S3 in the deduplication result set, adding S2 to the deduplication result set, where the deduplication result set is { Sl, S2, S3 }; when the traversal is performed to S3, it is obvious that S3 is the same text as S3 in the deduplication result set, at this time, when the traversal is performed, assuming that S2 is not similar text to any text in Sl in the deduplication result set, if S4 is similar text to any text in Sl, S2, and S3 in the deduplication result set, the traversal is ended, if S2 is not similar text to any text in Sl, S2, and S3 in the deduplication result set, S4 is added to the deduplication result set, at this time, the deduplication result set is { Sl, S2, S3, and S4}, and the traversal is ended. And finally, after the traversal is finished, taking the duplicate removal result set as the duplicate removal text set corresponding to the seed text string. For example, the deduplication result set is { Sl, S2, S3, S4} as the deduplication text set corresponding to the Reduce operator node.
In the last step, the text of each Reduce operator node is subjected to deduplication, but the same text may also be distributed in different Reduce operator nodes, so that secondary deduplication needs to be performed.
S503: and the server performs secondary duplication elimination on each residual text to be duplicated according to the hash value of each residual text to be duplicated.
In the step, the texts respectively positioned on different Reduce operator nodes need to be subjected to Hash again, and are distributed to different Reduce operator nodes according to Hash values, so that the same texts are necessarily distributed to the same Reduce operator node, repeated texts are filtered on the Reduce operator node, and only one repeated text is reserved in the finally obtained result. And finally, outputting the final text duplicate removal result on the Reduce operator node to an HDFS directory for downstream use.
Or, directly performing secondary removal according to the primary removal result of each Reduce operator node, for example, if the set of the de-weighted postambles of the Reduce operator node 1 is { S1, S2}, and the set of the de-weighted postambles of the Reduce operator node 2 is { S1, S2, S3}, then since S1, S2 exist in both sets, at this time, one S1 and S2 can be removed, and finally, a result text set { S1, S2, S3} is obtained.
Referring to fig. 8A, a complete timing chart of a text deduplication method in the embodiment of the present application specifically includes the following steps:
s801: reading and analyzing a plurality of text data stored on the HDFS through Spark to form a text set;
s802: performing word segmentation and removal of stop words on the text to be deduplicated in the acquired text set;
s803: acquiring a N-gram subfile string set corresponding to each text to be deduplicated;
s804: learning the target weight of the N-gram sub-text string of each text to be de-weighted through a weight prediction model, and screening out the target sub-text string based on the target weight;
s805: acquiring minHash vectors corresponding to the texts to be deduplicated based on the inclusion relationship between the texts to be deduplicated and the target sub-text strings;
step S806: constructing minHash segmentation of the text to be de-duplicated and an index of the text to be de-duplicated through a Map operator so as to divide the text to be de-duplicated into a plurality of candidate text sets and further distribute the candidate text sets to Reduce operator nodes;
s807: acquiring the similarity of any two texts to be de-duplicated in a corresponding candidate text set through Reduce operator nodes;
s808: judging whether each sub text string set has repeated text to be deduplicated according to the similarity of the text to be deduplicated, and if so, removing the text to be deduplicated;
s809: and finally performing final aggregation and de-duplication on the texts to be de-duplicated in all Reduce operator nodes, and writing the final result into the HDFS directory.
Fig. 8B is a schematic diagram of a deduplication process corresponding to fig. 8A in the embodiment of the present application.
In step S802, the text set includes N texts to be deduplicated, which are S1, S2, S3, …, and SN, respectively, and these texts are related texts matched based on a request of a user when text recommendation is made to the user; after the sub text strings of each text to be deduplicated are intercepted based on the step S803, the target sub text strings can be screened out based on the step S804; furthermore, based on steps S805 and S806, the text to be de-duplicated may be divided into a plurality of candidate text sets, for example, the candidate text sets in fig. 8B have { S1, S2, S3}, { S1, S4, S5}, { S6, S7, S8}, …, { S6, SN }, respectively, since the division of the candidate text sets is performed based on minhash segmentation, there may be cases where the same text to be de-duplicated is distributed with different candidate text sets, for example, S1, S2, S3}, { S1, S4, S5} both include S1, and { S6, S7, S8}, { S6, SN } both include S6, therefore, after the preliminary de-duplication is performed based on Reduce operator nodes, the preliminary de-duplicated result is obtained as shown in fig. 8B, the final de-duplicated result is further obtained as a de-duplicated result, and the final de-duplicated result is obtained as S1 S2, S3, S4, S6, and S8}, that is, after N texts in the text set are deduplicated, the remaining texts are: s1, S2, S3, S4, S6, S8, and recommending the several remaining texts to the user.
For convenience of description, the above parts are separately described as modules (or units) according to functional division. Of course, the functionality of the various modules (or units) may be implemented in the same one or more pieces of software or hardware when implementing the present application.
Having described the text deduplication method and apparatus according to an exemplary embodiment of the present application, a text deduplication apparatus according to another exemplary embodiment of the present application will be described next.
As will be appreciated by one skilled in the art, aspects of the present application may be embodied as a system, method or program product. Accordingly, various aspects of the present application may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.
Based on the same inventive concept, the embodiment of the application also provides a text duplicate removal device. Fig. 9 is a schematic diagram of a text deduplication apparatus 900 in an embodiment of the present application, which includes:
a text intercepting unit 901, configured to perform sub-text string interception on each to-be-deduplicated text in the text set respectively, to obtain a sub-text string set corresponding to each to-be-deduplicated text;
a weight determining unit 902, configured to determine, based on the obtained feature information of each sub-text string included in each sub-text string set, a target weight corresponding to each sub-text string;
a screening unit 903, configured to screen out, from each sub-text string, a sub-text string whose target weight is not lower than a target threshold as a target sub-text string;
and a deduplication unit 904, configured to perform deduplication on each text to be deduplicated based on an inclusion relationship between each text to be deduplicated and each target sub-text string, respectively.
Optionally, the weight determining unit 902 is specifically configured to:
for each sub-text string, the following operations are performed:
acquiring first embedded characteristic information of a sub-text string, second embedded characteristic information of each element contained in the sub-text string and offset characteristic information corresponding to each element in the sub-text string;
based on the first embedded characteristic information, the second embedded characteristic information and the offset characteristic information, performing attention characteristic extraction on each element to obtain an association weight between elements in one subfile string;
and taking the associated weight as a target weight corresponding to one sub-text string.
Optionally, the weight determining unit 902 is specifically configured to:
performing semantic representation on a sub-text string based on the first embedded feature information, the second embedded feature information and the offset feature information to obtain a semantic feature vector corresponding to the sub-text string and a semantic feature vector corresponding to each element in the sub-text string;
for each element, the following operations are performed: performing normalization processing based on a semantic feature vector corresponding to an element and a semantic feature vector corresponding to a sub-text string to obtain a normalization value corresponding to the element; taking the ratio of the exponential power corresponding to the normalized value of one element to the sum of the exponential powers corresponding to the normalized values of all elements in one subfile string as the attention weight corresponding to one element;
and respectively carrying out weighted summation on the respective normalized values of the elements and the corresponding attention weights to obtain the associated weights.
Optionally, the deduplication unit 904 is specifically configured to:
dividing each text to be de-duplicated to obtain a plurality of candidate text sets respectively based on the inclusion relation between each text to be de-duplicated and each target sub-text string;
respectively carrying out preliminary de-duplication on the text to be de-duplicated in each candidate text set to obtain the residual text to be de-duplicated;
and carrying out secondary duplication removal on each residual text to be duplicated according to the hash value of each residual text to be duplicated.
Optionally, the deduplication unit 904 is specifically configured to:
determining a coding vector corresponding to each text to be deduplicated based on whether each text to be deduplicated contains each target sub-text string;
acquiring a minimum hash vector corresponding to each text to be deduplicated based on a coding matrix formed by each coding vector;
and carrying out segmentation on the minimum hash vector of each text to be deduplicated, and carrying out hash bucket division on each text to be deduplicated on the basis of a segmentation result to obtain a plurality of candidate text sets.
Optionally, the deduplication unit 904 is specifically configured to:
for each candidate text set, the following operations are respectively performed:
acquiring text similarity between every two texts to be deduplicated in a candidate text set;
and based on the text similarity of every two texts to be deduplicated, performing deduplication on the texts to be deduplicated in the candidate text set.
Optionally, the text intercepting unit 901 is specifically configured to:
taking the minimum value of the preset length reference value and the maximum text lengths corresponding to all texts to be deduplicated as the intercepting length of the sub text string;
respectively intercepting a plurality of sub text strings from each text to be de-duplicated in a sliding manner according to the intercepting length;
and taking a set formed by the sub-text strings intercepted on the basis of the same text to be deduplicated as a sub-text string set corresponding to the same text to be deduplicated.
The electronic equipment is based on the same inventive concept as the method embodiment, and the embodiment of the application also provides the electronic equipment. In one embodiment, the electronic device may be a server, such as server 120 shown in FIG. 1. In this embodiment, the electronic device may be configured as shown in fig. 10, and include a memory 1001, a communication module 1003, and one or more processors 1002.
A memory 1001 for storing computer programs executed by the processor 1002. The memory 1001 may mainly include a storage program area and a storage data area, where the storage program area may store an operating system, a program required for running an instant messaging function, and the like; the storage data area can store various instant messaging information, operation instruction sets and the like.
The memory 1001 may be a volatile memory (volatile memory), such as a random-access memory (RAM); the memory 1001 may also be a non-volatile memory (non-volatile memory), such as a read-only memory (rom), a flash memory (flash memory), a hard disk (HDD), or a solid-state drive (SSD); or the memory 1001 is any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited thereto. The memory 1001 may be a combination of the above memories.
The processor 1002 may include one or more Central Processing Units (CPUs), a digital processing unit, and the like. The processor 1002 is configured to implement the text deduplication method when the computer program stored in the memory 1001 is called.
The communication module 1003 is used for communicating with the terminal device and other servers.
In the embodiment of the present application, the specific connection medium among the memory 1001, the communication module 1003, and the processor 1002 is not limited. In the embodiment of the present application, the memory 1001 and the processor 1002 are connected through the bus 1004 in fig. 10, the bus 1004 is depicted by a thick line in fig. 10, and the connection manner between other components is merely illustrative and is not limited. The bus 1004 may be divided into an address bus, a data bus, a control bus, and the like. For ease of description, only one thick line is depicted in fig. 10, but only one bus or one type of bus is not depicted.
The memory 1001 stores therein a computer storage medium, and the computer storage medium stores therein computer-executable instructions for implementing the text deduplication method according to the embodiment of the present application. The processor 1002 is configured to execute the text deduplication method described above, as shown in fig. 2.
In another embodiment, the electronic device may also be other electronic devices, such as the terminal device 110 shown in fig. 1. In this embodiment, the structure of the electronic device may be as shown in fig. 11, including: communications component 1110, memory 1120, display unit 1130, camera 1140, sensor 1150, audio circuit 1160, bluetooth module 1170, processor 1180, and the like.
The communication component 1110 is configured to communicate with a server. In some embodiments, a wireless fidelity (WiFi) module may be included, the WiFi module being a short-range wireless transmission technology, through which the electronic device may assist the user in transmitting and receiving information.
The memory 1120 may be used to store software programs and data. The processor 1180 performs various functions of the terminal device 110 and data processing by executing software programs or data stored in the memory 1120. The memory 1120 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device. The memory 1120 stores an operating system that enables the terminal device 110 to operate. The memory 1120 may store an operating system and various application programs, and may also store codes for performing the text deduplication method according to the embodiment of the present application.
The display unit 1130 may also be used to display information input by the user or information provided to the user and a Graphical User Interface (GUI) of various menus of the terminal apparatus 110. Specifically, the display unit 1130 may include a display screen 1132 disposed on the front surface of the terminal device 110. The display screen 1132 may be configured in the form of a liquid crystal display, a light emitting diode, or the like. The display unit 1130 may be used to display text to be deduplicated and the like in the embodiment of the present application.
The display unit 1130 may also be used to receive input numeric or character information and generate signal input related to user settings and function control of the terminal apparatus 110, and specifically, the display unit 1130 may include a touch screen 1131 disposed on the front surface of the terminal apparatus 110 and may collect touch operations of a user thereon or nearby, such as clicking a button, dragging a scroll box, and the like.
The touch screen 1131 may be covered on the display screen 1132, or the touch screen 1131 and the display screen 1132 may be integrated to implement the input and output functions of the terminal device 110, and after the integration, the touch screen may be referred to as a touch display screen for short. The display unit 1130 in the present application may display the application programs and the corresponding operation steps.
Camera 1140 may be used to capture still images and a user may post comments on the images captured by camera 1140 through an application. The number of the cameras 1140 may be one or more. The object generates an optical image through the lens and projects the optical image to the photosensitive element. The photosensitive element may be a Charge Coupled Device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor. The light sensing elements convert the light signals into electrical signals, which are then passed to the processor 1180 for conversion into digital image signals.
The terminal device may further comprise at least one sensor 1150, such as an acceleration sensor 1151, a distance sensor 1152, a fingerprint sensor 1153, a temperature sensor 1154. The terminal device may also be configured with other sensors such as a gyroscope, barometer, hygrometer, thermometer, infrared sensor, light sensor, motion sensor, and the like.
Audio circuitry 1160, speakers 1161, and microphone 1162 may provide an audio interface between a user and terminal device 110. The audio circuit 1160 may transmit the electrical signal converted from the received audio data to the speaker 1161, and convert the electrical signal into a sound signal for output by the speaker 1161. Terminal device 110 may also be configured with a volume button for adjusting the volume of the sound signal. On the other hand, the microphone 1162 converts the collected sound signals into electrical signals, which are received by the audio circuit 1160 and converted into audio data, which is then output to the communication assembly 1110 for transmission to, for example, another terminal device 110, or to the memory 1120 for further processing.
The bluetooth module 1170 is used for performing information interaction with other bluetooth devices having bluetooth modules through a bluetooth protocol. For example, the terminal device may establish a bluetooth connection with a wearable electronic device (e.g., a smart watch) that is also equipped with a bluetooth module via the bluetooth module 1170, so as to perform data interaction.
The processor 1180 is a control center of the terminal device, connects various parts of the entire terminal device using various interfaces and lines, and performs various functions of the terminal device and processes data by running or executing software programs stored in the memory 1120 and calling data stored in the memory 1120. In some embodiments, processor 1180 may include one or more processing units; the processor 1180 may also integrate an application processor, which primarily handles operating systems, user interfaces, application programs, and the like, and a baseband processor, which primarily handles wireless communications. It will be appreciated that the baseband processor described above may not be integrated into the processor 1180. In the present application, the processor 1180 may run an operating system, an application program, a user interface display, a touch response, and the text deduplication method according to the embodiment of the present application. Additionally, the processor 1180 is coupled to the display unit 1130.
In some possible embodiments, various aspects of the text deduplication method provided in the present application may also be implemented in the form of a program product including program code for causing a computer device to perform the steps in the text deduplication method according to various exemplary embodiments of the present application described above in this specification when the program product is run on the computer device, for example, the computer device may perform the steps as shown in fig. 2.
The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The program product of embodiments of the present application may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a computing device. However, the program product of the present application is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with a command execution system, apparatus, or device.
A readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with a command execution system, apparatus, or device.
Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user computing device, partly on the user equipment, as a stand-alone software package, partly on the user computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).
While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all alterations and modifications as fall within the scope of the application.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims (15)

1. A method for text deduplication, the method comprising:
respectively intercepting subfile strings of each text to be deduplicated in a text set to obtain a subfile string set corresponding to each text to be deduplicated;
respectively determining target weights corresponding to the sub text strings based on the obtained characteristic information of the sub text strings contained in the sub text string sets;
screening out sub text strings with target weight not lower than a target threshold value from each sub text string as target sub text strings;
and respectively carrying out duplication removal on each text to be duplicated based on the inclusion relationship between each text to be duplicated and each target sub-text string.
2. The method according to claim 1, wherein when determining the respective target weights corresponding to the respective sub-text strings respectively based on the obtained feature information of the respective sub-text strings respectively contained in the respective sub-text string sets, for each sub-text string, the following operations are respectively performed:
acquiring first embedded characteristic information of one sub-text string, second embedded characteristic information of each element contained in the one sub-text string, and offset characteristic information corresponding to each element in the one sub-text string;
performing attention feature extraction on each element based on the first embedded feature information, the second embedded feature information and the offset feature information to obtain an association weight between elements in the one sub-text string;
and taking the associated weight as a target weight corresponding to the sub-text string.
3. The method of claim 2, wherein said performing attention feature extraction on each element based on the first embedded feature information, the second embedded feature information, and the offset feature information to obtain the association weight between elements in the one sub-text string comprises:
performing semantic representation on the sub-text string based on the first embedded feature information, the second embedded feature information and the offset feature information to obtain a semantic feature vector corresponding to the sub-text string and a semantic feature vector corresponding to each element in the sub-text string;
for each element, the following operations are performed: based on the semantic feature vector corresponding to one element and the semantic feature vector corresponding to the sub-text string, carrying out normalization processing to obtain a normalization value corresponding to the element; taking the ratio of the exponential power corresponding to the normalized value of the element to the sum of the exponential powers corresponding to the normalized values of all elements in the one sub-text string as the attention weight corresponding to the element;
and respectively carrying out weighted summation on the respective normalized values of the elements and the corresponding attention weights to obtain the associated weights.
4. The method according to claim 1, wherein said deduplicating the respective texts to be deduplicated based on the inclusion relationship between the respective texts to be deduplicated and the respective target sub-text strings respectively comprises:
dividing each text to be de-duplicated to obtain a plurality of candidate text sets respectively based on the inclusion relation between each text to be de-duplicated and each target sub-text string;
respectively carrying out preliminary de-duplication on the text to be de-duplicated in each candidate text set to obtain the residual text to be de-duplicated;
and carrying out secondary duplication removal on each residual text to be duplicated according to the hash value of each residual text to be duplicated.
5. The method of claim 4, wherein the dividing the text to be de-duplicated into a plurality of candidate text sets based on the inclusion relationship between the text to be de-duplicated and the target sub-text strings respectively comprises:
determining a coding vector corresponding to each text to be deduplicated based on whether each text to be deduplicated contains each target sub-text string;
acquiring a minimum hash vector corresponding to each text to be deduplicated based on a coding matrix formed by the coding vectors;
and carrying out segmentation on the minimum hash vector of each text to be deduplicated, and carrying out hash bucket division on each text to be deduplicated on the basis of a segmentation result to obtain a plurality of candidate text sets.
6. The method according to claim 4, wherein when the text to be de-duplicated in each candidate text set is subjected to preliminary de-duplication, the following operations are respectively performed for each candidate text set:
acquiring text similarity between every two texts to be deduplicated in a candidate text set;
and based on the text similarity of every two texts to be deduplicated, performing deduplication on the texts to be deduplicated in the candidate text set.
7. The method according to any one of claims 1 to 6, wherein the performing subfile string interception on each text to be deduplicated in a text set respectively to obtain a corresponding subfile string set of each text to be deduplicated comprises:
taking the minimum value of the preset length reference value and the maximum text lengths corresponding to all texts to be deduplicated as the intercepting length of the sub text string;
respectively intercepting a plurality of sub text strings from each text to be deduplicated in a sliding manner according to the interception length;
and taking a set formed by the sub-text strings intercepted on the basis of the same text to be deduplicated as the set of the sub-text strings corresponding to the same text to be deduplicated.
8. A text deduplication apparatus, comprising:
the text intercepting unit is used for respectively intercepting subfile strings of each text to be deduplicated in the text set to obtain a subfile string set corresponding to each text to be deduplicated;
the weight determining unit is used for respectively determining the target weight corresponding to each sub text string based on the obtained characteristic information of each sub text string contained in each sub text string set;
a screening unit, configured to screen out, from the respective sub-text strings, a sub-text string whose target weight is not lower than a target threshold as a target sub-text string;
and the duplication removing unit is used for respectively removing duplication of each text to be duplicated based on the inclusion relationship between each text to be duplicated and each target sub-text string.
9. The apparatus of claim 8, wherein the weight determination unit is specifically configured to:
for each sub-text string, the following operations are performed:
acquiring first embedded characteristic information of one sub-text string, second embedded characteristic information of each element contained in the one sub-text string, and offset characteristic information corresponding to each element in the one sub-text string;
performing attention feature extraction on each element based on the first embedded feature information, the second embedded feature information and the offset feature information to obtain an association weight between elements in the one sub-text string;
and taking the associated weight as a target weight corresponding to the sub-text string.
10. The apparatus of claim 9, wherein the weight determination unit is specifically configured to:
performing semantic representation on the sub-text string based on the first embedded feature information, the second embedded feature information and the offset feature information to obtain a semantic feature vector corresponding to the sub-text string and a semantic feature vector corresponding to each element in the sub-text string;
for each element, the following operations are performed: based on the semantic feature vector corresponding to one element and the semantic feature vector corresponding to the sub-text string, carrying out normalization processing to obtain a normalization value corresponding to the element; taking the ratio of the exponential power corresponding to the normalized value of the element to the sum of the exponential powers corresponding to the normalized values of all elements in the one sub-text string as the attention weight corresponding to the element;
and respectively carrying out weighted summation on the respective normalized values of the elements and the corresponding attention weights to obtain the associated weights.
11. The apparatus of claim 8, wherein the deduplication unit is specifically configured to:
dividing each text to be de-duplicated to obtain a plurality of candidate text sets respectively based on the inclusion relation between each text to be de-duplicated and each target sub-text string;
respectively carrying out preliminary de-duplication on the text to be de-duplicated in each candidate text set to obtain the residual text to be de-duplicated;
and carrying out secondary duplication removal on each residual text to be duplicated according to the hash value of each residual text to be duplicated.
12. The apparatus of claim 11, wherein the deduplication unit is specifically configured to:
determining a coding vector corresponding to each text to be deduplicated based on whether each text to be deduplicated contains each target sub-text string;
acquiring a minimum hash vector corresponding to each text to be deduplicated based on a coding matrix formed by the coding vectors;
and carrying out segmentation on the minimum hash vector of each text to be deduplicated, and carrying out hash bucket division on each text to be deduplicated on the basis of a segmentation result to obtain a plurality of candidate text sets.
13. An electronic device, comprising a processor and a memory, wherein the memory stores program code which, when executed by the processor, causes the processor to perform the steps of the method of any of claims 1 to 7.
14. A computer-readable storage medium, characterized in that it comprises program code for causing an electronic device to carry out the steps of the method according to any one of claims 1 to 7, when said storage medium is run on said electronic device.
15. A computer program product comprising computer instructions, characterized in that the computer instructions, when executed by a processor, implement the steps of the method according to any one of claims 1 to 7.
CN202111246050.4A 2021-10-26 2021-10-26 Text duplicate removal method and device, electronic equipment and storage medium Pending CN114282511A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111246050.4A CN114282511A (en) 2021-10-26 2021-10-26 Text duplicate removal method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111246050.4A CN114282511A (en) 2021-10-26 2021-10-26 Text duplicate removal method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114282511A true CN114282511A (en) 2022-04-05

Family

ID=80868902

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111246050.4A Pending CN114282511A (en) 2021-10-26 2021-10-26 Text duplicate removal method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114282511A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116304111A (en) * 2023-04-10 2023-06-23 大连数通云网络科技有限公司 AI call optimization processing method and server based on visual service data
CN116341566A (en) * 2023-05-29 2023-06-27 中债金科信息技术有限公司 Text deduplication method and device, electronic equipment and storage medium
CN117093717A (en) * 2023-10-20 2023-11-21 湖南财信数字科技有限公司 Similar text aggregation method, device, equipment and storage medium thereof
CN117807404A (en) * 2024-02-29 2024-04-02 智广海联(天津)大数据技术有限公司 AI-based intelligent duplicate removal analysis method and device for studying and judging event

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116304111A (en) * 2023-04-10 2023-06-23 大连数通云网络科技有限公司 AI call optimization processing method and server based on visual service data
CN116304111B (en) * 2023-04-10 2024-02-20 深圳市兴海物联科技有限公司 AI call optimization processing method and server based on visual service data
CN116341566A (en) * 2023-05-29 2023-06-27 中债金科信息技术有限公司 Text deduplication method and device, electronic equipment and storage medium
CN116341566B (en) * 2023-05-29 2023-10-20 中债金科信息技术有限公司 Text deduplication method and device, electronic equipment and storage medium
CN117093717A (en) * 2023-10-20 2023-11-21 湖南财信数字科技有限公司 Similar text aggregation method, device, equipment and storage medium thereof
CN117093717B (en) * 2023-10-20 2024-01-30 湖南财信数字科技有限公司 Similar text aggregation method, device, equipment and storage medium thereof
CN117807404A (en) * 2024-02-29 2024-04-02 智广海联(天津)大数据技术有限公司 AI-based intelligent duplicate removal analysis method and device for studying and judging event

Similar Documents

Publication Publication Date Title
JP7164729B2 (en) CROSS-MODAL INFORMATION SEARCH METHOD AND DEVICE THEREOF, AND STORAGE MEDIUM
CN114282511A (en) Text duplicate removal method and device, electronic equipment and storage medium
WO2020199904A1 (en) Video description information generation method, video processing method, and corresponding devices
CN112163165A (en) Information recommendation method, device, equipment and computer readable storage medium
US10796203B2 (en) Out-of-sample generating few-shot classification networks
WO2022253074A1 (en) Data processing method and related device
CN114387567B (en) Video data processing method and device, electronic equipment and storage medium
CN113255625B (en) Video detection method and device, electronic equipment and storage medium
CN111783903A (en) Text processing method, text model processing method and device and computer equipment
CN115455171B (en) Text video mutual inspection rope and model training method, device, equipment and medium
CN113705315B (en) Video processing method, device, equipment and storage medium
Alexandridis et al. A knowledge-based deep learning architecture for aspect-based sentiment analysis
TW201931163A (en) Image search and index building
CN114443899A (en) Video classification method, device, equipment and medium
CN111767697B (en) Text processing method and device, computer equipment and storage medium
Li et al. TPFN: Applying outer product along time to multimodal sentiment analysis fusion on incomplete data
CN116628328A (en) Web API recommendation method and device based on functional semantics and structural interaction
CN114942994A (en) Text classification method, text classification device, electronic equipment and storage medium
CN115269828A (en) Method, apparatus, and medium for generating comment reply
CN113919361A (en) Text classification method and device
CN115909374A (en) Information identification method, device, equipment, storage medium and program product
US20220012424A1 (en) Word and image relationships in combined vector space
CN113761270A (en) Video recall method and device, electronic equipment and storage medium
US20220044105A1 (en) Training multimodal representation learning model on unnanotated multimodal data
CN116956117A (en) Method, device, equipment, storage medium and program product for identifying label

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination