CN114328884B - Image-text duplication removing method and device - Google Patents

Image-text duplication removing method and device Download PDF

Info

Publication number
CN114328884B
CN114328884B CN202111466812.1A CN202111466812A CN114328884B CN 114328884 B CN114328884 B CN 114328884B CN 202111466812 A CN202111466812 A CN 202111466812A CN 114328884 B CN114328884 B CN 114328884B
Authority
CN
China
Prior art keywords
text
image
recall
target
repeated
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111466812.1A
Other languages
Chinese (zh)
Other versions
CN114328884A (en
Inventor
安涵
陈祥
唐伟
黄展鹏
封盛
赵博
林民龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202111466812.1A priority Critical patent/CN114328884B/en
Publication of CN114328884A publication Critical patent/CN114328884A/en
Application granted granted Critical
Publication of CN114328884B publication Critical patent/CN114328884B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to the technical field of computers, and provides a picture and text duplication eliminating method and device which can efficiently and accurately calculate the similarity between two pictures and texts. Through the text, the image and the multi-mode recall, the similarity of the image and the text in the image text can be measured more comprehensively, and the recall rate and the accuracy are balanced. The calculation pressure in the calibration stage can be reduced and the duplicate removal efficiency can be improved by carrying out coarse filtration on the pictures and texts in the recall stage; and the result of the rough filtration is subjected to de-duplication calibration from the similarity level and the template image-text level, the editing distance is used for calibrating the rough filtration image-text at the similarity level, the interpretation is stronger, the de-duplication accuracy is improved, and the similarity between the template image-text is optimized according to entity nouns at the template image-text level, so that the de-duplication accuracy is further improved. The method can better solve the problems of less recall, low efficiency and low accuracy of the current scheme.

Description

Image-text duplication removing method and device
Technical Field
The application relates to the technical field of computers, in particular to a picture and text duplication eliminating method and device.
Background
In the internet age, information is explosively growing, and networks are usually filled with massive graphics and texts and contain a large number of repeated graphics and texts; for example, one image is transferred, modified and edited by various media to obtain a plurality of similar images.
The repeated pictures and texts in the network have different picture and text formats due to different editing modes, so that a large amount of storage resources are required to be occupied for repeated storage, the waste of the storage resources is caused to a certain extent, and the repeated pictures and texts are required to be subjected to de-duplication processing, namely similar repeated pictures and texts are identified.
Under the related technology, when the image-text duplicate removal processing is carried out, the image and the text are usually recalled in two ways, and each way is compared based on the single dimension characteristic, so that the recall rate and the accuracy rate of the image-text are difficult to balance, the recall rate can be improved, the accuracy rate can be reduced, or the recall rate can be reduced while the accuracy rate is improved.
Disclosure of Invention
The embodiment of the application provides an image-text duplication eliminating method and device, which are used for improving recall rate, accuracy and efficiency of image-text duplication elimination.
In one aspect, the image-text duplication elimination method provided by the embodiment of the application comprises the following steps:
Responding to a duplication eliminating request aiming at a target image and text, and extracting an image and text characteristic set of the target image and text; wherein, the image-text feature set is: text features and image features of the target graphics context;
Based on the image-text feature set, carrying out multi-stage recall on the target image-text to obtain recall image-text sets corresponding to each stage; wherein the multi-stage recall comprises at least: multimodal recall based on the text feature and the image feature;
Based on each recall image in each recall image set, determining an initial repeated image set corresponding to each recall image set respectively, wherein the key word set is between each recall image and the target image;
and determining a target repeated image-text set based on the editing distance between each initial repeated image-text in each initial repeated image-text set and the target image-text.
On the other hand, the image-text duplication eliminating device provided by the embodiment of the application comprises:
The characteristic extraction module is used for responding to a duplication elimination request aiming at a target image and text and extracting an image and text characteristic set of the target image and text; wherein, the image-text feature set is: text features and image features of the target graphics context;
The multi-stage recall module is used for carrying out multi-stage recall on the target image and text based on the image and text characteristic set to obtain recall image and text sets corresponding to each stage; wherein the multi-stage recall comprises at least: multimodal recall based on the text feature and the image feature;
The primary duplicate removal module is used for respectively determining initial duplicate image-text sets corresponding to each recall image-text set based on the key word set between each recall image-text in each recall image-text set and the target image-text;
The fine duplicate removal module is used for determining a target duplicate text set based on the edit distance between each initial duplicate text in each initial duplicate text set and the target text.
Optionally, the multi-stage recall module is specifically configured to:
Fusing the text features and the image features in the image-text feature set to obtain multi-mode features;
Based on the multi-modal characteristics, respectively obtaining first similarity between the multi-modal characteristics of each contrast image and text in the preset contrast image and text data set;
and carrying out multi-mode recall on the target image and text based on each first similarity to obtain a recall image and text set corresponding to the multi-mode stage.
Optionally, the multi-stage recall module is specifically configured to:
Based on the text features in the image-text feature set, respectively carrying out text recall on the target image-text with second similarity between the text features of each contrast image-text in the preset contrast image-text data set to obtain a recall image-text set corresponding to a text stage; and
And based on the third similarity between the image features in the image-text feature set and the text features of each contrast image-text in the preset contrast image-text data set, carrying out image recall on the target image-text to obtain a recall image-text set corresponding to the image stage.
Optionally, the primary deduplication module is specifically configured to:
For each recall image-text set, the following operations are respectively executed:
extracting a first keyword set corresponding to each recall image and text from one recall image and text set, and extracting a second keyword set corresponding to the target image and text;
determining the cross-union ratio between each first keyword set and each second keyword set respectively;
sequencing the recall pictures and texts according to the intersection ratios;
And screening at least one recall image from the recall images based on the sorting result to obtain an initial repeated image set corresponding to the recall image language set.
Optionally, the primary deduplication module is specifically configured to:
for each first keyword set, respectively executing the following operations:
Determining a first keyword set, a keyword intersection and a keyword union of the first keyword set and the second keyword set;
And taking the ratio of the number of keywords contained in the keyword intersection and the number of keywords contained in the keyword union as the intersection ratio between the keyword sets.
Optionally, the fine-removal module is specifically configured to:
for each initial repeated image and text, respectively executing the following operations:
Determining the text similarity between one initial repeated image and the target image according to the editing distance between the initial repeated image and the target image;
And if the text similarity meets the threshold requirement, taking the initial repeated image and text as a target repeated image and text.
Optionally, the text similarity is inversely related to the editing distance, and is positively related to the maximum text length in the initial repeated image-text and the target image-text.
Optionally, the fine deduplication module is further configured to:
aiming at each target repeated image and text in the target repeated image and text set, respectively executing the following operations:
extracting a first entity noun set from a preset field of a target repeated image and text, and extracting a second entity noun set from the preset field of the target image and text;
Formatting the first entity noun set and the second entity noun set according to a preset rule;
determining a fourth similarity between the one target repeated picture and the target picture based on the formatted first entity noun set and the formatted second entity noun set;
And if the fourth similarity meets a preset condition, reserving the target repeated graphics and texts.
Optionally, the feature extraction module is specifically configured to:
determining the weight of each keyword contained in the target image-text based on a preset keyword word stock;
And weighting the word vectors of the keywords according to the determined weights to obtain the text features corresponding to the target graphics context.
Optionally, the feature extraction module is specifically configured to:
extracting the characteristics of each picture contained in the target image-text to obtain an initial characteristic vector set;
And aggregating all the initial feature vectors in the initial feature vector set into feature vectors with preset length, and performing dimension reduction to obtain image features corresponding to the target text.
In another aspect, an embodiment of the present application provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements a method for performing image-text deduplication when executing the computer program.
In another aspect, embodiments of the present application provide a computer-readable storage medium storing computer instructions that, when executed on a computer, cause the computer to perform a method of deduplication.
In another aspect, the present application provides a computer program product comprising a computer program which, when executed by a processor, implements a method for deduplication of graphics.
The embodiment of the application provides a picture and text duplicate removal method and a picture and text duplicate removal device, which are used for carrying out multi-stage recall on a target picture and text through a picture and text feature set extracted from the target picture and text. In addition, for the recall image-text set obtained in each recall stage, an initial repeated image-text set is determined based on the key word set of each recall image-text and the target text, so that the image-text with lower repetition degree in the recall image-text set is filtered, the calculated amount of the subsequent image-text de-duplication is reduced, and the de-duplication efficiency is improved; and based on the filtered initial repeated image-text set, the editing distance between each initial repeated image-text and the target image-text is further de-duplicated, so that a better de-duplication effect can be obtained, and the accuracy is higher.
Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the application. The objectives and other advantages of the application will be realized and attained by the structure particularly pointed out in the written description and claims thereof as well as the appended drawings.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application.
Fig. 1 is a schematic diagram of an application scenario in an embodiment of the present application;
FIG. 2 is a schematic diagram of an image-text duplication elimination system according to an embodiment of the application;
FIG. 3 is a schematic diagram of an image-text duplication elimination process according to an embodiment of the application;
Fig. 4 is a flowchart of an image-text duplication elimination method according to an embodiment of the application;
Fig. 5 is a schematic diagram of a process of CLIP extracting image features according to an embodiment of the present application;
FIG. 6 is a flow chart of a multi-stage recall method provided by an embodiment of the present application;
FIG. 7 is a schematic diagram of the results of a recall at various stages provided by an embodiment of the present application;
FIG. 8 is a flow chart of a coarse deduplication method according to an embodiment of the present application;
FIG. 9 is a flowchart of a method for performing fine deduplication according to an embodiment of the present application;
Fig. 10 is a schematic diagram of an image-text duplication elimination result provided by an embodiment of the application;
FIG. 11 is a schematic diagram of the composition structure of an image-text duplication eliminating device in an embodiment of the present application;
Fig. 12 is a schematic diagram of a hardware composition structure of an electronic device to which the embodiment of the present application is applied;
fig. 13 is a schematic diagram of a hardware composition structure of still another electronic device to which the embodiment of the present application is applied.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the technical solutions of the present application, but not all embodiments. All other embodiments, based on the embodiments described in the present document, which can be obtained by a person skilled in the art without any creative effort, are within the scope of protection of the technical solutions of the present application.
Some of the concepts involved in the embodiments of the present application are described below.
Graph-text: refers to an article containing pictures.
Sea distance: also called hamming distance, which is to perform exclusive-or operation on two bit strings in one valid bit (bit) code set, and calculate the number of 1 s in the exclusive-or operation result. Or the Hamming distance of two strings of equal length is the number of different characters at the same position.
Edit distance: quantifying the degree of difference of two strings is also understood to be the minimum number of single character editing operations (including insertion, deletion, substitution, etc.) required to convert one text to another.
Word vector (Word 2 Vec) model: in natural language processing, the finest granularity is that words, words form sentences, sentences form paragraphs, chapters, documents and the like, and words are in symbol form (such as Chinese, english, latin and the like), and Word2Vec can convert words into numerical form, that is, the words are embedded into a mathematical space.
Word Frequency-inverse document Frequency (Term Frequency-Inverse Document Frequency, TF-IDF): to evaluate the importance of a term to a document in a corpus. The importance degree is positively correlated with the frequency of the word appearing in the file, but is negatively correlated with the frequency of the word appearing in the corpus.
Contrast language-Image Pre-training (Contrastive Language-Image Pre-training, CLIP): the teletext dataset was trained using 4 billion from the network, with text as the label of the image.
The image-text duplication elimination method provided by the embodiment of the application mainly relates to artificial intelligence (ARTIFICIALINTELLIGENCE, AI), natural language processing (NatureLanguageprocessing, NLP) and machine learning (MACHINELEARNING, ML), and is designed based on computer vision technology and machine learning in the artificial intelligence.
The following briefly describes the design concept of the embodiment of the present application.
At present, the image-text de-duplication mainly comprises text recall and picture recall. The text recall is mainly performed by SimHash algorithm, cosine similarity algorithm and the like; the image recall is mainly carried out by a traditional pHash algorithm, a Sift algorithm and a depth neural network algorithm based on MobileNet and the like.
Text recall
The SimHash algorithm is mainly to dimension down a long text to a document represented by several keywords. Firstly, the key words are encoded into a binary character string with fixed length (such as 64 bits) and composed of Hash values; then, the character strings are weighted (w=hash, weight) and the weighted results of the keywords are combined and accumulated; further, according to the weighting result, setting the negative weight to 0 and the positive weight to 1 to obtain SimHash of the text; and the 64-bit binary character string is equally divided into 4 blocks, if the Hamming distance of two texts is within 3 according to the drawer principle, 1 block is completely identical between the texts, so that each of the divided 4 blocks is respectively used as the first 16 bits for searching, an inverted index is built, and the Hamming distance between the two texts is calculated; finally, the similarity of the two texts is determined by the Hamming distance, and the smaller the Hamming distance is, the lower the similarity is. The SimHash algorithm makes the calculation amount smaller and the retrieval speed faster through drawer acceleration.
The cosine similarity algorithm is characterized by using a vector to represent the text, and then using the vector to compare the cosine similarity, recall the text similar to the content of the query text.
(II) Picture recall
Conventional picture recall algorithms (e.g., pHash algorithm) generally convert pictures into gray-scale images, then separate the pictures into sets of component rates by discrete cosine transform (Discrete Cosine Transform, DCT) to reduce the frequency, further calculate the average value of all picture pixels after DCT, record the pixel point larger than the average value as1, otherwise record as 0, and finally calculate the hamming distance of two pictures, and the smaller the hamming distance, the more similar the pictures are. Such algorithms have high requirements for similar accuracy.
The deep neural network algorithm uses the last embedded (embedding) layer of the model as the feature vector of the picture by training a supervised picture classification model, and recalls similar pictures and texts by using the feature vector.
However, in the related art, when text recall and picture recall are performed, the two models are compared based on single dimension characteristics, and the comparison threshold is difficult to determine, so that the accuracy of similarity determined by the two models is low, and the accuracy of picture recall is poor; in addition, the text and the picture are recalled in two paths, and the recall rate and the accuracy rate are difficult to balance.
For example, for text recall, simHash algorithm calculation speed is fast, recall accuracy is high, but recall number is small; the cosine similarity algorithm finely tuned by the BERT pre-training model is adopted, the recall quantity is large, the calculated amount is large, the on-line prediction is slow, and the recall accuracy is low.
Meanwhile, one pain point of the current recall scheme on the product side is that the template graphics and texts such as financial accounting, weather and the like usually contain necessary contents in the template; for example, weather-like graphics typically contain the necessary content of date, temperature, wind direction, icons, etc. Therefore, the similarity is particularly high, but the similarity itself does not belong to the category of duplicate articles, but can be erroneously recognized, and recall accuracy is poor.
In view of this, the embodiment of the application provides a method and a device for removing duplication of graphics and texts, which are characterized in that repeated graphics and texts are identified by performing multi-stage recall of visual features and text features of graphics and texts and performing coarse duplication removal and fine duplication removal on recall results. Specifically, in the text recall stage, text features are extracted through a keyword set in a target graph, and graphs with text similarity higher than a threshold value between the text similarity and the target graph are recalled from a comparison graph-text data set; in the picture recall stage, extracting image characteristics from at least one picture contained in the target picture and text, and recalling pictures and texts with similarity higher than a threshold value from the comparison picture and text data set; in the multi-mode recall stage, recalling the image text with the similarity higher than a threshold value from the comparison image text data set by utilizing the text characteristics and the image characteristics of the target image text; further, for the recalled pictures and texts at each stage, the pictures and texts which are repeated with the target pictures and texts are preliminarily screened out through the cross-merging ratio of the keywords between the respective pictures and texts and the target pictures and texts, and then the repeated similarity is calculated through the editing distance between the screened repeated pictures and texts and the target pictures and texts, so that a target repeated picture and texts set with higher repetition degree is obtained. In the multi-stage recall, the application fully utilizes the characteristics of multiple dimensions visually, textually and in multiple modes, and the mutual complementation of each stage can well balance the recall rate and the accuracy rate; and repeated pictures and texts are filtered through the cross-merging ratio of the keywords so as to reduce the calculation pressure of the subsequent fine duplicate removal stage and improve duplicate removal efficiency; when the duplicate is removed, the repeated pictures and texts recalled are calibrated by the editing distance, so that the interpretation between the repeated pictures and texts is stronger, and the duplicate removal accuracy is improved.
Meanwhile, aiming at the template image-text, the embodiment of the application improves the accuracy of the similarity between the template image-text by extracting the entity noun calibration, thereby improving the accuracy of duplicate removal.
The preferred embodiments of the present application will be described below with reference to the accompanying drawings of the specification, it being understood that the preferred embodiments described herein are for illustration and explanation only, and not for limitation of the present application, and embodiments of the present application and features of the embodiments may be combined with each other without conflict.
Fig. 1 is a schematic diagram of an application scenario according to an embodiment of the present application. The application scenario diagram includes a terminal device 110 and a server 120. The terminal device 110 in the embodiment of the present application may be provided with an application related to graphics and texts, and the application may be an applet, a web page, etc., which is not limited herein. Server 120 is a server corresponding to software, web pages, applets, etc.
It should be noted that, the image-text deduplication method in the embodiment of the present application may be executed by the server 120 or the terminal device 110 separately, or may be executed by the server 120 and the terminal device 110 together. The present disclosure is mainly exemplified by the case where the server 120 is separately executed, and may be executed by one server 120, or may be executed by a plurality of servers 120 in parallel, etc., which is not limited herein.
In the embodiment of the present application, the deduplication model based on similarity may be deployed on the terminal device 110 for training, or may be deployed on the server 120 for training. The server 120 may have stored therein a number of training samples for training the model. Alternatively, after the model is trained based on the method in the embodiment of the present application, the trained model may be deployed on the server 120 or the terminal device 110.
In an alternative embodiment, the terminal device 110 and the server 120 may communicate via a communication network.
In an alternative embodiment, the communication network is a wired network or a wireless network.
In an embodiment of the present application, the terminal device 110 includes, but is not limited to, a personal computer, a mobile phone, a tablet computer, a notebook, an electronic book reader, a smart home appliance, a vehicle-mounted device, and the like. Each terminal device 110 is connected to a server 120 through a wireless network, and the server 120 is a server cluster or a cloud computing center formed by one server or a plurality of servers, or is a virtualization platform.
It should be noted that, the number of terminal devices and servers shown in fig. 1 is merely illustrative, and the number of terminal devices and servers is not limited in practice, and is not particularly limited in the embodiment of the present application.
In addition, it should be noted that the image-text duplication elimination method provided by the embodiment of the application can be applied to various application scenarios including image-text recommendation tasks, image-text search tasks, image-text duplication elimination tasks and the like, including but not limited to cloud technology, artificial intelligence, intelligent traffic, auxiliary driving and the like, and training samples used in different scenarios are different and are not listed here one by one.
The image-text deduplication method provided by the exemplary embodiment of the present application will be described below with reference to the accompanying drawings in conjunction with the application scenario described above, and it should be noted that the application scenario is only shown for the convenience of understanding the spirit and principle of the present application, and the embodiment of the present application is not limited in any way in this respect.
Referring to fig. 2, a schematic diagram of an image-text duplication eliminating system according to an embodiment of the present application is shown; as shown in FIG. 2, the image-text deduplication system mainly comprises a medium pool fusion layer, a storage index layer, a feature calculation layer, a feature fusion layer, a recall layer, a coarse arrangement layer and a fine arrangement layer. The medium pool fuses the contrast image-text data sets in the main user grabbing station or outside the main user grabbing station; the storage index layer is used for preprocessing the captured contrast image-text data set; the feature machine calculating layer is used for extracting image features from a picture set contained in the image-text through a CLIP algorithm and extracting text features from a keyword set contained in the image-text; the feature fusion layer is used for fusing the image features of each picture in the picture set, fusing the text features of each keyword in the keyword set and fusing the image features and the text features to obtain multi-mode features; the recall layer carries out repeated image-text recall based on the extracted text features, image features and multi-mode features; the coarse arrangement layer is used for carrying out preliminary screening on the recalled repeated pictures and texts, and filtering out partial repeated pictures and texts with the similarity meeting the threshold requirement; and the fine-ranking layer is used for further removing the duplication of the repeated pictures of the primary screening.
As shown in fig. 2, the image-text duplication removal system provided by the embodiment of the application can be applied to pipeline duplication removal service, original content identification service, content maintenance service and cp handling service, and recall results of the services can be fed back to a fine-ranking layer and a coarse-ranking layer for calibration.
The image-text deduplication method provided by the embodiment of the application focuses on a recall layer, a coarse arrangement layer and a fine arrangement layer, and as shown in fig. 3, the recall layer carries out multi-stage recall based on the extracted text features, image features and multi-mode features so as to balance recall rate and accuracy rate; the coarse arrangement layer performs preliminary screening on recall results based on keywords between recall pictures and texts and target pictures and texts for each recall stage, so that the duplicate removal efficiency is improved; the fine-ranking layer determines the editing distance between two texts through entity nouns in the texts, and determines the similarity between the two texts based on the editing distance so as to calibrate the duplicate removal result.
Referring to fig. 4, a flowchart of an image-text duplication elimination method according to an embodiment of the present application is provided, where fig. 4 is an example of an execution body as a server, and a specific implementation flow of the method is as follows:
S401: the server responds to the duplication eliminating request for the target image and text and extracts the image and text characteristic set of the target image and text.
When executing S401, after receiving the image-text duplication elimination request, extracting text features from each keyword contained in the target image-text, and extracting image features from each picture contained in the target image-text, thereby obtaining an image-text feature set composed of the text features and the image features of the target image-text.
In the embodiment of the application, the text feature extraction method is as follows: based on a preset keyword Word stock, a TF-IDF algorithm is adopted to determine the weight of each keyword contained in the target image-text, a Word2Vec algorithm is adopted to respectively convert each keyword into a Word vector, further, the weight determined by the TF-IDF is used to respectively weight the corresponding Word vector, and based on each weighted Word vector, the text feature of a preset dimension (such as 128 dimensions) is obtained.
In the embodiment of the application, the image features are extracted in the following manner: extracting the characteristics of each picture contained in the target image-text to obtain an initial characteristic vector set, aggregating each initial characteristic vector in the initial characteristic vector set into a characteristic vector with a preset length, and performing dimension reduction to obtain the dimension-reduced image characteristics. Because the dimension of the image features is reduced, the recall speed can be greatly improved.
For example, assuming that n (n is greater than or equal to 1) pictures are included in the target picture, the initial feature vector of each picture is 512 dimensions, the initial feature vector set is n x 512 dimensions, the NextVLAD algorithm is adopted to aggregate the initial feature vector set into a feature vector with a fixed length of k x 512 dimensions, and then the feature vector with the fixed length is compressed into 128 dimensions (consistent with the text feature dimensions) through the principal component analysis (PRINCIPAL COMPONENT ANALYSIS, PCA) to obtain the image feature.
Alternatively, the CLIP algorithm may be used to extract image features in the target image. The process of the CLIP is as follows: the Text encoder (Text encoder) and the Image encoder (Image encoder) are used for predicting which pictures are paired with which texts in the training Image-Text data set in a contrast learning mode, then the CLIP is converted into a zero-shot classifier, natural language is used as a flexible prediction space, namely, the texts are used as labels of the pictures, and therefore generalization and migration are achieved.
Fig. 5 is a schematic diagram of a training process of a CLIP model provided by an embodiment of the present application, and as shown in fig. 5, mainly includes three processes of contrast pre-training (contrastive pre-training), creating a data set classifier (CREATE DATASET CLASSIFIER from label text) using a label text, and using a label text (use for zero-shot prediction) corresponding to a zero-shot prediction image.
The training teletext dataset used by the CLIP model is made up of 4 billion "picture-text" pairs, containing 50 tens of thousands of query text (category labels), each category label corresponding to about 2 tens of thousands of pictures, batch size n=2 15 =32768. Each batch has N 2 "picture-text" pairs, where N are positive sample pairs and the remaining N 2 -N negative sample pairs. Wherein the similarity (matching degree) between "picture-text" pairs can be characterized by embedding distances, the closer the embedding distance is, the better for positive sample pairs, to maximize the similarity between picture and text; for negative sample pairs, the farther it embedding is, the better it is to minimize the similarity between the picture and the text. In the pre-training process, a picture classification task is converted into a picture matching task through a pre-training image encoder (such as ResNet network and Vision Transformer network) and a text encoder (such as a transducer network), which pictures are matched with which texts in a training picture-text data set (which are diagonal pairs) are predicted in a contrast learning mode, a contrast loss value is calculated through a matching result, and training is stopped when the contrast loss value meets a threshold value requirement, so that a trained CLIP model is obtained. Further, the image characteristics of each picture are extracted through fine adjustment of the trained CLIP model.
S402: and the server carries out multi-stage recall on the target image and text based on the image and text characteristic set to obtain recall image and text sets corresponding to each stage.
In the embodiment of the application, the multi-stage recall at least comprises multi-mode recall, and text recall and picture recall are also included, so that the characteristics of multiple dimensions can be fully utilized, and the recall rate and the accuracy rate can be well balanced.
It should be noted that, for each comparison image in the preset comparison image data set to be recalled, the extraction modes of the text feature, the image feature and the multi-mode feature are consistent with those of the target image, and repeated description is not required.
FIG. 6 is a schematic diagram of a multi-stage recall process according to an embodiment of the present application, wherein the recall of each stage may be performed in parallel, mainly comprising the following steps:
S4021: the server fuses the text features and the image features in the image-text feature set to obtain multi-mode features, obtains first similarity between the multi-mode features of each comparison image-text in the preset comparison image-text data set based on the multi-mode features, and carries out multi-mode recall on the target image-text based on each first similarity to obtain a recall image-text set corresponding to the multi-mode stage.
In the embodiment of the application, the dimension of the text feature corresponding to the target image-text is consistent with the dimension of the image feature by reducing the dimension of the image feature corresponding to the target image-text, so that the text feature and the image feature can be fused to obtain the multi-mode feature of the target image-text. Then, for each contrast image in the preset contrast image data set, calculating a first similarity between the multi-mode characteristics of the target image and the multi-mode characteristics of the contrast image based on the multi-mode characteristics of the target image and the multi-mode characteristics of the contrast image. Further, comparing the first similarity with a preset first threshold, and if the first similarity is larger than the first threshold, indicating that the comparison image is similar to the target image, using the comparison image as a recall image for multi-mode recall. And traversing the preset contrast image-text data set to obtain a recall image-text set corresponding to the multi-mode stage.
For example, as shown in fig. 7, the comparison patterns in the preset comparison dataset are { S1, S2, S3, …, SM }, where the first similarity between the comparison patterns S1, S4, S5, S12, S34, S35, S67, S72 and the target pattern meets the threshold requirement, and the recall pattern set (denoted as Q1) corresponding to the multi-mode stage is { S1, S4, S5, S12, S34, S35, S67, S72}.
S4022: based on the text features in the image-text feature set, the server performs text recall on the target image-text based on the second similarity between the text features of each contrast image-text in the preset contrast image-text data set, and obtains a recall image-text set corresponding to the text stage.
For each contrast image in the preset contrast image data set, calculating second similarity between the text features of the target image and the text features of the contrast image based on the text features of the target image and the text features of the contrast image. Further, comparing the second similarity with a preset second threshold, and if the second similarity is larger than the second threshold, indicating that the comparison image is similar to the target image, taking the comparison image as a recall image of text recall. And traversing the preset contrast image-text data set to obtain a recall image-text set corresponding to the text stage.
For example, as shown in fig. 7, the comparison graphs in the preset comparison dataset are { S1, S2, S3, …, SM }, where the second similarity between the comparison graphs S2, S4, S10, S12, S26, S45 and the target graph meets the threshold requirement, and the recall graph set (denoted as Q2) corresponding to the text phase is { S2, S4, S10, S12, S26, S45}.
Optionally, the second similarity is a cosine similarity.
S4023: based on the image features in the image-text feature set, the server performs image recall on the target image-text based on the third similarity between the image features in the image-text feature set and the text features of each contrast image-text in the preset contrast image-text data set, and a recall image-text set corresponding to the image stage is obtained.
For each contrast image in the preset contrast image data set, calculating a third similarity (matching degree) between the image features of the target image and the text features of the contrast image based on the image features of the target image and the text features of the contrast image. Further, comparing the third similarity with a preset third threshold, and if the third similarity is larger than the third threshold, indicating that the comparison image is similar to the target image, taking the comparison image as a recall image for image recall. And traversing the preset contrast image-text data set to obtain a recall image-text set corresponding to the image stage.
For example, as shown in fig. 7, the comparison graphs in the preset comparison dataset are { S1, S2, S3, …, SM }, where the third similarity between the comparison graphs S3, S4, S17, S34, S35, S59 and the target graph meets the threshold requirement, and the recall graph set (denoted as Q3) corresponding to the picture stage is { S3, S4, S17, S34, S35, S59}.
According to the application, multi-stage recall is carried out through the text features, the image features and the multi-mode features, and compared with two-way recall of the picture and the text, on one hand, the multi-mode recall is a multi-dimensional feature recall, so that the defect of missing of other dimensional features in any one way of recall of the picture and the text is overcome, and the recall rate of the model is greatly improved; on the other hand, in the multi-mode recall, the similarity between the multi-mode features is adopted, so that the similarity of two pictures and texts can be comprehensively measured from the overall angle of the pictures and the texts, and the similarity of the two pictures and texts is measured relative to a single path, and the accuracy of the model recall is effectively improved.
S403: the server determines initial repeated image-text sets corresponding to the recall image-text sets respectively based on the recall image-text sets in the recall image-text sets and the keyword sets between the recall image-text sets and the target image-text sets.
Taking one recall image-text set in each recall image-text set as an example, the initial repeated image-text set is filtered through coarse deduplication, and referring to fig. 8 specifically, the method mainly comprises the following steps:
s4031: the server extracts a first keyword set corresponding to each recall image and text from one recall image and text set, and extracts a second keyword set corresponding to the target image and text.
Wherein, each of the first keyword set and the second keyword set may contain the same keywords or may contain different keywords.
Taking any recall graph in each recall graph as an example, assume that a first keyword set extracted from the recall graph is u1= { C1, C2, C3}, and a second keyword set u2= { C1, C3, C4}, extracted from the target graph, contains the same keywords C1 and C3, and contains different keywords C2 and C4.
S4032: the server respectively determines the cross-union ratio between each first keyword set and each second keyword set.
For any one first keyword set in each keyword, determining a keyword intersection and a keyword union of the first keyword set and the second keyword set, and taking the ratio of the number of keywords contained in the keyword intersection to the number of keywords contained in the keyword union as the intersection ratio between the keyword sets.
For example, when the keyword intersection V1 between the first keyword set U1 and the second keyword set U2 is { C1, C3}, the keyword union V1 is { C1, C2, C3, C4}, the intersection ratio q1 between the first keyword set U1 and the second keyword set U2 is 2/4=50%.
S4033: and sequencing the recall pictures and texts according to the intersection ratio.
Alternatively, assuming that there are 100 recall pictures in one recall picture set, the intersection ratio between each first keyword set and each second keyword set is q1, q2, q3, …, q100, and q1, q2, q3, …, q100 are ordered in order from large to small.
S4034: and the server screens at least one recall image from all recall images based on the sequencing result to obtain an initial repeated image set corresponding to the recall image language set.
In S4034, based on the sorting result of each cross-over ratio, the first K (e.g., k=20) recall images corresponding to the cross-over ratio are selected as the initial repeat image and text set corresponding to the recall image and text set is obtained.
For example, as shown in fig. 10, for the recall image-text set Q1 corresponding to the multi-mode stage, after the cross-correlation between the keywords is coarsely de-duplicated, an initial repeated image-text set Q1' = { S1, S4, S12, S34, S35, S72}; for the recall image-text set Q2 corresponding to the text stage, the initial repeated image-text set Q2' = { S4, S10, S12, S45}, which is obtained after the cross-merging ratio between the keywords is coarsely de-duplicated; for the recall image-text set Q3 corresponding to the image stage, the initial repeated image-text set Q3' = { S4, S17, S34, S35, S59} is obtained after the cross-merging ratio between the keywords is coarsely de-duplicated.
In the embodiment of the application, after coarse deduplication is respectively executed for the recall image-text sets corresponding to each recall stage, the deduplication calculation pressure brought by a large amount of recall image-text recalled at each stage can be avoided, so that the deduplication efficiency of the whole system is accelerated while the recall rate is ensured to be improved.
S404: the server determines a target repeated graph-text set based on the editing distance between each initial repeated graph-text in each initial repeated graph-text set and the target graph-text.
Taking any one initial repeated image in each initial repeated image set as an example, determining the text similarity between one initial repeated image and the target image according to the editing distance between the initial repeated image and the target image by adopting a formula 1, comparing the text similarity with a preset similarity threshold, and taking the initial repeated image as a target repeated image if the text similarity is larger than the similarity threshold, wherein the content of the initial repeated image and the target image is more. And traversing each initial repeated graph-text set to obtain a target repeated graph-text set. The calculation formula of the similarity is as follows:
wherein sim represents the text similarity between the initial repeated graphics context and the target graphics context, dist represents the editing distance between the initial repeated graphics context and the target graphics context, len (A) represents the text length of the initial repeated graphics context, and len (B) represents the text length of the target graphics context.
As can be seen from equation 1, the text similarity between the original repeated text and the target text is inversely related to the editing distance between the original repeated text and the target text, and is positively related to the maximum text length between the original repeated text and the target text.
For example, as shown in fig. 10, for the initial repeated image-text sets Q1', Q2', Q3' after coarse duplication removal, after fine duplication removal by the edit distance between texts, the obtained target repeated image-text set Q is { S4, S34, S35}.
In the embodiment of the application, the editing distance is used for carrying out precise de-duplication on the recalled initial repeated pictures and texts, and the interpretation between the repeated pictures and texts is stronger due to the editing distance, so that the accuracy of the similarity between the recalled initial repeated texts and the target text is improved, and the de-duplication accuracy is improved.
In some embodiments, a plurality of template graphics, such as templates graphics of sports, finance, weather, etc., may appear in the business, and the two graphics may contain some necessary content; for example: weather graphics and texts generally comprise date, temperature, wind direction, icons and the like. Therefore, although the template graphics context has higher similarity, the template graphics context does not belong to the category of repeated articles, and the template graphics context needs to be exempted to further improve the duplicate removal accuracy.
Therefore, after the target repeated image-text set is determined, the target repeated image-text set is updated to remove template image-text which is mistakenly identified as repeated image-text in the target repeated image-text set. For any one target repetition image in the target repetition image collection, the template image exemption process is shown in fig. 9, and mainly comprises the following steps:
S4051: the server extracts a first entity noun set from a preset field of a target repeated picture and text, and extracts a second entity noun set from the preset field of the target picture and text.
In an alternative embodiment, for the initial repeated graphics context of templates such as sports, finance, weather, etc., each entity noun (e.g., key noun such as date, place name, stock code, etc.) is extracted from the first 500 words, so as to obtain a first entity noun set, and each entity noun is extracted from the first 500 words of the target graphics context, so as to obtain a second entity noun set.
Alternatively, the embodiment of the present application does not impose a limitation on the extraction algorithm of entity nouns, for example, a Named Entity Recognition (NER) algorithm may be used.
S4052: the server formats the first entity noun set and the second entity noun set according to preset rules respectively.
For example, assuming that the first and second sets of entity nouns contain "date" entity nouns, the formatted format is: yyyymmdd, wherein yyyy represents year, mm represents month, and dd represents day.
For another example, assuming that the first and second sets of entity nouns contain "stock code" entity nouns, the formatted format is: xx+6 digits, where xx denotes which stock exchange the stock's market place is at.
Optionally, if one entity noun set includes entity nouns of multiple categories, the text corresponding to the entity noun set has multiple entity noun lists of multiple categories respectively, and each entity noun list stores entity nouns of the same category after being formatted.
S4053: the server determines a fourth similarity between the target repeat text and the target text based on the formatted first entity noun set and the formatted second entity noun set.
In the embodiment of the application, after formatting each entity noun in the first entity noun set and the second entity noun set, comparing the entity noun in the first entity noun set with the entity noun in the second entity noun set one by one, and determining a fourth similarity between a target repeated graph and a target graph according to the comparison result.
Optionally, if the number of different entity nouns in the first entity noun set and the second entity noun set is greater, the fourth similarity between the target repeat text and the target graph text is smaller.
S4054: the server determines whether the fourth similarity satisfies a preset condition, if so, S4055 is executed, otherwise S4056 is executed.
In the embodiment of the present application, if the fourth similarity is greater than the preset threshold, it indicates that the target repetitive pattern and the target pattern are repetitive patterns, S4055 is executed, otherwise, it indicates that the target repetitive pattern and the target pattern are not repetitive patterns, and S4056 is executed.
S4055: the server reserves the target repeated graphics and texts in the target repeated graphics and texts set.
S4056: the server deletes the target duplicate in the target duplicate set.
For example, as shown in fig. 10, the target repeated graph and text set determined by the editing distance is q= { S4, S34, S35}, and assuming that the fourth similarity between the target repeated graph and text S34 and the target graph and text is smaller than the preset threshold, it indicates that the target repeated graph and text S34 and the target graph and text are not repeated graph and text, after the template graph and text is subjected to fine deduplication by the entity noun, the updated target repeated graph and text set is Q' = { S4, S35}.
It should be noted that the embodiments provided by the present application may be used alone or in combination. The magnitude relation between recall and accuracy in the combination of the different embodiments is shown in table 1 with reference to the baseline indicator on the current line.
TABLE 1 comparison of effects of different methods
Method of Accuracy rate of Recall rate of recall
Base line 78% 57%
Baseline + multiple recall 78% 80%
Baseline + multi-way recall + calibration 98% 75%
As can be seen from Table 1, after text, image and multi-mode three-way recall is adopted, the recall rate is increased to 80% on the basis of unchanged accuracy; on the basis of multi-way recall, after the calibration layer is added, the overall accuracy is improved to 98%.
The embodiment of the application can efficiently and accurately calculate the similarity between two pictures and texts, and can effectively solve the problems of less recall, low efficiency and low accuracy in picture and text duplication removal in the prior art. Overall, the image-text duplication elimination method of the embodiment of the application is divided into three stages: the first stage is a recall stage, mainly comprising text recall, image recall and multi-mode recall, and can more comprehensively measure the similarity of the images and the texts in the image text through multi-path recall, so that the recall rate and the accuracy rate are balanced; the second stage is a coarse de-duplication stage, which is used for performing coarse filtration on the pictures and texts in the recall stage so as to reduce the calculation pressure in the calibration stage and improve the de-duplication efficiency; the third stage is a calibration stage, in terms of similarity calibration, the editing distance is used for calibrating the image and text which is roughly filtered, so that the interpretation is stronger, and the accuracy of de-duplication is improved; in the aspect of template calibration, the similarity between template patterns is optimized by using entity nouns such as time, place, stock codes and the like, so that the accuracy of de-duplication is further improved. The algorithm can better solve the problems of less recall, low efficiency and low accuracy of the current scheme.
The image-text duplication elimination method provided by the embodiment of the application can be applied to various services.
For example, the image-text duplication eliminating method provided by the embodiment of the application is applied to auditing business, and the trigger flow of first-sending and later-auditing is realized by repeatedly judging the content of the new image-text by using a machine. On one hand, the manpower for auditing the system is reduced by 40%, and the labor cost is saved; on the other hand, the average auditing consumption is reduced by about 90 percent, and the whole auditing process is quickened.
The processing assembly line of the image-text duplication eliminating service provided by the embodiment of the application comprises basic information of the released image-text, such as information of titles, release numbers and the like of the image-text. Meanwhile, the assembly line also comprises the number (Cmsid) of the platform where the recalled repeated image and text with higher similarity with the target image and text is located, such as 20210614A025J00.
In other downstream business scenes, the business can apply the similarity relied by the image-text deduplication provided by the embodiment of the application to various links, including but not limited to original identification of content dimension, right-to-hold identification, handling identification of account dimension and the like.
For convenience of description, the above parts are described as being functionally divided into modules (or units) respectively. Of course, the functions of each module (or unit) may be implemented in the same piece or pieces of software or hardware when implementing the present application.
Having described the method and apparatus for teletext deduplication according to an exemplary embodiment of the application, next, a device for teletext deduplication according to another exemplary embodiment of the application will be described.
Those skilled in the art will appreciate that the various aspects of the application may be implemented as a system, method, or program product. Accordingly, aspects of the application may be embodied in the following forms, namely: an entirely hardware embodiment, an entirely software embodiment (including firmware, micro-code, etc.) or an embodiment combining hardware and software aspects may be referred to herein as a "circuit," module "or" system.
Based on the same inventive concept, the embodiment of the application also provides an image-text duplication eliminating device. Fig. 11 is a schematic diagram of an image-text duplication elimination device according to an embodiment of the application, where the device includes:
The feature extraction module 1301 is configured to extract a graphic feature set of a target graphic in response to a duplication elimination request for the target graphic; wherein, the picture and text characteristic set is: text features and image features of the target graphics context;
The multi-stage recall module 1302 is configured to perform multi-stage recall on the target image based on the image-text feature set, to obtain a recall image-text set corresponding to each stage; wherein the multi-stage recall comprises at least: multimodal recall based on text features and image features;
The primary duplicate removal module 1303 is configured to determine, based on each recall image in each recall image set, a keyword set between each recall image and the target image, and an initial duplicate image set corresponding to each recall image set;
The deduplication module 1304 is configured to determine a target duplicate set of images and texts based on an edit distance between each of the initial duplicate sets of images and texts and the target image and texts.
Optionally, the multi-stage recall module 1302 is specifically configured to:
Fusing text features and image features in the image-text feature set to obtain multi-modal features;
Based on the multi-modal characteristics, respectively obtaining first similarity between the multi-modal characteristics of each contrast image in the preset contrast image data set;
And carrying out multi-mode recall on the target image and text based on each first similarity to obtain a recall image and text set corresponding to the multi-mode stage.
Optionally, the multi-stage recall module 1302 is specifically configured to:
Based on the text features in the image-text feature set, respectively carrying out text recall on the target image-text with a second similarity between the text features of each contrast image-text in the preset contrast image-text data set to obtain a recall image-text set corresponding to the text stage; and
And based on the image features in the image-text feature set, respectively carrying out image recall on the target image-text with a third similarity between the image features and the text features of each contrast image-text in the preset contrast image-text data set, and obtaining a recall image-text set corresponding to the image stage.
Optionally, the primary deduplication module 1303 is specifically configured to:
For each recall image-text set, the following operations are respectively executed:
extracting a first keyword set corresponding to each recall image and text in one recall image and text set, and extracting a second keyword set corresponding to the target image and text;
determining the intersection ratio between each first keyword set and each second keyword set respectively;
sequencing each recall image and text according to each intersection ratio;
and screening at least one recall image from the recall images based on the sorting result to obtain an initial repeated image and text set corresponding to the recall image and text set.
Optionally, the primary deduplication module 1303 is specifically configured to:
For each first keyword set, the following operations are respectively executed:
determining a first keyword set, a keyword intersection set and a keyword union set of the first keyword set and the second keyword set;
and taking the ratio of the number of keywords contained in the keyword intersection and the number of keywords contained in the keyword union as the intersection ratio between the keyword sets.
Optionally, the deduplication module 1304 is specifically configured to:
For each initial repeat pattern, the following operations are respectively executed:
Determining the text similarity between an initial repeated image and a target image according to the editing distance between the initial repeated image and the target image;
and if the text similarity meets the threshold requirement, taking an initial repeated image and text as a target repeated image and text.
Alternatively, the text similarity is inversely related to the edit distance and positively related to the maximum text length in an initial repeat text and a target text.
Optionally, the deduplication module 1304 is further configured to:
aiming at each target repetition graph in the target repetition graph set, the following operations are respectively executed:
extracting a first entity noun set from a preset field of a target repeated image and text, and extracting a second entity noun set from the preset field of the target image and text;
formatting the first entity noun set and the second entity noun set according to a preset rule;
Determining a fourth similarity between the target repeated graph and the target graph based on the formatted first entity noun set and the formatted second entity noun set;
If the fourth similarity meets the preset condition, a target repeated image-text is reserved.
Optionally, the feature extraction module 1301 is specifically configured to:
Determining the weight of each keyword contained in the target image-text based on a preset keyword word stock;
And weighting word vectors of the keywords according to the determined weights to obtain text features corresponding to the target graph.
Optionally, the feature extraction module 1301 is specifically configured to:
extracting the characteristics of each picture contained in the target image and text to obtain an initial characteristic vector set;
And aggregating all the initial feature vectors in the initial feature vector set into feature vectors with preset length, and performing dimension reduction to obtain image features corresponding to the target text.
The embodiment of the application also provides electronic equipment based on the same conception as the embodiment of the method. In one embodiment, the electronic device may be a server, such as server 120 shown in FIG. 1. In this embodiment, the electronic device may be configured as shown in fig. 12, including a memory 1401, a communication module 1403, and one or more processors 1402.
A memory 1401 for storing a computer program executed by the processor 1402. The memory 1401 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, a program required for running an instant communication function, and the like; the storage data area can store various instant messaging information, operation instruction sets and the like.
Memory 1401 may be volatile memory (volatilememory), such as random-access memory (RAM); the memory 1401 may also be a non-volatile memory (non-volatilememory), such as read-only memory, flash memory (flashmemory), hard disk (HDD) or Solid State Drive (SSD); or memory 1401, is any other medium that can be used to carry or store a desired computer program in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such. The memory 1401 may be a combination of the above memories.
Processor 1402 may include one or more central processing units (centralprocessingunit, CPU) or digital processing units, or the like. A processor 1402 for implementing the above-described teletext method when invoking a computer program stored in the memory 1401.
The communication module 1403 is used for communicating with the terminal device and other servers.
The specific connection medium between the memory 1401, the communication module 1403, and the processor 1402 is not limited to the above embodiments of the present application. The embodiment of the present application is illustrated in fig. 12 by the memory 1401 and the processor 1402 being coupled via the bus 1404. The bus 1404 is illustrated in fig. 12 by a bold line, and the manner in which the other components are coupled is illustrative only and not limiting. The bus 1404 may be divided into an address bus, a data bus, a control bus, and the like. For ease of description, only one thick line is depicted in fig. 12, but only one bus or one type of bus is not depicted.
The memory 1401 stores a computer storage medium in which computer executable instructions are stored for implementing the image-text deduplication method according to the embodiment of the present application. The processor 1402 is configured to perform the above-described image de-duplication method, as shown in fig. 4.
In another embodiment, the electronic device may also be other electronic devices, such as terminal device 110 shown in fig. 1. In this embodiment, the structure of the electronic device may include, as shown in fig. 13: communication component 1510, memory 1520, display unit 1530, camera 1540, sensor 1550, audio circuitry 1560, bluetooth module 1570, processor 1580, and the like.
The communication component 1510 is for communicating with a server. In some embodiments, a circuit wireless fidelity (WIRELESSFIDELITY, WIFI) module may be included, the WiFi module belongs to a short-range wireless transmission technology, and the electronic device may help the user to send and receive information through the WiFi module.
Memory 1520 may be used to store software programs and data. The processor 1580 performs various functions and data processing of the terminal device 110 by executing software programs or data stored in the memory 1520. Memory 1520 may include high-speed random access memory, but may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device. Memory 1520 stores an operating system that enables terminal device 110 to operate. The memory 1520 of the present application may store an operating system and various applications, as well as code for performing the image and text deduplication method of embodiments of the present application.
The display unit 1530 may also be used to display information input by a user or information provided to the user and a Graphic User Interface (GUI) of various menus of the terminal device 110. In particular, the display unit 1530 may include a display screen 1532 disposed on the front side of the terminal device 110. The display 1532 may be configured in the form of a liquid crystal display, light emitting diodes, or the like. The display unit 1530 may be used for displaying graphics to be deduplicated, and the like in the embodiment of the present application.
The display unit 1530 may also be used to receive input numerical or character information, generate signal inputs related to user settings and function control of the terminal device 110, and in particular, the display unit 1530 may include a touch screen 1531 disposed at the front of the terminal device 110, and may collect touch operations on or near the user, such as clicking buttons, dragging scroll boxes, and the like.
The touch screen 1531 may cover the display screen 1532, or the touch screen 1531 may be integrated with the display screen 1532 to implement input and output functions of the terminal device 110, and the integrated touch screen may be simply referred to as a touch screen. The display unit 1530 may display an application program and a corresponding operation procedure in the present application.
The camera 1540 may be used to capture still images, and a user may post comments on the image captured by the camera 1540 through the application. The camera 1540 may be one or a plurality of cameras. The object generates an optical image through the lens and projects the optical image onto the photosensitive element. The photosensitive element may be a charge coupled device (chargecoupleddevice, CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor. The photosensitive element converts the optical signal into an electrical signal, which is then passed to a processor 1180 for conversion into a digital image signal.
The terminal device may also include at least one sensor 1550, such as an acceleration sensor 1551, a distance sensor 1552, a fingerprint sensor 1553, a temperature sensor 1554. The terminal device may also be configured with other sensors such as gyroscopes, barometers, hygrometers, thermometers, infrared sensors, light sensors, motion sensors, and the like.
Audio circuitry 1560, speakers 1561, microphones 1562 may provide an audio interface between the user and terminal device 110. The audio circuit 1560 may transmit the received electrical signal converted from audio data to the speaker 1561, and may be converted into an audio signal by the speaker 1561 for output. The terminal device 110 may also be configured with a volume button for adjusting the volume of the sound signal. On the other hand, the microphone 1562 converts the collected sound signals into electrical signals, which are received by the audio circuit 1560 for conversion into audio data, which is then output to the communication component 1510 for transmission, such as to the other terminal device 110, or to the memory 1520 for further processing.
The bluetooth module 1570 is used for exchanging information with other bluetooth devices having a bluetooth module through a bluetooth protocol. For example, the terminal device may establish a bluetooth connection with a wearable electronic device (e.g., a smart watch) that also has a bluetooth module through bluetooth module 1570, thereby performing data interaction.
The processor 1580 is a control center of the terminal device, connects various parts of the entire terminal using various interfaces and lines, and performs various functions of the terminal device and processes data by running or executing software programs stored in the memory 1520 and calling data stored in the memory 1520. In some embodiments, processor 1580 may include one or more processing units; processor 1580 may also integrate an application processor that primarily handles operating systems, user interfaces, applications, etc., and a baseband processor that primarily handles wireless communications. It is to be appreciated that the baseband processor described above may not be integrated into the processor 1580. The processor 1580 in the present application may run an operating system, an application, a user interface display, a touch response, and a text deduplication method according to the embodiments of the present application. In addition, a processor 1580 is coupled to the display unit 1530.
In some possible embodiments, aspects of the teletext deduplication method provided by the application may also be implemented in the form of a program product comprising a computer program for causing an electronic device to perform steps in the teletext method according to various exemplary embodiments of the application described in the specification, when the program product is run on the electronic device, e.g. the electronic device may perform steps as shown in fig. 4.
The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The program product of embodiments of the present application may employ a portable compact disc read only memory (CD-ROM) and comprise a computer program and may run on a computing device. However, the program product of the present application is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with a command execution system, apparatus, or device.
The readable signal medium may comprise a data signal propagated in baseband or as part of a carrier wave in which a readable computer program is embodied. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with a command execution system, apparatus, or device.
A computer program embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer programs for performing the operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer program may execute entirely on the user's computing device, partly on the user's equipment, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).
While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application.
It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the spirit or scope of the application. Thus, it is intended that the present application also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims (10)

1. The image-text duplication eliminating method is characterized by comprising the following steps of:
Responding to a duplication eliminating request aiming at a target image and text, and extracting an image and text characteristic set of the target image and text; wherein, the image-text feature set is: text features and image features of the target graphics context;
Based on the image-text feature set, carrying out multi-stage recall on the target image-text to obtain recall image-text sets corresponding to each stage; wherein the multi-stage recall comprises at least: multimodal recall based on the text feature and the image feature;
Based on each recall image in each recall image set, determining an initial repeated image set corresponding to each recall image set respectively, wherein the key word set is between each recall image and the target image;
Determining a target repeated image-text set based on the editing distance between each initial repeated image-text in each initial repeated image-text set and the target image-text;
aiming at each target repeated image and text in the target repeated image and text set, respectively executing the following operations:
extracting a first entity noun set from a preset field of a target repeated image and text, and extracting a second entity noun set from the preset field of the target image and text;
Formatting the first entity noun set and the second entity noun set according to a preset rule;
determining a fourth similarity between the one target repeated picture and the target picture based on the formatted first entity noun set and the formatted second entity noun set;
And if the fourth similarity meets a preset condition, reserving the target repeated graphics and texts.
2. The method of claim 1, wherein the performing multi-stage recall on the target image based on the image-text feature set to obtain a recall image-text set corresponding to each stage includes at least:
Fusing the text features and the image features in the image-text feature set to obtain multi-mode features;
Based on the multi-modal characteristics, respectively obtaining first similarity between the multi-modal characteristics of each contrast image and text in the preset contrast image and text data set;
and carrying out multi-mode recall on the target image and text based on each first similarity to obtain a recall image and text set corresponding to the multi-mode stage.
3. The method of claim 2, wherein the multi-stage recall of the target image based on the image feature set to obtain a recall image set corresponding to each stage, further comprising at least one of:
Based on the text features in the image-text feature set, respectively carrying out text recall on the target image-text with second similarity between the text features of each contrast image-text in the preset contrast image-text data set to obtain a recall image-text set corresponding to a text stage; and
And based on the third similarity between the image features in the image-text feature set and the text features of each contrast image-text in the preset contrast image-text data set, carrying out image recall on the target image-text to obtain a recall image-text set corresponding to the image stage.
4. The method of claim 1, wherein the determining, based on the respective recall image in the respective recall image set, the respective initial set of repeat images corresponding to the respective recall image set, respectively, and the set of keywords between the respective recall image and the target image, respectively, comprises:
For each recall image-text set, the following operations are respectively executed:
extracting a first keyword set corresponding to each recall image and text from one recall image and text set, and extracting a second keyword set corresponding to the target image and text;
determining the cross-union ratio between each first keyword set and each second keyword set respectively;
sequencing the recall pictures and texts according to the intersection ratios;
And screening at least one recall image from the recall images based on the sorting result to obtain an initial repeated image set corresponding to the recall image set.
5. The method of claim 4, wherein determining the respective intersection ratios between the respective first keyword sets and the respective second keyword sets comprises:
for each first keyword set, respectively executing the following operations:
Determining a first keyword set, a keyword intersection and a keyword union of the first keyword set and the second keyword set;
And taking the ratio of the number of keywords contained in the keyword intersection and the number of keywords contained in the keyword union as the intersection ratio between the keyword sets.
6. The method of claim 1, wherein the determining the set of target repeat patterns based on the edit distance between each of the initial repeat patterns in each set of initial repeat patterns and the target pattern, comprises:
for each initial repeated image and text, respectively executing the following operations:
Determining the text similarity between one initial repeated image and the target image according to the editing distance between the initial repeated image and the target image;
And if the text similarity meets the threshold requirement, taking the initial repeated image and text as a target repeated image and text.
7. The method of claim 6, wherein the text similarity is inversely related to the edit distance and positively related to a maximum text length in the one initial repeat text and the target text.
8. The method according to any one of claims 1-7, wherein the text features are extracted by:
determining the weight of each keyword contained in the target image-text based on a preset keyword word stock;
And weighting the word vectors of the keywords according to the determined weights to obtain the text features corresponding to the target graphics context.
9. The method according to any one of claims 1-7, wherein the image features are extracted by:
extracting the characteristics of each picture contained in the target image-text to obtain an initial characteristic vector set;
And aggregating all the initial feature vectors in the initial feature vector set into feature vectors with preset length, and performing dimension reduction to obtain image features corresponding to the target graphics context.
10. An image-text de-duplication device, comprising:
The characteristic extraction module is used for responding to a duplication elimination request aiming at a target image and text and extracting an image and text characteristic set of the target image and text; wherein, the image-text feature set is: text features and image features of the target graphics context;
The multi-stage recall module is used for carrying out multi-stage recall on the target image and text based on the image and text characteristic set to obtain recall image and text sets corresponding to each stage; wherein the multi-stage recall comprises at least: multimodal recall based on the text feature and the image feature;
The primary duplicate removal module is used for respectively determining initial duplicate image-text sets corresponding to each recall image-text set based on the key word set between each recall image-text in each recall image-text set and the target image-text;
The fine duplicate removal module is used for determining a target duplicate text set based on the edit distance between each initial duplicate text in each initial duplicate text set and the target text; aiming at each target repeated image and text in the target repeated image and text set, respectively executing the following operations: extracting a first entity noun set from a preset field of a target repeated image and text, and extracting a second entity noun set from the preset field of the target image and text; formatting the first entity noun set and the second entity noun set according to a preset rule; determining a fourth similarity between the one target repeated picture and the target picture based on the formatted first entity noun set and the formatted second entity noun set; and if the fourth similarity meets a preset condition, reserving the target repeated graphics and texts.
CN202111466812.1A 2021-12-03 2021-12-03 Image-text duplication removing method and device Active CN114328884B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111466812.1A CN114328884B (en) 2021-12-03 2021-12-03 Image-text duplication removing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111466812.1A CN114328884B (en) 2021-12-03 2021-12-03 Image-text duplication removing method and device

Publications (2)

Publication Number Publication Date
CN114328884A CN114328884A (en) 2022-04-12
CN114328884B true CN114328884B (en) 2024-07-09

Family

ID=81048849

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111466812.1A Active CN114328884B (en) 2021-12-03 2021-12-03 Image-text duplication removing method and device

Country Status (1)

Country Link
CN (1) CN114328884B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116738368A (en) * 2023-06-25 2023-09-12 上海任意门科技有限公司 Method and system for extracting single-mode characteristics and method for extracting post characteristics

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113469152A (en) * 2021-09-03 2021-10-01 腾讯科技(深圳)有限公司 Similar video detection method and device

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005149323A (en) * 2003-11-18 2005-06-09 Canon Inc Image processing system, image processing apparatus, and image processing method
JP6173754B2 (en) * 2013-04-18 2017-08-02 株式会社日立製作所 Image search system, image search apparatus, and image search method
CN110929002B (en) * 2018-09-03 2022-10-11 优视科技(中国)有限公司 Similar article duplicate removal method, device, terminal and computer readable storage medium
EP3772014A1 (en) * 2019-07-29 2021-02-03 TripEye Limited Identity document validation method, system and computer program
CN110956038B (en) * 2019-10-16 2022-07-05 厦门美柚股份有限公司 Method and device for repeatedly judging image-text content
CN110956037B (en) * 2019-10-16 2022-07-08 厦门美柚股份有限公司 Multimedia content repeated judgment method and device
CN110909725B (en) * 2019-10-18 2023-09-19 平安科技(深圳)有限公司 Method, device, equipment and storage medium for recognizing text
CN111680173B (en) * 2020-05-31 2024-02-23 西南电子技术研究所(中国电子科技集团公司第十研究所) CMR model for unified searching cross-media information

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113469152A (en) * 2021-09-03 2021-10-01 腾讯科技(深圳)有限公司 Similar video detection method and device

Also Published As

Publication number Publication date
CN114328884A (en) 2022-04-12

Similar Documents

Publication Publication Date Title
WO2020119350A1 (en) Video classification method and apparatus, and computer device and storage medium
CN113010703B (en) Information recommendation method and device, electronic equipment and storage medium
CN112163428A (en) Semantic tag acquisition method and device, node equipment and storage medium
CN113515942A (en) Text processing method and device, computer equipment and storage medium
CN115860271B (en) Scheme management system for artistic design and method thereof
CN111898675B (en) Credit wind control model generation method and device, scoring card generation method, machine readable medium and equipment
US20170185913A1 (en) System and method for comparing training data with test data
WO2023179429A1 (en) Video data processing method and apparatus, electronic device, and storage medium
CN113255625B (en) Video detection method and device, electronic equipment and storage medium
TW201931163A (en) Image search and index building
CN116089648B (en) File management system and method based on artificial intelligence
CN112990172B (en) Text recognition method, character recognition method and device
CN111382620A (en) Video tag adding method, computer storage medium and electronic device
CN114328884B (en) Image-text duplication removing method and device
CN111881943A (en) Method, device, equipment and computer readable medium for image classification
CN114092948B (en) Bill identification method, device, equipment and storage medium
Hiriyannaiah et al. Deep learning for multimedia data in IoT
CN113971400B (en) Text detection method and device, electronic equipment and storage medium
CN113919361A (en) Text classification method and device
CN113297525A (en) Webpage classification method and device, electronic equipment and storage medium
CN111275683B (en) Image quality grading processing method, system, device and medium
CN114445833B (en) Text recognition method, device, electronic equipment and storage medium
CN116030375A (en) Video feature extraction and model training method, device, equipment and storage medium
CN114168715A (en) Method, device and equipment for generating target data set and storage medium
CN113420699A (en) Face matching method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant