CN114328884A - Image-text duplication removing method and device - Google Patents

Image-text duplication removing method and device Download PDF

Info

Publication number
CN114328884A
CN114328884A CN202111466812.1A CN202111466812A CN114328884A CN 114328884 A CN114328884 A CN 114328884A CN 202111466812 A CN202111466812 A CN 202111466812A CN 114328884 A CN114328884 A CN 114328884A
Authority
CN
China
Prior art keywords
text
image
recall
target
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111466812.1A
Other languages
Chinese (zh)
Other versions
CN114328884B (en
Inventor
安涵
陈祥
唐伟
黄展鹏
封盛
赵博
林民龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202111466812.1A priority Critical patent/CN114328884B/en
Priority claimed from CN202111466812.1A external-priority patent/CN114328884B/en
Publication of CN114328884A publication Critical patent/CN114328884A/en
Application granted granted Critical
Publication of CN114328884B publication Critical patent/CN114328884B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to the technical field of computers, and provides a method and a device for removing duplication of pictures and texts, which can efficiently and accurately calculate the similarity between two pictures and texts. Through the text, the image and the multi-mode recall, the similarity of the image and the text in the image and the text can be more comprehensively measured, and the recall rate and the accuracy rate are balanced. By coarsely filtering the image-text in the recall stage, the calculation pressure in the calibration stage can be reduced, and the de-weighting efficiency is improved; and moreover, the coarse filtering result is subjected to duplication elimination calibration from a similarity level and a template type image-text level, the coarse filtering image-text is calibrated by using an editing distance at the similarity level, the interpretability is stronger, the duplication elimination accuracy is improved, the similarity between the template type image-text is optimized according to the entity nouns at the template type image-text level, and the duplication elimination accuracy is further improved. The method can better solve the problems of less recalls, low efficiency and low accuracy of the current scheme.

Description

Image-text duplication removing method and device
Technical Field
The application relates to the technical field of computers, in particular to a method and a device for removing image-text duplication.
Background
In the internet era, information is explosively increased, and the network is generally full of massive pictures and texts and contains a large number of repeated pictures and texts; for example, a single image-text is transferred, modified and edited by various media to obtain a plurality of similar image-texts.
Because a large number of repeated pictures and texts exist in the network, different picture and text formats can be presented due to different editing modes, a large number of storage resources are occupied for repeated storage, and the storage resources are wasted to a certain extent, so that the pictures and texts need to be subjected to duplicate removal processing, namely similar and repeated pictures and texts are identified.
In the related art, when image-text deduplication processing is performed, images and texts are generally recalled in two ways, and each way is compared based on a single-dimensional feature, so that the recall rate and the accuracy rate of the image-text are difficult to balance, the recall rate may be increased, and meanwhile, the accuracy rate is reduced, or the recall rate is reduced while the accuracy rate is increased.
Disclosure of Invention
The embodiment of the application provides an image-text duplicate removal method and device, which are used for improving recall rate, accuracy and efficiency of image-text duplicate removal.
On one hand, an image-text de-duplication method provided by the embodiment of the application comprises the following steps:
responding to a rearrangement request aiming at target graphics and texts, and extracting a graphics and text feature set of the target graphics and texts; wherein the image-text feature set is: text features and image features of the target image-text;
based on the image-text feature set, multi-stage recalling is carried out on the target image-text to obtain a recall image-text set corresponding to each stage; wherein the multi-stage recall comprises at least: a multimodal recall based on the text features and the image features;
respectively determining initial repeated image-text sets corresponding to the recall image-text sets based on the recall image-texts in the recall image-text sets and keyword sets between the recall image-texts and the target image-texts;
and determining a target repeated image-text set based on each initial repeated image-text in each initial repeated image-text set and the editing distance between each initial repeated image-text and the target image-text.
On the other hand, an image-text de-duplication device provided by the embodiment of the application includes:
the characteristic extraction module is used for responding to a rearrangement request aiming at the target image-text and extracting an image-text characteristic set of the target image-text; wherein the image-text feature set is: text features and image features of the target image-text;
the multi-stage recall module is used for performing multi-stage recall on the target image-text based on the image-text feature set to obtain a recall image-text set corresponding to each stage; wherein the multi-stage recall comprises at least: a multimodal recall based on the text features and the image features;
the primary duplication removal module is used for respectively determining an initial repeated image-text set corresponding to each recall image-text set based on each recall image-text in each recall image-text set and a keyword set between each recall image-text and the target image-text;
and the fine duplication removing module is used for determining the target repeated image-text set based on the editing distance between each initial repeated image-text in each initial repeated image-text set and the target image-text.
Optionally, the multi-stage recall module is specifically configured to:
fusing the text features and the image features in the image-text feature set to obtain multi-modal features;
respectively obtaining a first similarity between the multi-modal characteristics of each contrast image-text in a preset contrast image-text data set based on the multi-modal characteristics;
and performing multi-mode recall on the target image-text based on each first similarity to obtain a recall image-text set corresponding to the multi-mode stage.
Optionally, the multi-stage recall module is specifically configured to:
based on the text features in the image-text feature set, respectively carrying out text recall on the target image-text to obtain a recalled image-text set corresponding to a text stage, wherein the second similarity is between the text features of the images-text in the preset comparison image-text data set and the text features of the images-text in the preset comparison image-text data set; and
and based on the image features in the image-text feature set, respectively carrying out image recall on the target image-text to obtain a recalled image-text set corresponding to the image stage, wherein the third similarity is between the image features in the image-text feature set and the text features of the contrast images-texts in the preset contrast image-text data set.
Optionally, the initial de-weighting module is specifically configured to:
and respectively executing the following operations aiming at each recall image-text set:
extracting a first keyword set corresponding to each recall image-text in a recall image-text set and extracting a second keyword set corresponding to the target image-text;
respectively determining the intersection ratio between each first keyword set and the second keyword set;
sequencing the recalling graphics and texts according to the cross-over ratios;
and screening at least one recall image-text from each recall image-text based on the sequencing result to obtain an initial repeated image-text set corresponding to the recall image-text set.
Optionally, the initial de-weighting module is specifically configured to:
for each first keyword set, respectively executing the following operations:
determining a first keyword set, a keyword intersection and a keyword union of the first keyword set and the second keyword set;
and taking the ratio of the number of the keywords contained in the intersection of the keywords to the number of the keywords contained in the union of the keywords as the intersection ratio of the keyword sets.
Optionally, the fine deduplication module is specifically configured to:
and respectively executing the following operations aiming at each initial repeated image-text:
determining text similarity between an initial repeated image-text and the target image-text according to the editing distance between the initial repeated image-text and the target image-text;
and if the text similarity meets the threshold requirement, taking the initial repeated image-text as a target repeated image-text.
Optionally, the text similarity is negatively correlated with the edit distance, and positively correlated with the maximum text length in the initial repeated image-text and the target image-text.
Optionally, the fine deduplication module is further configured to:
and aiming at each target repeated image-text in the target repeated image-text set, respectively executing the following operations:
extracting a first entity noun set from a preset field of a target repeated image-text, and extracting a second entity noun set from the preset field of the target image-text;
formatting the first entity noun set and the second entity noun set respectively according to a preset rule;
determining a fourth similarity between the target repeated image-text and the target image-text based on the formatted first entity noun set and the formatted second entity noun set;
and if the fourth similarity meets a preset condition, reserving the target repeated image-text.
Optionally, the feature extraction module is specifically configured to:
determining the weight of each keyword contained in the target image-text based on a preset keyword lexicon;
and weighting the word vectors of the keywords according to the determined weight to obtain the text characteristics corresponding to the target image-text.
Optionally, the feature extraction module is specifically configured to:
extracting the features of each picture contained in the target image-text to obtain an initial feature vector set;
and aggregating each initial feature vector in the initial feature vector set into a feature vector with a preset fixed length, and performing dimension reduction to obtain the image feature corresponding to the target text.
In another aspect, an embodiment of the present application provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements a text deduplication method when executing the computer program.
In another aspect, an embodiment of the present application provides a computer-readable storage medium storing computer instructions, which, when executed on a computer, cause the computer to execute a text deduplication method.
In another aspect, the present application provides a computer program product comprising a computer program, which when executed by a processor implements a teletext de-duplication method.
The embodiment of the application provides a picture-text duplication removing method and device, the picture-text feature set extracted from a target picture-text is used for multi-stage recall of the target picture-text, and the multi-stage recall comprises multi-mode recall based on text features and image features, so that recall rate and accuracy of repeated picture-text are improved. In addition, aiming at the recall image-text set obtained in each recall stage, an initial repeated image-text set is determined based on the respective recall image-text and the keyword set of the target text, so that images and texts with lower repetition degree in the recall image-text set are filtered, the calculation amount of subsequent image-text duplicate removal is reduced, and the duplicate removal efficiency is improved; and based on the filtered initial repeated image-text set, the editing distance between each initial repeated image-text and the target image-text is further subjected to duplication removal, so that a better duplication removal effect can be obtained, and the accuracy is higher.
Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the application. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application.
Fig. 1 is a schematic view of an application scenario in an embodiment of the present application;
fig. 2 is an architecture diagram of an image-text deduplication system provided in an embodiment of the present application;
fig. 3 is a schematic diagram of an image-text deduplication process provided in an embodiment of the present application;
fig. 4 is a flowchart of an image-text deduplication method provided in an embodiment of the present application;
FIG. 5 is a schematic diagram of a process of extracting image features by CLIP according to an embodiment of the present application;
FIG. 6 is a flowchart of a multi-stage recall method provided by an embodiment of the present application;
FIG. 7 is a diagram illustrating the results of various recall stages provided by an embodiment of the present application;
FIG. 8 is a flowchart of a coarse deduplication method provided by an embodiment of the present application;
FIG. 9 is a flow chart of a fine deduplication method provided by an embodiment of the present application;
fig. 10 is a schematic diagram of an image-text deduplication result provided in the embodiment of the present application;
fig. 11 is a schematic view of a processing line for image-text de-duplication according to an embodiment of the present disclosure;
fig. 12 is a schematic diagram of an image-text de-duplication effect provided in the embodiment of the present application;
fig. 13 is a schematic structural diagram illustrating a composition of an image-text de-emphasis apparatus according to an embodiment of the present application;
fig. 14 is a schematic diagram of a hardware component structure of an electronic device to which an embodiment of the present application is applied;
fig. 15 is a schematic diagram of a hardware component structure of another electronic device to which the embodiment of the present application is applied.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments, but not all embodiments, of the technical solutions of the present application. All other embodiments obtained by a person skilled in the art without any inventive step based on the embodiments described in the present application are within the scope of the protection of the present application.
Some concepts related to the embodiments of the present application are described below.
Picture and text: refers to an article containing a picture.
Haiming distance: also called hamming distance, means that in one valid bit (bit) code set, the two bit strings are subjected to xor operation, and the number of 1 s in the xor operation result is calculated. Or the hamming distance of two strings of equal length is the number of different characters at the same position.
Editing distance: quantifying the degree of difference between two strings is also understood to be the minimum number of single character editing operations (including insertion, deletion, replacement, etc.) required to convert one text to another.
Word vector (Word2Vec) model: in natural language processing, the finest granularity is words, which are composed of words to form sentences, which are then composed of paragraphs, chapters, documents, etc., and the words are in symbolic form (e.g., Chinese, English, Latin, etc.), Word2Vec can convert the words to numeric form, i.e., the words are embedded into a mathematical space.
Term Frequency-Inverse Document Frequency (Term Frequency-Inverse Document Frequency, TF-IDF): to evaluate how important a word is to a document in a corpus. The importance degree is positively correlated with the occurrence frequency of a word in a document, but is negatively correlated with the occurrence frequency of the word in a corpus.
Contrast Language-Image Pre-training (CLIP): a teletext data set is trained using 4 billions from the network, with text as a label for the image.
The image-text de-duplication method provided in the embodiment of the application mainly relates to Artificial Intelligence (AI), Natural Language Processing (NLP) and Machine Learning (ML), and is designed based on computer vision technology and machine learning in the artificial intelligence.
The following briefly introduces the design concept of the embodiments of the present application.
At present, the image-text deduplication mainly comprises a text recall and a picture recall. The text recall mainly comprises a SimHash algorithm, a cosine similarity algorithm and the like; the picture recall mainly comprises a traditional pHash algorithm, a Sift algorithm and a deep neural network algorithm based on MobileNet and the like.
Text recall
The SimHash algorithm is mainly used for reducing dimension of a long text to a document represented by a few key words. Firstly, encoding the key words into a binary string with fixed length (such as 64 bits) and composed of Hash values; then, weighting the character string (W is Hash weight), and combining and accumulating the weighted results of the keywords; further, according to the weighting result, the negative weight is set to 0, the positive weight is set to 1, and the SimHash of the text is obtained; the 64-bit binary character string is averagely divided into 4 blocks, and according to the drawer principle, if the Hamming distance between two texts is less than 3, 1 block between the texts is completely the same, so that each block in the divided 4 blocks is respectively used as the first 16 bits for searching, an inverted index is built, and the Hamming distance between the two texts is calculated; and finally, determining the similarity of the two texts through the hamming distance, wherein the smaller the hamming distance is, the lower the similarity is. The SimHash algorithm enables the calculated amount to be small through drawer acceleration, and the retrieval speed is high.
The cosine similarity algorithm is characterized by using vectors to represent texts, then using the vectors to compare cosine similarity, and recalling texts with similar contents of query texts.
(II) Picture recall
The conventional picture recall algorithm (e.g., the pHash algorithm) generally converts a picture into a gray-scale image, then separates the picture into a set of fractions through Discrete Cosine Transform (DCT) to reduce the frequency, further calculates an average value of all picture pixel points after DCT, records the pixel point larger than the average value as 1, and otherwise records as 0, and finally calculates the Hamming distance of two pictures, wherein the smaller the Hamming distance is, the more similar the pictures are. Such algorithms have a high requirement for similar accuracy.
The deep neural network algorithm takes the last embedded (embedding) layer of a model as a feature vector of a picture by training a supervised picture classification model, and recalls similar pictures and texts by using the feature vector.
However, in the related art, when text recall and picture recall are performed, the two models are compared based on single-dimensional characteristics due to the fact that the two models are performed in two ways, and a comparison threshold value is difficult to determine, so that the similarity determined by the two models is low in accuracy, and the accuracy of the picture and text recall is poor; moreover, the text and the picture are recalled in two ways, and the recall rate and the accuracy rate are difficult to balance.
For example, for text recall, the SimHash algorithm has high calculation speed and higher recall accuracy, but the number of recalls is small; by adopting a cosine similarity calculation method finely adjusted by a BERT pre-training model, the number of recalls is large, but the calculation amount is large, the on-line prediction is slow, and the recall accuracy is low.
Meanwhile, one pain point of the current recall scheme on the product side is that graphics and texts of templates such as finance and economics, weather and the like usually contain necessary contents in the templates; for example, weather graphics usually include necessary contents such as date, temperature, wind direction, and icon. Thus, the similarity is particularly high, but the similarity does not belong to the category of repeated articles, but is mistakenly identified, and the recall accuracy is poor.
In view of this, an embodiment of the present application provides an image-text deduplication method and apparatus, which perform multi-stage recall through visual features of images and texts and perform coarse deduplication and fine deduplication on a recall result to identify repeated images and texts. Specifically, in a text recall stage, text features are extracted through a keyword set in a target image text, and images and texts with text similarity higher than a threshold value between the image text data set and the target image text are recalled from a comparison image text data set; in the picture recalling stage, extracting image features from at least one picture contained in the target picture and text, and recalling the picture and text with the similarity higher than a threshold value from the comparison picture and text data set; in the multi-mode recall stage, the images and texts with the similarity higher than a threshold value between the images and texts are recalled from the comparison image and text data set by utilizing the text characteristics and the image characteristics of the target images and texts; further, aiming at the recalled pictures and texts in each stage, preliminarily screening out pictures and texts which are repeated with the target pictures and texts through the intersection and comparison of keywords between the respective pictures and the target pictures and texts, and then calculating the repeated similarity through the editing distance between the screened repeated pictures and texts and the target pictures and texts, thereby obtaining a target repeated picture and text set with higher repeated degree. In the multi-stage recall, the characteristics of multiple dimensions are fully utilized visually, textually and in a multi-mode manner, and the recall rate and the accuracy rate can be well balanced by mutually supplementing each stage; repeated pictures and texts are filtered out primarily through the intersection ratio of the keywords, so that the calculation pressure of a subsequent fine de-weighting stage is reduced, and the de-weighting efficiency is improved; during fine duplication removal, repeated recalled pictures and texts are calibrated by using the editing distance, so that the interpretation between the repeated pictures and texts is stronger, and the duplication removal accuracy is improved.
Meanwhile, aiming at the template type pictures and texts, the accuracy of the similarity between the template type pictures and texts is improved by extracting entity nouns for calibration, and then the accuracy of duplicate removal is improved.
The preferred embodiments of the present application will be described below with reference to the accompanying drawings of the specification, it should be understood that the preferred embodiments described herein are merely for illustrating and explaining the present application, and are not intended to limit the present application, and that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.
Fig. 1 is a schematic view of an application scenario in the embodiment of the present application. The application scenario diagram includes a terminal device 110 and a server 120. The terminal device 110 in the embodiment of the present application may be installed with an application related to graphics, where the application may be an applet, a web page, and the like, and is not limited specifically herein. The server 120 is a server corresponding to software, web pages, applets, etc.
It should be noted that the teletext deduplication method in the embodiment of the present application may be executed by the server 120 or the terminal device 110 alone, or may be executed by both the server 120 and the terminal device 110. The present disclosure is mainly illustrated by an example in which the server 120 is executed separately, and specifically, the server 120 may be executed by one server 120, or a plurality of servers 120 may execute in parallel, and the like, and is not limited herein.
In the embodiment of the present application, the similarity-based deduplication model may be deployed on the terminal device 110 for training, or may be deployed on the server 120 for training. A large number of training samples may be stored in the server 120 for training the model. Optionally, after the model is trained based on the method in the embodiment of the present application, the trained model may be deployed on the server 120 or the terminal device 110.
In an alternative embodiment, terminal device 110 and server 120 may communicate via a communication network.
In an alternative embodiment, the communication network is a wired network or a wireless network.
In the embodiment of the present application, the terminal device 110 includes, but is not limited to, a personal computer, a mobile phone, a tablet computer, a notebook, an e-book reader, a smart appliance, a vehicle-mounted device, and the like. Each terminal device 110 is connected to a server 120 through a wireless network, and the server 120 is a server or a server cluster or a cloud computing center formed by a plurality of servers, or is a virtualization platform.
It should be noted that fig. 1 is only an example, and the number of the terminal devices and the servers is not limited in practice, and is not specifically limited in the embodiment of the present application.
In addition, the image-text deduplication method provided in the embodiment of the present application may be applied to various application scenarios including an image-text recommendation task, an image-text search task, an image-text deduplication task, and the like, including but not limited to cloud technology, artificial intelligence, smart traffic, driving assistance, and the like, and training samples used in different scenarios are different and are not listed here.
The text de-duplication method provided by the exemplary embodiment of the present application is described below with reference to the drawings in conjunction with the application scenarios described above, and it should be noted that the application scenarios described above are only shown for the convenience of understanding the spirit and principles of the present application, and the embodiments of the present application are not limited in this respect.
Referring to fig. 2, an architecture diagram of an image-text deduplication system provided in the embodiment of the present application is shown; as shown in fig. 2, the image-text deduplication system mainly includes a medium pool fusion layer, a storage index layer, a feature calculation layer, a feature fusion layer, a recall layer, a coarse ranking layer, and a fine ranking layer. The media pool fuses contrast image-text data sets captured by main users in or out of the station; the storage index layer is used for preprocessing the captured comparison image-text data set; the characteristic computer layer is used for extracting image characteristics from the picture set contained in the pictures and texts through a CLIP algorithm and extracting text characteristics from the keyword set contained in the pictures and texts; the feature fusion layer is used for fusing the image features of all the pictures in the picture set, fusing the text features of all the keywords in the keyword set, and fusing the image features and the text features to obtain multi-modal features; the recall layer recalls repeated pictures and texts based on the extracted text features, image features and multi-modal features; the coarse arrangement layer is used for primarily screening recalled repeated pictures and texts and filtering out partial repeated pictures and texts with similarity meeting the threshold requirement; and the fine arrangement layer is used for further removing the duplicate of the primarily screened repeated pictures and texts.
As shown in fig. 2, the image-text deduplication system provided in the embodiment of the present application can be applied to a pipeline deduplication service, an original content identification service, a content right maintenance service, and a cp handling service, and a recall result of each service can be fed back to the fine layer and the coarse layer for calibration.
The image-text deduplication method provided by the embodiment of the application is mainly characterized by a recall layer, a rough arrangement layer and a fine arrangement layer, and as shown in fig. 3, the recall layer performs multi-stage recall based on extracted text features, image features and multi-modal features to balance recall rate and accuracy; the rough arrangement layer is used for preliminarily screening the recall result based on the keywords between the recall image-text and the target image-text aiming at each recall stage, so that the duplicate removal efficiency is improved; the fine ranking layer determines an editing distance between the two texts through entity nouns in the texts, and determines similarity between the two texts based on the editing distance so as to calibrate the duplicate removal result.
Fig. 4 is a flowchart of an image-text deduplication method provided in the embodiment of the present application, where fig. 4 exemplifies an execution subject as a server, and a specific implementation flow of the method is as follows:
s401: and the server responds to the rearrangement request aiming at the target image-text and extracts the image-text characteristic set of the target image-text.
When S401 is executed, after an image-text rearrangement request is received, text features are extracted from each keyword contained in the target image-text, image features are extracted from each image contained in the target image-text, and an image-text feature set composed of the text features and the image features of the target image-text is obtained.
In the embodiment of the present application, the text feature extraction method is as follows: determining the weight of each keyword contained in the target image-text by adopting a TF-IDF algorithm based on a preset keyword Word library, simultaneously converting each keyword into a Word vector by adopting a Word2Vec algorithm, further weighting the corresponding Word vector by using the weight determined by the TF-IDF algorithm, and obtaining text characteristics of a preset dimension (such as 128 dimensions) based on each weighted Word vector.
In the embodiment of the present application, the image feature extraction method is as follows: extracting the features of each picture contained in the target image-text to obtain an initial feature vector set, aggregating each initial feature vector in the initial feature vector set into a feature vector with a preset fixed length, and performing dimensionality reduction to obtain the image features after dimensionality reduction. Due to the fact that dimension reduction is conducted on the image features, the recall speed can be greatly improved.
For example, if the target image includes n (n ≧ 1) images, and the initial feature vector of each image is 512 dimensions, the initial feature vector set has n × 512 dimensions, and the initial feature vector set is aggregated into a feature vector of a fixed length of k × 512 dimensions by using a nextvad algorithm, and then the feature vector of the fixed length is compressed into 128 dimensions (corresponding to the text feature dimensions) by a Principal Component Analysis (PCA), thereby obtaining image features.
Optionally, the CLIP algorithm may be used to extract image features in the target image text. Wherein, the flow of the CLIP is as follows: by means of a pre-trained Image encoder (Image encoder) and a Text encoder (Text encoder), which pictures are paired with which texts in a training Image-Text data set are predicted in a contrast learning mode, then, the CLIP is converted into a zero-shot classifier, natural language is used as a flexible prediction space, namely, the texts are used as labels of the pictures, and therefore generalization and migration are achieved.
Fig. 5 is a schematic diagram of a training process of the CLIP model provided in the embodiment of the present application, and as shown in fig. 5, the training process mainly includes three processes of contrast pre-training (contrast pre-training), creating a dataset classifier (create data set classifier from label text) by using a label text, and using a zero-shot prediction image to correspond to the label text (use for zero-shot prediction).
The training image-text data set used by the CLIP model is composed of 4 hundred million pairs of "picture-text", and contains 50 ten thousand query texts (category labels), each category label corresponds to about 2 ten thousand pictures, and the batch size N is 21532768. Each batch has N2A number of "picture-text" pairs, N of which are positive sample pairs and the remaining N2-N negative sample pairs. Wherein, the similarity (matching degree) between the 'picture-text' pairs can be represented by the embedding distance, and for the positive sample pair, the closer the embedding distance isThe better, to maximize the similarity between the picture and the text; for the negative sample pair, the farther the embedding distance is, the better, so as to minimize the similarity between the picture and the text. In the pre-training process, a pre-trained image encoder (such as a ResNet network and a Vision Transformer network) and a text encoder (such as a Transformer network) are used for converting the image classification task into an image matching task, which images are paired with which texts (paired with diagonals) in the training image data set are predicted in a comparison learning mode, a comparison loss value is calculated through a pairing result, and when the comparison loss value meets the threshold requirement, the training is stopped, so that the trained CLIP model is obtained. Further, by fine-tuning the trained CLIP model, the image features of each picture are extracted.
S402: and the server recalls the target image-text in multiple stages based on the image-text characteristic set to obtain a recall image-text set corresponding to each stage.
In the embodiment of the application, the multi-stage recall at least comprises a multi-modal recall, and in addition, the multi-stage recall also comprises a text recall and a picture recall, so that the characteristics of multiple dimensions can be fully utilized, and the recall rate and the accuracy rate can be well balanced.
It should be noted that, for each contrast image-text in the preset contrast image-text data set to be recalled, the extraction manner of the text feature, the image feature and the multi-modal feature of the contrast image-text data set is consistent with that of the text feature and the image-text feature of the target image-text, and repeated descriptions may not be performed.
Fig. 6 is a schematic diagram of a multi-stage recall process provided in an embodiment of the present application, where the recall of each stage can be executed in parallel, and mainly includes the following steps:
s4021: the server fuses the text features and the image features in the image-text feature set to obtain multi-mode features, respectively obtains first similarities between the multi-mode features and the multi-mode features of all the images and texts in a preset contrast image-text data set based on the multi-mode features, and performs multi-mode recall on the target image-text based on the first similarities to obtain a recall image-text set corresponding to a multi-mode stage.
In the embodiment of the application, the dimensionality of the text feature and the dimensionality of the image feature corresponding to the target image-text are consistent by reducing the dimensionality of the image feature corresponding to the target image-text, so that the text feature and the image feature can be fused to obtain the multi-modal feature of the target image-text. Then, for each contrast image-text in the preset contrast image-text data set, a first similarity between the multi-modal features of the target image-text and the multi-modal features of the contrast image-text is calculated. And further comparing the first similarity with a preset first threshold, and if the first similarity is greater than the first threshold, indicating that the contrast image-text is similar to the target image-text, taking the contrast image-text as a recall image-text of the multi-modal recall. And traversing the preset contrast image-text data set to obtain a recall image-text set corresponding to the multi-modal stage.
For example, as shown in fig. 7, the comparison images and texts in the preset comparison data set are { S1, S2, S3, …, SM }, where the first similarity between the comparison images and texts S1, S4, S5, S12, S34, S35, S67, S72 and the target image and text meets the threshold requirement, and the recall image and text set (denoted as Q1) corresponding to the multi-modal stage is { S1, S4, S5, S12, S34, S35, S67, S72 }.
S4022: and the server recalls the text of the target image-text based on the second similarity between the text features of the image-text in the image-text feature set and the text features of the contrast images-texts in the preset contrast image-text data set to obtain a recalled image-text set corresponding to the text stage.
And calculating a second similarity between the text characteristics of the target image-text and the text characteristics of the contrast image-text based on the text characteristics of the target image-text and the text characteristics of the contrast image-text aiming at each contrast image-text in the preset contrast image-text data set. And further, comparing the second similarity with a preset second threshold, and if the second similarity is greater than the second threshold, indicating that the comparison image-text is similar to the target image-text, taking the comparison image-text as a recall image-text of the text recall. And traversing the preset image-text comparison data set to obtain a recall image-text set corresponding to the text stage.
For example, as shown in fig. 7, the comparison images and texts in the preset comparison data set are { S1, S2, S3, …, SM }, where the second similarity between the comparison images and texts S2, S4, S10, S12, S26, S45 and the target image and text satisfies the threshold requirement, and the recall image and text set (denoted as Q2) corresponding to the text phase is { S2, S4, S10, S12, S26, S45 }.
Optionally, the second similarity is a cosine similarity.
S4023: and the server recalls the target image-text based on the third similarity between the image features in the image-text feature set and the text features of the contrast images-texts in the preset contrast image-text data set to obtain a recalled image-text set corresponding to the image stage.
And calculating a third similarity (matching degree) between the image characteristics of the target image and the text characteristics of the contrast image based on the image characteristics of the target image and the text characteristics of the contrast image in the preset contrast image and text data set. And further, comparing the third similarity with a preset third threshold, and if the third similarity is greater than the third threshold, indicating that the comparison image-text is similar to the target image-text, taking the comparison image-text as a recall image-text of the image recall. And traversing the preset comparison image-text data set to obtain a recall image-text set corresponding to the image stage.
For example, as shown in fig. 7, the comparison images and texts in the preset comparison data set are { S1, S2, S3, …, SM }, where the third similarity between the comparison images and texts S3, S4, S17, S34, S35, S59 and the target image and text meets the threshold requirement, and the recall image and text set (denoted as Q3) corresponding to the picture stage is { S3, S4, S17, S34, S35, S59 }.
According to the multi-mode recall method and device, multi-stage recall is performed through text features, image features and multi-mode features, and compared with two-way recall of pictures and texts, on one hand, multi-mode recall is recall of one multi-dimensional feature, so that the defect that other dimensional features are lost in any one-way recall of pictures and texts is overcome, and the recall rate of models is greatly improved; on the other hand, in the multi-modal recall, the similarity between the multi-modal characteristics is adopted, so that the similarity of the two images and texts can be more comprehensively measured from the overall angle of the image and the text, and the similarity of the two images and texts is measured relative to a single path, thereby effectively improving the accuracy of model recall.
S403: and the server respectively determines an initial repeated image-text set corresponding to each recall image-text set based on each recall image-text in each recall image-text set and a keyword set between each recall image-text and the target image-text.
Taking one recall graph and text set in each recall graph and text set as an example, an initial repeated graph and text set is filtered through coarse deduplication, and specifically referring to fig. 8, the method mainly comprises the following steps:
s4031: the server extracts a first keyword set corresponding to each recall image-text in a recall image-text set and extracts a second keyword set corresponding to the target image-text.
Each of the first keyword set and the second keyword set may include the same keyword or different keywords.
Taking any one of the recall texts in each recall graph text as an example, it is assumed that a first keyword set extracted from the recall graph text is U1 ═ { C1, C2, C3}, and a second keyword set extracted from the target graph text is U2 ═ C1, C3, C4}, which contains the same keywords C1 and C3, and contains different keywords C2 and C4.
S4032: and the server respectively determines the intersection ratio between each first keyword set and each second keyword set.
And aiming at any one first keyword set in the keywords, determining a first keyword set, performing keyword intersection and keyword union of the first keyword set and the second keyword set, and taking the ratio of the number of the keywords contained in the keyword intersection to the number of the keywords contained in the keyword union as the intersection and the union ratio among the keyword sets.
For example, the intersection V1 of the keywords between the first keyword set U1 and the second keyword set U2 is { C1, C3}, the keyword union V1 is { C1, C2, C3, C4}, and the intersection ratio q1 between the first keyword set U1 and the second keyword set U2 is 2/4-50%.
S4033: and sequencing the recalling graphics and texts according to the intersection ratios.
Optionally, assuming that there are 100 recalling images in a recall image-text set, the intersection ratio between each first keyword set and the second keyword set is q1, q2, q3, …, and q100, respectively, and q1, q2, q3, …, and q100 are sorted in descending order.
S4034: and the server screens at least one recall image-text from each recall image-text based on the sequencing result to obtain an initial repeated image-text set corresponding to one recall image-text set.
In S4034, based on the ranking result of each union ratio, the top K (for example, K ═ 20) union ratios are selected as the initial repetition images and texts, and the initial repetition image and text set corresponding to the recall image and text set is obtained.
For example, as shown in fig. 10, for the recall text set Q1 corresponding to the multi-modal stage, the initial repeated text set Q1' obtained by performing merging and de-duplication on the keywords is { S1, S4, S12, S34, S35, S72 }; for a recall image-text set Q2 corresponding to a text stage, performing intersection and coarse de-duplication among keywords to obtain an initial repeated image-text set Q2' ({ S4, S10, S12, S45 }; for the recall image-text set Q3 corresponding to the picture stage, the initial repeated image-text set Q3' obtained by performing merging and coarse de-duplication between the keywords is { S4, S17, S34, S35, S59 }.
In the embodiment of the application, after rough deduplication is respectively executed for the recall image-text sets corresponding to the recall stages, deduplication computing pressure caused by a large number of recall image-texts recalled in each stage can be avoided, so that the recall rate is guaranteed to be improved, and deduplication efficiency of the whole system is accelerated.
S404: and the server determines a target repeated image-text set based on each initial repeated image-text in each initial repeated image-text set and the editing distance between each initial repeated image-text and the target image-text.
Taking any one initial repeated image-text in each initial repeated image-text set as an example, determining the text similarity between one initial repeated image-text and the target image-text by adopting a formula 1 according to the editing distance between the one initial repeated image-text and the target image-text, comparing the text similarity with a preset similarity threshold, and if the text similarity is greater than the similarity threshold, indicating that the content of the initial repeated image-text and the target image-text is more, taking the initial repeated image-text as the target repeated image-text. And traversing each initial repeated image-text set to obtain a target repeated image-text set. Wherein, the calculation formula of the similarity is as follows:
Figure BDA0003391894750000161
where sim represents the text similarity between the initial repeated image-text and the target image-text, dist represents the editing distance between the initial repeated image-text and the target image-text, len (a) represents the text length of the initial repeated image-text, and len (b) represents the text length of the target image-text.
As can be seen from formula 1, the text similarity between the initial repeated image-text and the target image-text is in negative correlation with the editing distance between the initial repeated image-text and the target image-text, and is in positive correlation with the maximum text length between the initial repeated image-text and the target image-text.
For example, as shown in fig. 10, for the initial repeated image-text sets Q1 ', Q2 ', and Q3 ' after coarse deduplication, the target repeated image-text set Q obtained after fine deduplication is performed according to the edit distance between the texts is { S4, S34, S35 }.
In the embodiment of the application, the editing distance is used for carrying out the fine de-duplication on the recalled initial repeated image-texts, and the editing distance enables the interpretability between the repeated image-texts to be stronger, so that the accuracy of the similarity between the recalled initial repeated text and the target text is improved, and the de-duplication accuracy is improved.
In some embodiments, many template graphics may appear in the service, such as sports, finance, weather, and other template graphics, and the two graphics may contain some necessary contents; for example: the weather graphics usually include date, temperature, wind direction, icons, and the like. Thus, although the template-like texts have higher similarity, they do not belong to the category of repeated articles per se, and exemptions need to be made on the template-like texts so as to further improve the deduplication accuracy.
Therefore, after the target repeated image-text set is determined, the target repeated image-text set is updated to remove the template image-text which is mistakenly identified as the repeated image-text in the target repeated image-text set. Aiming at any one target repeated image-text in the target repeated image-text set, the template type image-text exemption process is shown in figure 9 and mainly comprises the following steps:
s4051: the server extracts a first entity noun set from a preset field of a target repeated image-text, and extracts a second entity noun set from the preset field of the target image-text.
In an alternative embodiment, for the initial repeated graphics and text of the template type of sports, finance, weather, etc., the first 500 characters are extracted to obtain each entity noun (such as key nouns of date, place name, stock code, etc.) to obtain a first entity noun set, and the first 500 characters of the target graphics and text are extracted to obtain a second entity noun set.
Optionally, the embodiment of the present application does not require a limitation on the Entity noun extraction algorithm, for example, a Named Entity Recognition (NER) algorithm may be used.
S4052: the server formats the first entity noun set and the second entity noun set respectively according to preset rules.
For example, assuming that the first set of nouns and the second set of nouns contain "date" nouns, the formatted format is: yyymmdd, where yyyy denotes year, mm denotes month, and dd denotes day.
For another example, assuming that the first set of nouns and the second set of nouns contain "stock codes" nouns, the formatted format is: xx +6 digits, where xx denotes at which stock exchange the stock's place of listing is.
Optionally, if a physical noun set includes multiple categories of physical nouns, the images and texts corresponding to the physical noun set respectively have multiple categories of physical noun lists, and each physical noun list stores formatted physical nouns of the same category.
S4053: the server determines a fourth similarity between the target repeated image and the target image based on the formatted first entity noun set and the formatted second entity noun set.
In an embodiment of the application, after formatting each noun in the first set of nouns and the second set of nouns, comparing the nouns in the first set of nouns with the nouns in the second set of nouns one by one, and determining a fourth similarity between a target repeated image and the target image according to a comparison result.
Optionally, if the number of different nouns in the first set of nouns and the second set of nouns is larger, the fourth similarity between a target repeated image and the target image is smaller.
S4054: the server determines whether the fourth similarity satisfies a preset condition, if so, executes S4055, otherwise, executes S4056.
In the embodiment of the present application, if the fourth similarity is greater than the preset threshold, it indicates that the target repeated image-text and the target image-text are repeated image-texts, then S4055 is executed, otherwise, it indicates that the target repeated image-text and the target image-text are not repeated image-texts, and S4056 is executed.
S4055: the server keeps the target repeated graphics in the target repeated graphics set.
S4056: the server deletes the target repeated graphics in the target repeated graphics set.
For example, as shown in fig. 10, the target repeated image set determined by the editing distance is Q ═ { S4, S34, S35}, and if the fourth similarity between the target repeated image S34 and the target image is smaller than the preset threshold, which indicates that the target repeated image S34 and the target image are not repeated images, after the template type images and texts are subjected to fine de-duplication by the entity nouns, the updated target repeated image set is Q' ═ { S4, S35 }.
It should be noted that the embodiments provided in the present application may be used alone or in combination. The relationship between the recall ratio and the accuracy ratio in the combination mode of different embodiments is shown in table 1 by taking the baseline index on the current line as a reference.
TABLE 1 comparison of the effects of the different methods
Method Rate of accuracy Recall rate
Base line 78% 57%
Baseline + multi-way recall 78% 80%
Baseline + multi-recall + calibration 98% 75%
As can be seen from Table 1, after three recalls of text, image and multi-mode are adopted, the recall rate is increased to 80 percent on the basis of unchanged accuracy rate; on the basis of multi-recall, after the calibration layer is added, the overall accuracy is improved to 98%.
The image-text de-duplication method and device can efficiently and accurately calculate the similarity between the two image-texts, and can effectively solve the problems of few recalls, low efficiency and low accuracy in image-text de-duplication in the prior art. On the whole, the image-text duplication elimination method of the embodiment of the application is divided into three stages: the first stage is a recall stage which mainly comprises a text recall, an image recall and a multi-mode recall, and through multi-channel recall, similarity measurement can be carried out on images and texts in images and texts more comprehensively, so that recall rate and accuracy are balanced; the second stage is a coarse de-weighting stage, which is used for performing coarse filtering on the pictures and texts in the recall stage so as to reduce the calculation pressure in the calibration stage and improve the de-weighting efficiency; the third stage is a calibration stage, in the aspect of similarity calibration, the coarsely filtered image-text is calibrated by using the editing distance, so that the interpretability is stronger, and the accuracy of duplicate removal is improved; in the aspect of template calibration, the similarity between template images and texts is optimized by using entity nouns such as time, place, stock codes and the like, so that the accuracy of duplicate removal is further improved. The algorithm can better solve the problems of less recalls, low efficiency and low accuracy of the current scheme.
The image-text duplicate removal method provided by the embodiment of the application can be applied to various services.
For example, the image-text duplication elimination method provided by the embodiment of the application is applied to an auditing service, and a machine is used for repeatedly judging the content of a new image-text, so that a trigger process of sending before examining is realized. On one hand, the manpower for auditing the system is reduced by 40%, and the labor cost is saved; on the other hand, the average time consumption for auditing is reduced by about 90%, and the whole auditing process is accelerated.
Fig. 11 is a schematic view of a processing pipeline of the teletext deduplication service provided in the embodiment of the present application, and as shown in fig. 11, the pipeline includes basic information for publishing teletext, such as a title and a publication number of the teletext. Meanwhile, the pipeline also comprises the number (Cmsid) of the platform where the recalled repeated image-text with higher similarity to the target image-text is located, such as 20210614A025J 00.
Fig. 12 is a schematic diagram of a deduplication result of the image-text deduplication service provided in the embodiment of the application, as shown in fig. 12, where (a) is a newly issued image-text, and (b) is a preset comparison image-text data set, and an existing comparison image-text is shown, as can be seen from fig. 12, the similarity of the contents of the two image-texts is high.
In other downstream business scenes, a business party can apply the similarity depended by the image-text de-duplication provided by the embodiment of the application to each link, including but not limited to original identification and right identification of content dimensions, and transport identification of account dimensions and the like.
For convenience of description, the above parts are separately described as modules (or units) according to functional division. Of course, the functionality of the various modules (or units) may be implemented in the same one or more pieces of software or hardware when implementing the present application.
Having described the method and apparatus for teletext deduplication according to an exemplary embodiment of the present application, a description is next given of a device for teletext deduplication according to another exemplary embodiment of the present application.
As will be appreciated by one skilled in the art, aspects of the present application may be embodied as a system, method or program product. Accordingly, various aspects of the present application may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.
Based on the same inventive concept, the embodiment of the application also provides an image-text de-duplication device. Fig. 13 is a schematic diagram of an image de-duplication apparatus 900 according to an embodiment of the present disclosure, the apparatus including:
a feature extraction module 1301, configured to, in response to a rearrangement request for a target image-text, extract an image-text feature set of the target image-text; wherein the image-text feature set is: text features and image features of the target image-text;
a multi-stage recall module 1302, configured to perform multi-stage recall on the target image-text based on the image-text feature set, so as to obtain a recall image-text set corresponding to each stage; wherein the multi-stage recall comprises at least: multimodal recall based on text features and image features;
the initial duplication elimination module 1303 is configured to determine initial duplicate image-text sets corresponding to the recall image-text sets respectively based on the recall image-texts in the recall image-text sets and keyword sets between the recall image-texts and the target image-texts respectively;
and a fine duplication elimination module 1304, configured to determine the target repeated image-text set based on the edit distance between each initial repeated image-text in each initial repeated image-text set and the target image-text.
Optionally, the multi-stage recall module 1302 is specifically configured to:
fusing text features and image features in the image-text feature set to obtain multi-modal features;
respectively obtaining a first similarity between the multi-modal characteristics of each contrast image-text in the preset contrast image-text data set based on the multi-modal characteristics;
and performing multi-mode recall on the target image-text based on each first similarity to obtain a recall image-text set corresponding to the multi-mode stage.
Optionally, the multi-stage recall module 1302 is specifically configured to:
based on the text features in the image-text feature set, respectively carrying out text recall on the target image-text to obtain a recall image-text set corresponding to the text stage, wherein the second similarity is between the text features of the image-text in the preset contrast image-text data set and the text features of the contrast images-text in the preset contrast image-text data set; and
and based on the image features in the image-text feature set, respectively carrying out picture recall on the target image-text based on a third similarity between the image features and the text features of the various comparison images-texts in the preset comparison image-text data set to obtain a recall image-text set corresponding to the picture stage.
Optionally, the initial deduplication module 1303 is specifically configured to:
and respectively executing the following operations aiming at each recall image-text set:
extracting a first keyword set corresponding to each recall image-text and a second keyword set corresponding to a target image-text in a recall image-text set;
respectively determining the intersection ratio between each first keyword set and the second keyword set;
sorting the recalling graphics according to the cross-over ratios;
and screening at least one recall image-text from each recall image-text based on the sequencing result to obtain an initial repeated image-text set corresponding to one recall image-text set.
Optionally, the initial deduplication module 1303 is specifically configured to:
for each first keyword set, the following operations are respectively executed:
determining a first keyword set, a keyword intersection and a keyword union of the first keyword set and a second keyword set;
and taking the ratio of the number of the keywords contained in the intersection of the keywords to the number of the keywords contained in the union of the keywords as the intersection ratio of the keyword sets.
Optionally, the fine deduplication module 1304 is specifically configured to:
and respectively executing the following operations aiming at each initial repeated image-text:
determining text similarity between an initial repeated image-text and a target image-text according to an editing distance between the initial repeated image-text and the target image-text;
and if the text similarity meets the threshold requirement, taking an initial repeated image-text as a target repeated image-text.
Optionally, the text similarity is negatively correlated with the editing distance, and positively correlated with the maximum text length in the initial repeated image and text and the target image and text.
Optionally, the fine deduplication module 1304 is further configured to:
aiming at each target repeated image-text in the target repeated image-text set, respectively executing the following operations:
extracting a first entity noun set from a preset field of a target repeated image-text, and extracting a second entity noun set from the preset field of the target image-text;
formatting the first entity noun set and the second entity noun set respectively according to a preset rule;
determining a fourth similarity between a target repeated image-text and the target image-text based on the formatted first entity noun set and the formatted second entity noun set;
and if the fourth similarity meets the preset condition, reserving a target repeated image-text.
Optionally, the feature extraction module 1301 is specifically configured to:
determining the weight of each keyword contained in the target image-text based on a preset keyword lexicon;
and weighting the word vectors of the keywords according to the determined weight to obtain the text characteristics corresponding to the target image text.
Optionally, the feature extraction module 1301 is specifically configured to:
extracting the characteristics of each picture contained in the target image-text to obtain an initial characteristic vector set;
and aggregating each initial feature vector in the initial feature vector set into a feature vector with a preset fixed length, and reducing dimensions to obtain image features corresponding to the target text.
The electronic equipment is based on the same inventive concept as the method embodiment, and the embodiment of the application also provides the electronic equipment. In one embodiment, the electronic device may be a server, such as server 120 shown in FIG. 1. In this embodiment, the structure of the electronic device may be as shown in fig. 14, including a memory 1401, a communication module 1403 and one or more processors 1402.
A memory 1401 for storing computer programs executed by the processor 1402. The memory 1401 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, a program required for running an instant messaging function, and the like; the storage data area can store various instant messaging information, operation instruction sets and the like.
The memory 1401 may be a volatile memory (volatile memory), such as a random-access memory (RAM); the memory 1401 may also be a non-volatile memory (non-volatile memory), such as a read-only memory (rom), a flash memory (flash memory), a hard disk (HDD), or a solid-state drive (SSD); or the memory 1401 is any other medium that can be used to carry or store a desired computer program in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such. The memory 1401 may be a combination of the above memories.
The processor 1402 may include one or more Central Processing Units (CPUs), or be a digital processing unit, etc. A processor 1402 for implementing the above-described teletext deduplication method when invoking the computer program stored in the memory 1401.
The communication module 1403 is used for communicating with the terminal device and other servers.
The embodiment of the present application does not limit the specific connection medium among the memory 1401, the communication module 1403 and the processor 1402. In the embodiment of the present application, the memory 1401 and the processor 1402 are connected through the bus 1404 in fig. 14, the bus 1404 is depicted by a thick line in fig. 14, and the connection manner between other components is merely illustrative and is not limited. The bus 1404 may be divided into an address bus, a data bus, a control bus, and the like. For ease of description, only one thick line is depicted in FIG. 14, but only one bus or one type of bus is not depicted.
The memory 1401 stores a computer storage medium, and the computer storage medium stores computer executable instructions for implementing the image-text deduplication method according to the embodiment of the present application. The processor 1402 is configured to perform the above-described teletext deduplication method, as shown in fig. 4.
In another embodiment, the electronic device may also be other electronic devices, such as the terminal device 110 shown in fig. 1. In this embodiment, the structure of the electronic device may be as shown in fig. 15, including: communications component 1510, memory 1520, display unit 1530, camera 1540, sensors 1550, audio circuitry 1560, bluetooth module 1570, processor 1580, and the like.
The communication component 1510 is used to communicate with a server. In some embodiments, a wireless fidelity (WiFi) module may be included, the WiFi module being a short-range wireless transmission technology, through which the electronic device may assist the user in transmitting and receiving information.
The memory 1520 may be used to store software programs and data. The processor 1580 performs various functions of the terminal device 110 and data processing by executing software programs or data stored in the memory 1520. The memory 1520 may include high-speed random access memory and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device. Memory 1520 stores an operating system that enables terminal device 110 to operate. The memory 1520 may store an operating system and various application programs, and may also store codes for performing the image de-duplication method according to the embodiment of the present application.
The display unit 1530 may also be used to display information input by the user or information provided to the user and a Graphical User Interface (GUI) of various menus of the terminal apparatus 110. Specifically, the display unit 1530 may include a display screen 1532 disposed on the front surface of the terminal device 110. The display screen 1532 may be configured in the form of a liquid crystal display, a light emitting diode, or the like. The display unit 1530 may be used to display the to-be-deduplicated text and the like in the embodiment of the present application.
The display unit 1530 may also be used to receive input numeric or character information, generate signal inputs related to user settings and function control of the terminal device 110, and particularly, the display unit 1530 may include a touch screen 1531 disposed on the front surface of the terminal device 110, and may collect touch operations of the user thereon or nearby, such as clicking a button, dragging a scroll box, and the like.
The touch screen 1531 may cover the display screen 1532, or the touch screen 1531 and the display screen 1532 may be integrated to implement the input and output functions of the terminal device 110, and after the integration, the touch screen may be referred to as a touch display screen for short. The display unit 1530 in this application may display the application programs and the corresponding operation steps.
Camera 1540 may be used to capture still images and the user may post comments on the images captured by camera 1540 through the application. The number of the camera 1540 may be one or more. The object generates an optical image through the lens and projects the optical image to the photosensitive element. The photosensitive element may be a Charge Coupled Device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor. The light sensing elements convert the light signals into electrical signals, which are then passed to the processor 1180 for conversion into digital image signals.
The terminal device may further comprise at least one sensor 1550, such as an acceleration sensor 1551, a distance sensor 1552, a fingerprint sensor 1553, a temperature sensor 1554. The terminal device may also be configured with other sensors such as a gyroscope, barometer, hygrometer, thermometer, infrared sensor, light sensor, motion sensor, and the like.
Audio circuit 1560, speaker 1561, microphone 1562 may provide an audio interface between a user and terminal device 110. The audio circuit 1560 may transmit the electrical signal converted from the received audio data to the speaker 1561, and convert the electrical signal into an audio signal by the speaker 1561 and output the audio signal. Terminal device 110 may also be configured with a volume button for adjusting the volume of the sound signal. On the other hand, the microphone 1562 converts collected sound signals into electrical signals, converts the electrical signals into audio data after being received by the audio circuit 1560, and outputs the audio data to the communication component 1510 for transmission to, for example, another terminal device 110, or outputs the audio data to the memory 1520 for further processing.
The bluetooth module 1570 is configured to perform information interaction with other bluetooth devices having a bluetooth module through a bluetooth protocol. For example, the terminal device may establish a bluetooth connection with a wearable electronic device (e.g., a smart watch) that is also equipped with a bluetooth module via the bluetooth module 1570, so as to perform data interaction.
The processor 1580 is a control center of the terminal device, connects various parts of the entire terminal device using various interfaces and lines, and performs various functions of the terminal device and processes data by running or executing software programs stored in the memory 1520 and calling data stored in the memory 1520. In some embodiments, the processor 1580 may include one or more processing units; the processor 1580 may also integrate an application processor, which primarily handles operating systems, user interfaces, application programs, and the like, and a baseband processor, which primarily handles wireless communications. It is to be appreciated that the baseband processor may not be integrated into the processor 1580. In the present application, the processor 1580 may run an operating system, an application program, a user interface display, a touch response, and the text deduplication method according to the embodiment of the present application. Further, the processor 1580 is coupled with the display unit 1530.
In some possible embodiments, the aspects of the teletext deduplication method provided herein may also be implemented in the form of a program product comprising a computer program for causing an electronic device to perform the steps in the teletext deduplication method according to various exemplary embodiments of the present application described above in this specification when the program product is run on the electronic device, e.g., the electronic device may perform the steps as shown in fig. 4.
The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The program product of embodiments of the present application may employ a portable compact disc read only memory (CD-ROM) and include a computer program, and may be run on a computing device. However, the program product of the present application is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with a command execution system, apparatus, or device.
A readable signal medium may include a propagated data signal with a readable computer program embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with a command execution system, apparatus, or device.
The computer program embodied on the readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer programs for carrying out operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer program may execute entirely on the user computing device, partly on the user equipment, as a stand-alone software package, partly on the user computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).
While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all alterations and modifications as fall within the scope of the application.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims (11)

1. An image-text de-duplication method, characterized in that the method comprises:
responding to a rearrangement request aiming at target graphics and texts, and extracting a graphics and text feature set of the target graphics and texts; wherein the image-text feature set is: text features and image features of the target image-text;
based on the image-text feature set, multi-stage recalling is carried out on the target image-text to obtain a recall image-text set corresponding to each stage; wherein the multi-stage recall comprises at least: a multimodal recall based on the text features and the image features;
respectively determining initial repeated image-text sets corresponding to the recall image-text sets based on the recall image-texts in the recall image-text sets and keyword sets between the recall image-texts and the target image-texts;
and determining a target repeated image-text set based on each initial repeated image-text in each initial repeated image-text set and the editing distance between each initial repeated image-text and the target image-text.
2. The method of claim 1, wherein the performing a multi-stage recall on the target teletext based on the teletext feature set to obtain a recall teletext set corresponding to each stage at least comprises:
fusing the text features and the image features in the image-text feature set to obtain multi-modal features;
respectively obtaining a first similarity between the multi-modal characteristics of each contrast image-text in a preset contrast image-text data set based on the multi-modal characteristics;
and performing multi-mode recall on the target image-text based on each first similarity to obtain a recall image-text set corresponding to the multi-mode stage.
3. The method of claim 2, wherein the target teletext is recalled in multiple stages based on the teletext feature set to obtain a recall teletext set corresponding to each stage, and further comprising at least one of:
based on the text features in the image-text feature set, respectively carrying out text recall on the target image-text to obtain a recalled image-text set corresponding to a text stage, wherein the second similarity is between the text features of the images-text in the preset comparison image-text data set and the text features of the images-text in the preset comparison image-text data set; and
and based on the image features in the image-text feature set, respectively carrying out image recall on the target image-text to obtain a recalled image-text set corresponding to the image stage, wherein the third similarity is between the image features in the image-text feature set and the text features of the contrast images-texts in the preset contrast image-text data set.
4. The method of claim 1, wherein the determining respective initial repeating image-text sets corresponding to respective recall image-text sets based on respective recall images-text in the respective recall image-text sets and respective keyword sets between the respective recall images-text and the target images-text respectively comprises:
and respectively executing the following operations aiming at each recall image-text set:
extracting a first keyword set corresponding to each recall image-text in a recall image-text set and extracting a second keyword set corresponding to the target image-text;
respectively determining the intersection ratio between each first keyword set and the second keyword set;
sequencing the recalling graphics and texts according to the cross-over ratios;
and screening at least one recall image-text from each recall image-text based on the sequencing result to obtain an initial repeated image-text set corresponding to the recall image-text set.
5. The method of claim 4, wherein determining respective cross-over ratios between the respective first keyword sets and the second keyword sets, respectively, comprises:
for each first keyword set, respectively executing the following operations:
determining a first keyword set, a keyword intersection and a keyword union of the first keyword set and the second keyword set;
and taking the ratio of the number of the keywords contained in the intersection of the keywords to the number of the keywords contained in the union of the keywords as the intersection ratio of the keyword sets.
6. The method of claim 1, wherein determining a set of target repeating pictures based on the edit distance between each initial repeating picture and the target picture in each set of initial repeating pictures comprises:
and respectively executing the following operations aiming at each initial repeated image-text:
determining text similarity between an initial repeated image-text and the target image-text according to the editing distance between the initial repeated image-text and the target image-text;
and if the text similarity meets the threshold requirement, taking the initial repeated image-text as a target repeated image-text.
7. The method of claim 6, wherein the text similarity is negatively correlated to the edit distance and positively correlated to the maximum text length in the one initial repeat and the target text.
8. The method of claim 1, wherein after determining the target repeating pattern set, further comprising:
and aiming at each target repeated image-text in the target repeated image-text set, respectively executing the following operations:
extracting a first entity noun set from a preset field of a target repeated image-text, and extracting a second entity noun set from the preset field of the target image-text;
formatting the first entity noun set and the second entity noun set respectively according to a preset rule;
determining a fourth similarity between the target repeated image-text and the target image-text based on the formatted first entity noun set and the formatted second entity noun set;
and if the fourth similarity meets a preset condition, reserving the target repeated image-text.
9. The method of any one of claims 1-8, wherein the text features are extracted in a manner that:
determining the weight of each keyword contained in the target image-text based on a preset keyword lexicon;
and weighting the word vectors of the keywords according to the determined weight to obtain the text characteristics corresponding to the target image-text.
10. The method of any one of claims 1-8, wherein the image features are extracted by:
extracting the features of each picture contained in the target image-text to obtain an initial feature vector set;
and aggregating each initial feature vector in the initial feature vector set into a feature vector with a preset fixed length, and performing dimension reduction to obtain the image feature corresponding to the target text.
11. An image-text de-duplication device, comprising:
the characteristic extraction module is used for responding to a rearrangement request aiming at the target image-text and extracting an image-text characteristic set of the target image-text; wherein the image-text feature set is: text features and image features of the target image-text;
the multi-stage recall module is used for performing multi-stage recall on the target image-text based on the image-text feature set to obtain a recall image-text set corresponding to each stage; wherein the multi-stage recall comprises at least: a multimodal recall based on the text features and the image features;
the primary duplication removal module is used for respectively determining an initial repeated image-text set corresponding to each recall image-text set based on each recall image-text in each recall image-text set and a keyword set between each recall image-text and the target image-text;
and the fine duplication removing module is used for determining the target repeated image-text set based on the editing distance between each initial repeated image-text in each initial repeated image-text set and the target image-text.
CN202111466812.1A 2021-12-03 Image-text duplication removing method and device Active CN114328884B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111466812.1A CN114328884B (en) 2021-12-03 Image-text duplication removing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111466812.1A CN114328884B (en) 2021-12-03 Image-text duplication removing method and device

Publications (2)

Publication Number Publication Date
CN114328884A true CN114328884A (en) 2022-04-12
CN114328884B CN114328884B (en) 2024-07-09

Family

ID=

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116738368A (en) * 2023-06-25 2023-09-12 上海任意门科技有限公司 Method and system for extracting single-mode characteristics and method for extracting post characteristics

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005149323A (en) * 2003-11-18 2005-06-09 Canon Inc Image processing system, image processing apparatus, and image processing method
JP2014211730A (en) * 2013-04-18 2014-11-13 株式会社日立製作所 Image searching system, image searching device, and image searching method
CN110929002A (en) * 2018-09-03 2020-03-27 广州神马移动信息科技有限公司 Similar article duplicate removal method, device, terminal and computer readable storage medium
CN110956038A (en) * 2019-10-16 2020-04-03 厦门美柚股份有限公司 Repeated image-text content judgment method and device
CN110956037A (en) * 2019-10-16 2020-04-03 厦门美柚股份有限公司 Multimedia content repeated judgment method and device
CN111680173A (en) * 2020-05-31 2020-09-18 西南电子技术研究所(中国电子科技集团公司第十研究所) CMR model for uniformly retrieving cross-media information
EP3772014A1 (en) * 2019-07-29 2021-02-03 TripEye Limited Identity document validation method, system and computer program
WO2021072885A1 (en) * 2019-10-18 2021-04-22 平安科技(深圳)有限公司 Method and apparatus for recognizing text, device and storage medium
CN113469152A (en) * 2021-09-03 2021-10-01 腾讯科技(深圳)有限公司 Similar video detection method and device

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005149323A (en) * 2003-11-18 2005-06-09 Canon Inc Image processing system, image processing apparatus, and image processing method
JP2014211730A (en) * 2013-04-18 2014-11-13 株式会社日立製作所 Image searching system, image searching device, and image searching method
CN110929002A (en) * 2018-09-03 2020-03-27 广州神马移动信息科技有限公司 Similar article duplicate removal method, device, terminal and computer readable storage medium
EP3772014A1 (en) * 2019-07-29 2021-02-03 TripEye Limited Identity document validation method, system and computer program
CN110956038A (en) * 2019-10-16 2020-04-03 厦门美柚股份有限公司 Repeated image-text content judgment method and device
CN110956037A (en) * 2019-10-16 2020-04-03 厦门美柚股份有限公司 Multimedia content repeated judgment method and device
WO2021072885A1 (en) * 2019-10-18 2021-04-22 平安科技(深圳)有限公司 Method and apparatus for recognizing text, device and storage medium
CN111680173A (en) * 2020-05-31 2020-09-18 西南电子技术研究所(中国电子科技集团公司第十研究所) CMR model for uniformly retrieving cross-media information
CN113469152A (en) * 2021-09-03 2021-10-01 腾讯科技(深圳)有限公司 Similar video detection method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
丁泽亚;张全;: "基于编辑距离的网页去重策略", 网络新媒体技术, no. 06, 15 November 2013 (2013-11-15) *
蒋宗礼;袁圆;: "对比内嵌字幕进行视频去重", 计算技术与自动化, no. 01, 15 March 2015 (2015-03-15) *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116738368A (en) * 2023-06-25 2023-09-12 上海任意门科技有限公司 Method and system for extracting single-mode characteristics and method for extracting post characteristics

Similar Documents

Publication Publication Date Title
GB2547068B (en) Semantic natural language vector space
CN110069709B (en) Intention recognition method, device, computer readable medium and electronic equipment
CN112131350B (en) Text label determining method, device, terminal and readable storage medium
AU2016256764A1 (en) Semantic natural language vector space for image captioning
CN114612759B (en) Video processing method, video query method, model training method and model training device
CN112163428A (en) Semantic tag acquisition method and device, node equipment and storage medium
CN113515942A (en) Text processing method and device, computer equipment and storage medium
CN111159409B (en) Text classification method, device, equipment and medium based on artificial intelligence
CN113704460B (en) Text classification method and device, electronic equipment and storage medium
CN111898675B (en) Credit wind control model generation method and device, scoring card generation method, machine readable medium and equipment
CN111639228B (en) Video retrieval method, device, equipment and storage medium
WO2023179429A1 (en) Video data processing method and apparatus, electronic device, and storage medium
CN114661861B (en) Text matching method and device, storage medium and terminal
CN114282511A (en) Text duplicate removal method and device, electronic equipment and storage medium
CN113806588A (en) Method and device for searching video
CN111625715A (en) Information extraction method and device, electronic equipment and storage medium
JP6172332B2 (en) Information processing method and information processing apparatus
CN113919361A (en) Text classification method and device
CN116186197A (en) Topic recommendation method, device, electronic equipment and storage medium
CN111988668B (en) Video recommendation method and device, computer equipment and storage medium
CN113297525A (en) Webpage classification method and device, electronic equipment and storage medium
CN114328884B (en) Image-text duplication removing method and device
CN114168715A (en) Method, device and equipment for generating target data set and storage medium
CN114328884A (en) Image-text duplication removing method and device
CN114579876A (en) False information detection method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant