CN112926700B - Class identification method and device for target image - Google Patents

Class identification method and device for target image Download PDF

Info

Publication number
CN112926700B
CN112926700B CN202110460794.XA CN202110460794A CN112926700B CN 112926700 B CN112926700 B CN 112926700B CN 202110460794 A CN202110460794 A CN 202110460794A CN 112926700 B CN112926700 B CN 112926700B
Authority
CN
China
Prior art keywords
text
image
encoder
feature vector
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110460794.XA
Other languages
Chinese (zh)
Other versions
CN112926700A (en
Inventor
暨凯祥
刘家佳
曾小英
胡圻圻
冯力国
王剑
陈景东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alipay Hangzhou Information Technology Co Ltd
Original Assignee
Alipay Hangzhou Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alipay Hangzhou Information Technology Co Ltd filed Critical Alipay Hangzhou Information Technology Co Ltd
Priority to CN202110460794.XA priority Critical patent/CN112926700B/en
Publication of CN112926700A publication Critical patent/CN112926700A/en
Application granted granted Critical
Publication of CN112926700B publication Critical patent/CN112926700B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/2431Multiple classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/413Classification of content, e.g. text, photographs or tables

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The embodiment of the specification provides a category identification method and device for a target image, wherein the target image comprises a text, and the method comprises the following steps: recognizing text content in the target image to obtain a first text recognition result; inputting the target image into an image encoder, and outputting a first image semantic feature vector corresponding to the target image through the image encoder; inputting the first text recognition result into a text encoder, and outputting a first text semantic feature vector corresponding to the first text recognition result through the text encoder; inputting the first image semantic feature vector and the first text semantic feature vector into a multimode fusion encoder, and outputting a first global feature vector through the multimode fusion encoder; and determining the target category of the target image according to the first global feature vector. The accuracy of class identification for the target image can be improved.

Description

Class identification method and device for target image
Technical Field
One or more embodiments of the present specification relate to the field of computers, and in particular, to a method and apparatus for class identification for a target image.
Background
Currently, category identification for the target image is often involved, for example, identifying whether the animal in the target image is a dog or a cat, whether the merchant in the target image is a restaurant or a retail, and so on. If the category identification is carried out manually according to the target image, the time efficiency is slow, the cost is high, and therefore an intelligent category identification scheme aiming at the target image needs to be provided.
In the prior art, in a general class identification process for a target image, a class of the target image is automatically identified by using the target image in an image classification mode. However, in many cases, the recognition result is not accurate enough due to insufficient information utilization.
Accordingly, improved solutions are desired that can improve the accuracy of class identification for target images.
Disclosure of Invention
One or more embodiments of the present specification describe a method and an apparatus for class identification for a target image, which can improve the accuracy of class identification for the target image.
In a first aspect, a method for identifying a category of a target image is provided, where the target image includes text, and the method includes:
recognizing text content in the target image to obtain a first text recognition result;
inputting the target image into an image encoder, and outputting a first image semantic feature vector corresponding to the target image through the image encoder;
inputting the first text recognition result into a text encoder, and outputting a first text semantic feature vector corresponding to the first text recognition result through the text encoder;
inputting the first image semantic feature vector and the first text semantic feature vector into a multimode fusion encoder, and outputting a first global feature vector through the multimode fusion encoder;
and determining the target category of the target image according to the first global feature vector.
In one possible embodiment, the recognizing the text content in the target image includes:
performing text detection on the target image to obtain a first text region in the target image;
and performing character recognition on the first text area to obtain the first text recognition result.
Further, the target image is a picture of a shop head, the shop head is provided with a signboard, and the signboard is provided with a text; the target category is a merchant category;
the text detection of the target image comprises:
inputting the merchant shop head photo into a signboard detector, and outputting a signboard area through the signboard detector;
inputting the signboard region into a text detector, and outputting the first text region through the text detector.
In one possible embodiment, the image encoder, the text encoder or the multimodal fusion encoder is trained on sample images obtained by:
and carrying out at least one of image size conversion, image angle conversion, white noise addition and text sequence adjustment on the original image obtained by photographing to obtain the sample image.
In one possible embodiment, the multimode fusion encoder is trained by:
acquiring a sample image and a category label corresponding to the sample image;
recognizing text content in the sample image to obtain a second text recognition result;
inputting the sample image into an image encoder, and outputting a second image semantic feature vector corresponding to the sample image through the image encoder;
inputting the second text recognition result into a text encoder, and outputting a second text semantic feature vector corresponding to the second text recognition result through the text encoder;
inputting the second image semantic feature vector and the second text semantic feature vector into a multimode fusion encoder, and outputting a second global feature vector through the multimode fusion encoder;
determining a first prediction category of the sample image according to the second global feature vector;
determining a multi-modal classification loss according to the first prediction category and the category label;
and training the multimode fusion encoder according to the multimode classification loss.
Further, the image encoder is trained by:
determining a second prediction category of the sample image according to the second image semantic feature vector;
determining image classification loss according to the second prediction category and the category label;
and training the image encoder according to the image classification loss.
Further, the text encoder is trained by:
determining a third prediction category of the sample image according to the second text semantic feature vector;
determining text classification loss according to the third prediction category and the category label;
and training the text encoder according to the text classification loss.
Further, the image encoder is trained by:
determining an image distillation loss in distillation learning according to a first similarity between the second image semantic feature vector and the second global feature vector, wherein the image distillation loss is in negative correlation with the first similarity;
training the image encoder with a goal of reducing the image distillation loss.
Further, the text encoder is trained by:
determining text distillation loss in distillation learning according to a second similarity between the second text semantic feature vector and the second global feature vector, wherein the text distillation loss is negatively correlated with the second similarity;
training the text encoder with a goal of reducing the text distillation loss.
In one possible embodiment, the object class includes: a target class of a first hierarchy and a target class of a second hierarchy; any one of the categories of the first hierarchy has a plurality of categories of the second hierarchy.
In a second aspect, an apparatus for identifying a category of a target image, the target image including text therein, the apparatus comprising:
the identification unit is used for identifying the text content in the target image to obtain a first text identification result;
the image coding unit is used for inputting the target image into an image coder and outputting a first image semantic feature vector corresponding to the target image through the image coder;
the text coding unit is used for inputting the first text recognition result obtained by the recognition unit into a text encoder and outputting a first text semantic feature vector corresponding to the first text recognition result through the text encoder;
the fusion unit is used for inputting the first image semantic feature vector obtained by the image coding unit and the first text semantic feature vector obtained by the text coding unit into a multimode fusion encoder and outputting a first global feature vector through the multimode fusion encoder;
and the determining unit is used for determining the target category of the target image according to the first global feature vector obtained by the fusing unit.
In a third aspect, there is provided a computer readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method of the first aspect.
In a fourth aspect, there is provided a computing device comprising a memory having stored therein executable code and a processor that, when executing the executable code, implements the method of the first aspect.
According to the method and the device provided by the embodiment of the specification, the target image comprises the text, firstly, the text content in the target image is identified, and a first text identification result is obtained; then inputting the target image into an image encoder, and outputting a first image semantic feature vector corresponding to the target image through the image encoder; then inputting the first text recognition result into a text encoder, and outputting a first text semantic feature vector corresponding to the first text recognition result through the text encoder; inputting the first image semantic feature vector and the first text semantic feature vector into a multimode fusion encoder, and outputting a first global feature vector through the multimode fusion encoder; and finally, determining the target category of the target image according to the first global feature vector. As can be seen from the above, in the embodiments of the present specification, information of an image modality and information of a text modality are used simultaneously, and the two modalities are complementary, so that the problem of incomplete information of a single modality is effectively avoided, the information is fully utilized, and the accuracy of class identification for a target image can be improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a schematic diagram illustrating an implementation scenario of an embodiment disclosed herein;
FIG. 2 illustrates a flow diagram of a method for class identification for a target image, according to one embodiment;
FIG. 3 illustrates an identification flow diagram for determining merchant categories based on merchant door views, according to one embodiment;
FIG. 4 illustrates a training flow diagram for determining merchant categories based on merchant interviews, according to one embodiment;
fig. 5 shows a schematic block diagram of a class identification apparatus for a target image according to one embodiment.
Detailed Description
The scheme provided by the specification is described below with reference to the accompanying drawings.
Fig. 1 is a schematic view of an implementation scenario of an embodiment disclosed in this specification. The implementation scenario involves class recognition for a target image, which includes text. Referring to fig. 1, the target image 11 is a photograph of a merchant gate, the merchant gate has a signboard 12, the signboard 12 has a text, the text may be ". x.la. museum" shown in fig. 1, or may also be ". x.bai goods", ". x.supermarket", etc., and the corresponding merchant category is identified according to the target image 11, and specifically, the merchant category may be identified as a restaurant category or a department category in a first-level category, or a chinese meal or a western meal in a second-level category, etc.
Modality (modal): are different representations of information. For example: pictures, text, sounds, etc.
In the embodiment of the specification, information of an image modality and information of a text modality are simultaneously used, and the two modalities are complementary, so that the problem of incomplete information of a single modality is effectively solved, the information is fully utilized, and the accuracy of class identification of a target image can be improved.
In addition, it should be noted that, in the embodiments of the present specification, the target image is not limited to the shop head photo, and the method is applicable as long as category identification for the target image is involved and a scene including text in the target image is used. For example, in another possible implementation scenario, the target image may be a photo of a car, where the photo includes characters such as a license plate number or a model number of the car, and a corresponding model or generation of year is identified according to the target image.
Fig. 2 shows a flowchart of a class identification method for a target image including text according to an embodiment, which may be based on the implementation scenario shown in fig. 1. As shown in fig. 2, the class identification method for the target image in this embodiment includes the following steps: step 21, recognizing text content in the target image to obtain a first text recognition result; step 22, inputting the target image into an image encoder, and outputting a first image semantic feature vector corresponding to the target image through the image encoder; step 23, inputting the first text recognition result into a text encoder, and outputting a first text semantic feature vector corresponding to the first text recognition result through the text encoder; step 24, inputting the first image semantic feature vector and the first text semantic feature vector into a multimode fusion encoder, and outputting a first global feature vector through the multimode fusion encoder; and 25, determining the target category of the target image according to the first global feature vector. Specific execution modes of the above steps are described below.
First, in step 21, the text content in the target image is recognized, and a first text recognition result is obtained. It is understood that the text content may include various types of words or characters, such as chinese characters, pinyin, english letters, arabic numerals, and the like.
In one example, the identifying text content in the target image includes:
performing text detection on the target image to obtain a first text region in the target image;
and performing character recognition on the first text area to obtain the first text recognition result.
It can be understood that the text in the target image is usually concentrated in a certain region, rather than scattered everywhere in the target image, and therefore, the efficiency of character recognition can be improved by determining the first text region in the target image and then performing character recognition on the first text region. The character recognition may be performed by an Optical Character Recognition (OCR) algorithm.
Further, the target image is a picture of a shop head, the shop head is provided with a signboard, and the signboard is provided with a text; the target category is a merchant category;
the text detection of the target image comprises:
inputting the merchant shop head photo into a signboard detector, and outputting a signboard area through the signboard detector;
inputting the signboard region into a text detector, and outputting the first text region through the text detector.
In the embodiment of the present specification, the signboard detector is trained by using a shop front photo marked in a signboard area, and is used for positioning the position of the signboard in the shop front photo.
Then, in step 22, the target image is input into an image encoder, and a first image semantic feature vector corresponding to the target image is output through the image encoder. It can be understood that the category of the target image can be identified only according to the semantic feature vector of the first image, but since the image is only one modality, the information is not fully utilized, and the identification result is not accurate enough.
The image encoder can adopt a general image classification network, such as ResNet, VGG, DenseNet and the like based on convolution, or ViT, T2T-ViT and the like based on Transformer, for extracting high-order semantic features of the image.
Then, in step 23, the first text recognition result is input into a text encoder, and a first text semantic feature vector corresponding to the first text recognition result is output by the text encoder. It can be understood that the category of the target image can be identified only according to the semantic feature vector of the first text, but since the text is only one mode, the information is not fully utilized, and the identification result is not accurate enough.
The text encoder may adopt a transform-based network, such as Bert, RoBERTa, or the like, or RNN-based models such as LSTM and Bi-LSTM, for extracting text high-order semantic features.
And 24, inputting the first image semantic feature vector and the first text semantic feature vector into a multimode fusion encoder, and outputting a first global feature vector through the multimode fusion encoder. The first global feature vector is used for identifying the type of the target image, and the identification result is high in accuracy.
The multimode fusion encoder can take the image high-order semantic features and the text high-order semantic features as input, carries out interactive fusion in the multimode fusion encoder, fully excavates information of two modes, and outputs representative global features. The two modal high-order semantic features are used for interaction, so that interference caused by low-order features can be effectively avoided, and the calculation efficiency is improved. Specifically, the multimode fusion encoder may fuse two modality information in a splicing manner, or may use a Cross-Transformer structured network in order to better interact the two modality information.
Finally, in step 25, a target class of the target image is determined based on the first global feature vector. It is understood that a plurality of selectable categories are preset, at least one selectable category is selected from the plurality of selectable categories according to the first global feature vector, and the selectable category is determined as the target category of the target image.
In one example, the target categories include: a target class of a first hierarchy and a target class of a second hierarchy; any one of the categories of the first hierarchy has a plurality of categories of the second hierarchy.
In one example, the determining the target class of the target image according to the first global feature vector includes:
inputting the first global feature vector into a first full-connection layer, and determining a target category of a first level of the target image;
and inputting the first global feature vector into a second fully-connected layer, and determining a target class of a second level of the target image.
FIG. 3 illustrates an identification flow diagram for determining merchant categories based on merchant door views, according to one embodiment. Referring to fig. 3, an input worker collects uploaded merchant photos, or merchants self-shoot the uploaded merchant photos, a text region in the merchant photos is located through text detection, the located text region is input into an OCR algorithm, and a text recognition result is output. The image encoder is used for extracting high-order semantic features of the image corresponding to the head of the merchant. And the text encoder is used for extracting text high-order semantic features corresponding to the text recognition result. And the multimode fusion encoder is used for performing interactive fusion by taking the high-order semantic features of the image and the high-order semantic features of the text as input to obtain corresponding global feature vectors. And subsequently, determining the identification result of the merchant category according to the global feature vector.
In the embodiments of the present specification, the aforementioned image encoder, text encoder and multi-mode fusion encoder are obtained through pre-training, which may be referred to as multi-modal machine learning (MMML), and are intended to achieve the capability of processing and understanding multi-source modal information through a machine learning method.
In one example, the image encoder, the text encoder, or the multimodal fusion encoder is trained on sample images obtained by:
and carrying out at least one of image size conversion, image angle conversion, white noise addition and text sequence adjustment on the original image obtained by photographing to obtain the sample image.
It can be understood that, besides the original image obtained by photographing is used as the sample image, the disturbed image can also be used as the sample image, thereby improving the robustness of the trained model.
In one example, the multimode fusion encoder is trained by:
acquiring a sample image and a category label corresponding to the sample image;
recognizing text content in the sample image to obtain a second text recognition result;
inputting the sample image into an image encoder, and outputting a second image semantic feature vector corresponding to the sample image through the image encoder;
inputting the second text recognition result into a text encoder, and outputting a second text semantic feature vector corresponding to the second text recognition result through the text encoder;
inputting the second image semantic feature vector and the second text semantic feature vector into a multimode fusion encoder, and outputting a second global feature vector through the multimode fusion encoder;
determining a first prediction category of the sample image according to the second global feature vector;
determining a multi-modal classification loss according to the first prediction category and the category label;
and training the multimode fusion encoder according to the multimode classification loss.
Optionally, the image encoder or the text encoder may be trained according to the multi-modal classification loss.
Further, the image encoder may also be trained by:
determining a second prediction category of the sample image according to the second image semantic feature vector;
determining image classification loss according to the second prediction category and the category label;
and training the image encoder according to the image classification loss.
Further, the text encoder may also be trained by:
determining a third prediction category of the sample image according to the second text semantic feature vector;
determining text classification loss according to the third prediction category and the category label;
and training the text encoder according to the text classification loss.
Further, the image encoder may also be trained by:
determining an image distillation loss in distillation learning according to a first similarity between the second image semantic feature vector and the second global feature vector, wherein the image distillation loss is in negative correlation with the first similarity;
training the image encoder with a goal of reducing the image distillation loss.
Further, the text encoder may also be trained by:
determining text distillation loss in distillation learning according to a second similarity between the second text semantic feature vector and the second global feature vector, wherein the text distillation loss is negatively correlated with the second similarity;
training the text encoder with a goal of reducing the text distillation loss.
In the embodiment of the specification, a set of universal distillation learning framework is designed, a deep supervision mode is adopted, a deep network is used for guiding a shallow network, the distinguishing performance and the robustness of the characteristics of the middle layer are improved, and therefore the classification performance is effectively improved. Wherein, the classification loss can adopt cross entropy loss, and the distillation loss can adopt KL divergence.
Embodiments of the present disclosure may combine the aforementioned training to jointly train the image encoder, the text encoder, and the multimodal fusion encoder.
FIG. 4 illustrates a training flow diagram for determining merchant categories based on merchant interviews, according to one embodiment. Referring to fig. 4, a merchant photo serving as a sample image is input, a text region in the merchant photo is positioned through text detection, and the positioned text region is input into an OCR algorithm to output a text recognition result. The image encoder is used for extracting high-order semantic features of the image corresponding to the head of the merchant. And the text encoder is used for extracting text high-order semantic features corresponding to the text recognition result. And the multimode fusion encoder is used for performing interactive fusion by taking the high-order semantic features of the image and the high-order semantic features of the text as input to obtain corresponding global feature vectors. Subsequently, determining the multi-mode classification loss according to the output of the multi-mode fusion encoder and the class label; determining an image classification loss according to an output of the image encoder and the class label; determining text classification loss according to the output of the text encoder and the class label; determining an image distillation loss according to the output of the image encoder and the output of the multimode fusion encoder; determining text distillation loss according to the output of the text encoder and the output of the multimode fusion encoder; and determining total loss according to the multi-mode classification loss, the image classification loss, the text classification loss, the image distillation loss and the text distillation loss, and performing combined training on the image encoder, the text encoder and the multi-mode fusion encoder by taking the minimized total loss as a training target.
In addition, in order to further improve the performance of the model, a hierarchical classification component or a key region matching component can be added.
A hierarchical classification component: as the category system of the merchant has a layered two-stage structure (such as catering Chinese dinning), the characteristics are utilized in the model, the characteristics output by the multimode fusion encoder pass through two full-connection layers which do not share weight, so that two-stage classification results are obtained, and the corresponding labels are two-stage category labels.
Critical area matching component: when the multimode fusion encoder adopts an attention mechanism, because images and texts may have certain correlation, important feature blocks in the texts and the image features are selected from the features output by the multimode fusion encoder according to attention scores and scores from high to low, and for two region features, KL divergence is used for constraint to enable the feature expressions of the two region features to be consistent.
According to the method provided by the embodiment of the specification, the target image comprises the text, firstly, the text content in the target image is identified, and a first text identification result is obtained; then inputting the target image into an image encoder, and outputting a first image semantic feature vector corresponding to the target image through the image encoder; then inputting the first text recognition result into a text encoder, and outputting a first text semantic feature vector corresponding to the first text recognition result through the text encoder; inputting the first image semantic feature vector and the first text semantic feature vector into a multimode fusion encoder, and outputting a first global feature vector through the multimode fusion encoder; and finally, determining the target category of the target image according to the first global feature vector. As can be seen from the above, in the embodiments of the present specification, information of an image modality and information of a text modality are used simultaneously, and the two modalities are complementary, so that the problem of incomplete information of a single modality is effectively avoided, the information is fully utilized, and the accuracy of class identification for a target image can be improved.
According to another aspect of embodiments, there is also provided a class identification apparatus for a target image, the target image including text, the apparatus being configured to execute the class identification method for the target image provided in the embodiments of the present specification. Fig. 5 shows a schematic block diagram of a class identification apparatus for a target image according to one embodiment. As shown in fig. 5, the apparatus 500 includes:
the recognition unit 51 is used for recognizing the text content in the target image to obtain a first text recognition result;
an image encoding unit 52, configured to input the target image into an image encoder, and output a first image semantic feature vector corresponding to the target image through the image encoder;
a text encoding unit 53, configured to input the first text recognition result obtained by the recognition unit 51 into a text encoder, and output a first text semantic feature vector corresponding to the first text recognition result through the text encoder;
a fusion unit 54, configured to input the first image semantic feature vector obtained by the image encoding unit 52 and the first text semantic feature vector obtained by the text encoding unit 53 into a multimode fusion encoder, and output a first global feature vector through the multimode fusion encoder;
a determining unit 55, configured to determine a target category of the target image according to the first global feature vector obtained by the fusing unit 54.
Optionally, as an embodiment, the identifying unit 51 includes:
the text detection subunit is used for performing text detection on the target image to obtain a first text region in the target image;
and the character recognition subunit is used for performing character recognition on the first text region obtained by the text detection subunit to obtain the first text recognition result.
Further, the target image is a picture of a shop head, the shop head is provided with a signboard, and the signboard is provided with a text; the target category is a merchant category;
the text detection subunit includes:
the signboard detection module is used for inputting the merchant shop head photo into a signboard detector and outputting a signboard area through the signboard detector;
and the text detection module is used for inputting the signboard area obtained by the signboard detection module into a text detector and outputting the first text area through the text detector.
Optionally, as an embodiment, the image encoder, the text encoder, or the multi-mode fusion encoder is trained based on sample images, and the sample images are obtained by:
and carrying out at least one of image size conversion, image angle conversion, white noise addition and text sequence adjustment on the original image obtained by photographing to obtain the sample image.
Optionally, as an embodiment, the multimode fusion encoder is trained by:
acquiring a sample image and a category label corresponding to the sample image;
recognizing text content in the sample image to obtain a second text recognition result;
inputting the sample image into an image encoder, and outputting a second image semantic feature vector corresponding to the sample image through the image encoder;
inputting the second text recognition result into a text encoder, and outputting a second text semantic feature vector corresponding to the second text recognition result through the text encoder;
inputting the second image semantic feature vector and the second text semantic feature vector into a multimode fusion encoder, and outputting a second global feature vector through the multimode fusion encoder;
determining a first prediction category of the sample image according to the second global feature vector;
determining a multi-modal classification loss according to the first prediction category and the category label;
and training the multimode fusion encoder according to the multimode classification loss.
Further, the image encoder is trained by:
determining a second prediction category of the sample image according to the second image semantic feature vector;
determining image classification loss according to the second prediction category and the category label;
and training the image encoder according to the image classification loss.
Further, the text encoder is trained by:
determining a third prediction category of the sample image according to the second text semantic feature vector;
determining text classification loss according to the third prediction category and the category label;
and training the text encoder according to the text classification loss.
Further, the image encoder is trained by:
determining an image distillation loss in distillation learning according to a first similarity between the second image semantic feature vector and the second global feature vector, wherein the image distillation loss is in negative correlation with the first similarity;
training the image encoder with a goal of reducing the image distillation loss.
Further, the text encoder is trained by:
determining text distillation loss in distillation learning according to a second similarity between the second text semantic feature vector and the second global feature vector, wherein the text distillation loss is negatively correlated with the second similarity;
training the text encoder with a goal of reducing the text distillation loss.
Optionally, as an embodiment, the object category includes:
a target class of a first hierarchy and a target class of a second hierarchy; any one of the categories of the first hierarchy has a plurality of categories of the second hierarchy.
With the apparatus provided by the embodiment of the present specification, the target image includes a text, and first, the identifying unit 51 identifies text content in the target image to obtain a first text identification result; then, the image encoding unit 52 inputs the target image into an image encoder, and outputs a first image semantic feature vector corresponding to the target image through the image encoder; then, the text encoding unit 53 inputs the first text recognition result into a text encoder, and outputs a first text semantic feature vector corresponding to the first text recognition result through the text encoder; the fusion unit 54 inputs the first image semantic feature vector and the first text semantic feature vector into a multimode fusion encoder, and outputs a first global feature vector through the multimode fusion encoder; finally, the determining unit 55 determines the target category of the target image according to the first global feature vector. As can be seen from the above, in the embodiments of the present specification, information of an image modality and information of a text modality are used simultaneously, and the two modalities are complementary, so that the problem of incomplete information of a single modality is effectively avoided, the information is fully utilized, and the accuracy of class identification for a target image can be improved.
According to an embodiment of another aspect, there is also provided a computer-readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method described in connection with fig. 2.
According to an embodiment of yet another aspect, there is also provided a computing device comprising a memory having stored therein executable code, and a processor that, when executing the executable code, implements the method described in connection with fig. 2.
Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in this invention may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.
The above-mentioned embodiments, objects, technical solutions and advantages of the present invention are further described in detail, it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made on the basis of the technical solutions of the present invention should be included in the scope of the present invention.

Claims (22)

1. A method of class identification for a target image, the target image including text therein, the method comprising:
recognizing text content in the target image to obtain a first text recognition result;
inputting the target image into an image encoder, and outputting a first image semantic feature vector corresponding to the target image through the image encoder;
inputting the first text recognition result into a text encoder, and outputting a first text semantic feature vector corresponding to the first text recognition result through the text encoder;
inputting the first image semantic feature vector and the first text semantic feature vector into a multimode fusion encoder, and outputting a first global feature vector through the multimode fusion encoder;
determining a target category of the target image according to the first global feature vector;
wherein the image encoder, the text encoder, and the multimodal fusion encoder are trained by:
determining a total loss according to the multi-mode classification loss, the image classification loss, the text classification loss, the image distillation loss and the text distillation loss; the multimodal fusion encoder comprises a multimodal fusion encoder, a text encoder, a multimodal fusion encoder, and a text distillation loss, wherein the multimodal fusion encoder is configured to encode a text image and a multimodal fusion encoder;
and jointly training the image encoder, the text encoder and the multimode fusion encoder by taking the minimized total loss as a training target.
2. The method of claim 1, wherein the identifying text content in the target image comprises:
performing text detection on the target image to obtain a first text region in the target image;
and performing character recognition on the first text area to obtain the first text recognition result.
3. The method of claim 2, wherein the target image is a merchant shop photo, the merchant shop having a sign, the sign having text; the target category is a merchant category;
the text detection of the target image comprises:
inputting the merchant shop head photo into a signboard detector, and outputting a signboard area through the signboard detector;
inputting the signboard region into a text detector, and outputting the first text region through the text detector.
4. The method of claim 1, wherein the image encoder, the text encoder, or the multimodal fusion encoder is trained based on sample images obtained by:
and carrying out at least one of image size conversion, image angle conversion, white noise addition and text sequence adjustment on the original image obtained by photographing to obtain the sample image.
5. The method of claim 1, wherein the multi-modal classification penalty is determined by:
acquiring a sample image and a category label corresponding to the sample image;
recognizing text content in the sample image to obtain a second text recognition result;
inputting the sample image into an image encoder, and outputting a second image semantic feature vector corresponding to the sample image through the image encoder;
inputting the second text recognition result into a text encoder, and outputting a second text semantic feature vector corresponding to the second text recognition result through the text encoder;
inputting the second image semantic feature vector and the second text semantic feature vector into a multimode fusion encoder, and outputting a second global feature vector through the multimode fusion encoder;
determining a first prediction category of the sample image according to the second global feature vector;
and determining the multi-mode classification loss according to the first prediction class and the class label.
6. The method of claim 5, wherein the image classification penalty is determined by:
determining a second prediction category of the sample image according to the second image semantic feature vector;
and determining the image classification loss according to the second prediction category and the category label.
7. The method of claim 5, wherein the text classification penalty is determined by:
determining a third prediction category of the sample image according to the second text semantic feature vector;
and determining text classification loss according to the third prediction category and the category label.
8. The method of claim 5, wherein the image distillation loss is determined by:
and determining the image distillation loss in distillation learning according to the first similarity between the second image semantic feature vector and the second global feature vector, wherein the image distillation loss is in negative correlation with the first similarity.
9. The method of claim 5, wherein the text distillation loss is determined by:
and determining text distillation loss in distillation learning according to a second similarity between the second text semantic feature vector and the second global feature vector, wherein the text distillation loss is negatively correlated with the second similarity.
10. The method of claim 1, wherein the object categories include: a target class of a first hierarchy and a target class of a second hierarchy; any one of the categories of the first hierarchy has a plurality of categories of the second hierarchy.
11. A category identification apparatus for a target image including text therein, the apparatus comprising:
the identification unit is used for identifying the text content in the target image to obtain a first text identification result;
the image coding unit is used for inputting the target image into an image coder and outputting a first image semantic feature vector corresponding to the target image through the image coder;
the text coding unit is used for inputting the first text recognition result obtained by the recognition unit into a text encoder and outputting a first text semantic feature vector corresponding to the first text recognition result through the text encoder;
the fusion unit is used for inputting the first image semantic feature vector obtained by the image coding unit and the first text semantic feature vector obtained by the text coding unit into a multimode fusion encoder and outputting a first global feature vector through the multimode fusion encoder;
the determining unit is used for determining the target category of the target image according to the first global feature vector obtained by the fusing unit;
wherein the image encoder, the text encoder, and the multimodal fusion encoder are trained by:
determining a total loss according to the multi-mode classification loss, the image classification loss, the text classification loss, the image distillation loss and the text distillation loss; the multimodal fusion encoder comprises a multimodal fusion encoder, a text encoder, a multimodal fusion encoder, and a text distillation loss, wherein the multimodal fusion encoder is configured to encode a text image and a multimodal fusion encoder;
and jointly training the image encoder, the text encoder and the multimode fusion encoder by taking the minimized total loss as a training target.
12. The apparatus of claim 11, wherein the identifying unit comprises:
the text detection subunit is used for performing text detection on the target image to obtain a first text region in the target image;
and the character recognition subunit is used for performing character recognition on the first text region obtained by the text detection subunit to obtain the first text recognition result.
13. The apparatus of claim 12, wherein the target image is a picture of a merchant's head, the merchant's head having a sign, the sign having text; the target category is a merchant category;
the text detection subunit includes:
the signboard detection module is used for inputting the merchant shop head photo into a signboard detector and outputting a signboard area through the signboard detector;
and the text detection module is used for inputting the signboard area obtained by the signboard detection module into a text detector and outputting the first text area through the text detector.
14. The apparatus of claim 11, wherein the image encoder, the text encoder, or the multimodal fusion encoder is trained based on sample images obtained by:
and carrying out at least one of image size conversion, image angle conversion, white noise addition and text sequence adjustment on the original image obtained by photographing to obtain the sample image.
15. The apparatus of claim 11, wherein the multi-modal classification penalty is determined by:
acquiring a sample image and a category label corresponding to the sample image;
recognizing text content in the sample image to obtain a second text recognition result;
inputting the sample image into an image encoder, and outputting a second image semantic feature vector corresponding to the sample image through the image encoder;
inputting the second text recognition result into a text encoder, and outputting a second text semantic feature vector corresponding to the second text recognition result through the text encoder;
inputting the second image semantic feature vector and the second text semantic feature vector into a multimode fusion encoder, and outputting a second global feature vector through the multimode fusion encoder;
determining a first prediction category of the sample image according to the second global feature vector;
and determining the multi-mode classification loss according to the first prediction class and the class label.
16. The apparatus of claim 15, wherein the image classification penalty is determined by:
determining a second prediction category of the sample image according to the second image semantic feature vector;
and determining the image classification loss according to the second prediction category and the category label.
17. The apparatus of claim 15, wherein the text classification penalty is determined by:
determining a third prediction category of the sample image according to the second text semantic feature vector;
and determining text classification loss according to the third prediction category and the category label.
18. The apparatus of claim 15, wherein the image distillation loss is determined by:
and determining the image distillation loss in distillation learning according to the first similarity between the second image semantic feature vector and the second global feature vector, wherein the image distillation loss is in negative correlation with the first similarity.
19. The apparatus of claim 15, wherein the text distillation loss is determined by:
and determining text distillation loss in distillation learning according to a second similarity between the second text semantic feature vector and the second global feature vector, wherein the text distillation loss is negatively correlated with the second similarity.
20. The apparatus of claim 11, wherein the object classes comprise: a target class of a first hierarchy and a target class of a second hierarchy; any one of the categories of the first hierarchy has a plurality of categories of the second hierarchy.
21. A computer-readable storage medium, on which a computer program is stored which, when executed in a computer, causes the computer to carry out the method of any one of claims 1-10.
22. A computing device comprising a memory having stored therein executable code and a processor that, when executing the executable code, implements the method of any of claims 1-10.
CN202110460794.XA 2021-04-27 2021-04-27 Class identification method and device for target image Active CN112926700B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110460794.XA CN112926700B (en) 2021-04-27 2021-04-27 Class identification method and device for target image

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110460794.XA CN112926700B (en) 2021-04-27 2021-04-27 Class identification method and device for target image

Publications (2)

Publication Number Publication Date
CN112926700A CN112926700A (en) 2021-06-08
CN112926700B true CN112926700B (en) 2022-04-12

Family

ID=76174748

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110460794.XA Active CN112926700B (en) 2021-04-27 2021-04-27 Class identification method and device for target image

Country Status (1)

Country Link
CN (1) CN112926700B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113378792B (en) * 2021-07-09 2022-08-02 合肥工业大学 Weak supervision cervical cell image analysis method fusing global and local information
CN113591770B (en) * 2021-08-10 2023-07-18 中国科学院深圳先进技术研究院 Multi-mode fusion obstacle detection method and device based on artificial intelligence blind guiding

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20170098573A (en) * 2016-02-22 2017-08-30 에스케이텔레콤 주식회사 Multi-modal learning device and multi-modal learning method
CN111275038A (en) * 2020-01-17 2020-06-12 平安医疗健康管理股份有限公司 Image text recognition method and device, computer equipment and computer storage medium
CN112613303A (en) * 2021-01-07 2021-04-06 福州大学 Knowledge distillation-based cross-modal image aesthetic quality evaluation method

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10460201B2 (en) * 2015-12-31 2019-10-29 Microsoft Technology Licensing, Llc Structure and training for image classification
CN110705460B (en) * 2019-09-29 2023-06-20 北京百度网讯科技有限公司 Image category identification method and device
CN112699265A (en) * 2019-10-22 2021-04-23 商汤国际私人有限公司 Image processing method and device, processor and storage medium
CN112541530B (en) * 2020-12-06 2023-06-20 支付宝(杭州)信息技术有限公司 Data preprocessing method and device for clustering model
CN112633380A (en) * 2020-12-24 2021-04-09 北京百度网讯科技有限公司 Interest point feature extraction method and device, electronic equipment and storage medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20170098573A (en) * 2016-02-22 2017-08-30 에스케이텔레콤 주식회사 Multi-modal learning device and multi-modal learning method
CN111275038A (en) * 2020-01-17 2020-06-12 平安医疗健康管理股份有限公司 Image text recognition method and device, computer equipment and computer storage medium
CN112613303A (en) * 2021-01-07 2021-04-06 福州大学 Knowledge distillation-based cross-modal image aesthetic quality evaluation method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Multistructure-Based Collaborative Online Distillation;Liang Gao 等;《Entropy》;20190402;第1-15页 *
一种语义级文本协同图像识别方法;段喜萍 等;《哈尔滨工业大学学报》;20140331;第46卷(第3期);第49-53页 *

Also Published As

Publication number Publication date
CN112926700A (en) 2021-06-08

Similar Documents

Publication Publication Date Title
US11301732B2 (en) Processing image-bearing electronic documents using a multimodal fusion framework
CN108959482B (en) Single-round dialogue data classification method and device based on deep learning and electronic equipment
US9002066B2 (en) Methods, systems and processor-readable media for designing a license plate overlay decal having infrared annotation marks
Zeng et al. Beyond ocr+ vqa: involving ocr into the flow for robust and accurate textvqa
CN111738016A (en) Multi-intention recognition method and related equipment
CN111680490A (en) Cross-modal document processing method and device and electronic equipment
CN112926700B (en) Class identification method and device for target image
CN114092707A (en) Image text visual question answering method, system and storage medium
CN115658955B (en) Cross-media retrieval and model training method, device, equipment and menu retrieval system
Tong et al. MA-CRNN: a multi-scale attention CRNN for Chinese text line recognition in natural scenes
US20230298630A1 (en) Apparatuses and methods for selectively inserting text into a video resume
CN115017911A (en) Cross-modal processing for vision and language
Sharma et al. A comprehensive survey on image captioning: From handcrafted to deep learning-based techniques, a taxonomy and open research issues
CN114328934A (en) Attention mechanism-based multi-label text classification method and system
CN112464927B (en) Information extraction method, device and system
CN116993446A (en) Logistics distribution management system and method for electronic commerce
CN117093687A (en) Question answering method and device, electronic equipment and storage medium
CN115004261A (en) Text line detection
Wang et al. AMRE: An Attention-Based CRNN for Manchu Word Recognition on a Woodblock-Printed Dataset
US20240193976A1 (en) Machine learning-based diagram label recognition
Praneel et al. Gated Dual Adaptive Attention Mechanism with Semantic Reasoning, Character Awareness, and Visual-Semantic Ensemble Fusion Decoder for Text Recognition in Natural Scene Images
CN117333868A (en) Method, device and storage medium for identifying object
Bagarukayo Marvin Ssemambo Reg: 2016/HD05/344U Std No: 210025242 sallanmarvin@ gmail. com/mssemambo@ cis. mak. ac. ug
Wu et al. A Multimodal Text Block Segmentation Framework for Photo Translation
Sheng Image to English translation and comprehension: INT2-VQA method based on inter-modality and intra-modality collaborations

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant