CN112926700B

CN112926700B - Class identification method and device for target image

Info

Publication number: CN112926700B
Application number: CN202110460794.XA
Authority: CN
Inventors: 暨凯祥; 刘家佳; 曾小英; 胡圻圻; 冯力国; 王剑; 陈景东
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2021-04-27
Filing date: 2021-04-27
Publication date: 2022-04-12
Anticipated expiration: 2041-04-27
Also published as: CN112926700A

Abstract

The embodiment of the specification provides a category identification method and device for a target image, wherein the target image comprises a text, and the method comprises the following steps: recognizing text content in the target image to obtain a first text recognition result; inputting the target image into an image encoder, and outputting a first image semantic feature vector corresponding to the target image through the image encoder; inputting the first text recognition result into a text encoder, and outputting a first text semantic feature vector corresponding to the first text recognition result through the text encoder; inputting the first image semantic feature vector and the first text semantic feature vector into a multimode fusion encoder, and outputting a first global feature vector through the multimode fusion encoder; and determining the target category of the target image according to the first global feature vector. The accuracy of class identification for the target image can be improved.

Description

Class identification method and device for target image

Technical Field

One or more embodiments of the present specification relate to the field of computers, and in particular, to a method and apparatus for class identification for a target image.

Background

Currently, category identification for the target image is often involved, for example, identifying whether the animal in the target image is a dog or a cat, whether the merchant in the target image is a restaurant or a retail, and so on. If the category identification is carried out manually according to the target image, the time efficiency is slow, the cost is high, and therefore an intelligent category identification scheme aiming at the target image needs to be provided.

In the prior art, in a general class identification process for a target image, a class of the target image is automatically identified by using the target image in an image classification mode. However, in many cases, the recognition result is not accurate enough due to insufficient information utilization.

Accordingly, improved solutions are desired that can improve the accuracy of class identification for target images.

Disclosure of Invention

One or more embodiments of the present specification describe a method and an apparatus for class identification for a target image, which can improve the accuracy of class identification for the target image.

In a first aspect, a method for identifying a category of a target image is provided, where the target image includes text, and the method includes:

recognizing text content in the target image to obtain a first text recognition result;

inputting the target image into an image encoder, and outputting a first image semantic feature vector corresponding to the target image through the image encoder;

inputting the first text recognition result into a text encoder, and outputting a first text semantic feature vector corresponding to the first text recognition result through the text encoder;

inputting the first image semantic feature vector and the first text semantic feature vector into a multimode fusion encoder, and outputting a first global feature vector through the multimode fusion encoder;

and determining the target category of the target image according to the first global feature vector.

In one possible embodiment, the recognizing the text content in the target image includes:

performing text detection on the target image to obtain a first text region in the target image;

and performing character recognition on the first text area to obtain the first text recognition result.

Further, the target image is a picture of a shop head, the shop head is provided with a signboard, and the signboard is provided with a text; the target category is a merchant category;

the text detection of the target image comprises:

inputting the merchant shop head photo into a signboard detector, and outputting a signboard area through the signboard detector;

inputting the signboard region into a text detector, and outputting the first text region through the text detector.

In one possible embodiment, the image encoder, the text encoder or the multimodal fusion encoder is trained on sample images obtained by:

and carrying out at least one of image size conversion, image angle conversion, white noise addition and text sequence adjustment on the original image obtained by photographing to obtain the sample image.

In one possible embodiment, the multimode fusion encoder is trained by:

acquiring a sample image and a category label corresponding to the sample image;

recognizing text content in the sample image to obtain a second text recognition result;

inputting the sample image into an image encoder, and outputting a second image semantic feature vector corresponding to the sample image through the image encoder;

inputting the second text recognition result into a text encoder, and outputting a second text semantic feature vector corresponding to the second text recognition result through the text encoder;

inputting the second image semantic feature vector and the second text semantic feature vector into a multimode fusion encoder, and outputting a second global feature vector through the multimode fusion encoder;

determining a first prediction category of the sample image according to the second global feature vector;

determining a multi-modal classification loss according to the first prediction category and the category label;

and training the multimode fusion encoder according to the multimode classification loss.

Further, the image encoder is trained by:

determining a second prediction category of the sample image according to the second image semantic feature vector;

determining image classification loss according to the second prediction category and the category label;

and training the image encoder according to the image classification loss.

Further, the text encoder is trained by:

determining a third prediction category of the sample image according to the second text semantic feature vector;

determining text classification loss according to the third prediction category and the category label;

and training the text encoder according to the text classification loss.

Further, the image encoder is trained by:

determining an image distillation loss in distillation learning according to a first similarity between the second image semantic feature vector and the second global feature vector, wherein the image distillation loss is in negative correlation with the first similarity;

training the image encoder with a goal of reducing the image distillation loss.

Further, the text encoder is trained by:

determining text distillation loss in distillation learning according to a second similarity between the second text semantic feature vector and the second global feature vector, wherein the text distillation loss is negatively correlated with the second similarity;

training the text encoder with a goal of reducing the text distillation loss.

In one possible embodiment, the object class includes: a target class of a first hierarchy and a target class of a second hierarchy; any one of the categories of the first hierarchy has a plurality of categories of the second hierarchy.

In a second aspect, an apparatus for identifying a category of a target image, the target image including text therein, the apparatus comprising:

the identification unit is used for identifying the text content in the target image to obtain a first text identification result;

the image coding unit is used for inputting the target image into an image coder and outputting a first image semantic feature vector corresponding to the target image through the image coder;

the text coding unit is used for inputting the first text recognition result obtained by the recognition unit into a text encoder and outputting a first text semantic feature vector corresponding to the first text recognition result through the text encoder;

the fusion unit is used for inputting the first image semantic feature vector obtained by the image coding unit and the first text semantic feature vector obtained by the text coding unit into a multimode fusion encoder and outputting a first global feature vector through the multimode fusion encoder;

and the determining unit is used for determining the target category of the target image according to the first global feature vector obtained by the fusing unit.

In a third aspect, there is provided a computer readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method of the first aspect.

In a fourth aspect, there is provided a computing device comprising a memory having stored therein executable code and a processor that, when executing the executable code, implements the method of the first aspect.

According to the method and the device provided by the embodiment of the specification, the target image comprises the text, firstly, the text content in the target image is identified, and a first text identification result is obtained; then inputting the target image into an image encoder, and outputting a first image semantic feature vector corresponding to the target image through the image encoder; then inputting the first text recognition result into a text encoder, and outputting a first text semantic feature vector corresponding to the first text recognition result through the text encoder; inputting the first image semantic feature vector and the first text semantic feature vector into a multimode fusion encoder, and outputting a first global feature vector through the multimode fusion encoder; and finally, determining the target category of the target image according to the first global feature vector. As can be seen from the above, in the embodiments of the present specification, information of an image modality and information of a text modality are used simultaneously, and the two modalities are complementary, so that the problem of incomplete information of a single modality is effectively avoided, the information is fully utilized, and the accuracy of class identification for a target image can be improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic diagram illustrating an implementation scenario of an embodiment disclosed herein;

FIG. 2 illustrates a flow diagram of a method for class identification for a target image, according to one embodiment;

FIG. 3 illustrates an identification flow diagram for determining merchant categories based on merchant door views, according to one embodiment;

FIG. 4 illustrates a training flow diagram for determining merchant categories based on merchant interviews, according to one embodiment;

fig. 5 shows a schematic block diagram of a class identification apparatus for a target image according to one embodiment.

Detailed Description

The scheme provided by the specification is described below with reference to the accompanying drawings.

Fig. 1 is a schematic view of an implementation scenario of an embodiment disclosed in this specification. The implementation scenario involves class recognition for a target image, which includes text. Referring to fig. 1, the target image 11 is a photograph of a merchant gate, the merchant gate has a signboard 12, the signboard 12 has a text, the text may be ". x.la. museum" shown in fig. 1, or may also be ". x.bai goods", ". x.supermarket", etc., and the corresponding merchant category is identified according to the target image 11, and specifically, the merchant category may be identified as a restaurant category or a department category in a first-level category, or a chinese meal or a western meal in a second-level category, etc.

Modality (modal): are different representations of information. For example: pictures, text, sounds, etc.

In the embodiment of the specification, information of an image modality and information of a text modality are simultaneously used, and the two modalities are complementary, so that the problem of incomplete information of a single modality is effectively solved, the information is fully utilized, and the accuracy of class identification of a target image can be improved.

In addition, it should be noted that, in the embodiments of the present specification, the target image is not limited to the shop head photo, and the method is applicable as long as category identification for the target image is involved and a scene including text in the target image is used. For example, in another possible implementation scenario, the target image may be a photo of a car, where the photo includes characters such as a license plate number or a model number of the car, and a corresponding model or generation of year is identified according to the target image.

Fig. 2 shows a flowchart of a class identification method for a target image including text according to an embodiment, which may be based on the implementation scenario shown in fig. 1. As shown in fig. 2, the class identification method for the target image in this embodiment includes the following steps: step 21, recognizing text content in the target image to obtain a first text recognition result; step 22, inputting the target image into an image encoder, and outputting a first image semantic feature vector corresponding to the target image through the image encoder; step 23, inputting the first text recognition result into a text encoder, and outputting a first text semantic feature vector corresponding to the first text recognition result through the text encoder; step 24, inputting the first image semantic feature vector and the first text semantic feature vector into a multimode fusion encoder, and outputting a first global feature vector through the multimode fusion encoder; and 25, determining the target category of the target image according to the first global feature vector. Specific execution modes of the above steps are described below.

First, in step 21, the text content in the target image is recognized, and a first text recognition result is obtained. It is understood that the text content may include various types of words or characters, such as chinese characters, pinyin, english letters, arabic numerals, and the like.

In one example, the identifying text content in the target image includes:

It can be understood that the text in the target image is usually concentrated in a certain region, rather than scattered everywhere in the target image, and therefore, the efficiency of character recognition can be improved by determining the first text region in the target image and then performing character recognition on the first text region. The character recognition may be performed by an Optical Character Recognition (OCR) algorithm.

the text detection of the target image comprises:

In the embodiment of the present specification, the signboard detector is trained by using a shop front photo marked in a signboard area, and is used for positioning the position of the signboard in the shop front photo.

Then, in step 22, the target image is input into an image encoder, and a first image semantic feature vector corresponding to the target image is output through the image encoder. It can be understood that the category of the target image can be identified only according to the semantic feature vector of the first image, but since the image is only one modality, the information is not fully utilized, and the identification result is not accurate enough.

The image encoder can adopt a general image classification network, such as ResNet, VGG, DenseNet and the like based on convolution, or ViT, T2T-ViT and the like based on Transformer, for extracting high-order semantic features of the image.

Then, in step 23, the first text recognition result is input into a text encoder, and a first text semantic feature vector corresponding to the first text recognition result is output by the text encoder. It can be understood that the category of the target image can be identified only according to the semantic feature vector of the first text, but since the text is only one mode, the information is not fully utilized, and the identification result is not accurate enough.

The text encoder may adopt a transform-based network, such as Bert, RoBERTa, or the like, or RNN-based models such as LSTM and Bi-LSTM, for extracting text high-order semantic features.

And 24, inputting the first image semantic feature vector and the first text semantic feature vector into a multimode fusion encoder, and outputting a first global feature vector through the multimode fusion encoder. The first global feature vector is used for identifying the type of the target image, and the identification result is high in accuracy.

The multimode fusion encoder can take the image high-order semantic features and the text high-order semantic features as input, carries out interactive fusion in the multimode fusion encoder, fully excavates information of two modes, and outputs representative global features. The two modal high-order semantic features are used for interaction, so that interference caused by low-order features can be effectively avoided, and the calculation efficiency is improved. Specifically, the multimode fusion encoder may fuse two modality information in a splicing manner, or may use a Cross-Transformer structured network in order to better interact the two modality information.

Finally, in step 25, a target class of the target image is determined based on the first global feature vector. It is understood that a plurality of selectable categories are preset, at least one selectable category is selected from the plurality of selectable categories according to the first global feature vector, and the selectable category is determined as the target category of the target image.

In one example, the target categories include: a target class of a first hierarchy and a target class of a second hierarchy; any one of the categories of the first hierarchy has a plurality of categories of the second hierarchy.

In one example, the determining the target class of the target image according to the first global feature vector includes:

inputting the first global feature vector into a first full-connection layer, and determining a target category of a first level of the target image;

and inputting the first global feature vector into a second fully-connected layer, and determining a target class of a second level of the target image.

FIG. 3 illustrates an identification flow diagram for determining merchant categories based on merchant door views, according to one embodiment. Referring to fig. 3, an input worker collects uploaded merchant photos, or merchants self-shoot the uploaded merchant photos, a text region in the merchant photos is located through text detection, the located text region is input into an OCR algorithm, and a text recognition result is output. The image encoder is used for extracting high-order semantic features of the image corresponding to the head of the merchant. And the text encoder is used for extracting text high-order semantic features corresponding to the text recognition result. And the multimode fusion encoder is used for performing interactive fusion by taking the high-order semantic features of the image and the high-order semantic features of the text as input to obtain corresponding global feature vectors. And subsequently, determining the identification result of the merchant category according to the global feature vector.

In the embodiments of the present specification, the aforementioned image encoder, text encoder and multi-mode fusion encoder are obtained through pre-training, which may be referred to as multi-modal machine learning (MMML), and are intended to achieve the capability of processing and understanding multi-source modal information through a machine learning method.

In one example, the image encoder, the text encoder, or the multimodal fusion encoder is trained on sample images obtained by:

It can be understood that, besides the original image obtained by photographing is used as the sample image, the disturbed image can also be used as the sample image, thereby improving the robustness of the trained model.

In one example, the multimode fusion encoder is trained by:

Optionally, the image encoder or the text encoder may be trained according to the multi-modal classification loss.

Further, the image encoder may also be trained by:

and training the image encoder according to the image classification loss.

Further, the text encoder may also be trained by:

and training the text encoder according to the text classification loss.

Further, the image encoder may also be trained by:

training the image encoder with a goal of reducing the image distillation loss.

Further, the text encoder may also be trained by:

training the text encoder with a goal of reducing the text distillation loss.

In the embodiment of the specification, a set of universal distillation learning framework is designed, a deep supervision mode is adopted, a deep network is used for guiding a shallow network, the distinguishing performance and the robustness of the characteristics of the middle layer are improved, and therefore the classification performance is effectively improved. Wherein, the classification loss can adopt cross entropy loss, and the distillation loss can adopt KL divergence.

Embodiments of the present disclosure may combine the aforementioned training to jointly train the image encoder, the text encoder, and the multimodal fusion encoder.

FIG. 4 illustrates a training flow diagram for determining merchant categories based on merchant interviews, according to one embodiment. Referring to fig. 4, a merchant photo serving as a sample image is input, a text region in the merchant photo is positioned through text detection, and the positioned text region is input into an OCR algorithm to output a text recognition result. The image encoder is used for extracting high-order semantic features of the image corresponding to the head of the merchant. And the text encoder is used for extracting text high-order semantic features corresponding to the text recognition result. And the multimode fusion encoder is used for performing interactive fusion by taking the high-order semantic features of the image and the high-order semantic features of the text as input to obtain corresponding global feature vectors. Subsequently, determining the multi-mode classification loss according to the output of the multi-mode fusion encoder and the class label; determining an image classification loss according to an output of the image encoder and the class label; determining text classification loss according to the output of the text encoder and the class label; determining an image distillation loss according to the output of the image encoder and the output of the multimode fusion encoder; determining text distillation loss according to the output of the text encoder and the output of the multimode fusion encoder; and determining total loss according to the multi-mode classification loss, the image classification loss, the text classification loss, the image distillation loss and the text distillation loss, and performing combined training on the image encoder, the text encoder and the multi-mode fusion encoder by taking the minimized total loss as a training target.

In addition, in order to further improve the performance of the model, a hierarchical classification component or a key region matching component can be added.

A hierarchical classification component: as the category system of the merchant has a layered two-stage structure (such as catering Chinese dinning), the characteristics are utilized in the model, the characteristics output by the multimode fusion encoder pass through two full-connection layers which do not share weight, so that two-stage classification results are obtained, and the corresponding labels are two-stage category labels.

Critical area matching component: when the multimode fusion encoder adopts an attention mechanism, because images and texts may have certain correlation, important feature blocks in the texts and the image features are selected from the features output by the multimode fusion encoder according to attention scores and scores from high to low, and for two region features, KL divergence is used for constraint to enable the feature expressions of the two region features to be consistent.

According to the method provided by the embodiment of the specification, the target image comprises the text, firstly, the text content in the target image is identified, and a first text identification result is obtained; then inputting the target image into an image encoder, and outputting a first image semantic feature vector corresponding to the target image through the image encoder; then inputting the first text recognition result into a text encoder, and outputting a first text semantic feature vector corresponding to the first text recognition result through the text encoder; inputting the first image semantic feature vector and the first text semantic feature vector into a multimode fusion encoder, and outputting a first global feature vector through the multimode fusion encoder; and finally, determining the target category of the target image according to the first global feature vector. As can be seen from the above, in the embodiments of the present specification, information of an image modality and information of a text modality are used simultaneously, and the two modalities are complementary, so that the problem of incomplete information of a single modality is effectively avoided, the information is fully utilized, and the accuracy of class identification for a target image can be improved.

According to another aspect of embodiments, there is also provided a class identification apparatus for a target image, the target image including text, the apparatus being configured to execute the class identification method for the target image provided in the embodiments of the present specification. Fig. 5 shows a schematic block diagram of a class identification apparatus for a target image according to one embodiment. As shown in fig. 5, the apparatus 500 includes:

the recognition unit 51 is used for recognizing the text content in the target image to obtain a first text recognition result;

an image encoding unit 52, configured to input the target image into an image encoder, and output a first image semantic feature vector corresponding to the target image through the image encoder;

a text encoding unit 53, configured to input the first text recognition result obtained by the recognition unit 51 into a text encoder, and output a first text semantic feature vector corresponding to the first text recognition result through the text encoder;

a fusion unit 54, configured to input the first image semantic feature vector obtained by the image encoding unit 52 and the first text semantic feature vector obtained by the text encoding unit 53 into a multimode fusion encoder, and output a first global feature vector through the multimode fusion encoder;

a determining unit 55, configured to determine a target category of the target image according to the first global feature vector obtained by the fusing unit 54.

Optionally, as an embodiment, the identifying unit 51 includes:

the text detection subunit is used for performing text detection on the target image to obtain a first text region in the target image;

and the character recognition subunit is used for performing character recognition on the first text region obtained by the text detection subunit to obtain the first text recognition result.

the text detection subunit includes:

the signboard detection module is used for inputting the merchant shop head photo into a signboard detector and outputting a signboard area through the signboard detector;

and the text detection module is used for inputting the signboard area obtained by the signboard detection module into a text detector and outputting the first text area through the text detector.

Optionally, as an embodiment, the image encoder, the text encoder, or the multi-mode fusion encoder is trained based on sample images, and the sample images are obtained by:

Optionally, as an embodiment, the multimode fusion encoder is trained by:

Further, the image encoder is trained by:

and training the image encoder according to the image classification loss.

Further, the text encoder is trained by:

and training the text encoder according to the text classification loss.

Further, the image encoder is trained by:

training the image encoder with a goal of reducing the image distillation loss.

Further, the text encoder is trained by:

training the text encoder with a goal of reducing the text distillation loss.

Optionally, as an embodiment, the object category includes:

a target class of a first hierarchy and a target class of a second hierarchy; any one of the categories of the first hierarchy has a plurality of categories of the second hierarchy.

With the apparatus provided by the embodiment of the present specification, the target image includes a text, and first, the identifying unit 51 identifies text content in the target image to obtain a first text identification result; then, the image encoding unit 52 inputs the target image into an image encoder, and outputs a first image semantic feature vector corresponding to the target image through the image encoder; then, the text encoding unit 53 inputs the first text recognition result into a text encoder, and outputs a first text semantic feature vector corresponding to the first text recognition result through the text encoder; the fusion unit 54 inputs the first image semantic feature vector and the first text semantic feature vector into a multimode fusion encoder, and outputs a first global feature vector through the multimode fusion encoder; finally, the determining unit 55 determines the target category of the target image according to the first global feature vector. As can be seen from the above, in the embodiments of the present specification, information of an image modality and information of a text modality are used simultaneously, and the two modalities are complementary, so that the problem of incomplete information of a single modality is effectively avoided, the information is fully utilized, and the accuracy of class identification for a target image can be improved.

According to an embodiment of another aspect, there is also provided a computer-readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method described in connection with fig. 2.

According to an embodiment of yet another aspect, there is also provided a computing device comprising a memory having stored therein executable code, and a processor that, when executing the executable code, implements the method described in connection with fig. 2.

Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in this invention may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.

The above-mentioned embodiments, objects, technical solutions and advantages of the present invention are further described in detail, it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made on the basis of the technical solutions of the present invention should be included in the scope of the present invention.

Claims

1. A method of class identification for a target image, the target image including text therein, the method comprising:

determining a target category of the target image according to the first global feature vector;

wherein the image encoder, the text encoder, and the multimodal fusion encoder are trained by:

determining a total loss according to the multi-mode classification loss, the image classification loss, the text classification loss, the image distillation loss and the text distillation loss; the multimodal fusion encoder comprises a multimodal fusion encoder, a text encoder, a multimodal fusion encoder, and a text distillation loss, wherein the multimodal fusion encoder is configured to encode a text image and a multimodal fusion encoder;

and jointly training the image encoder, the text encoder and the multimode fusion encoder by taking the minimized total loss as a training target.

2. The method of claim 1, wherein the identifying text content in the target image comprises:

3. The method of claim 2, wherein the target image is a merchant shop photo, the merchant shop having a sign, the sign having text; the target category is a merchant category;

the text detection of the target image comprises:

4. The method of claim 1, wherein the image encoder, the text encoder, or the multimodal fusion encoder is trained based on sample images obtained by:

5. The method of claim 1, wherein the multi-modal classification penalty is determined by:

and determining the multi-mode classification loss according to the first prediction class and the class label.

6. The method of claim 5, wherein the image classification penalty is determined by:

and determining the image classification loss according to the second prediction category and the category label.

7. The method of claim 5, wherein the text classification penalty is determined by:

and determining text classification loss according to the third prediction category and the category label.

8. The method of claim 5, wherein the image distillation loss is determined by:

and determining the image distillation loss in distillation learning according to the first similarity between the second image semantic feature vector and the second global feature vector, wherein the image distillation loss is in negative correlation with the first similarity.

9. The method of claim 5, wherein the text distillation loss is determined by:

and determining text distillation loss in distillation learning according to a second similarity between the second text semantic feature vector and the second global feature vector, wherein the text distillation loss is negatively correlated with the second similarity.

10. The method of claim 1, wherein the object categories include: a target class of a first hierarchy and a target class of a second hierarchy; any one of the categories of the first hierarchy has a plurality of categories of the second hierarchy.

11. A category identification apparatus for a target image including text therein, the apparatus comprising:

the determining unit is used for determining the target category of the target image according to the first global feature vector obtained by the fusing unit;

12. The apparatus of claim 11, wherein the identifying unit comprises:

13. The apparatus of claim 12, wherein the target image is a picture of a merchant's head, the merchant's head having a sign, the sign having text; the target category is a merchant category;

the text detection subunit includes:

14. The apparatus of claim 11, wherein the image encoder, the text encoder, or the multimodal fusion encoder is trained based on sample images obtained by:

15. The apparatus of claim 11, wherein the multi-modal classification penalty is determined by:

16. The apparatus of claim 15, wherein the image classification penalty is determined by:

17. The apparatus of claim 15, wherein the text classification penalty is determined by:

18. The apparatus of claim 15, wherein the image distillation loss is determined by:

19. The apparatus of claim 15, wherein the text distillation loss is determined by:

20. The apparatus of claim 11, wherein the object classes comprise: a target class of a first hierarchy and a target class of a second hierarchy; any one of the categories of the first hierarchy has a plurality of categories of the second hierarchy.

21. A computer-readable storage medium, on which a computer program is stored which, when executed in a computer, causes the computer to carry out the method of any one of claims 1-10.

22. A computing device comprising a memory having stored therein executable code and a processor that, when executing the executable code, implements the method of any of claims 1-10.