CN103714178B

CN103714178B - Automatic image marking method based on word correlation

Info

Publication number: CN103714178B
Application number: CN201410008553.1A
Authority: CN
Inventors: 安震
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Priority date: 2014-01-08
Filing date: 2014-01-08
Publication date: 2017-01-25
Anticipated expiration: 2034-01-08
Also published as: CN103714178A

Abstract

The invention discloses an automatic image marking method based on work correlation. A training set T comprises l images, n marking words are marked on each image of the training set T, the training set T is provided with a corresponding vision lemma, and the image to be marked is I. The method includes the steps that a semantic vector of each marking word w is calculated according to a formula, the marking word w is represented by the vector form w=<v1, v2,......, vm>, ci is an associated word in a context, and the associated words in the context total m; semantic similarity of the marking words is calculated according to a formula, and vector module calculated is achieved as is shown in the specification; p(A) is calculated according to the formula, wherein A is a marking word group in w1, w2,......wn, and n is the number of the marking word groups; the conditional probability p(I/wi) is calculated according to a formula; the marking word group A of the image I to be marked is calculated according to the formula A=arg maxAp(I/A) p(A).

Description

Automatic image labeling method based on inter-word correlation

Technical Field

The invention relates to the field of image processing, in particular to an automatic image annotation method based on interword correlation.

Background

With the rapid development of multimedia and internet technologies, people's daily life and work have increasingly strong dependence on multimedia information such as images. The semantic-based image retrieval not only can accurately express the retrieval intention of the user, but also is convenient for the user to use, so that the retrieval mode not only becomes an important form of image retrieval, but also becomes a technical hotspot pursued by researchers.

The image automatic labeling technology is an important and challenging work in image semantic retrieval, and the image automatic labeling technology is presented for automatically acquiring semantic information contained in image visual content, and tries to build a bridge between image bottom-layer visual features and high-layer semantics, so as to support semantic retrieval at a semantic level. Therefore, the automatic annotation algorithm research based on the image semantics becomes a very active research branch and key technology in the image retrieval field, and has good application prospect and research value.

The automatic image annotation means that a computer automatically adds semantic keywords capable of reflecting the image content to an unmarked image. The method automatically practices a relation model of a semantic concept space and a visual characteristic space by using a labeled image set or other available information, and labels an image with unknown semantics by using the model. The semantic gap problem is solved to a certain extent by establishing a mapping relation between the high-level semantic information and the bottom-level features of the image.

The image automatic labeling method of the joint media related model is an image labeling algorithm which is most widely applied in the image labeling method based on the generated model at present, and is widely researched by scholars. The basic idea of the annotation model is to establish probability correlation between an image visual characteristic space and a semantic concept space by using a probability statistics method, find out a group of semantic annotation words to enable the joint probability between the semantic annotation words and image contents to be maximum by statistically learning joint probability distribution between the image visual characteristic space and the semantic concept space, and use the group of annotation words as the final annotation of an image to be detected.

However, the joint media correlation model belongs to a probability model, and the model has a bias on the tagging words with high occurrence frequency. Secondly, in the automatic annotation method of the joint media related model image, different candidate annotation words are assumed to be mutually independent in the annotation process, and the relevance between the annotation words is not fully mined. In fact, in the same image, there are various associations such as symbiosis, hierarchy or space between different annotation words.

For example, an image containing semantic objects such as "sun," sky, crowd, mountain, tree "and the like, it can be seen from the visual content of the image that there is a certain spatial correlation between the" sun "and" sky "objects, and" sun "cannot exist independently from the semantic object of" sky "; similarly, for two semantic objects of "mountain" and "tree" in the image content, the "tree" object exists with the "mountain" semantic object as the visual content background, and the two semantic objects are also in inseparable relation in the image visual content, and it cannot be absolutely assumed that the two annotation words are annotated independently from each other. Therefore, the automatic annotation algorithm of the joint media related model image considers that the method of mutually independence among different candidate annotation words has certain defects in the annotation process, and the phenomenon of inconsistent semantics among the annotation words in the annotation result can be caused by neglecting the correlation among the words.

Disclosure of Invention

In view of the above, the present invention provides an automatic image annotation method based on inter-word correlation, so as to overcome the defect of the joint media correlation model image automatic annotation algorithm in the annotation process that different candidate annotation words are considered to be independent from each other, and solve the problem of semantic inconsistency between the annotation words in the annotation result due to neglect of inter-word correlation. The technical scheme provided by the invention is as follows:

an image automatic labeling method based on word correlation is disclosed, wherein a training set T comprises l images, and the l images form an image set P ═ P₁p₂… p_l](ii) a Each image of the training set T is marked with n tagging words, and all the tagging words in the training set T form a tagging word set W ═ W₁w₂… w_s](ii) a Each image in the training set T has a corresponding visual lemma, and all the visual lemmas in the training set T form a visual lemma set B ═ B₁b₂… b_y]And the image to be marked is I, and the method comprises the following steps:

A. according to the formulaCalculating a semantic vector of each annotation word w in the training set T, and expressing the annotation word w as a vector form w ═ w<v₁,v₂,…,v_m>Wherein c is_iFor context-related words, there are m context-related words, p (c)_i) For context associated words c_iOverall distribution probability of p (c)_i/w) represents a context-related word c_iAnd the ratio of the number of co-occurrences of the annotation word w in the training set T to the total number of occurrences of the annotation word w in the training set T, i.e.The context associated word is a label word in a training set T;

B. according to the formulaCalculating semantic similarity between the labeled words, wherein | | · | | is calculated by a vector model, and w_i·w_jCalculating vector dot product;

C. according to the formulaCalculating p (A), wherein A is a labeling phrase { w₁,w₂,…w_nN is the number of the labeled phrases;

D. according to the formulaCalculating the conditional probability p (I/w)_i) Wherein p (w)_i) To mark a word w_iThe ratio of the number of occurrences in the training set T to the total number of occurrences of all the tokens in the training set T, i.e. the ratio

p(w_i,b₁,…,b_n) The calculation method comprises the following steps:

wherein P (J) represents the probability of randomly drawing a training image J in the image set P; p (w)_i/J) representing the occurrence of words w in the training image J_iA posterior probability of (d); and p (b)_k/J) represents a training image JIn the middle of which appears visual word element b_kA posterior probability of (d);

E. according toCalculating p (I/A);

F. from formula A ═ arg max_Ap (I/A) p (A) calculating an annotation phrase A of the image I to be annotated.

In the above scheme, p (w) in step D_i/J) and p (b)_kThe calculation methods of/J) are respectively as follows:

p (w_{i} | J) = (1 - α_{J}) \frac{# (w_{i}, J)}{| J |} + α_{J} \frac{# (w_{i}, T)}{| T |} - - - (1)

p (b_{k} | J) = (1 - β_{J}) \frac{# (b_{k}, J)}{| J |} + β_{J} \frac{# (b_{k}, T)}{| T |} - - - (2)

wherein, α_JAnd β_JIs an empirical value for the smoothing parameter;

#(w_iand J) represents a notation word w_iWhether it appears in training image J, and if so, # w_iJ) 1, otherwise # (w)_i,J)＝0；

#(w_iT) denotes a label word w_iIf it is present in the training set T, then # (w)_iT) 1, otherwise # (w)_i,T)＝0；

#(b_kAnd J) visual morpheme b_kWhether it appears in training image J, and if so # (b)_kJ) 1, otherwise # (b)_k,J)＝0；

The | J | represents the total number of the labeled words and the visual word elements in the training image J; and | T | represents the total number of the labeled words and the visual word elements in the training set T.

In summary, the technical scheme provided by the invention converts the joint probability calculation process of the annotation words and the images in the joint media correlation model into the solution of the probability of the occurrence of the images and the prior probability of the annotation word groups under the condition of the annotation words, thereby greatly reducing the influence of the high-frequency candidate annotation words on the probability statistic model, enabling the non-high-frequency candidate annotation words to play a greater role, improving the recall ratio and precision ratio of the non-high-frequency candidate annotation words, simultaneously introducing the semantic similar language model into the joint media correlation model, and estimating the prior probability of a group of annotation words through the semantic similar language model, so that a group of annotation words with stronger semantic correlation are more likely to be generated. Thereby improving the overall labeling effect of the image.

Drawings

FIG. 1 is a flow chart of an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail with reference to the accompanying drawings and specific embodiments.

The technical scheme of the invention is as follows:

A. according to the formulaCalculating a semantic vector of each annotation word w in the training set T, and expressing the annotation word w as a vector form w ═ w<v₁,v₂,…,v_m>Wherein c is_iFor the context-related words, there are m context-related words in total，p(c_i) For context associated words c_iOverall distribution probability of p (c)_i/w) represents a context-related word c_iAnd the ratio of the number of co-occurrences of the annotation word w in the training set T to the total number of occurrences of the annotation word w in the training set T, i.e.The context associated word is a label word in a training set T;

p(w_i,b₁,…,b_n) The calculation method comprises the following steps:

wherein P (J) represents the probability of randomly drawing a training image J in the image set P; p (w)_i/J) represents the training image JThe existing vocabulary w_iA posterior probability of (d); and p (b)_k/J) visual lemma b appearing in training image J_kA posterior probability of (d);

E. according toCalculating p (I/A);

The current image annotation problem can be defined as: giving a training set T, wherein the training set T comprises an image set P and a tagging word set W, and each image P_iHow to select a group of annotation words a from the annotation word set W to label a new image I?

The image labeling method of the invention adopts a probability model, the target is to find a labeling phrase A, the conditional probability p (A/I) is maximum, namely:

A＝arg max_Ap(A/I) (3)

wherein A is a markup phrase { w }₁,w₂,…w_nImage I with a set of visual features b₁,b₂…,b_mThe representation is obtained by performing preprocessing (such as image segmentation, feature extraction, feature value normalization and other operations) and image block region classification operation on the image I. p (A/I) can be rewritten as follows:

p (A / I) = \frac{p (A, I)}{p (I)} - - - (4)

since the prior probability of an image is usually considered to be uniformly distributed, p (I) can be considered as a constant, and

p(A,I)＝p(I/A)p(A) (5)

the formula (3) is simplified by the formulas (4) and (5), and the following results are obtained:

A＝arg max_Ap(I/A)p(A) (6)

and solving the maximum value by combining the two probabilities of p (I/A) and p (A) to find the optimal annotation phrase A. p (I/A) can be derived from the original image annotation model, and p (A) can be derived from the language model. The influence of the original image model and the language model on the finally obtained labeling effect is represented by giving different weights to the two probabilities:

A = \arg \max_{A} p {(I / A)}^{λ_{1}} p {(A)}^{λ_{2}} - - - (7)

it is converted into the following form:

A＝arg max_A(λ₁log p(I/A)+λ₂log p(A)) (8)

the annotation phrase A can be obtained by calculating p (A) and p (I/A). Wherein λ is₁And λ₂The method is determined in the machine learning and model building process of a training image set, and two constants are used in the automatic labeling process of the image to be detected.

The technical solution of the present invention is described below by taking a training set T containing l images as an example, where an image to be labeled is I. The training set T has one image forming set P ═ P₁p₂… p_l](ii) a Each image of the training set T is marked with n tagging words, and all the tagging words in the training set T form a tagging word set W ═ W₁w₂… w_s](ii) a Each image in the training set T has a corresponding visual lemma, and all the visual lemmas in the training set T form a visual lemma set B ═ B₁b₂… b_y]。

Fig. 1 is a flowchart of the present embodiment, and as shown in fig. 1, the method includes the following steps:

step 101: and carrying out image preprocessing and block region classification operation on the image I to be annotated.

In the step, image preprocessing (image segmentation, feature extraction, feature value normalization and the like) is carried out on an image I to be annotated, then image block region classification operation is carried out, a clustering algorithm is utilized to classify each image block region, and visual word element combination is used for representing image visual content: i ═ I₁i₂… i_f}. The method for obtaining visual lemmas is prior art and will not be described in detail here.

Step 102: p (a) is calculated by a semantic similar language model.

In order to introduce the relevance information among the annotation words into the similarity among the annotation words, the invention adopts a semantic vector model to represent each annotation word w: context-related word set C ═ C₁c₂… c_m]Each element ofc_iThe method represents one context related word, m context related words are provided in total, and all the annotation words in the annotation word set W in the training set T can be selected as the context related words, that is, C is W. Each annotation word w is represented by a context-related word vector associated therewith, i.e. w ═<v₁,v₂,…,v_m>Wherein the calculation of each semantic component vi is defined as a context-related word c_iConditional probability and context associated word c relative to annotation word w_iRatio of probabilities of (c):

v_{i} = \frac{p (c_{i} / w)}{p (c_{i})} - - - (9)

wherein p (c)_i) Representing a context-related word c_iThe overall distribution probability of (2) is uniform distribution. Conditional probability p (c)_i/w) represents a context-related word c_iAnd the ratio of the co-occurrence times of the annotation word w when all the images in the image set P in the training set T are labeled to the total times of the annotation word w when all the images in the image set P are labeled:

p (c_{i} / w) = \frac{c o u n t (c_{i}, w)}{c o u n t (w)} - - - (10)

p(c_i/w) represents the intensity distribution of the co-occurrence of the word w and the context-related word, and the division by the overall probability of each context-related word is to prevent the semantic vector w from being formed<v₁,v₂,…,v_m>Is dominated by context relevant words with high frequency of occurrence, because the relevant words with high frequency often have large conditional probability. As shown in table 1, where "sky", "sun", "clouds" and "town" represent a group of context-related words, "tree", "building" and "river" are a group of annotation words, and the semantic vectors of the annotation words are shown in table 1.

TABLE 1

	sky	sun	clouds	town
					tree	2.56	0.91	0.74	0.63
building	5.01	0.57	2.41	21.19
					river	2.57	2.57	1.12	5.72

Semantic similarity between the annotated words is then calculated. The similarity is calculated as shown in equation 11:

s i m (w_{i}, w_{j}) = \frac{w_{i} \cdot w_{j}}{| | w_{i} | | \cdot | | w_{j} | |} - - - (11)

where | | · | | is calculated modulo the vector.

w_i·w_jThe calculation of (d) is shown in equation 12:

w_{i} \cdot w_{j} = Σ_{k = 1}^{m} v_{w i, k} v_{w_{j}, k} = Σ_{k = 1}^{m} \frac{p (c_{k} / w_{i})}{p (c_{k})} \cdot \frac{p (c_{k} / w_{j})}{p (c_{k})} - - - (12)

wherein c is_kRepresenting a context-associated word. Semantic similarity between the annotated words is shown in table 2. The similarity value range is 0 to 1, and the higher the numerical value is, the higher the similarity between two annotation words is, the higher the probability that the two annotation words appear in the same image is.

TABLE 2

Assuming that in the same annotation the annotation word is semantically related to the context-related word, a set of annotation words a ═ w₁,w₂,…,w_nThe probability p (a) of each annotation word can be obtained by calculating the similarity between each annotation word and other annotation words:

p (A) &Proportional; \frac{1}{n - 1} \underset{w_{i} &Element; A}{Σ} \underset{w_{j} &Element; A, j &NotEqual; i}{Σ} s i m (w_{i}, w_{j}) - - - (13)

by substituting equations 10, 11, and 12 into equation 13, the probability p (a) of the labeled phrase can be calculated:

p (A) &Proportional; \frac{1}{n - 1} \underset{w_{i} &Element; A}{Σ} \underset{w_{j} &Element; A, j &NotEqual; i}{Σ} \frac{Σ_{k = 1}^{m} \frac{c o u n t (c_{k}, w_{i})}{c o u n t (w_{i}) \cdot p (c_{k})} \cdot \frac{c o u n t (c_{k}, w_{j})}{c o u n t (w_{j}) \cdot p (c_{k})}}{| | w_{i} | | \cdot | | w_{j} | |} - - - (14)

step 103: p (I/A) is calculated by the joint media correlation model.

In this step, firstly, the formula is based onCalculating the conditional probability p (I/w)_i). Wherein,

p(w_i) The calculation method comprises the following steps:

by marking words w_iThe ratio of the number of occurrences in the training set T to the total number of occurrences of all the annotation words represents the vocabulary w_iPrior probability p (w)_i)：

p (w_{i}) = \frac{| w_{i} |}{Σ_{w_{k} &Element; T} | w_{k} |} - - - (15)

p(w_i,b₁,…,b_n) The calculation method comprises the following steps:

p (w_{i}, b_{1}, b_{2}, ..., b_{n}) = \underset{J &Element; T}{Σ} p (J) p (w_{i} | J) Π_{k = 1}^{n} p (b_{k} | J) - - - (16)

p (J) represents the probability of randomly drawing a training image J in the image set P, which is generally assumed to be uniformly distributed; p (w)_i/J) representing the occurrence of words w in the training image J_iA posterior probability of (d); whilep(b_k/J) visual lemma b appearing in training image J_kThe posterior probability of (d). The probability value for each term is estimated as follows:

p (w_{i} | J) = (1 - α_{J}) \frac{# (w_{i}, J)}{| J |} + α_{J} \frac{# (w_{i}, T)}{| T |} - - - (17)

p (b_{k} | J) = (1 - β_{J}) \frac{# (b_{k}, J)}{| J |} + β_{J} \frac{# (b_{k}, T)}{| T |} - - - (18)

wherein, α_JAnd β_JIs an empirical value for the smoothing parameter; # (w)_iAnd J) represents a notation word w_iWhether it appears in training image J, and if so, # w_iJ) 1, otherwise # (w)_i,J)＝0；#(w_iT) denotes a label word w_iIf it is present in the training set T, then # (w)_iT) 1, otherwise # (w)_i,T)＝0；#(b_kAnd J) visual morpheme b_kWhether it appears in training image J, and if so # (b)_kJ) 1, otherwise # (b)_kJ) is 0; the | J | represents the total number of the labeled words and the visual word elements in the training image J; and | T | represents the total number of the labeled words and the visual word elements in the training set T.

Then p (I/A) can be approximated as

Step 104: and calculating the phrase to be labeled of the image I to be labeled.

Above, p (A) and p (I/A) are solved, respectively, according to A ═ arg max_A(λ₁log p(I/A)+λ₂log p (A)) can calculate the annotation phrase A for an image I

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that are within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. An image automatic labeling method based on the correlation between words is characterized in that a training set T comprises l images, and the l images form an image set P ═ P₁p₂… p_l](ii) a Each image of the training set T is marked with n tagging words, and all the tagging words in the training set T form a tagging word set W ═ W₁w₂… w_s](ii) a Each image in the training set T has a corresponding visual lemma, and all the visual lemmas in the training set T form a visual lemma set B ═ B₁b₂… b_y]The image to be labeled isI, the method comprises the following steps:

p(w_i,b₁,…,b_n) The calculation method comprises the following steps:

wherein P (J) represents the probability of randomly drawing a training image J in the image set P; p (w)_i/J) representing the occurrence of words w in the training image J_iA posterior probability of (d); and p (b)_k/J) visual lemma b appearing in training image J_kA posterior probability of (d);

E. according toCalculating p (I/A);

F. from formula A ═ argmax_Ap (I/A) p (A) calculating an annotation phrase A of the image I to be annotated.

2. The method of claim 1, wherein p (w) in step D_i/J) and p (b)_kThe calculation methods of/J) are respectively as follows:

p (w_{i} | J) = (1 - α_{J}) \frac{# (w_{i}, J)}{| J |} + α_{J} \frac{# (w_{i}, T)}{| T |} - - - (1)

p (b_{k} | J) = (1 - β_{J}) \frac{# (b_{k}, J)}{| J |} + β_{J} \frac{# (b_{k}, T)}{| T |} - - - (2)

wherein, α_JAnd β_JIs an empirical value for the smoothing parameter;