CN104021222A

CN104021222A - Labeling algorithm for biomedical image based on invisible dirichlet model

Info

Publication number: CN104021222A
Application number: CN201410289320.3A
Authority: CN
Inventors: 盛建强; 张运生; 李华忠
Original assignee: Shenzhen Institute of Information Technology
Current assignee: Shenzhen Institute of Information Technology
Priority date: 2014-06-26
Filing date: 2014-06-26
Publication date: 2014-09-03

Abstract

The invention provides a labeling algorithm for a biomedical image based on LDA (Lejeune Dirichlet Allocation), mainly aiming at that the biomedical image is labeled, each image has a corresponding text file in a biomedical image language database and the particularity is combined. The LDA is used for extracting a subject term from captions of the image; then a context is extracted from the corresponding text file of the image according to the captions; and finally, the LDA is used for modeling the context; an obtained subject term is used as a final label of the biomedical image. The labeling algorithm has the beneficial effects that the biomedical image is labeled and the captions and the text file which are related to the image in a data set are sufficiently utilized to dig label words of the image; the accuracy is high and the plurality of label words can be generated at one time. After the biomedical image is accurately labeled, the related image is searched by using keyword index; the labeling algorithm is convenient and rapid and meets a text retrieval habit of people.

Description

A kind of dimensioning algorithm of the Biomedical Image based on stealthy Di Li Cray model

Technical field

The present invention relates to a kind of dimensioning algorithm of image, relate in particular to a kind of dimensioning algorithm of the Biomedical Image based on stealthy Di Li Cray model.

Background technology

Day by day universal along with the equipment capable of taking pictures such as development and digital camera of digitized video technology, various amount of images present the growth at full speed of how much levels.And the fast development of internet simultaneously also makes image propagates become more quick with shared.In order effectively to organize, to inquire about and browsing so large-scale image resource, image retrieval technologies is arisen at the historic moment, and becomes the research emphasis of computer vision field.

Existing image search method is mainly divided into two kinds: CBIR (Content-Based Image Retrieval) and text based image retrieval (Text-Based Image Retrieval).CBIR needs user to provide piece image as inquiry, system is extracted the bottom visual signature of image, as color, texture and shape etc., for image is set up vision index, then find out occurrence according to the visual similarity between image in database and inquiry, realize the object of retrieval.Owing to there is inconsistency between image bottom visual signature and high-level semantic concept, i.e. so-called " semantic gap (Semantic Gap) ", the performance of CBIR is unsatisfactory.Text based image retrieval, need to set up in advance text index to image, when user search, as long as submit to text as inquiry, system is found out similar image according to the relevant matches of text and is returned, and like this retrieval of image is just converted into the retrieval to text key word.

Compared with CBIR, text based image retrieval only needs user to submit text key word to, convenient and swift, is more subject to users' favor, also becomes thus the major way of main flow commercialization image search engine.But this mode need to be set up text index to image, namely realize the semantic tagger of image, this is a job that has challenge in text based image retrieval technologies.Realize the semantic tagger of image, become the most important thing of text based image retrieval technologies.A kind of traditional mode is manually to mark, but this mode time and effort consuming, and during especially in the face of large-scale network image, it obviously cannot be competent at.Therefore, how to break away from manual intervention, and realize quickly and efficiently the automatic semantic tagger to image, become very important.

In order to realize the robotization mark of image, current a kind of existing method is that image is classified, and then the result of classification is used as to the mark of image.Particularly, regard each semantic key words as a classification mark (Label), and based on some sorters of training, then classify to not marking image with these sorters, the sub-category mark that is this image.The sorting algorithm of at present existing many maturations, for example support vector machine (SVM) [1], stealthy Markov model (HMM) [2] etc.

Adopt the method for classification to carry out image labeling, depend on the accuracy of sorting algorithm, although current sorting algorithm accuracy is higher, but still have certain error.In addition, existing sorting algorithm is binary classification device mostly, and for example support vector machine, so for the image that has multiple mark, just need to design multiple sorters, and image is repeatedly classified, and efficiency is not high yet.

Summary of the invention

The object of the invention is to solve and manually marks time and effort consuming, especially during in the face of large-scale network image, it cannot be competent at, and adopt the method for classification to carry out image labeling, there is error, and for the image that has multiple mark, just need to design multiple sorters, and image is repeatedly classified, efficiency also not high deficiency and the one that provides marks mainly for Biomedical Image, in Biomedical Image corpus, every image has a corresponding text, in conjunction with this singularity, proposed a kind ofly to distribute based on the stealthy Di Li Cray of LDA() the dimensioning algorithm of Biomedical Image, utilize LDA from the comment (caption) of image, to extract descriptor, then from text corresponding to image, extract context according to these descriptor, finally recycle LDA context is carried out to modeling, the descriptor obtaining is just as the final mark of Biomedical Image.

The present invention is achieved through the following technical solutions: a kind of dimensioning algorithm of the Biomedical Image based on stealthy Di Li Cray model, comprising:

Build training set module, the data set of LDA model is the comment of all Biomedical Images, we need to extract the content of comment (caption) node from the corresponding text of every width Biomedical Image, the i.e. comment of this image, the comment of all images gathers together, and has formed the training sample set of LDA model; Simultaneously our theme number, the Di Li Cray priori parameter that document-theme distributes and theme-word distributes are set to empirical value, and described text is generally XML form;

LDA training module, LDA training module is by the training sample set pair LDA model training in described structure training set module, to generate, document-theme distributes and theme-word distributes;

Key words extraction module, key words extraction module is carried out LDA modeling for the comment (caption) to every width Biomedical Image, then from institute's established model (theme distributes and word distributes), extracts all descriptor; Do not mark image for a pair, utilize LDA model that LDA training module the produces comment (caption) to this image to carry out modeling, then from the result (theme distributes and word distributes) of modeling, extract the descriptor of all words as this image, join in descriptor set;

Descriptor refining module, descriptor refining module is optimized for descriptor set that abstraction module is produced, is simplified most, the most effective descriptor set; In the result of comment (caption) modeling at LDA model to image, if the probability of certain subject word is zero in the distribution of theme-word, this word is rejected from descriptor set; If do not comprise certain descriptor in the comment of image, this word is concentrated and rejected from descriptor; If contain the word of repetition in descriptor set, the word repeating is rejected, only retain one; Pass through these Optimum Operations, thereby obtain the descriptor set of more refining;

Index context sentence module, index context sentence module is for indexing out and the associated close sentence collection of descriptor the set of refining descriptor from the text of image; Index context sentence module utilizes LUCENE as gopher, to each word in the set of refining descriptor, sets it as querying condition, retrieves all sentences that comprise this descriptor; After Index process completes, for each descriptor, there is a sentence collection associated;

Context generation module, context generation module is to concentrate and choose a sentence the closest from the corresponding sentence of each descriptor, then gathers the closest all sentences, has just formed the context (context) of image; The core work of context generation module is exactly to choose the closest sentence for each descriptor, and the molecular set of all close sentences is exactly context;

Mark generation module, mark generation module still utilizes the LDA model that LDA training module obtains to carry out modeling to the context of image, the theme that obtains image distributes and word distribution, then in theme-word being distributed, the probability of each word is multiplied by the probability of corresponding theme, and acquired results is as the weights of this word.According to weights order from big to small, all words are sorted, choose the mark word of front several word as Biomedical Image;

A kind of concrete steps of dimensioning algorithm of the Biomedical Image based on stealthy Di Li Cray model are as follows:

(a) start;

(b) build training set module, choose a part of Biomedical Image composing training collection, and from the text of every width image, extract the comment in comment (caption) node, the training dataset of composition LDA model; Meanwhile, the priori parameter that given number of topics, document-theme distribute, the priori parameter that theme-word distributes;

(c) LDA training module, adopts Gibbs sampling algorithm to the training of LDA model; The distribution of the theme that the word of first sampling out is corresponding, then further calculates the distribution of document-theme and theme-word and distributes;

(d) key words extraction module, does not mark image to a pair, utilizes the LDA model of training to carry out modeling, chooses all descriptor, the set of composition descriptor;

(e) descriptor refining module, to descriptor, set is optimized, and removes the word wherein repeating, word that probability is zero and word in comment not, thereby obtains the set of refining descriptor;

(f) index context sentence module to a descriptor, retrieves all sentences that comprise this word from the text of image with lucece, form a sentence collection, is denoted as the corresponding sentence collection of this descriptor;

(g) if all descriptor have corresponding sentence collection, enter (h), otherwise enter (f);

(h) context generation module, concentrates and chooses the closest sentence from the corresponding sentence of each descriptor, forms the context of this image;

(i) use the LDA model of (c) training to carry out modeling to context, the probability of the each word in then theme-word being distributed is multiplied by the probability of corresponding theme, and the result obtaining is as the weights of word; By all words of descending sort, choose front several final mark as image;

(j) if all images that do not mark all mark, enter (k), otherwise jump to (d);

(k) finish.

Further, described context generation module is concentrated and is chosen a sentence the closest from the corresponding sentence of each descriptor, chooses the algorithm of close sentence to be:

1）repeat

2) each word w that for refining descriptor is concentrated

3) find corresponding associated sentence collection, be called SS

4) an integer array VOTE of definition, the length of array is exactly the length N of SS, and the element in VOTE represents the number of votes obtained of corresponding sentence in SS.

5） repeat

6） for i=0 to N-1

7) each word c in the set of for refining descriptor

8) if sentence SS[i] in comprise word c

9) the number of votes obtained VOTE[i of this sentence] increase by 1

10) in until SS, all sentences have all traveled through

11) from VOTE, find out the index value j of maximum number of votes obtained, SS[j so] be exactly the closest sentence of descriptor w

12) all descriptor of until have all traveled through.

Beneficial effect of the present invention is:

Mark for Biomedical Image, take full advantage of the associated comment of data centralization image and text and excavate the mark word of image, accuracy is high, and once can generate multiple mark words.Realize after the accurate mark of Biomedical Image, can search relevant image by keyword index, convenient and swift, more meet people's text retrieval custom.

Brief description of the drawings

Fig. 1 is the process flow diagram that the present invention is based on the Biomedical Image dimensioning algorithm of stealthy Di Li Cray distribution.

Embodiment

Below in conjunction with the drawings and the specific embodiments, the present invention is described further:

A dimensioning algorithm for Biomedical Image based on stealthy Di Li Cray model, comprising:

As shown in Figure 1, a kind of concrete steps of dimensioning algorithm of the Biomedical Image based on stealthy Di Li Cray model are as follows:

(a) start;

(j) if all images that do not mark all mark, enter (k), otherwise jump to (d);

(k) finish.

1）repeat

2) each word w that for refining descriptor is concentrated

3) find corresponding associated sentence collection, be called SS

5） repeat

6） for i=0 to N-1

7) each word c in the set of for refining descriptor

8) if sentence SS[i] in comprise word c

9) the number of votes obtained VOTE[i of this sentence] increase by 1

10) in until SS, all sentences have all traveled through

12) all descriptor of until have all traveled through.

The announcement of book and instruction according to the above description, those skilled in the art in the invention can also carry out suitable change and amendment to above-mentioned embodiment.Therefore, the present invention is not limited to embodiment disclosed and described above, also should fall in the protection domain of claim of the present invention modifications and changes more of the present invention.In addition,, although used some specific terms in this instructions, these terms just for convenience of description, do not form any restriction to the present invention.

Claims

1. a dimensioning algorithm for the Biomedical Image based on stealthy Di Li Cray model, is characterized in that, comprising:

Mark generation module, mark generation module still utilizes the LDA model that LDA training module obtains to carry out modeling to the context of image, the theme that obtains image distributes and word distribution, then in theme-word being distributed, the probability of each word is multiplied by the probability of corresponding theme, and acquired results is as the weights of this word;

According to weights order from big to small, all words are sorted, choose the mark word of front several word as Biomedical Image;

(a) start;

(j) if all images that do not mark all mark, enter (k), otherwise jump to (d);

(k) finish.

2. the dimensioning algorithm of the Biomedical Image based on stealthy Di Li Cray model according to claim 1, it is characterized in that: described context generation module is concentrated and chosen a sentence the closest from the corresponding sentence of each descriptor, choose the algorithm of close sentence and be:

1）repeat

2) each word w that for refining descriptor is concentrated

3) find corresponding associated sentence collection, be called SS

4) an integer array VOTE of definition, the length of array is exactly the length N of SS, and the element in VOTE represents the number of votes obtained of corresponding sentence in SS;

5） repeat

6） for i=0 to N-1

7) each word c in the set of for refining descriptor

8) if sentence SS[i] in comprise word c

9) the number of votes obtained VOTE[i of this sentence] increase by 1

10) in until SS, all sentences have all traveled through

12) all descriptor of until have all traveled through.