CN111209948A

CN111209948A - Image processing method and device and electronic equipment

Info

Publication number: CN111209948A
Application number: CN201911426049.2A
Authority: CN
Inventors: 王扬斌; 张鹿鸣; 王泽鹏
Original assignee: Hangzhou Fubo Technology Co Ltd
Current assignee: Hangzhou Fubo Technology Co Ltd
Priority date: 2019-12-31
Filing date: 2019-12-31
Publication date: 2020-05-29

Abstract

The application provides an image processing method and device and electronic equipment. An image processing method comprising: dividing an image to be detected into a plurality of super pixel areas; acquiring semantic labels corresponding to the semantic features according to the semantic features of the super-pixel region; embedding the semantic tags into the super-pixel region to generate a saliency region; extracting depth features of the saliency areas and generating an image kernel of the image to be detected; and processing the image kernel by using a vector classifier, and carrying out scene classification on the image to be detected.

Description

Image processing method and device and electronic equipment

Technical Field

The present application relates to the field of image processing, and in particular, to an image processing method and apparatus, and an electronic device.

Background

Image scene classification has important applications in computer vision and intelligent systems, such as: image understanding, automatic driving. The technique aims to automatically classify images into different categories based on key information such as objects, regions, context, etc. The existing method based on deep learning has a training stage of a black box, which is not in accordance with the visual perception of human beings for image scenes. In the training stage, the existing method needs a large number of semantic labels at the region or pixel level, and great challenges are brought to manual labeling.

Disclosure of Invention

The embodiment of the application aims to provide an image processing method and device and an electronic device.

In a first aspect, an embodiment provides an image processing method, including: dividing an image to be detected into a plurality of super pixel areas; acquiring semantic labels corresponding to the semantic features according to the semantic features of the super-pixel region; embedding the semantic tags into the super-pixel region to generate a saliency region; extracting depth features of the saliency areas and generating an image kernel of the image to be detected; and processing the image kernel by using a vector classifier, and carrying out scene classification on the image to be detected.

In an optional embodiment, after the image to be measured is segmented into a plurality of super pixel regions, the method further includes: and removing the super pixel regions with the size smaller than a preset value or the super pixel fraction lower than a threshold value.

In an alternative embodiment, embedding semantic tags into the superpixel region, generating a saliency region, comprises: embedding the semantic tags into the super-pixel region by utilizing a manifold learning algorithm; acquiring a base matrix and a sparse matrix from an original matrix of the super-pixel region according to the semantic label; correspondingly acquiring a saliency region from the super-pixel region according to the base matrix; wherein the base matrix represents a feature matrix with semantic labels, and the sparse matrix represents a feature matrix without labels.

In an optional embodiment, after embedding the semantic tag into the super-pixel region and generating the saliency region, the method further includes: calculating a significance score of the significance region according to the sparse coding norm of the significance region; and sequencing the salient regions according to the salient scores to generate a generalized sequence pattern set.

In an optional embodiment, extracting depth features of a salient region and generating an image kernel of an image to be detected includes: acquiring depth features of a salient region corresponding to the generalized sequence pattern set according to a neural network architecture; acquiring a feature vector of the depth feature; and acquiring an image kernel from the salient region according to the Euclidean distance between the feature vectors.

In an optional embodiment, processing an image kernel by using a vector classifier, and performing scene classification on an image to be detected includes: training a multi-class support vector machine classifier based on the image kernel; and utilizing a support vector machine classifier to correspond the image to be detected to different scene categories according to the feature vector of the image to be detected.

In a second aspect, an embodiment provides an image processing apparatus, including: the image segmentation module is used for segmenting an image to be detected into a plurality of super pixel areas; the label acquisition module is used for acquiring semantic labels corresponding to the semantic features according to the semantic features of the super pixel region; the tag embedding module is used for embedding the semantic tags into the super-pixel region to generate a saliency region; the feature extraction module is used for extracting the depth features of the salient region and generating an image kernel of the image to be detected; and the scene classification module is used for processing the image kernel by using the vector classifier and carrying out scene classification on the image to be detected.

In an alternative embodiment, the tag embedding module is configured to: embedding the semantic tags into the super-pixel region by utilizing a manifold learning algorithm; acquiring a base matrix and a sparse matrix from an original matrix of the super-pixel region according to the semantic label; correspondingly acquiring a saliency region from the super-pixel region according to the base matrix; wherein the base matrix represents a feature matrix with semantic labels, and the sparse matrix represents a feature matrix without labels.

In an alternative embodiment, the feature extraction module is configured to: acquiring depth features of a salient region corresponding to the generalized sequence pattern set according to a neural network architecture; acquiring a feature vector of the depth feature; and acquiring an image kernel from the salient region according to the Euclidean distance between the feature vectors.

In an alternative embodiment, the scene classification module is configured to: training a multi-class support vector machine classifier based on the image kernel; and utilizing a support vector machine classifier to correspond the image to be detected to different scene categories according to the feature vector of the image to be detected.

In a third aspect, an embodiment provides an electronic device, including: a memory for storing a computer program; a processor for performing the method of any one of the preceding embodiments.

The beneficial effect that technical scheme that this application provided brought is:

1. according to the image scene classification method under the weak supervision, the accuracy of image scene classification is improved by combining human visual perception.

2. According to the embodiment of the application, only image-level semantic labels are needed in the training stage by using the manifold learning algorithm, so that the manual labeling amount is greatly reduced.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.

Fig. 1 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure;

fig. 2 is a schematic view of an interactive scene provided in an embodiment of the present application;

fig. 3 is a flowchart of an image processing method according to an embodiment of the present application;

FIG. 4 is a flowchart of another image processing method provided in the embodiments of the present application;

fig. 5 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present application.

Icon: icon: the system comprises electronic equipment 1, a bus 10, a processor 11, a memory 12, a user terminal 100, a server 200, an image segmentation module 501, a label acquisition module 502, a label embedding module 503, a feature extraction module 504 and a scene classification module 505.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.

As shown in fig. 1, the present embodiment provides an electronic apparatus 1 including: at least one processor 11 and a memory 12, one processor being exemplified in fig. 1. The processor 11 and the memory 12 are connected by a bus 10, and the memory 12 stores instructions executable by the processor 11 and the instructions are executed by the processor 11.

In an embodiment, the electronic device 1 may obtain raw image data stored in the memory, process the raw image data into an image kernel according to semantic features, and perform scene classification on the raw image data according to the image kernel according to a vector classifier.

Fig. 2 is a schematic view of an application scenario of the method for evaluating image quality according to this embodiment. As shown in fig. 2, the application scenario may include the user terminal 100, and the user terminal 100 may be a smartphone or a tablet computer with a photographing function. The user terminal 100 may execute the image processing method provided by the present application to perform scene classification according to the captured image.

According to the requirement, the application scenario may further include a server 200, and the server 200 may be a server, a server cluster, or a cloud computing center. The server 200 may receive the image uploaded by the user terminal 100, execute the image processing method provided by the present application, and perform scene classification according to the captured image.

Please refer to fig. 3, which is a track data processing method provided in this embodiment, and the method can be executed by the electronic device 1 shown in fig. 1 and used in the interaction scenario shown in fig. 2. The method comprises the following steps:

step 301: and dividing the image to be measured into a plurality of super pixel areas.

In this step, the image to be measured may be image data stored in the memory, or image data collected by the user terminal. In the field of computer vision, image segmentation refers to the process of subdividing a digital image into a plurality of image sub-regions (sets of pixels), also called superpixels. The super pixel area is a small area formed by a series of pixel points which are adjacent in position and similar in color, brightness, texture and other characteristics. Most of these small regions retain effective information for further image segmentation, and generally do not destroy the boundary information of objects in the image.

In one embodiment, the image may be segmented into superpixels by using a SLIC algorithm, and the image may be segmented by using three segmentation parameters (0.5A, 0.2A, and 0.1A, where a represents a smaller value of width and height of the image) to obtain a series of superpixel regions. The SLIC algorithm is a superpixel segmentation algorithm. SLIC clusters similar pixels together with K-means clustering and sets the K-means search range to 2S, S representing the number of pixels in each superpixel. Therefore, the search range can be greatly reduced, and the calculation efficiency is improved.

In one embodiment, the bottom-level features of the image region are extracted, and the bottom-level features include 9-bit color moment features (color moments) and 128-dimensional gradient histogram features. Meanwhile, a linear discriminant analysis algorithm is utilized to learn and generate a linear mapping matrix, 137-dimensional features of a large number of candidate regions are mapped into two categories of regions, namely a good segmentation region and a damage segmentation region, and the damage segmentation region is removed to reduce the influence on the final classification result.

Step 302: and acquiring semantic labels corresponding to the semantic features according to the semantic features of the super-pixel region.

In this step, semantic features of an image can be divided into a visual layer, an object layer and a concept layer, the visual layer is a commonly understood bottom layer, namely color, texture, shape and the like, and these features are all called bottom layer feature semantics; the object layer, i.e. the middle layer, usually contains attribute features, etc., that is, the state of a certain object at a certain time; the conceptual level is a high level, being what the image represents is closest to human understanding. For example, a graph is provided with sand, blue sky, seawater and the like, a visual layer is a block of distinction, an object layer is composed of sand, blue sky and seawater, a concept layer is a beach, and the graph shows semantics.

In one embodiment, the label for an image may have: buildings, pedestrians, cars, sky, etc.

Step 303: and embedding the semantic label into the super-pixel area to generate a saliency area.

In this step, one is interested in only a partial region of an image, which allows one to know the main content of the image. The salient region is the region which can most arouse the interest of the user and can most express the content of the image in one image.

In one embodiment, a non-negative matrix decomposition is used to extract a salient region from an image to be detected, an original feature matrix is obtained, and the original feature matrix is decomposed into a base matrix and a sparse matrix.

Step 304: and extracting the depth features of the salient region to generate an image kernel of the image to be detected.

In the step, the salient regions are ordered according to the salient scores to form a generalized sequence mode, human visual perception is simulated, and non-negative matrix decomposition based on space reservation is constructed.

In one embodiment, the saliency score is a semantic/visual saliency parameter of the sparse coding norm metric for each region.

Step 305: and processing the image kernel by using a vector classifier, and carrying out scene classification on the image to be detected.

In this step, a multi-class Vector classifier is trained based on the obtained image kernel features, and in one embodiment, the Vector classifier may be a Support Vector Machine (SVM).

In one embodiment, scene classification is performed based on image kernel features of a test image according to a trained binary SVM classifier.

Please refer to fig. 4, which is a track data processing method provided in this embodiment, and the method can be executed by the electronic device 1 shown in fig. 1 and used in the interaction scenario shown in fig. 2. The method comprises the following steps:

step 401: and dividing the image to be measured into a plurality of super pixel areas. Please refer to the above embodiment for the description of step 301.

Step 402: and removing the super pixel regions with the size smaller than a preset value or the super pixel fraction lower than a threshold value.

In this step, the super-pixel region with segmentation damage needs to be removed after image segmentation, and in an embodiment, the super-pixel region with a size smaller than 0.01wl (pixel) and the super-pixel region with a super-pixel fraction lower than a threshold are removed, and the super-pixel region with a super-pixel fraction greater than or equal to the threshold is retained. In an embodiment, 177 types of images with labels in the ImageNet dataset can be utilized to extract color moments and histogram features of the images to form 137-dimensional features, and then a transfer matrix is obtained by training an LDA (Linear Discriminant Analysis). The score of SLIC segmentation is measured using a transition matrix. For example, the division fraction interval is set to [0,1], 1 is set to good division, 0 bit division is poor, and the division threshold is set to 0.4.

Step 403: and acquiring semantic labels corresponding to the semantic features according to the semantic features of the super-pixel region. Please refer to the above embodiments for the description of step 302.

Step 404: and embedding the semantic labels into the super-pixel region by using a manifold learning algorithm.

In this step, the image-level semantic tags are embedded into the image region by manifold learning using a weak supervised learning algorithm, and the formula is as follows:

wherein l_s(i, j) represents the similarity of image region i and image region j, l_d(i, j) represents the disparity of image regions i and j. Y ═ Y₁,y₂,…,y_n]∈R^d×NWherein y is_iIs a d-dimensional vector representing the ith embedded region. Likewise, y_jIndicating the jth embedded region. Semantic labels are transferred to specific regions of the image by using a manifold learning method, and the formula shows that the proximity of the region i and the region j in the feature space should be consistent with the semantic labels of the image.

Image level tags are passed into various regions of the image. For example, the image level tags are: sky, grass, house. The formula expresses the transfer of these three labels into three regions of the image. In practical applications, we select 32 object concepts in advance, for example: sky, street lights, pedestrians, vehicles, trees, animals, grass, rivers, mountains, and so forth. If the object in the new test image does not belong to the 32 classes, it is indicated that the object is not important for classification and can be ignored.

In one embodiment, the minimum Y ═ Y₁,y₂,…,y_n]Each of y_iIs a d-dimensional vector representing an image region. If 32 object concepts are specified, then d is 32. In the 32-dimensional vector, each element is 0 or 1, 0 represents that the object is absent from the image, and 1 represents that the object is present. Thus, semantic information of the image can be represented by the 32-dimensional vector.

l_s(i, j) represents the similarity of image regions i and j, l_d(i, j) represents the disparity of image regions i and j. If there is similarity between the two regions, the two regions can be merged into one larger region block, i.e. sharing the same image level label.

In one embodiment, d in the d-dimensional vector is a variable set by the user, and different values of d affect the dimension of the feature vector, thereby affecting the final result. The purpose of this process is to automatically pass image level tags into specific pixel regions of the image.

Step 405: and acquiring a base matrix and a sparse matrix from the original matrix of the super-pixel region according to the semantic label.

In this step, let

Representing N scene images, wherein

A feature matrix with a label is represented,

representing a feature matrix without a label. Using non-negative matrix factorization:

wherein P ∈ R^(137+d)×tDenotes a basis matrix, Q ∈ R^t×NRepresenting a sparse matrix, | | I ⊙ Q | | represents a regularization term,

a representative indication matrix is provided which,

zero matrix for representation (MK × U), M representing the dimension of each subspace

Denote the block diagonal matrix ⊙ denotes the element-wise multiplication, γ being the regularization parameter.

Step 406: and correspondingly acquiring a saliency area from the super-pixel area according to the base matrix.

In this step, the base matrix has a feature matrix with semantic labels, and salient regions with corresponding base matrices are obtained from the plurality of salient regions for subsequent calculation.

Step 407: and calculating the significance score of the significance region according to the sparse coding norm of the significance region.

In this step, the optimal sparse matrix Q is utilized^*Calculating a scene region r_iFor non-negative matrix factorization, a global solution cannot be obtained since Q is non-convex for both matrices P. Therefore, an iterative method is adopted to obtain an optimal matrix. The formula is as follows:

the formula calculates the saliency score of the ri region, and the optimal sparse matrix Q represents the saliency region in the image, so that the saliency score corresponds to the saliency degree of the corresponding saliency region.

Step 408: and sequencing the salient regions according to the salient scores to generate a generalized sequence pattern set.

In this step, Q is determined from the optimal sparse matrix^*Calculating a scene region r_iThe significance score of (a), the significance region is sorted.

In one embodiment, the GSP algorithm may be divided into three phases of candidate set generation, candidate set counting and extended classification, similar to the association rule (Apriori) algorithm. In one embodiment, the association rule (Apriori) algorithm employs one AprioriAll algorithm, and the GSP algorithm counts fewer candidate sets than AprioriAll algorithm and does not need to calculate a frequent set in advance in the data conversion process. The Apriori algorithm is the first association rule mining algorithm, and is also the most classical algorithm. The method finds out the relation of item sets in the database by using an iterative method of searching layer by layer to form a rule, and the process of the method consists of connection and pruning. Where concatenation may be a matrix-like operation and pruning may be to remove unnecessary intermediate results. The concept of a set of terms in the algorithm is a set of terms. For example, a set containing K items is a set of K items, and the frequency of occurrence of the set of items is the number of transactions containing the set of items, which is referred to as the frequency of the set of items. If a certain item set meets the minimum support, it is called a frequent item set.

In one embodiment, the connection phase: if the first item of the drop sequence pattern S1 is the same as the sequence of the drop sequence pattern S2, S1 and S2 can be concatenated, i.e., the last item of S2 is added to S1.

Pruning: if a subset of the candidate sequence patterns is not a sequence pattern, the candidate sequence pattern is not likely to be a sequence pattern, and is deleted from the candidate sequence patterns.

Step 409: and acquiring depth features of the salient regions corresponding to the generalized sequence pattern set according to the neural network architecture.

In this step, depth features are integrated using a statistical algorithm: a neural network is used to extract depth features of A dimension from each region constituting the GSP, and then a statistical algorithm is used to fuse the region features. Specifically, let

Wherein

Representing the depth characteristics of each region in a GSP. Order to

Representing a series of

The m-th component of (1). Let F ═ min, max, mean denote the statistical method, which integrates the regional features in the GSP into an S-dimensional vector. This process formula is summarized below:

wherein W ∈ R^S×4AIs the polymerization parameter of the fully-connected layer,

indicating that 4A feature vectors are concatenated into one feature vector. k is a radical of_uThe u-th statistical method is shown. For example, a GSP composed of three salient regions with depth features of theta sequentially is extracted from a scene image₁＝{3,5,2},θ₁＝{2,6,1},θ ₁1,4,3, where the characteristic dimension a is 3. Then

Using the F statistical method, ω ═ {1,2,3,3,6,5,2,4,3,2,4,3 }.

Step 410: and acquiring a feature vector of the depth feature.

In this step, if 3 superpixels are characterized by

First, we use statistical algorithm F ═ min, max, mean }, where we first say that we are

Firstly, selecting the min value of each line according to the line to obtain 1,3 and 2; then, selecting the max value of each row according to the rows to obtain 3,7 and 4; and obtaining the average value mean and the intermediate value mean by the same method, and finally obtaining the internal aggregation of {1,3,2,3,7,4,2,5,3,2,5,3 }. The external polymerization only needs to be carried out after the internal polymerization and then the splicing together.

Step 411: and acquiring an image kernel from the salient region according to the Euclidean distance between the feature vectors.

In this step, the image kernel mechanism depends on the distance between scene images, and the calculation method depends on the extracted GSP features. In particular, given a scene image, its GSP is transferred to a vector

Wherein each element is

Where δ (·,) represents the euclidean distance between the two vectors, ω represents the depth features extracted from each GSP, N represents the number of training scene images, B (P)^*) Representing P in GSP^*Number of significant regions.

In one embodiment, P^*Representing a given scene image, its GSP being denoted P^*. The dimensionality of the features can be further reduced by utilizing an image kernel mechanism。

Step 412: and training a multi-class support vector machine classifier based on the image kernel.

In this step, a multi-class SVM classifier is trained based on the obtained image kernel features. For distinguishing the p-th and q-th scene classes, a two-classification SVM classifier is trained, i.e.

Wherein, β_i∈R^NRepresenting the feature vector corresponding to the ith training scene image, l_iIs a class label, i.e. belongs to class p _i1 in class q_i= 1.α denotes a hyperplane C distinguishing the p-th class from the q-th class>0 balances the complexity of the image kernel mechanism and the number of undivided scene images. N is a radical of_pqIndicating the number of training images in either the p-th or q-th class. Assuming a common R-type scene image, a total of R (R-1)/2 SVM classifiers need to be trained.

Step 413: and utilizing a support vector machine classifier to correspond the image to be detected to different scene categories according to the feature vector of the image to be detected.

In this step, the image to be detected is mapped to different scene categories by using a support vector machine classifier according to the feature vector of the image to be detected, and finally the processing process of image scene classification is completed.

Please refer to fig. 5, which is an image processing apparatus 500 according to an embodiment of the present disclosure, where the image processing apparatus 500 may be applied to the electronic device 1 shown in fig. 1 and may be applied to the interactive scene shown in fig. 2 to obtain original image data stored in a memory, process the original image data into an image kernel according to semantic features, and perform scene classification on the original image data according to the image kernel according to a vector classifier. The device includes: the system comprises an image segmentation module 501, a label acquisition module 502, a label embedding module 503, a feature extraction module 504 and a scene classification module 505.

An image segmentation module 501 is configured to segment the image to be measured into a plurality of super pixel regions. Please refer to the description of step 301 in the above embodiments.

A tag obtaining module 502, configured to obtain a semantic tag corresponding to the semantic feature according to the semantic feature of the super pixel region. Please refer to the description of step 302 in the above embodiment.

And a tag embedding module 503, configured to embed the semantic tag into the super-pixel region to generate a saliency region. Please refer to the description of step 303 in the above embodiments.

The feature extraction module 504 is configured to extract depth features of the significant region, and generate an image kernel of the image to be detected. Please refer to the description of step 304 in the above embodiment.

And a scene classification module 505, configured to process the image kernel by using a vector classifier, and perform scene classification on the image to be detected. Please refer to the description of step 305 in the above embodiment.

In one embodiment, the tag embedding module 503 is configured to: embedding the semantic tags into the super-pixel region by utilizing a manifold learning algorithm; acquiring a base matrix and a sparse matrix from an original matrix of the super-pixel region according to the semantic label; correspondingly acquiring a saliency region from the super-pixel region according to the base matrix; wherein the base matrix represents a feature matrix with semantic labels, and the sparse matrix represents a feature matrix without labels.

In one embodiment, the feature extraction module 504 is configured to: acquiring depth features of a salient region corresponding to the generalized sequence pattern set according to a neural network architecture; acquiring a feature vector of the depth feature; and acquiring an image kernel from the salient region according to the Euclidean distance between the feature vectors.

In one embodiment, the scene classification module 505 is configured to: training a multi-class support vector machine classifier based on the image kernel; and utilizing a support vector machine classifier to correspond the image to be detected to different scene categories according to the feature vector of the image to be detected.

In an embodiment, the image processing apparatus 500 may further include a data filtering module for removing a super-pixel region having a size smaller than a predetermined value or a super-pixel fraction lower than a threshold value.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

In addition, units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

Furthermore, the functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

It should be noted that the functions, if implemented in the form of software functional modules and sold or used as independent products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.

The above description is only an example of the present application and is not intended to limit the scope of the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. An image processing method, comprising:

dividing an image to be detected into a plurality of super pixel areas;

obtaining a semantic label corresponding to the semantic feature according to the semantic feature of the super pixel region;

embedding the semantic label into the super-pixel region to generate a saliency region;

extracting the depth features of the salient region and generating an image kernel of the image to be detected;

and processing the image kernel by using a vector classifier, and carrying out scene classification on the image to be detected.

2. The method of claim 1, further comprising, after said segmenting the image under test into a plurality of super-pixel regions:

and removing the super pixel regions with the size smaller than a preset value or the super pixel fraction lower than a threshold value.

3. The method of claim 1, wherein said embedding said semantic tags into said superpixel region, generating a saliency region, comprises:

embedding the semantic label into the super-pixel region by utilizing a manifold learning algorithm;

acquiring a base matrix and a sparse matrix from an original matrix of the super-pixel region according to the semantic label;

correspondingly acquiring the saliency areas from the super-pixel areas according to the base matrix;

wherein the base matrix represents a feature matrix with semantic tags and the sparse matrix represents a feature matrix without tags.

4. The method of claim 3, further comprising, after said embedding said semantic tag into said superpixel region, generating a saliency region:

calculating a significance score of the significance region according to the sparse coding norm of the significance region;

and sequencing the significance regions according to the significance scores to generate a generalized sequence pattern set.

5. The method according to claim 4, wherein the extracting the depth feature of the salient region and generating an image kernel of the image to be detected comprises:

acquiring depth features of the salient region corresponding to the generalized sequence pattern set according to a neural network architecture;

acquiring a feature vector of the depth feature;

and acquiring the image kernel from the salient region according to the Euclidean distance between the feature vectors.

6. The method of claim 5, wherein the processing the image kernel with the vector classifier to perform scene classification on the image under test comprises:

training a multi-class support vector machine classifier based on the image kernel;

and utilizing the support vector machine classifier to correspond the image to be detected to different scene categories according to the feature vector of the image to be detected.

7. An image processing apparatus characterized by comprising:

the image segmentation module is used for segmenting an image to be detected into a plurality of super pixel areas;

the label acquisition module is used for acquiring a semantic label corresponding to the semantic feature according to the semantic feature of the super pixel region;

the label embedding module is used for embedding the semantic label into the super-pixel region to generate a saliency region;

the feature extraction module is used for extracting the depth features of the salient region and generating an image kernel of the image to be detected;

and the scene classification module is used for processing the image kernel by using a vector classifier and carrying out scene classification on the image to be detected.

8. The apparatus of claim 7, wherein the tag embedding module is to:

9. The apparatus of claim 7, wherein the feature extraction module is configured to:

acquiring a feature vector of the depth feature;

10. An electronic device, comprising:

a memory for storing a computer program;

a processor for performing the method of any one of claims 1 to 6.