CN111833360A

CN111833360A - Image processing method, device, equipment and computer readable storage medium

Info

Publication number: CN111833360A
Application number: CN202010674414.8A
Authority: CN
Inventors: 徐昊
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-07-14
Filing date: 2020-07-14
Publication date: 2020-10-27
Anticipated expiration: 2040-07-14
Also published as: CN111833360B

Abstract

The embodiment of the application discloses an image processing method, an image processing device, image processing equipment and a computer readable storage medium, wherein the method comprises the following steps: acquiring an image, inputting the image into an image segmentation model, and generating at least two image characteristic matrixes of the image; the image segmentation model comprises a convolution splicing module, wherein the convolution splicing module comprises at least two point-by-point convolution kernels and at least two depth-by-depth convolution kernels; the features output by the convolution splicing module have a target channel number; respectively convolving at least two image feature matrixes through each point-by-point convolution kernel to generate at least two intermediate feature matrixes; convolving at least two intermediate feature matrices according to at least two depth-by-depth convolution kernels respectively to generate at least two feature matrices to be spliced; and performing characteristic splicing processing on the at least two intermediate characteristic matrixes and the at least two characteristic matrixes to be spliced to obtain at least two target characteristic matrixes. By adopting the method and the device, the parameter quantity of the model can be reduced, and the running speed of the model can be further improved.

Description

Image processing method, device, equipment and computer readable storage medium

Technical Field

The present application relates to the field of internet technologies, and in particular, to an image processing method, an image processing apparatus, an image processing device, and a computer-readable storage medium.

Background

With the rapid popularization of the deep learning technology and the improvement of the computing power, the performance of the semantic segmentation technology is greatly improved.

Human image segmentation is one of the basic topics of semantic segmentation, and has been widely regarded in both academic and industrial fields. In applications such as video conferences and live chat scenes, functions of the applications can be enriched through a portrait segmentation technology, however, the portrait segmentation technology based on deep learning has the problem of large parameter quantity, so that model calculation quantity is large, if the hardware capacity of a terminal running the applications is not high enough, running speed is slow, resource consumption is too high, and the terminal may not run the applications with the portrait segmentation technology.

Disclosure of Invention

The embodiment of the application provides an image processing method, an image processing device, image processing equipment and a computer readable storage medium, which can reduce the parameter quantity of a model while realizing a portrait segmentation technology, and further can improve the running speed of the model.

An embodiment of the present application provides an image processing method, including:

acquiring an image, inputting the image into an image segmentation model, and generating at least two image characteristic matrixes of the image; the image segmentation model comprises a convolution splicing module, wherein the convolution splicing module comprises at least two point-by-point convolution kernels and at least two depth-by-depth convolution kernels; the features output by the convolution splicing module have a target channel number;

respectively convolving at least two image feature matrixes through each point-by-point convolution kernel to generate at least two intermediate feature matrixes; the number of convolution kernels of the at least two point-by-point convolution kernels is smaller than the number of target channels;

convolving at least two intermediate feature matrices according to at least two depth-by-depth convolution kernels respectively to generate at least two feature matrices to be spliced;

performing feature splicing processing on the at least two intermediate feature matrices and the at least two feature matrices to be spliced to obtain at least two target feature matrices, and performing image identification processing on the image according to the at least two target feature matrices; the number of the at least two target feature matrices is equal to the number of the target channels.

An aspect of an embodiment of the present application provides an image processing apparatus, including:

the first acquisition module is used for acquiring an image, inputting the image into an image segmentation model and generating at least two image characteristic matrixes of the image; the image segmentation model comprises a convolution splicing module, wherein the convolution splicing module comprises at least two point-by-point convolution kernels and at least two depth-by-depth convolution kernels; the features output by the convolution splicing module have a target channel number;

the first generation module is used for respectively convolving at least two image feature matrixes through each point-by-point convolution kernel to generate at least two intermediate feature matrixes; the number of convolution kernels of the at least two point-by-point convolution kernels is smaller than the number of target channels;

the first generation module is further used for respectively convolving the at least two intermediate feature matrices according to the at least two depth-by-depth convolution kernels to generate at least two feature matrices to be spliced;

the splicing characteristic module is used for performing characteristic splicing processing on the at least two intermediate characteristic matrixes and the at least two characteristic matrixes to be spliced to obtain at least two target characteristic matrixes, and performing image identification processing on the image according to the at least two target characteristic matrixes; the number of the at least two target feature matrices is equal to the number of the target channels.

Wherein, concatenation characteristic module includes:

the first processing unit is used for inputting the at least two target characteristic matrixes into the transposition convolution module, and performing deconvolution processing on the at least two target characteristic matrixes through the transposition convolution module to obtain characteristic segmentation matrixes;

and the second processing unit is used for determining the characteristic segmentation matrix as a semantic segmentation image of the image and carrying out image identification processing on the image according to the semantic segmentation image.

Wherein, the second processing unit includes:

the first assignment subunit is used for performing first assignment processing on the semantic segmentation image to obtain a first assignment image; the first value-assigned image is used for extracting a foreground object in the image;

the second assignment subunit is used for acquiring a material image and performing second assignment processing on the semantic segmentation image to obtain a second assignment image; the second assigned value image is used for extracting a background object in the material image;

a generation target subunit, configured to generate a target image from the material image, the first assigned value image, and the second assigned value image; the target image includes a foreground object and a background object.

The method comprises the steps of generating a target subunit, specifically, obtaining a first assignment matrix of a first assignment image, obtaining an original matrix of the image, and performing matrix adjustment on the original matrix according to the first assignment matrix to obtain a first target matrix; the first target matrix is used for representing a foreground object in the image;

generating a target subunit, specifically, obtaining a second assignment matrix of the second assignment image, obtaining a material matrix of the material image, and performing matrix adjustment on the material matrix according to the second assignment matrix to obtain a second target matrix; the second target matrix is used for representing a background object in the material image;

and generating a target subunit, and specifically, performing matrix addition on the first target matrix and the second target matrix to obtain a target image.

The first generating module is specifically configured to convolve the at least two intermediate feature matrices according to the at least two first depth-wise convolution kernels, and generate at least two first feature matrices to be spliced;

a splice feature module comprising:

the third processing unit is used for performing feature splicing processing on the at least two intermediate feature matrices and the at least two first feature matrices to be spliced to obtain at least two feature matrices to be determined;

the first generating unit is used for respectively convolving the at least two characteristic matrixes to be determined according to the at least two second depth-by-depth convolution kernels to generate at least two second characteristic matrixes to be spliced;

and the fourth processing unit is used for performing characteristic splicing processing on the at least two characteristic matrixes to be determined and the at least two second characteristic matrixes to be spliced to obtain at least two target characteristic matrixes.

Wherein, the first generation module comprises:

a second generation unit for generating a kernel K by point-by-point convolution_iEach point-by-point convolution channel convolves each image feature matrix to generate at least two intermediate feature sub-matrixes Z_iFor at least two intermediate feature sub-matrices Z_iFusing to obtain an intermediate feature matrix L_i(ii) a Point-by-point convolution kernel K_iOne point-by-point convolution channel in the image feature matrix is corresponding to one image feature matrix in at least two image feature matrices;

a second generation unit for further performing point-by-point convolution of kernel K_i+1Each point-by-point convolution channel convolves each image feature matrix to generate at least two intermediate feature sub-matrixes Z_i+1For at least two intermediate feature sub-matrices Z_i+1Fusing to obtain an intermediate feature matrix L_i+1(ii) a Point-by-point convolution kernel K_i+1One point-by-point convolution channel in the image feature matrix is corresponding to one image feature matrix in at least two image feature matrices;

a first determination unit for determining the intermediate feature matrix L_iAnd an intermediate feature matrix L_i+1At least two intermediate feature matrices are determined.

Wherein, the first generation module comprises:

a third generation unit for generating a kernel P by depth-wise convolution_iTo the intermediate feature matrix L_iPerforming convolution to generate a characteristic matrix J to be spliced_i；

A third generation unit for further performing depth-wise convolution of the kernel P_i+1To the intermediate feature matrix L_i+1Performing convolution to generate a characteristic matrix J to be spliced_i+1；

A second determination unit for determining a feature matrix J to be spliced_iAnd a feature matrix J to be spliced_i+1Determining at least two characteristic matrixes to be spliced.

Wherein, the first generation module comprises:

the fourth generation unit is used for respectively convolving the at least two image feature matrices through each point-by-point convolution kernel to generate at least two intermediate feature matrices to be activated;

and the fourth generating unit is further used for mapping the at least two intermediate feature matrices to be activated to the first activation function to generate at least two intermediate feature matrices.

Wherein, the first generation module comprises:

the fifth generation unit is used for respectively convolving the at least two intermediate feature matrices according to the at least two depth-by-depth convolution kernels to generate at least two splicing feature matrices to be activated;

and the fifth generating unit is further configured to map the at least two splicing feature matrices to be activated to the second activation function, so as to generate at least two splicing feature matrices.

Wherein, image processing apparatus, still include:

the second acquisition module is used for acquiring a training sample set; the training sample set comprises a sample image and a label image, and the label image is used for representing an object class label to which each pixel point in the sample image belongs;

the second generation module is used for inputting the training sample set into the sample image segmentation model and generating a sample image characteristic matrix of the sample image; the sample image segmentation model comprises a sample convolution splicing module; the sample image segmentation model comprises at least two object class labels;

the extraction characteristic module is used for extracting a predicted image characteristic matrix associated with each category label in the sample image characteristic matrix through the sample convolution splicing module;

and the adjusting model module is used for adjusting the sample image segmentation model according to the predicted image feature matrix and the label image to obtain the image segmentation model comprising the convolution splicing module.

Wherein, adjust the model module, include:

a sixth generating unit, configured to generate a tag image feature matrix of the tag image, and generate a model loss value according to the tag image feature matrix and the predicted image feature matrix;

and the third determining unit is used for adjusting the weight of the parameters in the sample image segmentation model according to the model loss value, and determining the adjusted image segmentation model containing the convolution splicing module as the image segmentation model when the model loss value meets the convergence condition.

One aspect of the present application provides a computer device, comprising: a processor, a memory, a network interface;

the processor is connected to the memory and the network interface, wherein the network interface is used for providing a data communication function, the memory is used for storing a computer program, and the processor is used for calling the computer program to execute the method in the embodiment of the present application.

An aspect of the embodiments of the present application provides a computer-readable storage medium, in which a computer program is stored, where the computer program includes program instructions, and the program instructions, when executed by a processor, perform the method in the embodiments of the present application.

An aspect of an embodiment of the present application provides a computer program product or a computer program, where the computer program product or the computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium; the processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method in the embodiment of the present application.

According to the method, the image is input into an image segmentation model by acquiring the image, the image segmentation model comprises a convolution module and a convolution splicing module, and at least two image characteristic matrixes of the image are obtained through the convolution module; the convolution splicing module comprises at least two point-by-point convolution kernels and at least two depth-by-depth convolution kernels, and the features output by the convolution splicing module have the number of target channels; then, performing convolution on at least two image characteristic matrixes through each point-by-point convolution kernel respectively, namely mapping the at least two image characteristic matrixes to each point-by-point convolution kernel respectively to generate at least two intermediate characteristic matrixes, wherein obviously, the number of the matrixes of the at least two intermediate characteristic matrixes is the same as that of the convolution kernels of the at least two point-by-point convolution kernels; the number of convolution kernels of the at least two point-by-point convolution kernels is smaller than the number of target channels, so that the number of matrixes of the at least two intermediate feature matrixes is smaller than the number of target channels; and then, convolving the at least two intermediate feature matrices according to the at least two depth-by-depth convolution kernels respectively to generate at least two feature matrices to be spliced. The method comprises the steps of performing characteristic splicing processing on at least two intermediate characteristic matrixes and at least two characteristic matrixes to be spliced to obtain at least two target characteristic matrixes, wherein the number of the at least two target characteristic matrixes is equal to the number of target channels, and finally performing image identification processing on an image according to the at least two target characteristic matrixes. In this way, the feature splicing processing is performed on the at least two intermediate feature matrices and the at least two feature matrices to be spliced, so that the parameter quantity can be greatly reduced in the convolution splicing module, and the running speed of the image segmentation model can be further improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a system architecture diagram according to an embodiment of the present application;

FIG. 2 is a schematic view of a scene of image processing provided by an embodiment of the present application;

fig. 3 is a schematic flowchart of an image processing method according to an embodiment of the present application;

fig. 4 is a schematic flowchart of an image processing method according to an embodiment of the present application;

fig. 5a is a schematic structural diagram of a convolution splicing module according to an embodiment of the present application;

fig. 5b is a schematic structural diagram of an image segmentation model provided in an embodiment of the present application;

fig. 5c is a schematic network structure diagram of a convolution splicing module according to an embodiment of the present application;

FIG. 6 is a schematic view of a scene of image processing provided by an embodiment of the present application;

fig. 7 is a schematic flowchart of an image processing method according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making. The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Computer Vision technology (CV) is a science for researching how to make a machine "see", and further refers to that a camera and a Computer are used to replace human eyes to perform machine Vision such as identification, tracking and measurement on a target, and further image processing is performed, so that the Computer processing becomes an image more suitable for human eyes to observe or is transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. Computer vision technologies generally include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technologies, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also include common biometric technologies such as face recognition and fingerprint recognition.

The scheme provided by the embodiment of the application relates to the computer vision technology of artificial intelligence, deep learning technology and other technologies, and the specific process is explained by the following embodiment.

Referring to fig. 1, fig. 1 is a schematic diagram of a system architecture according to an embodiment of the present disclosure. As shown in fig. 1, the system may include a server 10a and a user terminal cluster, and the user terminal cluster may include: the present application is not limited to the connection mode, and the connection may be performed directly or indirectly through a wired or wireless communication mode, or may be performed through other modes, and the present application is not limited herein.

The server 10a provides a service for the user terminal cluster through a communication connection, and when a user terminal (which may be the server 10b, the user terminal 10c, or the user terminal 10d) acquires an image and needs to process the image, for example, replace the background of the image, the user terminal may send the image to the server 10 a. After receiving the image sent by the user terminal, the server 10a performs semantic segmentation on the image based on the image segmentation model trained in advance to obtain a semantic segmentation image corresponding to the image, and the server 10a then obtains a target image based on the material image, the image and the semantic segmentation image, where the target image includes both a foreground object in the image, i.e., a target object (which may include a human figure, an animal, a vehicle, etc.), and a background object in the material image. Subsequently, the server 10a may transmit the generated target image to the user terminal, and store the image, the semantically segmented image, and the material image in association in the database. The user terminal may display the target image on the screen after receiving the target image transmitted from the server 10 a.

Optionally, the server 10a may send the semantic segmentation image to a user terminal, and the user terminal obtains a target image based on the material image, the image, and the semantic segmentation image; if the trained image segmentation model is locally stored in the user terminal, the image can be processed into a semantic segmentation image locally at the user terminal, and then the semantic segmentation image is subjected to subsequent processing. Since the training of the image segmentation model involves a large amount of off-line computation, the image segmentation model local to the user terminal may be sent to the user terminal after being trained by the server 10 a.

It is understood that the methods provided by the embodiments of the present application may be performed by a computer device, including but not limited to a terminal or a server. The server 10a in the embodiment of the present application may be a computer device. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, middleware service, a domain name service, a security service, a CDN, a big data and artificial intelligence platform, and the like. The terminal may be, but is not limited to, a smart phone, a tablet computer, a laptop computer, a desktop computer, a smart speaker, a smart watch, and the like. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein.

The server 10a, the user terminal 10b, the user terminal 10c, and the user terminal 10d in fig. 1 may include a mobile phone, a tablet computer, a notebook computer, a palm computer, a smart audio, a Mobile Internet Device (MID), a POS (Point Of Sales) machine, a wearable device (e.g., a smart watch, a smart band, etc.), and the like.

In the following, a background of replacing a portrait image is taken as an example (which may be processed in the server 10a or in the user terminal), please refer to fig. 2, and fig. 2 is a scene schematic diagram of an image processing provided in an embodiment of the present application. The purpose of the embodiment of the present application is to realize portrait segmentation quickly without affecting segmentation accuracy, as shown in fig. 2, the image 20a includes a canteen, an escalator, and a portrait (a girl), that is, the portrait is a foreground object, and the rest (also includes the escalator and the canteen) is a background object. Before the image 20a is input into the image segmentation model 20c, the image 20a is subjected to image preprocessing, and in image analysis, the quality of the image directly affects the precision of the design and effect of the segmentation algorithm, so that the image preprocessing is required before the image analysis. The main purpose of image pre-processing is to eliminate irrelevant information (e.g. background objects) in the image 20a, recover useful real information, enhance the detectability of relevant information (e.g. foreground objects), and simplify the data to the maximum extent, thereby improving the reliability of feature extraction, image segmentation, matching and recognition, and methods of image pre-processing include, but are not limited to, the following:

a. and (5) carrying out image normalization processing.

Image normalization is a widely used technique in the fields of computer vision, pattern recognition, and the like. By image normalization, an original image to be processed (e.g., image 20a in fig. 2) is transformed into a corresponding unique standard form (which has invariant characteristics to affine transformations such as translation, rotation, scaling, etc.) through a series of transformations. The basic working principle of the image normalization technology is that firstly, a matrix which has invariance to affine transformation in an image is used for determining parameters of a transformation function, and then the original image is transformed into a standard-form image (the image is not related to the affine transformation) by using the transformation function determined by the parameters.

Reasons for normalization in neural networks: 1) normalization is to accelerate the convergence of the training network; 2) the normalization has the specific function of summarizing the statistical distribution of uniform samples, has the meaning of identity, unity and unity, is used for modeling or calculation, firstly, the basic measurement unit is the same, the neural network is trained (probability calculation) and predicted according to the statistical probability of the samples in the event, and the normalization is the statistical probability distribution of the identity between 0 and 1; 3) when the input signals of all samples are positive values, the weights connected with the first hidden layer neurons can only be increased or decreased simultaneously, so that the learning speed is very slow, and in order to avoid the situation, the network learning speed is increased, the input signals can be normalized, so that the mean values of the input signals of all samples are close to 0 or are very small compared with the mean square error of the input signals of all samples; 4) normalization is to normalize the output of the sample because the value of the sigmoid function is between 0 and 1, and so is the output of the last node of the network.

b. And (4) geometric transformation.

The geometric transformation of the image is also called image space transformation, and the acquired image is processed through the geometric transformations of translation, transposition, mirror image, rotation, scaling and the like, so that the geometric transformation is used for correcting the system error of an image acquisition system and the random error of the position of an instrument (imaging angle, perspective relation and even the reason of a lens). Furthermore, it is also necessary to use a gray interpolation algorithm because pixels of the output image may be mapped onto non-integer coordinates of the input image as calculated according to this transformation relationship. The commonly used geometric transformation methods are nearest neighbor interpolation, bilinear interpolation and bicubic interpolation.

c. And (5) enhancing the image.

The useful information in the enhanced image, it can be a distorted process, the purpose is to improve the visual effect of the image, aiming at the application occasion of the given image, the whole or local characteristic of the image is emphasized purposefully, the original unclear image is changed into clear or some interesting characteristics are emphasized, the difference between different object characteristics in the image is enlarged, the uninteresting characteristics are inhibited, the image quality and the information content are improved, the image interpretation and identification effects are enhanced, and the requirements of some special analyses are met. Commonly used image enhancement methods are mean filtering, gaussian/low-pass filtering, median filtering.

It can be understood that the image may be classified into an optical image, a radar image, and the like, or a grayscale image, a color image, and the like, so that the method of image preprocessing may be determined according to practical applications, and is not limited herein.

Referring to fig. 2 again, after the image 20a is preprocessed, an image 20b is generated, and the image 20b is input into an image segmentation model 20c, where the image segmentation model 20c includes an input module, a general convolution module (hereinafter referred to as a convolution module), a novel convolution splicing module 200c proposed in the present application, a deconvolution module (also referred to as a transposed convolution module), and an output module. The parameter size of the input module is equal to the size of the image 20b after the size adjustment, when the image 20b is input to an input layer (i.e., the input module) of the image segmentation model 20c, an original image matrix corresponding to the image 20b is generated, and then the original image matrix enters the convolution module, it needs to be noted that the convolution module in the embodiment of the present application may include a general convolution layer and a general pooling layer, which is not separately exemplified here, and the convolution module learns some feature information from the original image matrix, that is, performs convolution operation on the feature information in the original image matrix, so as to obtain the most significant feature information on different pixel points of the image 20 b. After the convolution operation is completed, the feature information of the image 20b is already extracted, but the number of features extracted only through the convolution operation is large, in order to reduce the calculation amount, pooling operation is needed, that is, the feature information extracted through the convolution operation from the image 20b is transmitted to a pooling layer, aggregation statistics is carried out on the extracted feature information, the order of magnitude of the statistical feature information is far lower than that of the feature information extracted through the convolution operation, and meanwhile, the segmentation effect is improved. Common pooling methods include, but are not limited to, an average pooling method and a maximum pooling method. The average pooling operation method is that an average characteristic information is calculated in a characteristic information set to represent the characteristics of the characteristic information set; the maximum pooling operation is to extract the maximum feature information from a feature information set to represent the features of the feature information set.

Through the convolution processing and pooling processing of the convolution module, at least two image feature matrices 20d corresponding to the image 20b can be extracted, and it can be understood that there may be only one convolution layer or a plurality of convolution layers in the convolution module, and similarly, there may be only one pooling layer or a plurality of pooling layers.

Referring to fig. 2 again, at least two image feature matrices 20d are output from the convolution module and input to the convolution concatenation module 200c, and before and after performing depth-by-depth (depthwise) convolution, a general neural network needs to perform dimension increase and dimension reduction of feature channels by using 1x1 point-by-point (pointwise) convolution, but a large amount of parameters and calculation amounts caused by 1x1 point-by-point convolution are ignored. Assuming that the number of the matrices of the at least two image feature matrices 20d is 64, that is, the number of the feature maps input into the convolution splicing module 200c is 64, it can also be understood that the number of the feature channels input into the convolution splicing module 200c is 64; assuming that the number of output matrices of the convolution concatenation module 200c is 64, that is, the number of output feature maps of the convolution concatenation module 200c is 64, it can also be understood that the number of output feature channels of the convolution concatenation module 200c is 64, and assuming that the size of a depthwise convolution kernel is 3x3, according to a general network operation, first, 64 convolution kernels of 3x3 are required to perform depthwise convolution on each feature matrix, respectively, where the number of parameters is 64x3x3 is 576, and then, a pointwise convolution kernel of 1x1 is used to perform a fusion operation between feature matrices, where the number of parameters is 64x64x1x1 is 4096, and it can be seen that the number of convolution parameters of 1x1pointwise convolution kernels is very large.

To solve the problem, in the embodiment of the present application, a convolution splicing manner is adopted to reduce the number of parameters, please refer to fig. 2 again, which specifically includes the following steps: firstly, convolving 64 feature matrices (namely at least two image feature matrices 20d) by using 32 point-by-point convolution kernels of 1x1, wherein the depth of each point-by-point convolution kernel is 64, and generating at least two intermediate feature matrices 20e, wherein the number of the at least two intermediate feature matrices 20e is 32, that is, after 32 point-by-point convolution kernels of 1x1, the number of output feature maps is 32, which can also be understood as the number of feature channels is 32, and the parameter number of the step is 64x32x1x1 ═ 2048; then, at least two intermediate feature matrices 20e are subjected to depthwise convolution kernel operation of 3x3, because the number of input feature channels is 32, 32 depthwise convolution kernels of 3x3 are needed, and at least two feature matrices 200f to be spliced are generated, wherein the number of matrices of the at least two feature matrices 200f to be spliced is 32, that is, the number of feature maps is 32, which can also be understood as the number of feature channels is 32, and the parameter number of this step is 32x3x3 — 288.

Before outputting at least two feature matrices 200f to be spliced, feature splicing is performed on at least two feature matrices 200f to be spliced and at least two intermediate feature matrices 20e, for example, at least two intermediate feature matrices 20e include an intermediate feature matrix 1, an intermediate feature matrix 2, …, an intermediate feature matrix 31 and an intermediate feature matrix 32, at least two feature matrices 200f to be spliced include a feature matrix 1 to be spliced, a feature matrix 2, … to be spliced, a feature matrix 31 to be spliced and a feature matrix 32 to be spliced, the feature matrix 1 to be spliced is used as the 33 th feature matrix and is spliced behind the intermediate feature matrix 1, the intermediate feature matrix 2, …, the intermediate feature matrix 31 and the intermediate feature matrix 32, the feature matrix 2 to be spliced is used as the 34 th feature matrix and is spliced to the intermediate feature matrix 1, the intermediate feature matrix 2, …, …, splicing the feature matrix 31 to be spliced to the back of the middle feature matrix 1, the middle feature matrix 2, …, the middle feature matrix 31, the middle feature matrix 32, the feature matrix 1 to be spliced, the feature matrix 2, … to be spliced and the feature matrix 30 to be spliced by taking the feature matrix 32 to be spliced as a 63 st feature matrix, splicing the feature matrix 32 to be spliced to the back of the middle feature matrix 1, the middle feature matrix 2, …, the middle feature matrix 31, the middle feature matrix 32, the feature matrix 1 to be spliced, the feature matrix 2 to be spliced, …, the feature matrix 30 to be spliced and the feature matrix 31 to be spliced to obtain at least two target feature matrices 20 g; optionally, the intermediate feature matrix 1 is used as a 33 th feature matrix and spliced to the feature matrix 1 to be spliced, the feature matrix 2 to be spliced, …, the feature matrix 31 to be spliced and the feature matrix 32 to be spliced, the intermediate feature matrix 2 is used as a 34 th feature matrix and spliced to the feature matrix 1 to be spliced, the feature matrix 2 to be spliced, …, the feature matrix 31 to be spliced, the feature matrix 32 to be spliced and the rear of the intermediate feature matrix 1, …, the intermediate feature matrix 31 is used as a 63 th feature matrix and spliced to the feature matrix 1 to be spliced, the feature matrix 2 to be spliced, …, the feature matrix 31 to be spliced, the feature matrix 32 to be spliced, the intermediate feature matrix 1, the intermediate feature matrix 2, … and the rear of the intermediate feature matrix 30, the intermediate feature matrix 32 is used as a 64 th feature matrix and spliced to the feature matrix 1 to be spliced, the feature matrix 2 to be spliced, …, a feature matrix 31 to be spliced, a feature matrix 32 to be spliced, an intermediate feature matrix 1, intermediate feature matrices 2 and …, an intermediate feature matrix 30 and an intermediate feature matrix 31, so as to obtain at least two target feature matrices 20 g; the splicing order is not limited here.

As can be seen from the above description, the number of matrices of at least two target feature matrices 20g is 64, that is, the number of output feature maps is 64, and it can also be understood that the number of feature channels is 64, however, the total parameter number of the convolution concatenation module 200c is 2048+288 — 2336, which is almost half of the original parameter number 4096.

Besides reducing the number of parameters, the convolution splicing module 200c has the following advantages that firstly, the depth-by-depth convolution of 3x3 is used, so that the receptive field of the network is increased, and the segmentation of a larger portrait is facilitated; secondly, originally, there is only one chance of nonlinear activation after the pointwise convolution operation of 1x1, and in the convolution splicing module 200c, nonlinear activation can be added after the pointwise convolution of 1x1 and the depthwise convolution operation of 3x3, respectively, so that the nonlinear expression capability of the network is enhanced.

It is understood that there may be only one or a plurality of convolution patches 200c in the image segmentation model 20c, which is exemplified by one convolution patch 200c, and if there are a plurality of convolution patches 200c, the parameter may be calculated by referring to the above process.

Inputting at least two target feature matrices 20g into a transposition convolution module, wherein the transposition convolution module is the reverse operation of a convolution module and a convolution splicing module, the size of a feature graph or a feature matrix is changed from small to large, and the transposition convolution module is used for carrying out deconvolution processing on the at least two target feature matrices 20g to obtain a feature segmentation matrix; the feature segmentation matrix is determined as a semantic segmentation image 20h of the image 20B, and it can be understood that the embodiment of the present application is a two-class classification, the image segmentation model 20c has two labels, one is a portrait label and the other is a background label, each pixel in the image 20B is identified, if the pixel a is identified as the portrait label, the pixel a is marked as white, if the pixel B is identified as the background label, the pixel B is marked as black, as shown in the semantic segmentation image 20h, the portrait area is white, and the background area is black.

There may be rough places at the segmentation edge of the semantic segmentation image 20h, for example, 200h at the segmentation edge, so that the prediction region (i.e., the semantic segmentation image 20h) may be subjected to image post-processing, and the image post-processing method includes, but is not limited to, mean filtering, gaussian high/low pass filtering, and median filtering, so as to obtain the semantic segmentation image 20 i. Performing first assignment processing on the semantic segmentation image 20i, for example, assigning a white area in the semantic segmentation image 20i to 1 and assigning a black area to 0 to obtain a first assigned value image, where the first assigned value image is used to extract a foreground object, i.e., a portrait, in the image 20 b; acquiring a material image 20j, obviously, the object in the material image 20j is inconsistent with the object in the image 20b, the material image 20j comprises a house, a tree and a small grass, and performing second assignment processing on the semantic segmentation image 20i, for example, assigning a white area in the semantic segmentation image 20i to be 0 and assigning a black area to be 1 to obtain a second assigned value image, and the second assigned value image is used for extracting a background object in the material image 20 j; acquiring a first assignment matrix of the first assignment image, acquiring an original matrix of the image 20b, and performing matrix adjustment on the original matrix according to the first assignment matrix to obtain a first target matrix, wherein the first target matrix is used for representing a foreground object in the image 20 b; acquiring a second assignment matrix of the second assignment image, acquiring a material matrix of the material image 20j, and performing matrix adjustment on the material matrix according to the second assignment matrix to obtain a second target matrix, wherein the second target matrix is used for representing a background object in the material image 20 j; the first target matrix and the second target matrix are subjected to matrix addition to obtain a target image 20k, and it is obvious that the target image 20k includes a portrait in the image 20b and a background object in the material image 20 j.

In summary, please refer to fig. 3 together for realizing fast and accurate portrait segmentation, fig. 3 is a schematic flow chart of an image processing method according to an embodiment of the present application, which includes the following steps:

step 1: acquiring an image may include acquiring an original image, i.e., image 20a and material image 20j in fig. 2.

Step 2: image preprocessing, which performs processing such as normalization and filtering on the image 20a in step 1, generates an image 20b to be input to the image segmentation model 20 c.

And step 3: the image segmentation model 20c predicts a human foreground region, the image 20b in the step 2 is input to a general convolution module in the image segmentation model 20c, and at least two image feature matrices 20d are generated, which is the invention point of the embodiment of the present application, the present application provides a novel convolution module, that is, a convolution splicing module 200c in fig. 2, the convolution splicing module 200c can perform feature splicing processing on at least two intermediate feature matrices 20e and at least two feature matrices to be spliced 20f generated by the at least two intermediate feature matrices 20e, and generate at least two target feature matrices 20g having the same number as the number of target channels, so that the number of parameters can be greatly reduced in the image segmentation model 20c, and the operation speed of the image segmentation model 20c can be further improved.

And 4, step 4: the prediction region post-processing is to filter the predicted human image foreground region (i.e., the semantic segmentation image 20h) of the image segmentation model 20c to obtain a human image segmentation image with smooth edges, i.e., the semantic segmentation image 20i in fig. 2.

And 5: the target image is generated, and the present embodiment is described by taking the background object in the replaced image 20b as an example, and the semantic segmentation image 20i generated in step 4 can reconstruct a three-dimensional image and analyze the target object in the image in an actual scene.

Further, please refer to fig. 4, where fig. 4 is a schematic flowchart of an image processing method according to an embodiment of the present application. As shown in fig. 4, the image processing process includes the steps of:

step S101, acquiring an image, inputting the image into an image segmentation model, and generating at least two image characteristic matrixes of the image; the image segmentation model comprises a convolution splicing module, wherein the convolution splicing module comprises at least two point-by-point convolution kernels and at least two depth-by-depth convolution kernels; the features output by the convolution splicing module have a target channel number.

Specifically, in order to realize fast and high-precision deep learning semantic segmentation, a novel convolution layer-convolution splicing layer is designed in the embodiment of the application, a convolution splicing module is constructed based on the convolution splicing layer, and finally an image segmentation model with a small parameter and a nonlinear expression capability is constructed based on the convolution splicing module to segment the image.

Please refer to fig. 5a, fig. 5a is a schematic structural diagram of a convolution splicing module according to an embodiment of the present application. As shown in fig. 5a, the structure of the convolution splicing module includes an input layer, a convolution splicing layer, a normalization (BatchNorm, BN) layer, an active layer, and a convolution layer. When deep learning is used for image classification or object segmentation, data preprocessing needs to be carried out on images firstly, the most common image preprocessing methods comprise two methods, namely normal whitening processing and image standardization processing, and the other method is called normalization processing; the image standardization is to realize the centralization processing of data through mean value removal, and according to convex optimization theory and data probability distribution related knowledge, the centralization of the data accords with the data distribution rule, so that the generalization effect after training is easier to obtain, the input value of the nonlinear transformation function falls into a region sensitive to input, and the problem of gradient disappearance is avoided.

Referring to fig. 5a again, since the linear model has insufficient expression capability, it is necessary to use an activation function (activation layer) to add a non-linear factor, and commonly used activation functions include a Sigmod function, a Tanh function, a modified linear Unit (Relu) function, and the like, where the Relu function has the following advantages:

(1) the ReLU function solves the problem of gradient disappearance, and at least when input into a positive region, neurons cannot be saturated;

(2) due to the linear, unsaturated form of ReLU, fast convergence is possible in random gradient descent (SGD);

(3) the algorithm speed is much faster, the ReLU function has only linear relation and does not need exponential calculation, and the calculation speed is faster than the sigmoid function and the tanh function no matter in forward propagation or backward propagation.

The convolutional layer in fig. 5a may include a pooling layer, in the neural network, the input image performs feature extraction through a plurality of consecutive convolutional layers and pooling layers, gradually changes the low-layer features into high-layer features, and the receptive field of the network at a deeper level can be increased through consecutive convolution operation and pooling operation (sub-sampling), so as to capture more context information.

Please refer to fig. 5b, wherein fig. 5b is a schematic structural diagram of an image segmentation model according to an embodiment of the present application. As shown in fig. 5b, the basic structure of the image segmentation model includes an input module, a general convolution module (hereinafter referred to as a convolution module), a convolution splicing module, a transposition convolution module, and an output module, in the structure, an encoder is composed of the convolution module 1, the convolution splicing module 2, the convolution splicing module 3, and the convolution splicing module 4, and downsampling feature extraction is performed on an image to obtain an encoded graph with a small size and rich semantic information; then, a decoder is composed by using 4 transposition convolution (deconvolution) modules (i.e., the transposition convolution module 5, the transposition convolution module 6, the transposition convolution module 7, and the transposition convolution module 8 in fig. 5 b) and using skip connection (skip), and an encoded image (feature matrix) obtained by the encoder is upsampled, so as to obtain a semantic segmentation image (e.g., the semantic segmentation image 20h in fig. 2) with the size consistent with that of the original image (e.g., the image 20b in fig. 2). Wherein, the input of the transposition convolution module 5 is the output of the convolution splicing module 4, the input of the transposition convolution module 6 is the sum of the output of the convolution splicing module 3 and the output of the transposition convolution module 5, that is, the output of the convolution splicing module 3 and the output of the transposition convolution module 5 are taken as the input of the transposition convolution module 6 through skip connection, the output of the convolution splicing module 2 and the output of the transposition convolution module 6 are taken as the input of the transposition convolution module 7 through skip connection, the output of the convolution module 1 and the output of the transposition convolution module 7 are taken as the input of the transposition convolution module 8 through skip connection, assuming that the output of the convolution splicing module 3 is a matrix of 2x 2, such as { [2,3], [4,5] }, the output of the transposition convolution module 5 is a matrix of 2x 2, such as { [1,2], [2,4 }, the input of the transposition convolution module 6 can be { [3,4], [6,9] }, it should be noted that the above numbers are only examples of the input of the transposed convolution module as the output of one or two convolution modules (the transposed convolution module and the convolution splicing module or convolution module).

Referring to fig. 2 again, fig. 20a is obtained, the image 20a is subjected to image preprocessing to obtain an image 20b, and the image 20b is input into a convolution module in the image segmentation model 20c to generate at least two image feature matrices 20d of the image 20 b. The image segmentation model 20c includes a convolution splicing module 200c, in the embodiment of the present application, an input and output process of the convolution splicing module 2 is taken as an example for description, and other convolution splicing modules can be understood according to the input and output process of the convolution splicing module 2, that is, the convolution splicing module 200c is the convolution splicing module 2. The convolution splicing module 200c includes a convolution splicing layer, the convolution splicing layer includes at least two point-by-point convolution kernels and at least two depth-by-depth convolution kernels, the feature output by the convolution splicing module 200c has a target channel number, it is assumed that the target channel number is 64, and the matrix number of the at least two image feature matrices 20d is 64.

At least two image feature matrices 20d are input into the convolution splicing module 200c, please refer to fig. 5c together, and fig. 5c is a schematic network structure diagram of the convolution splicing module according to an embodiment of the present disclosure.

Step S102, respectively convolving at least two image feature matrixes through each point-by-point convolution kernel to generate at least two intermediate feature matrixes; the number of convolution kernels of the at least two point-by-point convolution kernels is less than the number of target channels.

In particular, by point-by-point convolution of kernel K_iEach point-by-point convolution channel convolves each image feature matrix to generate at least two intermediate feature sub-matrixes Z_iFor at least two intermediate feature sub-matrices Z_iFusing to obtain an intermediate feature matrix L_iPoint-by-point convolution kernel K_iOne point-by-point convolution channel in the image feature matrix is corresponding to one image feature matrix in at least two image feature matrices; by point-by-point convolution of kernel K_i+1Each point-by-point convolution channel convolves each image feature matrix to generate at least two intermediate feature sub-matrixes Z_i+1For at least two intermediate feature sub-matrices Z_i+1Fusing to obtain an intermediate feature matrix L_i+1Point-by-point convolution kernel K_i+1One point-by-point convolution channel in the image feature matrix is corresponding to one image feature matrix in at least two image feature matrices; the intermediate feature matrix L_iAnd an intermediate feature matrix L_i+1At least two intermediate feature matrices are determined.

Respectively convolving at least two image feature matrixes through each point-by-point convolution kernel to generate at least two intermediate feature matrixes to be activated; and mapping at least two intermediate feature matrixes to be activated to the first activation function to generate at least two intermediate feature matrixes.

As shown in fig. 5c, the number of the at least two image feature matrices 20d is 64, and the size of each image feature matrix is H × W, in the embodiment of the present application, the 64 image feature matrices with the size of H × W are reduced in dimension, and first, the 64 image feature matrices (i.e., the at least two image feature matrices 20d) are convolved by using 32 point-by-point convolution kernels of 1 × 1. Wherein the depth of each point-by-point convolution kernel is 64, and it is assumed that 64 image feature matrices are respectively an image feature matrix 1a, an image feature matrix 2a, …, an image feature matrix 63a, an image feature matrix 64a, and 32 point-by-point convolution kernels of 1 × 1 are respectively a point-by-point convolution kernel K₁Roll by rollProduct of kernel K₂…, point-by-point convolution kernel K₃₁Point-by-point convolution kernel K₃₂Point-by-point convolution kernel K₁The point-by-point convolution channel in (1) is a point-by-point convolution kernel K_1-1Point-by-point convolution kernel K_1-2…, point-by-point convolution kernel K_1-63Point-by-point convolution kernel K_1-64Point-by-point convolution kernel K₂The point-by-point convolution channel in (1) is a point-by-point convolution kernel K_2-1Point-by-point convolution kernel K_2-2…, point-by-point convolution kernel K_2-63Point-by-point convolution kernel K_2-64…, point-by-point convolution kernel K₃₁The point-by-point convolution channel in (1) is a point-by-point convolution kernel K_31-1Point-by-point convolution kernel K₃₁-2, …, point-by-point convolution kernel K_31-63Point-by-point convolution kernel K_31-64Point-by-point convolution kernel K₃₂The point-by-point convolution channel in (1) is a point-by-point convolution kernel K_32-1Point-by-point convolution kernel K₃₂-2, …, point-by-point convolution kernel K_32-63Point-by-point convolution kernel K_32-64。

As shown in fig. 5c, 64 image feature matrices with size H × W are mapped to each point-by-point convolution kernel, specifically: mapping an image feature matrix 1a to a point-by-point convolution kernel K₁The point-by-point convolution channel in (1) is a point-by-point convolution kernel K_1-1Generating an intermediate feature submatrix Z_1-1Mapping the image feature matrix 2a to a point-by-point convolution kernel K₁The point-by-point convolution channel in (1) is a point-by-point convolution kernel K_1-2Generating an intermediate feature submatrix Z_1-2…, mapping the image feature matrix 63a to a point-by-point convolution kernel K₁The point-by-point convolution channel in (1) is a point-by-point convolution kernel K_1-63Generating an intermediate feature submatrix Z_1-63Mapping the image feature matrix 64a to a point-by-point convolution kernel K₁The point-by-point convolution channel in (1) is a point-by-point convolution kernel K_1-64Generating an intermediate feature submatrix Z_1-64Then the intermediate feature submatrix Z_1-1Intermediate feature submatrix Z_1-2…, intermediate feature submatrix Z_1-63Intermediate feature submatrix Z_1-64Performing feature fusion to obtain an intermediate feature matrix L₁(ii) a Similarly, the image feature matrix 1a is mapped to a point-by-point convolution kernel K₂The point-by-point convolution channel in (1) is point-by-pointConvolution kernel K_2-1Generating an intermediate feature submatrix Z_2-1Mapping the image feature matrix 2a to a point-by-point convolution kernel K₂The point-by-point convolution channel in (1) is a point-by-point convolution kernel K_2-2Generating an intermediate feature submatrix Z_2-2…, mapping the image feature matrix 63a to a point-by-point convolution kernel K₁The point-by-point convolution channel in (1) is a point-by-point convolution kernel K_2-63Generating an intermediate feature submatrix Z_2-63Mapping the image feature matrix 64a to a point-by-point convolution kernel K₂The point-by-point convolution channel in (1) is a point-by-point convolution kernel K_2-64Generating an intermediate feature submatrix Z_2-64Then the intermediate feature submatrix Z_2-1Intermediate feature submatrix Z_2-2…, intermediate feature submatrix Z_2-63Intermediate feature submatrix Z_2-64Performing feature fusion to obtain an intermediate feature matrix L₂(ii) a The 64 image feature matrices are convolved with the remaining point-by-point kernels (including the point-by-point convolution kernel K)₃Point-by-point convolution kernel K₄…, point-by-point convolution kernel K₃₁Point-by-point convolution kernel K₃₂) For the mapping process, reference may be made to the above description, and details are not repeated here.

As can be seen from the above, 32 intermediate feature matrices, namely, the intermediate feature matrix L, can be generated from the 32 point-by-point convolution kernels₁An intermediate feature matrix L₂…, intermediate feature matrix L₃₁An intermediate feature matrix L₃₂. Note that the size of each intermediate feature matrix in fig. 5c is H × W, because the size of the convolution kernel is 1 × 1, and the sliding step size is set to 1, and the parameter number of this step is 64 × 32 × 1 × 1 — 2048.

Optionally, 32 intermediate feature matrices (respectively, intermediate feature matrices L) obtained by the above process are used₁An intermediate feature matrix L₂…, intermediate feature matrix L₃₁An intermediate feature matrix L₃₂) Firstly, the intermediate feature matrixes to be activated are called, 32 intermediate feature matrixes to be activated are mapped to a first activation function, and 32 intermediate feature matrixes are generated, wherein the process is used for enhancing the nonlinear expression capability of the model.

Alternatively, 16 point-by-point convolution kernels of 1x1 are used, for 6The 4 image feature matrices are convolved. Wherein the depth of each point-by-point convolution kernel is still 64, and it is assumed that 64 image feature matrices are respectively an image feature matrix 1a, an image feature matrix 2a, …, an image feature matrix 63a, and an image feature matrix 64a, and 16 point-by-point convolution kernels of 1 × 1 are respectively a point-by-point convolution kernel K₁Point-by-point convolution kernel K₂…, point-by-point convolution kernel K₁₅Point-by-point convolution kernel K₁₆Point-by-point convolution kernel K₁The point-by-point convolution channel in (1) is a point-by-point convolution kernel K_1-1Point-by-point convolution kernel K_1-2…, point-by-point convolution kernel K_1-63Point-by-point convolution kernel K_1-64Point-by-point convolution kernel K₂The point-by-point convolution channel in (1) is a point-by-point convolution kernel K_2-1Point-by-point convolution kernel K_2-2…, point-by-point convolution kernel K_2-63Point-by-point convolution kernel K_2-64…, point-by-point convolution kernel K₁₅The point-by-point convolution channel in (1) is a point-by-point convolution kernel K_15-1Point-by-point convolution kernel K₁₅-2, …, point-by-point convolution kernel K_15-63Point-by-point convolution kernel K_15-64Point-by-point convolution kernel K₁₆The point-by-point convolution channel in (1) is a point-by-point convolution kernel K_16-1Point-by-point convolution kernel K_16-2…, point-by-point convolution kernel K_16-63Point-by-point convolution kernel K_16-64。

Mapping 64 image feature matrices with size H × W to each point-by-point convolution kernel, wherein the specific mapping process may be as described above, and no further description is given here, and 16 intermediate feature matrices, namely intermediate feature matrices L may be generated according to 16 point-by-point convolution kernels₁An intermediate feature matrix L₂…, intermediate feature matrix L₁₅An intermediate feature matrix L₁₆The parameter number of this step is 64x16x1x1 ═ 1024.

And S103, convolving the at least two intermediate feature matrices according to the at least two depth-by-depth convolution kernels respectively to generate at least two feature matrices to be spliced.

In particular, by depth-wise convolution of the kernel P_iTo the intermediate feature matrix L_iPerforming convolution to generate a characteristic matrix J to be spliced_i(ii) a By depth-wise convolution of the kernel P_i+1To the middle featureMatrix L_i+1Performing convolution to generate a characteristic matrix J to be spliced_i+1(ii) a The characteristic matrix J to be spliced_iAnd a feature matrix J to be spliced_i+1Determining at least two characteristic matrixes to be spliced.

Respectively convolving at least two intermediate feature matrices according to at least two depth-by-depth convolution kernels to generate at least two splicing feature matrices to be activated; and mapping the at least two splicing feature matrixes to be activated to a second activation function to generate at least two splicing feature matrixes to be activated.

Referring to fig. 5c again, the number of the at least two intermediate feature matrices is 32, and the size of each intermediate feature matrix is H × W, and as shown in fig. 5c, 32 intermediate feature matrices with size H × W are depth-wise convolved by using 32 depth-wise convolution kernels of 3 × 3. Suppose 32 intermediate image feature matrices are respectively intermediate feature matrices L₁An intermediate feature matrix L₂…, intermediate feature matrix L₃₁An intermediate feature matrix L₃₂The 32 depth-wise convolution kernels of 3x3 are respectively depth-wise convolution kernels P₁Depth-wise convolution kernel P₂… depth-wise convolution kernel P₃₁Depth-wise convolution kernel P₃₂。

As shown in fig. 5c, 32 intermediate feature matrices with size H × W are mapped to each depth-wise convolution kernel one-to-one, and the specific mapping manner is as follows: the intermediate feature matrix L₁Mapping to depth-wise convolution kernels P₁Generating a feature matrix J to be spliced₁The intermediate feature matrix L₂Mapping to depth-wise convolution kernels P₂Generating a feature matrix J to be spliced₂…, applying the intermediate feature matrix L₃₁Mapping to depth-wise convolution kernels P₃₁Generating a feature matrix J to be spliced₃₁The intermediate feature matrix L₃₂Mapping to depth-wise convolution kernels P₃₂Generating a feature matrix J to be spliced₃₂. It should be noted that the size of each feature matrix to be spliced in fig. 5c is H × W, because the sliding step is set to 1, and 32 intermediate feature matrices can be filled, if the filling size is 1, it can be ensured that the sizes of the input and output feature matrices are consistent, and if not, the sizes of the input and output feature matrices are all H × WAnd performing padding, wherein the size of the generated 32 feature matrices to be stitched is smaller than that of the intermediate feature matrix, and the size of the target feature matrix generated by subsequent feature stitching processing also needs to be modified to ensure that the size of the target feature matrix is consistent, and the parameter number in the above process is 32x3x 3-288.

Optionally, 32 feature matrices to be spliced (respectively, feature matrices J to be spliced) obtained in the above process are used₁And a characteristic matrix J to be spliced₂… feature matrix J to be spliced₃₁And a characteristic matrix J to be spliced₃₂) Firstly, the splicing feature matrixes to be activated are called, 32 splicing feature matrixes to be activated are mapped to a second activation function, and 32 splicing feature matrixes to be activated are generated.

It should be understood that the first activation function in step S102 or the second activation function in step S103 may be a common activation function.

Optionally, the at least two intermediate feature matrices are convolved respectively according to the at least two first depth-wise convolution kernels, so as to generate at least two first feature matrices to be spliced.

If the 64 image feature matrices are convolved in step S102 using 16 point-by-point convolution kernels of 1 × 1, 16 intermediate feature matrices are generated, each being an intermediate feature matrix L₁An intermediate feature matrix L₂…, intermediate feature matrix L₁₅An intermediate feature matrix L₁₆The parameter number of this step is 64x16x1x1 ═ 1024. At this time, the process performs depth-wise convolution on 16 intermediate feature matrices of size H × W using 16 depth-wise convolution kernels of 3 × 3. Assuming 16 intermediate image feature matrices as intermediate feature matrices L₁An intermediate feature matrix L₂…, intermediate feature matrix L₁₅An intermediate feature matrix L₁₆16 depth-wise convolution kernels of 3x3 are depth-wise convolution kernels P, respectively₁Depth-wise convolution kernel P₂… depth-wise convolution kernel P₁₅Depth-wise convolution kernel P₁₆(ii) a One-to-one mapping of 15 intermediate feature matrices of size H W to each depth volumeAnd (4) performing kernel accumulation, wherein specific mapping modes can refer to the above, and are not repeated here, and generating a feature matrix J to be spliced₁Generating a characteristic matrix J to be spliced₂…, generating a characteristic matrix J to be spliced₁₅Generating a characteristic matrix J to be spliced₁₆In this case, the parameter number is 16x3x3 ═ 144.

Step S104, performing characteristic splicing processing on the at least two intermediate characteristic matrixes and the at least two characteristic matrixes to be spliced to obtain at least two target characteristic matrixes, and performing image identification processing on the image according to the at least two target characteristic matrixes; the number of the at least two target feature matrices is equal to the number of the target channels.

Specifically, at least two target feature matrices are input into a transposed convolution module, and deconvolution processing is performed on the at least two target feature matrices through the transposed convolution module to obtain a feature segmentation matrix; and determining the feature segmentation matrix as a semantic segmentation image of the image, and performing image identification processing on the image according to the semantic segmentation image.

The number of output target channels of the convolution splicing module is 64, and the feature matrix obtained in step S103 is 32 feature matrices to be spliced, so that before outputting the 32 feature matrices to be spliced, feature splicing is performed on the 32 feature matrices to be spliced and the 32 intermediate feature matrices, and as can be seen from step S102, the 32 intermediate feature matrices are respectively intermediate feature matrices L₁An intermediate feature matrix L₂…, intermediate feature matrix L₃₁An intermediate feature matrix L₃₂In step S103, the 32 feature matrices to be spliced are respectively the feature matrix J to be spliced₁And a characteristic matrix J to be spliced₂… feature matrix J to be spliced₃₁And a characteristic matrix J to be spliced₃₂The matrix sequence of the feature matrix to be spliced is not limited, and 32 feature matrices to be spliced can be sequenced in front of 32 intermediate feature matrices, namely the feature matrix J to be spliced₁And a characteristic matrix J to be spliced₂… feature matrix J to be spliced₃₁And a characteristic matrix J to be spliced₃₂An intermediate feature matrix L₁An intermediate feature matrix L₂…, intermediate feature matrix L₃₁InInter-feature matrix L₃₂The 32 intermediate feature matrices may be ordered before the 32 feature matrices to be merged, i.e. the intermediate feature matrix L₁An intermediate feature matrix L₂…, intermediate feature matrix L₃₁An intermediate feature matrix L₃₂And a characteristic matrix J to be spliced₁And a characteristic matrix J to be spliced₂… feature matrix J to be spliced₃₁And a characteristic matrix J to be spliced₃₂It is also possible to cross-order, e.g. the intermediate feature matrix L₁And a characteristic matrix J to be spliced₁An intermediate feature matrix L₂And a characteristic matrix J to be spliced₂…, intermediate feature matrix L₃₁And a characteristic matrix J to be spliced₃₁An intermediate feature matrix L₃₂And a characteristic matrix J to be spliced₃₂64 target feature matrices may also be generated in a random order.

Inputting 64 target feature matrices into a transposition convolution module, wherein the transposition convolution module is the reverse operation of the convolution module and the convolution splicing module, the size of a feature map or a feature matrix is changed from small to large, and the transposition convolution module is used for carrying out deconvolution processing on the 64 target feature matrices to obtain a feature segmentation matrix; and determining the feature segmentation matrix as a semantic segmentation image of the image.

Optionally, if 16 point-by-point convolution kernels of 1 × 1 are used in step S102 to perform convolution on 64 image feature matrices, then 16 intermediate feature matrices are generated, which are respectively intermediate feature matrices L₁An intermediate feature matrix L₂…, intermediate feature matrix L₁₅An intermediate feature matrix L₁₆Then, in step S103, 16 depth-by-depth convolution kernels of 3x3 are used to perform depth-by-depth convolution on 16 intermediate feature matrices with the size H × W to generate 16 first feature matrices to be spliced, and feature splicing processing is performed on the 16 intermediate feature matrices and the 16 first feature matrices to be spliced to obtain 32 feature matrices to be determined; since the number of output target channels of the convolution and concatenation module is 64, it is necessary to perform the convolution on 32 feature matrices to be determined respectively by using 32 second depth-wise convolution kernels (with a size of 3 × 3) to generate 32 second feature matrices to be concatenated, and perform the convolution on the 32 feature matrices to be determined and the convolutionAnd performing feature splicing treatment on the 32 second feature matrixes to be spliced to obtain 64 target feature matrixes. In the embodiment of the application, a feature matrix splicing process is performed by taking 32 point-by-point convolution kernels (the number of the point-by-point convolution kernels can be 1/2 of the number of input channels) and 32 depth-by-depth convolution kernels as examples, or 16 point-by-point convolution kernels (the number of the point-by-point convolution kernels can be 1/4 of the number of the input channels), 16 first depth-by-depth convolution kernels and 32 second depth-by-depth convolution kernels as examples, and in actual application, the number of the point-by-point convolution kernels can be 1/2, 1/4, 1/8 and the like of the number of the input channels.

Optionally, performing first assignment processing on the semantic segmentation image to obtain a first assigned value image, where the first assigned value image is used to extract a foreground object in the image; acquiring a material image, and performing second assignment processing on the semantic segmentation image to obtain a second assigned value image, wherein the second assigned value image is used for extracting a background object in the material image; acquiring a first assignment matrix of a first assignment image, acquiring an original matrix of the image, and performing matrix adjustment on the original matrix according to the first assignment matrix to obtain a first target matrix, wherein the first target matrix is used for representing a foreground object in the image; acquiring a second assignment matrix of the second assignment image, acquiring a material matrix of the material image, and performing matrix adjustment on the material matrix according to the second assignment matrix to obtain a second target matrix, wherein the second target matrix is used for representing a background object in the material image; and performing matrix addition on the first target matrix and the second target matrix to obtain a target image.

Referring to fig. 6, fig. 6 is a schematic view of a scene of image processing according to an embodiment of the present disclosure. As shown in fig. 6, an original semantic matrix 30a of the semantic segmentation image 20i is obtained, and since a foreground object, that is, a portrait object, in the image 20b is to be extracted, pixel points corresponding to the portrait object in the semantic segmentation image 20i are to be retained, and according to prior knowledge and matrix numerical value distribution of the original semantic matrix 30a, values smaller than 30 in the original semantic matrix 30a are all assigned to 0, and values greater than or equal to 30 are all assigned to 1, so that a first assignment matrix 30b is obtained. Similarly, since the background object in the material image 20j is to be extracted, the pixel points corresponding to the background object in the semantic segmentation image 20i are to be retained, and according to the priori knowledge and the matrix numerical value distribution of the original semantic matrix 30a, the numerical values smaller than 30 in the original semantic matrix 30a are all assigned as 1, and the numerical values greater than or equal to 30 are all assigned as 0, so as to obtain the second assignment matrix 30c, obviously, the numerical values in the second assignment matrix 30c are opposite to the numerical values in the first assignment matrix 30 b.

Acquiring an original matrix 30d of the image 20b, and performing matrix adjustment on the original matrix 30d according to the first assignment matrix 30b, namely multiplying numerical values at the same positions of the two matrices to obtain a first target matrix 30e, wherein only foreground objects in the image 20b are reserved in the first target matrix 30 e; acquiring a material matrix 30f of the material image 20j, and performing matrix adjustment on the material matrix 30f according to a second assignment matrix 30c, namely multiplying numerical values at the same positions of the two matrixes to obtain a second target matrix 30g, wherein only background objects in the material image 20j are reserved in the second target matrix 30 g; as shown in fig. 6, the first object matrix 30e and the second object matrix 30g are subjected to matrix addition, that is, the numerical values at the same positions of the two matrices are added to obtain an object matrix 30h, obviously, the object matrix 30h includes the foreground object in the image 20b and the background object in the material image 20j, and finally, the object matrix 30h obtains an object image 20 k.

Further, please refer to fig. 7, and fig. 7 is a flowchart illustrating an image processing method according to an embodiment of the present application. As shown in fig. 7, the image processing process includes the steps of:

step S201, acquiring a training sample set; the training sample set comprises a sample image and a label image, wherein the label image is used for representing an object class label to which each pixel point in the sample image belongs.

Specifically, in the embodiment corresponding to fig. 4, the semantic segmentation of the portrait image is mainly performed, and when the embodiment is actually applied, the segmentation of other objects, such as animals, vehicles, sky, and the like, can be realized only by replacing the training sample set.

Step S202, inputting a training sample set into a sample image segmentation model to generate a sample image characteristic matrix of a sample image; the sample image segmentation model comprises a sample convolution splicing module; the sample image segmentation model includes at least two object class labels.

Specifically, in order to achieve a good training effect of the sample image segmentation model, the sample images in the training sample set may be preprocessed first, where the image preprocessing method includes, but is not limited to, image normalization, nearest neighbor interpolation, bilinear interpolation, bicubic interpolation, and filtering the sample images, such as mean filtering, gaussian high/low pass filtering, and median filtering.

For a specific process of inputting the preprocessed sample image and the label image into the sample image segmentation model and generating the sample image feature matrix of the sample image, refer to step S101 in the embodiment corresponding to fig. 4, which is not described herein again.

Step S203, extracting a predicted image characteristic matrix associated with each category label in the sample image characteristic matrix through a sample convolution splicing module.

For a specific process, please refer to step S102-step S104 in the embodiment corresponding to fig. 4, which is not described herein again.

And S204, adjusting the sample image segmentation model according to the predicted image feature matrix and the label image to obtain an image segmentation model containing the convolution splicing module.

Specifically, a tag image feature matrix of the tag image is generated, and a model loss value is generated according to the tag image feature matrix and the predicted image feature matrix; and adjusting the weight of parameters in the sample image segmentation model according to the model loss value, and determining the adjusted image segmentation model containing the convolution splicing module as the image segmentation model when the model loss value meets the convergence condition.

It can be understood that, image features extracted by the sample image segmentation model at the initial stage are incomplete, and therefore, an error exists between the predicted image feature matrix and the tag image feature matrix corresponding to the tag image, so that a large model loss value exists in the sample image segmentation model, and it is necessary to continuously adjust the weight of parameters in the sample image segmentation model until the model loss value converges, and determine the adjusted image segmentation model including the convolution stitching module as the image segmentation model.

Optionally, the number of training iterations is set, for example, the number of training iterations is set to 100, and when the number of training iterations reaches 100, the model loss value is not considered any more, and the image segmentation model is determined.

Further, please refer to fig. 8, where fig. 8 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present application. The image processing apparatus may be a computer program (including program code) running on a computer device, for example, the image processing apparatus is an application software; the apparatus may be used to perform the corresponding steps in the methods provided by the embodiments of the present application. As shown in fig. 8, the image processing apparatus 1 may include: a first acquisition module 11, a first generation module 12 and a stitching feature module 13.

The first acquisition module 11 is configured to acquire an image, input the image into an image segmentation model, and generate at least two image feature matrices of the image; the image segmentation model comprises a convolution splicing module, wherein the convolution splicing module comprises at least two point-by-point convolution kernels and at least two depth-by-depth convolution kernels; the features output by the convolution splicing module have a target channel number;

a first generation module 12, configured to convolve at least two image feature matrices with each point-by-point convolution kernel, respectively, and generate at least two intermediate feature matrices; the number of convolution kernels of the at least two point-by-point convolution kernels is smaller than the number of target channels;

the first generating module 12 is further configured to convolve the at least two intermediate feature matrices according to the at least two depth-wise convolution kernels, respectively, and generate at least two feature matrices to be spliced;

the splicing characteristic module 13 is configured to perform characteristic splicing processing on the at least two intermediate characteristic matrices and the at least two characteristic matrices to be spliced to obtain at least two target characteristic matrices, and perform image identification processing on the image according to the at least two target characteristic matrices; the number of the at least two target feature matrices is equal to the number of the target channels.

For specific functional implementation manners of the first obtaining module 11, the first generating module 12 and the splicing feature module 13, reference may be made to steps S101 to S104 in the embodiment corresponding to fig. 4, which is not described herein again.

Referring again to fig. 8, the splice feature module 13 may include: a first processing unit 131 and a second processing unit 132.

The first processing unit 131 is configured to input the at least two target feature matrices into a transposed convolution module, and perform deconvolution processing on the at least two target feature matrices through the transposed convolution module to obtain a feature segmentation matrix;

and a second processing unit 132, configured to determine the feature segmentation matrix as a semantic segmentation image of the image, and perform image recognition processing on the image according to the semantic segmentation image.

For specific functional implementation of the first processing unit 131 and the second processing unit 132, reference may be made to step S104 in the embodiment corresponding to fig. 4, which is not described herein again.

Referring to fig. 8 again, the second processing unit 132 may include: a first assignment subunit 1321, a second assignment subunit 1322, and a generate target subunit 1323.

A first assignment subunit 1321, configured to perform a first assignment process on the semantic segmentation image to obtain a first assignment image; the first value-assigned image is used for extracting a foreground object in the image;

a second assignment subunit 1322, configured to obtain a material image, and perform second assignment processing on the semantic segmentation image to obtain a second assignment image; the second assigned value image is used for extracting a background object in the material image;

a generation target subunit 1323 configured to generate a target image from the material image, the first assigned value image, and the second assigned value image; the target image includes a foreground object and a background object.

For specific functional implementation manners of the first assignment subunit 1321, the second assignment subunit 1322 and the generation target subunit 1323, reference may be made to step S104 in the embodiment corresponding to fig. 4, which is not described herein again.

Referring to fig. 8 again, a target subunit 1323 is generated, which may be specifically configured to obtain a first assignment matrix of the first assignment image, obtain an original matrix of the image, and perform matrix adjustment on the original matrix according to the first assignment matrix to obtain a first target matrix; the first target matrix is used for representing a foreground object in the image;

generating a target subunit, and specifically, obtaining a second assignment matrix of a second assignment image, obtaining a material matrix of the material image, and performing matrix adjustment on the material matrix according to the second assignment matrix to obtain a second target matrix; the second target matrix is used for representing a background object in the material image;

For a specific function implementation manner of the generation target subunit 1323, reference may be made to step S104 in the embodiment corresponding to fig. 4, which is not described herein again.

Referring to fig. 8 again, the first generating module 12 may be specifically configured to convolve the at least two intermediate feature matrices according to the at least two first depth-wise convolution kernels, respectively, to generate at least two first feature matrices to be spliced;

the splice feature module 13 may include: a third processing unit 133, a first generating unit 134, and a fourth processing unit 135.

The third processing unit 133 is configured to perform feature splicing processing on the at least two intermediate feature matrices and the at least two first feature matrices to be spliced to obtain at least two feature matrices to be determined;

the first generating unit 134 is configured to convolve the at least two feature matrices to be determined respectively according to the at least two second depth-wise convolution kernels, and generate at least two second feature matrices to be spliced;

the fourth processing unit 135 is configured to perform feature splicing processing on the at least two feature matrices to be determined and the at least two second feature matrices to be spliced to obtain at least two target feature matrices.

For specific functional implementation manners of the third processing unit 133, the first generating unit 134, and the fourth processing unit 135, reference may be made to step S103 to step S104 in the embodiment corresponding to fig. 4, which is not described herein again.

Referring again to fig. 8, the first generating module 12 may include: a second generating unit 121 and a first determining unit 122.

A second generating unit 121 for generating a kernel K by point-by-point convolution_iEach point-by-point convolution channel convolves each image feature matrix to generate at least two intermediate feature sub-matrixes Z_iFor at least two intermediate feature sub-matrices Z_iFusing to obtain an intermediate feature matrix L_i(ii) a Point-by-point convolution kernel K_iOne point-by-point convolution channel in the image feature matrix is corresponding to one image feature matrix in at least two image feature matrices;

a second generating unit 121 for further performing a point-by-point convolution of the kernel K_i+1Each point-by-point convolution channel convolves each image feature matrix to generate at least two intermediate feature sub-matrixes Z_i+1For at least two intermediate feature sub-matrices Z_i+1Fusing to obtain an intermediate feature matrix L_i+1(ii) a Point-by-point convolution kernel K_i+1One point-by-point convolution channel in the image feature matrix is corresponding to one image feature matrix in at least two image feature matrices;

a first determination unit 122 for determining the intermediate feature matrix L_iAnd an intermediate feature matrix L_i+1At least two intermediate feature matrices are determined.

For specific functional implementation manners of the second generating unit 121 and the first determining unit 122, reference may be made to step S102 in the embodiment corresponding to fig. 4, which is not described herein again.

Referring again to fig. 8, the first generating module 12 may include: a third generating unit 123 and a second determining unit 124.

A third generating unit 123 for performing a depth-wise convolution of the kernel P_iTo the intermediate feature matrix L_iPerforming convolution to generate a characteristic matrix J to be spliced_i；

A third generating unit 123 for further performing a depth-wise convolution of the kernel P_i+1To the intermediate feature matrix L_i+1Performing convolution to generate a characteristic matrix J to be spliced_i+1；

A second determining unit 124 for determining the feature matrix J to be spliced_iAnd a feature matrix J to be spliced_i+1Determining at least two characteristic matrixes to be spliced.

For specific functional implementation manners of the third generating unit 123 and the second determining unit 124, reference may be made to step S103 in the embodiment corresponding to fig. 4, which is not described herein again.

Referring again to fig. 8, the first generating module 12 may include: a fourth generation unit 125.

A fourth generating unit 125, configured to convolve at least two image feature matrices with each point-by-point convolution kernel, respectively, and generate at least two intermediate feature matrices to be activated;

the fourth generating unit 125 is further configured to map the at least two intermediate feature matrices to be activated to the first activation function, so as to generate at least two intermediate feature matrices.

The specific functional implementation manner of the fourth generating unit 125 may refer to step S102 in the embodiment corresponding to fig. 4, which is not described herein again.

Referring again to fig. 8, the first generating module 12 may include: a fifth generating unit 126.

A fifth generating unit 126, configured to convolve the at least two intermediate feature matrices according to the at least two depth-wise convolution kernels, respectively, and generate at least two to-be-activated splicing feature matrices;

the fifth generating unit 126 is further configured to map the at least two splicing feature matrices to be activated to the second activation function, so as to generate at least two splicing feature matrices to be activated.

The specific functional implementation manner of the fifth generating unit 126 may refer to step S103 in the embodiment corresponding to fig. 4, which is not described herein again.

Referring again to fig. 8, the image processing apparatus 1 may further include: a second acquisition module 14, a second generation module 15, an extract features module 16, and an adjustment model module 17.

A second obtaining module 14, configured to obtain a training sample set; the training sample set comprises a sample image and a label image, and the label image is used for representing an object class label to which each pixel point in the sample image belongs;

the second generating module 15 is configured to input the training sample set into the sample image segmentation model, and generate a sample image feature matrix of the sample image; the sample image segmentation model comprises a sample convolution splicing module; the sample image segmentation model comprises at least two object class labels;

the extraction characteristic module 16 is used for extracting a predicted image characteristic matrix associated with each category label in the sample image characteristic matrix through the sample convolution splicing module;

and the adjusting model module 17 is configured to adjust the sample image segmentation model according to the predicted image feature matrix and the tag image, so as to obtain an image segmentation model including the convolution splicing module.

The specific functional implementation manners of the second obtaining module 14, the second generating module 15, the feature extracting module 16, and the model adjusting module 17 may refer to steps S201 to S204 in the embodiment corresponding to fig. 7, which are not described herein again.

Referring again to fig. 8, the adjustment model module 17 may include: a sixth generating unit 171 and a third determining unit 172.

A sixth generating unit 171, configured to generate a tag image feature matrix of the tag image, and generate a model loss value according to the tag image feature matrix and the predicted image feature matrix;

and a third determining unit 172, configured to adjust a weight of a parameter in the sample image segmentation model according to the model loss value, and determine the adjusted image segmentation model including the convolution patch module as the image segmentation model when the model loss value satisfies a convergence condition.

For specific functional implementation of the sixth generating unit 171 and the third determining unit 172, reference may be made to step S204 in the embodiment corresponding to fig. 7, which is not described herein again.

Further, please refer to fig. 9, where fig. 9 is a schematic structural diagram of a computer device according to an embodiment of the present application. As shown in fig. 9, the computer apparatus 1000 may include: the processor 1001, the network interface 1004, and the memory 1005, and the computer apparatus 1000 may further include: a user interface 1003, and at least one communication bus 1002. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display) and a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface and a standard wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (e.g., at least one disk memory). The memory 1005 may optionally be at least one memory device located remotely from the processor 1001. As shown in fig. 9, a memory 1005, which is a kind of computer-readable storage medium, may include therein an operating system, a network communication module, a user interface module, and a device control application program.

In the computer device 1000 shown in fig. 9, the network interface 1004 may provide a network communication function; the user interface 1003 is an interface for providing a user with input; and the processor 1001 may be used to invoke a device control application stored in the memory 1005 to implement:

In one embodiment, an image segmentation model includes a transposed convolution module;

when the processor 1001 executes image recognition processing on an image according to at least two target feature matrices, the following steps are specifically executed:

inputting at least two target feature matrixes into a transposition convolution module, and performing deconvolution processing on the at least two target feature matrixes through the transposition convolution module to obtain feature segmentation matrixes;

and determining the feature segmentation matrix as a semantic segmentation image of the image, and performing image identification processing on the image according to the semantic segmentation image.

In one embodiment, when the processor 1001 performs the image recognition processing on the image according to the semantic segmentation image, the following steps are specifically performed:

performing first assignment processing on the semantic segmentation image to obtain a first assignment image; the first value-assigned image is used for extracting a foreground object in the image;

acquiring a material image, and performing second assignment processing on the semantic segmentation image to obtain a second assignment image; the second assigned value image is used for extracting a background object in the material image;

generating a target image according to the material image, the first assigned value image and the second assigned value image; the target image includes a foreground object and a background object.

In one embodiment, when the processor 1001 generates the target image according to the material image, the first assigned value image, and the second assigned value image, the processor specifically performs the following steps:

acquiring a first assignment matrix of a first assignment image, acquiring an original matrix of the image, and performing matrix adjustment on the original matrix according to the first assignment matrix to obtain a first target matrix; the first target matrix is used for representing a foreground object in the image;

acquiring a second assignment matrix of the second assignment image, acquiring a material matrix of the material image, and performing matrix adjustment on the material matrix according to the second assignment matrix to obtain a second target matrix; the second target matrix is used for representing a background object in the material image;

and performing matrix addition on the first target matrix and the second target matrix to obtain a target image.

In one embodiment, the at least two depth-wise convolution kernels include at least two first depth-wise convolution kernels and at least two second depth-wise convolution kernels;

when the processor 1001 performs convolution on at least two intermediate feature matrices according to at least two depth-by-depth convolution kernels to generate at least two feature matrices to be spliced, the following steps are specifically performed:

convolving the at least two intermediate feature matrices according to the at least two first depth-by-depth convolution kernels respectively to generate at least two first feature matrices to be spliced;

performing feature splicing processing on the at least two intermediate feature matrices and the at least two feature matrices to be spliced to obtain at least two target feature matrices, including:

performing characteristic splicing processing on the at least two intermediate characteristic matrixes and the at least two first characteristic matrixes to be spliced to obtain at least two characteristic matrixes to be determined;

convolving at least two feature matrixes to be determined respectively according to at least two second depth-by-depth convolution kernels to generate at least two second feature matrixes to be spliced;

and performing characteristic splicing processing on the at least two characteristic matrixes to be determined and the at least two second characteristic matrixes to be spliced to obtain at least two target characteristic matrixes.

In one embodiment, the at least two point-wise convolution kernels comprise a point-wise convolution kernel K_iAnd a point-by-point convolution kernel K_i+1I is a positive integer; each point-by-point convolution kernel comprises at least two point-by-point convolution channels, and the number of the channels of the at least two point-by-point convolution channels is equal to the number of the matrixes of the at least two image characteristic matrixes;

when the processor 1001 performs convolution on at least two image feature matrices through each point-by-point convolution kernel to generate at least two intermediate feature matrices, the following steps are specifically performed:

by point-by-point convolution of kernel K_iEach point-by-point convolution channel convolves each image feature matrix to generate at least two intermediate feature sub-matrixes Z_iFor at least two intermediate feature sub-matrices Z_iFusing to obtain an intermediate feature matrix L_i(ii) a Point-by-point convolution kernel K_iOne point-by-point convolution channel in the image feature matrix is corresponding to one image feature matrix in at least two image feature matrices;

by point-by-point convolution of kernel K_i+1Each point-by-point convolution channel convolves each image feature matrix to generate at least two intermediate feature sub-matrixes Z_i+1For at least two intermediate feature sub-matrices Z_i+1Fusing to obtain an intermediate feature matrix L_i+1(ii) a Point-by-point convolution kernel K_i+1One point-by-point convolution channel in the image feature matrix is corresponding to one image feature matrix in at least two image feature matrices;

the intermediate feature matrix L_iAnd an intermediate feature matrix L_i+1At least two intermediate feature matrices are determined.

In one embodiment, the at least two depth-wise convolution kernels comprise a depth-wise convolution kernel P_iAnd a depth-wise convolution kernel P_i+1I is a positive integer, and the number of convolution kernels of at least two depth-wise convolution kernels is equal to the number of matrixes of at least two intermediate feature matrixes;

by depth-wise convolution of the kernel P_iTo the intermediate feature matrix L_iPerforming convolution to generate a characteristic matrix J to be spliced_i；

By depth-wise convolution of the kernel P_i+1To the intermediate feature matrix L_i+1Performing convolution to generate a characteristic matrix J to be spliced_i+1；

The characteristic matrix J to be spliced_iAnd a feature matrix J to be spliced_i+1Determining at least two characteristic matrixes to be spliced.

In an embodiment, when the processor 1001 performs convolution on at least two image feature matrices through each point-by-point convolution kernel to generate at least two intermediate feature matrices, the following steps are specifically performed:

respectively convolving at least two image feature matrixes through each point-by-point convolution kernel to generate at least two intermediate feature matrixes to be activated;

and mapping at least two intermediate feature matrixes to be activated to the first activation function to generate at least two intermediate feature matrixes.

In an embodiment, when the processor 1001 performs convolution on at least two intermediate feature matrices according to at least two depth-wise convolution kernels respectively to generate at least two feature matrices to be spliced, the following steps are specifically performed:

respectively convolving at least two intermediate feature matrices according to at least two depth-by-depth convolution kernels to generate at least two splicing feature matrices to be activated;

and mapping the at least two splicing feature matrixes to be activated to a second activation function to generate at least two splicing feature matrixes to be activated.

In an embodiment, the processor 1001 further specifically executes the following steps:

acquiring a training sample set, wherein the training sample set comprises a sample image and a label image; the label image is used for representing an object class label to which each pixel point in the sample image belongs;

inputting a training sample set into a sample image segmentation model to generate a sample image characteristic matrix of a sample image; the sample image segmentation model comprises a sample convolution splicing module; the sample image segmentation model comprises at least two object class labels;

extracting a predicted image feature matrix associated with each category label in the sample image feature matrix through a sample convolution splicing module;

and adjusting the sample image segmentation model according to the predicted image feature matrix and the label image to obtain an image segmentation model containing the convolution splicing module.

In an embodiment, when the processor 1001 adjusts the sample image segmentation model according to the predicted image feature matrix and the tag image to obtain an image segmentation model including a convolution concatenation module, the following steps are specifically performed:

generating a tag image characteristic matrix of the tag image, and generating a model loss value according to the tag image characteristic matrix and the predicted image characteristic matrix;

and adjusting the weight of parameters in the sample image segmentation model according to the model loss value, and determining the adjusted image segmentation model containing the convolution splicing module as the image segmentation model when the model loss value meets the convergence condition.

An embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, where the computer program includes program instructions, and when the program instructions are executed by a processor, the image processing method provided in each step in fig. 4 and fig. 7 is implemented, which may specifically refer to the implementation manner provided in each step in fig. 4 and fig. 7, and is not described herein again.

The computer-readable storage medium may be the image processing apparatus provided in any of the foregoing embodiments or an internal storage unit of the computer device, such as a hard disk or a memory of the computer device. The computer readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Memory Card (SMC), a Secure Digital (SD) card, a flash card (flash card), and the like, provided on the computer device. Further, the computer-readable storage medium may also include both an internal storage unit and an external storage device of the computer device. The computer-readable storage medium is used for storing the computer program and other programs and data required by the computer device. The computer readable storage medium may also be used to temporarily store data that has been output or is to be output.

The terms "first," "second," and the like in the description and in the claims and drawings of the embodiments of the present application are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "comprises" and any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, apparatus, product, or apparatus that comprises a list of steps or elements is not limited to the listed steps or modules, but may alternatively include other steps or modules not listed or inherent to such process, method, apparatus, product, or apparatus.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The method and the related apparatus provided by the embodiments of the present application are described with reference to the flowchart and/or the structural diagram of the method provided by the embodiments of the present application, and each flow and/or block of the flowchart and/or the structural diagram of the method, and the combination of the flow and/or block in the flowchart and/or the block diagram can be specifically implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block or blocks of the block diagram. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block or blocks of the block diagram. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block or blocks.

The above disclosure is only for the purpose of illustrating the preferred embodiments of the present application and is not to be construed as limiting the scope of the present application, so that the present application is not limited thereto, and all equivalent variations and modifications can be made to the present application.

Claims

1. An image processing method, comprising:

respectively convolving the at least two image feature matrices through each point-by-point convolution kernel to generate at least two intermediate feature matrices; the number of convolution kernels of the at least two point-by-point convolution kernels is smaller than the number of target channels;

convolving the at least two intermediate feature matrices according to the at least two depth-by-depth convolution kernels respectively to generate at least two feature matrices to be spliced;

2. The method of claim 1, wherein the image segmentation model comprises a transposed convolution module;

the image recognition processing of the image according to the at least two target feature matrices includes:

inputting the at least two target feature matrixes into the transposed convolution module, and performing deconvolution processing on the at least two target feature matrixes through the transposed convolution module to obtain feature segmentation matrixes;

and determining the characteristic segmentation matrix as a semantic segmentation image of the image, and carrying out image identification processing on the image according to the semantic segmentation image.

3. The method of claim 2, wherein the image recognition processing of the image according to the semantically segmented image comprises:

performing first assignment processing on the semantic segmentation image to obtain a first assignment image; the first assigned value image is used for extracting a foreground object in the image;

generating a target image according to the material image, the first assigned value image and the second assigned value image; the target image includes the foreground object and the background object.

4. The method according to claim 3, wherein the generating a target image from the material image, the first assigned value image, and the second assigned value image includes:

acquiring a first assignment matrix of the first assignment image, acquiring an original matrix of the image, and performing matrix adjustment on the original matrix according to the first assignment matrix to obtain a first target matrix; the first target matrix is used for representing a foreground object in the image;

and performing matrix addition on the first target matrix and the second target matrix to obtain the target image.

5. The method of claim 1, wherein the at least two depth-wise convolution kernels comprise at least two first depth-wise convolution kernels and at least two second depth-wise convolution kernels;

the convolving the at least two intermediate feature matrices according to the at least two depth-by-depth convolution kernels respectively to generate at least two feature matrices to be spliced includes:

performing feature splicing processing on the at least two intermediate feature matrices and the at least two first feature matrices to be spliced to obtain at least two feature matrices to be determined;

convolving the at least two feature matrixes to be determined respectively according to the at least two second depth-by-depth convolution kernels to generate at least two second feature matrixes to be spliced;

6. The method of claim 1, wherein the at least two point-wise convolution kernels comprise a point-wise convolution kernel K_iAnd a point-by-point convolution kernel K_i+1I is a positive integer; each point-by-point convolution kernel comprises at least two point-by-point convolution channels, and the number of the channels of the at least two point-by-point convolution channels is equal to the number of the matrixes of the at least two image characteristic matrixes;

the convolving the at least two image feature matrices by each point-by-point convolution kernel to generate at least two intermediate feature matrices includes:

by said point-by-point convolution kernel K_iEach point-by-point convolution channel convolves each image feature matrix to generate at least two intermediate feature sub-matrixes Z_iFor said at least two intermediate feature sub-matrices Z_iFusing to obtain an intermediate feature matrix L_i(ii) a The point-by-point convolution kernel K_iOne point-by-point convolution channel in (b) corresponds to one image feature matrix in the at least two image feature matrices;

by said point-by-point convolution kernel K_i+1Each point-by-point convolution channel convolves each image feature matrix to generate at least two intermediate feature sub-matrixes Z_i+1For said at least two intermediate feature sub-matrices Z_i+1Fusing to obtain an intermediate feature matrix L_i+1(ii) a The point-by-point convolution kernel K_i+1One point-by-point convolution channel in (b) corresponds to one image feature matrix in the at least two image feature matrices;

the intermediate feature matrix L_iAnd the intermediate feature matrix L_i+1The at least two intermediate feature matrices are determined.

7. The method of claim 6, wherein the at least two depth-wise convolution kernels comprise a depth-wise convolution kernel P_iAnd a depth-wise convolution kernel P_i+1I is a positive integer, the number of convolution kernels of said at least two depth-wise convolution kernels is equal toThe number of the at least two intermediate feature matrices;

by said depth-wise convolution kernel P_iFor the intermediate feature matrix L_iPerforming convolution to generate a characteristic matrix J to be spliced_i；

By said depth-wise convolution kernel P_i+1For the intermediate feature matrix L_i+1Performing convolution to generate a characteristic matrix J to be spliced_i+1；

The characteristic matrix J to be spliced is obtained_iAnd the characteristic matrix J to be spliced_i+1And determining the at least two feature matrixes to be spliced.

8. The method of claim 1, wherein the convolving the at least two image feature matrices with each point-by-point convolution kernel to generate at least two intermediate feature matrices comprises:

respectively convolving the at least two image feature matrixes through each point-by-point convolution kernel to generate at least two intermediate feature matrixes to be activated;

and mapping the at least two intermediate feature matrixes to be activated to a first activation function to generate the at least two intermediate feature matrixes.

9. The method according to claim 1, wherein the convolving the at least two intermediate feature matrices according to the at least two depth-wise convolution kernels respectively to generate at least two feature matrices to be stitched comprises:

convolving the at least two intermediate feature matrices according to the at least two depth-by-depth convolution kernels respectively to generate at least two splicing feature matrices to be activated;

and mapping the at least two splicing feature matrixes to be activated to a second activation function to generate the at least two splicing feature matrixes to be activated.

10. The method of claim 1, further comprising:

acquiring a training sample set; the training sample set comprises a sample image and a label image, wherein the label image is used for representing an object class label to which each pixel point in the sample image belongs;

inputting the training sample set into a sample image segmentation model to generate a sample image feature matrix of the sample image; the sample image segmentation model comprises a sample convolution splicing module; the sample image segmentation model comprises at least two object class labels;

extracting a predicted image feature matrix associated with each category label in the sample image feature matrix through the sample convolution splicing module;

and adjusting the sample image segmentation model according to the predicted image feature matrix and the label image to obtain the image segmentation model comprising the convolution splicing module.

11. The method according to claim 10, wherein the adjusting the sample image segmentation model according to the predicted image feature matrix and the label image to obtain the image segmentation model including the convolution concatenation module comprises:

generating a tag image feature matrix of the tag image, and generating a model loss value according to the tag image feature matrix and the predicted image feature matrix;

and adjusting the weight of parameters in the sample image segmentation model according to the model loss value, and determining the adjusted image segmentation model containing the convolution splicing module as the image segmentation model when the model loss value meets a convergence condition.

12. An image processing apparatus characterized by comprising:

the first generation module is used for respectively convolving the at least two image feature matrixes through each point-by-point convolution kernel to generate at least two intermediate feature matrixes; the number of convolution kernels of the at least two point-by-point convolution kernels is smaller than the number of target channels;

the first generating module is further configured to convolve the at least two intermediate feature matrices according to the at least two depth-wise convolution kernels, so as to generate at least two feature matrices to be spliced;

13. A computer device, comprising: a processor, a memory, and a network interface;

the processor is connected to the memory and the network interface, wherein the network interface is configured to provide data communication functions, the memory is configured to store program code, and the processor is configured to call the program code to perform the steps of the method according to any one of claims 1 to 11.

14. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program comprising program instructions which, when executed by a processor, perform the steps of the method of any one of claims 1 to 11.