CN114005138A - Image processing method, image processing apparatus, electronic device, and medium - Google Patents
Image processing method, image processing apparatus, electronic device, and medium Download PDFInfo
- Publication number
- CN114005138A CN114005138A CN202111274692.5A CN202111274692A CN114005138A CN 114005138 A CN114005138 A CN 114005138A CN 202111274692 A CN202111274692 A CN 202111274692A CN 114005138 A CN114005138 A CN 114005138A
- Authority
- CN
- China
- Prior art keywords
- target
- feature map
- target object
- image
- location
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000012545 processing Methods 0.000 title claims abstract description 38
- 238000003672 processing method Methods 0.000 title claims abstract description 26
- 239000011159 matrix material Substances 0.000 claims abstract description 62
- 238000013528 artificial neural network Methods 0.000 claims description 51
- 238000000034 method Methods 0.000 claims description 47
- 230000004927 fusion Effects 0.000 claims description 38
- 238000004590 computer program Methods 0.000 claims description 17
- 238000001514 detection method Methods 0.000 claims description 17
- 238000010586 diagram Methods 0.000 claims description 15
- 230000006870 function Effects 0.000 claims description 12
- 238000012549 training Methods 0.000 claims description 8
- 230000008859 change Effects 0.000 claims description 5
- 238000005516 engineering process Methods 0.000 abstract description 11
- 238000013473 artificial intelligence Methods 0.000 abstract description 7
- 238000013135 deep learning Methods 0.000 abstract description 3
- 230000008569 process Effects 0.000 description 14
- 238000004891 communication Methods 0.000 description 13
- 230000000694 effects Effects 0.000 description 4
- 230000004913 activation Effects 0.000 description 3
- 238000013459 approach Methods 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 230000006399 behavior Effects 0.000 description 2
- 230000033228 biological regulation Effects 0.000 description 2
- 230000001413 cellular effect Effects 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 230000009466 transformation Effects 0.000 description 2
- 230000009471 action Effects 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000010267 cellular communication Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000008878 coupling Effects 0.000 description 1
- 238000010168 coupling process Methods 0.000 description 1
- 238000005859 coupling reaction Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 210000001145 finger joint Anatomy 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000003924 mental process Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 230000036544 posture Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000001953 sensory effect Effects 0.000 description 1
- 239000004984 smart glass Substances 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Image Analysis (AREA)
Abstract
The present disclosure provides an image processing method, an image processing apparatus, an electronic device, and a medium, which relate to the field of artificial intelligence, specifically to computer vision and deep learning technologies, and are specifically applicable to smart cities and intelligent traffic scenes. The implementation scheme is as follows: acquiring a semantic feature map of a target image; determining at least one target feature map for at least one target object in a target image, respectively, based on the semantic feature map; and for each of the at least one target object, determining an attention matrix for the target object based on a target feature map of the target object; and determining a location profile for the target based on a target profile of the target object and an attention matrix of the target object, wherein the location profile indicates a location of the target object in the target image.
Description
Technical Field
The present disclosure relates to the field of artificial intelligence, and in particular to computer vision and deep learning technologies, which can be used in smart cities and smart traffic scenes, and in particular to an image processing method, an apparatus, an electronic device, a computer-readable storage medium, and a computer program product.
Background
Artificial intelligence is the subject of research that makes computers simulate some human mental processes and intelligent behaviors (such as learning, reasoning, thinking, planning, etc.), both at the hardware level and at the software level. The artificial intelligence hardware technology generally comprises technologies such as a sensor, a special artificial intelligence chip, cloud computing, distributed storage, big data processing and the like, and the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, machine learning/deep learning, a big data processing technology, a knowledge graph technology and the like.
Characteristic points such as key points of human skeletons have important significance for describing human postures and predicting human behaviors. Therefore, many computer vision tasks design the detection of feature points.
The approaches described in this section are not necessarily approaches that have been previously conceived or pursued. Unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section. Similarly, unless otherwise indicated, the problems mentioned in this section should not be considered as having been acknowledged in any prior art.
Disclosure of Invention
The present disclosure provides an image processing method, an apparatus, an electronic device, a computer-readable storage medium, and a computer program product.
According to an aspect of the present disclosure, there is provided an image processing method including: acquiring a semantic feature map of a target image; determining at least one target feature map for at least one target object in a target image, respectively, based on the semantic feature map; and for each of the at least one target object, determining an attention matrix for the target object based on a target feature map of the target object; and determining a location profile for the target based on a target profile of the target object and an attention matrix of the target object, wherein the location profile indicates a location of the target object in the target image.
According to another aspect of the present disclosure, there is provided a computer-implemented neural network configured to detect at least one target object present in a target image, the neural network comprising: the backbone network is used for processing the target image to obtain a semantic feature map of the target image; a branching network for determining at least one target feature map for at least one target object in a target image, respectively, based on the semantic feature maps; a feature fusion layer comprising at least one feature fusion branch for the at least one target object, respectively, wherein each feature fusion branch is configured to determine an attention matrix for the target object based on a target feature map of the target object and to determine a location feature map for the target object based on the target feature map of the target object and the attention matrix of the target object, wherein the location feature map indicates a location of the target object in the target image.
According to another aspect of the present disclosure, there is provided a training method of a neural network implemented by a computer, the neural network being implemented by the neural network as described above, the training method including: acquiring a sample image and a real position of at least one target object existing in the sample image; inputting the sample image into the neural network, and acquiring at least one position feature map respectively used for the at least one target object and output by the neural network; determining a predicted location of the at least one target object in the sample image based on the at least one location feature map; calculating a loss function based on the real location and the predicted location; and adjusting a parameter in the neural network based on the loss function.
According to another aspect of the present disclosure, there is provided an image processing apparatus including: a semantic feature acquisition unit configured to acquire a semantic feature map of a target image; a target feature acquisition unit configured to determine at least one target feature map for at least one target object in a target image, respectively, based on the semantic feature maps; and a target object detection unit configured to determine, for each of the at least one target object, an attention matrix for the target object based on a target feature map of the target object; determining a location feature map for the target based on the target feature map and the attention matrix, wherein the location feature map indicates a location of the target object in the target image.
According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method described above.
According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the aforementioned method.
According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program, wherein the computer program realizes the aforementioned method when executed by a processor.
According to one or more embodiments of the present disclosure, corresponding attention matrices may be respectively determined for different target objects existing in a target image, so that detection accuracy for each target object can be improved.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the embodiments and, together with the description, serve to explain the exemplary implementations of the embodiments. The illustrated embodiments are for purposes of illustration only and do not limit the scope of the claims. Throughout the drawings, identical reference numbers designate similar, but not necessarily identical, elements.
FIG. 1 illustrates a schematic diagram of an exemplary system in which various methods described herein may be implemented, according to an embodiment of the present disclosure;
FIG. 2 shows an exemplary flow diagram of an image processing method according to an embodiment of the present disclosure;
FIG. 3 shows a schematic block diagram of a neural network for implementing the image processing method shown in FIG. 2, in accordance with an embodiment of the present disclosure;
FIG. 4 illustrates an example of a neural network for identifying human keypoints, according to an embodiment of the present disclosure;
FIG. 5 illustrates a method of training a neural network according to an embodiment of the present disclosure;
fig. 6 shows an exemplary block diagram of an image processing apparatus according to an embodiment of the present disclosure;
FIG. 7 illustrates a block diagram of an exemplary electronic device that can be used to implement embodiments of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
In the present disclosure, unless otherwise specified, the use of the terms "first", "second", etc. to describe various elements is not intended to limit the positional relationship, the timing relationship, or the importance relationship of the elements, and such terms are used only to distinguish one element from another. In some examples, a first element and a second element may refer to the same instance of the element, and in some cases, based on the context, they may also refer to different instances.
The terminology used in the description of the various described examples in this disclosure is for the purpose of describing particular examples only and is not intended to be limiting. Unless the context clearly indicates otherwise, if the number of elements is not specifically limited, the elements may be one or more. Furthermore, the term "and/or" as used in this disclosure is intended to encompass any and all possible combinations of the listed items.
Embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings.
Fig. 1 illustrates a schematic diagram of an exemplary system 100 in which various methods and apparatus described herein may be implemented in accordance with embodiments of the present disclosure. Referring to fig. 1, the system 100 includes one or more client devices 101, 102, 103, 104, 105, and 106, a server 120, and one or more communication networks 110 coupling the one or more client devices to the server 120. Client devices 101, 102, 103, 104, 105, and 106 may be configured to execute one or more applications.
In an embodiment of the present disclosure, the server 120 may run one or more services or software applications that enable the execution of the image processing method according to the present disclosure.
In some embodiments, the server 120 may also provide other services or software applications that may include non-virtual environments and virtual environments. In certain embodiments, these services may be provided as web-based services or cloud services, for example, provided to users of client devices 101, 102, 103, 104, 105, and/or 106 under a software as a service (SaaS) model.
In the configuration shown in fig. 1, server 120 may include one or more components that implement the functions performed by server 120. These components may include software components, hardware components, or a combination thereof, which may be executed by one or more processors. A user operating a client device 101, 102, 103, 104, 105, and/or 106 may, in turn, utilize one or more client applications to interact with the server 120 to take advantage of the services provided by these components. It should be understood that a variety of different system configurations are possible, which may differ from system 100. Accordingly, fig. 1 is one example of a system for implementing the various methods described herein and is not intended to be limiting.
A user may use client devices 101, 102, 103, 104, 105, and/or 106 to obtain images and perform corresponding image processing. The client device may provide an interface that enables a user of the client device to interact with the client device. The client device may also output information to the user via the interface. Although fig. 1 depicts only six client devices, those skilled in the art will appreciate that any number of client devices may be supported by the present disclosure.
The server 120 may include one or more general purpose computers, special purpose server computers (e.g., PC (personal computer) servers, UNIX servers, mid-end servers), blade servers, mainframe computers, server clusters, or any other suitable arrangement and/or combination. The server 120 may include one or more virtual machines running a virtual operating system, or other computing architecture involving virtualization (e.g., one or more flexible pools of logical storage that may be virtualized to maintain virtual storage for the server). In various embodiments, the server 120 may run one or more services or software applications that provide the functionality described below. For example, the server 120 may acquire images captured by a client and implement image processing according to embodiments of the present disclosure.
The computing units in server 120 may run one or more operating systems including any of the operating systems described above, as well as any commercially available server operating systems. The server 120 may also run any of a variety of additional server applications and/or middle tier applications, including HTTP servers, FTP servers, CGI servers, JAVA servers, database servers, and the like.
In some implementations, the server 120 may include one or more applications to analyze and consolidate data feeds and/or event updates received from users of the client devices 101, 102, 103, 104, 105, and 106. Server 120 may also include one or more applications to display data feeds and/or real-time events via one or more display devices of client devices 101, 102, 103, 104, 105, and 106.
In some embodiments, the server 120 may be a server of a distributed system, or a server incorporating a blockchain. The server 120 may also be a cloud server, or a smart cloud computing server or a smart cloud host with artificial intelligence technology. The cloud Server is a host product in a cloud computing service system, and is used for solving the defects of high management difficulty and weak service expansibility in the traditional physical host and Virtual Private Server (VPS) service.
The system 100 may also include one or more databases 130. In some embodiments, these databases may be used to store data and other information. For example, one or more of the databases 130 may be used to store information such as audio files and video files. The database 130 may reside in various locations. For example, the database used by the server 120 may be local to the server 120, or may be remote from the server 120 and may communicate with the server 120 via a network-based or dedicated connection. The database 130 may be of different types. In certain embodiments, the database used by the server 120 may be, for example, a relational database. One or more of these databases may store, update, and retrieve data to and from the database in response to the command.
In some embodiments, one or more of the databases 130 may also be used by applications to store application data. The databases used by the application may be different types of databases, such as key-value stores, object stores, or regular stores supported by a file system.
The system 100 of fig. 1 may be configured and operated in various ways to enable application of the various methods and apparatus described in accordance with the present disclosure.
In the related art, in order to detect the positions of feature points (e.g., key points of a human body) in an image, an image region of the human body may be obtained through a human body detection model, and then key point information of the human body is extracted through a key point model, where the human body key point model may extract image space and semantic features through a backbone network (e.g., HRNet) with a good effect, obtain detection features through an deconvolution layer, and directly regress through the detection features to obtain heat maps (heatmaps) of n individual key points of the human body, where n is the number of preset key points of the human body. The heat map may indicate locations of key points of the human body in the image. It can be seen that, in the related art, the heat maps for different human key points are obtained through direct regression of the same detection feature, an attention regression mechanism for a single key point is lacked, and the prediction effect is limited.
In order to solve the above problem, the present disclosure provides a method of separately acquiring an attention matrix for different target objects.
Fig. 2 illustrates an exemplary flowchart of an image processing method according to an embodiment of the present disclosure. The image processing method 200 shown in FIG. 2 can be performed by the clients 101-106 or the server 120 shown in FIG. 1.
As shown in fig. 2, in step S202, a semantic feature map of the target image may be acquired. In step S204, at least one target feature map for at least one target object in the target image, respectively, may be determined based on the semantic feature maps. In step S206, for each of at least one target object, an attention matrix for the target object may be determined based on a target feature map of the target object, and a location feature map for the target may be determined based on the target feature map of the target object and the attention matrix of the target object, wherein the location feature map indicates a location of the target object in the target image.
With the above image processing method provided by the embodiments of the present disclosure, for at least one target image existing in the target image, the attention matrix of each target image may be acquired, respectively, so as to determine the position of each target object in the target image.
The principle of the embodiments of the present disclosure will be described in detail below.
In step S202, a semantic feature map of the target image may be acquired.
In some embodiments, the trained backbone network may be utilized to process the target image and output the backbone network as a semantic feature map of the target image. For example, the HRNet network may be used as a backbone network to acquire the semantic feature map of the target image. The semantic features on all scales in the target image can be effectively extracted by using the backbone network, so that the accuracy of the subsequent detection process can be improved.
In other embodiments, the trained backbone network may be further used to process the target image, and the other neural network units may be used to further process the output of the backbone network to obtain the semantic feature map of the target image. In some implementations, the output of the backbone network can be processed with at least one deconvolution layer to obtain a semantic feature map of the target image. That is, the target image may be processed using the backbone network and the at least one deconvolution layer to obtain the semantic feature map. The scale of the feature map output by the backbone network can be adjusted to the scale required for the position feature of the at least one target object present in the target image using the deconvolution layer. In other implementations, the neural network elements used to further process the output of the backbone network may also be implemented in any other form.
In step S204, at least one target feature map for at least one target object in the target image, respectively, may be determined based on the semantic feature maps.
In some embodiments, the target image may comprise a human body image and the at least one target object may comprise at least one human body keypoint. For example, the human keypoints may include the head, shoulders, hands, etc. of the human. In other embodiments, the target image may also include a portion of a human body (e.g., a hand), and the at least one target object may be at least one hand keypoint, such as a finger joint or the like. The content and the number of target objects involved in the image processing method can be defined by a person skilled in the art according to the actual application. The principles of the present disclosure will be described in the present disclosure by taking human keypoint detection as an example, however, it will be understood by those skilled in the art that the image processing method provided by the present disclosure may be used to detect any form of target object without departing from the principles of the present disclosure.
In some embodiments, for each of the at least one target object, the semantic feature map obtained in step S202 may be convolved with a convolution layer for the target object to obtain a target feature map for the target object. With the above method, a target feature map can be determined for each target object in the target image, so that the position of each target object in the target image can be determined based on different target feature maps.
The target feature map of each target object obtained in step S203 may be a single channel. The calculation amount in the subsequent flow can be simplified by adopting the single-channel target characteristic diagram, so that the calculation efficiency of the image processing method is improved.
In step S206, for each of at least one target object, an attention matrix for the target object may be determined based on a target feature map of the target object, and a location feature map for the target may be determined based on the target feature map of the target object and the attention matrix of the target object, wherein the location feature map indicates a location of the target object in the target image.
In some embodiments, determining an attention moment array for the target object based on the target feature map of the target object may include: and processing the target characteristic diagram by utilizing at least one full connection layer to obtain an attention matrix.
Each node in the full-connection layer can be connected with all nodes in the previous layer, so that the full-connection layer can be used for realizing the global spatial feature fusion of the target feature map. Thus, each target object present in the target image may help predict the location of other target objects present in the target image. Taking an application scene of detecting key points of a human body as an example, the position of the head of the human body can help to predict the positions of other key points such as the neck, the shoulders and the like. Therefore, the attention moment array obtained by utilizing the global spatial feature fusion can improve the accuracy of target object detection.
In some implementations, the at least one fully-connected layer may include a first fully-connected layer and a second fully-connected layer. In some examples, an activation layer may also be disposed between the first fully-connected layer and the second fully-connected layer, thereby introducing a non-linear spatial feature fusion effect. The mode of adopting two full connection layers can realize the effective fusion of the global spatial features without obviously increasing the calculated amount in the image processing process. However, the scope of the present disclosure is not limited thereto, and those skilled in the art may increase or decrease the number of the fully-connected layers according to actual situations, for example, in a manner of one fully-connected layer or more than two fully-connected layers.
In some embodiments, processing the target feature map using at least one fully connected layer to derive the attention moment array may include: determining a target feature vector corresponding to the target feature map; inputting the target characteristic vector into at least one full-connection layer to obtain an output vector of the at least one full-connection layer; and determining a matrix corresponding to the output vector as an attention matrix by using the matrix dimension change.
In order to facilitate the processing of the full connection layer, dimension transformation may be performed on the target feature map, and a target feature vector obtained after the dimension transformation is used as an input of the full connection layer. Taking the single-channel feature that the target feature map is of size 1 × H × W (where H, W are the dimensions of the target feature map in the horizontal and vertical directions, respectively) as an example, the target feature map may be flattened (flat) into a vector of length H × W as the target feature vector and input to the at least one fully-connected layer. The output vector of the at least one fully connected layer may be the same size as the input vector, i.e. the output vector is also a vector of length H x W. The output vector may be transformed into a matrix of size 1 × H × W, i.e. an attention matrix, using a matrix dimension change (reshape).
In some embodiments, determining a location profile for the object based on the object profile and the attention matrix may include: and fusing elements at corresponding positions in the target characteristic diagram and the attention matrix to obtain a position characteristic diagram. In some implementations, the target feature map and the element at the corresponding position in the attention matrix may be multiplied, and the result obtained after the multiplication is used as the value of the element at the position in the position feature map. In other implementations, any mathematical processing may also be performed on the elements at corresponding positions in the target feature map and the attention matrix to achieve fusion.
Wherein each element in the location feature map indicates a probability that the target object is located at a location in the target image corresponding to the element. The size of the location feature map and the size of the target image may be the same or different. Each element in the location feature map corresponds to an element (or an image block) in the target image. The position of the target object corresponding to the position feature map in the target image may be determined based on the position of the element having the maximum value in the position feature map.
Fig. 3 shows a schematic block diagram of a neural network for implementing the image processing method shown in fig. 2, according to an embodiment of the present disclosure. Among other things, the neural network 300 shown in fig. 3 may be computer-implemented. The neural network 300 may be configured to detect at least one target object present in the target image.
As shown in fig. 3, the neural network 300 may include a backbone network 310, a branch network 320, and a feature fusion layer 330.
The backbone network 310 may be used to process the target image to obtain a semantic feature map of the target image. Where the backbone network may be implemented as a HRNet. It is understood that the backbone network may also be implemented as other neural networks capable of extracting semantic features of an image without departing from the principles of the present disclosure.
The branching network 320 may be configured to determine at least one target feature map for at least one target object in the target image, respectively, based on the semantic feature maps.
In some embodiments, the target image may comprise a human body image and the at least one target object may comprise at least one human body keypoint.
In some embodiments, the branch network 320 may include at least one convolutional layer. The at least one convolution layer is used for respectively convolving the semantic feature maps to obtain a target feature map for each target object in the at least one target object. That is, in the branch network 320, in order to obtain target feature maps for different target objects, the unified semantic feature map may be processed by different convolutional layers respectively to obtain target feature maps for different target objects. Therefore, different convolutional layers can be trained for different target objects respectively, and the detection accuracy rate of the different target objects is improved.
The target feature map can be a single channel, so that the calculation amount of the subsequent image processing process can be simplified.
The feature fusion layer 330 may include at least one feature fusion branch for at least one target object, respectively, wherein each feature fusion branch is configured to determine an attention matrix for the target object based on a target feature map of the target object, and determine a location feature map for the target object based on the target feature map and the attention matrix, wherein the location feature map indicates a location of the target object in the target image.
The feature fusion branch may include at least one fully connected layer, and the at least one fully connected layer may be used to process the target feature map to obtain an attention matrix for a corresponding target object.
Each node in the full-connection layer can be connected with all nodes in the previous layer, so that the full-connection layer can be used for realizing the global spatial feature fusion of the target feature map. Thus, each target object present in the target image may help predict the location of other target objects present in the target image. Taking an application scene of detecting key points of a human body as an example, the position of the head of the human body can help to predict the positions of other key points such as the neck, the shoulders and the like. Therefore, the attention moment array obtained by utilizing the global spatial feature fusion can improve the accuracy of target object detection.
The at least one fully-connected layer may be configured to process the target feature vector corresponding to the target feature map to obtain an output vector of the at least one fully-connected layer. Wherein the attention matrix is a matrix corresponding to the output vector determined using matrix dimension change. Taking the single-channel feature that the target feature map is of size 1 × H × W (where H, W are the dimensions of the target feature map in the horizontal and vertical directions, respectively) as an example, the target feature map may be flattened (flat) into a vector of length H × W as the target feature vector and input to the at least one fully-connected layer. The output vector of the at least one fully connected layer may be the same size as the input vector, i.e. the output vector is also a vector of length H x W. The output vector may be transformed into a matrix of size 1 × H × W, i.e. an attention matrix, using a matrix dimension change (reshape).
In some implementations, the at least one fully-connected layer may include a first fully-connected layer and a second fully-connected layer. In some examples, an activation layer may also be disposed between the first fully-connected layer and the second fully-connected layer, thereby introducing a non-linear spatial feature fusion effect. The mode of adopting two full connection layers can realize the effective fusion of the global spatial features without obviously increasing the calculated amount in the image processing process. However, the scope of the present disclosure is not limited thereto, and those skilled in the art may increase or decrease the number of the fully-connected layers according to actual situations, for example, in a manner of one fully-connected layer or more than two fully-connected layers.
The feature fusion branch may further include a fusion unit, and the fusion unit may be configured to fuse the target feature map and an element at a corresponding position in the attention matrix to obtain a position feature map. In some implementations, the target feature map and the element at the corresponding position in the attention matrix may be multiplied, and the result obtained after the multiplication is used as the value of the element at the position in the position feature map. In other implementations, any mathematical processing may also be performed on the elements at corresponding positions in the target feature map and the attention matrix to achieve fusion.
Wherein each element in the location feature map indicates a probability that the target object is located at a location in the target image corresponding to the element. The size of the location feature map and the size of the target image may be the same or different. Each element in the location feature map corresponds to an element (or an image block) in the target image. The position of the target object corresponding to the position feature map in the target image may be determined based on the position of the element having the maximum value in the position feature map.
By using the neural network provided by the disclosure, a separate branch can be set for each target object and different attention matrixes can be trained for each target object, so that the detection accuracy rate for a single target object can be improved. It is understood that the neural network provided by the embodiments of the present disclosure is not limited to the form described in conjunction with fig. 3, and those skilled in the art can add or reduce neural network units on the basis of the neural network shown in fig. 3 according to practical applications without departing from the principles of the present disclosure to adapt to the images to be processed with different sizes and channel numbers in different application scenarios.
Fig. 4 illustrates an example of a neural network for identifying human keypoints, according to an embodiment of the present disclosure. The neural network 300 described in fig. 3 may be implemented using the neural network 400 shown in fig. 4.
As shown in fig. 4, the neural network 400 may include a backbone network 410. Wherein the backbone network 410 can be used for processing the target image and obtaining the semantic feature map of the target image.
The neural network 400 may also include a branch network 420. In the example shown in FIG. 4, a branching network may include deconvolution layer 421 and convolutional layers 422-1 through 422-n. Wherein n is a positive integer greater than 1. The deconvolution layer 421 may be used to further process the semantic feature map output by the backbone network to obtain the extended features of the target image. Convolutional layers 422-1 through 422-n may include n convolutional layers corresponding to each of the n target objects, respectively, and may process the extended features to obtain a target feature map for each target object, respectively. Wherein the size of the extended feature may be k × H × W, where k is any positive integer, and the size of the target feature map of each target object may be a single-channel feature of 1 × H × W.
The neural network 400 may also include a feature fusion layer 430. The feature fusion layer 430 may include feature fusion branches 430-1-430-n respectively connected in series to the convolutional layers 422-1-422-n, where each feature fusion branch is configured to determine an attention matrix for the target object based on a target feature map of the target object, and determine a location feature map for the target object based on the target feature map of the target object and the attention matrix of the target object.
As shown in fig. 4, each feature fusion branch may include a first fully-connected layer 431 and a second fully-connected layer 432, and an activation layer 433 is further disposed between the first fully-connected layer 431 and the second fully-connected layer 432. Taking the first branch as an example, the first feature fusion branch may include a first fully-connected layer 431-1 and a second fully-connected layer 432-1, and an active layer 433-1 is further disposed between the first fully-connected layer 431-1 and the second fully-connected layer 432-1. The second fully connected layer 432-1 may be used to output an attention matrix for the target object. Further, each feature fusion branch may further include a fusion unit 433 (e.g., fusion units 433-1 to 433-n), where the fusion unit may be configured to fuse the target feature map of the target object and an element at a corresponding position in the attention matrix of the target object to obtain a position feature map of the target object. In some implementations, the target feature map and the element at the corresponding position in the attention matrix may be multiplied, and the result obtained after the multiplication is used as the value of the element at the position in the position feature map. In other implementations, any mathematical processing may also be performed on the elements at corresponding positions in the target feature map and the attention matrix to achieve fusion.
Fig. 5 illustrates a training method of a neural network according to an embodiment of the present disclosure. The neural network described in connection with fig. 3, 4 may be trained using the training method 500 shown in fig. 5.
In step S502, a sample image and a true position of at least one target object present in the sample image may be acquired. Wherein the sample image may comprise a human body image and the target object may be a human body key point. It should be noted that the human body image in the present embodiment is from a public data set.
In step S504, the sample image is input into the neural network to be trained, and at least one position feature map respectively used for at least one target object output by the neural network to be trained is obtained. The initial parameters of the neural network to be trained may be randomly generated or obtained through pre-training.
In step S506, a predicted position of the at least one target object in the sample image may be determined based on the at least one position feature map.
In step S508, a loss function may be calculated based on the actual position noted in step S502 and the predicted position obtained in step S506. In some examples, a loss function such as Mean Square Error (MSE) may be used.
In step S510, parameters in the neural network may be adjusted based on the loss function calculated in step S508, so that the neural network can learn features of each target object to be detected, which refers to the accuracy of target object detection.
Fig. 6 illustrates an exemplary block diagram of an image processing apparatus according to an embodiment of the present disclosure.
As shown in fig. 6, the image processing apparatus 600 may include a semantic feature acquisition unit 610, a target feature acquisition unit 620, and a target object detection unit 630.
The semantic feature acquisition unit 610 may be configured to acquire a semantic feature map of the target image. The target feature obtaining unit 620 may be configured to determine at least one target feature map for at least one target object in the target image, respectively, based on the semantic feature maps. The target object detection unit 630 may be configured to, for each of at least one target object, determine an attention matrix for the target object based on a target feature map of the target object and determine a location feature map for the target based on the target feature map of the target object and the attention matrix of the target object, wherein the location feature map indicates a location of the target object in a target image.
The operations of the units 610 to 630 of the image processing apparatus 600 are similar to the operations of the steps S202 to S206, and are not described again.
According to an embodiment of the present disclosure, there is also provided an electronic apparatus including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method described in connection with fig. 2.
There is also provided, in accordance with an embodiment of the present disclosure, a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method described in connection with fig. 2.
There is also provided, in accordance with an embodiment of the present disclosure, a computer program product, comprising a computer program, wherein the computer program, when executed by a processor, implements the method described in connection with fig. 2.
In the technical scheme of the disclosure, the collection, storage, use, processing, transmission, provision, disclosure and other processing of the personal information of the related user are all in accordance with the regulations of related laws and regulations and do not violate the good customs of the public order.
Referring to fig. 7, a block diagram of a structure of an electronic device 700, which may be a server or a client of the present disclosure, which is an example of a hardware device that may be applied to aspects of the present disclosure, will now be described. Electronic device is intended to represent various forms of digital electronic computer devices, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 7, the device 700 comprises a computing unit 701, which may perform various suitable actions and processes according to a computer program stored in a Read Only Memory (ROM)702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the device 700 can also be stored. The computing unit 701, the ROM 702, and the RAM 703 are connected to each other by a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.
Various components in the device 700 are connected to the I/O interface 705, including: an input unit 706, an output unit 707, a storage unit 708, and a communication unit 709. The input unit 706 may be any type of device capable of inputting information to the device 700, and the input unit 706 may receive input numeric or character information and generate key signal inputs related to user settings and/or function controls of the electronic device, and may include, but is not limited to, a mouse, a keyboard, a touch screen, a track pad, a track ball, a joystick, a microphone, and/or a remote controller. Output unit 707 may be any type of device capable of presenting information and may include, but is not limited to, a display, speakers, a video/audio output terminal, a vibrator, and/or a printer. Storage unit 708 may include, but is not limited to, magnetic or optical disks. The communication unit 709 allows the device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks, and may include, but is not limited to, a modem, a network card, an infrared communication device, a wireless communication transceiver, and/or a chipset, such as bluetoothTMDevices, 802.11 devices, WiFi devices, WiMax devices, cellular communication devices, and/or the like.
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be performed in parallel, sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.
Although embodiments or examples of the present disclosure have been described with reference to the accompanying drawings, it is to be understood that the above-described methods, systems and apparatus are merely exemplary embodiments or examples and that the scope of the present invention is not limited by these embodiments or examples, but only by the claims as issued and their equivalents. Various elements in the embodiments or examples may be omitted or may be replaced with equivalents thereof. Further, the steps may be performed in an order different from that described in the present disclosure. Further, various elements in the embodiments or examples may be combined in various ways. It is important that as technology evolves, many of the elements described herein may be replaced with equivalent elements that appear after the present disclosure.
Claims (25)
1. An image processing method comprising:
acquiring a semantic feature map of a target image;
determining at least one target feature map for at least one target object in a target image, respectively, based on the semantic feature map; and
for each of the at least one target object,
determining an attention matrix for the target object based on a target feature map of the target object; and
determining a location profile for the target based on a target profile of the target object and an attention matrix of the target object, wherein the location profile indicates a location of the target object in the target image.
2. The image processing method of claim 1, wherein determining at least one target feature map for at least one target object in the target image, respectively, comprises:
and for each target object in the at least one target object, convolving the semantic feature map by using a convolution layer for the target object to obtain a target feature map for the target object.
3. The image processing method of claim 2, wherein the target feature map is single-channel.
4. The image processing method of any of claims 1 to 3, wherein determining an attention moment array for the target object based on the target feature map of the target object comprises:
and processing the target characteristic diagram by utilizing at least one full connection layer to obtain the attention matrix.
5. The image processing method of claim 4, wherein the at least one fully-connected layer comprises a first fully-connected layer and a second fully-connected layer.
6. The image processing method of claim 4, wherein processing the target feature map with at least one fully connected layer to obtain the attention moment array comprises:
determining a target feature vector corresponding to the target feature map;
inputting the target feature vector into the at least one fully-connected layer to obtain an output vector of the at least one fully-connected layer;
and determining a matrix corresponding to the output vector as the attention matrix by using matrix dimension change.
7. The image processing method of claim 1, wherein determining a location profile for the object based on the object profile and the attention matrix comprises:
and fusing elements at corresponding positions in the target feature map and the attention matrix to obtain the position feature map.
8. The image processing method of claim 7, wherein each element in the position feature map indicates a probability that the target object is located at a position in the target image corresponding to the element.
9. The image processing method of claim 1, wherein obtaining the semantic feature map of the target image comprises:
and processing the target image by using a backbone network to obtain the semantic feature map.
10. The image processing method of claim 1, wherein the at least one target object is at least one human keypoint.
11. A computer-implemented neural network configured to detect at least one target object present in a target image, the neural network comprising:
the backbone network is used for processing the target image to obtain a semantic feature map of the target image;
a branching network for determining at least one target feature map for at least one target object in a target image, respectively, based on the semantic feature maps;
a feature fusion layer comprising at least one feature fusion branch for the at least one target object, respectively, wherein each feature fusion branch is configured to determine an attention matrix for the target object based on a target feature map of the target object and to determine a location feature map for the target object based on the target feature map of the target object and the attention matrix of the target object, wherein the location feature map indicates a location of the target object in the target image.
12. A neural network as claimed in claim 11, wherein the branch network comprises at least one convolutional layer for respectively convolving the semantic feature maps to obtain a target feature map for each of the at least one target object respectively.
13. The neural network of claim 12, wherein the target feature map is single-channel.
14. A neural network as claimed in any one of claims 11 to 13, wherein the feature fusion branches comprise at least one fully-connected layer for processing the target feature map to derive the attention matrix.
15. The neural network of claim 13, wherein the at least one fully-connected layer includes a first fully-connected layer and a second fully-connected layer.
16. The neural network of claim 14, wherein the at least one fully-connected layer is to:
processing a target feature vector corresponding to the target feature map to obtain an output vector of the at least one fully connected layer;
wherein the attention matrix is a matrix corresponding to the output vector determined using matrix dimensionality.
17. A neural network as claimed in claim 11, wherein the feature fusion branch further comprises a fusion unit for fusing elements at corresponding positions in the target feature map and the attention matrix to obtain the location feature map.
18. The neural network of claim 17, wherein each element in the location feature map indicates a probability that the target object is located at a location in the target image corresponding to the element.
19. A neural network as claimed in claim 11, wherein the backbone network is a HRNet.
20. A neural network as claimed in claim 11, wherein the at least one target object is at least one human keypoint.
21. A training method of a computer-implemented neural network implemented by the neural network of any one of claims 11-20, the training method comprising:
acquiring a sample image and a real position of at least one target object existing in the sample image;
inputting the sample image into the neural network, and acquiring at least one position feature map respectively used for the at least one target object and output by the neural network;
determining a predicted location of the at least one target object in the sample image based on the at least one location feature map;
calculating a loss function based on the real location and the predicted location; and
adjusting a parameter in the neural network based on the loss function.
22. An image processing apparatus comprising:
a semantic feature acquisition unit configured to acquire a semantic feature map of a target image;
a target feature acquisition unit configured to determine at least one target feature map for at least one target object in a target image, respectively, based on the semantic feature maps; and
a target object detection unit configured to, for each of the at least one target object,
determining an attention matrix for the target object based on a target feature map of the target object;
determining a location feature map for the target based on the target feature map and the attention matrix, wherein the location feature map indicates a location of the target object in the target image.
23. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein
The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-21.
24. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any of claims 1-21.
25. A computer program product comprising a computer program, wherein the computer program realizes the method of any one of claims 1-21 when executed by a processor.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111274692.5A CN114005138A (en) | 2021-10-29 | 2021-10-29 | Image processing method, image processing apparatus, electronic device, and medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111274692.5A CN114005138A (en) | 2021-10-29 | 2021-10-29 | Image processing method, image processing apparatus, electronic device, and medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114005138A true CN114005138A (en) | 2022-02-01 |
Family
ID=79925417
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111274692.5A Pending CN114005138A (en) | 2021-10-29 | 2021-10-29 | Image processing method, image processing apparatus, electronic device, and medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114005138A (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112017198A (en) * | 2020-10-16 | 2020-12-01 | 湖南师范大学 | Right ventricle segmentation method and device based on self-attention mechanism multi-scale features |
CN112560698A (en) * | 2020-12-18 | 2021-03-26 | 北京百度网讯科技有限公司 | Image processing method, apparatus, device and medium |
CN113449561A (en) * | 2020-03-26 | 2021-09-28 | 华为技术有限公司 | Motion detection method and device |
US20210304413A1 (en) * | 2020-12-18 | 2021-09-30 | Beijing Baidu Netcom Science And Technology Co., Ltd. | Image Processing Method and Device, and Electronic Device |
-
2021
- 2021-10-29 CN CN202111274692.5A patent/CN114005138A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113449561A (en) * | 2020-03-26 | 2021-09-28 | 华为技术有限公司 | Motion detection method and device |
CN112017198A (en) * | 2020-10-16 | 2020-12-01 | 湖南师范大学 | Right ventricle segmentation method and device based on self-attention mechanism multi-scale features |
CN112560698A (en) * | 2020-12-18 | 2021-03-26 | 北京百度网讯科技有限公司 | Image processing method, apparatus, device and medium |
US20210304413A1 (en) * | 2020-12-18 | 2021-09-30 | Beijing Baidu Netcom Science And Technology Co., Ltd. | Image Processing Method and Device, and Electronic Device |
Non-Patent Citations (2)
Title |
---|
刘建伟;丁熙浩;罗雄麟;: "多模态深度学习综述", 计算机应用研究, no. 06, 31 December 2020 (2020-12-31) * |
麻森权;周克;: "基于注意力机制和特征融合改进的小目标检测算法", 计算机应用与软件, no. 05, 12 May 2020 (2020-05-12) * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN114511758A (en) | Image recognition method and device, electronic device and medium | |
CN112857268B (en) | Object area measuring method, device, electronic equipment and storage medium | |
CN114743196B (en) | Text recognition method and device and neural network training method | |
CN112749758A (en) | Image processing method, neural network training method, device, equipment and medium | |
CN114972958B (en) | Key point detection method, neural network training method, device and equipment | |
CN115511779B (en) | Image detection method, device, electronic equipment and storage medium | |
CN114445667A (en) | Image detection method and method for training image detection model | |
CN114723949A (en) | Three-dimensional scene segmentation method and method for training segmentation model | |
CN114821581A (en) | Image recognition method and method for training image recognition model | |
CN114550313A (en) | Image processing method, neural network, and training method, device, and medium thereof | |
CN114547252A (en) | Text recognition method and device, electronic equipment and medium | |
CN116152607A (en) | Target detection method, method and device for training target detection model | |
CN115797660A (en) | Image detection method, image detection device, electronic equipment and storage medium | |
CN114429678A (en) | Model training method and device, electronic device and medium | |
CN115578501A (en) | Image processing method, image processing device, electronic equipment and storage medium | |
CN115601555A (en) | Image processing method and apparatus, device and medium | |
CN115359309A (en) | Training method, device, equipment and medium of target detection model | |
CN114494797A (en) | Method and apparatus for training image detection model | |
CN114092556A (en) | Method, apparatus, electronic device, medium for determining human body posture | |
CN114005138A (en) | Image processing method, image processing apparatus, electronic device, and medium | |
CN112579587A (en) | Data cleaning method and device, equipment and storage medium | |
CN114117046B (en) | Data processing method, device, electronic equipment and medium | |
CN114882331A (en) | Image processing method, apparatus, device and medium | |
CN115601561A (en) | High-precision map target detection method, device, equipment and medium | |
CN113920304A (en) | Sample image processing method, sample image processing device, electronic device, and medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |