CN114005138A

CN114005138A - Image processing method, image processing apparatus, electronic device, and medium

Info

Publication number: CN114005138A
Application number: CN202111274692.5A
Authority: CN
Inventors: 卢子鹏; 王健; 孙昊; 丁二锐
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-10-29
Filing date: 2021-10-29
Publication date: 2022-02-01

Abstract

The present disclosure provides an image processing method, an image processing apparatus, an electronic device, and a medium, which relate to the field of artificial intelligence, specifically to computer vision and deep learning technologies, and are specifically applicable to smart cities and intelligent traffic scenes. The implementation scheme is as follows: acquiring a semantic feature map of a target image; determining at least one target feature map for at least one target object in a target image, respectively, based on the semantic feature map; and for each of the at least one target object, determining an attention matrix for the target object based on a target feature map of the target object; and determining a location profile for the target based on a target profile of the target object and an attention matrix of the target object, wherein the location profile indicates a location of the target object in the target image.

Description

Image processing method, image processing apparatus, electronic device, and medium

Technical Field

The present disclosure relates to the field of artificial intelligence, and in particular to computer vision and deep learning technologies, which can be used in smart cities and smart traffic scenes, and in particular to an image processing method, an apparatus, an electronic device, a computer-readable storage medium, and a computer program product.

Background

Artificial intelligence is the subject of research that makes computers simulate some human mental processes and intelligent behaviors (such as learning, reasoning, thinking, planning, etc.), both at the hardware level and at the software level. The artificial intelligence hardware technology generally comprises technologies such as a sensor, a special artificial intelligence chip, cloud computing, distributed storage, big data processing and the like, and the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, machine learning/deep learning, a big data processing technology, a knowledge graph technology and the like.

Characteristic points such as key points of human skeletons have important significance for describing human postures and predicting human behaviors. Therefore, many computer vision tasks design the detection of feature points.

The approaches described in this section are not necessarily approaches that have been previously conceived or pursued. Unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section. Similarly, unless otherwise indicated, the problems mentioned in this section should not be considered as having been acknowledged in any prior art.

Disclosure of Invention

The present disclosure provides an image processing method, an apparatus, an electronic device, a computer-readable storage medium, and a computer program product.

According to an aspect of the present disclosure, there is provided an image processing method including: acquiring a semantic feature map of a target image; determining at least one target feature map for at least one target object in a target image, respectively, based on the semantic feature map; and for each of the at least one target object, determining an attention matrix for the target object based on a target feature map of the target object; and determining a location profile for the target based on a target profile of the target object and an attention matrix of the target object, wherein the location profile indicates a location of the target object in the target image.

According to another aspect of the present disclosure, there is provided a computer-implemented neural network configured to detect at least one target object present in a target image, the neural network comprising: the backbone network is used for processing the target image to obtain a semantic feature map of the target image; a branching network for determining at least one target feature map for at least one target object in a target image, respectively, based on the semantic feature maps; a feature fusion layer comprising at least one feature fusion branch for the at least one target object, respectively, wherein each feature fusion branch is configured to determine an attention matrix for the target object based on a target feature map of the target object and to determine a location feature map for the target object based on the target feature map of the target object and the attention matrix of the target object, wherein the location feature map indicates a location of the target object in the target image.

According to another aspect of the present disclosure, there is provided a training method of a neural network implemented by a computer, the neural network being implemented by the neural network as described above, the training method including: acquiring a sample image and a real position of at least one target object existing in the sample image; inputting the sample image into the neural network, and acquiring at least one position feature map respectively used for the at least one target object and output by the neural network; determining a predicted location of the at least one target object in the sample image based on the at least one location feature map; calculating a loss function based on the real location and the predicted location; and adjusting a parameter in the neural network based on the loss function.

According to another aspect of the present disclosure, there is provided an image processing apparatus including: a semantic feature acquisition unit configured to acquire a semantic feature map of a target image; a target feature acquisition unit configured to determine at least one target feature map for at least one target object in a target image, respectively, based on the semantic feature maps; and a target object detection unit configured to determine, for each of the at least one target object, an attention matrix for the target object based on a target feature map of the target object; determining a location feature map for the target based on the target feature map and the attention matrix, wherein the location feature map indicates a location of the target object in the target image.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method described above.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the aforementioned method.

According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program, wherein the computer program realizes the aforementioned method when executed by a processor.

According to one or more embodiments of the present disclosure, corresponding attention matrices may be respectively determined for different target objects existing in a target image, so that detection accuracy for each target object can be improved.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the embodiments and, together with the description, serve to explain the exemplary implementations of the embodiments. The illustrated embodiments are for purposes of illustration only and do not limit the scope of the claims. Throughout the drawings, identical reference numbers designate similar, but not necessarily identical, elements.

FIG. 1 illustrates a schematic diagram of an exemplary system in which various methods described herein may be implemented, according to an embodiment of the present disclosure;

FIG. 2 shows an exemplary flow diagram of an image processing method according to an embodiment of the present disclosure;

FIG. 3 shows a schematic block diagram of a neural network for implementing the image processing method shown in FIG. 2, in accordance with an embodiment of the present disclosure;

FIG. 4 illustrates an example of a neural network for identifying human keypoints, according to an embodiment of the present disclosure;

FIG. 5 illustrates a method of training a neural network according to an embodiment of the present disclosure;

fig. 6 shows an exemplary block diagram of an image processing apparatus according to an embodiment of the present disclosure;

FIG. 7 illustrates a block diagram of an exemplary electronic device that can be used to implement embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the present disclosure, unless otherwise specified, the use of the terms "first", "second", etc. to describe various elements is not intended to limit the positional relationship, the timing relationship, or the importance relationship of the elements, and such terms are used only to distinguish one element from another. In some examples, a first element and a second element may refer to the same instance of the element, and in some cases, based on the context, they may also refer to different instances.

The terminology used in the description of the various described examples in this disclosure is for the purpose of describing particular examples only and is not intended to be limiting. Unless the context clearly indicates otherwise, if the number of elements is not specifically limited, the elements may be one or more. Furthermore, the term "and/or" as used in this disclosure is intended to encompass any and all possible combinations of the listed items.

Embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings.

Fig. 1 illustrates a schematic diagram of an exemplary system 100 in which various methods and apparatus described herein may be implemented in accordance with embodiments of the present disclosure. Referring to fig. 1, the system 100 includes one or

more client devices

101, 102, 103, 104, 105, and 106, a server 120, and one or more communication networks 110 coupling the one or more client devices to the server 120.

Client devices

101, 102, 103, 104, 105, and 106 may be configured to execute one or more applications.

In an embodiment of the present disclosure, the server 120 may run one or more services or software applications that enable the execution of the image processing method according to the present disclosure.

In some embodiments, the server 120 may also provide other services or software applications that may include non-virtual environments and virtual environments. In certain embodiments, these services may be provided as web-based services or cloud services, for example, provided to users of

client devices

101, 102, 103, 104, 105, and/or 106 under a software as a service (SaaS) model.

In the configuration shown in fig. 1, server 120 may include one or more components that implement the functions performed by server 120. These components may include software components, hardware components, or a combination thereof, which may be executed by one or more processors. A user operating a

client device

101, 102, 103, 104, 105, and/or 106 may, in turn, utilize one or more client applications to interact with the server 120 to take advantage of the services provided by these components. It should be understood that a variety of different system configurations are possible, which may differ from system 100. Accordingly, fig. 1 is one example of a system for implementing the various methods described herein and is not intended to be limiting.

A user may use

client devices

101, 102, 103, 104, 105, and/or 106 to obtain images and perform corresponding image processing. The client device may provide an interface that enables a user of the client device to interact with the client device. The client device may also output information to the user via the interface. Although fig. 1 depicts only six client devices, those skilled in the art will appreciate that any number of client devices may be supported by the present disclosure.

Client devices

101, 102, 103, 104, 105, and/or 106 may include various types of computer devices, such as portable handheld devices, general purpose computers (such as personal computers and laptops), workstation computers, wearable devices, smart screen devices, self-service terminal devices, service robots, gaming systems, thin clients, various messaging devices, sensors or other sensing devices, and so forth. These computer devices may run various types and versions of software applications and operating systems, such as MICROSOFT Windows, APPLE iOS, UNIX-like operating systems, Linux, or Linux-like operating systems (e.g., GOOGLE Chrome OS); or include various Mobile operating systems such as MICROSOFT Windows Mobile OS, iOS, Windows Phone, Android. Portable handheld devices may include cellular telephones, smart phones, tablets, Personal Digital Assistants (PDAs), and the like. Wearable devices may include head-mounted displays (such as smart glasses) and other devices. The gaming system may include a variety of handheld gaming devices, internet-enabled gaming devices, and the like. The client device is capable of executing a variety of different applications, such as various Internet-related applications, communication applications (e.g., email applications), Short Message Service (SMS) applications, and may use a variety of communication protocols.

Network 110 may be any type of network known to those skilled in the art that may support data communications using any of a variety of available protocols, including but not limited to TCP/IP, SNA, IPX, etc. By way of example only, one or more networks 110 may be a Local Area Network (LAN), an ethernet-based network, a token ring, a Wide Area Network (WAN), the internet, a virtual network, a Virtual Private Network (VPN), an intranet, an extranet, a Public Switched Telephone Network (PSTN), an infrared network, a wireless network (e.g., bluetooth, WIFI), and/or any combination of these and/or other networks.

The server 120 may include one or more general purpose computers, special purpose server computers (e.g., PC (personal computer) servers, UNIX servers, mid-end servers), blade servers, mainframe computers, server clusters, or any other suitable arrangement and/or combination. The server 120 may include one or more virtual machines running a virtual operating system, or other computing architecture involving virtualization (e.g., one or more flexible pools of logical storage that may be virtualized to maintain virtual storage for the server). In various embodiments, the server 120 may run one or more services or software applications that provide the functionality described below. For example, the server 120 may acquire images captured by a client and implement image processing according to embodiments of the present disclosure.

The computing units in server 120 may run one or more operating systems including any of the operating systems described above, as well as any commercially available server operating systems. The server 120 may also run any of a variety of additional server applications and/or middle tier applications, including HTTP servers, FTP servers, CGI servers, JAVA servers, database servers, and the like.

In some implementations, the server 120 may include one or more applications to analyze and consolidate data feeds and/or event updates received from users of the

client devices

101, 102, 103, 104, 105, and 106. Server 120 may also include one or more applications to display data feeds and/or real-time events via one or more display devices of

client devices

101, 102, 103, 104, 105, and 106.

In some embodiments, the server 120 may be a server of a distributed system, or a server incorporating a blockchain. The server 120 may also be a cloud server, or a smart cloud computing server or a smart cloud host with artificial intelligence technology. The cloud Server is a host product in a cloud computing service system, and is used for solving the defects of high management difficulty and weak service expansibility in the traditional physical host and Virtual Private Server (VPS) service.

The system 100 may also include one or more databases 130. In some embodiments, these databases may be used to store data and other information. For example, one or more of the databases 130 may be used to store information such as audio files and video files. The database 130 may reside in various locations. For example, the database used by the server 120 may be local to the server 120, or may be remote from the server 120 and may communicate with the server 120 via a network-based or dedicated connection. The database 130 may be of different types. In certain embodiments, the database used by the server 120 may be, for example, a relational database. One or more of these databases may store, update, and retrieve data to and from the database in response to the command.

In some embodiments, one or more of the databases 130 may also be used by applications to store application data. The databases used by the application may be different types of databases, such as key-value stores, object stores, or regular stores supported by a file system.

The system 100 of fig. 1 may be configured and operated in various ways to enable application of the various methods and apparatus described in accordance with the present disclosure.

In the related art, in order to detect the positions of feature points (e.g., key points of a human body) in an image, an image region of the human body may be obtained through a human body detection model, and then key point information of the human body is extracted through a key point model, where the human body key point model may extract image space and semantic features through a backbone network (e.g., HRNet) with a good effect, obtain detection features through an deconvolution layer, and directly regress through the detection features to obtain heat maps (heatmaps) of n individual key points of the human body, where n is the number of preset key points of the human body. The heat map may indicate locations of key points of the human body in the image. It can be seen that, in the related art, the heat maps for different human key points are obtained through direct regression of the same detection feature, an attention regression mechanism for a single key point is lacked, and the prediction effect is limited.

In order to solve the above problem, the present disclosure provides a method of separately acquiring an attention matrix for different target objects.

Fig. 2 illustrates an exemplary flowchart of an image processing method according to an embodiment of the present disclosure. The image processing method 200 shown in FIG. 2 can be performed by the clients 101-106 or the server 120 shown in FIG. 1.

As shown in fig. 2, in step S202, a semantic feature map of the target image may be acquired. In step S204, at least one target feature map for at least one target object in the target image, respectively, may be determined based on the semantic feature maps. In step S206, for each of at least one target object, an attention matrix for the target object may be determined based on a target feature map of the target object, and a location feature map for the target may be determined based on the target feature map of the target object and the attention matrix of the target object, wherein the location feature map indicates a location of the target object in the target image.

With the above image processing method provided by the embodiments of the present disclosure, for at least one target image existing in the target image, the attention matrix of each target image may be acquired, respectively, so as to determine the position of each target object in the target image.

The principle of the embodiments of the present disclosure will be described in detail below.

In step S202, a semantic feature map of the target image may be acquired.

In some embodiments, the trained backbone network may be utilized to process the target image and output the backbone network as a semantic feature map of the target image. For example, the HRNet network may be used as a backbone network to acquire the semantic feature map of the target image. The semantic features on all scales in the target image can be effectively extracted by using the backbone network, so that the accuracy of the subsequent detection process can be improved.

In other embodiments, the trained backbone network may be further used to process the target image, and the other neural network units may be used to further process the output of the backbone network to obtain the semantic feature map of the target image. In some implementations, the output of the backbone network can be processed with at least one deconvolution layer to obtain a semantic feature map of the target image. That is, the target image may be processed using the backbone network and the at least one deconvolution layer to obtain the semantic feature map. The scale of the feature map output by the backbone network can be adjusted to the scale required for the position feature of the at least one target object present in the target image using the deconvolution layer. In other implementations, the neural network elements used to further process the output of the backbone network may also be implemented in any other form.

In step S204, at least one target feature map for at least one target object in the target image, respectively, may be determined based on the semantic feature maps.

In some embodiments, the target image may comprise a human body image and the at least one target object may comprise at least one human body keypoint. For example, the human keypoints may include the head, shoulders, hands, etc. of the human. In other embodiments, the target image may also include a portion of a human body (e.g., a hand), and the at least one target object may be at least one hand keypoint, such as a finger joint or the like. The content and the number of target objects involved in the image processing method can be defined by a person skilled in the art according to the actual application. The principles of the present disclosure will be described in the present disclosure by taking human keypoint detection as an example, however, it will be understood by those skilled in the art that the image processing method provided by the present disclosure may be used to detect any form of target object without departing from the principles of the present disclosure.

In some embodiments, for each of the at least one target object, the semantic feature map obtained in step S202 may be convolved with a convolution layer for the target object to obtain a target feature map for the target object. With the above method, a target feature map can be determined for each target object in the target image, so that the position of each target object in the target image can be determined based on different target feature maps.

The target feature map of each target object obtained in step S203 may be a single channel. The calculation amount in the subsequent flow can be simplified by adopting the single-channel target characteristic diagram, so that the calculation efficiency of the image processing method is improved.

In step S206, for each of at least one target object, an attention matrix for the target object may be determined based on a target feature map of the target object, and a location feature map for the target may be determined based on the target feature map of the target object and the attention matrix of the target object, wherein the location feature map indicates a location of the target object in the target image.

In some embodiments, determining an attention moment array for the target object based on the target feature map of the target object may include: and processing the target characteristic diagram by utilizing at least one full connection layer to obtain an attention matrix.

Each node in the full-connection layer can be connected with all nodes in the previous layer, so that the full-connection layer can be used for realizing the global spatial feature fusion of the target feature map. Thus, each target object present in the target image may help predict the location of other target objects present in the target image. Taking an application scene of detecting key points of a human body as an example, the position of the head of the human body can help to predict the positions of other key points such as the neck, the shoulders and the like. Therefore, the attention moment array obtained by utilizing the global spatial feature fusion can improve the accuracy of target object detection.

In some implementations, the at least one fully-connected layer may include a first fully-connected layer and a second fully-connected layer. In some examples, an activation layer may also be disposed between the first fully-connected layer and the second fully-connected layer, thereby introducing a non-linear spatial feature fusion effect. The mode of adopting two full connection layers can realize the effective fusion of the global spatial features without obviously increasing the calculated amount in the image processing process. However, the scope of the present disclosure is not limited thereto, and those skilled in the art may increase or decrease the number of the fully-connected layers according to actual situations, for example, in a manner of one fully-connected layer or more than two fully-connected layers.

In some embodiments, processing the target feature map using at least one fully connected layer to derive the attention moment array may include: determining a target feature vector corresponding to the target feature map; inputting the target characteristic vector into at least one full-connection layer to obtain an output vector of the at least one full-connection layer; and determining a matrix corresponding to the output vector as an attention matrix by using the matrix dimension change.

In order to facilitate the processing of the full connection layer, dimension transformation may be performed on the target feature map, and a target feature vector obtained after the dimension transformation is used as an input of the full connection layer. Taking the single-channel feature that the target feature map is of size 1 × H × W (where H, W are the dimensions of the target feature map in the horizontal and vertical directions, respectively) as an example, the target feature map may be flattened (flat) into a vector of length H × W as the target feature vector and input to the at least one fully-connected layer. The output vector of the at least one fully connected layer may be the same size as the input vector, i.e. the output vector is also a vector of length H x W. The output vector may be transformed into a matrix of size 1 × H × W, i.e. an attention matrix, using a matrix dimension change (reshape).

In some embodiments, determining a location profile for the object based on the object profile and the attention matrix may include: and fusing elements at corresponding positions in the target characteristic diagram and the attention matrix to obtain a position characteristic diagram. In some implementations, the target feature map and the element at the corresponding position in the attention matrix may be multiplied, and the result obtained after the multiplication is used as the value of the element at the position in the position feature map. In other implementations, any mathematical processing may also be performed on the elements at corresponding positions in the target feature map and the attention matrix to achieve fusion.

Wherein each element in the location feature map indicates a probability that the target object is located at a location in the target image corresponding to the element. The size of the location feature map and the size of the target image may be the same or different. Each element in the location feature map corresponds to an element (or an image block) in the target image. The position of the target object corresponding to the position feature map in the target image may be determined based on the position of the element having the maximum value in the position feature map.

Fig. 3 shows a schematic block diagram of a neural network for implementing the image processing method shown in fig. 2, according to an embodiment of the present disclosure. Among other things, the neural network 300 shown in fig. 3 may be computer-implemented. The neural network 300 may be configured to detect at least one target object present in the target image.

As shown in fig. 3, the neural network 300 may include a backbone network 310, a branch network 320, and a feature fusion layer 330.

The backbone network 310 may be used to process the target image to obtain a semantic feature map of the target image. Where the backbone network may be implemented as a HRNet. It is understood that the backbone network may also be implemented as other neural networks capable of extracting semantic features of an image without departing from the principles of the present disclosure.

The branching network 320 may be configured to determine at least one target feature map for at least one target object in the target image, respectively, based on the semantic feature maps.

In some embodiments, the target image may comprise a human body image and the at least one target object may comprise at least one human body keypoint.

In some embodiments, the branch network 320 may include at least one convolutional layer. The at least one convolution layer is used for respectively convolving the semantic feature maps to obtain a target feature map for each target object in the at least one target object. That is, in the branch network 320, in order to obtain target feature maps for different target objects, the unified semantic feature map may be processed by different convolutional layers respectively to obtain target feature maps for different target objects. Therefore, different convolutional layers can be trained for different target objects respectively, and the detection accuracy rate of the different target objects is improved.

The target feature map can be a single channel, so that the calculation amount of the subsequent image processing process can be simplified.

The feature fusion layer 330 may include at least one feature fusion branch for at least one target object, respectively, wherein each feature fusion branch is configured to determine an attention matrix for the target object based on a target feature map of the target object, and determine a location feature map for the target object based on the target feature map and the attention matrix, wherein the location feature map indicates a location of the target object in the target image.

The feature fusion branch may include at least one fully connected layer, and the at least one fully connected layer may be used to process the target feature map to obtain an attention matrix for a corresponding target object.

The at least one fully-connected layer may be configured to process the target feature vector corresponding to the target feature map to obtain an output vector of the at least one fully-connected layer. Wherein the attention matrix is a matrix corresponding to the output vector determined using matrix dimension change. Taking the single-channel feature that the target feature map is of size 1 × H × W (where H, W are the dimensions of the target feature map in the horizontal and vertical directions, respectively) as an example, the target feature map may be flattened (flat) into a vector of length H × W as the target feature vector and input to the at least one fully-connected layer. The output vector of the at least one fully connected layer may be the same size as the input vector, i.e. the output vector is also a vector of length H x W. The output vector may be transformed into a matrix of size 1 × H × W, i.e. an attention matrix, using a matrix dimension change (reshape).

The feature fusion branch may further include a fusion unit, and the fusion unit may be configured to fuse the target feature map and an element at a corresponding position in the attention matrix to obtain a position feature map. In some implementations, the target feature map and the element at the corresponding position in the attention matrix may be multiplied, and the result obtained after the multiplication is used as the value of the element at the position in the position feature map. In other implementations, any mathematical processing may also be performed on the elements at corresponding positions in the target feature map and the attention matrix to achieve fusion.

By using the neural network provided by the disclosure, a separate branch can be set for each target object and different attention matrixes can be trained for each target object, so that the detection accuracy rate for a single target object can be improved. It is understood that the neural network provided by the embodiments of the present disclosure is not limited to the form described in conjunction with fig. 3, and those skilled in the art can add or reduce neural network units on the basis of the neural network shown in fig. 3 according to practical applications without departing from the principles of the present disclosure to adapt to the images to be processed with different sizes and channel numbers in different application scenarios.

Fig. 4 illustrates an example of a neural network for identifying human keypoints, according to an embodiment of the present disclosure. The neural network 300 described in fig. 3 may be implemented using the neural network 400 shown in fig. 4.

As shown in fig. 4, the neural network 400 may include a backbone network 410. Wherein the backbone network 410 can be used for processing the target image and obtaining the semantic feature map of the target image.

The neural network 400 may also include a branch network 420. In the example shown in FIG. 4, a branching network may include deconvolution layer 421 and convolutional layers 422-1 through 422-n. Wherein n is a positive integer greater than 1. The deconvolution layer 421 may be used to further process the semantic feature map output by the backbone network to obtain the extended features of the target image. Convolutional layers 422-1 through 422-n may include n convolutional layers corresponding to each of the n target objects, respectively, and may process the extended features to obtain a target feature map for each target object, respectively. Wherein the size of the extended feature may be k × H × W, where k is any positive integer, and the size of the target feature map of each target object may be a single-channel feature of 1 × H × W.

The neural network 400 may also include a feature fusion layer 430. The feature fusion layer 430 may include feature fusion branches 430-1-430-n respectively connected in series to the convolutional layers 422-1-422-n, where each feature fusion branch is configured to determine an attention matrix for the target object based on a target feature map of the target object, and determine a location feature map for the target object based on the target feature map of the target object and the attention matrix of the target object.

As shown in fig. 4, each feature fusion branch may include a first fully-connected layer 431 and a second fully-connected layer 432, and an activation layer 433 is further disposed between the first fully-connected layer 431 and the second fully-connected layer 432. Taking the first branch as an example, the first feature fusion branch may include a first fully-connected layer 431-1 and a second fully-connected layer 432-1, and an active layer 433-1 is further disposed between the first fully-connected layer 431-1 and the second fully-connected layer 432-1. The second fully connected layer 432-1 may be used to output an attention matrix for the target object. Further, each feature fusion branch may further include a fusion unit 433 (e.g., fusion units 433-1 to 433-n), where the fusion unit may be configured to fuse the target feature map of the target object and an element at a corresponding position in the attention matrix of the target object to obtain a position feature map of the target object. In some implementations, the target feature map and the element at the corresponding position in the attention matrix may be multiplied, and the result obtained after the multiplication is used as the value of the element at the position in the position feature map. In other implementations, any mathematical processing may also be performed on the elements at corresponding positions in the target feature map and the attention matrix to achieve fusion.

Fig. 5 illustrates a training method of a neural network according to an embodiment of the present disclosure. The neural network described in connection with fig. 3, 4 may be trained using the training method 500 shown in fig. 5.

In step S502, a sample image and a true position of at least one target object present in the sample image may be acquired. Wherein the sample image may comprise a human body image and the target object may be a human body key point. It should be noted that the human body image in the present embodiment is from a public data set.

In step S504, the sample image is input into the neural network to be trained, and at least one position feature map respectively used for at least one target object output by the neural network to be trained is obtained. The initial parameters of the neural network to be trained may be randomly generated or obtained through pre-training.

In step S506, a predicted position of the at least one target object in the sample image may be determined based on the at least one position feature map.

In step S508, a loss function may be calculated based on the actual position noted in step S502 and the predicted position obtained in step S506. In some examples, a loss function such as Mean Square Error (MSE) may be used.

In step S510, parameters in the neural network may be adjusted based on the loss function calculated in step S508, so that the neural network can learn features of each target object to be detected, which refers to the accuracy of target object detection.

Fig. 6 illustrates an exemplary block diagram of an image processing apparatus according to an embodiment of the present disclosure.

As shown in fig. 6, the image processing apparatus 600 may include a semantic feature acquisition unit 610, a target feature acquisition unit 620, and a target object detection unit 630.

The semantic feature acquisition unit 610 may be configured to acquire a semantic feature map of the target image. The target feature obtaining unit 620 may be configured to determine at least one target feature map for at least one target object in the target image, respectively, based on the semantic feature maps. The target object detection unit 630 may be configured to, for each of at least one target object, determine an attention matrix for the target object based on a target feature map of the target object and determine a location feature map for the target based on the target feature map of the target object and the attention matrix of the target object, wherein the location feature map indicates a location of the target object in a target image.

The operations of the units 610 to 630 of the image processing apparatus 600 are similar to the operations of the steps S202 to S206, and are not described again.

According to an embodiment of the present disclosure, there is also provided an electronic apparatus including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method described in connection with fig. 2.

There is also provided, in accordance with an embodiment of the present disclosure, a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method described in connection with fig. 2.

There is also provided, in accordance with an embodiment of the present disclosure, a computer program product, comprising a computer program, wherein the computer program, when executed by a processor, implements the method described in connection with fig. 2.

In the technical scheme of the disclosure, the collection, storage, use, processing, transmission, provision, disclosure and other processing of the personal information of the related user are all in accordance with the regulations of related laws and regulations and do not violate the good customs of the public order.

Referring to fig. 7, a block diagram of a structure of an electronic device 700, which may be a server or a client of the present disclosure, which is an example of a hardware device that may be applied to aspects of the present disclosure, will now be described. Electronic device is intended to represent various forms of digital electronic computer devices, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 7, the device 700 comprises a computing unit 701, which may perform various suitable actions and processes according to a computer program stored in a Read Only Memory (ROM)702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the device 700 can also be stored. The computing unit 701, the ROM 702, and the RAM 703 are connected to each other by a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

Various components in the device 700 are connected to the I/O interface 705, including: an input unit 706, an output unit 707, a storage unit 708, and a communication unit 709. The input unit 706 may be any type of device capable of inputting information to the device 700, and the input unit 706 may receive input numeric or character information and generate key signal inputs related to user settings and/or function controls of the electronic device, and may include, but is not limited to, a mouse, a keyboard, a touch screen, a track pad, a track ball, a joystick, a microphone, and/or a remote controller. Output unit 707 may be any type of device capable of presenting information and may include, but is not limited to, a display, speakers, a video/audio output terminal, a vibrator, and/or a printer. Storage unit 708 may include, but is not limited to, magnetic or optical disks. The communication unit 709 allows the device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks, and may include, but is not limited to, a modem, a network card, an infrared communication device, a wireless communication transceiver, and/or a chipset, such as bluetooth^TMDevices, 802.11 devices, WiFi devices, WiMax devices, cellular communication devices, and/or the like.

Computing unit 701 may be a variety of general purpose and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The computing unit 701 performs the various methods and processes described above, such as the

methods

200, 500. For example, in some embodiments, the

methods

200, 500 may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as the storage unit 708. In some embodiments, part or all of a computer program may be loaded onto and/or installed onto device 700 via ROM 702 and/or communications unit 709. When the computer program is loaded into the RAM 703 and executed by the computing unit 701, one or more steps of the

methods

200, 500 described above may be performed. Alternatively, in other embodiments, the computing unit 701 may be configured to perform the

methods

200, 500 in any other suitable manner (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be performed in parallel, sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

Although embodiments or examples of the present disclosure have been described with reference to the accompanying drawings, it is to be understood that the above-described methods, systems and apparatus are merely exemplary embodiments or examples and that the scope of the present invention is not limited by these embodiments or examples, but only by the claims as issued and their equivalents. Various elements in the embodiments or examples may be omitted or may be replaced with equivalents thereof. Further, the steps may be performed in an order different from that described in the present disclosure. Further, various elements in the embodiments or examples may be combined in various ways. It is important that as technology evolves, many of the elements described herein may be replaced with equivalent elements that appear after the present disclosure.

Claims

1. An image processing method comprising:

acquiring a semantic feature map of a target image;

determining at least one target feature map for at least one target object in a target image, respectively, based on the semantic feature map; and

for each of the at least one target object,

determining an attention matrix for the target object based on a target feature map of the target object; and

determining a location profile for the target based on a target profile of the target object and an attention matrix of the target object, wherein the location profile indicates a location of the target object in the target image.

2. The image processing method of claim 1, wherein determining at least one target feature map for at least one target object in the target image, respectively, comprises:

and for each target object in the at least one target object, convolving the semantic feature map by using a convolution layer for the target object to obtain a target feature map for the target object.

3. The image processing method of claim 2, wherein the target feature map is single-channel.

4. The image processing method of any of claims 1 to 3, wherein determining an attention moment array for the target object based on the target feature map of the target object comprises:

and processing the target characteristic diagram by utilizing at least one full connection layer to obtain the attention matrix.

5. The image processing method of claim 4, wherein the at least one fully-connected layer comprises a first fully-connected layer and a second fully-connected layer.

6. The image processing method of claim 4, wherein processing the target feature map with at least one fully connected layer to obtain the attention moment array comprises:

determining a target feature vector corresponding to the target feature map;

inputting the target feature vector into the at least one fully-connected layer to obtain an output vector of the at least one fully-connected layer;

and determining a matrix corresponding to the output vector as the attention matrix by using matrix dimension change.

7. The image processing method of claim 1, wherein determining a location profile for the object based on the object profile and the attention matrix comprises:

and fusing elements at corresponding positions in the target feature map and the attention matrix to obtain the position feature map.

8. The image processing method of claim 7, wherein each element in the position feature map indicates a probability that the target object is located at a position in the target image corresponding to the element.

9. The image processing method of claim 1, wherein obtaining the semantic feature map of the target image comprises:

and processing the target image by using a backbone network to obtain the semantic feature map.

10. The image processing method of claim 1, wherein the at least one target object is at least one human keypoint.

11. A computer-implemented neural network configured to detect at least one target object present in a target image, the neural network comprising:

the backbone network is used for processing the target image to obtain a semantic feature map of the target image;

a branching network for determining at least one target feature map for at least one target object in a target image, respectively, based on the semantic feature maps;

a feature fusion layer comprising at least one feature fusion branch for the at least one target object, respectively, wherein each feature fusion branch is configured to determine an attention matrix for the target object based on a target feature map of the target object and to determine a location feature map for the target object based on the target feature map of the target object and the attention matrix of the target object, wherein the location feature map indicates a location of the target object in the target image.

12. A neural network as claimed in claim 11, wherein the branch network comprises at least one convolutional layer for respectively convolving the semantic feature maps to obtain a target feature map for each of the at least one target object respectively.

13. The neural network of claim 12, wherein the target feature map is single-channel.

14. A neural network as claimed in any one of claims 11 to 13, wherein the feature fusion branches comprise at least one fully-connected layer for processing the target feature map to derive the attention matrix.

15. The neural network of claim 13, wherein the at least one fully-connected layer includes a first fully-connected layer and a second fully-connected layer.

16. The neural network of claim 14, wherein the at least one fully-connected layer is to:

processing a target feature vector corresponding to the target feature map to obtain an output vector of the at least one fully connected layer;

wherein the attention matrix is a matrix corresponding to the output vector determined using matrix dimensionality.

17. A neural network as claimed in claim 11, wherein the feature fusion branch further comprises a fusion unit for fusing elements at corresponding positions in the target feature map and the attention matrix to obtain the location feature map.

18. The neural network of claim 17, wherein each element in the location feature map indicates a probability that the target object is located at a location in the target image corresponding to the element.

19. A neural network as claimed in claim 11, wherein the backbone network is a HRNet.

20. A neural network as claimed in claim 11, wherein the at least one target object is at least one human keypoint.

21. A training method of a computer-implemented neural network implemented by the neural network of any one of claims 11-20, the training method comprising:

acquiring a sample image and a real position of at least one target object existing in the sample image;

inputting the sample image into the neural network, and acquiring at least one position feature map respectively used for the at least one target object and output by the neural network;

determining a predicted location of the at least one target object in the sample image based on the at least one location feature map;

calculating a loss function based on the real location and the predicted location; and

adjusting a parameter in the neural network based on the loss function.

22. An image processing apparatus comprising:

a semantic feature acquisition unit configured to acquire a semantic feature map of a target image;

a target feature acquisition unit configured to determine at least one target feature map for at least one target object in a target image, respectively, based on the semantic feature maps; and

a target object detection unit configured to, for each of the at least one target object,

determining an attention matrix for the target object based on a target feature map of the target object;

determining a location feature map for the target based on the target feature map and the attention matrix, wherein the location feature map indicates a location of the target object in the target image.

23. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-21.

24. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any of claims 1-21.

25. A computer program product comprising a computer program, wherein the computer program realizes the method of any one of claims 1-21 when executed by a processor.