CN115423752A

CN115423752A - Image processing method, electronic device and readable storage medium

Info

Publication number: CN115423752A
Application number: CN202210927878.4A
Authority: CN
Inventors: 姚洋; 高崇军; 史廓
Original assignee: Honor Device Co Ltd
Current assignee: Honor Device Co Ltd
Priority date: 2022-08-03
Filing date: 2022-08-03
Publication date: 2022-12-02
Anticipated expiration: 2042-08-03
Also published as: CN115423752B

Abstract

The application provides an image processing method, electronic equipment and a readable storage medium, relates to the technical field of image processing, and can solve the problems of an empty hand area and unnatural self-timer posture of a photographer, wherein the method comprises the following steps: the electronic equipment displays a first target image; the first target image comprises an image of a first person holding a first object in a first target posture; the electronic equipment selects a first reference image from the M reference images; the first reference image comprises an image of a corresponding person in a second target posture, and the second target posture is different from the first target posture; the electronic equipment performs attitude migration on the first target image by adopting the first reference image to obtain a second target image; and the second target image comprises an image of the first person in the second target posture, and the second target image does not comprise an image of the first person holding the first object.

Description

Image processing method, electronic device and readable storage medium

Technical Field

The present application relates to the field of image processing technologies, and in particular, to an image processing method, an electronic device, and a readable storage medium.

Background

Smart phones have been developed to date, and photographing and video recording have become one of the most important features. With the increasing powerful photographing function of the smart phone, more and more people use the smart phone to replace a camera to photograph. In order to realize wider shooting angle, can fix the smart mobile phone on the telescopic is from rapping bar usually, through freely adjusting the flexible volume of telescopic link, realize the multi-angle from rapping. However, when a selfie stick is used for taking a self-timer stick, a local selfie stick may be taken, that is, the selfie stick may exist in a shot picture or video, which affects the user experience.

In the existing scheme, can get rid of from rapping bar in order to improve shooter's shooting experience. However, after the selfie stick is removed, an empty area appears in the hand of the photographer, which causes an unnatural self-photographing posture of the photographer. Therefore, there is a need for a solution to the problem of unnatural hand clearance and self-timer pose of the photographer.

Disclosure of Invention

The embodiment of the application provides an image processing method, electronic equipment and a readable storage medium, which are used for solving the problem that a hand empty area and a self-timer gesture of a photographer are unnatural.

In order to achieve the above purpose, the embodiment of the present application adopts the following technical solutions:

in a first aspect, an image processing method is provided, which is applied to an electronic device, wherein M reference images are stored in the electronic device, the M reference images include images of at least one person in multiple postures, the M reference images are not images shot by a first object held by the person, and M is an integer greater than 1; the method comprises the following steps: the electronic equipment displays a first target image; the first target image comprises an image of a first person holding a first object in a first target posture; the electronic equipment selects a first reference image from the M reference images; the first reference image comprises an image of a corresponding person in a second target posture, and the second target posture is different from the first target posture; the electronic equipment performs attitude migration on the first target image by adopting the first reference image to obtain a second target image; and the second target image comprises an image of the first person in the second target posture, and the second target image does not comprise an image of the first person holding the first object.

Based on the first aspect, in the embodiment of the application, the electronic device stores M reference images, and the M reference images are not images shot by a corresponding person holding a first object; when the electronic device displays the first target image, the electronic device can select a first reference image from the M reference images and perform pose migration on the first target image by using the first reference image to obtain a second target image, because the first target image includes an image of a first person holding a first object in a first target pose; and because the first reference image comprises an image of the corresponding person in a second target posture, the second target posture is different from the first target posture, the second target image comprises an image of the first person in the second target posture, and the second target image does not comprise an image of the first person holding the first object, the second target image generated after processing does not comprise an image of the first person holding the first object, and the image of the first person in the second target posture in the second target image is different from the first target posture, thereby solving the problems of the hand empty area and unnatural self-timer shooting posture of the photographer.

In one implementation of the first aspect, the first object comprises a selfie stick.

In the embodiment, since the first object includes the selfie stick, the problem that the selfie stick appears in a selfie image when a photographer holds the selfie stick for selfie can be solved; and can also solve only after removing from the rapping bar, the photographer hand appears the empty region and the unnatural problem of gesture of autodyning.

In one implementation of the first aspect, the second target gesture is a default gesture preset in the electronic device; or, the electronic device selects a first reference image from the M reference images, including: the electronic equipment displays a first interface; the first interface comprises a plurality of gesture selection items, and each gesture selection item in the plurality of gesture selection items corresponds to one target gesture; the electronic equipment responds to the operation of a first posture selection item in the plurality of posture selection items, and selects a reference image under the target posture corresponding to the first posture selection item from the M reference images as a first reference image; and the target posture corresponding to the first posture selection item is a second target posture.

In this implementation, the electronic device may display a plurality of gesture options, and since each gesture option of the plurality of gesture options corresponds to one target gesture, the user may select one target gesture from the plurality of gesture options, and the electronic device may process the first target image according to the target gesture selected by the user to generate a second target image; therefore, the second target posture of the first person in the second target image generated by the electronic equipment is the target posture selected by the user, the problems that the hand of the user is empty and the self-timer posture is unnatural are solved, and meanwhile, the improvement of user experience is facilitated.

In an implementation manner of the first aspect, the M reference images include: n reference images are reference images under the target posture corresponding to the first posture selection item, N is an integer larger than 1, and N is smaller than or equal to M; selecting a reference image under a target posture corresponding to a first posture option from the M reference images as a first reference image, wherein the method comprises the following steps: the electronic equipment selects a reference image with the maximum similarity between the target pose and the first target pose from the N reference images as a first reference image.

In this implementation manner, since the reference image in the target posture corresponding to the first posture selection item includes N reference images, on this basis, the electronic device may select, as the first reference image, the reference image in which the similarity between the target posture and the first target posture is the greatest from the N reference images, which is beneficial to improving the image effect.

In one implementation of the first aspect, the first pose selection item is used to customize the target pose; selecting a reference image under a target posture corresponding to a first posture option from the M reference images as a first reference image, wherein the method comprises the following steps: the electronic equipment displays a second interface; the second interface is used for indicating a user to draw a target gesture; the electronic equipment receives a target gesture drawn by a user on the second interface, and selects a reference image with the maximum similarity between the target gesture and the drawn target gesture from the M reference images as a first reference image.

In this implementation, when the first gesture selection item selected by the user is used for customizing the target gesture, the electronic device may further display a second interface to facilitate the user to draw the target gesture; after the electronic equipment receives the target posture drawn by the user on the second interface, the electronic equipment selects the reference image with the maximum similarity between the target posture and the drawn target posture from the M reference images as the first reference image, and user experience is further improved.

In one implementation of the first aspect, the first target image is acquired by the electronic device in response to a capture instruction; wherein the electronic device selects a first reference image from the M reference images, comprising: in response to the photographing instruction, the electronic device determines that the first target image includes an image of a first person holding the selfie stick in a first target pose, and selects a first reference image from the M reference images.

In this implementation, since the first target image is captured by the electronic device in response to the photographing instruction, after the electronic device completes photographing, when the electronic device determines that the first target image includes an image of a first person holding the selfie stick in the first target pose, the first reference image is selected from the M reference images; after that, the electronic device may perform gesture migration on the first target image by using the first reference image to generate a second target image, that is, the electronic device may process the first target image after the shooting is completed, thereby improving user experience.

In one implementation manner of the first aspect, an electronic device displays a first target image, including: the electronic equipment displays a first target image in the gallery application, or the electronic equipment displays the first target image in the instant messaging application; wherein the electronic device selects a first reference image from the M reference images, comprising: in response to a preset editing operation of a user on the first target image, the electronic equipment selects a first reference image from the M reference images.

In the implementation manner, the electronic device can also process the first target image stored in the gallery application and the first target image received by the instant messaging application, so that the user experience is further improved.

In an implementation manner of the first aspect, performing, by the electronic device, pose migration on the first target image by using the first reference image to obtain a second target image, including: the electronic equipment performs object segmentation on the first target image to remove the image of the first object in the first target image and identify a first image of the first person in the first target image; the electronic equipment carries out portrait segmentation on the first reference image so as to obtain a second image of a first person in the first reference image; the electronic equipment carries out first posture estimation on a first image of a first person and a second image of the first person to obtain a first posture UV image of the first person, a first image UV image of the first person, a second posture UV image of the first person and a second image UV image of the first person; the electronic equipment performs gesture migration on the first gesture UV picture of the first person and the first image UV picture of the first person by adopting the second gesture UV picture of the first person and the second image UV picture of the first person to obtain a third gesture UV picture of the first person and a third image UV picture of the first person; the electronic equipment carries out second posture estimation on the UV image of the third posture of the first person and the UV image of the third image of the first person to obtain a third image of the first person; and the electronic equipment performs fusion processing on the third image of the first person and the background image in the first target image to obtain a second target image.

In a second aspect, an electronic device is provided, which has the function of implementing the first aspect. The function can be realized by hardware, and can also be realized by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the functions described above.

In a third aspect, an electronic device is provided, where M reference images are stored, where the M reference images include images of at least one person in multiple poses, the M reference images are not images of a first object held by the person, and M is an integer greater than 1; the electronic device includes a display screen, a memory, and one or more processors; the display screen, the memory and the processor are coupled; the memory is for storing computer program code, the computer program code comprising computer instructions; the computer instructions, when executed by the processor, cause the electronic device to perform the steps of: the electronic equipment displays a first target image; the first target image comprises an image of a first person holding a first object in a first target posture; the electronic equipment selects a first reference image from the M reference images; the first reference image comprises an image of a corresponding person in a second target posture, and the second target posture is different from the first target posture; the electronic equipment performs attitude migration on the first target image by adopting the first reference image to obtain a second target image; and the second target image comprises an image of the first person in the second target posture, and the second target image does not comprise an image of the first person holding the first object.

In one implementation of the third aspect, the first object comprises a selfie stick.

In one implementation of the third aspect, the second target gesture is a default gesture preset in the electronic device; alternatively, when the computer instructions are executed by the processor, the electronic device is specifically caused to perform the following steps: the electronic equipment displays a first interface; the first interface comprises a plurality of gesture selection items, and each gesture selection item in the plurality of gesture selection items corresponds to one target gesture; the electronic equipment responds to the operation of a first posture selection item in the plurality of posture selection items, and selects a reference image under the target posture corresponding to the first posture selection item from the M reference images as a first reference image; and the target posture corresponding to the first posture selection item is a second target posture.

In one implementation manner of the third aspect, the M reference images include: n reference images are reference images under the target posture corresponding to the first posture selection item, N is an integer larger than 1, and N is smaller than or equal to M; when the computer instructions are executed by the processor, the electronic device is enabled to specifically execute the following steps: the electronic equipment selects a reference image with the maximum similarity between the target pose and the first target pose from the N reference images as a first reference image.

In one implementation of the third aspect, the first gesture selection is used to customize the target gesture; when the computer instructions are executed by the processor, the electronic device is enabled to specifically execute the following steps: the electronic equipment displays a second interface; the second interface is used for indicating a user to draw a target gesture; the electronic equipment receives a target gesture drawn by a user on the second interface, and selects a reference image with the maximum similarity between the target gesture and the drawn target gesture from the M reference images as a first reference image.

In one implementation form of the third aspect, the first target image is acquired by the electronic device in response to a shooting instruction; when executed by a processor, the computer instructions cause the electronic device to perform in particular the steps of: in response to the photographing instruction, the electronic device determines that the first target image includes an image of a first person holding the selfie stick in a first target pose, and selects a first reference image from the M reference images.

In one implementation of the third aspect, the computer instructions, when executed by the processor, cause the electronic device to perform the following steps: the electronic equipment displays a first target image in the gallery application, or the electronic equipment displays the first target image in the instant messaging application; in response to a preset editing operation of a user on the first target image, the electronic equipment selects a first reference image from the M reference images.

In one implementation of the third aspect, the computer instructions, when executed by the processor, cause the electronic device to perform the following steps: the electronic equipment performs object segmentation on the first target image to remove the image of the first object in the first target image and identify a first image of the first person in the first target image; the electronic equipment carries out portrait segmentation on the first reference image so as to obtain a second image of a first person in the first reference image; the electronic equipment carries out first posture estimation on a first image of a first person and a second image of the first person to obtain a first posture UV (ultraviolet) graph of the first person, a first image UV graph of the first person, a second posture UV graph of the first person and a second image UV graph of the first person; the electronic equipment performs gesture migration on the first gesture UV picture of the first person and the first image UV picture of the first person by adopting the second gesture UV picture of the first person and the second image UV picture of the first person to obtain a third gesture UV picture of the first person and a third image UV picture of the first person; the electronic equipment carries out second posture estimation on the UV image of the third posture of the first person and the UV image of the third image of the first person to obtain a third image of the first person; and the electronic equipment performs fusion processing on the third image of the first person and the background image in the first target image to obtain a second target image.

In a fourth aspect, a computer-readable storage medium is provided, in which computer instructions are stored, which, when run on a computer, cause the computer to perform the image processing method of any of the above first aspects.

In a fifth aspect, there is provided a computer program product comprising instructions which, when run on a computer, cause the computer to perform the image processing method of any of the first aspects above.

For technical effects brought by any one of the design manners in the second aspect to the fifth aspect, reference may be made to technical effects brought by different design manners in the first aspect, and details are not described herein.

Drawings

Fig. 1 is a schematic structural diagram of a UV texture color value according to an embodiment of the present disclosure;

fig. 2 is a first flowchart illustrating an image processing method according to an embodiment of the present disclosure;

fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure;

fig. 4 is a first schematic interface diagram of an image processing method according to an embodiment of the present disclosure;

fig. 5 is a second interface schematic diagram of an image processing method according to an embodiment of the present application;

fig. 6 is a third schematic interface diagram of an image processing method according to an embodiment of the present application;

fig. 7 is a flowchart illustrating a second image processing method according to an embodiment of the present application;

fig. 8 is a third schematic flowchart of an image processing method according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a semantic segmentation model according to an embodiment of the present application;

FIG. 10 is a schematic diagram of a semantic segmentation model provided in an embodiment of the present application;

fig. 11 is a schematic structural diagram of a synthetic network according to an embodiment of the present application;

fig. 12 is a fourth schematic flowchart of an image processing method according to an embodiment of the present application;

fig. 13 is a fourth interface schematic diagram of an image processing method according to an embodiment of the present application;

fig. 14 is a schematic flowchart of an image processing method according to an embodiment of the present application;

fig. 15 is a schematic structural diagram of a chip system according to an embodiment of the present application.

Detailed Description

In the following, the terms "first", "second" are used for descriptive purposes only and are not to be understood as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present embodiment, "a plurality" means two or more unless otherwise specified.

In order to facilitate understanding of the schemes provided in the embodiments of the present application, some terms referred to in the embodiments of the present application will be explained below.

And (3) attitude estimation: the pose estimation is based on a pose conversion (densipos) model, and performs pose estimation on an input image to be processed (such as an RGB image). In practical implementation, the pose estimation of the image to be processed is usually prediction of key points (or called labeled points) of the image, that is, position coordinates of each key point of the human body are predicted first, and then a spatial position relationship between each key point is determined, so as to obtain a predicted human skeleton. The key points may be joint points or other points on the human body, which are not specifically limited in this embodiment of the present application.

In general, the DensePose model performs pose estimation on an input image to be processed, which includes two tasks, one is a classification task and the other is a regression task. The classification task can classify and mark the input image to be processed based on the human body region to which each key point belongs. Illustratively, the body regions are grouped into 24 categories, such as including head, left hand, right hand, arms, shoulders, etc.; on the basis, the human body regions to which the key points belong in the image to be processed are marked based on different classification results. In some embodiments, the classification task may also implement classification and labeling of the background in the image to be processed, so that the classification task may classify the image to be processed into 25 classes.

The regression task may be based on a three-dimensional human-person linear model (SMPL), and mapping three-dimensional coordinates (i.e., X, Y, Z coordinates) corresponding to each pixel in the image to be processed into the UV coordinates to obtain the UV texture map. The SMPL is a human body three-dimensional model based on vertexes, and can accurately represent different shapes and postures of a human body; the SMPL internal model includes 6890 vertices and 24 person region classes. For the SMPL model, two important coordinate systems are included, one being the location (X, Y, Z) coordinates with respect to the vertex and the other being the UV coordinates. Wherein, U refers to the coordinate of the picture in the horizontal direction of the display, and V refers to the coordinate of the picture in the vertical direction of the display; in other words, U represents the U-th pixel in the horizontal direction (i.e., picture width), and V represents the V-th pixel in the vertical direction (i.e., picture height). Generally, U and V are in the range of 0,1.

It should be noted that, based on the SMPL model, the three-dimensional coordinates corresponding to each pixel in the image to be processed can be mapped to the UV coordinates to obtain a UV texture map; correspondingly, the UV coordinates can be mapped into the three-dimensional coordinates, and a processed image can be obtained. The image to be processed and the processed image are both 3D images (or I images), that is, in the embodiment of the present application, the conversion from the I image to the UV image (i.e., I2 UV) may be implemented, and the conversion from the UV image to the I image (i.e., UV 2I) may also be implemented.

The following relation between three-dimensional coordinates and UV coordinates is an example to illustrate the conversion of I2UV and UV 2I.

v：a1、b1、c1、0；

v：a2、b2、c2、0；

v：a3、b3、c3、0；

v：a4、b4、c4、0；

vt：A1、B1；

vt：A2、B2；

vt：A3、B3；

vt：A4、B4。

Wherein v represents a three-dimensional coordinate, and vt represents a UV coordinate. And (c) adding the following components in percentage by volume: a1, b1, c1 and 0 are taken as examples, wherein a1, b1 and c1 respectively correspond to an X axis, a Y axis and a Z axis of the three-dimensional coordinate; 0 represents the classification of the body region corresponding to the keypoint, e.g. 0 may be represented as a head. Accordingly, 1 may be denoted as a neck, 2 may be denoted as a shoulder, etc., and so on, which will not be described in detail herein. And (3) taking the vt: a1 and B1 are examples, where A1 represents the coordinates of the picture in the horizontal direction of the display (i.e., the picture width), and B1 represents the coordinates of the picture in the vertical direction of the display (i.e., the picture height). For other examples of v and vt, reference may be made to the above embodiments, and details are not repeated here.

Further, the position of each vertex of the SMPL model may be expressed as f = v/vt. Where f represents the position of each vertex in the SMPL model, and v/vt represents the position value of each vertex. For example, f1 represents the position of the first vertex of 6890 vertices, and v1/vt1 represents the position value of the first vertex; similarly, f2 represents the position of the second vertex of the 6890 vertices, and v2/vt2 represents the position value of the second vertex; f3 represents the position of the third vertex of the 6890 vertices, and v3/vt3 represents the position value of the third vertex. In combination with the relationship between the three-dimensional coordinates and the UV coordinates, the relationship between the positions f of the vertices of the SMPL model and v and vt is as follows: such as f:4/1, 2/2 and 1/3.

Combining the three-dimensional coordinates shown above with the UV coordinates, wherein 4 of 4/1 represents the fourth vt coordinate and 1 represents the first vt coordinate; similarly, 2 in 2/2 represents a second v coordinate, and 2 represents a second vt coordinate; 1 in 1/3 denotes the first v coordinate, 3 denotes the third vt coordinate.

In combination with the above embodiment, the texture color values of the vertices in the SMPL model can be found in the UV coordinates, and the obtained texture color values are rendered to the positions of the vertices, so as to obtain the UV texture image. Exemplarily, as shown in fig. 1, a schematic diagram of a correspondence relationship between UV coordinates and texture color values is shown, and as can be seen from fig. 1, the texture color values include RGBY four colors; wherein R represents red (red), G represents green (green), B represents blue (blue), and Y represents yellow (yellow).

Illustratively, with the above f: for example, 4/1, the vertex position corresponds to a UV coordinate of (0.8,0.4), and the texture color corresponding to the vertex position is blue. By analogy, the texture color corresponding to each vertex position can be obtained, and each texture color is rendered to each vertex position, so that the conversion from the I image to the UV image is obtained. Correspondingly, according to the relation between v and vt, the coordinate corresponding to vt is mapped to the v coordinate, and the conversion from the UV image to the I image can be obtained. For the conversion from the UV image to the I image, reference may be made to the above conversion from the I image to the UV image, which is not described herein again.

In summary, in the embodiment of the present application, after the image to be processed is input to the DensePose model for pose estimation, since the DensePose model includes a classification task and a regression task, the output of the DensePose model may include a pose estimation UV map and a UV texture image. The gesture estimation UV image is an image obtained by converting each key point in the human body gesture in the image to be processed into UV; the UV texture image refers to an image after UV conversion of an image to be processed (i.e., an RGB image).

It can be understood that the selfie stick is a long stick with adjustable length, and one end of the stick is used for controlling photographing by a photographer holding the handle part with hands; the other end of the rod is a fixed part arranged on the electronic equipment and used for adjusting the length of the self-timer rod. In actual operation, the photographer can realize the auto heterodyne image of different angles through adjusting the length of auto heterodyne pole. The longer the length of the selfie stick, the wider the image background of the self-portrait image obtained by shooting.

However, the longer the length of the self-timer stick appearing in the self-timer image, and the larger the area affecting the self-timer image. In addition, if just get rid of alone from rapping bar, an empty region can appear in shooter's hand, leads to shooter to shoot the gesture unnatural, influences the shooting effect.

The embodiment of the application provides an image processing method, which is applied to electronic equipment and can simultaneously solve the problems of an empty hand area of a photographer and unnatural self-photographing posture and improve the photographing effect. Specifically, the electronic device first locates a selfie stick in a selfie image based on the selfie image including the selfie stick, and then removes the selfie stick; then, the electronic equipment carries out image restoration on the image removed from the selfie stick to obtain a selfie image with the shooting effect of the user.

Illustratively, as shown in fig. 2, (1) in fig. 2 is a self-timer image containing a self-timer stick taken by the electronic device; the electronic equipment firstly positions a selfie stick in the selfie image, then removes the selfie stick to obtain a selfie image shown in (2) in figure 2 after the selfie stick is removed; subsequently, the electronic device performs image restoration on the image shown in (2) in fig. 2, and obtains a self-timer image with the self-timer effect shown in (3) in fig. 2.

The image processing method provided by the embodiment of the application will be described in detail below with reference to the drawings in the specification.

For example, the electronic device in the embodiment of the present application may be an electronic device having a shooting function. For example, the electronic device may be a mobile phone motion camera (GoPro), a digital camera, a tablet computer, a desktop, a laptop, a handheld computer, a notebook computer, a car-mounted device, an ultra-mobile personal computer (UMPC), a netbook, a cellular phone, a Personal Digital Assistant (PDA), an Augmented Reality (AR) \ Virtual Reality (VR) device, and the like, and the embodiment of the present application is not particularly limited to the specific form of the electronic device.

Fig. 3 is a schematic structural diagram of the electronic device 100. Among them, the electronic device 100 may include: the mobile terminal includes a processor 110, an external memory interface 120, an internal memory 121, a Universal Serial Bus (USB) interface 130, a charging management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2, a mobile communication module 150, a wireless communication module 160, an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, a sensor module 180, a positioning module 181, a key 190, a motor 191, an indicator 192, a camera 193, a display screen 194, a Subscriber Identity Module (SIM) card interface 195, and the like.

It is to be understood that the illustrated structure of the present embodiment does not constitute a specific limitation to the electronic apparatus 100. In other embodiments, electronic device 100 may include more or fewer components than shown, or combine certain components, or split certain components, or a different arrangement of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

Processor 110 may include one or more processing units, such as: the processor 110 may include an Application Processor (AP), a modem processor, a Graphics Processing Unit (GPU), an Image Signal Processor (ISP), a controller, a memory, a video codec, a Digital Signal Processor (DSP), a baseband processor, and/or a neural-Network Processing Unit (NPU), etc. The different processing units may be separate devices or may be integrated into one or more processors.

The controller may be a neural center and a command center of the electronic device 100. The controller can generate an operation control signal according to the instruction operation code and the timing signal to complete the control of instruction fetching and instruction execution.

A memory may also be provided in processor 110 for storing instructions and data. In some embodiments, the memory in the processor 110 is a cache memory. The memory may hold instructions or data that have just been used or recycled by the processor 110. If the processor 110 needs to reuse the instruction or data, it can be called directly from the memory. Avoiding repeated accesses reduces the latency of the processor 110, thereby increasing the efficiency of the system.

In some embodiments, processor 110 may include one or more interfaces. The interface may include an integrated circuit (I2C) interface, an integrated circuit built-in audio (I2S) interface, a Pulse Code Modulation (PCM) interface, a universal asynchronous receiver/transmitter (UART) interface, a Mobile Industry Processor Interface (MIPI), a general-purpose input/output (GPIO) interface, a Subscriber Identity Module (SIM) interface, and/or a Universal Serial Bus (USB) interface, etc.

It should be understood that the connection relationship between the modules illustrated in this embodiment is only an exemplary illustration, and does not limit the structure of the electronic device. In other embodiments, the electronic device may also adopt different interface connection manners or a combination of multiple interface connection manners in the above embodiments.

The charging management module 140 is configured to receive charging input from a charger. The charger may be a wireless charger or a wired charger. In some wired charging embodiments, the charging management module 140 may receive charging input from a wired charger via the USB interface 130. In some wireless charging embodiments, the charging management module 140 may receive a wireless charging input through a wireless charging coil of the electronic device. The charging management module 140 may also supply power to the electronic device through the power management module 141 while charging the battery 142.

The power management module 141 is used to connect the battery 142, the charging management module 140 and the processor 110. The power management module 141 receives input from the battery 142 and/or the charging management module 140, and provides power to the processor 110, the internal memory 121, the external memory, the display 194, the camera 193, the wireless communication module 160, and the like. The power management module 141 may also be used to monitor parameters such as battery capacity, battery cycle count, battery state of health (leakage, impedance), etc. In some other embodiments, the power management module 141 may also be disposed in the processor 110. In other embodiments, the power management module 141 and the charging management module 140 may be disposed in the same device.

The wireless communication function of the electronic device 100 may be implemented by the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, a modem processor, a baseband processor, and the like.

The electronic device 100 implements display functions via the GPU, the display screen 194, and the application processor. The GPU is a microprocessor for image processing, and is connected to the display screen 194 and an application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. The processor 110 may include one or more GPUs that execute program instructions to generate or alter display information.

The display screen 194 is used to display images, video, and the like. The display screen 194 includes a display panel. The display panel may adopt a Liquid Crystal Display (LCD), an organic light-emitting diode (OLED), an active-matrix organic light-emitting diode (active-matrix organic light-emitting diode, AMOLED), a flexible light-emitting diode (FLED), a Mini-LED, a Micro-OLED, a quantum dot light-emitting diode (QLED), and the like.

The electronic device 100 may implement a photographing function through the ISP, the camera 193, the video codec, the GPU, the display screen 194, and the application processor, etc.

The ISP is used to process the data fed back by the camera 193. For example, when a photo is taken, the shutter is opened, light is transmitted to the camera photosensitive element through the lens, the optical signal is converted into an electrical signal, and the camera photosensitive element transmits the electrical signal to the ISP for processing and converting into an image visible to naked eyes. The ISP can also carry out algorithm optimization on the noise, brightness and color of the image. The ISP can also optimize parameters such as exposure, color temperature and the like of a shooting scene. In some embodiments, the ISP may be provided in camera 193.

The camera 193 is used to capture still images or video. The object generates an optical image through the lens and projects the optical image to the photosensitive element. The photosensitive element may be a Charge Coupled Device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor. The light sensing element converts the optical signal into an electrical signal, which is then passed to the ISP where it is converted into a digital image signal. And the ISP outputs the digital image signal to the DSP for processing. The DSP converts the digital image signal into an image signal in a standard RGB, YUV and other formats. In some embodiments, the electronic device may include 1 or N cameras 193, N being a positive integer greater than 1.

The digital signal processor is used for processing digital signals, and can process digital image signals and other digital signals. For example, when the electronic device is in frequency bin selection, the digital signal processor is used for performing fourier transform and the like on the frequency bin energy.

Video codecs are used to compress or decompress digital video. The electronic device 100 may support one or more video codecs. In this way, the electronic device can play or record video in a variety of encoding formats, such as: moving Picture Experts Group (MPEG) 1, MPEG2, MPEG3, MPEG4, and the like.

The NPU is a neural-network (NN) computing processor that processes input information quickly by using a biological neural network structure, for example, by using a transfer mode between neurons of a human brain, and can also learn by itself continuously. The NPU can realize applications such as intelligent cognition of electronic equipment, for example: image recognition, face recognition, speech recognition, text understanding, and the like.

The electronic device 100 may implement audio functions via the audio module 170, the speaker 170A, the receiver 170B, the microphone 170C, the headphone interface 170D, and the application processor. Such as music playing, recording, etc.

The audio module 170 is used to convert digital audio information into an analog audio signal output and also to convert an analog audio input into a digital audio signal. The audio module 170 may also be used to encode and decode audio signals. In some embodiments, the audio module 170 may be disposed in the processor 110, or some functional modules of the audio module 170 may be disposed in the processor 110. The speaker 170A, also called a "horn", is used to convert the audio electrical signal into an acoustic signal. The receiver 170B, also called "earpiece", is used to convert the electrical audio signal into an acoustic signal. The microphone 170C, also referred to as a "microphone," is used to convert sound signals into electrical signals.

The earphone interface 170D is used to connect a wired earphone. The headset interface 170D may be the USB interface 130, or may be a 3.5mm open mobile electronic device platform (OMTP) standard interface, a cellular telecommunications industry association (cellular telecommunications industry association of the USA, CTIA) standard interface.

The external memory interface 120 may be used to connect an external memory card, such as a Micro SD card, to extend the memory capability of the electronic device. The external memory card communicates with the processor 110 through the external memory interface 120 to implement a data storage function. For example, files such as audio, video, etc. are saved in the external memory card.

The internal memory 121 may be used to store computer-executable program code, which includes instructions. The processor 110 executes various functional applications of the electronic device and data processing by executing instructions stored in the internal memory 121. For example, in the embodiment of the present application, the processor 110 may execute instructions stored in the internal memory 121, and the internal memory 121 may include a program storage area and a data storage area.

The storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required by at least one function, and the like. The data storage area can store data (such as audio data, phone book and the like) created in the using process of the electronic device. In addition, the internal memory 121 may include a high-speed random access memory, and may further include a nonvolatile memory, such as at least one magnetic disk storage device, a flash memory device, a universal flash memory (UFS), and the like.

The keys 190 include a power-on key, a volume key, and the like. The keys 190 may be mechanical keys. Or may be touch keys. The motor 191 may generate a vibration cue. The motor 191 may be used for incoming call vibration cues, as well as for touch vibration feedback. Indicator 192 may be an indicator light that may be used to indicate a state of charge, a change in charge, or a message, missed call, notification, etc. The SIM card interface 195 is used to connect a SIM card. The SIM card can be attached to and detached from the electronic device by being inserted into the SIM card interface 195 or being pulled out of the SIM card interface 195. The electronic equipment can support 1 or N SIM card interfaces, and N is a positive integer greater than 1. The SIM card interface 195 may support a Nano SIM card, a Micro SIM card, a SIM card, etc.

It is to be understood that the illustrated structure of the embodiment of the present application does not specifically limit the electronic device 100. In other embodiments of the present application, an electronic device may include more or fewer components than shown, or some components may be combined, some components may be split, or a different arrangement of components may be used. The illustrated components may be implemented in hardware, software, or a combination of hardware and software.

The methods in the following embodiments may be implemented in the electronic device 100 having the above-described hardware structure. In the following embodiments, the electronic device 100 is taken as a mobile phone as an example, and the technical solutions provided by the embodiments of the present application are specifically described.

The image processing method provided by the embodiment of the application can be applied to various self-shooting scenes, for example, a photographer can fix a mobile phone at one end of a self-shooting rod, and hold the other end of the self-shooting rod by hand to carry out single-person shooting, multi-person shooting, scene shooting, high-altitude shooting, skiing shooting, surfing shooting, parachute jumping shooting and other self-shooting scenes. The following takes a scene in which a photographer holds a selfie stick to perform one-man self-shooting as an example, and exemplarily illustrates the technical scheme provided by the embodiment of the present application.

Illustratively, a camera application (or an application having a shooting function) may be installed in the mobile phone. Taking the example of installing the camera application in the mobile phone, the mobile phone may run the camera application by receiving an operation input by the first person. The operation may be, for example, one of touch operation, key operation, gesture operation, voice operation, or the like; the touch operation may be, for example, a click operation or a slide operation.

Here, the first person may be a photographer or a subject. When a first person carries out self-shooting through a front camera of the mobile phone, the first person is a photographer and a photographed person at the same time; when the first person shoots through the rear camera, the first person is a photographer. In the following embodiment, a first person is taken as an example of a photographer when the first person performs self-shooting through a front camera of a mobile phone.

In some embodiments, as shown in fig. 4 (1), the cell phone displays an interface 220 as shown in fig. 4 (2) in response to the photographer operating an icon 210 of the camera application in the cell phone home screen interface. The interface 220 is a preview interface when the mobile phone takes a picture, and the preview interface includes a preview image. Illustratively, the interface 220 further includes a portrait mode, a video recording mode, a movie mode, and a professional mode, and the photographer can select any one of the modes for shooting.

In some embodiments, as also shown in fig. 4 (2), the interface 220 further includes an Artificial Intelligence (AI) control 230, and the cell phone opens an AI function in response to a photographer operating the AI control 230. The AI function is used for identifying whether a selfie stick is included in the preview image. When the mobile phone recognizes that the selfie stick is included in the preview image, the mobile phone displays an interface 240 (or first interface) as shown in (1) in fig. 5. The interface 240 includes a plurality of gesture selection items; wherein each of the plurality of pose selection items corresponds to a target pose of the first person.

Illustratively, still as shown in FIG. 5 (1) interface 240, the plurality of gesture selections include a default gesture selection, a gesture 1 (pos 1) selection, a gesture 2 (pos 2) selection, … …, a custom gesture selection, and the like. Considering that the display area in the interface 240 is limited, the plurality of gesture selection items further includes a "more" function, and adding the "more" function can provide more gesture selection items for the photographer. The default posture is a target posture preset by the mobile phone; after the AI function is started by the mobile phone, if the mobile phone recognizes that the preview image includes the selfie stick and the photographer does not select any other gesture selection item, the mobile phone may perform gesture migration on a target image (e.g., a first target image) generated by the mobile phone in a default gesture.

In addition, the self-defined gesture is a target gesture self-defined by the photographer, for example, the photographer can manually draw a target gesture; the posture 1 and the posture 2 are target postures which can be selected by a photographer according to actual needs, for example, the posture 1 can be ' first person one-hand waist crossing ', one-hand ratio Yeah ', and the posture 2 can be ' first person two-hand waist crossing '.

Illustratively, the first gesture selection item is pos 1, and in response to the selection operation of pos 1 in the interface 240, the mobile phone displays an interface 250 as shown in (2) in fig. 5, where pos 1 is selected in the interface 250. For example, the pos 1 selection box is darkened. Then, the mobile phone responds to the operation of the shooting key, and displays an interface 260 as shown in (3) in fig. 5, wherein the interface 260 comprises a first target image; the first target image is generated by an image frame acquired by the mobile phone through the front camera. Then, the mobile phone performs posture migration on the first target image according to the target posture corresponding to the position 1 selected by the photographer before, and displays an interface 270 as shown in (4) in fig. 5, where the interface 270 includes the second target image. The interface 270 further includes a "save" control and a "cancel" control, and the mobile phone responds to the operation on the "save" control and saves the second target image in the gallery application of the mobile phone; and the mobile phone responds to the operation of the cancel control, and stores the first target image in the gallery application of the mobile phone.

Taking the first gesture selection item as the custom gesture selection item as an example, as shown in (1) in fig. 6, for example, in response to the operation on the custom gesture selection item, the mobile phone displays an interface 280 (or referred to as a second interface) as shown in (2) in fig. 6, where the interface 280 is used for instructing the photographer to draw the target gesture. And then, the mobile phone performs posture migration on the first target image according to the target posture drawn by the photographer to obtain a second target image. For an example of the second target image, reference may be made to the second target image shown in (4) in fig. 5, which is not described in detail here.

In the embodiment of the application, the second target image is generated after the mobile phone performs the posture migration on the first target image according to the target posture corresponding to the posture selection item. Comparing the first target image and the second target image shown in fig. 5 (3) and 5 (4), it can be seen that the target posture (or first target posture) of the photographer in the first target image is different from the target posture (or second target posture) of the photographer in the second target image. For example, one hand of the photographer holds the selfie stick and the other hand crosses the waist in the first target image; one hand of the photographer in the second target image is shorter than "Yeah", and the other hand is cross-waist. Thus, when a photographer takes a picture by holding the selfie stick, the mobile phone carries out posture migration on the first target image through the preset target posture, the selfie stick in the first target image can be removed, and the second target image with the shooting effect of the photographer is generated, so that the problems that after the selfie stick is only removed, the hands of the photographer have an empty area and the selfie posture of the photographer is unnatural can be solved.

In some embodiments, in combination with the interface diagrams shown in fig. 4 and 5, as shown in fig. 7, the mobile phone enters a photographing mode in response to an operation of the photographer running the camera application, and displays a preview image on the preview interface. Then, the mobile phone starts an AI function and starts to identify whether a preview image comprises a selfie stick or not; when the mobile phone identifies that the preview image comprises the selfie stick, the mobile phone prompts a user whether the selfie stick needs to be removed or not through first prompt information. Illustratively, when the mobile phone receives an operation of removing the selfie stick by the user, the mobile phone responds to the operation and displays a plurality of gesture selection items on the preview interface. Such as a default gesture selection, a pos 1 gesture selection, a pos 2 gesture selection, a custom gesture selection, and the like. The first prompt message may be, for example, a voice prompt message or a text prompt message, which is not limited in this embodiment of the application.

After a user selects a first gesture selection item in the gesture selection items, the mobile phone responds to the operation of a shooting key to generate a first target image, and gesture migration is carried out on the first target image by adopting a target gesture corresponding to the first gesture selection item to obtain a second target image.

With reference to fig. 8, a specific implementation process of performing gesture migration on the first target image by using the target gesture corresponding to the first gesture option by the mobile phone to obtain the second target image is illustrated.

Illustratively, M reference images are stored in the mobile phone in advance, where the M reference images include images of at least one person in multiple poses, and M is an integer greater than 1. For example, on the basis of the interface 240 shown in (1) in fig. 5, in response to a photographer operating a first posture selection item (e.g., pos 1) among the plurality of posture selection items, the mobile phone selects a reference image in a target posture corresponding to the first posture selection item from the M reference images as a first reference image; subsequently, the mobile phone can perform posture migration on the first target image by using the first reference image to obtain a second target image.

In some embodiments, each pose selection item of the plurality of pose selection items comprises N reference images of the M reference images at the target pose, i.e., the target pose corresponding to one target pose selection item comprises N reference images; wherein N is an integer greater than 1, and N is less than or equal to M. Illustratively, in response to an operation on a first posture option in the multiple posture options, the mobile phone determines N reference images in a target posture corresponding to the first posture option; then, the mobile phone selects a first reference image from the N reference images. For example, the mobile phone may select, as the first reference image, a reference image with the largest similarity between the target pose and the first target pose from the N reference images. It should be understood that the first target pose is a target pose of a first person holding the selfie stick in the first target image.

For example, the mobile phone may calculate the similarity between the target pose corresponding to each of the N reference images and the first target pose, and select the reference image with the maximum similarity as the first reference image. It should be noted that, for an example that the mobile phone calculates the similarity between the target pose corresponding to each of the N reference images and the first target pose, reference may be made to the following embodiments, which are not described herein again.

As described in connection with the above embodiments, the plurality of gesture selections includes a default gesture selection, a custom gesture selection, and the like. On the basis, when the first posture selection item is the default posture selection item, the target posture corresponding to the first posture selection item is the default posture; and when the first posture selection item is the self-defined posture selection item, the target posture corresponding to the first posture selection item is the self-defined target posture.

In some embodiments, after the mobile phone receives the operation of the shooting key, the mobile phone generates a first target image through an image frame acquired by a front camera; then, the mobile phone performs object segmentation on the first target image to remove a selfie stick in the first target image to obtain a third target image, and identifies the first image of the first person in the third target image.

For example, as shown in fig. 8, the mobile phone may use an object segmentation algorithm to remove a selfie stick in the first target image to obtain a third target image, and identify the first image of the first person in the third target image. In the embodiment of the present application, the object segmentation algorithm may be, for example, a semantic segmentation algorithm (or a semantic segmentation model). The semantic segmentation algorithm comprises an encoder and a decoder. The encoder is a data compression algorithm and can be used for extracting the characteristics of an input image; the decoder is an inverse reconstruction of the encoder and is an inverse decoding of the deep feature space.

For example, as shown in fig. 9, the semantic segmentation algorithm can separate the selfie stick 310 and the person image 320 (or the first image of the first person) in the first target image. It should be understood that if the selfie stick and the person image exist in the first target image, after the semantic segmentation algorithm processing, the mobile phone may output a corresponding mask region (i.e. output a selfie stick category 1 and a person image category 2); if the first target image does not contain the selfie stick and the figure image, after the semantic segmentation algorithm processing, the mobile phone outputs the category 0, which indicates that the target object is not detected at this time, namely that the selfie stick and the figure image are not detected.

Subsequently, after the mobile phone separates the selfie stick 310 and the person image 320 in the first target image by using a semantic segmentation algorithm, the mobile phone may remove the selfie stick 310 to obtain the first image of the first person in the first target image, i.e. obtain the person image 320.

In some embodiments, the semantic segmentation model may be pre-trained based on the portrait segmentation dataset in an end-to-end training manner, and then trained based on a selfie stick dataset (finetune), so as to obtain a network model capable of performing selfie stick segmentation and portrait segmentation. In the embodiment of the application, the semantic segmentation model can be obtained by training a graphical display based on NVIDIARTX A5000 based on an optimization function adam, the Lr learning rate is 0.001, and the iteration times are 100.

For example, a selfie stick dataset may be collected using different types of selfie sticks based on a mobile electronic device (e.g., a cell phone). For example, a selfie stick of different colors (e.g., black, white, blue, etc.) and lengths of 1 meter, 2 meters, and 3 meters may be selected. The following provides a collection method of a selfie stick data set, as step 1: holding a 1-meter-length selfie stick by hand, and sequentially prolonging the length of the selfie stick from 0 to 1 meter by taking n (such as 10 cm) as a unit length to acquire images; specifically, under each unit length, the image is acquired by respectively holding the selfie stick with a single left hand, a single right hand and two hands. For example, m (e.g., 33) images may be acquired. Step 2: holding 2 meters of a selfie stick, and collecting 2m images according to the mode of the step 2. And 3, step 3: holding 3 meters of a selfie stick, and collecting 3m images according to the mode of the step 2. And 4, step 4: and (3) manually performing semantic segmentation on all the images collected in the steps 1 to 3, and marking the positions and the category attributes of the portrait and the selfie stick. The category attribute may include, for example, a background, a portrait, a selfie stick, and the like.

Further, the portrait segmentation data may include two part data sets, one being the position and category attributes of the portrait manually segmented when constructing the selfie stick data; the other part is a portrait dataset based on open source bands. The open source band data set may refer to related technologies, which are not described in detail herein.

Exemplarily, as can be seen in fig. 9, the semantic segmentation model includes an encoder for downsampling (upsampled) and a decoder for upsampling (subsampled). The input data of the semantic segmentation model is the constructed self-timer rod data set portrait segmentation data set. In some embodiments, assume that image is an original image with an image resolution of 5000 × 4000; in order to ensure that the image is lossless, the receptive field can be expanded by 32 pixels in a manner based on mirroring operation, and the image resolution of the original image is expanded to 5032 × 4032. As shown in fig. 10, the encoder performs five downsampling operations on the original image, to obtain pool1, pool2, pool3, pool4, and pool5, respectively. The convolution layer number of each sampling is respectively 64, 128, 256, 384 and 256; the convolution kernel used is 3 × 3 and the activation function is Relu. As also shown in fig. 10, the encoder obtains pool5 based on the results of five downsampling, and then amplifies the pool5 by 2 times through upsampling to obtain upsampled1, i.e. 2 times upsampled1. Then, 2 times upsampled1 is added to the four times downsampling result pool4 to obtain upsampled2, and upsampled2 is amplified by 2 times through upsampling to obtain upsampled3. Upsampled3 is then added to the three downsampled results pool3 to yield upsampled4. And finally, connecting the upsampled4 with a normalized exponential function (softmax) to obtain a segmentation result of the pixel points in the original image.

Typically, when using the softmax function as the activation function for an output node, cross entropy is typically used as the loss function. In the embodiment of the application, the segmentation result and the true value map (GT) can be used for calculating the cross entropy; wherein, the GT is used as the supervision information, which means that the classification result of each pixel point is obtained by taking the pixel point as a unit.

It should be noted that the object segmentation algorithm used in the embodiment of the present application is not limited to the semantic segmentation model described in the above embodiment, and other object segmentation algorithms, such as vgnet based on feature coding, resNet, convolutional-based neural network (R-CNN) based on region selection, a conventional region growing algorithm, and a segmentation algorithm based on edge detection, may also be used, and the present application does not limit this.

For example, as shown in fig. 8, the mobile phone may perform a portrait segmentation on the first reference image by using a portrait segmentation algorithm to obtain a second image of the first person in the first reference image.

It should be noted that, in the embodiment of the present application, the portrait segmentation algorithm may be, for example, a semantic segmentation algorithm, and the semantic segmentation algorithm may be obtained based on the above training of the semantic segmentation model. It should be understood that when a semantic segmentation algorithm capable of performing face segmentation is obtained based on training of a semantic segmentation model, the input data is the face segmentation data described above. The training method can refer to the above embodiments, and details are not repeated here.

In some embodiments, the reference images at the target pose corresponding to the first pose migration option include N reference images of the M reference images, i.e., the target pose corresponding to the first pose migration option includes N reference images. In this case, the mobile phone may perform a portrait segmentation on each of the N reference images by using a portrait segmentation algorithm to obtain a second image of the first person in each reference image.

Still as shown in fig. 8, the mobile phone performs the first pose estimation on the first image of the first person in the first target image to obtain the first pose UV map of the first person and the first image UV map of the first person. It should be understood that the pose UV map refers to a UV map obtained by UV-treating each key point of the first character. The image UV map refers to a UV map obtained by UV-processing a first image (i.e., RGB image) of a first person.

Correspondingly, aiming at each reference image, the mobile phone carries out first posture estimation on the second image of the first person in each reference image respectively to obtain a second posture UV image of the first person in each reference image and a second image UV image of the first person.

Subsequently, the mobile phone can respectively calculate the similarity between the second posture UV image and the first posture UV image in each reference image, so that the reference image with the maximum similarity between the target posture and the first target posture is selected from the N reference images to serve as the first reference image.

In some embodiments, the similarity between the second-posture UV map and the first-posture UV map may be represented by a distance (euclidean distance) between respective key points in the second-posture UV map and the first-posture UV map. For example, the smaller the distance between each key point in the second posture UV map and the first posture UV map is, the higher the similarity between the second posture UV map and the first posture UV map is; conversely, the greater the distance between the respective key points in the second posture UV map and the first posture UV map, the lower the similarity between the second posture UV map and the first posture UV map.

For example, the distance between the second pose UV map and the first pose UV map in the reference image satisfies the following formula:

wherein d (x, y) represents a distance between the second-posture UV map and the first-posture UV map; x _i Representing a first pose UV map; y is _i Representing a second pose UV map; z represents the number of dense dots.

In some embodiments, as also shown in fig. 8, the mobile phone performs pose migration on the first pose UV map of the first person in the first target image and the first image UV map of the first person by using the second pose UV map of the first person in the first reference image and the second image UV map of the first person to obtain a third pose UV map of the first person and a third image UV map of the first person.

For example, the mobile phone may input the second pose UV map of the first person and the second image UV map of the first person in the first reference image, and the first pose UV map of the first person and the first image UV map of the first person in the first target image into the pose migration model, so as to obtain a third pose UV map of the first person and a third image UV map of the first person. Wherein, the posture migration model is obtained based on the synthetic network training. The synthetic network may be, for example, vgnet, resNet, convolutional Neural Network (CNN), or cyclic neural network (RNN), which is not limited in this embodiment.

In some embodiments, the synthetic network is trained based on the input data and the output data to derive a pose migration model. The input data are a person posture UV image and a person image UV image in the self-portrait image, and the output data are a person posture UV image and a person image UV image in the self-portrait image. In the embodiment of the application, when the synthetic network trains the posture migration model based on the input data and the output data, the learning rate of Lr is 0.001 based on the optimization function adam, the learning rate is reduced by 10 times at intervals of 20 rounds (epochs), 60 epochs are iterated, and the posture migration model is obtained by training the graphical display based on NVIDIARTX a 5000. Wherein an epoch represents a process in which data passes through the composition network once and returns once.

It should be noted that the self-timer image is an image captured by a photographer holding a self-timer stick, and the self-timer image includes a part of the self-timer stick; the image taken by the photographer is an image taken by the photographer without holding the selfie stick, and the selfie stick is not included in the selfie stick image. In the embodiment of the application, the gesture of a person in the self-timer image is that one hand holds the self-timer stick and the other hand crosses the waist; the pose of the person in the image taken by him is one hand more than "Yeah", the other hand across the waist.

For example, the pose migration data set (including self-portrait images and other self-portrait images) may be based on a mobile electronic device (such as a mobile phone), and may be obtained by capturing different images of a person in different scenes by using different types of self-portrait sticks. The system comprises a self-timer image, an image taking device, a display device and a control device, wherein the self-timer image and the image taking device are a group of data pairs, and the scene, the figure, the type of a self-timer rod and the length of the self-timer rod of each group of data pairs are the same; the difference lies in that the selfie stick is included in the selfie image, the selfie stick is not included in the self-portrait image, and the figure gestures are different.

For example, a tripod and a selfie stick may be combined together to achieve taking a selfie image and his taken image at the same angle and at the same position. The following provides a manner of acquiring a pose migration data set, as in step 1: in the same scene, a hand holds the selfie stick (the left hand, the right hand and the two hands are switched), and gestures (such as a back hand, a finger Yeah, a five-gesture and the like) of the other hand are changed to acquire a selfie image; then the selfie stick is rotated to one side, the position of the figure is kept unchanged, and the state of two hands (such as two hands placed on two sides of the body, two hands across the waist, one hand across the waist and one hand over Yeah) is changed to acquire the image taken by the user. And 2, step: and (5) changing the length of the selfie stick in the scene of the step 1, and acquiring a self-portrait image and an image taken by the user according to the mode of the step 1. And step 3: changing scenes, replacing the type of the selfie stick, and acquiring self-portrait images and other self-portrait images under different scenes and different lengths of the selfie stick according to the modes of the step 1 and the step 2. And step 3: changing characters, acquiring self-shooting images and other-shot images of different characters, different scenes and different lengths of a self-shooting rod according to the modes of the step 1, the step 2 and the step 3, and finishing the acquisition of all posture migration data by combining the steps 1-3.

Illustratively, as shown in FIG. 11, the synthesis network includes an encoder for downsampling and a decoder for upsampling. The input data of the synthetic network is self-timer data in the constructed posture migration data, and the output data is other shooting data in the constructed posture migration data. In some embodiments, the encoder includes six convolutional layers, the convolutional kernel is 2 x 2, the max pooling (maxpool) is 2 x 2, and the activation function is Relu. The convolution layers of the encoder are respectively 64, 128, 256, 512 and 1024, wherein the third convolution layer in the encoder is connected (skip connection) to the fifth convolution layer in a skipping way. The input of the decoder is the output of the encoder, the decoder comprises five convolutional layers, the convolutional kernel is 2 x 2, and the activation function is Relu. The number of convolutional layers of the decoder is 512, 256, 128, 64, 2, respectively. Wherein, the output of the decoder is a person gesture UV image and a person image UV image in the image shot by the user.

It should be noted that, when the synthesis network is VGGNet, the synthesis network extracts the features output by the third layer, the fourth layer and the fifth layer convolution layer as the Globally Unique Identifier (GUID) of the other shot image and inputs the features into the sixth layer convolution layer of the encoder.

In some embodiments, a loss function may be configured in the composite network, and the loss function is used to evaluate the degree of inconsistency between the predicted value and the actual value of the output of the single data after being calculated by the composite network. The loss function is a non-negative real value function, and in the model training process, the smaller the error is, the smaller the function value of the loss function is, and the fast the model convergence is. The function value of the loss function directly influences the prediction performance of the model, and the smaller the function value of the loss function is, the better the prediction performance of the model is.

Illustratively, the configured loss function in the composite network is as follows:

wherein Lidt represents an identity loss function, L1 represents a reconstruction loss function, and Lp represents a perception loss function; psrc represents a person pose UV graph in the self-shot image, ptar represents a person pose UV graph in the self-shot image, isrc represents a person image UV graph in the self-shot image, and Itar represents a person image UV graph in the self-shot image;

showing the combined characteristics of the outputs of the fourth, eighth, and 27 th convolutional layers.

It should be noted that, in the embodiment of the present application, the electronic device 100 (such as a mobile phone) described in the embodiment of the present application may be used to train the semantic segmentation model and the pose migration model, or may also be used to train the semantic segmentation model and the pose migration model by using other devices, servers, and the like with a model training function, which is not limited in the embodiment of the present application.

With reference to the foregoing embodiment, as shown in fig. 8, the mobile phone performs second pose estimation on the UV map of the third pose of the first person and the UV map of the third image of the first person to obtain a third image of the first person. Illustratively, the second pose estimate is used for performing a backward calculation based on the pose UV map and the image UV map, and outputting a third image of the first person.

In order to ensure that the image output after the posture transition has a good degree of fit with the first target image, in some embodiments, the mobile phone may further perform fusion processing on the third image of the first person and the background image in the first target image, so as to obtain a second target image. It should be understood that the selfie stick is not included in the second target image, and that the second target pose of the first person in the second target image is different from the first target pose of the first person in the first target image. For example, the first target pose is that the first person holds a selfie stick in one hand and crosses the waist in the other hand; the second target posture is that one hand of the first person is more than Yeah, and the other hand is across the waist. Thus, by adopting the image processing method of the embodiment of the application, the problems of an empty hand area and unnatural self-photographing posture can be solved after the self-photographing stick is removed.

Further, in order to increase the algorithm processing speed and reduce the power consumption of the device, in some embodiments, as shown in fig. 8, after the mobile phone performs object segmentation on the first target image, the mobile phone separates the head image of the first person in the first target image, where the first image of the first person in the first target image recognized by the mobile phone does not include the head image. Correspondingly, the mobile phone carries out first posture estimation on the first image of the first person, and the obtained first posture UV image of the first person and the first image UV image of the first person in the first target image do not comprise head images; subsequently, the mobile phone inputs the first pose UV image of the first person and the first image UV image of the first person in the first target image in the pose migration model, and the first image UV image of the first person does not include the head image. In this way, the third image of the first person finally obtained by the mobile phone is an image without the head of the first person.

On this basis, as shown in fig. 12, the mobile phone may perform fusion processing on the third image of the first person, the background image in the first target image, and the head image of the first person, so as to obtain the second target image.

It should be noted that, in the above embodiment, an example is given in which after the mobile phone captures the first target image in the photographing mode of the camera application, the first reference image is adopted to perform posture migration on the first target image, and the second target image is obtained. In the embodiment of the application, the mobile phone can also perform posture migration on the first target image stored in the mobile phone by using the first reference image to obtain the second target image.

For example, the mobile phone may perform gesture migration on an image stored in a gallery application or an image received in an instant messaging application; or, the mobile phone may also perform pose migration on images acquired in other scenes, which is not limited in the embodiment of the present application.

Taking the mobile phone as an example to perform gesture migration on an image stored in a gallery application, for example, as shown in (1) in fig. 13, the mobile phone displays an interface 410, where the interface 410 displays an interface of a first target image for the gallery application of the mobile phone, and the interface 410 further includes a preset editing control 420. The cell phone, in response to the operation of the editing control in the interface 410, displays an interface 420 (or second interface) as shown in (2) in fig. 13, which includes a plurality of gesture options, such as a default gesture option, a pos 1 gesture option, a pos 2 gesture option, a custom gesture option, and the like. For the illustration of the gesture selection item, reference may be made to the above embodiments, and details are not repeated here.

On this basis, in response to the selection operation of the first posture selection item (e.g., pos 1), the mobile phone performs posture migration on the first target image by using the target posture corresponding to the pos 1, and displays an interface 430 as shown in (3) in fig. 13, where the interface 430 includes the second target image. In some embodiments, the interface 430 further includes a "save" control and a "cancel" control, and the cell phone saves the second target image in the gallery application of the cell phone in response to an operation on the "save" control; and the mobile phone responds to the operation of the cancel control, and stores the first target image in the gallery application of the mobile phone.

In some embodiments, in combination with the interface diagram shown in fig. 13, as shown in fig. 14, the mobile phone starts to remove the selfie stick function in response to the operation on the preset editing control; then, when the mobile phone recognizes that the first target image comprises the selfie stick, the mobile phone displays a plurality of gesture options. After the user selects a first posture selection item in the posture selection items, the mobile phone performs posture migration on the first target image by adopting a target posture corresponding to the first posture selection item to obtain a second target image.

Further, when the mobile phone identifies that the first target image does not include the selfie stick, the mobile phone can prompt the user to select an image with the selfie stick through the second prompt message. The second prompt message may be a voice prompt message or a text prompt message.

It should be noted that, for the specific implementation process of performing the gesture migration on the first target image by using the target gesture corresponding to the first gesture selection item for the mobile phone to obtain the second target image, reference may be made to fig. 8 and the foregoing embodiments, and details are not repeated here.

In addition, in the above embodiment, it is exemplified that the mobile phone displays a plurality of gesture selection items, and the user selects a target gesture corresponding to the first gesture selection item from the plurality of gesture selection items to perform gesture transition on the first target image. Correspondingly, during actual implementation, the mobile phone can also receive semantic information input by a user to perform posture migration on the first target image. Such as: the semantic information input by the user is 'remove the selfie stick and cross the waist with both hands'. In this case, the specific implementation manner of the method may refer to the illustration in the foregoing embodiments, and details are not repeated here.

Further, in the embodiment of the present application, the first object is taken as an example of a selfie stick, and certainly, the first object may also be a book, a newspaper, a water cup, and the like, and these objects may also affect the image effect of the self-portrait image. On the basis of the above, the contents described in the embodiments of the present application can explain and explain the technical solutions in the other embodiments of the present application, and the technical features described in the embodiments can also be applied to the other embodiments, and combined with the technical features of the other embodiments to form a new solution.

The embodiment of the application provides an electronic device, which can comprise a display screen, a memory and one or more processors; the display screen is used for displaying images acquired by the plurality of cameras or images generated by the processor; the memory has stored therein computer program code comprising computer instructions which, when executed by the processor, cause the electronic device to perform the functions or steps performed by the handset in the embodiments described above. The structure of the electronic device may refer to the structure of the electronic device 100 shown in fig. 4.

An embodiment of the present application further provides a chip system, as shown in fig. 15, the chip system 1800 includes at least one processor 1801 and at least one interface circuit 1802. The processor 1801 may be the processor 110 shown in fig. 4 in the foregoing embodiment. The interface circuit 1802 may be, for example, an interface circuit between the processor 110 and an external memory; or an interface circuit between the processor 110 and the internal memory 121.

The processor 1801 and the interface circuit 1802 may be interconnected by wires. For example, the interface circuit 1802 may be used to receive signals from other devices (e.g., a memory of an electronic device). Also for example, the interface circuit 1802 may be used to send signals to other devices, such as the processor 1801. Illustratively, the interface circuit 1802 may read instructions stored in the memory and send the instructions to the processor 1801. The instructions, when executed by the processor 1801, may cause the electronic device to perform the steps performed by the mobile phone in the embodiments described above. Of course, the chip system may further include other discrete devices, which is not specifically limited in this embodiment of the present application.

Embodiments of the present application further provide a computer-readable storage medium, where the computer-readable storage medium includes computer instructions, and when the computer instructions are executed on an electronic device, the electronic device is caused to perform various functions or steps performed by a mobile phone in the foregoing method embodiments.

The embodiment of the present application further provides a computer program product, which when running on a computer, causes the computer to execute each function or step executed by the mobile phone in the above method embodiments.

Through the description of the above embodiments, it is clear to those skilled in the art that, for convenience and simplicity of description, the foregoing division of the functional modules is merely used as an example, and in practical applications, the above function distribution may be completed by different functional modules according to needs, that is, the internal structure of the device may be divided into different functional modules to complete all or part of the above described functions.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described device embodiments are merely illustrative, and for example, the division of the modules or units is only one type of logical functional division, and other divisions may be realized in practice, for example, multiple units or components may be combined or integrated into another device, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may be one physical unit or multiple physical units, that is, may be located in one place, or may be distributed in multiple different places. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially or partially contributed to by the prior art, or all or part of the technical solutions may be embodied in the form of a software product, where the software product is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, or the like) or a processor (processor) to execute all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a variety of media that can store program codes, such as a usb disk, a removable hard disk, a Read Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above description is only an embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions within the technical scope of the present disclosure should be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. An image processing method is applied to an electronic device, wherein M reference images are stored in the electronic device, the M reference images comprise images of at least one person in multiple postures, the M reference images are not images shot by a first object held by the corresponding person, and M is an integer greater than 1; the method comprises the following steps:

the electronic equipment displays a first target image; wherein the first target image comprises an image of a first person holding the first object in a first target pose;

the electronic device selects a first reference image from the M reference images; wherein the first reference image comprises an image of a corresponding person in a second target pose, the second target pose being different from the first target pose;

the electronic equipment performs attitude migration on the first target image by adopting the first reference image to obtain a second target image; wherein the second target image includes an image of the first person in the second target pose, the second target image not including an image of the first person holding the first object.

2. The method of claim 1, wherein the first object comprises a selfie stick.

3. The method according to claim 1 or 2, wherein the second target gesture is a default gesture preset in the electronic device; alternatively, the first and second electrodes may be,

the electronic device selects a first reference image from the M reference images, including:

the electronic equipment displays a first interface; the first interface comprises a plurality of gesture selection items, and each gesture selection item in the plurality of gesture selection items corresponds to a target gesture;

the electronic equipment responds to operation of a first posture selection item in the plurality of posture selection items, and selects a reference image corresponding to a target posture of the first posture selection item from the M reference images as the first reference image;

and the target posture corresponding to the first posture selection item is the second target posture.

4. The method of claim 3, wherein the M reference pictures comprise: n reference images, wherein the N reference images are reference images under the target posture corresponding to the first posture selection item, N is an integer larger than 1, and N is smaller than or equal to M;

the selecting, from the M reference images, a reference image at a target pose corresponding to the first pose selection item as the first reference image includes:

the electronic equipment selects a reference image with the maximum similarity between a target pose and the first target pose from the N reference images as the first reference image.

5. The method of claim 3, wherein the first pose selection item is used to customize a target pose;

the electronic equipment displays a second interface; the second interface is used for indicating a user to draw a target gesture;

and the electronic equipment receives a target gesture drawn by a user on the second interface, and selects a reference image with the maximum similarity between the target gesture and the drawn target gesture from the M reference images as the first reference image.

6. The method of any of claims 1-5, wherein the first target image is acquired by the electronic device in response to a capture instruction;

wherein the electronic device selects a first reference image from the M reference images, comprising:

in response to the photographing instruction, the electronic device selects the first reference image from the M reference images if it is determined that the first target image includes an image of the first person held in the first target pose with a selfie stick.

7. The method of any of claims 1-5, wherein the electronic device displays a first target image, comprising:

the electronic equipment displays the first target image in a gallery application, or the electronic equipment displays the first target image in an instant messaging application;

in response to a preset editing operation of the first target image by a user, the electronic equipment selects the first reference image from the M reference images.

8. The method of any one of claims 1-7, wherein the electronic device performs pose migration on the first target image using the first reference image to obtain a second target image, comprising:

the electronic equipment performs object segmentation on the first target image to remove an image of the first object in the first target image and identify a first image of the first person in the first target image;

the electronic equipment carries out portrait segmentation on the first reference image so as to obtain a second image of the first person in the first reference image;

the electronic equipment carries out first posture estimation on the first image of the first person and the second image of the first person to obtain a first posture UV picture of the first person, a first image UV picture of the first person, a second posture UV picture of the first person and a second image UV picture of the first person;

the electronic equipment performs posture migration on the first posture UV image of the first person and the first image UV image of the first person by adopting the second posture UV image of the first person and the second image UV image of the first person to obtain a third posture UV image of the first person and a third image UV image of the first person;

the electronic equipment carries out second posture estimation on the UV image of the first person and the UV image of the first person to obtain a third image of the first person;

and the electronic equipment performs fusion processing on the third image of the first person and the background image in the first target image to obtain the second target image.

9. An electronic device, comprising: a display screen, a memory, and one or more processors; the display screen, the memory, and the processor are coupled;

the display screen is used for displaying the image generated by the processor; the memory has stored therein computer program code comprising computer instructions which, when executed by the processor, cause the electronic device to perform the method of any of claims 1-8.

10. A computer-readable storage medium comprising computer instructions; the computer instructions, when executed on an electronic device, cause the electronic device to perform the method of any of claims 1-8.