CN113099150A

CN113099150A - Image processing method, device and system

Info

Publication number: CN113099150A
Application number: CN202010018738.6A
Authority: CN
Inventors: 梁运恺; 高扬; 叶威威
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2020-01-08
Filing date: 2020-01-08
Publication date: 2021-07-09
Anticipated expiration: 2040-01-08
Also published as: CN113099150B; WO2021139706A1

Abstract

The application provides an image processing method, equipment and a system, wherein the method comprises the following steps: a first frame face image of a user is acquired, the first frame face image of the user including a plurality of facial organ images. A plurality of first images matching the plurality of facial organ images are acquired. And sending a data packet of the first frame face image of the user to a receiving end, wherein the data packet of the first frame face image of the user comprises indexes of a plurality of first images, and the indexes of the plurality of first images are used for acquiring the plurality of first images, so that the requirement on network bandwidth can be reduced, namely, under the condition of limited network transmission bandwidth, a better video effect can still be ensured.

Description

Image processing method, device and system

Technical Field

The present application relates to the field of video technologies, and in particular, to a method, a device, and a system for image processing.

Background

At present, video call is a more effective remote communication interaction mode than voice call, and can transmit information such as body movement and facial expression besides voice information, so that communication between two parties is deeper.

The traditional video mode is a live-action video mode, namely, a local terminal acquires picture frames such as characters, backgrounds and the like participating in a video in real time by using a camera, generates a video stream, and transmits the video stream to a remote terminal through a network so as to enable the remote terminal to display the video. However, the requirement of high-resolution video streaming on network transmission bandwidth is high, and the traditional video mode is difficult to realize real-time high-quality video call. Even under the condition of poor network environment, the phenomena of packet loss, screen splash and the like can occur in video pictures. In a word, under the condition that the network transmission bandwidth is limited, the effect of performing video call by adopting the traditional video mode is poor, and the user experience is influenced.

Disclosure of Invention

The application provides an image processing method, device and system, so that the requirement on network transmission bandwidth is reduced, and the video call effect and the user experience are improved.

In a first aspect, the present application provides an image processing method, including: a first frame face image of a user is acquired, the first frame face image of the user including a plurality of facial organ images. A plurality of first images matching the plurality of facial organ images are acquired. And sending a data packet of the first frame face image of the user to a receiving end, wherein the data packet of the first frame face image of the user comprises indexes of a plurality of first images, and the indexes of the plurality of first images are used for acquiring the plurality of first images.

In the present application, the transmitting side does not need to transmit the first frame image of the user to the receiving side, and only needs to transmit a packet including indexes of a plurality of first images. Therefore, the requirement on the network bandwidth can be reduced, namely, a better video effect can be still ensured under the condition that the network transmission bandwidth is limited.

Alternatively, the plurality of facial organ images are images of a user's real facial organ, and the plurality of first images are images of a facial organ that is virtual for the user. Because the first image is the image of the virtual facial organ of the user, the individual privacy of the user is protected, and the application range of the technical scheme of the application is further improved.

Optionally, acquiring a plurality of first images matching the plurality of facial organ images includes: for each of the plurality of facial organ images, the facial organ image is compared with a standard organ image corresponding to the facial organ image to determine a first difference value. And acquiring a first image matched with the facial organ image according to the first difference value, wherein the second difference value and the first difference value of the first image matched with the facial organ image and the standard organ image meet a first condition. By the method, a plurality of first images matched with a plurality of facial organ images can be effectively acquired.

Optionally, the method further includes sending, by the sending end, at least one audio data packet to the receiving end, where a time stamp of the audio data packet matches a time stamp of a data packet of the first frame face image of the user. Based on this, the user is enabled to achieve a synchronous effect in terms of hearing and vision.

Optionally, the method further includes: and acquiring a second face image of the user, wherein the second face image of the user is earlier than the first face image of the user. A plurality of second images matching the plurality of facial organ images of the second frame facial image of the user are acquired. And sending a data packet of the second frame face image of the user to a receiving end, wherein the data packet of the second frame face image of the user comprises indexes of a plurality of second images, and the indexes of the plurality of second images are used for acquiring the plurality of second images. The transmitting end does not need to transmit the second frame image of the user to the receiving end, and only needs to transmit the data packet including the indexes of the plurality of second images. Therefore, the requirement on the network bandwidth can be reduced, namely, a better video effect can be still ensured under the condition that the network transmission bandwidth is limited.

Optionally, the method further includes: receiving instruction information transmitted by a receiving end, the instruction information being used for instructing to transmit a face image earlier than the first frame image of the user, that is, the instruction information being used for instructing to transmit a data packet of the face image earlier than the first frame image of the user. That is, in not all cases, the transmitting end transmits a face image earlier than the first frame face image of the user, thereby reducing the consumption of communication resources.

In a second aspect, the present application provides an image processing method, including: receiving a packet of a first frame of facial images of a user from a transmitting end, the packet of the first frame of facial images of the user including an index of a plurality of first images, the first frame of facial images of the user including a plurality of facial organ images, the plurality of first images matching the plurality of facial organ images. A plurality of first images is acquired. And generating a receiving end first frame face image according to the plurality of first images. The transmitting end does not need to transmit the first frame image of the user to the receiving end, and only needs to transmit the data packet including the indexes of the plurality of first images. Therefore, the requirement on the network bandwidth can be reduced, namely, a better video effect can be still ensured under the condition that the network transmission bandwidth is limited.

Optionally, the method further includes: and receiving at least one audio data packet from the sending end, wherein the time stamp of the audio data packet is matched with the time stamp of the data packet of the first frame face image of the user. Based on this, the user is enabled to achieve a synchronous effect in terms of hearing and vision.

Optionally, the method further includes: and receiving a data packet of a second frame image of the user from the transmitting end, wherein the second frame image of the user is earlier than the first frame image of the user, the data packet of the second frame image of the user comprises indexes of a plurality of second images, and the plurality of second images are matched with a plurality of facial organ images contained in the second frame image of the user. The transmitting end does not need to transmit the second frame image of the user to the receiving end, and only needs to transmit the data packet including the indexes of the plurality of second images. Therefore, the requirement on the network bandwidth can be reduced, namely, a better video effect can be still ensured under the condition that the network transmission bandwidth is limited.

Optionally, the method further includes: and sending indication information to the sending end, wherein the indication information is used for indicating that the face image which is earlier than the first frame face image of the user is sent. That is, not in all cases, the transmitting end transmits the face image earlier than the first frame face image of the user only when the transmitting end receives the instruction information, thereby reducing consumption of communication resources.

Optionally, the method further includes: if the first frame image of the receiving end is generated, the data packet of the second frame image of the user is discarded. Without generating the second frame image of the receiving end, thereby reducing the power consumption of the receiving end.

Optionally, the method further includes: and if the receiving end third frame image corresponding to the user third frame image is not generated, wherein the user third frame image is earlier than the user second frame image, generating a receiving end second frame image according to the data packet of the user second frame image.

Optionally, when the user at the receiving end performs video with the users at the multiple transmitting ends simultaneously, the receiving end generates a video background image through an AR/VR technology, so that the first frame face images of the multiple receiving ends can be fused in one background scene, thereby improving the experience and interactivity of the users.

The following describes an image processing apparatus, a device, a system, a storage medium, and a computer program product, which have the same effects as those of the above method, and will not be described in detail below.

In a second aspect, the present application provides an image processing apparatus comprising: the device comprises a first acquisition module, a second acquisition module and a first sending module. The first acquisition module is used for acquiring a first frame face image of a user, and the first frame face image of the user comprises a plurality of facial organ images. The second acquisition module is used for acquiring a plurality of first images matched with the plurality of facial organ images. The first sending module is used for sending a data packet of a first image of a user to a receiving end, wherein the data packet of the first image of the user comprises indexes of a plurality of first images, and the indexes of the plurality of first images are used for acquiring the plurality of first images.

In a third aspect, the present application provides an image processing apparatus comprising: the device comprises a first receiving module, a first obtaining module and a first generating module. The first receiving module is used for receiving a data packet of a first frame image of a user from a sending end, the data packet of the first frame image of the user comprises indexes of a plurality of first images, the first frame image of the user comprises a plurality of facial organ images, and the plurality of first images are matched with the plurality of facial organ images. The first acquisition module is used for acquiring a plurality of first images. The first generating module is used for generating a receiving end first frame face image according to the plurality of first images.

In a fourth aspect, the present application provides a terminal device, comprising: a memory and a processor. The memory stores instructions executable by the processor to enable the processor to perform the method of any one of the first aspect, the second aspect, the alternatives of the first aspect, the alternatives of the second aspect.

In a fifth aspect, the present application provides a computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method of any one of the first aspect, the second aspect, the alternatives of the first aspect, and the alternatives of the second aspect.

In a sixth aspect, the present application provides a computer program product having stored thereon computer instructions for causing a computer to perform the method of any one of the first aspect, the second aspect, the alternatives of the first aspect, the alternatives of the second aspect.

In summary, the present application provides an image processing method, device, and system, where an image sample library is configured at a sending end and a receiving end, and an image index in the sample library is transmitted between the sending end and the receiving end to implement image transmission, so as to reduce a bandwidth requirement on network transmission, and further improve a video call effect and a user experience. Furthermore, the video scene is established on the AR or VR technology, and rich expression and posture information is transmitted by using the virtual character and the video scene, so that the personal privacy of the user can be protected. Furthermore, when the user at the receiving end side performs video with the users at the multiple sending end sides simultaneously, the receiving end generates a video background image through the AR/VR technology, so that the first frame face images of the multiple receiving ends can be fused to a background scene, and the experience and interactivity of the users can be improved.

Drawings

FIG. 1 is a diagram of a system architecture provided by an embodiment of the present application;

fig. 2 is a flowchart of an image processing method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of an image processing process provided in an embodiment of the present application;

FIG. 4 is a flowchart of an image processing method according to another embodiment of the present application;

fig. 5 is a schematic diagram of an audio packet sequence and a facial image packet sequence provided in an embodiment of the present application;

fig. 6 is a schematic diagram of a first packet and a first buffer queue according to an embodiment of the present application;

fig. 7 is a flowchart of a method for processing a data packet of a face image by a receiving end according to an embodiment of the present application;

FIG. 8 is a schematic diagram of image processing according to an embodiment of the present application;

FIG. 9 is a schematic diagram of image processing according to another embodiment of the present application;

FIG. 10 is a schematic diagram of image processing according to yet another embodiment of the present application;

fig. 11 is a schematic diagram of an image processing apparatus according to an embodiment of the present application;

fig. 12 is a schematic diagram of an image processing apparatus according to another embodiment of the present application;

fig. 13 is a schematic diagram of a terminal device according to an embodiment of the present application;

fig. 14 is a schematic diagram of an image processing system according to an embodiment of the present application.

Detailed Description

The traditional video mode is a live-action video mode, namely, a local terminal acquires picture frames such as characters, backgrounds and the like participating in a video in real time by using a camera, generates a video stream, and transmits the video stream to a remote terminal through a network so as to enable the remote terminal to display the video. However, in the conventional video mode, under the condition of limited network transmission bandwidth, the effect of performing video call by using the conventional video mode is poor, and user experience is affected. Further, the conventional video method is easy to expose personal privacy such as wearing and dressing, places where the video method is located or mental states, and the like, so that the conventional video method is narrow in application range.

In order to solve the above problems, the present application provides a method, an apparatus, and a system for image processing. The main idea of the application is as follows: the image sample library is configured at the sending end and the receiving end, and the image indexes in the sample library are transmitted between the sending end and the receiving end to realize the transmission of the images, so that the bandwidth requirement on network transmission is reduced. Further, the video scene is established on the Augmented Reality (AR) or Virtual Reality (VR) technology, and rich expression and posture information is transmitted by using Virtual characters and the video scene.

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the embodiments of the present application are described below with reference to the accompanying drawings.

The technical solution of the embodiment of the present application can be applied to various communication systems, such as a third Generation (3G) mobile communication system, a fourth Generation (4G) mobile communication system, a fifth Generation (5G) mobile communication system, a New Radio (NR) or a Wireless Fidelity (WiFi) network.

For example, fig. 1 is a system architecture diagram provided in this embodiment of the present application, as shown in fig. 1, cameras are provided on both the sending end 11 and the receiving end 12, and the cameras can perform image acquisition, a Session Initiation Protocol (SIP) is adopted in a signaling plane of the sending end 11 and the receiving end 12, and a Real-time Transport Protocol (RTP) or a Real-time Control Protocol (RTCP) is adopted in a media plane, so that the sending end 11 sends a data packet of a facial image to the receiving end 12 by using the RTP or the RTCP. The sending end 11 may call a Real-time Network (RTN) Software Development Kit (SDK) to send a data packet of the facial image to the server 13 through the RTN, and the server 13 forwards the data packet of the facial image to the receiving end 12. The receiving end 12 calls the RTN SDK to receive the data packet of the face image, the receiving end 12 parses the data packet of the face image according to the format of the RTP data packet, and the receiving end 12 implements a 3D rendering function of the image by a Graphics Processing Unit (GPU) or a Network Processing Unit (NPU) according to the parsed data packet. As shown in FIG. 1, the dotted box of the GPU/NPU indicates that the GPU/NPU is displayed inside the terminal device, rather than on the display screen of the terminal device. The terminal device may be a mobile phone or an AR/VR device, such as a VR head-up device, AR glasses, etc.

It should be noted that the sending end and the receiving end may not perform data transmission through the server, that is, the sending end and the receiving end may be directly connected to perform data transmission, for example: and the sending end calls the RTN SDK to send the data packet of the facial image to the receiving end through the RTN. And the receiving terminal calls the RTN SDK to receive the data packet of the facial image, analyzes the data packet of the facial image according to the format of the RTP data packet, and realizes the 3D rendering function of the image through the GPU or the NPU according to the analyzed data packet.

The technical scheme of the application is explained in detail as follows:

fig. 2 is a flowchart of an image processing method provided in an embodiment of the present application, where the method relates to a sending end and a receiving end, where the sending end and the receiving end may be two different terminal devices, for example, two different mobile phones, or the sending end is a mobile phone and the receiving end is an AR/VR device, or the sending end is an AR/VR device and the receiving end is a mobile phone, and the present application does not limit this. As shown in fig. 2, the method comprises the steps of:

step S201: the sending end acquires a first frame face image of a user, and the first frame face image of the user comprises a plurality of facial organ images.

Step S202: the transmitting end acquires a plurality of first images matched with the plurality of facial organ images.

Step S203: the sending end sends a data packet of a first frame face image of a user to the receiving end, wherein the data packet comprises indexes of a plurality of first images, and the indexes of the plurality of first images are used for acquiring the plurality of first images.

Step S204: the receiving end acquires a plurality of first images.

Step S205: the receiving end generates a first frame face image of the receiving end according to the plurality of first images.

The following description is made with reference to steps S201 to S203:

in a video call scene, a sending end acquires a picture of a user through a camera of the sending end, such as a front-facing camera, and a plurality of frames of facial images can be obtained. The first frame face image of the user here indicates the current frame face image, which may or may not be the first frame face image of the user, and the "first" here is merely distinguished from the second frame face image to be mentioned below, and has no practical meaning. The plurality of facial organ images included in the first frame of facial image of the user are each an image of the user's real facial organ. The facial organs may be coarse-grained facial organs, such as eyes, nose, mouth, ears, and the like. Or facial organs in the sense of finer granularity, such as eyeballs, whites of the eyes, eyelashes, left and right nasal wings, the bridge of the nose, and the like.

For a facial organ image, the first image that matches the facial organ image means that the facial organ features presented by the first image are similar to the facial organ features presented by the facial organ image. For example, the first image may be a first image satisfying the following condition: the difference between the first image and the facial organ image is minimal, or the absolute value of the difference between the first image and the facial organ image is less than a preset threshold. Or, assuming that a difference between the facial organ image and a standard organ image corresponding to the facial organ image is a first difference, a difference between the first image and the standard organ image is a second difference, and a difference between the second difference and the first difference is the smallest, or an absolute value of a difference between the second difference and the first difference is smaller than a preset threshold. The standard organ image corresponding to any facial organ image refers to a standard image corresponding to the facial organ, for example, the facial organ is an eye, and the corresponding standard organ image is a standard image corresponding to the eye.

Alternatively, the first image in this application is an image of a face organ that is virtual for the user, i.e. a virtual image of a face organ, which may be understood as an image of a cartoon character face organ or an image of a star face organ, etc.

Optionally, the plurality of first images are acquired by: for each of the plurality of facial organ images, a first image is acquired from the facial organ image, wherein the first image has a minimal difference from the facial organ image. For example, if the first frame of facial image of the user is a picture of the user when laughing, the plurality of facial organ images include images of eyebrows, squinted eyes, noses, raised mouths and ears, and for the squinted eyes, the images are compared with at least one eye image in the sample library to obtain an eye image with the smallest difference with the image, and the eye image with the smallest difference is the first image. Or the absolute value of the difference value between the first image and the facial organ image is smaller than a preset threshold value. The preset threshold value can be set according to actual conditions. Still taking the first frame image of the user as the picture when the user laughs as an example, for the image of the squint eyes, comparing the image with the image of at least one eye in the sample library, and acquiring the image of the eyes with the absolute value of the difference value smaller than the preset threshold value, wherein the image of the eyes with the absolute value of the difference value smaller than the preset threshold value is the first image.

Or, for each of the plurality of facial organ images, comparing the facial organ image with a standard organ image corresponding to the facial organ image to determine a first difference value; and acquiring a first image corresponding to the facial organ image according to the first difference value, wherein the difference value between the second difference value and the first difference value is minimum. For example, if the first frame of facial image of the user is a picture of the user when laughing, the plurality of facial organ images include images of eyebrows, squinted eyes, noses, raised mouths and ears, for the images of the squinted eyes, a first difference value between the image and the image of the standard eyes is determined, a second difference value between the image of at least one eye in the sample library and the image of the standard eyes is determined, the image of the eye with the smallest difference value between the second difference value and the first difference value is obtained, and the image of the eye with the smallest difference value is the first image. Or the absolute value of the difference value between the second difference value and the first difference value between the first image and the standard organ image is smaller than a preset threshold value, and the preset threshold value can be set according to actual conditions. Still taking the first frame face image of the user as the picture when the user laughs as an example, for the image of the squinted eyes, determining a first difference value between the image and the image of the standard eyes, determining a second difference value between the image of at least one eye in the sample library and the image of the standard eyes, and acquiring the image of the eyes of which the absolute value of the difference value between the second difference value and the first difference value is smaller than a preset threshold, wherein the image of the eyes of which the absolute value of the difference value is smaller than the preset threshold is the first image.

The sending end may determine the first difference value between the facial organ image and the standard organ image corresponding to the facial organ image in the following manner, but is not limited thereto:

the first alternative is as follows: the method comprises the steps that a sending end obtains pixel values of a plurality of first pixel points in a face organ image, and obtains pixel values of a plurality of second pixel points in each standard organ image in a sample library, wherein the plurality of first pixel points and the plurality of second pixel points are in one-to-one correspondence. Further, the sending end calculates, for each standard organ image, an absolute value of a difference between pixel values of a plurality of first pixel points and a plurality of second pixel points in the standard organ image, and adds all the absolute values to obtain a first difference value.

The second option is: the method comprises the steps that a sending end obtains pixel values of a plurality of first pixel points in a face organ image, and obtains pixel values of a plurality of second pixel points in each standard organ image in a sample library, wherein the plurality of first pixel points and the plurality of second pixel points are in one-to-one correspondence. Further, the sending end calculates, for each standard organ image, an absolute value of a difference between pixel values of a plurality of first pixel points and a plurality of second pixel points in the standard organ image, and calculates a sum of squares of all the absolute values to obtain a first difference value.

Similarly, the method for the sending end to calculate the second difference value is the same as the method for the sending end to calculate the first difference value, and details are not repeated herein.

The standard organ images and/or the first images may be in a local sample library at a transmitting end or in a sample library at a cloud end, which is not limited in the present application.

The indexes of the plurality of first images correspond to the plurality of first images one to one, optionally, each index is a floating-point numerical value, and the number range of the indexes of the plurality of first images is [ 70, 312 ]. Optionally, each index is an integer value. Through the index, the receiving end can obtain the first image corresponding to the index in the sample library.

It should be noted that the first image may be stored in the sample library in the form of a facial organ feature value. If the characteristic values of the first images are stored by the receiving end, the receiving end generates a first frame image of the receiving end according to the characteristic values respectively corresponding to the plurality of first images.

The following will be explained with respect to step S204 and step S205:

fig. 3 is a schematic diagram of an image processing process provided in an embodiment of the present application, as shown in fig. 3, a receiving end stores first images of respective facial organs (e.g., eyes, mouth, nose, cheek, etc.) corresponding to respective indexes (i.e., indexes 1 and 2 … … 70 shown in fig. 3, where the numbers indicate that the indexes are not numbers, but merely to distinguish the 70 indexes) in a local sample library or a cloud sample library, and the local sample library or the cloud sample library of the receiving end stores the first images of the respective facial organs and the indexes of the respective first images. Based on this, the receiving end may determine each first image according to the index of each first image. For example: if the receiving end receives the index of the first image corresponding to the squint eyes, the receiving end determines the first image of the squint eyes according to the index.

The first alternative is as follows: after the receiving end acquires the plurality of first images, the first images are rendered through the 3D model to generate a receiving end first frame image, and the receiving end first frame image is a virtual image.

The second option is: in order to prevent the condition that the indexes of all the face organs cannot be completely included in the data packet of the first frame image of the user or some indexes in the data packet of the first frame image of the user are lost when the data packet of the first frame image of the user is transmitted. The receiving end may further obtain a data packet of at least one other frame image of the user (hereinafter, a second frame image of the user is taken as an example). The data packet of the second frame face image of the user includes indexes of a plurality of second images of the plurality of facial organs, and the plurality of second images can be determined by the indexes of the plurality of second images, and the second images are also virtual images. Based on this, the receiving side can combine the packet of the first frame image of the user and the packet of the second frame image of the user to generate the receiving side first frame image. The phrase "generating the receiving-side first frame image by combining the packet of the user's first frame image and the packet of the user's second frame image" refers to: if the data packet of the first frame face image of the user received by the receiving end has an index corresponding to a certain face organ, acquiring a first image corresponding to the face organ through the index, and taking the first image as a component of the first frame face image of the receiving end; if the data packet of the first frame image of the user received by the receiving end does not include the index corresponding to a certain facial organ, and the data packet of the second frame image of the user includes the index corresponding to the facial organ, the receiving end acquires the image corresponding to the facial organ through the index, and takes the image as the component of the first frame image of the receiving end.

Or, according to the receiving sequence of the data packets of the face images, if an index corresponding to a certain face organ exists in the earliest received one of the data packets of the at least one other frame of face images, acquiring the image corresponding to the face organ through the index, and using the image as a component of the first frame of face image of the receiving end. If the earliest received data packet of a face image does not include an index corresponding to a certain face organ, and the data packet of the subsequent face image or the data packet of the first frame of the face image of the user includes the index corresponding to the face organ, the receiving end acquires the image corresponding to the face organ through the index and takes the image as a component of the first frame of the face image of the receiving end.

Optionally, the receiving end generates a video background image through an AR/VR technology. For example: when the user at the receiving end side carries out video with the users at the plurality of sending end sides simultaneously, the receiving end generates a video background image through an AR/VR technology, so that the first frame face images of the receiving end of each of the plurality of users can be fused into a background scene.

Optionally, the receiving end may select a video background image adapted to the first frame image of the receiving end, for example: and if the first frame face image of the receiving end is the image of the facial organ of the cartoon character, selecting the cartoon background image by the receiving end. If the first frame face image of the receiving end is the image of the star face organ, the receiving end selects the poster image of the movie and television work in which the star participates as the video background image. The receiving end first frame face image and the video background image have a corresponding relationship, and the corresponding relationship is one-to-one, one-to-many, many-to-one, or many-to-many. For example: when two-person video is performed, namely a first frame image of a receiving end corresponding to one user is displayed on a display screen of the receiving end at present, the first frame image of the receiving end can correspond to one video background image or a plurality of video background images, and when the first frame image of the receiving end corresponds to the plurality of video background images, the receiving end can arbitrarily select one video background image from the plurality of video background images or select one video background image according to a preset rule. When three or more than three users enter the video, that is, the receiving end display screen currently displays the receiving end first frame part images corresponding to a plurality of users, the receiving end first frame part images can correspond to one video background image or a plurality of video background images, and when the receiving end first frame part images correspond to a plurality of video background images, the receiving end can arbitrarily select one video background image from the plurality of video background images or select one video background image according to a preset rule.

Optionally, in the application, the receiving end may further rotate, zoom, and the like the first frame face image of the receiving end, and further add an expression special effect or a gesture special effect, and the like to the face image, so as to increase the interestingness.

In summary, the present application provides an image processing method, in which a sending end does not need to send a first frame image of a user to a receiving end, and only needs to send a data packet including indexes of a plurality of first images. Therefore, the requirement on the network bandwidth can be reduced, namely, a better video effect can be still ensured under the condition that the network transmission bandwidth is limited. For example: at present, the traditional video occupies a larger bandwidth under the condition of a video image with high definition and high frame rate. In general, if a video frame with 2K image quality is to be displayed on a receiving end, a conventional video method needs to transmit a video frame of 2K video, and the video frame is encoded by a 30 Frames Per Second (FPS) and H264 coding method, and a bandwidth required by a transmission process is about 8 megabits Per Second (Mbps). If the image processing method provided by the present application is adopted, that is, the sending end only sends the data packet including the index corresponding to each facial organ, and if the receiving end is to present the video picture with 2K image quality, the bandwidth occupied by the data packet of the first frame facial image of the user is about:

frame rate user first frame face image data packet index number bits per floating point/thousand (computer)/text compression rate bandwidth

Assuming that the frame rate is 30FPS, the number of indexes in the data packet of the first frame image of the user is 70, the bit number of each floating point is 32bit/float, the bit number of each floating point is 1024kb, and the text compression rate is 10, a bandwidth of 6.56 kilobits per second (kbps) is calculated, which is about 1/1250 of the bandwidth occupied in the conventional video mode. Therefore, the data packets of the face images can be acquired at the frame rate of 60FPS, 90FPS or even more than 500FPS, so that the video picture can be presented more continuously and finely.

Secondly, the image processing method provided by the application can not expose personal privacy such as wearing and dressing, places where the image processing method is located, mental states and the like, thereby expanding the application range of the technical scheme of the application.

Finally, when the user at the receiving end side carries out video with the users at the multiple sending end sides simultaneously, the receiving end generates a video background image through the AR/VR technology, so that the first frame face images of the multiple receiving ends can be fused to a background scene, and the experience and interactivity of the users can be improved.

On the basis of the previous embodiment, the sending end also sends the audio data packet to the receiving end, so that the user can achieve the synchronous effect in the sense of hearing and vision. Therefore, the receiving end needs to perform data synchronization on the receiving end first frame face image and the at least one audio data packet. Specifically, fig. 4 is a flowchart of an image processing method according to another embodiment of the present application, and as shown in fig. 4, the image processing method further includes the following steps:

step S401: the sending end acquires a first frame face image of a user, and the first frame face image of the user comprises a plurality of facial organ images.

Step S402: the transmitting end acquires a plurality of first images matched with the plurality of facial organ images.

Step S403: the sending end sends a data packet of a first frame face image of a user to the receiving end, wherein the data packet comprises indexes of a plurality of first images, and the indexes of the plurality of first images are used for acquiring the plurality of first images.

Step S404: the receiving end acquires a plurality of first images.

Step S405: the receiving end generates a first frame face image of the receiving end according to the plurality of first images.

Step S406: the sending end sends at least one audio data packet to the receiving end.

Step S407: and the receiving end displays the first frame face image of the receiving end and synchronizes the at least one audio data packet.

In steps S401 to S405, the same as steps S201 to S205, and the contents thereof can refer to the contents of steps S201 to S205, which are not described again.

Step S406 is explained as follows: the time stamp of the at least one audio data packet matches the time stamp of the data packet of the first frame face image of the user. By "the time stamp of the at least one audio data packet matches the time stamp of the data packet of the first frame face image of the user" is meant: the time stamp of each audio data packet in the at least one audio data packet is greater than or equal to the time stamp of the data packet of the first frame image of the user, and the time stamp of each audio data packet in the at least one audio data packet is less than the time stamp of the next data packet of the first frame image of the user. For example: the time stamp of the packet of the first frame image of the user is n, the time stamps of the respective audio packets of the at least one audio packet are n, n +160, n +320 … … and n +2880, and the time stamp of the next packet of the data of the first frame image of the user is n + 3000.

Wherein the time stamp in any one of the audio data packets or the data packet of the face image reflects the sampling instant of the first octet of that data packet. In RTP, the timestamp occupies 32 bits.

In one-time video, the sending end can randomly set the initial value of the timestamp. Such as: set to n. Assuming that the packet of the first frame face image of the user is the packet of the first frame face image in the video of this time, the time stamp of the packet of the first frame face image of the user is n, and the time stamp of the first audio packet in the at least one audio packet is also n.

The sending end obtains a plurality of audio data packets according to the collection frequency of the audio data packets, and obtains a plurality of data packets of the facial images according to the collection frequency of the data packets of the facial images. For example: the audio data packet is captured at 8 kilohertz (kHz), and one audio data packet is packed every 0.02 Seconds (Seconds, S), so that the time stamp increment of the adjacent audio data packets is: 0.02 × 8000 ═ 160S. The acquisition frequency of the packets of the face image is 90kHz, one packet of the face image is packed every (1/30) S, and then the time stamp increment of the packets of the adjacent face images is: (1/30) × 90 × 1000 ═ 3000S. Fig. 5 is a schematic diagram of an audio packet sequence and a facial image packet sequence provided in an embodiment of the present application, and as shown in fig. 5, a first row is an audio packet sequence formed by a plurality of audio packets, a second row is a facial image packet sequence formed by packets of a plurality of frames of facial images, a timestamp of a T-th frame audio packet in the audio packet sequence is n, a timestamp of a T + 1-th frame audio packet is n +160 … …, a timestamp of a T + 18-th frame audio packet is n +2880, and a timestamp of a T + 19-th frame audio packet is n +304 3040 … …, a timestamp of a T + 38-th frame audio packet is n + 6080. The time stamp of the packet of the T-th frame face image in the packet sequence of the face image is n, and the time stamp of the packet of the T + 1-th frame face image is n +3000 … … and the time stamp of the packet of the T + 2-th frame face image is n + 6000.

Step S407 is explained as follows: when the receiving side generates the receiving side first frame image, the receiving side also generates a timestamp of the receiving side first frame image, and the timestamp may be a timestamp of a packet of the user's first frame image. Further, the receiving end adopts the same criterion as the transmitting end to determine the audio data packet and the receiving end face image matched with the time stamp. For example: the receiving-end first frame face image with the time stamp n, and the at least one audio packet matched therewith are audio packets with the time stamps n, n +160, n +320 … …, and n + 2880.

The receiving end first frame image and the at least one audio data packet need to be synchronized, so that the terminal device displays the receiving end first frame image and simultaneously synchronously plays the content of the at least one audio data packet. For example, the audio packets of n, n +160, n +320 … …, and n +2880 are played synchronously while the first frame face image of the receiving end is displayed.

It should be noted that, while part of the content in the step S406 and the step S403 may be performed simultaneously, another part of the content in the step S406 is performed after the step S403, and for example, a first audio packet in the at least one audio packet and a packet of the first frame image of the user need to be transmitted to the receiving end simultaneously. And the other audio data packets except the first audio data packet among the at least one audio data packet are transmitted after the packet of the first frame face image of the user.

In summary, in the present application, the receiving end can synchronously play at least one audio data packet matched with the first frame portion image of the receiving end while displaying the first frame portion image of the receiving end, so that the user can achieve the synchronous effect in terms of hearing and vision.

Optionally, the receiving end further receives a data packet of a second frame portion image of the user from the sending end, where the second frame portion image of the user is earlier than the first frame portion image of the user, that is, the generation time of the second frame portion image of the user is earlier than the generation time of the first frame portion image of the user. The data packet of the second frame face image of the user includes indexes of a plurality of second images, the second frame face image of the user includes a plurality of facial organ images, and the plurality of second images are matched with the plurality of facial organ images. Wherein, the transmitting end can separately transmit the first frame part image of the user and the second frame part image of the user to the receiving end. For example: the sending end sends the first frame face image of the user firstly and then sends the second frame face image of the user. Alternatively, the sending end may send the first frame part image of the user and the second frame part image of the user to the receiving end together, for example: the transmitting end may transmit a first packet to the receiving end, where the first packet includes a packet of a first frame image of the user and a packet of a second frame image of the user. Note that transmitting the face image may also be understood as transmitting a packet of the face image.

Wherein the receiving end may transmit instruction information for instructing to transmit a face image earlier than the first frame face image of the user to the transmitting end. And the transmitting end transmits the data packet of the second frame face image of the user to the receiving end according to the indication information.

Further, the indication information may indicate that the first frame face image of the user is transmitted with a face image earlier than the first frame face image of the user. Considering that the transmitting end always transmits the first frame image of the user and the first frame image earlier than the user together, the transmission load of the transmitting end is increased, and therefore, the receiving end can transmit the indication information to the transmitting end when it does not receive the data packets of the consecutive frame images for a plurality of consecutive times.

However, for the receiving end, there are cases where it does not require a packet of the second frame face image of the user. For example: if the receiving end generates the data packet of the first frame part image of the receiving end according to the data packet of the first frame part image of the user, the receiving end does not need to generate the second frame part image of the receiving end according to the data packet of the second frame part image of the user, and the data packet of the second frame part image of the user is discarded.

Conversely, if the receiving end has not generated the receiving end third frame image according to the user third frame image, the receiving end may generate the receiving end second frame image according to a packet of the user second frame image, where the packet of the user third frame image is generated at a time earlier than the packet of the user second frame image.

When a sending end separately sends a plurality of data packets of facial images to a receiving end, some data packets of facial images can be received in a delayed manner due to poor network state and other reasons, so that the receiving end can increase the synchronous waiting time, the synchronous waiting time refers to the time for the receiving end to wait for the data packets of the facial images received in a delayed manner, and the synchronous waiting time can be 20 milliseconds, 30 milliseconds and the like, which is not limited in the application.

In order to prevent the case where the packet of the face image is lost, the transmitting end may transmit the packet of the first frame face image of the user and the packet of the second frame face image of the user to the receiving end together. Wherein the data packet of the second frame image of the user is temporally continuous with the data packet of the first frame image of the user. For example: fig. 6 is a schematic diagram of a first data packet and a first buffer queue according to an embodiment of the present application, and as shown in fig. 6, a first buffer queue of a receiving end stores received data packets of T-7 th to T-3 th frame face images, but the data packet of the T-2 th frame face image and the data packet of the T-1 th frame face image are lost, so the data packet of the T-2 th frame face image and the data packet of the T-1 th frame face image are not stored in the first buffer queue. And the first packet includes a packet of the T-th frame face image, a packet of the T-1 st frame, a packet of the T-2 nd frame face image, and a packet of the T-3 rd frame face image. The T-th frame image data packet may be the first frame image data packet of the user, and the T-1 th frame image data packet may be the second frame image data packet of the user. And the receiving end adds the data packets of the T-1 frame and the T-2 frame face image into the first buffer queue to solve the problem of packet loss.

In order to reduce the transmission load of the transmitting end, the receiving end may transmit, to the transmitting end, indication information indicating that the face image earlier than the first frame face image of the user is carried when the receiving end does not receive the data packets of the consecutive face images for a plurality of consecutive times. That is, when the sending end receives the indication information, the sending end carries the data packet of the first frame image of the user and the data packet of the second frame image of the user in the first data packet. When the sending end does not receive the indication information, the sending end does not carry the second frame face image of the user when sending the first frame face image of the user. The receiving end can set a network state variable S, the initial value of the network state variable is 0, and the receiving end judges whether the data packet of the face image and the data packet of the previous face image received by the receiving end are continuous data packets every time the receiving end receives a data packet of the face image, if so, the data packet of the face image and the data packet of the previous face image are made to be continuous data packets, and if not, the data packet of the previous face image is made to be S +1, and if not, the data packet of the previous. Once S reaches- (N +1), that is, the number of consecutive times of packets of the discontinuous face image received by the receiving end is N +1, the receiving end transmits to the transmitting end indication information indicating that the transmitting end carries a face image earlier than the first frame face image of the user when transmitting the first frame face image of the user, and the receiving end sets S to 0. In addition, when the first data packet received by the receiving end includes the first frame image of the user and the second frame image of the user, the data packet of the second frame image of the user is selectively placed into the first buffer queue by the receiving end. Alternatively, once S reaches N +1, that is, the number of consecutive times of packets of consecutive face images received by the receiving end is N +1, the receiving end transmits another indication information to the transmitting end to indicate that it is not necessary to carry a face image earlier than the first frame face image of the user when transmitting the first frame face image of the user. For convenience, the indication information for indicating that the face image earlier than the current face image is carried when the current face image is transmitted is referred to as first indication information. The indication information for indicating that it is not necessary to carry a face image earlier than the current face image at the time of transmitting the current face image is referred to as second indication information. It should be noted that the first instruction may be replaced by increasing the carrying of the face image earlier than the current face image when instructing to transmit the current face image, and the second instruction information may be replaced by decreasing the carrying of the face image earlier than the current face image when instructing to transmit the current face image.

Specifically, fig. 7 is a flowchart of a method for processing a data packet of a face image by a receiving end according to an embodiment of the present application, and as shown in fig. 7, an execution subject of the method is the receiving end, and the method includes the following steps:

step S701: and receiving a data packet of a first frame face image of a user.

Step S702: it is determined whether a packet of a first frame face image of a user and a previously received packet of a face image of the user are consecutive packets. If the first frame face image packet of the user and the previously received user face image packet are consecutive packets, step S703 is executed, otherwise step S707 is executed.

Step S703: let S be S + 1.

Step S704: and judging whether the S reaches N +1, if so, executing the step S705, and if not, executing the step S706.

Step S705: and sending second indication information to the sending end, and enabling S to be 0.

Step S706: and caching the data packet of the first frame face image of the user into a first cache queue.

If the data packet of the first frame image of the user and the data packet of the second frame image of the user are packaged in the first data packet and sent, the data packet of the first frame image of the user is taken out from the first data packet, and the data packet of the first frame image of the user is cached in the first cache queue. For example: the first frame face image of the user is a T-th frame face image, the second frame face image of the user is a T-1-th frame face image, data packets of the T-th frame, the T-1-th frame and the T-2-th frame face image are packaged in the first data packet to be sent, and then the receiving end stores the data packet of the T-th frame face image into the first cache queue.

Step S707: let S be S-1.

Step S708: it is determined whether S reaches- (N +1), if so, step S709 is performed, and if not, step S710 is performed.

Step S709: and sending the first indication information to the sending end, and making S equal to 0.

Step S710: it is determined whether the first packet includes a packet of a first frame image of the user and a packet of a second frame image of the user, and if so, step S711 is performed, and if not, step S714 is performed.

Step S711: it is determined whether the face image generated at the earliest time in the first packet is earlier than the face image generated at the latest time in the first buffer queue. If yes, step S712 is executed, and if no, step S713 is executed.

Step S712: and adding the data packet of the face image in the first data packet into the first buffer queue.

Assuming that the first frame image of the user is a T-th frame image, and the second frame image of the user is a T-1 th frame image, the first data packet includes: a data packet of a T-th frame face image, a data packet of a T-1-th frame face image and a data packet of a T-2-th frame face image. And under the condition that the face image generated in the first cache queue at the latest time is the T-3 frame face image packet, and the T-2 frame face image is earlier than the T-3 frame face image, the receiving end adds the data packet of the T-1 frame face image, the data packet of the T-1 frame face image and the data packet of the T-2 frame face image into the first cache queue.

Step S713: and adding a data packet of the face image which is later than the face image with the latest generation time in the first buffer queue in the first data packet into the first buffer queue.

Assuming that the first frame face image of the user is a T-th frame face image, the second frame face image of the user is a T-1-th frame face image, and the first packet includes: and in the case that the data packet of the T frame face image, the data packet of the T-1 frame face image, the data packet of the T-2 frame face image and the data packet of the T-3 frame face image are generated in the first buffer queue, and the face image generated at the latest time in the first buffer queue is the T-3 frame face image, the data packet of the T-1 frame face image and the data packet of the T-2 frame face image are added into the first buffer queue, and the data packet of the T-3 frame face image in the first data packet is discarded.

Step S714: and judging whether the first frame face image of the user is earlier than the face image with the latest generation time in the first buffer queue, if so, executing the step S715, and otherwise, executing the step S716.

Step S715: discarding the data packet of the first frame face image of the user.

Step S716: and caching the data packet of the first frame face image of the user into a first cache queue.

Finally, the receiving end can select the data packets of the 2-3 frame face images from the first buffer queue and buffer the data packets to the second buffer queue for rendering.

For example: fig. 8 is a schematic diagram of image processing according to an embodiment of the present application, as shown in fig. 8, a receiving end receives a packet of a T-th frame image, but the packet is not yet stored in a first buffer queue, the first buffer queue currently stores a packet of a T-1-th frame image to a packet of a T-7-th frame image, and when the receiving end generates a first frame image of the receiving end, the receiving end only schedules the packet of the T-th frame image to the packet of the T-2-th frame image, stores the packet of the 3-th frame image in a second buffer queue, and clears the packet of the T-7-th frame image to the packet of the T-3-th frame image in the first buffer queue. The rendering module in the receiving end may start rendering from the T-2 frame face image, sequentially decrease progressively, and after the data packet of the 3 frame face image in the second buffer queue is rendered, the second buffer queue continues to acquire the data packet of the face image from the first buffer queue. The refresh frequency of the receiving end to the second buffer queue may be 30 frames per second, as long as it is ensured that the rendering module can obtain 2 to 3 frames of data packets of the face image each time.

In summary, in the present application, the data packet of the first frame facial image of the user and the data packet of the second frame data of the user may be carried in one data packet. The second frame image of the user is continuous with the first frame image of the user in time, so that the situation of packet loss of the face image can be prevented, and the quality of the first frame image of the receiving end can be improved based on the situation. In addition, the receiving end may send instruction information to the sending end when it does not receive the data packets of consecutive face images for a plurality of consecutive times, so as to instruct the sending end to carry the face image earlier than the first frame face image of the user when sending the first frame face image of the user. That is, when the transmitting end receives the indication information, the transmitting end transmits the second frame face image of the user together with the first frame face image of the user. When the transmitting end does not receive the indication information, the transmitting end does not carry the second frame face of the user when transmitting the first frame face image of the user, so that the transmission burden of the transmitting end can be reduced.

If the first frame image of the user and the face image generated in the first buffer queue at the latest time are not continuous, and the second frame image of the user continuous with the first frame image of the user is received after the first frame image of the user is received, the face image of the user generated at the latest time is received by the receiving end first, and the face image of the user generated at the earliest time is received by the receiving end. According to the situation, the receiving end can selectively discard the second frame image of the user and cache the data packet of the first frame image of the user into the first cache queue; or selecting to buffer the data packet of the second frame face image of the user and the data packet of the first frame face image of the user into the first buffer queue.

For example: fig. 9 is a schematic diagram of image processing according to another embodiment of the present application, in which in the case shown in fig. 9, the receiving end discards the second frame image of the user. As shown in fig. 9, the receiving end receives the data packet of the T-th frame face image, and the receiving end buffers the T-th frame into the second buffer queue for rendering, and then receives the data packet of the T-1-th frame face image and the data packet of the T-2-th frame face image. In order to prevent the data packets of the face images from being out of order in the first buffer queue, the receiving end discards the data packets of the T-1 th frame face images and the data packets of the T-2 th frame face images. Based on this, the rendering module can acquire the frame skipping, namely the data packet of the T-th frame face image, the data packet of the T-3-th frame face image and the data packet of the T-4-th frame face image, and the appearance of the receiving end during video call is not influenced because the refreshing frequency of the receiving end to the second cache queue is higher.

For example: fig. 10 is a schematic diagram of image processing according to yet another embodiment of the present application, in which under the condition shown in fig. 10, a receiving end adds a second frame image of a user to a first buffer queue. As shown in fig. 10, the receiving end receives the data packet of the T-th frame face image first, and the receiving end does not yet buffer the T-3-th frame into the second buffer queue for rendering, and then receives the data packet of the T-1-th frame face image and the data packet of the T-2-th frame face image. In order to ensure the continuity of the data packets of the face images in the first buffer queue, the receiving end adds the data packets of the T-1 frame face images and the data packets of the T-2 frame face images to the first buffer queue. Based on the data, the subsequent rendering module can acquire the data packet of the T-th frame face image, the data packet of the T-1-th frame face image and the data packet of the T-2-th frame face image, and the continuity of the rendered receiving end face image is guaranteed.

That is, in the present application, if a disorder occurs, the second frame image of the user coming later should be received before the first frame image of the user, but due to the delay, the second frame image of the user comes later with respect to the first frame image of the user. Discarding the second frame face image of the user if the first frame face image of the user is already used for generating the first frame face image of the receiving end; and if the third frame image of the user is not used for generating the third frame image of the receiving end, wherein the third frame image of the user is earlier than the second frame image of the user, adding the second frame image of the user into the first cache queue, namely generating the second frame image of the receiving end according to the second frame image of the user.

It should be noted that, the case where the receiving end generates the receiving end face image from the data packet of one frame of face image at a time is described above, however, as described in the second alternative of step S205, the receiving end may also generate the receiving end first frame of face image by combining the data packet of the user first frame of face image and the data packet of at least one other frame of face image of the user. The receiving end face image is generated according to the data packet of the number of frames of the face image of the user, and the method is not limited in the application.

Fig. 11 is a schematic diagram of an image processing apparatus according to an embodiment of the present application, where the image processing apparatus is a part or all of the transmitting end, and as shown in fig. 11, the apparatus includes:

a first acquiring module 1101, configured to acquire a first frame face image of a user, where the first frame face image of the user includes a plurality of facial organ images.

A second obtaining module 1102 for obtaining a plurality of first images matching the plurality of facial organ images.

The first sending module 1103 is configured to send, to the receiving end, a data packet of the first image of the user, where the data packet of the first image of the user includes indexes of a plurality of first images, and the indexes of the plurality of first images are used to obtain the plurality of first images.

Alternatively, the plurality of facial organ images are images of a user's real facial organ, and the plurality of first images are images of a facial organ that is virtual for the user.

Optionally, the second obtaining module 1102 is specifically configured to: for each of the plurality of facial organ images, the facial organ image is compared with a standard organ image corresponding to the facial organ image to determine a first difference value. And acquiring a first image matched with the facial organ image according to the first difference value, wherein the second difference value and the first difference value of the first image matched with the facial organ image and the standard organ image meet a first condition.

Optionally, the apparatus further comprises: and a second sending module 1104, configured to send at least one audio data packet to the receiving end, where a timestamp of the audio data packet matches a timestamp of a data packet of the first frame image of the user.

Optionally, the apparatus further comprises:

a third obtaining module 1105, configured to obtain a second facial image of the user, where the second facial image of the user is earlier than the first facial image of the user.

A fourth obtaining module 1106, configured to obtain a plurality of second images matching the plurality of facial organ images of the second frame facial image of the user.

A third sending module 1107, configured to send a data packet of the second frame face image of the user to the receiving end, where the data packet of the second frame face image of the user includes indexes of multiple second images, and the indexes of the multiple second images are used to obtain the multiple second images.

Optionally, the apparatus further comprises: a receiving module 1108, configured to receive indication information sent by the receiving end, where the indication information is used to indicate that a face image earlier than the first frame face image of the user is sent.

The image processing apparatus provided in the present application may be configured to execute the image processing method corresponding to the sending end, and the content and the effect of the image processing apparatus may refer to the method embodiment section, which is not described again.

Fig. 12 is a schematic diagram of an image processing apparatus according to another embodiment of the present application, where the image processing apparatus is a part or all of the receiving end, and as shown in fig. 12, the apparatus includes:

a first receiving module 1201, configured to receive a packet of a first frame of facial images of a user from a transmitting end, where the packet of the first frame of facial images of the user includes indexes of a plurality of first images, the first frame of facial images of the user includes a plurality of facial organ images, and the plurality of first images match the plurality of facial organ images.

A first obtaining module 1202 for obtaining a plurality of first images.

A first generating module 1203 is configured to generate a receiving-end first frame portion image according to the plurality of first images.

Optionally, the apparatus further comprises: a second receiving module 1204, configured to receive at least one audio data packet from the sending end, where a timestamp of the audio data packet matches a timestamp of a data packet of the first frame face image of the user.

Optionally, the apparatus further comprises: a third receiving module 1205 is configured to receive, from the sending end, a data packet of a second frame image of the user, the second frame image of the user being earlier than the first frame image of the user, the data packet of the second frame image of the user including indexes of a plurality of second images, the plurality of second images being matched with the plurality of facial organ images included in the second frame image of the user.

Optionally, the apparatus further comprises: a sending module 1206, configured to send, to the sending end, indication information indicating that a face image earlier than the first frame face image of the user is sent.

Optionally, the apparatus further comprises: a discarding module 1207, configured to discard the data packet of the second frame face image of the user if the receiving-end first frame face image has been generated.

Optionally, the apparatus further comprises: a second generating module 1208, configured to generate a receiving-end second frame image according to the data packet of the user's second frame image if the receiving-end third frame image corresponding to the user's third frame image is not generated yet, where the user's third frame image is earlier than the user's second frame image.

The image processing apparatus provided in the present application may be configured to execute the image processing method corresponding to the receiving end, and the content and effect of the image processing apparatus may refer to the method embodiment section, which is not described again.

Fig. 13 is a schematic diagram of a terminal device according to an embodiment of the present application, where the terminal device may be the sending end or the receiving end, and as shown in fig. 13, the terminal device includes: memory 1301, processor 1302, and transceiver 1303. The memory 1301 stores instructions executable by the processor, so that the processor 1302 can execute the image processing method corresponding to the transmitting end or the receiving end. The transceiver 1303 is used for implementing data transmission between terminal devices.

The terminal device may include one or more processors 1302, among other things. The memory 1301 may be implemented by any type of volatile or non-volatile memory device or combination thereof, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.

The terminal device may further include one or more of the following components: power components, multimedia components, audio components, interfaces for input/output (I/O), sensor components.

The power supply component provides power to the various components of the terminal. The power components may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the terminal device.

The multimedia component comprises a touch-sensitive display screen providing an output interface between the terminal device and a user. In some embodiments, the touch display screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. In some embodiments, the multimedia component includes a front facing camera and/or a rear facing camera. When the terminal device is in an operation mode, such as a shooting mode or a video mode, the front camera and/or the rear camera can receive external multimedia data. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component is configured to output and/or input an audio signal. For example, the audio component includes a Microphone (MIC) configured to receive an external audio signal when the terminal device is in an operating mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signal may further be stored in a memory or transmitted via a communication component. In some embodiments, the audio assembly further comprises a speaker for outputting audio signals.

The I/O interface provides an interface between the processor and a peripheral interface module, which may be a keyboard, click wheel, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly includes one or more sensors, which may include a light sensor, such as at least one of a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly may further include at least one of an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The terminal device provided by the present application may be configured to execute the image processing method corresponding to the sending end or the receiving end, and the content and the effect of the method may refer to the method embodiment section, which is not described in detail herein.

Fig. 14 is a schematic diagram of an image processing system 1400 provided in an embodiment of the present application, as shown in fig. 14, the system includes: the sending end 1401 and the receiving end 1402 may be directly connected, or may be connected through an intermediate device, such as a server. The transmitting end 1401 is configured to execute an image processing method corresponding to the transmitting end, and the receiving end 1402 is configured to execute an image processing method corresponding to the receiving end 1402, and the contents and effects thereof may refer to the embodiment of the method, which is not described herein again.

The present application also provides a computer-readable storage medium. The computer-readable storage medium stores computer instructions for causing a computer to execute the image processing method provided by the present application.

The computer-readable storage medium may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store computer instructions to implement the image processing method described above. The computer readable storage medium is also a memory, which may be a high speed random access memory, or a non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device.

The present application further provides a computer program product, where the computer program product stores computer instructions, and the computer instructions are used to enable a computer to execute the image processing method, and the content and effect of the computer instructions can refer to the embodiment of the method, which is not described herein again.

Claims

1. An image processing method, comprising:

acquiring a first frame face image of a user, wherein the first frame face image of the user comprises a plurality of facial organ images;

acquiring a plurality of first images matched with the plurality of facial organ images;

and sending a data packet of the first frame face image of the user to a receiving end, wherein the data packet of the first frame face image of the user comprises indexes of the plurality of first images, and the indexes of the plurality of first images are used for acquiring the plurality of first images.

2. The method of claim 1,

the plurality of facial organ images are images of real facial organs of the user, and the plurality of first images are images of facial organs that are virtual for the user.

3. The method of claim 1 or 2, wherein said obtaining a plurality of first images matching the plurality of facial organ images comprises:

for each facial organ image in the plurality of facial organ images, comparing the facial organ image with a standard organ image corresponding to the facial organ image to determine a first difference value;

and acquiring a first image matched with the facial organ image according to the first difference value, wherein the second difference value and the first difference value of the first image matched with the facial organ image and the standard organ image meet a first condition.

4. The method according to any one of claims 1-3, further comprising:

and sending at least one audio data packet to the receiving end, wherein the time stamp of the audio data packet is matched with the time stamp of the data packet of the first frame face image of the user.

5. The method according to any one of claims 1-4, further comprising:

acquiring a second facial image of a user, wherein the second facial image of the user is earlier than the first facial image of the user;

acquiring a plurality of second images matched with a plurality of facial organ images of a second frame of facial images of the user;

and sending a data packet of a second frame face image of the user to the receiving end, wherein the data packet of the second frame face image of the user comprises indexes of the plurality of second images, and the indexes of the plurality of second images are used for acquiring the plurality of second images.

6. The method of claim 5, further comprising:

and receiving indication information sent by the receiving end, wherein the indication information is used for indicating that a face image earlier than a first frame face image of the user is sent.

7. An image processing method, comprising:

receiving a packet of a first frame of facial images of a user from a transmitting end, the packet of the first frame of facial images of the user including an index of a plurality of first images, the first frame of facial images of the user including a plurality of facial organ images, the plurality of first images matching the plurality of facial organ images;

acquiring the plurality of first images;

and generating a receiving end first frame face image according to the plurality of first images.

8. The method of claim 7,

9. The method of claim 7 or 8, further comprising:

and receiving at least one audio data packet from the transmitting end, wherein the time stamp of the audio data packet is matched with the time stamp of the data packet of the first frame face image of the user.

10. The method according to any one of claims 7-9, further comprising:

receiving, from the transmitting end, a packet of a second frame image of the user, the second frame image of the user being earlier than the first frame image of the user, the packet of the second frame image of the user including indexes of a plurality of second images that match a plurality of facial organ images included in the second frame image of the user.

11. The method of claim 10, further comprising:

and sending indication information to the sending end, wherein the indication information is used for indicating that a face image earlier than a first frame face image of the user is sent.

12. The method of claim 10 or 11, further comprising:

and if the first frame image of the receiving end is generated, discarding the data packet of the second frame image of the user.

13. The method of claim 10 or 11, further comprising:

and if the receiving end third frame part image corresponding to the user third frame part image is not generated, wherein the user third frame part image is earlier than the user second frame part image, generating a receiving end second frame part image according to a data packet of the user second frame part image.

14. An image processing apparatus characterized by comprising:

a first acquisition module, configured to acquire a first frame of facial images of a user, where the first frame of facial images of the user includes a plurality of facial organ images;

a second acquisition module for acquiring a plurality of first images matched with the plurality of facial organ images;

a first sending module, configured to send a data packet of a first frame image of the user to a receiving end, where the data packet of the first frame image of the user includes indexes of the multiple first images, and the indexes of the multiple first images are used to obtain the multiple first images.

15. The apparatus of claim 14,

16. The apparatus according to claim 14 or 15, wherein the second obtaining module is specifically configured to:

17. The apparatus of any one of claims 14-16, further comprising:

and the second sending module is used for sending at least one audio data packet to the receiving end, and the time stamp of the audio data packet is matched with the time stamp of the data packet of the first frame face image of the user.

18. The apparatus of any one of claims 14-17, further comprising:

a third obtaining module, configured to obtain a second frame image of a user, where the second frame image of the user is earlier than the first frame image of the user;

a fourth acquisition module for acquiring a plurality of second images matched with a plurality of facial organ images of a second frame of facial images of the user;

a third sending module, configured to send a data packet of a second frame image of the user to the receiving end, where the data packet of the second frame image of the user includes indexes of the plurality of second images, and the indexes of the plurality of second images are used to obtain the plurality of second images.

19. The apparatus of claim 18, further comprising:

and the receiving module is used for receiving indication information sent by the receiving end, wherein the indication information is used for indicating that a face image earlier than a first frame face image of the user is sent.

20. An image processing apparatus characterized by comprising:

a first receiving module, configured to receive, from a sending end, a data packet of a first frame of facial images of a user, where the data packet of the first frame of facial images of the user includes indexes of a plurality of first images, the first frame of facial images of the user includes a plurality of facial organ images, and the plurality of first images match the plurality of facial organ images;

a first obtaining module, configured to obtain the plurality of first images;

and the first generating module is used for generating a receiving end first frame face image according to the plurality of first images.

21. The apparatus of claim 20,

22. The apparatus of claim 20 or 21, further comprising:

and the second receiving module is used for receiving at least one audio data packet from the transmitting end, and the time stamp of the audio data packet is matched with the time stamp of the data packet of the first frame face image of the user.

23. The apparatus of any one of claims 20-22, further comprising:

and a third receiving module, configured to receive, from the sending end, a packet of a second frame image of the user, where the second frame image of the user is earlier than the first frame image of the user, and the packet of the second frame image of the user includes indexes of a plurality of second images that match with a plurality of facial organ images included in the second frame image of the user.

24. The apparatus of claim 23, further comprising:

and the sending module is used for sending indication information to the sending end, wherein the indication information is used for indicating that a face image earlier than a first frame face image of the user is sent.

25. The apparatus of claim 23 or 24, further comprising:

and the discarding module is used for discarding the data packet of the second frame face image of the user if the first frame face image of the receiving end is generated.

26. The apparatus of claim 23 or 24, further comprising:

and the second generation module is used for generating a receiving end second frame image according to a data packet of the user second frame image if the receiving end third frame image corresponding to the user third frame image is not generated yet, wherein the user third frame image is earlier than the user second frame image.

27. A terminal device, comprising: a memory and a processor;

the memory stores instructions executable by the processor to enable the processor to perform the method of any one of claims 1-12.

28. An image processing system, comprising: a transmitting end for performing the method of any one of claims 1 to 6 and a receiving end for performing the method of any one of claims 7 to 12.

29. A computer-readable storage medium having stored thereon computer instructions for causing a computer to perform the method of any one of claims 1-12.