CN115243106A

CN115243106A - Intelligent visual interaction method and device

Info

Publication number: CN115243106A
Application number: CN202110444150.1A
Authority: CN
Inventors: 赵杰; 黄磊; 马春晖; 刘小蒙; 郁心迪
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2021-04-23
Filing date: 2021-04-23
Publication date: 2022-10-25

Abstract

The application provides an intelligent visual interaction method and device, and relates to the technical field of communication. The method comprises the following steps: the terminal plays a first video with the orientation of an action presenter who presents a first action as a first orientation, identifies the gesture of a user, plays a second video with the orientation of the action presenter who presents the first action as a second orientation according to the gesture of the user, and can intelligently identify the gesture of the user.

Description

Intelligent visual interaction method and device

Technical Field

The application relates to the technical field of artificial intelligence, in particular to an intelligent visual interaction method and device.

Background

With the rise of body-building sports, many people choose to learn body-building actions on the intelligent terminal by themselves, for example, an intelligent television, a mobile phone and the like play action demonstration videos, and users imitate the actions displayed by the devices, so that learning is performed. Because the learning mode lacks guidance of a motion presenter (hereinafter referred to as a coach) in the video, if the orientation of the user is inconsistent with the orientation of the coach during the process of learning by the user along with the video, the motion of the user is different from the motion of the coach, so that the learning effect and the comfort are influenced.

Most of the existing solutions prompt a user to adjust actions or prompt the user to make mistakes in the actions through voice or voice and characters, and the user can adjust the actions according to the voice or the voice and the characters.

Disclosure of Invention

The application provides an intelligent visual interaction method and device, which are used for solving the problem that a single visual angle mode is combined with voice and characters to conduct action guidance and has ambiguity in the process that a user follows an action presenter to learn actions.

In order to achieve the above purpose, the embodiments of the present application provide the following technical solutions:

in a first aspect, an intelligent visual interaction method is provided, including: the terminal plays a first video with the orientation of the action presenter who presents the first action as the first orientation, identifies the gesture of the user, and plays a second video with the orientation of the action presenter who presents the first action as the second orientation according to the gesture of the user. The method provided by the first aspect can intelligently recognize the gesture of the user, and when the gesture of the user is different from the gesture of the action presenter or an action error occurs, the direction of the action presenter presenting the first action is intelligently adjusted according to the gesture of the user, so that the learning effect and the comfort of the user are improved.

In a possible implementation manner, the playing, by the terminal, the second video according to the gesture of the user includes: the terminal determines the orientation of the user according to the posture of the user; and the terminal plays the second video according to the orientation of the user. This kind of possible implementation, the orientation of the action exhibitor who demonstrates first action is adjusted according to user orientation intelligence, the user of being convenient for learns.

In a possible implementation manner, before the terminal plays the second video according to the orientation of the user, the method further includes: the terminal determines that a video, which is the same as the orientation of the user, in the N videos is a second video according to the orientation of the user, the N videos correspond to the N orientations, the action presenter in the N videos presents the first actions in the N orientations, and N is an integer greater than 1. According to the possible implementation mode, the second video is the video corresponding to one of the plurality of orientations of the recorded first action in advance, so that the intelligent switching of the videos can be realized by directly matching, the switching process is short in time consumption and high in efficiency, and the second video can be quickly played for a user.

In a possible implementation manner, the playing, by the terminal, the second video according to the gesture of the user includes: the terminal determines an error skeleton point of the user action according to the posture of the user; and the terminal plays a second video according to the wrong skeleton point, and the action presenter in the second video displays the wrong skeleton point in the first action. According to the possible implementation mode, the played second video is adjusted by determining the wrong skeleton point of the user action, so that the wrong skeleton point is displayed, and the user can correct the wrong action conveniently.

In a possible implementation manner, before the terminal plays the second video according to the wrong bone point, the method further includes: the terminal determines that the video showing the wrong skeleton point in the first action in the N videos is the second video according to the wrong skeleton point, the N videos correspond to the N orientations, the action presenter in the N videos shows the first action in the N orientations, and N is an integer larger than 1. According to the possible implementation mode, the second video is the video corresponding to one of the plurality of orientations of the recorded first action in advance, so that the intelligent switching of the videos can be realized by directly matching, the switching process is short in time consumption and high in efficiency, and the second video can be quickly played for a user.

In a possible implementation manner, the gesture of the user is a 3D gesture, and the playing, by the terminal, of the second video according to the gesture of the user includes: the terminal obtains the 3D posture of the action presenter according to the first video; the terminal rotates and translates the 3D gesture of the action presenter and the 3D gesture of the user and then aligns the gestures to obtain a target 3D gesture of the action presenter; and the terminal generates and plays the second video according to the target 3D posture and the first video. According to the possible implementation mode, the terminal can generate the second video without taking the multi-orientation video of the first action offline in advance, and the storage pressure of the multi-orientation video can be reduced.

In a possible implementation manner, the gesture of the user is a 2D gesture, and the terminal plays the second video according to the gesture of the user, including: the terminal acquires the 3D posture of the action presenter according to the first video; the terminal rotates and translates the 3D posture of the action presenter and the 2D posture of the user, and then the terminal aligns the actions after scale transformation to obtain a target 2D posture of the action presenter; and the terminal generates and plays the second video according to the target 2D posture and the first video. According to the possible implementation mode, the terminal can generate the second video without taking the multi-orientation video of the first action offline in advance, and the storage pressure of the multi-orientation video can be reduced.

In a second aspect, a terminal device is provided, including: the functional units for executing any one of the methods provided by the first aspect, wherein the actions performed by the respective functional units are implemented by hardware or by hardware executing corresponding software. For example, the terminal device may include: a playing unit and an identification unit; the playing unit is used for playing a first video, and the orientation of an action presenter who presents a first action in the first video is a first orientation; a recognition unit for recognizing a posture of a user; and the playing unit is also used for playing a second video according to the recognized gesture of the user, and the orientation of the action presenter who presents the first action in the second video is a second orientation.

In a third aspect, a terminal device is provided, including: a processor. The processor is connected with the memory, the memory is used for storing computer execution instructions, and the processor executes the computer execution instructions stored by the memory, so as to realize any one of the methods provided by the first aspect. The memory and the processor may be integrated together or may be separate devices. In the latter case, the memory may be located inside the terminal device or outside the terminal device.

In a fourth aspect, there is provided a terminal device, comprising: a processor and interface circuitry; the interface circuit is used for receiving the code instruction and transmitting the code instruction to the processor; a processor for executing code instructions to perform any of the methods provided by the first aspect.

In a fifth aspect, a computer-readable storage medium is provided, which comprises computer-executable instructions, which, when executed on a computer, cause the computer to perform any one of the methods provided in the first aspect.

In a sixth aspect, there is provided a computer program product comprising computer executable instructions which, when executed on a computer, cause the computer to perform any one of the methods provided in the first aspect.

Technical effects brought by any one implementation manner in the second aspect to the sixth aspect may refer to technical effects brought by a corresponding implementation manner in the first aspect, and are not described herein again.

Drawings

Fig. 1 is a schematic hardware structure diagram of a terminal according to an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of a human bone point provided in an embodiment of the present application;

fig. 3 is a flowchart of an intelligent visual interaction method provided in an embodiment of the present application;

fig. 4 is a schematic diagram illustrating human body orientation division according to an embodiment of the present application;

FIG. 5 is a flowchart of an intelligent visual interaction method provided in an embodiment of the present application;

fig. 6 is a schematic diagram illustrating switching videos according to user orientations according to an embodiment of the present application;

FIG. 7 is a flowchart of a method for intelligent visual interaction according to an embodiment of the present application;

fig. 8 is a flowchart illustrating switching a video according to a 3D gesture of a user according to an embodiment of the present application;

fig. 9 is a flowchart illustrating switching a video according to a 2D gesture of a user according to an embodiment of the present application;

FIG. 10 is a flow chart illustrating switching video according to a user's wrong skeletal point according to an embodiment of the present disclosure;

fig. 11 is a schematic diagram illustrating actions exhibited by an action presenter before and after video switching according to an embodiment of the present application;

fig. 12 is a schematic composition diagram of a terminal device according to an embodiment of the present application;

fig. 13 is a schematic hardware structure diagram of a terminal device according to an embodiment of the present application.

Detailed Description

In the description of this application, "/" means "or" unless otherwise stated, for example, A/B may mean A or B. "and/or" herein is merely an association relationship describing an associated object, and means that there may be three relationships, for example, a and/or B, and may mean: a exists alone, A and B exist simultaneously, and B exists alone. Further, "at least one" means one or more, "a plurality" means two or more. The terms "first", "second", and the like do not necessarily limit the number and execution order, and the terms "first", "second", and the like do not necessarily limit the difference.

It is noted that the words "exemplary" or "such as" are used herein to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" or "e.g.," is not necessarily to be construed as preferred or advantageous over other embodiments or designs. Rather, use of the word "exemplary" or "such as" is intended to present concepts related in a concrete fashion.

The embodiment of the application provides an intelligent visual interaction method which can be applied to a terminal, wherein the terminal is used for providing one or more of voice service and data connectivity service for a user. A terminal can also be called a User Equipment (UE), a terminal device, an access terminal, a subscriber unit, a subscriber station, a mobile station, a remote terminal, a mobile device, a user terminal, a wireless communication device, a user agent, or a user equipment. The terminal may be a mobile phone, a television, an Augmented Reality (AR) device, a Virtual Reality (VR) device, a tablet computer, a notebook computer, an ultra-mobile personal computer (UMPC), a netbook, a Personal Digital Assistant (PDA), and the like, and of course, in the following embodiments, the specific form of the terminal is not limited at all.

An exemplary hardware structure of the terminal in the embodiment of the present application can be seen in fig. 1, which includes: a processor, a camera, a memory, and a display screen, and may also include other devices such as Radio Frequency (RF) circuitry, bluetooth devices, one or more sensors, a touch pad, a pointing device, audio circuitry, peripheral interfaces, and a power system, which may communicate via one or more communication buses or signal lines (not shown in fig. 1). Those skilled in the art will appreciate that the hardware configuration shown in fig. 1 does not constitute a limitation of the terminal, which may include more or fewer components than shown, or a combination of certain components, or a different arrangement of components.

The processor is the control center of the terminal, and various interfaces and lines are used for connecting various parts of the terminal. In some embodiments, a processor may include one or more processing units. The processor may include one or more of a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), and a Network Processing Unit (NPU). The memory is used for storing application programs and data, and the processor executes various functions and data processing of the terminal by operating the application programs and data stored in the memory. The camera is used to capture still images or video. In some embodiments, the terminal may include 1 or N cameras, N being a positive integer greater than 1. The cameras may include a two-dimensional (2D) camera and/or a three-dimensional (3D) camera, where the 2D camera is used for acquiring a 2D image of the user, and the 3D camera is used for acquiring a 2D image or a 3D image of the user, for example, a common color (RGB) camera can acquire a 2D image, and a red green blue-depth (RGB-D) camera can acquire a 2D image and a 3D image. A display screen (also referred to as a display) may be used to display information input by or provided to a user and various menus of the terminal. The display screen may be configured in the form of a liquid crystal display, an organic light emitting diode, or the like. In an embodiment of the present application, the display screen is used for playing a video (e.g., a fitness video, a rehabilitation training video, a dance teaching video, etc.).

In order to make the embodiments of the present application clearer, concepts and parts related to the embodiments of the present application are briefly introduced below.

1. Orientation of

Orientation refers to a direction corresponding to a certain face or part of the body of the user or trainer. For example, the user's face corresponds to a direction, or the user's back corresponds to a direction, or the user's chest corresponds to a direction. For example, for a user or a coach, assuming that the terminal playing the video is the reference point, the chest of the user or coach is opposite to the display screen of the terminal, the orientation of the user or coach can be considered as the front (in this case, for the coach, the back of the coach is displayed in the video played by the terminal), and the chest of the user or coach is towards the left side of the terminal, and the orientation of the user or coach is considered as the left side. For convenience of description, the method provided by the present application is exemplified in the following by taking such a case as an example.

2. Bone point identification

Skeletal points may also be referred to as human skeletal points, key skeletal points, or key points. Skeletal point recognition is an important component of human image processing. In the process of processing the human body image, generally, it is necessary to perform skeleton point identification on the human body image (2D image or 3D image) and perform subsequent processing using each identified skeleton point, for example, it is possible to identify a motion in the human body image by using position information of each skeleton point. Illustratively, referring to fig. 2, the skeleton points are numbered and differentiated in left and right directions, such as 3 being a right shoulder, 6 being a left shoulder, etc., so that the orientation of the user can be determined according to the skeleton points, for example, when the positions of the skeleton points of the left and right parts are opposite, the orientation of the user is considered to be a back side (i.e., the user faces away from the terminal). It should be noted that fig. 2 is only an illustration, and the number of the bone points may be more or less, and the posture of the human body may be described.

3. Posture

Gestures in this application include limb movements and orientations. In particular, the pose may be determined by locating skeletal points on the human anatomy. Therein, the gestures may be divided into 2D gestures and 3D gestures.

The 2D gesture refers to a user gesture including 2D information of the user. The 2D pose may be obtained by performing skeletal point recognition on a 2D image of the user. The skeleton points of the user can be obtained based on a deep learning method, for example, the deep learning method may be openness, and the openness can obtain the skeleton points of the user by recognizing a single frame RGB image.

The 3D gesture refers to a user gesture including 3D information of the user. The 3D pose may be predicted by performing bone point recognition on a 2D image of the user, or the 3D pose may be predicted by performing bone point recognition on a 3D image. The 3D pose may also be acquired offline by the motion capture system. The operation capture system can realize high-precision detection on the human body mark points, so that more accurate 3D postures can be acquired. The 3D gesture can also be obtained by a motion sensor, for example, by placing sub-sensors at joints of a human body, each sub-sensor obtaining the gesture of the corresponding joint, a plurality of sub-sensors forming a sensor network, so that the precise 3D gesture of the user can be collected.

Wherein, the 2D posture or the 3D posture of the user can be acquired by placing the 2D camera or the 3D camera in front of the user so that the camera view covers the user, acquiring the 2D image or the 3D image of the user. The 2D pose or the 3D pose may be obtained in an off-line or on-line manner, which is not limited in this application.

The foregoing is a brief introduction to some of the concepts related to this application.

Aiming at the prior art, the terminal has the following problems:

1. when a user learns according to a video displayed by a terminal, the content displayed by the video is an action recorded by a coach in advance, and the direction of the coach in the action is possibly inconsistent with the direction of the user, so that the learning effect and the comfort level of the user are influenced.

2. When a user learns the action displayed in the video, if the accuracy of the action is poor, the action is corrected only through prompts such as voice and characters, the effect is not obvious, and therefore the user has obstacles in the learning process and the learning effect is influenced.

In order to solve the problems, the application provides an intelligent visual interaction method, in the method, when a user imitates the action of a coach in a video, a camera in a terminal can collect images of the user, recognize the gesture of the user, and switch the video of the coach according to the gesture of the user, so that a better teaching visual angle is provided, and the learning effect and the comfort level of the user are improved.

As shown in fig. 3, the method includes:

301. the terminal plays a first video, and the orientation of an action presenter who presents a first action in the first video is a first orientation.

Wherein the first video may be pre-recorded. The first video may be stored in the terminal or in the cloud server. If the first video is the first video, the user calls the first video through interaction with the terminal and plays the first video. If the first video is the second video, the user enables the terminal to download or acquire the first video online through interaction with the terminal, and the first video is played.

In the first video, the action presenter may present one or more actions. In the case of one operation, the first operation is the operation, and in the case of a plurality of operations, the first operation may be any one of the plurality of operations, and the present application is not limited thereto. It should be noted that the method provided by the present application is exemplified by taking the first action as an example, but it should be understood that the method provided by the present application can be executed for any one action shown in the first video, and the specific process is similar to the first action and can be understood by reference.

302. The terminal recognizes the gesture of the user.

The terminal recognizing the user gesture may specifically recognize the orientation and/or limb movement of the user. Specifically, the terminal may determine a skeleton point of the user according to the image of the user, and determine the orientation and/or the limb movement of the user according to the skeleton point of the user. The orientation of the user refers to an orientation corresponding to a certain body part (for example, a chest cavity) of the user, and the body part can be determined by a skeletal point of the user.

303. And the terminal plays a second video according to the gesture of the user, and the orientation of the action presenter who presents the first action in the second video is a second orientation.

The second orientation and the first orientation may or may not be the same.

The method provided by the embodiment of the application can change the played video through the gesture of the user, visually display the first actions in different orientations for the user, so that the user can learn, solve the problem that the learning effect and the comfort level are poor due to the fact that the actions in the video are inconsistent with the orientation of the user, and solve the problem that the action correction is limited only through prompts such as voice characters, and improve the learning efficiency and the comfort level.

The body motion and the orientation of the user can be determined by recognizing the gesture of the user, and therefore, the step 303 can be implemented in the following manner 1 or manner 2.

Mode 1: the terminal determines the orientation of the user (noted as a third orientation) according to the gesture of the user, and plays the second video according to the third orientation.

In mode 1, before the terminal plays the second video according to the third orientation, the method further includes: the terminal determines that a video, which is the same as the orientation of the user, in the N videos is a second video according to the orientation of the user, the N videos correspond to the N orientations, the action presenter in the N videos presents the first actions in the N orientations, and N is an integer greater than 1.

It should be noted that the orientation corresponding to one video refers to the orientation of the motion presenter in the video. The user (or action presenter) may correspond to a plurality of orientations, each orientation corresponding to a range of angles centered on the user (or action presenter). At this time, when the angular range in which the user's chest is located is the same as the angular range in which the movement presenter's chest is located, it is considered that the user and the movement presenter are oriented in the same direction. For example, referring to fig. 4, the user (or the action presenter) corresponds to 8 orientations, and the angle ranges corresponding to the 8 orientations can be seen in table 1. If the angle range in which the chest of the user is located is (0 °,45 °), and the angle range in which the chest of the motion presenter is located is (0 °,45 °), the orientations of the user and the motion presenter are both in the same direction, and the video in the same direction as the orientation of the user is the video in which the orientation of the motion presenter is 1.

TABLE 1

Orientation of	Facing a corresponding angular range
		Towards 1	(0°，45°]
Towards 2	(45°，90°]
		Towards 3	(90°，135°]
Orientation 4	(135°，180°]
		Towards 5	(180°，225°]
Towards 6	(225°，270°]
		In the direction 7	(270°，315°]
Towards 8	(315°，0]

It should be noted that the division of the orientation in table 1 is only schematic, and other division manners may be available in practical implementation, for example, the division is performed in a front-back left-right direction, where an example of the angle ranges corresponding to the front-back left-right direction may be referred to in table 2. For convenience of description, the method provided by the present application is exemplified below with the orientation divided into front, back, left and right.

TABLE 2

Orientation of	Orientation corresponding angle range (0 degree to the display screen of the terminal)
		Left side of	(45°，135°]
Rear end	(135°，225°]
		Right side	(225°，315°]
Front side	(315°，45°]

Referring to fig. 5, in a specific implementation of the method 1, the cloud server or the terminal may store N videos locally, and after the terminal determines the orientation of the user according to the gesture of the user, the terminal may match the orientation of the user with the orientations of the N videos, determine a video with a successful orientation matching as a second video, and play the second video.

It should be noted that the second orientation and the first orientation may be the same when it is determined that the third orientation is the same as the first orientation or the third orientation indicates that the user faces away from the display screen of the terminal. In other cases, the second orientation and the first orientation may be different.

When the third orientation indicates that the user faces away from the display screen of the terminal, the terminal detects that the user does not watch the video, and therefore the orientation adjustment is unknown to the user, the terminal does not need to adjust, the action display video in the first orientation is continuously played, and the orientation of the action display person in the second video is the same as that in the first video; when the user is not facing away from the terminal, as shown in fig. 6, a first video is played in the current display screen, wherein the orientation of the action presenter is the first orientation, i.e. to the right, and a third orientation determined by the user gesture is to the left, a second video with the orientation of the action presenter to the left is played, and at this time, the second orientation is the same as the third orientation and is different from the first orientation.

And 2, the terminal determines the wrong skeleton point of the user action according to the posture of the user, and plays a second video according to the wrong skeleton point, wherein the action presenter in the second video displays the wrong skeleton point in the first action.

In mode 2, before the terminal plays the second video according to the wrong bone point, the method further includes: and the terminal determines the video which shows the wrong bone point in the first action in the N videos as a second video according to the wrong bone point. The related description of the N videos can be referred to above, and is not repeated.

The specific implementation process of the mode 2 can be referred to as the process shown in fig. 10 below.

In the above-described mode 1 and mode 2, the N videos showing multiple orientations of a set of motions may be referred to as multi-orientation (or multi-angle) videos of the set of motions. The multi-directional video of the set of motions may include the multi-directional video of the first motion in the set of motions. The multi-directional video can be shot in advance in an offline state and then uploaded to a cloud server or stored locally for use by a terminal. Taking the first motion as an example, the multi-directional video of the first motion can be obtained by shooting the first motion at multiple angles. As shown in fig. 5, the shooting can be performed by a plurality of cameras, including camera 1, camera 2, … …, and camera N, where N is an integer greater than 1, and N videos are obtained by shooting with N cameras, and each of the N videos corresponds to the first action. The shooting mode may be that multiple cameras shoot at different angles at the same time, or that multiple cameras shoot at different angles at different times. If a plurality of cameras shoot simultaneously, a multi-directional video for the first motion can be directly generated. If the images are not captured simultaneously, it is necessary to align the video frames of the N images, and for example, if N =2, it is assumed that the camera 1 captures a video in a first direction of the first motion at a first time, and the camera 2 captures a video in a second direction of the first motion at a second time, and by the video frame alignment, the first direction video and the second direction video of the first motion can be obtained, and the multi-direction video of the first motion can be generated.

For a video in a certain orientation, the motion shown by the motion presenter can be shot by placing the camera in an angle range corresponding to the video in the orientation. For example, for a video image oriented to the left in table 2, the camera may be placed at 90 ° to capture the motion exhibited by the motion presenter, but may be placed at other angles, such as 60 °, 80 °, and 120 °.

Because the second video is the video corresponding to a certain orientation in the plurality of orientations of the recorded first action in advance, the intelligent switching of the videos can be realized by directly matching, the switching process is short in time consumption and high in efficiency, and the second video can be quickly played for a user.

By means of the mode 1, the orientation of the action presenter of the first action in the video is intelligently adjusted according to the orientation of the user, so that the user does not need to adjust the orientation in the learning process, and the learning efficiency and the user experience are improved. The mode 2 is to identify the wrong skeleton point, determine the orientation of the finally displayed action according to the wrong skeleton point of the user, display the wrong skeleton point in multiple orientations, and help the user to correct the wrong action. That is, the user is presented with multiple orientations of wrong skeletal points by perspective switching, so that the user corrects the wrong actions.

The two modes can be realized in the whole scheme as shown in fig. 7, and two modes for adjusting and playing the video can be obtained by recognizing the gesture of the user.

In the above mode 1 and mode 2, the terminal may acquire the second video only after determining the body movement or orientation of the user, or the terminal may not determine the body movement or orientation of the user, and generate the second video directly according to the posture of the user and the first video, which may be specifically implemented by the following mode 3 or mode 4. The posture of the user in the mode 3 is a 3D posture, and the posture of the user in the mode 4 is a 2D posture.

The mode 3 includes in concrete implementation:

11 The terminal obtains the 3D gesture of the action presenter according to the first video.

If the gesture of the motion presenter who presents the first motion in the first video is a 2D gesture, the 3D gesture of the motion presenter can be obtained by converting the 2D gesture of the motion presenter. If the gesture of the motion presenter who presents the first motion in the first video is a 3D gesture, the image frame in the first video can be directly recognized to obtain the 3D gesture of the motion presenter.

12 The terminal rotates and translates the 3D gesture of the action presenter and the 3D gesture of the user and then aligns the two gestures to obtain a target 3D gesture of the action presenter.

13 The terminal generates and plays the second video according to the target 3D pose and the first video.

Referring to fig. 8, mode 3, when implemented, may include steps 801-805.

801. The terminal identifies a 3D gesture of an action presenter presenting a first action in a first video.

In the method shown in fig. 8 and fig. 9 and 10 below, the gesture of the motion presenter who presents the first motion in the first video is taken as the 3D gesture, but in actual implementation, the gesture of the motion presenter who presents the first motion in the first video may be taken as the 2D gesture, and in this case, the 2D gesture of the motion presenter may be converted to obtain the 3D gesture of the motion presenter.

802. The terminal recognizes the 3D gesture of the user.

803. And the terminal determines a rotation matrix R and a translational vector T corresponding to the 3D posture of the action presenter according to the 3D posture of the user and the 3D posture of the action presenter.

Specifically, the 3D posture of the user and the 3D posture of the motion presenter may be used as inputs of an Iterative Closest Point (ICP) algorithm, and the output of the ICP algorithm is a rotation matrix R and a translational vector T corresponding to the 3D posture of the motion presenter.

804. And the terminal determines the target 3D posture of the action presenter according to the rotation matrix R and the translational vector T corresponding to the 3D posture of the action presenter and the 3D posture of the action presenter.

For example, if the 3D posture of the motion presenter is set to M and the target 3D posture is set to M ', the target 3D posture of the motion presenter may be calculated by M' = R × M + T. The target 3D gesture may be made a 3D gesture aligned with the 3D gesture of the user through step 804, and at this time, the orientation of the motion presenter corresponding to the target 3D gesture is the same as the orientation of the user.

805. And the terminal generates and plays the second video according to the target 3D posture and the first video.

In a specific implementation of step 805, a first possible implementation manner may be to directly generate the second video through an action migration algorithm, for example, a Generative Adaptive Network (GAN), to learn clothing appearance information of an action presenter and skeleton point information (i.e., a target 3D gesture) of a specified view angle in the first video. The second video generated by the GAN automatically contains the background and people of the first video.

In a second possible implementation manner, the terminal may calculate a 2D projection of the target 3D pose in a certain direction through a perspective transformation algorithm, and since the orientation of the motion presenter corresponding to the target 3D pose is the same as the orientation of the user, the 2D projection and the first video determined according to the target 3D pose may be learned through a motion migration algorithm, for example, a GAN algorithm, to generate the second video.

In a third possible implementation manner, after step 804, that is, after the target 3D pose is determined, the terminal may further generate a 3D mesh (mesh) map (which may be regarded as a more accurate target 3D pose) through a human parameter model (for example, a Skinned Multi-Person Linear (SMPL) model), and determine a corresponding 2D projection according to the generated 3D mesh map, where the 2D projection does not include a background because the 3D mesh map only includes morphological features of a human body and does not include a background, and at this time, the 2D projection may be superimposed on the background in the first video to generate a second video. The SMPL model is a parameterized human body model, and changes in the form of the human body model are controlled by inputting parameters of each part of the human body. At the moment, the human body parameters determined according to the target 3D posture and the first video can be input into a basic human body model, so that a 3D mesh graph is generated, the 3D mesh graph can represent the surface appearance of the human body, and compared with the target 3D posture formed by skeleton points, the morphological characteristics of the human body can be displayed more accurately.

In the mode 3, when the user gesture is the 3D gesture, the gesture of the action presenter in the currently played video can be intelligently adjusted according to the 3D gesture of the user and the 3D gesture of the action presenter, the inconvenience that the gesture of the current user needs to be adjusted by oneself is solved, and because the generation of the second video only needs the first video of the original playing and the current gesture of the user, the multi-orientation video of the first action does not need to be shot offline in advance, the generation of the second video convenient for the user to learn can be realized through an algorithm, the storage pressure of the multi-orientation video can be reduced, and the learning intelligence is improved.

The mode 4 includes, in a specific implementation:

21 The terminal obtains the 3D posture of the action presenter according to the first video. The description of step 21) can refer to step 11 of mode 3 above), and is not repeated here.

22 The terminal rotates and translates the 3D posture of the action presenter and the 2D posture of the user, and then carries out scale transformation and alignment to obtain the target 2D posture of the action presenter.

23 The terminal generates and plays a second video according to the target 2D pose and the first video.

Referring to fig. 9, mode 4, when implemented, may include steps 901-905.

901. As in step 801.

902. The terminal acquires the 2D gesture of the user.

903. And the terminal determines a rotation matrix R, a translation vector T and a projection scale factor s corresponding to the 3D posture of the action presenter according to the 2D posture of the user, the internal parameters of the camera and the 3D posture of the action presenter.

Specifically, the 2D pose of the user, the internal parameters of the camera, and the 3D pose of the motion presenter may be used as inputs of a perspective-n-point (PnP) algorithm, and the output of the PnP algorithm is the rotation matrix R, the translation vector T, and the projection scale factor s corresponding to the 2D pose of the motion presenter.

904. The terminal determines the target 2D posture of the action presenter according to the rotation matrix R, the translation vector T and the projection scale factor s corresponding to the 3D posture of the action presenter and in combination with the 3D posture of the action presenter.

For example, if the 3D posture of the motion presenter is set to L and the target 2D posture is set to N ', the target 2D posture of the motion presenter may be calculated by N' = s (R × L + T).

905. And the terminal generates and plays the second video according to the target 2D posture and the first video.

Illustratively, the second video is generated by combining the first video with the target 2D pose via the GAN algorithm described above.

By means of the mode 4, when the user posture is the 2D posture, the posture of the action presenter in the currently played video can be intelligently adjusted according to the 2D posture of the user and the 3D posture of the action presenter, the second video based on the user visual angle can be generated by directly combining the original video (namely the first video) through an algorithm according to the posture type of the currently acquired user, and the user can conveniently learn on the basis that the multi-orientation video does not need to be recorded and stored in advance.

Referring to fig. 10, the above-described mode 2 may include steps 1001-1011 in a specific implementation.

1001-1002, same as

steps

801 and 802, respectively, or same as

steps

901 and 902, respectively.

1003. And the terminal matches the action of the user with the action template and identifies the wrong skeleton point of the user.

In step 1003, in a specific implementation, the terminal may identify an erroneous skeletal point of the user through a motion recognition algorithm. Specifically, the motion recognition algorithm may be a motion template matching algorithm, and in this case, the terminal completes motion recognition by calculating the similarity between each limb of the human body and a template limb. The action recognition algorithm can also be a multi-frame skeleton information matching algorithm, under the condition, the multi-frame skeleton information can be divided into multiple types, one type of skeleton information corresponds to one action number, one action number corresponds to one action, and the terminal matches the action information of the user with the multi-frame skeleton information through a deep learning method, determines the action number and further determines the action.

When the terminal judges that the user has an error skeleton point, executing step 1004; and when the error bone point does not exist, continuing to play the first video.

1004. The terminal judges whether the wrong skeleton point in the first action displayed by the action presenter in the first video is shielded or not according to the wrong skeleton point.

In step 1004, during specific implementation, the terminal determines whether the action shown by the action presenter exists in the currently played first video according to the specific position of the wrong skeleton point, and if so, it indicates that the user cannot completely observe the details of the first action shown by the action presenter at present.

It should be noted that whether the wrong skeletal point in the first action shown by the action presenter is occluded or not refers to that the action presenter stands at the angle of the user to observe and determine whether the occluded information exists or not. Illustratively, referring to fig. 11, the user's wrong bone point is identified through the above step 1003, the sequence number of the wrong bone point (e.g., the sequence number of the user's foot position in fig. 11) is determined, a 3D within a map of bone points other than the wrong bone point in the first action shown by the action presenter is calculated, A2D projection area can be calculated in the positive direction of the Z axis of the 3Dmesh graph, and if the projection point of the wrong bone point just falls in the 2D projection area and the Z coordinate of the wrong bone point is larger than or equal to the Z coordinates of all the bone points contained in the 3Dmesh graph, the wrong bone point is blocked. At this time, a plane XY perpendicular to the Z axis may be defined as a point (i.e., a point a in the figure) where the Z coordinate is the smallest in the 3D esh diagram, and the target 3D posture (i.e., the posture after the view angle is switched) of the motion presenter is obtained by mirroring the first motion whole presented by the motion presenter before the view angle is switched on the XY plane.

When the image is not shielded, step 1005 is executed, and when the image is shielded, step 1006 to step 1007 are executed, or step 1008 is executed, and after step 1007 and step 1008, step 1009 to step 1011 are executed.

1005. The terminal carries out local amplification display through voice prompt or text prompt or aiming at the wrong bone point so as to prompt a user to correct the wrong action.

1006. The terminal generates the target 3D posture or the target 2D posture of the motion presenter.

The generation process of the target 3D pose or the target 2D pose may refer to the above, and is not described in detail.

1007. And the terminal obtains a second video according to the target 3D posture or the target 2D posture and the first video.

The specific implementation of step 1007 can be referred to above, and is not described in detail.

1008. And the terminal determines the video which shows the wrong bone point in the first action from the N videos in the multiple orientations as the second video according to the wrong bone point.

1009. And the terminal plays the second video.

It should be noted that, if the wrong skeleton point in the first action shown by the action presenter in the first video is blocked, it indicates that the user cannot completely observe the details of the first action shown by the action presenter at present, and therefore, the view angle needs to be switched, so that the currently blocked wrong skeleton point can be shown, so that the user can correct the wrong action.

1010. The terminal judges whether i is less than or equal to n.

If yes, let i = i +1, continue to execute steps 1001-1009, if no, execute step 1011.

Wherein, the initial value of i is 0,n is an integer greater than 0, which can be a preset value.

1011. The terminal plays the first video or determines not to switch videos.

In subsequent processes, the terminal may continue to identify the user's wrong skeletal points after a certain period of time.

In the method shown in fig. 10, in the learning process of the user, the second video showing the first motion by the motion presenter is determined according to the wrong skeleton point by identifying the wrong skeleton point of the user, the new view angle video is generated in real time according to the existing multi-directional video switching or algorithm, and the wrong skeleton point is displayed by a repeated point. The problem of among the prior art because shelter from the relation and the action exhibitor shoots the angle between human limbs, the user probably can't the perception correct action under single visual angle, can't pertinence switch video visual angle, experience feels relatively poor is solved.

In addition, for example, in the above steps 1001 to 1011, the terminal may also play a plurality of oriented videos that may show wrong skeleton points of the first action, so that the user may correct the wrong action through the videos in the plurality of orientations.

In the embodiment of the application, the video played by the terminal can be a body-building video, a rehabilitation training video, a dance teaching video and the like, and the application is not limited.

The method provided by the embodiment of the application solves the problem that a single visual angle mode is combined with voice characters and the like to guide the movement and has ambiguity in the process that a user follows the movement presenter to learn the movement, the gesture of the user can be intelligently identified, when the gesture of the user has visual angle difference or movement errors with the gesture of the movement presenter, the orientation of the movement presenter is intelligently adjusted according to the gesture of the user or wrong bone points are better presented in a mode of error point projection and the like, and therefore the learning effect and the comfort level of the user are improved.

It should be noted that the actions performed by the terminal may also be performed by two or more devices, for example, by an image capturing device and a video processing device, where the image capturing device may adopt a user gesture, and the video processing device may acquire and play the second video according to the user gesture.

The above description has presented the embodiments of the present application primarily from a method perspective. It is to be understood that each module, for example, the terminal, includes at least one of a hardware structure and a software module corresponding to each function in order to implement the above-described functions. Those of skill in the art would readily appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as hardware or combinations of hardware and computer software. Whether a function is performed as hardware or computer software drives hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiment of the present application, the terminal may be divided into the functional units according to the above method example, for example, each functional unit may be divided corresponding to each function, or two or more functions may be integrated into one processing unit. The integrated unit may be implemented in the form of hardware, or may also be implemented in the form of a software functional unit. It should be noted that the division of the unit in the embodiment of the present application is schematic, and is only a logic function division, and there may be another division manner in actual implementation.

Exemplarily, fig. 12 shows a schematic diagram of a possible structure of a terminal device (denoted as a terminal device 120) according to the foregoing embodiment, where the terminal device 120 includes a playing unit 1201 and an identifying unit 1202.

A playing unit 1201, configured to play a first video, where an orientation of an action presenter who presents a first action in the first video is a first orientation;

a recognition unit 1202 for recognizing a posture of a user;

the playing unit 1201 is further configured to play a second video according to the recognized gesture of the user, where an orientation of the motion presenter who presents the first motion in the second video is a second orientation.

Optionally, the identifying unit 1202 is further configured to determine an orientation of the user according to the gesture of the user; the playing unit 1201 is specifically configured to play the second video according to the identified orientation of the user.

Optionally, the terminal device further includes: a determining unit 1203, configured to determine, according to the orientation of the user, that a video in the N videos that is the same as the orientation of the user is the second video, where the N videos correspond to N orientations, the action presenter shows the first action in the N orientations, and N is an integer greater than 1.

Optionally, the identifying unit 1202 is further configured to determine an erroneous skeletal point of the user's motion according to the user's gesture; the playing unit 1201 is specifically configured to play a second video according to the identified erroneous bone point, where the action presenter in the second video shows the erroneous bone point in the first action.

Optionally, the terminal device further includes: the determining unit 1203 is configured to determine, according to the wrong skeleton point, that a video showing the wrong skeleton point in the first action in the N videos is a second video, the N videos correspond to N orientations, an action presenter in the N videos shows the first action in the N orientations, and N is an integer greater than 1.

Optionally, the gesture of the user is a 3D gesture, and the terminal device further includes a generating unit 1204; the recognition unit 1202 is further configured to obtain a 3D gesture of the action presenter according to the first video; the generating unit 1204 is configured to rotate and translate the 3D gesture of the action presenter and the 3D gesture of the user, align the two gestures to obtain a target 3D gesture of the action presenter, and generate a second video according to the target 3D gesture and the first video; the playing unit 1201 is specifically configured to play the second video.

Optionally, the gesture of the user is a 2D gesture, and the terminal device further includes a generating unit 1204; the recognition unit 1202 is further configured to obtain a 3D gesture of the action presenter according to the first video; the generating unit 1204 is configured to rotate and translate the 3D gesture of the action presenter and the 2D gesture of the user, perform scale transformation and then align the two gestures to obtain a target 2D gesture of the action presenter, and generate a second video according to the target 2D gesture and the first video; the playing unit 1201 is specifically configured to play the second video.

Optionally, the terminal device 120 further includes a storage unit 1205. The storage unit 1205 is used to store computer execution instructions, and other units in the terminal device may execute corresponding actions according to the computer execution instructions stored in the storage unit 1205.

The terminal device 120 may be a single device, or may be a chip or a chip system.

The integrated unit in fig. 12, if implemented in the form of a software functional module and sold or used as a separate product, may be stored in a computer-readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially implemented or make a contribution to the prior art, or all or part of the technical solutions may be implemented in the form of a software product stored in a storage medium, and including several instructions to enable a computer device (which may be a personal computer, a server, a network device, or the like) or a processor (processor) to execute all or part of the steps of the methods of the embodiments of the present application. A storage medium storing a computer software product comprising: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The embodiment of the present application further provides a schematic diagram of a hardware structure of a terminal device, referring to fig. 13, the terminal device includes a processor 1301, and optionally, further includes a memory 1302 connected to the processor 1301. The processor 1301 and the memory 1302 are connected by a bus.

The processor 1301 may include a CPU, GPU, NPU, microprocessor, application-specific integrated circuit (ASIC), or one or more integrated circuits for controlling the execution of programs in accordance with the present disclosure. The processor 1301 may also include multiple CPUs, and the processor 1301 may be a single-core (single-CPU) processor or a multi-core (multi-CPU) processor. A processor herein may refer to one or more devices, circuits, or processing cores that process data (e.g., computer program instructions).

The memory 1302 may be a ROM or other type of static storage device that can store static information and instructions, a RAM or other type of dynamic storage device that can store information and instructions, an EEPROM (electrically erasable programmable read-only memory), a CD-ROM (compact disk read-only memory) or other optical disk storage, an optical disk storage (including a compact disk, a laser disk, an optical disk, a digital versatile disk, a blu-ray disk, etc.), a magnetic disk storage medium or other magnetic storage device, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, and is not limited in this respect. The memory 1302 may be separate or integrated with the processor 1301. The memory 1302 may contain, among other things, computer program code. The processor 1301 is configured to execute the computer program code stored in the memory 1302, thereby implementing the methods provided by the embodiments of the present application.

Processor 1301 is configured to control and manage actions of the terminal, for example, processor 1301 is configured to perform steps 301 to 303 in fig. 3, steps performed by the terminal in fig. 5, steps performed by the terminal in fig. 7, steps 801 to 805 in fig. 8, steps 901 to 905 in fig. 9, steps 1001 to 1011 in fig. 10, and/or actions performed by the terminal in other procedures described in this embodiment of the present application. The memory 1302 is used for storing program codes and data of the terminal.

In implementation, the steps of the method provided by this embodiment may be implemented by hardware integrated logic circuits in a processor or instructions in the form of software. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware processor, or implemented by a combination of hardware and software modules in a processor.

Embodiments of the present application also provide a computer-readable storage medium, which includes instructions that, when executed on a computer, cause the computer to perform any of the above methods.

Embodiments of the present application also provide a computer program product containing instructions which, when run on a computer, cause the computer to perform any of the methods described above.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented using a software program, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions according to the embodiments of the present application are all or partially generated when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on a computer readable storage medium or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). Computer-readable storage media can be any available media that can be accessed by a computer or can comprise one or more data storage devices, such as servers, data centers, and the like, that can be integrated with the media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a DVD), or a semiconductor medium (e.g., a Solid State Disk (SSD)), among others.

While the present application has been described in connection with various embodiments, other variations to the disclosed embodiments can be understood and effected by those skilled in the art in practicing the claimed application, from a review of the drawings, the disclosure, and the appended claims. In the claims, the word "comprising" does not exclude other elements or steps, and the word "a" or "an" does not exclude a plurality. A single processor or other unit may fulfill the functions of several items recited in the claims. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.

Although the present application has been described in conjunction with specific features and embodiments thereof, it will be evident that various modifications and combinations can be made thereto without departing from the spirit and scope of the application. Accordingly, the specification and figures are merely exemplary of the present application as defined in the appended claims and are intended to cover any and all modifications, variations, combinations, or equivalents within the scope of the present application. It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. An intelligent visual interaction method, comprising:

the terminal plays a first video, wherein the orientation of an action presenter who presents a first action in the first video is a first orientation;

the terminal identifies the gesture of a user;

and the terminal plays a second video according to the gesture of the user, wherein the orientation of the action presenter who presents the first action in the second video is a second orientation.

2. The method according to claim 1, wherein the terminal plays a second video according to the user gesture, comprising:

the terminal determines the orientation of the user according to the posture of the user;

and the terminal plays the second video according to the orientation of the user.

3. The method of claim 2, wherein before the terminal plays the second video according to the orientation of the user, the method further comprises:

the terminal determines, according to the orientation of the user, that a video in the N videos that is the same as the orientation of the user is the second video, the N videos correspond to the N orientations, the action presenter shows the first action in the N orientations in the N videos, and N is an integer greater than 1.

4. The method according to claim 1, wherein the terminal plays a second video according to the gesture of the user, comprising:

the terminal determines an error skeleton point of the user action according to the posture of the user;

and the terminal plays a second video according to the wrong bone point, and the action presenter in the second video displays the wrong bone point in the first action.

5. The method of claim 4, wherein before the terminal plays the second video according to the wrong bone point, the method further comprises:

the terminal determines, according to the wrong bone point, that a video showing the wrong bone point in the first action in N videos is the second video, where the N videos correspond to N orientations, the action presenter shows the first action in the N orientations, and N is an integer greater than 1.

6. The method according to claim 1, wherein the gesture of the user is a 3-dimensional (3D) gesture, and the terminal plays the second video according to the gesture of the user, comprising:

the terminal acquires the 3D posture of the action presenter according to the first video;

the terminal aligns the 3D gesture of the action presenter and the 3D gesture of the user after rotating and translating to obtain a target 3D gesture of the action presenter;

and the terminal generates and plays the second video according to the target 3D posture and the first video.

7. The method of claim 1, wherein the gesture of the user is a 2-dimensional (2D) gesture, and wherein the terminal plays the second video according to the gesture of the user, comprising:

the terminal rotates and translates the 3D gesture of the action presenter and the 2D gesture of the user, and then the movement presenter and the 2D gestures are aligned after scale transformation to obtain a target 2D gesture of the action presenter;

and the terminal generates and plays the second video according to the target 2D posture and the first video.

8. A terminal device, comprising:

the playing unit is used for playing a first video, and the orientation of an action presenter who presents a first action in the first video is a first orientation;

a recognition unit for recognizing a posture of a user;

the playing unit is further configured to play a second video according to the recognized gesture of the user, where an orientation of the action presenter who presents the first action in the second video is a second orientation.

9. A terminal device according to claim 8,

the recognition unit is further used for determining the orientation of the user according to the gesture of the user;

the playing unit is specifically configured to play the second video according to the identified orientation of the user.

10. The terminal device according to claim 9, wherein the terminal device further comprises:

a determining unit, configured to determine, according to the orientation of the user, that a video that is the same as the orientation of the user among N videos is the second video, where the N videos correspond to N orientations, the action presenter shows the first action in the N orientations, and N is an integer greater than 1.

11. A terminal device according to claim 8,

the recognition unit is further used for determining an error skeleton point of the action of the user according to the gesture of the user;

the playing unit is specifically configured to play a second video according to the identified erroneous bone point, where the action presenter in the second video presents the erroneous bone point in the first action.

12. The terminal device according to claim 11, wherein the terminal device further comprises:

a determining unit, configured to determine, according to the erroneous bone point, that a video showing the erroneous bone point in the first action in N videos is the second video, where the N videos correspond to N orientations, the action presenter shows the first action in the N orientations, and N is an integer greater than 1.

13. The terminal device according to claim 8, wherein the gesture of the user is a 3-dimensional (3D) gesture, the terminal device further comprising a generating unit;

the identification unit is further used for acquiring the 3D gesture of the action presenter according to the first video;

the generating unit is used for aligning the 3D gesture of the action presenter and the 3D gesture of the user after rotating and translating to obtain a target 3D gesture of the action presenter and generating the second video according to the target 3D gesture and the first video;

the playing unit is specifically configured to play the second video.

14. The terminal device according to claim 8, wherein the gesture of the user is a 2-dimensional (2D) gesture, the terminal device further comprising a generating unit;

the generating unit is used for rotating and translating the 3D gesture of the action presenter and the 2D gesture of the user, performing scale transformation and then aligning to obtain a target 2D gesture of the action presenter, and generating the second video according to the target 2D gesture and the first video;

the playing unit is specifically configured to play the second video.

15. A terminal device, comprising: a processor;

the processor is coupled to a memory, the memory configured to store computer-executable instructions, the processor executing the computer-executable instructions stored by the memory to cause the terminal device to implement the method of any one of claims 1-7.

16. A computer-readable storage medium storing computer instructions which, when executed on a computer, cause the computer to perform the method of any one of claims 1-7.

17. A computer program product comprising computer instructions which, when run on a computer, cause the computer to perform the method of any one of claims 1-7.