CN110189364B

CN110189364B - Method and device for generating information, and target tracking method and device

Info

Publication number: CN110189364B
Application number: CN201910480692.7A
Authority: CN
Inventors: 卢艺帆
Original assignee: Beijing ByteDance Network Technology Co Ltd
Current assignee: Beijing ByteDance Network Technology Co Ltd
Priority date: 2019-06-04
Filing date: 2019-06-04
Publication date: 2022-04-01
Anticipated expiration: 2039-06-04
Also published as: CN110189364A

Abstract

The embodiment of the disclosure discloses a method and a device for generating information, and a target tracking method and a target tracking device. One embodiment of the method for generating information comprises: acquiring a target video; selecting a video frame from a target video; determining wrist position information of a wrist object in a video frame; palm position information of a palm object in a subsequent video frame of the video frames is generated based on the wrist position information. The embodiment can determine the palm position information based on the wrist position information, enriches the determining modes of the palm position information and is beneficial to improving the accuracy of the determined palm position.

Description

Method and device for generating information, and target tracking method and device

Technical Field

Embodiments of the present disclosure relate to the field of computer technologies, and in particular, to a method and an apparatus for generating information, and a target tracking method and an apparatus.

Background

Target tracking is a technique for locating object objects in video. Object tracking is a topical problem of widespread interest and research in the field of image video processing. The technology is characterized in that a first frame video frame is initialized in a video, wherein a target needing to be tracked is determined in the first frame video frame, and the position of the target to be tracked in each video frame needs to be determined in a subsequent target video.

Existing target tracking algorithms are mainly classified into the following two categories:

generating a formula (generating) model: and establishing a target model in an online learning mode, and searching and reconstructing an image area with the minimum error by using the model to complete target positioning.

Discriminant (discrimination) model: the target tracking is regarded as a binary classification problem, and simultaneously, target and background information are extracted to train a classifier, and the target is separated from the background of an image sequence, so that the target position of the current frame is obtained.

Disclosure of Invention

The present disclosure proposes a method and apparatus for generating information, and a target tracking method and apparatus.

In a first aspect, an embodiment of the present disclosure provides a method for generating information, the method including: acquiring a target video; selecting a video frame from a target video; determining wrist position information of a wrist object in a video frame; palm position information of a palm object in a subsequent video frame of the video frames is generated based on the wrist position information.

In some embodiments, determining wrist position information for a wrist object in a video frame comprises: and inputting the video frame into a pre-trained joint positioning model to obtain wrist position information of the wrist object in the video frame, wherein the joint positioning model is used for determining the position of human body joint points, and the human body joint points comprise wrists.

In some embodiments, generating palm position information for a palm object in a subsequent video frame of the video frame based on the wrist position information comprises: inputting an image area corresponding to the wrist position information in a subsequent video frame of the video frame into a pre-trained tracking model to obtain palm position information, wherein the tracking model is used for determining the position of a palm object in the input image area.

In some embodiments, generating palm position information for a palm object in a subsequent video frame of the video frame based on the wrist position information comprises: amplifying an image area corresponding to the wrist position information in a subsequent video frame of the video frame to obtain an amplified image area; inputting the amplified image area to a pre-trained tracking model to obtain palm position information, wherein the tracking model is used for determining the position of a palm object in the input image area; responding to the palm position information to indicate that the amplified image area does not comprise a palm object, and inputting a subsequent video frame into the joint positioning model to obtain wrist position information of the wrist object; and inputting an image area corresponding to the wrist position information in the subsequent video frame into the tracking model to obtain the palm position information of the palm object.

In some embodiments, the target video is a video currently being captured and presented.

In a second aspect, an embodiment of the present disclosure provides a target tracking method, including: acquiring a current shot and presented target video; selecting a video frame from the target video as a first target video frame, and executing the following tracking steps: inputting the first target video frame into a pre-trained joint positioning model to obtain wrist position information of a wrist object in the first target video frame, wherein the joint positioning model is used for determining the position of a human body joint point, the human body joint point comprises a wrist, and the following tracking substep is executed: selecting a subsequent video frame of the first target video frame from the target video as a second target video frame: determining an image area corresponding to wrist position information of the wrist object in the first target video frame in the second target video frame; inputting an image area in a second target video frame into a pre-trained tracking model to obtain palm position information, wherein the tracking model is used for determining the position of a palm object in the input image area; in response to the palm position information of the palm object in the second target video frame indicating that the palm object is contained in the image region, palm position information of the palm object in the second target video frame is determined based on the palm position information of the palm object in the image region.

In some embodiments, the method further comprises: and in response to the second target video frame being the last frame in the non-target video, taking the second target video frame as the first target video frame and continuing to perform the tracking substep.

In some embodiments, the method further comprises: and in response to the palm position information of the palm object in the second target video frame indicating that the palm object is not contained in the image area, taking the second target video frame as the first target video frame, and continuing to execute the tracking step.

In a third aspect, an embodiment of the present disclosure provides an apparatus for generating information, the apparatus including: a first acquisition unit configured to acquire a target video; a first selecting unit configured to select a video frame from a target video; a determining unit configured to determine wrist position information of a wrist object in a video frame; a generating unit configured to generate palm position information of a palm object in a subsequent video frame of the video frames based on the wrist position information.

In some embodiments, the determining unit comprises: the first input module is configured to input the video frame to a pre-trained joint positioning model to obtain wrist position information of a wrist object in the video frame, wherein the joint positioning model is used for determining positions of human body joint points, and the human body joint points comprise wrists.

In some embodiments, the generating unit comprises: and the second input module is configured to input an image area corresponding to the wrist position information in a subsequent video frame of the video frames into a pre-trained tracking model to obtain the palm position information, wherein the tracking model is used for determining the position of a palm object in the input image area.

In some embodiments, the generating unit comprises: the amplification processing module is configured to amplify an image area corresponding to the wrist position information in a subsequent video frame of the video frame to obtain an amplified image area; a third input module configured to input the enlarged image region to a pre-trained tracking model, resulting in palm position information, wherein the tracking model is used for determining the position of a palm object in the input image region; a fourth input module configured to input a subsequent video frame to the joint positioning model in response to the palm position information indicating that the palm object is not included in the magnified image area, resulting in wrist position information for the wrist object; and the fifth input module is configured to input the image area corresponding to the wrist position information in the subsequent video frame into the tracking model, so as to obtain the palm position information of the palm object.

In a fourth aspect, an embodiment of the present disclosure provides a target tracking apparatus, including: a second acquisition unit configured to acquire a target video currently captured and presented; a second selecting unit configured to select a video frame from the target video as a first target video frame, and perform the following tracking steps: inputting the first target video frame into a pre-trained joint positioning model to obtain wrist position information of a wrist object in the first target video frame, wherein the joint positioning model is used for determining the position of a human body joint point, the human body joint point comprises a wrist, and the following tracking substep is executed: selecting a subsequent video frame of the first target video frame from the target video as a second target video frame: determining an image area corresponding to wrist position information of the wrist object in the first target video frame in the second target video frame; inputting an image area in a second target video frame into a pre-trained tracking model to obtain palm position information, wherein the tracking model is used for determining the position of a palm object in the input image area; in response to the palm position information of the palm object in the second target video frame indicating that the palm object is contained in the image region, palm position information of the palm object in the second target video frame is determined based on the palm position information of the palm object in the image region.

In some embodiments, the apparatus further comprises: a first execution unit configured to continue to execute the tracking sub-step with the second target video frame as the first target video frame in response to the second target video frame being the last frame in the non-target video.

In some embodiments, the apparatus further comprises: a second execution unit configured to continue executing the tracking step with the second target video frame as the first target video frame in response to the palm position information of the palm object in the second target video frame indicating that the palm object is not contained in the image region.

In a fifth aspect, an embodiment of the present disclosure provides an electronic device, including: one or more processors; a storage device, on which one or more programs are stored, which, when executed by the one or more processors, cause the one or more processors to implement the method for generating information as in the first aspect above or the method of any embodiment of the target tracking method in the second aspect above.

In a sixth aspect, embodiments of the present disclosure provide an object tracking computer readable medium, on which a computer program is stored, which when executed by a processor implements the method for generating information as in the first aspect or the method of any embodiment of the object tracking method in the second aspect.

According to the method and the device for generating information and the target tracking method and the device, the target video is obtained, then the video frame is selected from the target video, then the wrist position information of the wrist object in the video frame is determined, and finally the palm position information of the palm object in the subsequent video frame of the video frame is generated based on the wrist position information, so that the palm position information is determined based on the wrist position information, the determination mode of the palm position information is enriched, and the accuracy of the determined palm position is improved.

Drawings

Other features, objects and advantages of the disclosure will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is an exemplary system architecture diagram in which one embodiment of the present disclosure may be applied;

FIG. 2 is a flow diagram for one embodiment of a method for generating information, according to the present disclosure;

3A-3C are schematic diagrams of one application scenario of a method for generating information according to the present disclosure;

FIG. 4 is a flow diagram for one embodiment of a target tracking method according to the present disclosure;

FIG. 5 is a schematic block diagram illustrating one embodiment of an apparatus for generating information according to the present disclosure;

FIG. 6 is a schematic block diagram of one embodiment of a target tracking device according to the present disclosure;

FIG. 7 is a schematic block diagram of a computer system suitable for use with an electronic device implementing embodiments of the present disclosure.

Detailed Description

The present disclosure is described in further detail below with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that, in the present disclosure, the embodiments and features of the embodiments may be combined with each other without conflict. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Fig. 1 illustrates an exemplary system architecture 100 to which embodiments of the method for generating information or the apparatus for generating information, or the target tracking method or the target tracking apparatus, of embodiments of the present disclosure may be applied.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

A user may use the

terminal devices

101, 102, 103 to interact with the server 105 over the network 104 to receive or transmit data or the like. The

terminal devices

101, 102, 103 may have various client applications installed thereon, such as video playing software, news information applications, image processing applications, web browser applications, shopping applications, search applications, instant messaging tools, mailbox clients, social platform software, and the like.

The

terminal apparatuses

101, 102, and 103 may be hardware or software. By way of example, when the

terminal devices

101, 102, 103 are hardware, various electronic devices having a display screen and supporting video playback may be provided, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III, motion Picture Experts compression standard Audio Layer 3), MP4 players (Moving Picture Experts Group Audio Layer IV, motion Picture Experts compression standard Audio Layer 4), laptop and desktop computers, and the like. When the

terminal apparatuses

101, 102, 103 are software, they can be installed in the electronic apparatuses listed above. It may be implemented as multiple pieces of software or software modules (e.g., software or software modules used to provide distributed services) or as a single piece of software or software module. And is not particularly limited herein.

The server 105 may be a server that provides various services, such as a background server that processes video transmitted by the

terminal devices

101, 102, 103. The background server may perform processing such as analysis on the received video, and obtain a processing result (e.g., palm position information of a palm object in the video frame). By way of example, the server 105 may be a virtual server or a physical server.

The server may be hardware or software. When the server is hardware, it may be implemented as a distributed server cluster formed by multiple servers, or may be implemented as a single server. When the server is software, it may be implemented as multiple pieces of software or software modules (e.g., software or software modules used to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.

It should be further noted that the method for generating information provided by the embodiments of the present disclosure may be executed by a server, may also be executed by a terminal device, and may also be executed by the server and the terminal device in cooperation with each other. Accordingly, each part (for example, each unit, sub-unit, module, sub-module) included in the apparatus for generating information may be entirely provided in the server, may be entirely provided in the terminal device, and may be provided in the server and the terminal device, respectively. In addition, the target tracking method provided by the embodiment of the disclosure can be executed by the server, the terminal device, or the server and the terminal device in cooperation with each other. Accordingly, each part (for example, each unit, sub-unit, module, sub-module) included in the target tracking device may be entirely disposed in the server, may be entirely disposed in the terminal device, and may be disposed in the server and the terminal device, respectively.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. For example, the system architecture may only include the electronic device (e.g., server or terminal device) on which the target tracking method operates, when the electronic device on which the method for generating information operates does not require data transfer with other electronic devices.

With continued reference to FIG. 2, a flow 200 of one embodiment of a method for generating information in accordance with the present disclosure is shown. The method for generating information comprises the following steps:

step 201, acquiring a target video.

In this embodiment, an execution subject (for example, a terminal device or a server shown in fig. 1) of the method for generating information may acquire the target video from other electronic devices or locally through a wired connection manner or a wireless connection manner.

The target video may be any video. As an example, the target video may be a video obtained by shooting a hand (including a wrist and a palm) of a person. It is understood that, when the target video is a video obtained by shooting the hand, all or part of the video frames included in the target video may include the palm object and/or the wrist object. Here, the palm object may be an image of a palm presented in the video frame. The wrist object may be an image of the wrist presented in a video frame.

Step 202, selecting a video frame from the target video.

In this embodiment, the execution subject may select a video frame from the target video acquired in step 201.

As an example, the execution subject may randomly select a video frame from the target video, or may select a video frame meeting a preset condition from the target video. For example, the preset conditions may include: the selected video frame is a first frame video frame in the target video, or the selected video frame is a video frame currently presented by the target terminal in the target video. When the execution main body is a terminal device, the target terminal may be the execution main body; when the execution agent is a server, the target terminal may be a terminal device communicatively connected to the execution agent.

In step 203, wrist position information of the wrist object in the video frame is determined.

In this embodiment, the execution subject may determine the wrist position information of the wrist object in the video frame selected in step 202.

The wrist position information may be used to indicate, among other things, the position of the wrist object in the video frame. The wrist position information may be represented by a rectangular frame containing the wrist object in the video frame (for example, a minimum bounding rectangular frame containing the wrist object in the video frame), a circle, or a contour line of the wrist object. Coordinates may also be used for characterization. As an example, the coordinates may be coordinates of a center point or a centroid point of the wrist object in the video frame, or coordinates of a rectangular frame containing the wrist object in the video frame. For example, the coordinates may be "(x, y, w, h)". Wherein x represents the abscissa of the corner point on the upper left of the rectangular frame containing the wrist object in the video frame under the coordinate system determined for the video frame, y represents the ordinate of the corner point on the upper left of the rectangular frame under the coordinate system, w represents the width of the rectangular frame, and h represents the height of the rectangular frame.

For example, the coordinate system determined for the video frame may be a coordinate system with a pixel point located at the upper left corner of the video frame as an origin and two vertical edges of the video frame as an x-axis and a y-axis, respectively.

Optionally, the wrist position information may also represent that "no wrist object is included in the video frame". As an example, in this scenario, the wrist position information may be "null".

As an example, the executing main body may execute the step 203 in the following manner:

inputting the video frame selected in step 202 into a pre-trained wrist positioning model to obtain wrist position information of the wrist object in the video frame selected in step 202. Among other things, the wrist-location model may be used to determine wrist-location information for wrist objects in video frames.

As an example, the wrist positioning model may be a deep neural network trained based on a training sample set by using a deep learning algorithm. The training samples in the training sample set may include the sample video frame and the wrist position information of the wrist object in the sample video frame. It can be understood that the wrist position information of the wrist object in the sample video frame can be obtained by pre-labeling by a labeling person or a device with a labeling function.

Alternatively, the wrist positioning model may be a two-dimensional table or a database obtained by a technician through a large number of statistics and storing wrist position information of the video frames and the wrist objects in the video frames in an associated manner.

In some optional implementations of this embodiment, the executing main body may also execute the step 203 in the following manner:

and inputting the video frame into a joint positioning model trained in advance to obtain the wrist position information of the wrist object in the video frame. The joint positioning model is used for determining the position of a human body joint point, and the human body joint point comprises a wrist.

Here, the joint positioning model may be used only for determining the position of the wrist (and not for determining the positions of other joint points other than the wrist). In this scenario, the joint localization model may be a deep learning model trained using a machine learning algorithm based on a set of training samples including a video frame and pre-labeled wrist position information indicating a position of a wrist in the video frame.

Alternatively, the joint localization model described above may also be used to determine the position of a plurality of joint points of the human body. For example, the joint localization model described above may be used to determine the location of the following respective joint points: shoulder, elbow, wrist, hip, knee, ankle. In this scenario, the joint positioning model may be a deep learning model obtained by training, using a machine learning algorithm, based on a set of training samples including video frames and pre-labeled hand joint position information indicating positions of the joint points in the video frames.

It will be appreciated that the above-described joint localization model may be a human pose estimation model, such as: DensePose, OpenPose, Realtime Multi-Person Pose Estimation, and the like.

It should be understood that when the joint positioning model is used for determining the positions of the plurality of joint points of the human body, the determined wrist position information is more accurate in this scenario relative to the wrist position information determined in other manners because the relative positions of the plurality of joint points of the human body have a certain rule.

And step 204, generating palm position information of the palm object in the subsequent video frame of the video frame based on the wrist position information.

In this embodiment, the execution body described above may generate palm position information of a palm object in a video frame subsequent to the video frame based on the wrist position information.

The subsequent video frame may be a video frame in the target video, which is adjacent to the video frame selected in step 202 (hereinafter, "the video frame selected in step 202" is referred to as a reference video frame) and is located after the reference video frame, or may be a video frame in the target video, which is separated from the reference video frame by a preset number (for example, 5, 1, and the like) of video frames and is located after the reference video frame.

The palm position information may be used to indicate the position of the palm object in the video frame. The palm position information may be represented by a rectangular frame containing a palm object in the video frame (for example, a minimum bounding rectangular frame containing the palm object in the video frame), a circle, or a contour line of the palm object. Coordinates may also be used for characterization. As an example, the coordinates may be coordinates of a center point or a centroid point of the palm object in the video frame, or coordinates of a rectangular frame containing the palm object in the video frame. For example, the coordinates may be "(x, y, w, h)". Wherein x represents the abscissa of the corner point on the upper left of the rectangular frame containing the palm object in the video frame under the coordinate system determined for the video frame, y represents the ordinate of the corner point on the upper left of the rectangular frame under the coordinate system, w represents the width of the rectangular frame, and h represents the height of the rectangular frame.

Optionally, the palm position information may also represent that "no palm object is included in the video frame". As an example, in this scenario, the palm position information may be "null".

In some optional implementations of this embodiment, the executing main body may execute the step 204 in the following manner:

and inputting an image area corresponding to the wrist position information in a subsequent video frame of the video frame into a pre-trained tracking model to obtain palm position information. Wherein the tracking model is used to determine the position of the palm object in the input image region.

Here, the position of the image area corresponding to the wrist position information in the subsequent video frame may be the same as the position of the image area indicated by the wrist position information in the reference video frame. It is understood that if the wrist position information is "(100,100,100,100)", the wrist position information may represent that the horizontal and vertical coordinates of the corner point on the upper left of the rectangular frame containing the wrist object in the coordinate system determined for the video frame are both 100 pixels, and the length and width of the rectangular frame of the wrist object are both 100 pixels. Then, in this scenario, the image region corresponding to the wrist position information may be the image region located at (100,100,100,100) in the video frame subsequent to the reference video frame.

It is understood that the size of the image area indicated by the wrist position information may be predetermined. For example, the size of the image area indicated by the wrist position information may be "100 pixels × 100 pixels", or may be determined based on the size of the wrist object in the video frame. The image area indicated by the wrist position information may or may not include a palm object.

It should be understood that, since the tracking model is used to determine the wrist object information of the wrist object in the image area, the technical solution of this alternative implementation may reduce the above-mentioned computation amount of the execution subject and improve the generation speed of the wrist object information, compared with directly determining the wrist object information of the wrist object from the video frame.

In some optional implementations of this embodiment, the executing main body may also execute the step 204 in the following manner:

the method comprises the following steps of firstly, carrying out amplification processing on an image area corresponding to wrist position information in a subsequent video frame of a video frame to obtain an amplified image area.

The position region after enlargement obtained by the enlargement processing may include a position region indicated by the position information before enlargement. As an example, the area of the position region after enlargement, or the number of pixel points included may be 1.2 times, 1.5 times, or the like the area of the position region indicated by the position information before enlargement.

And secondly, inputting the amplified image area into a pre-trained tracking model to obtain palm position information. Wherein the tracking model is used to determine the position of the palm object in the input image region.

As an example, the tracking model may be trained as follows:

first, a training sample set is obtained. Wherein the training sample comprises an image area and predetermined position information of the palm object in the image area.

Then, the initial model is trained using a machine learning algorithm, using image regions included in training samples in the training sample set as input data of the initial model, and using position information corresponding to the input image regions as expected output data of the initial model. And determining the initial model obtained after the training is finished as the tracking model obtained by the training.

Here, a training completion condition may be set in advance to determine whether the initial model completes training. The training completion condition may include, but is not limited to, at least one of the following: the training times exceed the preset times, the training time exceeds the preset duration, and the function value calculated based on the predetermined loss function is smaller than the preset threshold value.

The initial model may be an untrained model or a model that is trained but does not satisfy the training completion condition (e.g., a convolutional neural network).

Alternatively, the tracking model may be a two-dimensional table or database obtained by a technician through a large number of statistics and storing an image region and predetermined position information of the palm object in the image region in an associated manner.

And thirdly, responding to the palm position information indicating that the amplified image area does not comprise the palm object, and inputting the subsequent video frame into the joint positioning model to obtain the wrist position information of the wrist object.

And fourthly, inputting an image area corresponding to the wrist position information in the subsequent video frame into the tracking model to obtain the palm position information of the palm object.

It will be appreciated that since the joint location model is used to determine positional information of the wrist object in the video frame, the tracking model is used to determine positional information of the palm object in the image area contained in the video frame. Thus, the joint positioning model is more computationally intensive than the tracking model for one input data. Therefore, in the case that the palm position information indicates that the palm object is not included in the magnified image area, the optional implementation manner obtains the wrist position information of the wrist object based on the joint positioning model, and then determines the palm position information; and under the condition that the palm position information indicates that the amplified image area comprises the palm object, the palm position information can be directly obtained, so that the accuracy of positioning the palm object in the video frame is ensured, and the positioning speed is improved.

In some cases, when the palm position information obtained based on the tracking model indicates that "the image region does not include the palm object", other image regions except the image region in the video frame may include the palm object, and in this scenario, the position information of the wrist object in the subsequent video frame may be obtained by inputting the subsequent video frame to the joint point positioning model, so as to obtain the palm position information. Therefore, compared with the technical scheme that once the position information obtained based on the tracking model indicates that the image region does not contain the palm object, the palm object is determined not to be contained in the subsequent video frame, the optional implementation mode can improve the accuracy of palm positioning. On the other hand, in the alternative implementation, compared to the technical solution of "inputting each frame of video to the joint positioning model for determining the plurality of joints", the amount of computation by the execution agent is reduced, and the computation speed is increased.

In some optional implementations of the embodiment, the target video is a video currently captured and presented.

Here, the execution body may be a terminal device. It will be appreciated that this alternative implementation may determine the position of the palm object in the video currently being presented by the terminal device in real time. After the position of the palm object is determined, the execution main body can render a preset image at the target position of the palm object or add a preset special effect, so that the presentation mode of the image is enriched. The target position may be a position determined in advance for the position of the palm.

With continuing reference to fig. 3A-3C, fig. 3A-3C are schematic diagrams of an application scenario of the first target tracking method according to the present embodiment. In fig. 3A, the mobile phone first acquires a target video (e.g., a video captured in real time by an image acquisition device of the mobile phone). The handset then selects a video frame 301 from the target video (e.g., a video frame currently being presented by the handset). Thereafter, referring to fig. 3B, the handset determines the wrist position information 302 of the wrist object in the video frame 301 (here, the wrist position information 302 is represented by a rectangular box containing the wrist object in the video frame 301). Finally, the handset generates palm position information 304 of the palm object in a subsequent video frame 303 of the video frame 301 based on the wrist position information 302 (here, the palm position information 304 is represented by a rectangular box containing the palm object in the video frame 303).

At present, in an existing scheme for positioning a palm object in a video frame, a model obtained by training a machine learning algorithm is often used to determine a position of the palm object in a reference video frame and a position of the palm object in a subsequent video frame of the reference video frame. The above scheme typically requires determining the position of the palm object based on the entire video frame, positioning time is long and computation-intensive. Also, since the palm of the human body is flexible and moves quickly, there may be a large difference in the position of the palm in each video frame in the video including the palm object.

According to the method provided by the above embodiment of the disclosure, the target video is acquired, then the video frame is selected from the target video, then the wrist position information of the wrist object in the video frame is determined, and finally the palm position information of the palm object in the subsequent video frame of the video frame is generated based on the wrist position information, so that the position of the palm is determined based on the position of the wrist with lower flexibility relative to the palm, the palm object in the video frame can be positioned based on the image area corresponding to the wrist position information rather than the whole video frame, the amount of calculation consumed for generating the palm position information is reduced, the determination mode of the palm position information is enriched, and the accuracy of the determined palm position is improved.

With further reference to FIG. 4, a flow 400 of one embodiment of a target tracking method is shown. The process 400 of the target tracking method includes the following steps:

step 401, obtaining a currently shot and presented target video.

In this embodiment, an executing subject (for example, the terminal device shown in fig. 1) of the target tracking method may acquire a target video currently captured and presented.

Thereafter, the execution body may perform step 402.

Here, the execution body may be a terminal device having a video shooting function. Thus, the target video may be a video currently captured by the execution subject. In the above-described process of performing the subject photographing a video, it may present the video (i.e., the target video) in real time.

As an example, the target video may be a video obtained by photographing a hand (e.g., a palm and a wrist). It is understood that, when the target video is a video obtained by shooting the hand, all or part of the video frames included in the target video may include the palm object and the wrist object. Here, the palm object may be an image of a palm presented in the video frame. The wrist object may be an image of the wrist presented in a video frame.

Step 402, selecting a video frame from a target video as a first target video frame.

In this embodiment, the execution subject may select an arbitrary video frame from the target video as the first target video frame.

Thereafter, the execution agent may perform a tracking step. Wherein the tracking step comprises steps 403-408.

Step 403, inputting the first target video frame into a joint positioning model trained in advance, and obtaining wrist position information of the wrist object in the first target video frame.

In this embodiment, the executing body may input the first target video frame to a joint positioning model trained in advance, so as to obtain wrist position information of the wrist object in the first target video frame. The joint positioning model is used for determining the position of a human body joint point, and the human body joint point comprises a wrist.

Thereafter, the execution agent may perform a trace substep. Wherein the tracking sub-step comprises steps 404-408.

Step 404, selecting a subsequent video frame of the first target video frame from the target video as a second target video frame.

In this embodiment, the executing entity may select a video frame subsequent to the first target video frame from the target video as the second target video frame.

Thereafter, the execution agent may perform step 405.

Step 405, determining an image area corresponding to the wrist position information of the wrist object in the first target video frame in the second target video frame.

In this embodiment, the execution body described above may determine, in the second target video frame, an image area corresponding to wrist position information of the wrist object in the first target video frame.

Thereafter, the execution body may perform step 406.

Step 406, inputting the image area in the second target video frame to a pre-trained tracking model to obtain palm position information.

In this embodiment, the executing body may input the image region in the second target video frame to a pre-trained tracking model to obtain palm position information. Wherein the tracking model is used to determine the position of the palm object in the input image region.

Thereafter, the executing agent may execute step 407.

Step 407 determines whether the palm position information of the palm object in the second target video frame indicates that the image region contains a palm object.

In this embodiment, the execution body described above may determine whether the palm position information of the palm object in the second target video frame indicates that the palm object is contained in the image region.

Thereafter, if the palm position information of the palm object in the second target video frame indicates that the image area contains the palm object, the execution subject may execute step 408.

In some optional implementations of this embodiment, if the palm position information of the palm object in the second target video frame indicates that the image area does not include the palm object, the executing body may further perform step 410 of "regarding the second target video frame as the first target video frame", and perform step 403.

Here, after performing step 410, the execution subject may treat the second target video frame as a new first target video frame. It is understood that after step 410 is performed, the first target video frame in a subsequent step refers to the same video frame as the second target video frame before step 410 is performed.

Step 408, determining palm position information of the palm object in the second target video frame based on the palm position information of the palm object in the image area.

In this embodiment, the execution body may determine the palm position information of the palm object in the second target video frame based on the palm position information of the palm object in the image region.

In some optional implementations of this embodiment, after performing step 408, the executing main body may further perform step 409: it is determined whether the second target video frame is the last frame in the target video. Thereafter, if the second target video frame is not the last frame in the target video, the executing entity may execute step 410 "treat the second target video frame as the first target video frame" and step 404.

It should be noted that, besides the above-mentioned contents, the embodiment of the present application may further include the same or similar features as the embodiment corresponding to fig. 2, and the same beneficial effects as the embodiment corresponding to fig. 2 are produced, and therefore, the description thereof is omitted.

As can be seen from fig. 4, compared with the embodiment corresponding to fig. 2, the flow 400 of the target tracking method in this embodiment locates the palm object for each video frame in the currently captured and presented target video, so that the position of the palm object in the currently presented video can be determined in real time.

With further reference to fig. 5, as an implementation of the method shown in fig. 2 described above, the present disclosure provides an embodiment of an apparatus for generating information, the apparatus embodiment corresponding to the method embodiment shown in fig. 2, which may include the same or corresponding features as the method embodiment shown in fig. 2 and produce the same or corresponding effects as the method embodiment shown in fig. 2, in addition to the features described below. The device can be applied to various electronic equipment.

As shown in fig. 5, the target tracking apparatus 500 of the present embodiment includes: a first acquisition unit 501 configured to acquire a target video; a first selecting unit 502 configured to select a video frame from a target video; a determining unit 503 configured to determine wrist position information of a wrist object in a video frame; a generating unit 504 configured to generate palm position information of a palm object in a subsequent video frame of the video frame based on the wrist position information.

In this embodiment, the first obtaining unit 501 of the apparatus 500 for generating information may obtain the target video from other electronic devices or locally through a wired connection manner or a wireless connection manner. The target video may be any video. As an example, the target video may be a video shot of a palm of a person.

In this embodiment, the first selecting unit 502 may select a video frame from the target video acquired by the first acquiring unit 501.

In this embodiment, the determining unit 503 may determine the wrist position information of the wrist object in the video frame selected by the first selecting unit 502. The wrist position information may be used to indicate, among other things, the position of the wrist object in the video frame.

In the present embodiment, the above-described generating unit 504 may generate palm position information of a palm object in a video frame subsequent to the video frame based on the wrist position information obtained by the determining unit 503. The palm position information may be used to indicate the position of the palm object in the video frame.

In some optional implementations of this embodiment, the determining unit 503 may include: a first input module (not shown in the figures) is configured to input the video frame to a pre-trained joint location model, resulting in wrist position information of a wrist object in the video frame, wherein the joint location model is used to determine positions of human joint points, including a wrist.

In some optional implementations of this embodiment, the generating unit 504 may include: a second input module (not shown in the figure) is configured to input an image region corresponding to the wrist position information in a subsequent video frame of the video frames to a pre-trained tracking model, resulting in palm position information, wherein the tracking model is used to determine the position of the palm object in the input image region.

In some optional implementations of this embodiment, the generating unit 504 may also include: the enlargement processing module (not shown in the figure) is configured to enlarge an image area corresponding to the wrist position information in a subsequent video frame of the video frame, so as to obtain an enlarged image area. A third input module (not shown in the figures) is configured to input the enlarged image region to a pre-trained tracking model, resulting in palm position information, wherein the tracking model is used to determine the position of the palm object in the input image region. A fourth input module (not shown in the figures) is configured to input a subsequent video frame to the joint localization model, resulting in wrist position information of the wrist object, in response to the palm position information indicating that the palm object is not included in the magnified image area. A fifth input module (not shown in the figure) is configured to input an image region corresponding to the wrist position information in the subsequent video frame to the tracking model, so as to obtain palm position information of the palm object.

In some optional implementations of this embodiment, the target video is a video currently being captured and presented.

In the apparatus provided by the foregoing embodiment of the present disclosure, the first obtaining unit 501 obtains the target video, then the first selecting unit 502 selects a video frame from the target video, then the determining unit 503 determines the wrist position information of the wrist object in the video frame, and finally the generating unit 504 generates the palm position information of the palm object in the subsequent video frame of the video frame based on the wrist position information, thereby determining the palm position information based on the wrist position information, enriching the determination manner of the palm position information, and being helpful for improving the accuracy of the determined palm position.

Referring next to fig. 6, as an implementation of the method shown in fig. 4, the present disclosure provides an embodiment of an object tracking apparatus, which corresponds to the embodiment of the method shown in fig. 4, and which may include the same or corresponding features as the embodiment of the method shown in fig. 4 and produce the same or corresponding effects as the embodiment of the method shown in fig. 4, in addition to the features described below. The device can be applied to various electronic equipment.

As shown in fig. 6, the target tracking apparatus 600 of the present embodiment includes: a second acquisition unit 601 configured to acquire a target video currently captured and presented; a second selecting unit 602 configured to select a video frame from the target video as the first target video frame, and perform the following tracking steps: inputting the first target video frame into a pre-trained joint positioning model to obtain wrist position information of a wrist object in the first target video frame, wherein the joint positioning model is used for determining the position of a human body joint point, the human body joint point comprises a wrist, and the following tracking substep is executed: selecting a subsequent video frame of the first target video frame from the target video as a second target video frame: determining an image area corresponding to wrist position information of the wrist object in the first target video frame in the second target video frame; inputting an image area in a second target video frame into a pre-trained tracking model to obtain palm position information, wherein the tracking model is used for determining the position of a palm object in the input image area; in response to the palm position information of the palm object in the second target video frame indicating that the palm object is contained in the image region, palm position information of the palm object in the second target video frame is determined based on the palm position information of the palm object in the image region.

In this embodiment, the second acquisition unit 601 of the target tracking apparatus 600 may acquire a target video currently captured and presented.

In this embodiment, the second selecting unit 602 may select a video frame from the target video acquired by the second acquiring unit 601 as the first target video frame, and perform the following tracking steps: inputting the first target video frame into a pre-trained joint positioning model to obtain wrist position information of a wrist object in the first target video frame, wherein the joint positioning model is used for determining the position of a human body joint point, the human body joint point comprises a wrist, and the following tracking substep is executed: selecting a subsequent video frame of the first target video frame from the target video as a second target video frame: determining an image area corresponding to wrist position information of the wrist object in the first target video frame in the second target video frame; inputting an image area in a second target video frame into a pre-trained tracking model to obtain palm position information, wherein the tracking model is used for determining the position of a palm object in the input image area; in response to the palm position information of the palm object in the second target video frame indicating that the palm object is contained in the image region, palm position information of the palm object in the second target video frame is determined based on the palm position information of the palm object in the image region.

In some optional implementations of this embodiment, the apparatus 600 further includes: the first performing unit (not shown in the figures) is configured to continue performing the tracking sub-step with the second target video frame as the first target video frame in response to the second target video frame not being the last frame in the target video.

In some optional implementations of this embodiment, the apparatus 600 further includes: the second execution unit (not shown in the figure) is configured to continue to execute the tracking step with the second target video frame as the first target video frame in response to the palm position information of the palm object in the second target video frame indicating that the palm object is not contained in the image region.

In the apparatus provided by the foregoing embodiment of the present disclosure, the second obtaining unit 601 obtains the target video currently captured and presented, then the second selecting unit 602 selects a video frame from the target video as the first target video frame, and performs the following tracking steps: inputting the first target video frame into a pre-trained joint positioning model to obtain wrist position information of a wrist object in the first target video frame, wherein the joint positioning model is used for determining the position of a human body joint point, the human body joint point comprises a wrist, and the following tracking substep is executed: selecting a subsequent video frame of the first target video frame from the target video as a second target video frame: determining an image area corresponding to wrist position information of the wrist object in the first target video frame in the second target video frame; inputting an image area in a second target video frame into a pre-trained tracking model to obtain palm position information, wherein the tracking model is used for determining the position of a palm object in the input image area; in response to the palm position information of the palm object in the second target video frame indicating that the palm object is contained in the image region, the palm position information of the palm object in the second target video frame is determined based on the palm position information of the palm object in the image region, whereby the position of the palm object in the currently presented video can be determined in real time.

Referring now to fig. 7, a schematic diagram of an electronic device (e.g., the server or terminal device of fig. 1) 700 suitable for use in implementing embodiments of the present disclosure is shown. The terminal device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like, and a fixed terminal such as a digital TV, a desktop computer, and the like. The terminal device/server shown in fig. 7 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 7, electronic device 700 may include a processing means (e.g., central processing unit, graphics processor, etc.) 701 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)702 or a program loaded from storage 708 into a Random Access Memory (RAM) 703. In the RAM703, various programs and data necessary for the operation of the electronic apparatus 700 are also stored. The processing device 701, the ROM 702, and the RAM703 are connected to each other by a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

Generally, the following devices may be connected to the I/O interface 705: input devices 706 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 707 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 708 including, for example, magnetic tape, hard disk, etc.; and a communication device 709. The communication means 709 may allow the electronic device 700 to communicate wirelessly or by wire with other devices to exchange data. While fig. 7 illustrates an electronic device 700 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided. Each block shown in fig. 7 may represent one device or may represent multiple devices as desired.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such embodiments, the computer program may be downloaded and installed from a network via the communication means 709, or may be installed from the storage means 708, or may be installed from the ROM 702. The computer program, when executed by the processing device 701, performs the above-described functions defined in the methods of embodiments of the present disclosure.

It should be noted that the computer readable medium described in the embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In embodiments of the disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In embodiments of the present disclosure, however, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring a target video; selecting a video frame from a target video; determining wrist position information of a wrist object in a video frame; palm position information of a palm object in a subsequent video frame of the video frames is generated based on the wrist position information. Or, causing the electronic device to: acquiring a current shot and presented target video; selecting a video frame from a target video as a first target video frame, and executing the following tracking steps: inputting the first target video frame into a pre-trained joint positioning model to obtain wrist position information of a wrist object in the first target video frame, wherein the joint positioning model is used for determining the position of a human body joint point, the human body joint point comprises a wrist, and the following tracking substep is executed: selecting a subsequent video frame of the first target video frame from the target video as a second target video frame: determining an image area corresponding to wrist position information of the wrist object in the first target video frame in the second target video frame; inputting an image area in a second target video frame into a pre-trained tracking model to obtain palm position information, wherein the tracking model is used for determining the position of a palm object in the input image area; in response to the palm position information of the palm object in the second target video frame indicating that the palm object is contained in the image region, palm position information of the palm object in the second target video frame is determined based on the palm position information of the palm object in the image region.

Computer program code for carrying out operations for embodiments of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present disclosure may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes a first acquisition unit, a first selection unit, a determination unit, and a generation unit. Where the names of these units do not in some cases constitute a limitation on the unit itself, for example, the first acquisition unit may also be described as a "unit acquiring a target video".

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the present disclosure is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is possible without departing from the inventive concept as defined above. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Claims

1. A method for generating information, comprising:

acquiring a target video;

selecting a video frame from the target video;

determining wrist position information of a wrist object in the video frame;

generating palm position information of a palm object in a subsequent video frame of the video frames based on the wrist position information;

wherein the generating palm position information of a palm object in a subsequent video frame of the video frames based on the wrist position information comprises:

amplifying an image area corresponding to the wrist position information in a subsequent video frame of the video frame to obtain an amplified image area;

inputting the amplified image area to a pre-trained tracking model to obtain palm position information, wherein the tracking model is used for determining the position of a palm object in the input image area;

in response to the palm position information indicating that the magnified image area does not include a palm object, inputting the subsequent video frame to a pre-trained joint positioning model to obtain wrist position information of the wrist object, wherein the joint positioning model is used for determining positions of human joint points, and the human joint points include wrists;

and inputting the image area corresponding to the wrist position information in the subsequent video frame into the tracking model to obtain the palm position information of the palm object.

2. The method of claim 1, wherein said determining wrist position information of a wrist object in said video frame comprises:

and inputting the video frame into a pre-trained joint positioning model to obtain wrist position information of the wrist object in the video frame.

3. The method of claim 1, wherein the generating palm position information for a palm object in a subsequent video frame of the video frame based on the wrist position information comprises:

and inputting an image area corresponding to the wrist position information in a subsequent video frame of the video frames to a pre-trained tracking model to obtain palm position information.

4. The method according to one of claims 1-3, wherein the target video is a currently captured and presented video.

5. A target tracking method, comprising:

acquiring a current shot and presented target video;

selecting a video frame from the target video as a first target video frame, and executing the following tracking steps:

inputting the first target video frame into a pre-trained joint positioning model to obtain wrist position information of a wrist object in the first target video frame, wherein the joint positioning model is used for determining the position of a human body joint point, the human body joint point comprises a wrist, and the following tracking substep is executed:

selecting a subsequent video frame of the first target video frame from the target video as a second target video frame:

determining an image area corresponding to wrist position information of the wrist object in the first target video frame in the second target video frame;

inputting an image area in a second target video frame into a pre-trained tracking model to obtain palm position information, wherein the tracking model is used for determining the position of a palm object in the input image area;

in response to the palm position information of the palm object in the second target video frame indicating that the palm object is contained in the image region, palm position information of the palm object in the second target video frame is determined based on the palm position information of the palm object in the image region.

6. The method of claim 5, wherein the method further comprises:

and in response to the second target video frame not being the last frame in the target video, taking the second target video frame as the first target video frame, and continuing to perform the tracking substep.

7. The method of claim 5 or 6, wherein the method further comprises:

and in response to the palm position information of the palm object in the second target video frame indicating that the palm object is not contained in the image area, taking the second target video frame as the first target video frame, and continuing to execute the tracking step.

8. An apparatus for generating information, comprising:

a first acquisition unit configured to acquire a target video;

a first selecting unit configured to select a video frame from the target video;

a determining unit configured to determine wrist position information of a wrist object in the video frame;

a generating unit configured to generate palm position information of a palm object in a subsequent video frame of the video frames based on the wrist position information;

9. An object tracking device, comprising:

a second acquisition unit configured to acquire a target video currently captured and presented;

a second selecting unit configured to select a video frame from the target video as a first target video frame, and perform the following tracking steps:

10. An electronic device, comprising:

one or more processors;

a storage device having one or more programs stored thereon,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-7.

11. A computer-readable medium, on which a computer program is stored, wherein the program, when executed by a processor, implements the method of any one of claims 1-7.