CN113949891A

CN113949891A - Video processing method and device, server and client

Info

Publication number: CN113949891A
Application number: CN202111191080.XA
Authority: CN
Inventors: 李立锋
Original assignee: Migu Cultural Technology Co Ltd; China Mobile Communications Group Co Ltd
Current assignee: Migu Cultural Technology Co Ltd; China Mobile Communications Group Co Ltd
Priority date: 2021-10-13
Filing date: 2021-10-13
Publication date: 2022-01-18
Anticipated expiration: 2041-10-13
Also published as: CN113949891B

Abstract

The invention discloses a video processing method, a video processing device, a server and a client, relates to the technical field of video processing, and aims to solve the problem that in the related art, a user has a single interaction mode through electronic equipment. The method comprises the following steps: acquiring a user video, wherein the user video is obtained by performing video acquisition on a watching user by a first client playing a live video; identifying the user video, and determining whether a user in the user video is in a target state; loading the user video into the live video if it is determined that the user in the user video is in the target state; and sending the live video loaded with the user video to a second client. Like this, the user can be at the in-process of watching the live video, and is interactive through the video, and interactive mode is more interesting abundant.

Description

Video processing method and device, server and client

Technical Field

The present invention relates to the field of video processing technologies, and in particular, to a video processing method, an apparatus, a server, and a client.

Background

With the development of mobile internet, people often utilize electronic devices such as mobile phones to process daily affairs, and users can interact with each other through the electronic devices. However, in the related art, the interaction mode of the user through the electronic device is usually voice and text interaction, for example, interaction is performed through issuing a bullet screen when watching a video, and the interaction mode is single.

Disclosure of Invention

The embodiment of the invention provides a video processing method, a video processing device, a server and a client, and aims to solve the problem that in the related art, a user has a single interactive mode through electronic equipment.

In a first aspect, an embodiment of the present invention provides a video processing method, which is applied to a server, and the method includes:

acquiring a user video, wherein the user video is obtained by performing video acquisition on a watching user by a first client playing a live video;

identifying the user video, and determining whether a user in the user video is in a target state;

loading the user video into the live video if it is determined that the user in the user video is in the target state;

and sending the live video loaded with the user video to a second client.

Optionally, the identifying the user video and determining whether a user in the user video is in a target state includes:

and identifying the action amplitude and/or facial expression of the user in the user video, and determining whether the user in the user video is in a target emotional state.

Optionally, the target emotional state comprises a cheering state;

the identifying the action amplitude and/or the facial expression of the user in the user video and determining whether the user in the user video is in the target emotional state comprise:

determining an included angle between the arm and the trunk of the user in each frame of user image in the user video;

and under the condition that the included angle between the arm and the trunk of the user in the target frame user image is larger than a preset angle, determining that the user in the user video is in the cheering state, wherein the target frame user image is any frame image in the user video.

Optionally, the acquiring the user video includes:

acquiring a plurality of user videos, wherein the user videos are obtained by respectively carrying out video acquisition on respective watching users by different clients playing live videos;

the loading the user video into the live video under the condition that the user in the user video is determined to be in the target state comprises:

determining the motion change amplitude of the user in each user video under the condition that the user in at least two user videos is determined to be in the target state;

determining a target user video from the at least two user videos, wherein the action change amplitude of the user in the target user video is the largest;

and loading the at least two user videos in a preset area of a playing picture of the live video, wherein the target user video is positioned in the middle of the preset area.

Optionally, other user videos than the target user video in the at least two user videos are loaded in a preset area of a playing picture of the live video in a first size, and the target user video is loaded in a middle position of the preset area in a second size, where the second size is larger than the first size.

Optionally, after determining the target user video from the at least two user videos, the method further includes:

identifying the outline of the action core part of the user in the target user video;

based on the outline, intercepting a dynamic image of the action core part from the target user video;

and loading the dynamic image of the action core part at a position associated with the target user video in a playing picture of the live video by a third size, wherein the third size is larger than a second size, and the second size is the display size of the target user video.

Optionally, after loading the dynamic image of the action core part in a third size at a position associated with the target user video in the playing frame of the live video, the method further includes:

and identifying the interactive action of the dynamic image of the action core part on a first user video, and generating an interactive display effect of the first user video in a playing picture of the live video based on the interactive action, wherein the first user video is at least one of the other user videos.

Optionally, the identifying an interactive action of the dynamic image of the action core portion on the first user video, and generating an interactive display effect of the first user video in a playing picture of the live video based on the interactive action includes:

acquiring position information of the dynamic image of the action core part in a playing picture of the live video;

and under the condition that the dynamic image of the action core part is detected to be overlapped with the position information of the first user video, generating a collision effect graph of the dynamic image of the action core part and the first user video, and loading the collision effect graph into the live video.

Optionally, the generating a collision effect map of the dynamic image of the motion core part and the first user video includes:

retracting the first user video to a first direction of a playing picture of the live video, wherein the first direction is related to the action direction of the dynamic image of the action core part;

or, determining a target motion speed and a target motion direction of the first user video after the first user video collides with the dynamic image of the motion core part based on the motion speed and the motion direction of the motion core part in the dynamic image of the motion core part; and generating a pop-up effect picture of the first user video in a playing picture of the live video according to the target motion speed and the target motion direction.

performing gesture recognition on the target user video;

under the condition that a grabbing gesture of a user in the target user video is recognized, determining a second user video overlapped with the position information of the dynamic image of the action core part, wherein the second user video is one of the other user videos;

and moving the second user video to a target position based on a moving track of the dynamic image of the action core part in a playing picture of the live video, wherein the target position is associated with the end position of the moving track.

Optionally, in a case that it is determined that a user in the user video is in the target state, loading the user video into the live video includes at least one of:

under the condition that the user in the user video is determined to be in the target state, performing background segmentation on the user video based on the figure image outline in the user video, and determining a background area in the user video; filling a background area in the user video by using a preset color; loading the processed user video into the live video;

under the condition that the user in the user video is determined to be in the target state, carrying out face recognition on the user video, and determining the face position in the user video; based on the face position in the user video, cutting the user video; and loading the cut user video into the live video.

Optionally, the acquiring the user video includes:

after the obtaining of the plurality of user videos, the method further includes:

identifying and classifying the sounds in the plurality of user videos;

determining a number fraction of sounds of a target category in the plurality of user videos;

and processing the sound volume of the target category under the condition that the number ratio is greater than a preset value.

Optionally, the processing the sound volume of the target category includes:

determining the sound volume of the target category based on the number ratio, wherein the sound volume of the target category is positively correlated with the number ratio;

and superposing the sound of the target category in the plurality of user videos according to the sound volume of the target category.

Optionally, the loading the user video into the live video when it is determined that the user in the user video is in the target state includes:

under the condition that the user in the user video is determined to be in the target state, carrying out face recognition on the user video, and determining the face position in the user video;

based on the face position in the user video, cutting the user video;

and loading the cut user video into the live video.

In a second aspect, an embodiment of the present invention further provides a video processing method, which is applied to a client, where the method includes:

receiving a live video loaded with a user video and transmitted by a server;

and playing the live video loaded with the user video.

Optionally, the method further comprises:

in the process of playing the live video, video acquisition is carried out on a watching user to obtain a third user video;

and uploading the third user video to the server, so that the server identifies the third user video.

In a third aspect, an embodiment of the present invention further provides a video processing apparatus, which is applied to a server, where the video processing apparatus includes:

the system comprises a first acquisition module, a second acquisition module and a first display module, wherein the first acquisition module is used for acquiring a user video, and the user video is obtained by performing video acquisition on a watching user by a first client playing a live video;

the first identification module is used for identifying the user video and determining whether a user in the user video is in a target state;

the first processing module is used for loading the user video into the live video under the condition that the user in the user video is determined to be in the target state;

and the sending module is used for sending the live video loaded with the user video to a second client.

Optionally, the first recognition module is configured to recognize the motion amplitude and/or facial expression of the user in the user video, and determine whether the user in the user video is in a target emotional state.

Optionally, the first identification module includes:

the first determining unit is used for determining an included angle between an arm and a trunk of the user in each frame of user image in the user video;

and the second determining unit is used for determining that the user in the user video is in the cheering state under the condition that the included angle between the arm and the trunk of the user in the target frame user image is determined to be larger than a preset angle, wherein the target frame user image is any one frame image in the user video.

Optionally, the first obtaining module is configured to obtain a plurality of user videos, where the user videos are obtained by respectively performing video acquisition on respective watching users by different clients playing live videos;

the first processing module comprises:

a third determining unit, configured to determine a motion variation amplitude of the user in each user video when determining that the user in the at least two user videos is in the target state;

a fourth determining unit, configured to determine a target user video from the at least two user videos, where a motion variation amplitude of a user in the target user video is largest;

the first processing unit is used for loading the at least two user videos in a preset area of a playing picture of the live video, wherein the target user video is located in the middle of the preset area.

Optionally, the video processing apparatus further includes:

the second identification module is used for identifying the outline of the action core part of the user in the target user video;

the intercepting module is used for intercepting a dynamic image of the action core part from the target user video based on the outline;

and the second processing module is used for loading the dynamic image of the action core part at a position, associated with the target user video, in a playing picture of the live video in a third size, wherein the third size is larger than a second size, and the second size is the display size of the target user video.

Optionally, the video processing apparatus further includes:

and the seventh processing module is configured to identify an interactive action of the dynamic image of the action core part on a first user video, and generate an interactive display effect of the first user video in a playing picture of the live video based on the interactive action, where the first user video is at least one of the other user videos.

Optionally, the video processing apparatus further includes:

the second acquisition module is used for acquiring the position information of the dynamic image of the action core part in the playing picture of the live video;

and the third processing module is used for generating a collision effect graph of the dynamic image of the action core part and the first user video under the condition that the dynamic image of the action core part is detected to be overlapped with the position information of the first user video, and loading the collision effect graph into the live video.

Optionally, the third processing module is configured to indent the first user video in a first direction of a playing frame of the live video, where the first direction is related to an action direction of a dynamic image of the action core portion;

or, the third processing module is configured to determine a target motion speed and a target motion direction of the first user video after collision with the dynamic image of the motion core part, based on the motion speed and the motion direction of the motion core part in the dynamic image of the motion core part; and generating a pop-up effect picture of the first user video in a playing picture of the live video according to the target motion speed and the target motion direction.

Optionally, the video processing apparatus further includes:

the third identification module is used for carrying out gesture identification on the target user video;

a first determining module, configured to determine, when a grabbing gesture of a user in the target user video is identified, a second user video that overlaps with the position information of the dynamic image of the action core portion, where the second user video is one of the other user videos;

and the fourth processing module is used for moving the second user video to a target position based on a moving track of the dynamic image of the action core part in a playing picture of the live video, wherein the target position is associated with the end position of the moving track.

Optionally, the first processing module includes:

the second processing unit is used for carrying out background segmentation on the user video based on the figure image outline in the user video and determining a background area in the user video under the condition that the user in the user video is determined to be in the target state;

the third processing unit is used for filling a background area in the user video by using a preset color;

the fourth processing unit is used for loading the processed user video into the live video;

and/or, the first processing module comprises:

the identification unit is used for carrying out face identification on the user video under the condition that the user in the user video is determined to be in the target state, and determining the face position in the user video;

the sixth processing unit is used for cutting the user video based on the face position in the user video;

and the seventh processing unit is used for loading the cut user video into the live video.

the video processing apparatus further includes:

the fifth processing module is used for identifying and carrying out classified statistics on the sound in the user videos;

a second determining module, configured to determine a number fraction of sounds in a target category in the plurality of user videos;

and the sixth processing module is used for processing the sound volume of the target category under the condition that the number ratio is greater than a preset value.

Optionally, the sixth processing module includes:

a fifth determining unit, configured to determine the sound volume of the target category based on the number ratio, where the sound volume of the target category is positively correlated with the number ratio;

and the fifth processing unit is used for superposing the sound of the target category in the plurality of user videos according to the sound volume of the target category.

In a fourth aspect, an embodiment of the present invention further provides a video processing apparatus, which is applied to a client, where the video processing apparatus includes:

the receiving module is used for receiving the live video loaded with the user video and transmitted by the server;

and the playing module is used for playing the live video loaded with the user video.

Optionally, the video processing apparatus further includes:

the acquisition module is used for acquiring videos of watching users in the process of playing the live videos to obtain a third user video;

and the uploading module is used for uploading the third user video to the server so that the server identifies the third user video.

In a fifth aspect, an embodiment of the present invention further provides an electronic device, including: a transceiver, a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the video processing method according to the first aspect as described above or implementing the steps of the video processing method according to the second aspect as described above when executing the computer program.

In a sixth aspect, the embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps in the video processing method according to the first aspect; or implementing the steps in the video processing method as described in the second aspect above.

In the embodiment of the invention, a user video is obtained, wherein the user video is obtained by performing video acquisition on a watching user by a first client playing a live video; identifying the user video, and determining whether a user in the user video is in a target state; loading the user video into the live video if it is determined that the user in the user video is in the target state; and sending the live video loaded with the user video to a second client. Like this, the user can be at the in-process of watching the live video, and is interactive through the video, and interactive mode is more interesting abundant.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive exercise.

Fig. 1 is a flowchart of a video processing method applied to a server according to an embodiment of the present invention;

fig. 2 is a flowchart of audio and video data acquisition of a client according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of user action amplitude recognition provided by an embodiment of the present invention;

fig. 4 is a schematic view of a live video frame after a user video is loaded according to an embodiment of the present invention;

fig. 5 is a schematic diagram of a target user video after being subjected to an amplification process according to an embodiment of the present invention;

fig. 6 is a schematic diagram illustrating interaction effects between a user and other user videos in a target user video according to an embodiment of the present invention;

fig. 7 is a second schematic diagram illustrating interaction effects between a user and other user videos in a target user video according to an embodiment of the present invention;

FIG. 8 is a schematic diagram illustrating clipping of a user video based on a face of the user according to an embodiment of the present invention;

fig. 9 is a flowchart of a video processing method applied to a client according to an embodiment of the present invention;

fig. 10 is one of the structural diagrams of a video processing apparatus provided by the embodiment of the present invention;

fig. 11 is a second block diagram of a video processing apparatus according to an embodiment of the present invention;

FIG. 12 is a block diagram of a server according to an embodiment of the present invention;

fig. 13 is a structural diagram of a client according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, fig. 1 is a flowchart of a video processing method provided by an embodiment of the present invention, and is applied to a server, as shown in fig. 1, the method includes the following steps:

step 101, obtaining a user video, wherein the user video is obtained by performing video acquisition on a watching user by a first client playing a live video.

In the embodiment of the invention, the user video can be obtained by starting a camera and a microphone to perform video acquisition on a user watching the live video in the process of playing the live video by a first client, and the first client can be a client which plays the live video and is selected by the user to access the live video for interaction. The live video can be any type of live video, and particularly can be a live video with strong interactivity, such as sports event live broadcast, game competition live broadcast and the like.

For example, when a client user selects to access a live interaction of a sporting event, the client enables a camera and a microphone to capture video of the user.

The acquiring of the user video may be receiving the user video uploaded by the first client.

When a client collects a user video, in order to ensure the synchronization of audio and video data, timestamp verification can be performed on the collected audio data and video data, a specific process can be as shown in fig. 2, the audio data is collected through a microphone, the video data is collected through a camera, and the collection timestamp is recorded, so that the audio data and the video data with the timestamp are obtained, then the timestamp verification can be performed on the timestamp corresponding to an audio packet and the timestamp corresponding to a video frame, when the timestamps are not synchronous, if the video frame timestamp exceeds the audio packet timestamp, or the video frame timestamp is less than the audio packet time, frame loss of the video frame or frame copying operation is performed; the verified audio/video data, i.e., the audio/video data synchronized with the timestamp, can be transmitted to the server side through Web Real-Time Communication (webRTC) or other manners.

And 102, identifying the user video, and determining whether a user in the user video is in a target state.

After acquiring the user video, the server may identify the interaction information in the user video, that is, identify the user video to determine whether the user in the user video is in a target state, where the target state may be a state determined according to actual needs, for example, a cheering state, an excited state, an angry state, a sad state, and the like. When the user in the user video is in the target state, it can be considered that effective user interaction information exists in the user video.

Specifically, according to the target state to be recognized, motion recognition, emotion recognition and the like may be performed on the user image in the user video, semantic understanding may also be performed on the user voice in the user video, and whether the user is in the target state may also be determined by combining recognition of the user image and recognition of the user voice.

In one embodiment, the target emotional state may be a target emotional state, for example, an emotional state such as cheering, sadness, excitement, anger, etc., and the target emotional state may be set according to actual needs. In this embodiment, whether the user in the user video is in the target emotional state may be determined by recognizing the motion amplitude and/or facial expression of the user in the user video, specifically, the motion amplitude of a target part of the user in the user video, such as an arm, a leg, and a lip, may be recognized, and whether the user performs the target motion, such as swinging an arm, jumping, shouting, or the like, may also be recognized, and whether the user shows the target expression, such as difficulty, vitality, happiness, or the like, may also be recognized, and in a case that the user performs the target motion or shows the target expression in the user video, the user may be determined to be in the target emotional state.

Therefore, the target emotion state of the user can be accurately identified by identifying the action amplitude and/or facial expression of the user in the user video, and the quality of the user interactive video displayed in the live video is further ensured.

Optionally, the target emotional state comprises a cheering state;

In one embodiment, the target emotional state may include a cheering state, so as to identify the motion amplitude and/or facial expression of the user in the user video, specifically, an included angle between an arm and a trunk of the user in each frame of user image in the user video is identified, and when it is identified that the included angle between the arm and the trunk of the user in a certain frame of user image is greater than a preset angle, the user in the user video may be considered to be in the cheering state, for example, the user may lift the arm during normal cheering, so that it may be determined that the user is in the cheering state when the included angle between the arm and the trunk is greater than 90 degrees, 120 degrees, and the like.

Specifically, the angle between the arm and the trunk of the user can be calculated by using a bone point angle calculation method, for example, referring to fig. 3, the key points of the left shoulder, the left arm and the left trunk are a, b and c, respectively, the key points of the right shoulder, the right arm and the right trunk are a ', b ' and c ', respectively, and then the angle between the arm and the trunk is ═ abc and ═ a ' b ' c ', and when the angle ═ abc or ═ a ' b ' c ' is greater than a preset angle, the user is considered to be in a cheering state.

Thus, the embodiment can accurately identify the swing action of the user, and further identify the cheering state of the user.

It should be noted that the interactive information, i.e., the target state, is identified and extracted based on the user video data. The identification may be performed at the server, and the identification processing may be performed on each path of video data (video data uploaded by different clients), or may be performed at each client. The compatibility of unified processing of the server side is better; in the client identification processing, certain requirements on the processing capacity of the client are required, but the pressure of the server can be greatly reduced.

Step 103, loading the user video into the live video under the condition that the user in the user video is determined to be in the target state.

In the embodiment of the present invention, in the step 102, when it is recognized that the user in the user video is in the target state, it may be determined that there is effective interaction information in the user video, and then the user video may be loaded into the live video, and specifically, the user video may be displayed in the live video, where, in order to avoid that the user video blocks the live video to affect the user watching effect, the user video may be displayed at a corner position in the live video picture, such as a bottom, a top, a lower left corner, a lower right corner, and the like, in a manner of abbreviating the video, such as being reduced to a certain size, so that the user watching the live video can watch the video of the user or other users in the live video picture, and thus, the interaction of the user in the live video is realized. The user can also make a target state in the process of watching the live video, so that the service end can display the video of the user in the live video picture, and the interaction interest is enhanced.

Optionally, the step 101 includes:

the step 103 comprises:

In an implementation manner, the server may receive a plurality of user videos sent by a plurality of clients, and the plurality of clients playing the live videos may respectively perform video acquisition on respective watching users and upload the user videos acquired by the respective watching users to the server.

The server can respectively identify the user state in each user video to determine whether the user in each user video is in the target state, in a case where it is determined that the user of at least two user videos among the plurality of user videos is in the target state, the motion variation amplitude, i.e. the motion amplitude, of the user in each of the at least two user videos may be determined separately, and specifically, the motion variation amplitude of the user in the user video may be determined by identifying motion difference values of user images of different frames in the user video within a period of time, for example, for the cheering state, the difference value of the included angles between the arms and the trunk of the user in two frames of user images at different times can be calculated, if the angle between the user's arm and torso in the previous second was 90 degrees and the angle in the next second became 150 degrees, the user's arm movement amplitude could be determined to be 60 degrees/second.

Therefore, after the action change amplitude of the user in each user video is calculated, the at least two user videos can be sequenced according to the action change amplitude, and the user video with the largest action change amplitude is selected as the target user video.

For the at least two user videos, the preset regions of the playing picture of the live video can be loaded, such as the bottom, the top, the left side and the right side of the playing picture, which have small influence on the visual effect of the user, and the two user videos can be displayed in the middle of the two opposite ends. The target user video is used as the user video with the largest motion amplitude, and can be displayed at the middle position of the preset area, for example, at the bottom middle of a live video picture, that is, at the center position of all user videos. The motion amplitude of the user in the user video is changed in real time, so that the motion change amplitude of the user in each user video can be monitored in real time or regularly, and the user video with the highest motion amplitude is rapidly moved to the position in the middle of the bottom of the live video picture under the condition that the display sequence of the current user video is kept unchanged (unless the user video is newly added or reduced).

For example, as shown in fig. 4, a plurality of user videos 41 accessing the interaction may be displayed at the bottom of a playing screen 40 of a live video, and a target user video 42 with the largest motion amplitude may be displayed at the center-most position at the bottom of the screen. The user video display area can be moved to the left side or the right side due to the change of the user video with the highest motion amplitude, but the user video can only be fully paved with one screen, so that a circulating display mode can be adopted, and when the user video display area moves to the left side, the user video on the leftmost side can be moved to the rightmost side for display.

Thus, through the embodiment, the video (the best interactive performance) of the user with the largest motion amplitude can be ensured to be displayed in the center of the live broadcast picture, so that the interactive enthusiasm of the user is improved, and the watching user is stimulated to carry out active interaction in the watching video so as to preempt the center position.

That is, in one embodiment, the user video with the largest motion amplitude may be displayed in an enlarged manner to highlight the interactive effect of the user video. Specifically, the target user video may be displayed in a second size at a middle position of the preset area, for example, in the middle of the bottom of a playing picture of the live video, and other user videos may be displayed in the preset area in a first size, for example, in the bottom of the playing picture of the live video, where the second size is larger than the first size, the second size may be 1.2 to 2 times of the first size, and may be specifically adjusted by a visual display effect, and the first size may be a default size, and may be specifically determined according to an experimental effect of displaying in different sizes, so as to ensure that a user video display area is not too small or too large.

For example, as shown in FIG. 5, the target user video 42 in the bottom center of the live video frame 40 may be displayed in an enlarged scale, with the other user videos 41 remaining displayed in a default size.

Therefore, the video of the user with the largest motion amplitude is amplified and displayed, so that the user with good interactive performance can be highlighted, and the user is stimulated to carry out active interaction in the watching process.

In another embodiment, the video of the user with the largest motion amplitude may be subjected to other highlighting effects, and specifically, the outline of the motion core part of the user in the target user video may be identified, where the motion core part may be a part capable of representing the target motion of the user, such as an arm, a face, and the like. Taking arm contour identification as an example, the two hands of the user in the target user video can be identified by using a limb identification algorithm, and then the contours of the two hands of the user are detected by using a contour detection algorithm; then, based on the identified arm contour, capturing a dynamic image of an arm from the target user video, amplifying the dynamic image of the arm, and placing the amplified dynamic image of the arm in a position associated with the target user video in a playing picture of the live video for display, for example, placing the amplified dynamic image of the arm in a position above the target user video for display, where the size of the dynamic image of the arm may be 2 to 5 times larger than that of the target user video below, that is, the third size may be 2 to 5 times larger than the second size, but it is required to ensure that the dynamic image of the arm does not exceed the live video picture.

For example, as shown in fig. 5, the dynamic image of the arm of the user in the target user video 42 may be captured and then displayed in an enlarged manner on the target user video 42.

Therefore, more interesting interaction effect can be generated, and the interaction interest and the enthusiasm of the user are improved.

That is, in an implementation manner, it may also be supported that a user in the target user video interacts with other user videos loaded in the live video through an interaction action at an action core portion of the user, specifically, when the user in the target user video performs an interaction action, such as touching, grabbing, flicking, and the like, on the other user videos (referred to as first user videos) through the action core portion of the user, the server may recognize the interaction action and may generate an interaction display effect of the corresponding first user video in a playing picture of the live video based on the interaction action, for example, when the interaction action is touching, an effect that the first user video is touched may be generated, and when the interaction action is grabbing, an effect that the display position of the first user video is switched after being grabbed may be generated, when the interactive action is used as a flick finger, an effect that the first user video is flicked away can be generated, and the like.

Therefore, through the implementation mode, the user interaction effect in live broadcast can be further enhanced, the interaction interest is improved, and the enthusiasm of the user for participating in live broadcast interaction is improved.

In other words, in one embodiment, the user corresponding to the target user video may perform interesting interaction with other user videos through a specific action, for example, the user may perform touch, ejection, and the like on the other user videos by controlling an arm action, so as to achieve a physical collision effect.

The dynamic image of the action core part may be a dynamic image generated by capturing the action core part of the user in the target user video in real time, for example, when the user controls an arm in the target user video to make different actions, the action of the arm of the user is correspondingly displayed above the target user video, so that the user can make actions such as touching other user video areas (each user video area may be called a user video head portrait) by the arm, popping the other user video head portraits, and the like.

Still taking an action core part as an arm as an example, in order to identify the actions and generate corresponding interactive effects, the server may obtain position information of a dynamic image of the arm in a playing picture of the live video in real time, such as obtaining coordinate positions of points in the dynamic image of the arm in the playing picture of the live video, obtaining coordinate positions of user videos in the at least two user videos in the playing picture of the live video, and may detect whether the coordinate positions of the dynamic image of the arm and the user videos are overlapped, indicating that the user video is touched by the arm of the user, that is, the user arm collides with the user video when the coordinate positions of the dynamic image of the arm and the user video are overlapped, so as to generate a collision effect graph of the dynamic image of the arm and the first user video, for example, when a user smashes a video head portrait of a user on the left side and the right side of the user by arms, a smashed physical collision effect can be presented to the video head portrait of the user, and a user video after collision can be recovered after a certain time period, such as 1-3 s.

The user corresponding to the target user video can also flick the head portrait of other user videos and can also perform head touch killing on other user videos in such a way, and accordingly, the server can generate an effect graph of flicking the touched video head portrait of the user or performing head touch killing on the touched video head portrait of the user.

Therefore, through the implementation mode, the user occupying the central position can carry out interesting interaction on the video head portraits of other users, and the interactive interestingness of the user in watching live videos is improved.

That is, in one embodiment, the generating of the collision effect map between the moving image of the motion core region and the first user video, the motion recognition of the motion core part in the dynamic image of the motion core part, such as the arm in the dynamic image of the arm, when recognizing that the arm of the user extends to a certain user video head portrait, generating a fumbled effect picture of the user video head portrait, specifically, retracting the fumbled user video to a first direction of a playing picture of the live video, such as a bottom direction, when the motion core part, such as an arm, in the dynamic image of the motion core part touches a plurality of user video head portraits, the effect of retracting downwards can be generated for a plurality of user video head portraits, and head portraits retracting effect graphs with inconsistent heights can be generated according to the distance between each user video head portrait and the arm and finger of the user. For example, as shown in FIG. 6, when the user touches the left user video avatars 43, these user video avatars 43 may exhibit a physical bump effect that is indented downward.

The generating of the collision effect map of the motion core portion and the first user video may also be performed by combining motion recognition of the motion core portion in the motion image of the motion core portion, for example, when it is recognized that a user video head is ejected by a finger of a user, an effect map that the user video head is ejected is generated, specifically, taking an arm motion image as an example, a motion speed and a motion direction of an arm in the arm motion image may be obtained, for example, a pixel of the arm moving every second in the playing picture may be obtained to determine an arm motion speed, and an arm motion direction may be determined according to pixel positions before and after the arm moves in the playing picture.

Then, the target motion speed and the target motion direction of the ejected user video head portrait after colliding with the arm dynamic image can be calculated according to the law of conservation of kinetic energy, for example, it can be assumed that the quality of the first user video head portrait is m1, the quality of the user arm is m2, and m1 is equal to 1, m2 is equal to the area ratio of the user arm head portrait in the target user video, that is, m2 is equal to the quotient of the area of the arm dynamic image in the target user video divided by the area of the target user video. Assuming that the speed of the first user video avatar before collision is v1 and the arm speed of the user is v2, the motion energy conservation law, that is, m1v1+ m2v2 ═ m1v1 '+ m2v 2', derives v1'═ m1-m 2v 1+2m2v2]/(m1+ m2), and v2' ═ m2-m1) ] v2+2m1v1]/(m1+ m2), where v1 'is the speed in the opposite direction to v1 after collision, v2' is the speed in the opposite direction to v2 after collision, and v1 and v2 may be given initial values or may be actual speeds obtained. In this way, the target motion speed and the target motion direction v1' of the first user video after the collision with the arm moving image can be calculated, so that the pop-up effect image of the first user video in the playing picture of the live video can be generated according to the target motion speed and the target motion direction, that is, the effect image of the user video head portrait after being popped up by the user finger moving according to the target motion speed and the target motion direction can be generated.

In addition, the influence of the gravity on the moving speed of the user video avatar may be not considered, but a global deceleration a may be set, and if a is smaller than 1, the current speed of the user video avatar is v1' × a, and if the speed is smaller than a certain threshold, the pop-up effect disappears, and the user video avatar may return to the original position in a straight line at a fixed speed. Regarding the flying direction of the user video head portrait, four sides of a playing picture of the live video can be used as mirror surfaces, when the user video head portrait touches the four sides of the picture, the moving direction of the user video head portrait is regarded as a light refraction direction, the moving direction is regarded as a ray, and the user video head portrait moves according to a mirror surface refraction track. When the user is identified to touch the head, the value of the global deceleration a can be reduced according to the physical collision, so that the speed can be restricted within a certain range. I.e. the value of a is small enough, the smaller the distance of movement the user video avatar will have after the collision.

Thus, according to the embodiment, the physical collision effect between the dynamic image of the user action core part and the user video can be generated, and the quality of the effect graph can be ensured.

performing gesture recognition on the target user video;

In other words, in an implementation manner, the user corresponding to the target user video may further adjust the display position of the user video avatar in the preset area of the live video frame by controlling an arm motion, and the implementation manner may be that the user finger performs a grabbing motion to grab a certain user video avatar, and then the display sequence of the user video may be changed.

Specifically, the action core part may be an arm, the server may identify a grabbing action of a user in the target user video through a gesture recognition algorithm, when the user in the target user video performs the grabbing action, a user finger in an arm dynamic image above the target user video coincides with a certain user video coordinate, so that when it is identified that position information of the certain user video overlaps with the arm dynamic image, the user video ID may be obtained, and the user video may be moved to a target position associated with an end point position of the movement track according to the movement track based on the movement track of the arm dynamic image in the playing picture of the live video, specifically, after the user finger grabs a head image of the user video and releases the head image, the middle point coordinate of the grabbed video may be passed through, and determining the adjustment sequence of the head portrait of the grabbed user video according to the coordinates of the middle points of other nearby user videos.

For example, as shown in fig. 7, when the user video a is grabbed to a certain position between the user video B and the user video C, it may be determined to insert the user video a between the user video B and the user video C by identifying the current center point coordinate position of the user video a and the center point coordinate positions of the user video B and the user video C.

Therefore, the users occupying the central position can perform interesting interaction in various modes on the video head portraits of other users, and the interactive interest of the users in watching live videos is improved.

Optionally, the step 103 comprises at least one of:

In one implementation, the background in the user video can be segmented and color-filled, so that the purposes of unifying the background in the user interactive video and avoiding background interference are achieved.

Specifically, a background segmentation mode may be used to segment the person and the background of the user video, that is, the person image contour in the user video may be identified, then the background segmentation is performed on the user video based on the person image contour, the background area in the user video is determined, and the background area in the user video is filled with a preset color, that is, the background may be filled with a monochrome system. When the number of the user videos is multiple, different colors can be respectively used for filling background areas of different user videos, and the situation that the background is too monotonous is avoided.

Therefore, the problem that the picture is not clear or the watching experience is poor due to the fact that the background is too complex can be solved by performing color filling processing on the background area in the user video.

In another embodiment, the user video may be cut to ensure that the face area of the user video is mainly displayed in the playing picture of the live video. Specifically, the face recognition may be performed on the user video, the face coordinate position of the user in the user video is determined, then, based on the recognized face position, the cropping processing is performed on each frame of user image in the user video according to a certain ratio, for example, a ratio of 1:1, the cropping rule may be as shown in fig. 8, and it is ensured that the user face 80 is located in the right and left middle of the picture and located in the upper, lower, middle, and upper one third of the picture, where the frame line in fig. 8 is the cropping area. Finally, the cut-out processed user video may be loaded into the live video, for example, as shown in fig. 4, the cut-out processed user video 41 may be displayed at the bottom of the playing screen 40 of the live video.

Therefore, by cutting the user video, the interactive video displayed in the live video can be ensured to highlight the head portrait characteristics of the interactive user, and the display effect of the interactive video is ensured.

Optionally, the step 101 includes:

after the step 101, the method further comprises:

identifying and classifying the sounds in the plurality of user videos;

The server may obtain the total number of users currently accessing the video interaction of the local field, which is recorded as u, that is, the number of the obtained user videos, for example, if 100 users currently watch a game in the live broadcast room and access the interaction, u is 100.

The server can also respectively identify and classify the user voice in each user video, determine the voice category of the user in each user video, such as cheering, speaking, crying, sneezing, non-human voice and the like, specifically analyze all the user voices accessed to the live video interaction in real time by using a voice classification algorithm based on deep learning and the like, determine the voice category of each user, and filter the conventional voice, such as sneezing, non-human voice and the like, without statistics. And counting the number of other sound categories, for example, determining that 20 users do not make a sound currently, 45 users speak in a low voice, and 35 users speak in a loud tip, "coupled", that is, cheer, so as to determine the number of sounds in the target category, where the target category may be determined according to the user state to be identified, for example, the sounds in the target category may be cheers, crys, and the like, for example, if the number of sounds in the current target category is S, and the sound in the target category is cheer, S is 35, and determining the number ratio of the sounds in the target category, that is, S/u.

In a case where the number proportion of the sounds of the target category is larger than a preset value, the current atmosphere may be identified as the target atmosphere, for example, in a case where it is identified that the number proportion of cheering sounds is larger than a preset value, the current atmosphere may be determined as the cheering atmosphere. The preset value may be different values based on different types of live videos, for example, for a world cup live video, the preset value may be higher, such as 0.8, and for a common live video, the preset value may be lower, such as 0.3.

In the case where it is determined that the current atmosphere is the target atmosphere, the sound volume of the target category may be processed, for example, the sound volume of the target category may be set to match the current atmosphere, the cheering volume may be set to be larger for cheering atmospheres, the crying sound volume may be set to be smaller for sad atmospheres, and so on.

Therefore, the sound in the user video accessed to the interaction is identified and classified, the sound volume of the target category is determined, the playing effect of the user interaction video can be matched with the current atmosphere, and the interaction effect in the live broadcast video is guaranteed.

Optionally, the processing the sound volume of the target category includes:

That is, in one embodiment, the sound volume of the target category may be determined according to the sound quantity proportion of the target category, specifically, the higher the quantity proportion is, the greater the sound volume of the target category is, and the determined sound volume of the target category may be superimposed on the plurality of user videos, so as to achieve an effect that the sound volume of the target category gradually increases as the quantity of the sound of the target category increases, for example, as the number of users currently emitting cheering increases, the volume of the cheering emitted from the plurality of user videos increases when the client plays a live video.

Therefore, the situation that the sound of the target category is enlarged instantly to cause that the psychology of the user is not ready to be frightened can be prevented, the excessive buffering effect of the sound volume is achieved, and the interaction effect of the user is ensured.

And 104, sending the live video loaded with the user video to a second client.

The server may send the live video loaded with the user video to a second client, where the second client may include the first client or may include other clients other than the first client, that is, for a user not participating in the interaction, the server may also see the interactive video of other users in the live video, and of course, a user not participating in the interaction may also choose not to receive the interactive video of other users and only view the live video.

After receiving the live video loaded with the user video, the second client can display the user video in a live video picture, that is, the second client user can watch the live video and simultaneously watch the interactive video of the second client user or other users.

The video processing method of the embodiment of the invention comprises the steps of obtaining a user video, wherein the user video is obtained by a first client playing a live video and performing video acquisition on a watching user; identifying the user video, and determining whether a user in the user video is in a target state; loading the user video into the live video if it is determined that the user in the user video is in the target state; and sending the live video loaded with the user video to a second client. Like this, the user can be at the in-process of watching the live video, and is interactive through the video, and interactive mode is more interesting abundant.

Referring to fig. 9, fig. 9 is a flowchart of a video processing method provided by an embodiment of the present invention, and is applied to a client, as shown in fig. 9, the method includes the following steps:

and step 901, receiving the live video loaded with the user video and transmitted by the server.

And step 902, playing the live video loaded with the user video.

The user video is obtained by performing video acquisition on a watching user by a first client playing the live video, and the user in the user video is in a target state.

Optionally, the method further comprises:

It should be noted that, this embodiment is used as an implementation of the client side corresponding to the embodiment shown in fig. 1, and specific implementation thereof may refer to the relevant description in the embodiment shown in fig. 1, and is not described herein again to avoid repetition.

The video processing method of the embodiment of the invention receives a live video loaded with a user video and transmitted by a server; and playing the live video loaded with the user video. Like this, the user can be at the in-process of watching the live video, and is interactive through the video, and interactive mode is more interesting abundant.

The embodiment of the invention also provides a video processing device. Referring to fig. 10, fig. 10 is a structural diagram of a video processing apparatus according to an embodiment of the present invention, applied to a server. Since the principle of the video processing apparatus for solving the problem is similar to the video processing method in the embodiment of the present invention, the implementation of the video processing apparatus can refer to the implementation of the method, and repeated details are not repeated.

As shown in fig. 10, the video processing apparatus 1000 includes:

a first obtaining module 1001, configured to obtain a user video, where the user video is obtained by a first client that plays a live video and performing video acquisition on a watching user;

a first identifying module 1002, configured to identify the user video, and determine whether a user in the user video is in a target state;

a first processing module 1003, configured to, when it is determined that a user in the user video is in the target state, load the user video into the live video;

a sending module 1004, configured to send the live video loaded with the user video to the second client.

Optionally, the first identifying module 1002 is configured to identify the motion amplitude and/or facial expression of the user in the user video, and determine whether the user in the user video is in the target emotional state.

Optionally, the first identifying module 1002 includes:

Optionally, the first obtaining module 1001 is configured to obtain a plurality of user videos, where the user videos are obtained by respectively performing video acquisition on respective watching users by different clients playing a live video;

the first processing module 1003 includes:

Optionally, the video processing apparatus 1000 further includes:

Optionally, the first processing module 1003 includes:

and/or, the first processing module 1003 includes:

the video processing apparatus 1000 further includes:

Optionally, the sixth processing module includes:

The video processing apparatus provided in the embodiment of the present invention may implement the method embodiments described above, and the implementation principle and the technical effect are similar, which are not described herein again.

The video processing device 1000 of the embodiment of the invention acquires a user video, wherein the user video is obtained by a first client playing a live video and performing video acquisition on a watching user; identifying the user video, and determining whether a user in the user video is in a target state; loading the user video into the live video if it is determined that the user in the user video is in the target state; and sending the live video loaded with the user video to a second client. Like this, the user can be at the in-process of watching the live video, and is interactive through the video, and interactive mode is more interesting abundant.

The embodiment of the invention also provides a video processing device. Referring to fig. 11, fig. 11 is a structural diagram of a video processing apparatus according to an embodiment of the present invention, which is applied to a client. Since the principle of the video processing apparatus for solving the problem is similar to the video processing method in the embodiment of the present invention, the implementation of the video processing apparatus can refer to the implementation of the method, and repeated details are not repeated.

As shown in fig. 11, the video processing apparatus 1100 includes:

the receiving module 1101 is configured to receive a live video loaded with a user video and sent by a server;

a playing module 1102, configured to play the live video loaded with the user video.

Optionally, the video processing apparatus 1100 further comprises:

The video processing device 1100 of the embodiment of the present invention receives a live video loaded with a user video and sent by a server; and playing the live video loaded with the user video. . Like this, the user can be at the in-process of watching the live video, and is interactive through the video, and interactive mode is more interesting abundant.

The embodiment of the invention also provides the electronic equipment. Because the principle of the electronic device for solving the problem is similar to the video processing method in the embodiment of the present invention, the implementation of the electronic device may refer to the implementation of the method, and repeated details are not repeated. In one embodiment, the electronic device may be a server, as shown in fig. 12, the server includes:

a processor 1200 for reading the program in the memory 1220 and executing the following processes:

and transmitting the live video loaded with the user video to the second client through the transceiver 1210.

A transceiver 1210 for receiving and transmitting data under the control of the processor 1200.

Where in fig. 12, the bus architecture may include any number of interconnected buses and bridges, with various circuits of one or more processors represented by processor 1200 and memory represented by memory 1220 being linked together. The bus architecture may also link together various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. The bus interface provides an interface. The transceiver 1210 may be a number of elements, including a transmitter and a transceiver, providing a means for communicating with various other apparatus over a transmission medium. The processor 1200 is responsible for managing the bus architecture and general processing, and the memory 1220 may store data used by the processor 1200 in performing operations.

Optionally, the processor 1200 is further configured to read the program in the memory 1220, and execute the following steps:

performing gesture recognition on the target user video;

Optionally, the processor 1200 is further configured to read a program in the memory 1220, and execute at least one of the following steps:

identifying and classifying the sounds in the plurality of user videos;

When the electronic device provided by the embodiment of the present invention is used as a server, the method embodiment shown in fig. 1 may be executed, which has similar implementation principles and technical effects, and this embodiment is not described herein again.

In another embodiment, the electronic device may be a client, as shown in fig. 13, the client includes:

a processor 1300, for reading the program in the memory 1320, for executing the following processes:

receiving live video loaded with user video transmitted by a server through a transceiver 1310;

and playing the live video loaded with the user video.

A transceiver 1310 for receiving and transmitting data under the control of the processor 1300.

In fig. 13, among other things, the bus architecture may include any number of interconnected buses and bridges with various circuits being linked together, particularly one or more processors represented by processor 1300 and memory represented by memory 1320. The bus architecture may also link together various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. The bus interface provides an interface. The transceiver 1310 may be a number of elements including a transmitter and a receiver that provide a means for communicating with various other apparatus over a transmission medium. User interface 1330 may also be an interface capable of interfacing with a desired device for different user devices, including but not limited to a keypad, display, speaker, microphone, joystick, etc.

The processor 1300 is responsible for managing the bus architecture and general processing, and the memory 1320 may store data used by the processor 1300 in performing operations.

Optionally, the processor 1300 is further configured to read the program in the memory 1320, and execute the following steps:

uploading the third user video to the server via the transceiver 1310 to enable the server to identify the third user video.

When the electronic device provided by the embodiment of the present invention is used as a client, the method embodiment shown in fig. 9 may be executed, which has similar implementation principles and technical effects, and this embodiment is not described herein again.

Furthermore, the computer-readable storage medium of the embodiment of the present invention is used for storing a computer program, and in one implementation, the computer program can be executed by a processor to implement the following steps:

and sending the live video loaded with the user video to a second client.

Optionally, the recognizing the motion amplitude and/or facial expression of the user in the user video and determining whether the user in the user video is in the target emotional state includes:

Optionally, the acquiring the user video includes:

performing gesture recognition on the target user video;

Optionally, the acquiring the user video includes:

identifying and classifying the sounds in the plurality of user videos;

Optionally, the processing the sound volume of the target category includes:

In another embodiment, the computer program is executable by a processor to perform the steps of:

receiving a live video loaded with a user video and transmitted by a server;

playing the live video loaded with the user video;

Optionally, the method further comprises:

In the several embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may be physically included alone, or two or more units may be integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

The integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) to execute some steps of the transceiving method according to various embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A video processing method is applied to a server side, and is characterized in that the method comprises the following steps:

and sending the live video loaded with the user video to a second client.

2. The method of claim 1, wherein the identifying the user video and determining whether the user in the user video is in the target state comprises:

3. The method of claim 2, wherein the target emotional state comprises a cheering state;

4. The method of claim 1, wherein the obtaining the user video comprises:

5. The method of claim 4, wherein after determining the target user video from the at least two user videos, the method further comprises:

6. The method of claim 5, wherein the loading of the dynamic image of the action core part at a third size after the location associated with the target user video in the play of the live video, further comprises:

and identifying the interactive action of the dynamic image of the action core part on a first user video, and generating an interactive display effect of the first user video in a playing picture of the live video based on the interactive action, wherein the first user video is at least one of other user videos except the target user video in the at least two user videos.

7. The method of claim 6, wherein the identifying of the interactive action of the dynamic image of the action core part on the first user video and the generating of the interactive display effect of the first user video in the playing picture of the live video based on the interactive action comprise:

8. The method of claim 7, wherein generating the collision effect map of the dynamic image of the motion core region and the first user video comprises:

9. The method of claim 6, wherein the identifying of the interactive action of the dynamic image of the action core part on the first user video and the generating of the interactive display effect of the first user video in the playing picture of the live video based on the interactive action comprise:

performing gesture recognition on the target user video;

10. The method of claim 1, wherein the loading the user video into the live video in the case that the user in the user video is determined to be in the target state comprises at least one of:

11. The method of claim 1, wherein the obtaining the user video comprises:

identifying and classifying the sounds in the plurality of user videos;

12. The method of claim 11, wherein the processing the volume of the target class of sounds comprises:

13. A video processing method is applied to a client, and is characterized by comprising the following steps:

receiving a live video loaded with a user video and transmitted by a server;

and playing the live video loaded with the user video.

14. The method of claim 13, further comprising:

15. A video processing apparatus applied to a server, the video processing apparatus comprising:

16. A video processing apparatus applied to a client, the video processing apparatus comprising:

17. An electronic device, comprising: a transceiver, a memory, a processor, and a computer program stored on the memory and executable on the processor; processor for reading a program in a memory to implement the steps in the video processing method according to any one of claims 1 to 12 or to implement the steps in the video processing method according to claim 13 or 14.

18. A computer-readable storage medium for storing a computer program, wherein the computer program, when executed by a processor, implements the steps in the video processing method according to any one of claims 1 to 12; or implementing the steps in a video processing method according to claim 13 or 14.