CN112380396B

CN112380396B - Video processing method and device, computer readable storage medium and electronic equipment

Info

Publication number: CN112380396B
Application number: CN202011253155.8A
Authority: CN
Inventors: 何重龙; 孙静
Original assignee: Netease Hangzhou Network Co Ltd
Current assignee: Netease Hangzhou Network Co Ltd
Priority date: 2020-11-11
Filing date: 2020-11-11
Publication date: 2024-04-26
Anticipated expiration: 2040-11-11
Also published as: CN112380396A

Abstract

The disclosure provides a video processing method and device, a computer readable storage medium and electronic equipment, and relates to the technical field of video processing. The video processing method comprises the following steps: acquiring object state information of an object in a video to be processed; acquiring audio data and determining audio characteristic information of the audio data; and generating target audio and video data according to the object state information and the audio characteristic information. The method and the device realize automatic adjustment of the video playing speed according to the music rhythm, and improve the matching accuracy of the video content and the music rhythm climax point.

Description

Video processing method and device, computer readable storage medium and electronic equipment

Technical Field

The disclosure relates to the technical field of video processing, and in particular relates to a video processing method and device, a computer readable storage medium and electronic equipment.

Background

With the rapid development of the mobile internet, short video enters a stage of vigorous development. The method is suitable for the characteristics of mobile social media fragmentation propagation, the content of small videos is also innovated continuously, and the occurrence of the stuck point video is favored by more and more people. The stuck point video generation technology is a video technology for generating a picture which is matched with the rhythm of music and is smoothly switched at the rhythm point of the music. The video of the stuck point is generally produced by selecting music with stronger rhythm sense, and the music is required to be well mastered to be consistent with the rhythm of picture switching. Such video production methods are often used for producing short video mainly with tremble.

In the prior art, two methods for generating the stuck point video are mainly available. The first method is to match the uploaded video or photo using the existing music template, and generate the stuck point video by one key. The second method is to manually make the stuck point video using video editing software.

The first method is fast and convenient, but cannot arbitrarily select music, the number of segments of video or the number of pictures is fixed, and personalized customization of the video of the stuck point cannot be realized. The second method has the advantages that the generation efficiency of the stuck point video is extremely low due to manual production, and the position of the music rhythm point is completely determined manually, so that the accuracy of the determined position of the music rhythm point is poor.

Disclosure of Invention

The disclosure aims to provide a video processing method and device, a computer readable storage medium and an electronic device, so as to overcome the problems of manual production, poor matching accuracy of music rhythm points and video contents caused by the limitations and defects of related technologies at least to a certain extent.

According to a first aspect of the present disclosure, there is provided a video processing method, including: acquiring object state information of an object in a video to be processed; acquiring audio data and determining audio characteristic information of the audio data; and generating target audio and video data according to the object state information and the audio characteristic information.

Optionally, acquiring state information of an object in the video to be processed includes: acquiring video data and depth data of a video to be processed; object state information is determined from the video data and the depth data.

Optionally, determining object state information according to the video data and the depth data includes: determining the area where the object is located in each frame of picture of the video to be processed based on the video data to obtain object area information; determining depth data corresponding to the region where the object is located in each frame of picture of the video to be processed based on the region where the object is located and the depth data, and obtaining object depth information; object state information is determined based on the object area information and the object depth information.

Optionally, determining, based on the video data, an area of the object in each frame of the video to be processed includes: inputting each frame of picture of the video to be processed into a trained image recognition model; the output of the image recognition model is the region where the object is located in each frame of picture of the video to be processed; obtaining a result output by the image recognition model as an area where the object is located; and determining the area of the object in each frame of picture of the video to be processed according to the area of the object.

Optionally, determining the object state information based on the object area information and the object depth information includes: calculating the ratio of the object area information and the corresponding object depth information in each frame of the video to be processed to obtain ratio information of each frame of the video; obtaining a first ratio change curve according to the ratio information; the abscissa of the first ratio change curve is a time point corresponding to each frame of picture in the video to be processed, and the ordinate is ratio information; and determining object state information according to the first ratio change curve.

Optionally, under the condition that the number of the objects is multiple, calculating a ratio of the object area information in each frame of the video to be processed to the corresponding object depth information to obtain ratio information of each frame of the video to be processed, including: calculating the ratio of the object area information of each object in each frame of picture of the video to be processed to the corresponding object depth information respectively to obtain a plurality of intermediate ratios; and calculating the average value of the plurality of intermediate ratios as ratio information of each frame of picture.

Optionally, under the condition that the number of the objects is multiple, calculating a ratio of the object area information in each frame of the video to be processed to the corresponding object depth information to obtain ratio information of each frame of the video to be processed, and further including: calculating the ratio of the object area information of each object in each frame of picture of the video to be processed to the corresponding object depth information respectively to obtain a plurality of intermediate ratios; and weighting the plurality of intermediate ratios to obtain ratio information of each frame of picture.

Optionally, acquiring audio data and determining audio feature information includes: acquiring audio data and determining frequency spectrum characteristic information corresponding to the audio data; according to the frequency spectrum characteristic information, a first audio characteristic curve is obtained; wherein, the abscissa of the first audio characteristic curve is a time point corresponding to the audio data, and the ordinate is the frequency spectrum characteristic information; and determining the audio characteristic information according to the first audio characteristic curve.

Optionally, generating the target audio-video data according to the object state information and the audio feature information includes: determining the number of wave crests contained in a first ratio change curve according to the first ratio change curve in the object state information; determining the number of wave peaks contained in a first audio characteristic curve according to the first audio characteristic curve in the audio characteristic information; and generating target audio and video data according to the number of wave peaks contained in the first ratio change curve and the number of wave peaks contained in the first audio characteristic curve.

Optionally, generating the target audio and video data according to the number of peaks included in the first ratio change curve and the number of peaks included in the first audio feature curve includes: if the number of wave crests contained in the first ratio change curve is different from that of wave crests contained in the first audio frequency characteristic curve, filtering out partial wave crests in the first ratio change curve and wave crests in the first audio frequency characteristic curve according to a preset threshold value to obtain a second ratio change curve and a second audio frequency characteristic curve; the number of wave peaks contained in the second ratio change curve is the same as that of wave peaks contained in the second audio frequency characteristic curve; and generating target audio and video data according to the second ratio change curve and the second audio characteristic curve.

Optionally, generating the target audio-video data according to the second ratio variation curve and the second audio characteristic curve includes: determining the corresponding positions of the wave crests in the second ratio change curve in the video data according to the second ratio change curve to obtain ratio wave crest positions; determining the corresponding positions of the peaks in the second audio characteristic curve in the audio data according to the second audio characteristic curve to obtain the positions of the peaks in the audio data; and generating target audio and video data according to the ratio peak position and the audio peak position.

Optionally, generating the target audio-video data according to the ratio peak position and the audio peak position includes: and if the time point corresponding to the ratio peak position is different from the time point corresponding to the audio peak position, adjusting the playing speed of the video to be processed, and generating target audio and video data.

Optionally, generating the target audio-video data according to the ratio peak position and the audio peak position, further includes: if the time point corresponding to the ratio peak position is different from the time point corresponding to the audio peak position, cutting the video to be processed, and generating target audio and video data.

Optionally, generating the target audio-video data according to the ratio peak position and the audio peak position, further includes: if the time point corresponding to the ratio peak position is different from the time point corresponding to the audio peak position, the playing speed of the video to be processed is adjusted, the video to be processed is cut, and target audio and video data are generated.

According to a second aspect of the present disclosure, there is provided a video processing apparatus comprising: the system comprises a state information acquisition module, an audio information acquisition module and a target data generation module.

Specifically, the state information acquisition module may be configured to acquire object state information of an object in a video to be processed; the audio information acquisition module can be used for acquiring audio data and determining audio characteristic information of the audio data; the target data generating module can be used for generating target audio and video data according to the object state information and the audio characteristic information.

Alternatively, the status information acquisition module may be configured to perform: acquiring video data and depth data of a video to be processed; object state information is determined from the video data and the depth data.

Alternatively, the status information acquisition module may be configured to perform: determining the area where the object is located in each frame of picture of the video to be processed based on the video data to obtain object area information; determining depth data corresponding to the region where the object is located in each frame of picture of the video to be processed based on the region where the object is located and the depth data, and obtaining object depth information; object state information is determined based on the object area information and the object depth information.

Alternatively, the status information acquisition module may be configured to perform: inputting each frame of picture of the video to be processed into a trained image recognition model; the output of the image recognition model is the region where the object is located in each frame of picture of the video to be processed; obtaining a result output by the image recognition model as an area where the object is located; and determining the area of the object in each frame of picture of the video to be processed according to the area of the object.

Alternatively, the status information acquisition module may be configured to perform: calculating the ratio of the object area information and the corresponding object depth information in each frame of the video to be processed to obtain ratio information of each frame of the video; obtaining a first ratio change curve according to the ratio information; the abscissa of the first ratio change curve is a time point corresponding to each frame of picture in the video to be processed, and the ordinate is ratio information; and determining object state information according to the first ratio change curve.

Alternatively, the status information acquisition module may be configured to perform: calculating the ratio of the object area information of each object in each frame of picture of the video to be processed to the corresponding object depth information respectively to obtain a plurality of intermediate ratios; and calculating the average value of the plurality of intermediate ratios as ratio information of each frame of picture.

Alternatively, the status information acquisition module may be configured to perform: calculating the ratio of the object area information of each object in each frame of picture of the video to be processed to the corresponding object depth information respectively to obtain a plurality of intermediate ratios; and weighting the plurality of intermediate ratios to obtain ratio information of each frame of picture.

Alternatively, the audio information acquisition module may be configured to perform: acquiring audio data and determining frequency spectrum characteristic information corresponding to the audio data; according to the frequency spectrum characteristic information, a first audio characteristic curve is obtained; wherein, the abscissa of the first audio characteristic curve is a time point corresponding to the audio data, and the ordinate is the frequency spectrum characteristic information; and determining the audio characteristic information according to the first audio characteristic curve.

Alternatively, the target data generation module may be configured to perform: determining the number of wave crests contained in a first ratio change curve according to the first ratio change curve in the object state information; determining the number of wave peaks contained in a first audio characteristic curve according to the first audio characteristic curve in the audio characteristic information; and generating target audio and video data according to the number of wave peaks contained in the first ratio change curve and the number of wave peaks contained in the first audio characteristic curve.

Alternatively, the target data generation module may be configured to perform: if the number of wave crests contained in the first ratio change curve is different from that of wave crests contained in the first audio frequency characteristic curve, filtering out partial wave crests in the first ratio change curve and wave crests in the first audio frequency characteristic curve according to a preset threshold value to obtain a second ratio change curve and a second audio frequency characteristic curve; the number of wave peaks contained in the second ratio change curve is the same as that of wave peaks contained in the second audio frequency characteristic curve; and generating target audio and video data according to the second ratio change curve and the second audio characteristic curve.

Alternatively, the target data generation module may be configured to perform: determining the corresponding positions of the wave crests in the second ratio change curve in the video data according to the second ratio change curve to obtain ratio wave crest positions; determining the corresponding positions of the peaks in the second audio characteristic curve in the audio data according to the second audio characteristic curve to obtain the positions of the peaks in the audio data; and generating target audio and video data according to the ratio peak position and the audio peak position.

Alternatively, the target data generation module may be configured to perform: and if the time point corresponding to the ratio peak position is different from the time point corresponding to the audio peak position, adjusting the playing speed of the video to be processed, and generating target audio and video data.

Alternatively, the target data generation module may be configured to perform: if the time point corresponding to the ratio peak position is different from the time point corresponding to the audio peak position, cutting the video to be processed, and generating target audio and video data.

Alternatively, it may be configured to perform: if the time point corresponding to the ratio peak position is different from the time point corresponding to the audio peak position, the playing speed of the video to be processed is adjusted, the video to be processed is cut, and target audio and video data are generated.

According to a third aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements any of the live methods described above.

According to a fourth aspect of the present disclosure, there is provided an electronic device comprising: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to perform any of the live methods described above via execution of the executable instructions.

In some embodiments of the present disclosure, object state information of an object in a video to be processed is first obtained; acquiring audio data and determining audio characteristic information of the audio data; and generating target audio and video data according to the object state information and the audio characteristic information. The generated target audio-video data is the processed video which is matched with the determined music rhythm climax point of the audio. According to the video processing method, automatic generation of the stuck point video is achieved, the playing speed of the video can be automatically adjusted according to the action of objects in the video and the music content, the video is cut and spliced without manual operation, convenience in generating the stuck point video is improved, and meanwhile accuracy in matching the video content with the music climax is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure. It will be apparent to those of ordinary skill in the art that the drawings in the following description are merely examples of the disclosure and that other drawings may be derived from them without undue effort. In the drawings:

FIG. 1 schematically illustrates a flow chart for generating a current stuck point video;

Fig. 2 schematically illustrates a flow chart of a video processing method according to an exemplary embodiment of the present disclosure;

Fig. 3 schematically illustrates an effect diagram of judging the kind and number of objects according to a video to be processed according to an exemplary embodiment of the present disclosure;

FIG. 4 schematically illustrates an identification flow chart of the Yolo algorithm according to an exemplary embodiment of the present disclosure;

Fig. 5 schematically illustrates an object detection result of a certain frame of a picture obtained using a YOLO algorithm according to an exemplary embodiment of the present disclosure;

Fig. 6 schematically illustrates an object detection result diagram in another frame picture obtained using the YOLO algorithm according to an exemplary embodiment of the present disclosure;

FIG. 7 schematically illustrates a first ratio change graph plotted against K value according to an exemplary embodiment of the present disclosure;

fig. 8 schematically illustrates a first audio feature curve drawn according to feature information of volume, spectrum, etc. according to an exemplary embodiment of the present disclosure;

FIG. 9 schematically illustrates a second ratio variation graph after filtering out partial peaks according to a threshold value according to an exemplary embodiment of the present disclosure;

FIG. 10 schematically illustrates a second audio signature graph after partial peak filtering according to a threshold value in accordance with an exemplary embodiment of the present disclosure;

FIG. 11 schematically illustrates a comparison of a first ratio profile with a second ratio profile in accordance with an exemplary embodiment of the present disclosure;

FIG. 12 schematically illustrates a comparison of a first audio feature curve with a second audio feature curve in accordance with an exemplary embodiment of the present disclosure;

FIG. 13 schematically illustrates a comparison of a second ratio variation curve with a second audio feature curve in accordance with an exemplary embodiment of the present disclosure;

fig. 14 schematically illustrates a block diagram of a video processing apparatus of an exemplary embodiment of the present disclosure;

fig. 15 schematically illustrates a block diagram of an electronic device according to an exemplary embodiment of the present disclosure.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments may be embodied in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the present disclosure. One skilled in the relevant art will recognize, however, that the aspects of the disclosure may be practiced without one or more of the specific details, or with other methods, components, devices, steps, etc. In other instances, well-known technical solutions have not been shown or described in detail to avoid obscuring aspects of the present disclosure.

Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus a repetitive description thereof will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in software or in one or more hardware modules or integrated circuits or in different networks and/or processor devices and/or microcontroller devices.

The flow diagrams depicted in the figures are exemplary only and not necessarily all steps are included. For example, some steps may be decomposed, and some steps may be combined or partially combined, so that the order of actual execution may be changed according to actual situations.

With the rapid development of the mobile internet, short video enters a stage of vigorous development. More and more people begin to watch and make small videos.

Video in which video frame switching actions correspond to music beats is commonly called "stuck point video". The video of the stuck point is generally produced by selecting music with stronger rhythm sense, and the music is required to be well mastered to be consistent with the rhythm of picture switching. Such video production methods are often used for producing short video mainly with tremble. Fig. 1 schematically shows a flow chart of generating a video of a current stuck point, firstly selecting a video to be spliced from a mobile phone, then selecting a piece of background music, and finally manually completing splicing of the video according to the music tempo climax to enable the splicing point to be matched with the music tempo climax.

At present, most of the production of the stuck point video is carried out manually by adopting nonlinear editing software at a PC end or a mobile phone end, a producer identifies the beat of music through hearing, and finally the aim of consistent music and picture switching rhythm is achieved through switching of visual matching pictures. The method mainly comprises the steps of manufacturing software on an App or PC end with tremble sound, providing various templates, keeping the music beat and the picture switching rhythm consistent in advance through manual work, generating engineering files, manually changing pictures by a person, achieving the effect of sticking points, and generating new videos.

However, the click video generated by fusing the video and the music through manual operation needs to manually judge whether the music rhythm is matched with the picture content, and the time of the music rhythm point is consistent with the time of picture switching by means of subjective feeling. The method is time-consuming and labor-consuming, and the generated stuck point video is poor in effect, so that the problem of low matching accuracy of video content and music rhythm points is easily caused. In view of this, a new video processing method is required.

The various steps of the object detection method of the exemplary embodiments of this disclosure may be generally performed by a cell phone. However, aspects of the present disclosure may also be implemented with a server or other terminal device, which may include, but is not limited to, a tablet, personal computer, etc.

Fig. 2 schematically shows a flowchart of a video processing method of an exemplary embodiment of the present disclosure. Referring to fig. 2, the video processing method may include the steps of:

S22, acquiring object state information of an object in the video to be processed.

In an exemplary embodiment of the present disclosure, the video to be processed may be a video photographed by a user in real time or may be a video photographed before; the objects in the video to be processed can be one or more, can be a person, a hand of the person, a head of the person, a body of the person and the like, and can also be various objects or animals and the like, such as fans, puppies and the like; the object state information includes kind information, area information, distance information of the object from the camera, area-to-distance ratio information, and the like of the object. The above information can be calculated on the basis of the acquired original video data and depth data.

In an embodiment of the present disclosure, a method for acquiring object state information of an object in a video to be processed may be to acquire video data and depth data of the video to be processed; object state information is determined from the video data and the depth data. The video data is original video data, the depth data is data contained in a depth map, and the depth map can be acquired by a monocular camera, a binocular camera, a Time of Flight (TOF) camera, a structured light camera and other devices.

In embodiments of the present disclosure, the depth data of the original video may be acquired by a TOF camera onboard the handset. When shooting video, the TOF function is turned on, so that the depth map of each frame of picture of the video can be obtained when the video is shot by using a common camera. The working principle of TOF is as follows: the distance between the object and the lens is judged by measuring the transmission time of the light between the lens and the object by shining the light to the target object, and the distance between the object and the lens is judged by the data, so that the distance between each object in a picture is known, a depth map is obtained, and the gray value of each pixel point in the depth map can be used for representing the distance between a certain point in the picture and the camera.

In the embodiment of the disclosure, the method for determining the object state information according to the video data and the depth data may be that, based on the video data, determining an area where an object is located in each frame of picture of the video to be processed, to obtain object area information; determining depth data corresponding to the region where the object is located in each frame of picture of the video to be processed based on the region where the object is located and the depth data, and obtaining object depth information; object state information is determined based on the object area information and the object depth information. The object area information is the area occupied by the object in the picture, the area can be calculated according to the area where the object is located in the picture, and the area can be obtained by an image recognition method; the depth data is the data contained in the depth map, the depth map can be acquired by a monocular camera, a binocular camera, a TOF camera, a structured light camera and other devices, the data can reflect the distance between a certain point in a picture and the camera, and the object depth information is the depth data corresponding to the area where the object is located.

In an exemplary embodiment of the present disclosure, a method for determining an area of an object in each frame of a video to be processed based on video data may be to input each frame of the video to be processed into a trained image recognition model; the image recognition model can output the area where the object is located in each frame of picture of the video to be processed; obtaining a result output by the image recognition model as an area where the object is located; and determining the area of the object in each frame of picture of the video to be processed according to the area of the object.

In an exemplary embodiment of the present disclosure, fig. 3 schematically shows an effect diagram of determining the kind and number of objects from video data. As shown in fig. 3, the acquired original video data 31 is subjected to image recognition, and the kind and number of objects contained in the video can be detected. The number of objects to be detected can be set according to the requirements of users, and can be automatically selected according to video content. When only one object needs to be detected, a detection frame 33 of a target object can be obtained on the basis of the original image 32 by an image recognition method, and the detected human body area is represented; when a plurality of objects need to be detected, a plurality of target object detection frames 35, 36, 37 respectively representing a head region, a body region, and a hand region can be obtained on the basis of the original image 34 by an image recognition method.

In an exemplary embodiment of the present disclosure, the image recognition method may use Fast-RCNN (Fast-Region Convolutional Neural Networks, fast area convolutional neural network), SSD (Single Shot MultiBox Detector, single lens multi box detector) or YOLO (You Only Look Once, you see only once), etc. Taking a YOLO target detection method as an example, firstly, establishing a data set for training a YOLO model, obtaining a model with identification accuracy meeting the actual application requirement after performing iterative training on the model for a plurality of times, and then calling the model to detect pictures or shipi you needing target detection. Specifically, fig. 4 schematically illustrates an identification flow chart of a YOLO algorithm, in one embodiment of the present invention, original video data is first input into a YOLO algorithm model, after frames of each frame of the original video pass through a plurality of convolution layers and residual layers, the algorithm detects 3 target frames for each target object, and then a detection frame closest to a real target object area is obtained by using a non-maximum suppression method.

In an exemplary embodiment of the present disclosure, fig. 5 schematically illustrates an object detection result of a certain frame of a picture obtained using the YOLO algorithm, where 51 is an original picture, and 52, 53, 54 are target frames of a detected head region, body region, hand region, respectively. Fig. 6 also schematically shows a schematic diagram of the object detection result in another frame of picture obtained using the YOLO algorithm, where 61 is the original picture and 62, 63, 64 are the target frames of the detected head region, body region, hand region, respectively. The area of the area where each object is located can be calculated through the detected target frame. It will be appreciated that the hand region 54 in fig. 5 is farther from the camera than the hand region 64 in fig. 6, and occupies a smaller area in the picture, whereas the head region 52 in fig. 5 is closer to the excitation camera than the head region 62 in fig. 6, and occupies a larger area in the picture, so that it is possible to choose to process the video in dependence on the area of the object in the picture at a distance from the camera.

In an exemplary embodiment of the present disclosure, the method for determining object state information based on object area information and object depth information may be to calculate a ratio of the object area information to corresponding object depth information in each frame of a video to be processed, to obtain ratio information of each frame; obtaining a first ratio change curve according to the ratio information; wherein, the abscissa of the first ratio change curve is the time point corresponding to each frame of picture in the video to be processed, and the ordinate is the ratio information; and determining object state information according to the first ratio change curve.

In an exemplary embodiment of the present disclosure, the distance information of the object from the camera may be determined by depth data, and the depth data of the original video may be acquired by a TOF camera of the mobile phone itself. After the types and the numbers of the objects and the areas of the areas are obtained, the distance between the areas of each object and the camera is calculated. The distance may be an average distance between the area where each object is located and the camera, or may be a distance between the middle point of the area where the object is located and the camera, where the average value of the distances between the two points and the camera is selected at random.

In an exemplary embodiment of the present disclosure, in a case where the number of objects is plural, a method for calculating a ratio of object area information to corresponding object depth information in each frame of a video to be processed to obtain ratio information of each frame may be to calculate a ratio of object area information to corresponding object depth information of each object in each frame of the video to be processed, respectively, to obtain plural intermediate ratios; and calculating the average value of the plurality of intermediate ratios as ratio information of each frame of picture. Where the intermediate ratio refers to the ratio that needs to be calculated before the final calculation result is obtained. It should be noted that, the calculation method for obtaining the final ratio information may be calculating an average value of a plurality of intermediate ratios, or may be taking a median value or directly using a sum of the intermediate ratios as the ratio information, which all belong to the protection scope of the present disclosure.

In an exemplary embodiment of the present disclosure, after determining a calculation method of an area of an object and a distance between the object and a camera, an area of the object in each frame of an original video (denoted as S) and a distance between the object and the camera (denoted as L) may be calculated, and then a ratio of the area of the object in each frame of the original video and the distance between the object and the camera may be calculated to obtain ratio data (denoted as K) of each frame of the original video, where a calculation formula is as follows:

When there are multiple objects in the frame, the area of the area where each object is located is denoted as S ₁,S₂,S₃,S₄···S_n, the distance between the corresponding object and the camera is denoted as L ₁,L₂,L₃,L₄···L_n, and the K value at this time is:

Fig. 7 schematically shows a first ratio variation graph plotted against the K value, which is the original ratio variation graph plotted against the original video data. Wherein, the abscissa is the time point T corresponding to the original video, and the ordinate is the value of K; the smaller the K value is, the farther the distance between the object in the picture and the camera is, the larger the K value is, the closer the distance between the object in the picture and the camera is, and the wave crest represents that the object in the picture is far away from the camera after being close to the camera. The first ratio change curve may represent object state information of the object.

In an exemplary embodiment of the present disclosure, in a case where the number of objects is plural, calculating a ratio of object area information to corresponding object depth information in each frame of a video to be processed, to obtain ratio information of each frame of the video may further be to calculate a ratio of object area information to corresponding object depth information of each object in each frame of the video to be processed, respectively, to obtain plural intermediate ratios; and weighting the plurality of intermediate ratios to obtain ratio information of each frame of picture. Wherein the weighting process assigns different weights to different objects, and the K value is calculated in combination with the weights.

For example, when it is detected that the object in the video includes 20, 15, and 10 areas of the head, the hand, and the leg, the weight of the head may be preset to 0.5, the weight of the hand may be preset to 0.3, the weight of the leg may be preset to 0.2, the area of the head after the weighting process may be 20×0.5=10, the areas of the hand and the leg may be the same and 4.5 and 2, respectively, and then the final ratio information may be calculated according to the area after the weighting process.

It should be noted that, the calculation method of the K value in the present exemplary embodiment is only one of the calculation methods included in the present disclosure, including but not limited to only one calculation method, and any method involving video processing using the object area and the distance from the camera should fall within the protection scope of the present disclosure.

S24, acquiring audio data and determining audio characteristic information of the audio data.

In an exemplary embodiment of the present disclosure, the audio data may be background music to be fused of an original video, the audio data may be user-selected or automatically selected, the audio feature information may be information such as a volume of music, a spectral feature, etc., the audio feature information may be set in advance, and each piece of music has corresponding audio feature information or may be extracted in real time.

In an exemplary embodiment of the present disclosure, audio data is acquired, and spectral feature information corresponding to the audio data is determined; according to the frequency spectrum characteristic information, a first audio characteristic curve is obtained; wherein, the abscissa of the first audio characteristic curve is a time point corresponding to the audio data, and the ordinate is the frequency spectrum characteristic information; and determining the audio characteristic information according to the first audio characteristic curve.

Specifically, according to the characteristic information such as the volume and the frequency spectrum of music, a first audio characteristic curve is obtained, and the curve is drawn according to the original audio data. If the positive light irradiates the ground, all dark places are illuminated to form the song, and the part of the song with the music rhythm climax corresponds to the peak position of the curve. Fig. 8 schematically shows a first audio characteristic curve drawn according to characteristic information such as volume, frequency spectrum, etc., wherein the abscissa represents a time T corresponding to audio data, the ordinate represents an audio characteristic value M, and the peak represents a climax part of the audio data.

S26, generating target audio and video data according to the object state information and the audio characteristic information.

In an exemplary embodiment of the present disclosure, the object state information may be a first ratio change curve, where the curve is drawn according to a ratio of an area of an object in a picture to a distance between the area and a camera; the audio feature information may be a first audio feature curve, and the target audio and video data is processed video data drawn according to audio feature information such as music volume, spectrum feature, and the like, that is, a stuck point video in which video content is matched with music tempo (climax). And generating a stuck point video with video content matched with the music rhythm (climax) according to the first ratio change curve and the first audio characteristic curve.

In an exemplary embodiment of the present disclosure, the number of peaks included in a first ratio change curve is determined according to the first ratio change curve in the object state information; determining the number of wave peaks contained in a first audio characteristic curve according to the first audio characteristic curve in the audio characteristic information; and generating target audio and video data according to the number of wave peaks contained in the first ratio change curve and the number of wave peaks contained in the first audio characteristic curve.

In an exemplary embodiment of the present disclosure, in the first ratio change curve, when an object in a video is closer to a device used for capturing the video, the K value is larger, and the number of peaks included in the first ratio change curve may reflect the number of times that the object in the video is closer to the device used for capturing the video; the number of wave peaks contained in the first audio characteristic curve can reflect the number of music rhythm points or music climax points in audio, and according to the number of times that an object in video approaches equipment used for shooting video and the number of music rhythm points or music climax points in audio, a stuck point video in which the time point when the object in video approaches the equipment used for shooting video is matched with the time point where the music rhythm points or climax points are located is generated.

In an exemplary embodiment of the present disclosure, if the number of peaks included in the first ratio change curve is different from the number of peaks included in the first audio feature curve, filtering out a portion of the peaks in the first ratio change curve and the peaks in the first audio feature curve according to a preset threshold value to obtain a second ratio change curve and a second audio feature curve; the number of wave peaks contained in the second ratio change curve is the same as that of wave peaks contained in the second audio frequency characteristic curve; and generating target audio and video data according to the second ratio change curve and the second audio characteristic curve.

In an exemplary embodiment of the disclosure, the target audio-video data is obtained by matching a peak in the first ratio change curve with a peak in the first audio feature curve, and if the number of peaks in the first ratio change curve is different from the number of peaks in the first audio feature curve, matching cannot be completed, or the effect of the target audio-video data generated after matching is poor. Therefore, when the number of the wave peaks in the first ratio change curve is different from the number of the wave peaks in the first audio frequency characteristic curve, the wave peaks in part of the first ratio change curve and the wave peaks in the first audio frequency characteristic curve can be filtered according to a preset threshold value, and a second ratio change curve and a second audio frequency characteristic curve with the same number of the wave peaks are obtained.

In an exemplary embodiment of the present disclosure, according to the first ratio change curve in fig. 7, the number of peaks included in the curve may be determined to be 6, that is, the number of times that an object in a video can be obtained to approach a camera may be 6 times. From the first audio characteristic curve in fig. 8, it can be determined that the number of peaks contained in the curve is 7, that is, 7 places where the tempo climax in music can be obtained. It is not difficult to find that the number of times that an object in the video approaches the camera is not equal to the number of the climax of the music rhythm, that is to say, the matching of the video content and the music content cannot be completed at the moment, therefore, the peak value threshold value of a first ratio change curve and the peak value threshold value of a first audio characteristic curve can be preset to filter out partial unobvious peaks, so that the number of peaks in the first ratio change curve is the same as the number of peaks in the first audio characteristic curve, the matching of the video content and the music content can be conveniently completed, and the matching degree of the content can be improved. It should be noted that, the threshold value may be preset manually or may be set automatically by the mobile phone.

In an exemplary embodiment of the present disclosure, after filtering out a portion of the insignificant peaks according to a threshold, two graphs with the same number of new peaks are obtained. Fig. 9 schematically shows a second ratio variation graph after filtering out a portion of the peaks according to a threshold value, and fig. 10 schematically shows a second audio signature graph after filtering out a portion of the peaks according to a threshold value. Fig. 11 is a graph comparing the first ratio change curve with the second ratio change curve, and fig. 12 is a graph comparing the first audio characteristic curve with the second audio characteristic curve, it can be seen that, compared with the curve before the treatment, some peaks with lower peaks are filtered, the peaks are more obvious, and the number of peaks in the second ratio change curve is the same as the number of peaks in the second audio characteristic change curve, and is 3.

In an exemplary embodiment of the disclosure, determining, according to a second ratio change curve, a corresponding position of each peak in the second ratio change curve in video data, to obtain a ratio peak position; determining the corresponding positions of the peaks in the second audio characteristic curve in the audio data according to the second audio characteristic curve to obtain the positions of the peaks in the audio data; and generating target audio and video data according to the ratio peak position and the audio peak position. And generating target audio-video data according to the corresponding time point of the specific wave peak in the video data and the corresponding time point of the audio wave peak in the audio data.

In an exemplary embodiment of the present disclosure, if a time point corresponding to a ratio peak position is different from a time point corresponding to an audio peak position, a playing speed of a video to be processed is adjusted to generate target audio/video data. Fig. 13 schematically shows a comparison of the second ratio variation curve and the second audio characteristic curve, and after the second ratio variation curve and the second audio characteristic curve are obtained, it can be found by comparison that the positions of the horizontal axes corresponding to the peaks in the two curves are different, that is, the corresponding time points are different. If the time points corresponding to the wave peaks in the two curves are different, that is to say, the time point of the object in the video, which is close to the camera, is not matched with the time point of the music climax, the playing speed of the video needs to be adjusted, so that the time point of the object in the video, which is close to the camera, is the same as the time point of the music climax.

In an exemplary embodiment of the present disclosure, the video content may be divided into a plurality of sections according to the position of the peak of the second ratio variation curve. For example, the second ratio profile in the exemplary embodiments of the present disclosure has 3 peaks, then the video may be divided into 4 intervals. And respectively adjusting the playing speed of the video in each interval so that the time point corresponding to the wave peak in the second ratio change curve is the same as the time point corresponding to the wave peak of the second audio characteristic curve, and obtaining the target audio and video data, namely the processed video. And fusing the processed video with music to obtain the video with background music, wherein the video content of the video is matched with the high tide of the music.

In another exemplary embodiment of the present disclosure, the method for generating the target audio/video data may further be: if the time point corresponding to the ratio peak position is different from the time point corresponding to the audio peak position, cutting the video to be processed, and generating target audio and video data.

Referring to fig. 13, fig. 13 schematically shows a comparison graph of the second ratio variation curve and the second audio characteristic curve, and after the second ratio variation curve and the second audio characteristic curve are obtained, it can be found by comparing that the positions of the horizontal axes corresponding to the peaks in the two curves are different, that is, the corresponding time points are different. If the time points corresponding to the wave peaks in the two curves are different, that is to say, the time point of the object in the video, which is close to the camera, is not matched with the time point of the music climax, at the moment, the time point of the object in the video, which is close to the camera, is identical with the time point of the music climax by clipping the video content.

In an exemplary embodiment of the present disclosure, the video content may be divided into a plurality of sections according to the position of the peak of the second ratio variation curve. For example, the second ratio profile in the exemplary embodiments of the present disclosure has 3 peaks, then the video may be divided into 4 intervals. And if the time points corresponding to the positions of the peaks of the second ratio change curve are all behind the time points corresponding to the positions of the peaks of the second audio characteristic curve, cutting the video content in each interval respectively, so that the time points corresponding to the peaks in the second ratio change curve are the same as the time points corresponding to the peaks of the second audio characteristic curve, and obtaining target audio and video data, namely the processed video. And fusing the processed video with music to obtain the video with background music, wherein the video content of the video is matched with the high tide of the music.

In another exemplary embodiment of the present disclosure, the method for generating the target audio/video data may further be: and if the time point corresponding to the ratio peak position is different from the time point corresponding to the audio peak position, adjusting the playing speed of the video to be processed and cutting the video to be processed, so as to generate the target audio/video data.

Referring to fig. 13, fig. 13 schematically shows a comparison graph of the second ratio variation curve and the second audio characteristic curve, and after the second ratio variation curve and the second audio characteristic curve are obtained, it can be found by comparing that the positions of the horizontal axes corresponding to the peaks in the two curves are different, that is, the corresponding time points are different. If the time points corresponding to the wave peaks in the two curves are different, that is to say, the time point of the object in the video, which is close to the camera, is not matched with the time point of the music climax, at the moment, the time point of the object in the video, which is close to the camera, is identical with the time point of the music climax by adjusting the playing speed of the video and cutting the video content.

In an exemplary embodiment of the present disclosure, the video content may be divided into a plurality of sections according to the position of the peak of the second ratio variation curve. For example, the second ratio profile in the exemplary embodiments of the present disclosure has 3 peaks, then the video may be divided into 4 intervals. The video in the interval with longer video time can be cut or the playing speed of the video can be increased, and the video in the interval with shorter video time can be adjusted to slow down the playing speed, so that the time point corresponding to the wave peak in the second ratio change curve is the same as the time point corresponding to the wave peak of the second audio characteristic curve, and the target audio and video data, namely the processed video, can be obtained. And fusing the processed video with music to obtain the video with background music, wherein the video content of the video is matched with the high tide of the music.

In another exemplary embodiment of the present disclosure, the distance of the object from the camera may also be calculated only by the size of the area occupied by the object in the video frame. For example, when an object is far from the camera, the area occupied by the object in the shot picture is smaller; when the object approaches the camera, the occupied area of the object in the shot picture is larger, so that the distance between the object and the camera can be calculated only by the occupied area of the object in the video picture.

In another exemplary embodiment of the present disclosure, each frame of the video to be processed is input into a trained image recognition model by an image recognition method; the image recognition model can output the area where the object is located in each frame of picture of the video to be processed; obtaining a result output by the image recognition model as an area where the object is located; determining the area of the object in each frame of picture of the video to be processed according to the area of the object; after the area of the object in each frame of picture is obtained, the numerical value corresponding to the area can be expressed in a coordinate system and connected into a curve to obtain a first object area curve; and generating target audio and video data according to the first object area curve and the first audio characteristic curve.

The method for generating the target audio/video data according to the first object area curve and the first audio characteristic curve is the same as the method for generating the target audio/video according to the first ratio change curve and the first audio characteristic curve, namely, firstly, according to a preset threshold value, partial peaks in the first object area curve and the first audio characteristic curve are filtered to obtain a second object area curve and a second audio characteristic curve, the number of peaks contained in the second object area curve and the second audio characteristic curve is the same, and then according to time points corresponding to peak positions in each curve, the video to be processed is cut and/or the playing speed is adjusted, so that the target audio/video data is generated.

It should be noted that although the steps of the methods in the present disclosure are depicted in the accompanying drawings in a particular order, this does not require or imply that the steps must be performed in that particular order, or that all illustrated steps be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform, etc.

Further, a video processing apparatus is also provided in this example embodiment.

Fig. 14 schematically shows a block diagram of a video processing apparatus of an exemplary embodiment of the present disclosure. Referring to fig. 14, the video processing apparatus 14 according to an exemplary embodiment of the present disclosure may include a status information acquisition module 141, an audio information acquisition module 143, and a target data generation module 145.

Specifically, the state information obtaining module 141 may be configured to obtain object state information of an object in the video to be processed; the audio information obtaining module 143 may be configured to obtain audio data and determine audio feature information of the audio data; the target data generating module 145 may be configured to generate target audio-video data according to the object state information and the audio feature information.

In an exemplary embodiment of the present disclosure, the status information acquisition module 141 may be configured to perform: acquiring video data and depth data of a video to be processed; object state information is determined from the video data and the depth data.

In an exemplary embodiment of the present disclosure, the status information acquisition module 141 may be configured to perform: determining the area where the object is located in each frame of picture of the video to be processed based on the video data to obtain object area information; determining depth data corresponding to the region where the object is located in each frame of picture of the video to be processed based on the region where the object is located and the depth data, and obtaining object depth information; object state information is determined based on the object area information and the object depth information.

In an exemplary embodiment of the present disclosure, the status information acquisition module 141 may be configured to perform: inputting each frame of picture of the video to be processed into a trained image recognition model; the output of the image recognition model is the region where the object is located in each frame of picture of the video to be processed; obtaining a result output by the image recognition model as an area where the object is located; and determining the area of the object in each frame of picture of the video to be processed according to the area of the object.

In an exemplary embodiment of the present disclosure, the status information acquisition module 141 may be configured to perform: calculating the ratio of the object area information and the corresponding object depth information in each frame of the video to be processed to obtain ratio information of each frame of the video; obtaining a first ratio change curve according to the ratio information; the abscissa of the first ratio change curve is a time point corresponding to each frame of picture in the video to be processed, and the ordinate is ratio information; and determining object state information according to the first ratio change curve.

In an exemplary embodiment of the present disclosure, the status information acquisition module 141 may be configured to perform: calculating the ratio of the object area information of each object in each frame of picture of the video to be processed to the corresponding object depth information respectively to obtain a plurality of intermediate ratios; and calculating the average value of the plurality of intermediate ratios as ratio information of each frame of picture.

In an exemplary embodiment of the present disclosure, the status information acquisition module 141 may be configured to perform: calculating the ratio of the object area information of each object in each frame of picture of the video to be processed to the corresponding object depth information respectively to obtain a plurality of intermediate ratios; and weighting the plurality of intermediate ratios to obtain ratio information of each frame of picture.

In an exemplary embodiment of the present disclosure, the audio information acquisition module 143 may be configured to perform: acquiring audio data and determining frequency spectrum characteristic information corresponding to the audio data; according to the frequency spectrum characteristic information, a first audio characteristic curve is obtained; wherein, the abscissa of the first audio characteristic curve is a time point corresponding to the audio data, and the ordinate is the frequency spectrum characteristic information; and determining the audio characteristic information according to the first audio characteristic curve.

In an exemplary embodiment of the present disclosure, the target data generation module 145 may be configured to perform: determining the number of wave crests contained in a first ratio change curve according to the first ratio change curve in the object state information; determining the number of wave peaks contained in a first audio characteristic curve according to the first audio characteristic curve in the audio characteristic information; and generating target audio and video data according to the number of wave peaks contained in the first ratio change curve and the number of wave peaks contained in the first audio characteristic curve.

In an exemplary embodiment of the present disclosure, the target data generation module 145 may be configured to perform: if the number of wave crests contained in the first ratio change curve is different from that of wave crests contained in the first audio frequency characteristic curve, filtering out partial wave crests in the first ratio change curve and wave crests in the first audio frequency characteristic curve according to a preset threshold value to obtain a second ratio change curve and a second audio frequency characteristic curve; the number of wave peaks contained in the second ratio change curve is the same as that of wave peaks contained in the second audio frequency characteristic curve; and generating target audio and video data according to the second ratio change curve and the second audio characteristic curve.

In an exemplary embodiment of the present disclosure, the target data generation module 145 may be configured to perform: determining the corresponding positions of the wave crests in the second ratio change curve in the video data according to the second ratio change curve to obtain ratio wave crest positions; determining the corresponding positions of the peaks in the second audio characteristic curve in the audio data according to the second audio characteristic curve to obtain the positions of the peaks in the audio data; and generating target audio and video data according to the ratio peak position and the audio peak position.

In an exemplary embodiment of the present disclosure, the target data generation module 145 may be configured to perform: and if the time point corresponding to the ratio peak position is different from the time point corresponding to the audio peak position, adjusting the playing speed of the video to be processed, and generating target audio and video data.

In an exemplary embodiment of the present disclosure, the target data generation module 145 may be configured to perform: if the time point corresponding to the ratio peak position is different from the time point corresponding to the audio peak position, cutting the video to be processed, and generating target audio and video data.

In an exemplary embodiment of the present disclosure, the target data generation module 145 may be configured to perform: if the time point corresponding to the ratio peak position is different from the time point corresponding to the audio peak position, the playing speed of the video to be processed is adjusted, the video to be processed is cut, and target audio and video data are generated.

In an exemplary embodiment of the present disclosure, a computer-readable storage medium having stored thereon a program product capable of implementing the method described above in the present specification is also provided. In some possible embodiments, the various aspects of the invention may also be implemented in the form of a program product comprising program code for causing a terminal device to carry out the steps according to the various exemplary embodiments of the invention as described in the "exemplary methods" section of this specification, when said program product is run on the terminal device.

The program product for implementing the above-described method according to an embodiment of the present invention may employ a portable compact disc read-only memory (CD-ROM) and include program code, and may be run on a terminal device such as a personal computer. However, the program product of the present invention is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical disk, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The computer readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).

In an exemplary embodiment of the present disclosure, an electronic device capable of implementing the above method is also provided.

Those skilled in the art will appreciate that the various aspects of the invention may be implemented as a system, method, or program product. Accordingly, aspects of the invention may be embodied in the following forms, namely: an entirely hardware embodiment, an entirely software embodiment (including firmware, micro-code, etc.) or an embodiment combining hardware and software aspects may be referred to herein as a "circuit," module "or" system.

An electronic device 1500 according to such an embodiment of the invention is described below with reference to fig. 15. The electronic device 1500 shown in fig. 15 is merely an example, and should not be construed as limiting the functionality and scope of use of embodiments of the present invention.

As shown in fig. 15, the electronic device 1500 is embodied in the form of a general purpose computing device. The components of electronic device 1500 may include, but are not limited to: the at least one processing unit 1510, the at least one storage unit 1520, a bus 1530 connecting the different system components (including the storage unit 1520 and the processing unit 1510), and a display unit 1540.

Wherein the storage unit stores program code that is executable by the processing unit 1510 such that the processing unit 1510 performs steps according to various exemplary embodiments of the present invention described in the above section of the "exemplary method" of the present specification.

The storage unit 1520 may include readable media in the form of volatile memory units such as Random Access Memory (RAM) 15201 and/or cache memory 15202, and may further include Read Only Memory (ROM) 15203.

The storage unit 1520 may also include a program/utility 15204 having a set (at least one) of program modules 15205, such program modules 15205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.

Bus 1530 may be a bus representing one or more of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

The electronic device 1500 may also communicate with one or more external devices 1600 (e.g., keyboard, pointing device, bluetooth device, etc.), one or more devices that enable a user to interact with the electronic device 1500, and/or any device (e.g., router, modem, etc.) that enables the electronic device 1500 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 1550. Also, the electronic device 1500 may also communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, for example, the Internet, through a network adapter 1560. As shown, the network adapter 1560 communicates with other modules of the electronic device 1500 over the bus 1530. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with electronic device 1500, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.

From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, including several instructions to cause a computing device (may be a personal computer, a server, a terminal device, or a network device, etc.) to perform the method according to the embodiments of the present disclosure.

Furthermore, the above-described drawings are only schematic illustrations of processes included in the method according to the exemplary embodiment of the present invention, and are not intended to be limiting. It will be readily appreciated that the processes shown in the above figures do not indicate or limit the temporal order of these processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, for example, among a plurality of modules.

It should be noted that although in the above detailed description several modules or units of a device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit in accordance with embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into a plurality of modules or units to be embodied.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A video processing method, comprising:

Acquiring video data and depth data of a video to be processed, and determining an area where an object is located in each frame of picture of the video to be processed based on the video data to obtain object area information; determining depth data corresponding to the area where the object is located in each frame of the video to be processed based on the area where the object is located and the depth data, and obtaining object depth information; determining object state information based on the object area information and the object depth information;

Acquiring audio data and determining audio characteristic information of the audio data;

Determining the number of wave crests contained in a first ratio change curve according to the first ratio change curve in the object state information; determining the number of wave peaks contained in a first audio characteristic curve according to the first audio characteristic curve in the audio characteristic information; and generating target audio and video data according to the number of wave crests contained in the first ratio change curve and the number of wave crests contained in the first audio characteristic curve.

2. The method according to claim 1, wherein determining the area of the object in each frame of the video to be processed based on the video data comprises:

inputting each frame of picture of the video to be processed into a trained image recognition model; the image recognition model can output the area where the object is located in each frame of picture of the video to be processed;

obtaining a result output by the image recognition model as an area where the object is located;

And determining the area of the object in each frame of picture of the video to be processed according to the area of the object.

3. The video processing method according to claim 1, wherein determining the object state information based on the object area information and the object depth information comprises:

calculating the ratio of the object area information to the corresponding object depth information in each frame of the video to be processed to obtain ratio information of each frame of the video to be processed;

Obtaining a first ratio change curve according to the ratio information; wherein, the abscissa of the first ratio change curve is the time point corresponding to each frame of picture in the video to be processed, and the ordinate is the ratio information;

and determining the object state information according to the first ratio change curve.

4. The video processing method according to claim 3, wherein calculating a ratio of the object area information to the corresponding object depth information in each frame of the video to be processed, in the case where the number of the objects is plural, to obtain ratio information of each frame, comprises:

Calculating the ratio of the object area information of each object in each frame of picture of the video to be processed to the corresponding object depth information respectively to obtain a plurality of intermediate ratios;

And calculating the average value of the plurality of intermediate ratios as the ratio information of each frame of picture.

5. The video processing method according to claim 3, wherein in the case where the number of the objects is plural, calculating a ratio of the object area information in each frame of the video to be processed to the corresponding object depth information to obtain ratio information of each frame of the video to be processed, further comprises:

And weighting the plurality of intermediate ratios to obtain ratio information of each frame of picture.

6. A video processing method according to claim 3, wherein acquiring audio data and determining the audio feature information comprises:

Acquiring the audio data and determining frequency spectrum characteristic information corresponding to the audio data;

According to the frequency spectrum characteristic information, a first audio characteristic curve is obtained; wherein, the abscissa of the first audio characteristic curve is the time point corresponding to the audio data, and the ordinate is the frequency spectrum characteristic information;

And determining the audio characteristic information according to the first audio characteristic curve.

7. The video processing method according to claim 1, wherein generating the target audio-video data according to the number of peaks included in the first ratio change curve and the number of peaks included in the first audio feature curve includes:

If the number of wave crests contained in the first ratio change curve is different from the number of wave crests contained in the first audio frequency characteristic curve, filtering part of wave crests in the first ratio change curve and wave crests in the first audio frequency characteristic curve according to a preset threshold value to obtain a second ratio change curve and a second audio frequency characteristic curve; wherein the number of peaks contained in the second ratio variation curve and the second audio characteristic curve is the same;

and generating the target audio and video data according to the second ratio change curve and the second audio characteristic curve.

8. The video processing method of claim 7, wherein generating the target audio-video data from the second ratio profile and the second audio feature profile comprises:

Determining the corresponding positions of the wave crests in the video data in the second ratio change curve according to the second ratio change curve to obtain ratio wave crest positions;

Determining the corresponding positions of the peaks in the second audio characteristic curve in the audio data according to the second audio characteristic curve to obtain the positions of the peaks in the audio data;

And generating the target audio and video data according to the ratio peak position and the audio peak position.

9. The video processing method of claim 8, wherein generating the target audio-video data from the ratio peak position and the audio peak position comprises:

And if the time point corresponding to the ratio peak position is different from the time point corresponding to the audio peak position, adjusting the playing speed of the video to be processed, and generating the target audio and video data.

10. The video processing method of claim 8, wherein generating the target audio-video data based on the ratio peak position and the audio peak position, further comprises:

And if the time point corresponding to the ratio peak position is different from the time point corresponding to the audio peak position, cutting the video to be processed, and generating the target audio-video data.

11. The video processing method of claim 8, wherein generating the target audio-video data based on the ratio peak position and the audio peak position, further comprises:

And if the time point corresponding to the ratio peak position is different from the time point corresponding to the audio peak position, adjusting the playing speed of the video to be processed and cutting the video to be processed to generate the target audio and video data.

12. A video processing apparatus, comprising:

The state information acquisition module is used for acquiring video data and depth data of a video to be processed, determining an area where an object is located in each frame of picture of the video to be processed based on the video data, and obtaining object area information; determining depth data corresponding to the area where the object is located in each frame of the video to be processed based on the area where the object is located and the depth data, and obtaining object depth information; determining object state information based on the object area information and the object depth information;

The audio information acquisition module is used for acquiring audio data and determining audio characteristic information of the audio data;

The target data generation module is used for determining the number of wave crests contained in the first ratio change curve according to the first ratio change curve in the object state information; determining the number of wave peaks contained in a first audio characteristic curve according to the first audio characteristic curve in the audio characteristic information; and generating target audio and video data according to the number of wave crests contained in the first ratio change curve and the number of wave crests contained in the first audio characteristic curve.

13. A computer readable storage medium having stored thereon a computer program, which when executed by a processor implements the video processing method of any of claims 1-11.

14. An electronic device, comprising:

A processor; and

A memory for storing executable instructions of the processor;

Wherein the processor is configured to perform the video processing method of any of claims 1-11 via execution of the executable instructions.