CN118055292A

CN118055292A - Video processing method, device, equipment and storage medium

Info

Publication number: CN118055292A
Application number: CN202410179053.8A
Authority: CN
Inventors: 单文睿; 郑程; 王正宜; 奉伟; 孙卫亮; 郭永惠; 卜琴; 郭毅; 秦志伟; 张晶; 邱亚可; 赖欣; 范晋豪; 吴悦; 王博智; 郭志冠; 程宏愿; 王腾飞; 贾增义; 李鹏飞
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-05-20
Filing date: 2022-05-20
Publication date: 2024-05-17
Also published as: CN115022712A; CN115022712B; CN118055291A

Abstract

The present disclosure provides a video processing method, apparatus, device, and storage medium. The field of artificial intelligence, in particular to the fields of material searching, material recommending, video editing, intelligent interaction and the like. The specific implementation scheme is as follows: uploading subtitles of a first video in response to receiving a material recommendation trigger operation for the first video; receiving at least one material group returned by the server based on the caption of the first video; determining a target material group based on the at least one material group; and adding the materials in the target material group into the first video to generate a second video. According to the technical scheme, materials can be automatically added for the video, and video editing efficiency is improved. In addition, compared with uploading the first video to the server, the method and the device upload the audio of the first video to the server, can fully utilize powerful computing resources of the server, quickly identify and obtain the subtitles of the first video, can improve the subtitle acquisition speed, and are convenient for quickly obtaining the material group from the server.

Description

Video processing method, device, equipment and storage medium

Cross Reference to Related Applications

The present application is a divisional application of chinese patent application filed 5/20/2022 entitled "video processing method, apparatus, device, and storage medium", filed 202210553116.2, the entire contents of which are incorporated herein by reference.

Technical Field

The present disclosure relates to the field of artificial intelligence, and in particular, to the fields of material finding, material recommendation, video editing, intelligent interaction, and the like.

Background

With the rapid development of video breakfast, new types of video, such as video on the video-on-demand side, have been developed. The video is mainly characterized in that a person faces a lens to continuously output content. Because the picture is mainly speaking, the picture is relatively boring due to lack of change. To solve this problem, some materials are often added to such videos, and by editing such videos, the edited videos are more interesting. But manually adding material is slow and video editing is inefficient.

Disclosure of Invention

The present disclosure provides a video processing method, apparatus, device, and storage medium.

According to a first aspect of the present disclosure, there is provided a video processing method, applied to a terminal, including:

Uploading subtitles of a first video in response to receiving a material recommendation trigger operation for the first video;

receiving at least one material group returned by the server based on the caption of the first video;

determining a target material group based on the at least one material group;

and adding the materials in the target material group into the first video to generate a second video.

According to a second aspect of the present disclosure, there is provided a video processing method, applied to a server, including:

receiving a caption of a first video, wherein the caption of the first video is uploaded by a terminal under the condition of receiving a material recommendation triggering operation aiming at the first video;

Identifying at least one keyword of the first video from subtitles of the first video;

determining at least one material group for the first video based on the at least one keyword;

The at least one material group is transmitted, the at least one material group being used to indicate material available for addition by the first video.

According to a third aspect of the present disclosure, there is provided a video processing apparatus applied to a terminal, including:

The first sending module is used for responding to the received material recommendation triggering operation aiming at the first video and uploading the caption of the first video;

The first receiving module is used for receiving at least one material group returned by the server based on the caption of the first video;

a first determining module, configured to determine a target material group based on the at least one material group;

and the generation module is used for adding the materials in the target material group into the first video and generating a second video.

According to a fourth aspect of the present disclosure, there is provided a video processing apparatus applied to a server, including:

the second receiving module is used for receiving the subtitles of the first video, wherein the subtitles of the first video are uploaded by the terminal under the condition of receiving the material recommendation triggering operation aiming at the first video;

the first identification module is used for identifying at least one keyword of the first video from the subtitles of the first video;

A second determining module, configured to determine at least one material group for the first video based on the at least one keyword;

and the second sending module is used for sending the at least one material group, and the at least one material group is used for indicating materials which can be added by the first video.

According to a fifth aspect of the present disclosure, there is provided an electronic device comprising:

at least one processor; and

A memory communicatively coupled to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the methods provided in the first and second aspects above.

According to a sixth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the methods provided in the first and second aspects above.

According to a seventh aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the methods provided by the first and second aspects described above.

According to the technical scheme, materials can be automatically added for the video, and video editing efficiency is improved.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a flow diagram of a video processing method according to an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of displaying material in a video image according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of a comparison of video images before and after adding material in accordance with an embodiment of the present disclosure;

FIG. 4 is a second flow chart of a video processing method according to an embodiment of the disclosure;

FIG. 5 is a schematic diagram of an interaction flow of a terminal with a server according to an embodiment of the disclosure;

fig. 6 is a schematic diagram of a video processing apparatus according to an embodiment of the present disclosure;

fig. 7 is a second schematic structural diagram of a video processing apparatus according to an embodiment of the present disclosure;

FIG. 8 is a schematic view of a scenario of video processing according to an embodiment of the present disclosure;

fig. 9 is a block diagram of an electronic device for implementing a video processing method of an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The terms first, second, third and the like in the description and in the claims and in the above-described figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion, such as a series of steps or elements. The method, system, article, or apparatus is not necessarily limited to those explicitly listed but may include other steps or elements not explicitly listed or inherent to such process, method, article, or apparatus.

In the related art, in order to make a video more vivid, interesting and attractive in a video mainly speaking a person, such as a video of mouth broadcasting, the following two processing methods can be adopted: firstly, adding the flower words and the stickers of variety and art, and matching with the upper sound effect to make the video more interesting; and secondly, adding explanatory video or pictures to change the picture content. However, a series of processes such as searching, screening, downloading, adding and adjusting are generally required to be performed by the user, and the whole operation process is tedious and time-consuming. In addition, it is difficult for novice users engaged in the self-media industry to determine at what time point to add material and what material to add. Some clients can provide some material packages, but users still need to search materials to be added in the video by themselves, and the operation process is complex. Some clients can provide a viable video template, but such video templates are largely homogenous and are not suitable for videos that require different content for each period.

The embodiment of the disclosure provides a video processing method, which can be applied to a terminal, in particular to a client mounted on the terminal, wherein the client has a video editing function and supports video importing, material recommending, video editing, video generating and the like. In practical applications, the terminal includes, but is not limited to, a mobile phone, a tablet computer, a wearable device, a personal computer, or the like. As shown in fig. 1, the video processing method may include:

S101: uploading subtitles of a first video in response to receiving a material recommendation trigger operation for the first video;

S102: receiving at least one material group returned by the server based on the caption of the first video;

S103: determining a target material group based on the at least one material group;

S104: and adding the materials in the target material group into the first video to generate a second video.

In the embodiment of the disclosure, the first video is a video to be subjected to material addition processing. Here, the first video may be a video that has just been recorded by the user, a video that has been recorded by a newly introduced user before, or a locally stored video.

Here, the client may be provided with a material recommendation function. In some embodiments, when receiving an instruction which is input by a user through voice and triggers a material recommendation function, the terminal determines that a material recommendation triggering operation aiming at the first video is received. In other embodiments, the terminal detects an operation of the material recommendation trigger key, and determines that the material recommendation trigger operation for the first video is received. In still other embodiments, when detecting that the recording completion time of the first video exceeds the preset time value, the terminal automatically triggers and starts the material recommendation function, and determines that a material recommendation triggering operation for the first video is received. The present disclosure does not limit the triggering manner of the material recommendation triggering operation.

In the embodiment of the disclosure, the subtitle of the first video is a subtitle corresponding to the audio in the first video. The caption may be a caption recognized by the terminal according to the audio of the first video, or may be a caption recognized by the server according to the audio of the first video. The present disclosure does not limit how subtitles are obtained from audio recognition of the first video in particular, nor is it mandatory to define who is responsible for recognition in particular.

In the embodiment of the present disclosure, the number of material groups is not limited. The number of the material groups can be set or adjusted according to the requirements of users. Illustratively, the terminal instructs the server to return a certain number of material groups, such as 3 material groups, according to the user's needs. Therefore, under the condition that the server returns a plurality of material groups, the terminal displays the selectable plurality of material groups, so that not only is the diversity of the material groups provided for the first video, but also the variety of editing possibilities are enriched.

In the embodiment of the disclosure, the target material group is a material group specified by a user to be added to the first video. When a plurality of material groups are received, the target material group may be one of the plurality of material groups. When only one material group is received, the target material group may be the material group, or may be a new material group obtained by adding or deleting materials in the material group.

In some implementations, determining the target material set based on the at least one material set includes: receiving a first operation, wherein the first operation is used for indicating the selected material group; a target material set is determined from the at least one material set based on the first operation. In other embodiments, determining the target material set based on the at least one material set includes: and determining the target material group from the at least one material group according to the receiving sequence or the recommended sequence of each material group. For example, the material group having the earliest reception time is set as the target material group. For another example, the top-ranked material group is set as the target material group. The above is merely exemplary and is not intended to limit the total number of possible ways to determine the set of target materials, but is not intended to be exhaustive.

In the embodiment of the disclosure, the second video is a video to which a material is added. The second video is more varied, vivid and interesting in presentation form than the first video.

In the embodiment of the disclosure, the types of the materials in the material group are not limited. Illustratively, the material is classified into non-text type material and text type material according to the presence or absence of text. Also exemplary, the materials are classified into non-sound effect type materials and sound effect type materials according to the presence or absence of sound effects.

According to the technical scheme, the terminal responds to receiving a material recommendation triggering operation for a first video, and uploads subtitles of the first video; receiving at least one material group returned by the server based on the caption of the first video; determining a target material group based on the at least one material group; and adding the materials in the target material group into the first video to generate a second video. Compared with the method that the first video is sent to the server, the method has the advantages that the terminal sends the subtitles of the first video to the server, the data volume required to be transmitted is reduced, the transmission speed is improved, the server can conveniently and quickly screen out the material group which is suitable for the first video based on the subtitles of the first video, and further the terminal can quickly acquire the material group, and therefore video editing efficiency is improved. In addition, compared with the first video editing by the server side, the terminal determines the target material group according to at least one material group, so that the load of the server is reduced, the autonomy of the terminal for editing the video is enhanced, the terminal can automatically add materials to the video, and the video editing efficiency is improved.

In some embodiments, before S101, the video processing method may further include: uploading the audio of the first video to a server under the condition that the first video is received; receiving subtitles of a first video returned by a server based on the audio of the first video; the subtitles of the first video are saved.

Here, the present disclosure does not limit how audio is acquired from the first video. For example, the terminal is provided with an audio and image separation function, and after receiving the first video, the terminal obtains the audio of the first video by utilizing the function. For another example, the terminal extracts audio from the first video through an audio extraction technique.

Here, the present disclosure does not limit a storage location of subtitles of a first video.

Here, the subtitle of the first video can be used as a basis for requesting the material group from the server, and can be used as a subtitle of the generated second video.

Therefore, after the first video is imported, the audio of the first video is uploaded to the server, so that powerful computing resources of the server can be fully utilized, subtitles of the first video can be quickly identified and obtained, and further preparation is made for a subsequent request of a terminal for a material group. Compared with the processing mode of uploading the first video to the server to acquire the subtitles, the method improves the subtitle acquisition speed, provides subtitle basis for the subsequent request of the material group to the server, and is convenient for quickly acquiring the material group from the server, thereby being beneficial to improving the video editing efficiency.

In some embodiments, adding material in the target material group to the first video includes: determining the appearance time of each material in the target material group in the first video; and adding materials corresponding to the corresponding appearance time on the time axis of the first video.

In some embodiments, determining the time of occurrence of each material in the target set of materials at the first video includes: and searching the keywords included in the target material group from the subtitles of the first video according to the corresponding relation between the materials in the target material group and the keywords, and taking the appearance time of the keywords in the first video as the appearance time of the materials corresponding to the keywords.

Illustratively, the material group includes m materials, which are respectively denoted as m1, m2, …, and mm; the material m1 corresponds to the keyword c1, the material m2 corresponds to the keyword c2, …, and the material mm corresponds to the keyword cm; if the time corresponding to the keyword c1 is t1, the time corresponding to the keyword c2 is t2 and …, the time corresponding to the keyword cm is tm, the time corresponding to the material m1 is t1, the time corresponding to the material m2 is t2 and …, and the time corresponding to the material mm is tm.

Therefore, the occurrence time of the materials in the first video can be automatically matched, and the intelligence of video editing is improved.

In some embodiments, adding material in the target material group to the first video, generating a second video, includes: determining display information of each material in the target material group; each material is added to an image in the first video based on the display information of each material.

Wherein the display information includes, but is not limited to: display position, display angle, and display size.

Here, the display position is a position of a material in a video image or picture, including display coordinates.

Here, the display angle is an angle of a material with respect to a horizontal line or a vertical line in a video image or picture. The horizontal and vertical lines herein may be with respect to the terminal display.

Here, the display size is the size of a material in a video image or picture.

FIG. 2 shows a schematic diagram of displaying material in a video image, as shown in FIG. 2, in which when the host mentions "we should prepare a set of large wall shelves for children", picture-like material, such as a wall with a bookshelf as background, is displayed behind the host, while text-like material, such as "Whole wall-! "and" big bookshelf ". As can be seen from FIG. 2, "Whole wall-! The display angle of the' is slightly inclined and has a certain angle with the horizontal line, so that the whole image picture is more flexible. The display angle of the large bookshelf is parallel to the horizontal line, the whole wall is-! The font size of the book shelf is different from that of a large bookshelf, and the background wall covers the background of the whole picture, so that the content of the whole image picture is richer, and the expression forms are more various.

Therefore, the display form of the materials can be expanded and the baking effect of the materials can be improved by adjusting the display position, the display size, the display angle and other display information of each material.

In some embodiments, adding material in the target material group to the first video, generating a second video, includes: under the condition that a first material exists in the target material group, determining the volume of the first material according to a preset volume ratio; the volume of the first material is added to the audio in the first video.

The preset volume ratio is equal to the ratio between the material volume and the video volume.

Here, the first material is an audio-effect-like material such as wind sound, rain sound, tsunami sound, bird song, or the like.

Here, the preset volume ratio corresponding to the different first materials may be different. The preset volume ratio may be determined according to a content attribute of the first material. For example, the preset volume ratio with the content attribute of tsunami may be greater than the preset volume ratio with the content attribute of bird song. For another example, the preset volume ratio with the content attribute of rain sound may be smaller than the preset volume ratio with the content attribute of wind sound.

Generally, when the original volume of the sound effect material is large, a certain volume ratio may be set in order to make the sound effect excessively prominent.

Therefore, the volume of the material can be adapted to the volume of the first video, and the material can play a better baking role under the condition that the volume of the first video is not influenced.

In some embodiments, adding material in the target material group to the first video, generating a second video, includes: and under the condition that the second material exists in the target material group, identifying a target image in the first video corresponding to the second material, and adding the second material to a corresponding preset position in the target image.

Here, the second material is a material for decorating a target object in the target image. Target objects include, but are not limited to, humans, animals, plants, and the like.

Here, the second material mainly includes decorative or decorative material. Illustratively, the second material is material related to the video anchor. For example, the second material includes, but is not limited to, blush, red eye, eye shadow, red lip, dimple, etc. As another example, the second material includes, but is not limited to, bracelets, rings, necklaces, clothing, and the like.

Taking a second material as blush as an example, taking a target image as a current video image, and taking a preset position as the position of two blush of a main cast in the current video image.

Taking a second material as red eyes as an example, taking a target image as a current video image, and taking preset positions as positions of two eyes of a main cast in the current video image.

Taking a second material as an example of a ring, taking a target image as a current video image, and taking a preset position as a position of a ring finger of a host in the current video image.

The embodiments of the present disclosure do not define how to identify a preset location of a target object in a target image. For example, the location of the five sense organs can be identified by face recognition technology. For another example, the hand position is identified by a hand detection technique.

Therefore, the second material is added at the corresponding preset position in the target image, so that the second material is added at the appointed position, the diversity of the video image can be enriched, and the time cost for manually adjusting the image such as repairing the image by a user can be saved.

Fig. 3 shows a schematic diagram of a comparison of video images before and after adding material. As shown in the left-hand diagram of fig. 3, the anchor in the image says "first we can turn on the prompter" and there is no material in the image. After adding the material to the image shown in the left graph of fig. 3, the adding effect is shown in the right graph of fig. 3, specifically, the anchor in the image says "first we can open the prompter", 2 materials appear in the image, including: "first" and "try one go". Obviously, the display effect after the material is added is obviously better than the display effect before the material is added.

The embodiment of the disclosure provides a video processing method, which can be applied to a server, wherein the server has a material recommending function and supports material searching, material screening, material group generation and the like. In practical applications, the server includes, but is not limited to, a general server, a cloud server, and the like. As shown in fig. 4, the video processing method may include:

S401: receiving a caption of a first video, wherein the caption of the first video is uploaded by a terminal under the condition of receiving a material recommendation triggering operation aiming at the first video;

S402: identifying at least one keyword of the first video from subtitles of the first video;

S403: determining at least one material group for the first video based on the at least one keyword;

s404: the at least one material group is transmitted, the at least one material group being used to indicate material available for addition by the first video.

In the embodiment of the disclosure, the material group includes at least one material. The present disclosure does not limit the number of materials included in a material group. The number of materials in the material group generally depends on the number of keywords identified from the subtitles. Taking a material group as an example, if a material is allocated to each keyword, the number of materials in the material group may be equal to the number of keywords in the subtitle. If a plurality of materials are distributed for part of keywords, the number of the materials in the material group is larger than the number of the keywords in the subtitles.

In the embodiment of the disclosure, the materials in the material group are matched with the information represented by the keywords. For example, when the keyword is beach, the materials matched for beach are all beach-related materials, such as beach pictures. For example, when the key word is a bookshelf, the materials matched with the bookshelf are materials related to the bookshelf, such as two words of a bookshelf, a bookshelf picture and the like.

The embodiments of the present disclosure are not limited to the source of the material. For example, the material may be derived from a material database, from material actively uploaded by the terminal, or from material obtained from a third party, such as a website.

Therefore, the server can rapidly screen out the material group which is suitable for the first video based on the caption of the first video, so that the terminal can rapidly acquire at least one material group, and the server provides the material group for the terminal, thereby not only increasing the autonomy of editing the video at the terminal side, but also being beneficial to improving the video editing efficiency at the terminal side. In addition, compared with the first video clipped by the server, the load of the server is reduced, and the server can provide material recommendation services for more terminals at the same time.

In some embodiments, the video processing method may further include: under the condition that the audio of the first video is received, the audio of the first video is identified, and subtitles of the first video are obtained; and transmitting the caption of the first video. Here, the audio of the first video may be uploaded by the terminal in case of receiving the first video. For example, after the terminal imports the first video, the audio of the first video is uploaded to the server.

In some embodiments, identifying audio of a first video, resulting in subtitles of the first video, includes: the server converts the audio into characters by utilizing an audio identification technology; subtitles are generated based on text. In other embodiments, identifying audio of a first video to obtain subtitles of the first video includes: converting the audio into characters through an audio converter; subtitles are generated based on text. The present disclosure is not limited to how subtitles are obtained from audio recognition in particular.

Therefore, the terminal can be provided with the subtitles rapidly by utilizing powerful computing resources of the server, the time consumed by the terminal for generating the subtitles is saved, and the speed of acquiring the subtitles at the terminal side can be improved.

In some embodiments, identifying at least one keyword of the first video from subtitles of the first video includes: splitting subtitles of a first video to obtain a plurality of first target words of the first video; searching a plurality of first target words of a first video in a preset word list library; and taking at least one first target word which can be found in the preset word list library as at least one keyword of the first video.

Here, the first target word is a word obtained by splitting the subtitle. Illustratively, splitting the subtitle "please open the hundred degrees map" into first target words includes: please "open" and "hundred degree map".

Here, the preset vocabulary stores a large number of words.

Here, the sources of words in the preset vocabulary library include, but are not limited to: (1) Dictionary such as general dictionary and common vocabulary entry provided by hundred degrees encyclopedia; (2) Digital words, blessings, etc., such as thirteen billions, dajidalian, etc.; (3) self-built mood words: a set of mutual verbs, chapter words, such as attention me, first, etc., determined by means of sampling from media video, intra team assessment, etc.

It should be noted that, the words in the preset vocabulary library may be added or subtracted. For example, new words are added to the self-created mood words. For another example, a partially obsolete blessing is pruned.

Therefore, the keywords in the subtitles are determined based on the preset word list library, and the determination speed of the keywords can be improved, so that the rapid generation of the material group for the first video is facilitated.

In some embodiments, identifying at least one keyword of the first video from subtitles of the first video includes: extracting a plurality of second target words in the subtitles of the first video through a semantic recognition algorithm; combining at least two second target words in the plurality of second target words to obtain at least one combined word; at least one combined word is used as at least one keyword of the first video.

Here, the second target word is a word recognized according to a semantic recognition algorithm. Illustratively, the second target word is a word having a certain information content, such as hundred degrees encyclopedia, birthday, happiness, first, etc.

Here, the combined word refers to a word combined by at least two second target words. Illustratively, "birthday", "happy" are combined into one combination word "birthday happy". Also exemplary, the "first", "search purpose" are combined into one combined word "first, search purpose". Further exemplary, the "second", "search results" are combined into one combined word "second", search results ". Thus, by combining the keywords, more complex keywords are formed, which are more suitable for chapter titles with clear structures.

Therefore, the determined keywords are more consistent with the semantics and the context by combining the keywords, and the rationality of the selected keywords is improved, so that the suitability of the material group is improved.

In some embodiments, determining at least one material group for the first video based on the at least one keyword comprises: recalling a plurality of materials of at least one material type from the material database for each of the at least one keyword; determining a target number of materials for each keyword from a plurality of materials of at least one material type corresponding to each keyword; and determining at least one material group for the first video according to the target number of materials corresponding to each keyword.

Here, a large amount of material is stored in the material database. The present disclosure is not limited to a particular number of material databases. In practical application, different types of materials can be stored in a material database. In practical applications, different types of materials may be stored in different material databases, where each material database is used for storing one type of material, for example, the material database 1 is used for storing text type materials, the material database 2 is used for storing sticker type materials, the material database 3 is used for storing sound type materials, and the material database 4 is used for storing animation type materials.

Here, the target number may be specified by the terminal side or may be determined by the server side. For example, the target number=5, i.e., 5 materials are selected for each keyword.

Therefore, a plurality of material groups can be provided for the first video, so that abundant material support can be provided for the terminal side, and support can be provided for video clips of different styles on the terminal side.

In some embodiments, determining a target number of materials for each keyword from a plurality of materials of at least one material type corresponding to each keyword includes: the priority ranking is carried out on a plurality of materials of the same type of each keyword respectively; selecting different types of materials to be recommended for each keyword according to the priority ordering condition of a plurality of materials of the same type of each keyword; and determining a target number of materials for each keyword according to the different types of materials to be recommended of each keyword.

Here, selecting different types of materials to be recommended for each keyword may include: and selecting materials which are ranked in the top of the priority and meet the expected number for each keyword as materials to be recommended.

It should be noted that the expected number corresponding to different types of materials may be different.

For example, keyword 1, keyword 2, and keyword 3 are identified from subtitles of a first video; selecting a first type of materials S11 and S12 for the keyword 1; selecting a second type of material S21 for the keyword 1; selecting a third type of material S31 for the keyword 1; similarly, a first type of material S13 is selected for the keyword 2; selecting a second type of material S22 for keyword 2; selecting a third type of material S32 for the keyword 2; selecting a first type of materials S14, S15 and S16 for the keyword 3; selecting a second type of material for the keyword 3S 23; selecting a third type of material S33 for the keyword 3; then, a plurality of material groups such as material group 1= { S11, S13, S14}, material group 2= { S12, S13, S15}, material group 3= { S21, S22, S33}, material group 4= { S31, S22, S33}, and the like may be generated, and are not listed here.

Therefore, the target number of the materials of each keyword can be determined according to the materials to be recommended corresponding to different types of each keyword, the diversity of the material groups which can be generated is improved, and the requirement of frequently changing the content and the form of the materials at the terminal side can be further met.

In some embodiments, after generating the material group, the method may further include: and under the condition that two or more than two identical keywords exist in the material group, carrying out de-duplication processing on the materials corresponding to the identical keywords.

For example, the material group includes a material of the keyword 1, a material of the keyword 2, a material of the keyword 3, and a material of the keyword 4, and if the keyword 1 and the keyword 3 are the same keyword, the material of the keyword 1 and the material of the keyword 3 are subjected to a deduplication process, and the material of the keyword 1 or the material of the keyword 3 is reserved. Note that, the keyword 1 and the keyword 3 are the same keyword, but the material of the keyword 1 and the material of the keyword 2 may be the same or different.

Therefore, the materials with the same keywords under the same video are subjected to de-duplication, so that the situation that a large number of repeated materials appear in the same video can be reduced, and the baking effect of the materials can be improved.

In some embodiments, after generating the material group, the method may further include: and under the condition that a plurality of materials of the target type exist in the material group, performing recommended frequency de-duplication processing on the plurality of materials of the target type.

Here, the target type may include one or more of material types. For example, the target type may be material that acts as a baking atmosphere. Such as sprinkling flowers, and moving pictures fly in and out.

Here, the frequency deduplication process includes: if there are multiple materials of the target type in the preset time period, one material in the preset time period is reserved, for example, the first material is reserved.

Here, the preset time period may be set or adjusted according to the baking effect. The preset time period may be set to 1 second or 10 seconds, for example.

Thus, the baking effect of the materials can be improved by de-duplicating the recommended density.

Fig. 5 shows a schematic interaction diagram of a terminal and a server, and as shown in fig. 5, the interaction flow includes: the terminal client side imports the video and uploads the audio of the video to the server; the server identifies the caption aiming at the audio frequency, and returns the identified caption to the terminal client. When the terminal client receives triggering operation of the material recommending function, the caption is uploaded to the server, so that the server extracts keywords according to the caption, screens the keywords, recalls the materials, performs priority sorting, density filtering and the like on the materials, and a plurality of material groups are obtained. And the server returns the recommended material group to the terminal client, and finally, the terminal client adds the materials in the material group according to the material type strategy.

Thus, the complex material adding process is shortened, and the video editing efficiency is improved; the threshold for video production is reduced, so that novice users can easily produce interesting videos; the interesting points ignored by the user can be actively discovered, more possibilities are created, and the product heat is further improved.

It should be understood that the interaction diagram shown in fig. 5 is merely exemplary and not limiting, and that it is scalable, and that a person skilled in the art may make various obvious changes and/or substitutions based on the example of fig. 5, while still falling within the scope of the disclosure of the embodiments of the present disclosure.

The embodiment of the disclosure provides a video processing device, which is applied to a terminal, as shown in fig. 6, and may include: a first sending module 601, configured to upload subtitles of a first video in response to receiving a material recommendation triggering operation for the first video; a first receiving module 602, configured to receive at least one material group returned by a server based on a subtitle of the first video; a first determining module 603, configured to determine a target material group based on the at least one material group; and the generating module 604 is configured to add materials in the target material group to the first video, and generate a second video.

In some embodiments, the first sending module 601 is further configured to upload, if the first video is received, audio of the first video to a server; the first receiving module 602 is further configured to receive a subtitle of the first video returned by the server based on the audio of the first video.

In some embodiments, the video processing apparatus may further include: a saving module 605 (not shown in the figure) is configured to save subtitles of the first video.

In some embodiments, the generating module 604 includes: the first determining submodule is used for determining the appearance time of each material in the target material group in the first video; and the first adding sub-module is used for adding materials corresponding to the appearance time on the corresponding appearance time on the time axis of the first video.

In some embodiments, the generating module 604 includes: the second determining submodule is used for determining display information of each material in the target material group; and the second adding sub-module is used for adding each material in the target material group into the image of the first video based on the display information of each material in the target material group.

In some embodiments, the generating module 604 includes: the third determining submodule is used for determining the volume of the first material according to a preset volume ratio under the condition that the first material exists in the target material group; and the third adding sub-module is used for adding the volume of the first material to the audio of the first video.

In some embodiments, the generating module 604 includes: the first identification sub-module is used for identifying a target image in the first video corresponding to the second material under the condition that the second material exists in the target material group; and the fourth adding sub-module is used for adding the second material to the preset position of the target image.

It should be understood by those skilled in the art that the functions of each processing module in the video processing apparatus according to the embodiments of the present disclosure may be understood by referring to the foregoing description of the video processing method applied to the terminal, each processing module in the video processing apparatus according to the embodiments of the present disclosure may be implemented by an analog circuit implementing the functions described in the embodiments of the present disclosure, or may be implemented by running software executing the functions described in the embodiments of the present disclosure on an electronic device.

The video processing device disclosed by the embodiment of the invention can automatically add materials to the video and improve the video editing efficiency.

The embodiment of the present disclosure provides a video processing apparatus applied to a server, as shown in fig. 7, which may include: a second receiving module 701, configured to receive a caption of a first video, where the caption of the first video is uploaded by a terminal when receiving a material recommendation triggering operation for the first video; a first identifying module 702, configured to identify at least one keyword of the first video from subtitles of the first video; a second determining module 703, configured to determine at least one material group for the first video based on the at least one keyword; a second sending module 704, configured to send at least one material group, where the at least one material group is used to indicate materials available for the first video to add.

In some embodiments, the second receiving module 701 is further configured to receive audio of the first video, where the audio of the first video is uploaded by the terminal when the first video is received.

In some embodiments, the video processing apparatus further comprises: a second recognition module 705 (not shown in the figure) is configured to recognize the audio of the first video, and obtain a subtitle of the first video. Correspondingly, the second sending module 704 is further configured to send subtitles of the first video.

In some embodiments, the first identification module 702 includes: the splitting module is used for splitting the subtitles of the first video to obtain a plurality of first target words of the first video; the second recognition sub-module is used for taking at least one first target word which can be found in the preset word list library as at least one keyword of the first video.

In some embodiments, the first identification module 702 includes: the extraction sub-module is used for extracting a plurality of second target words in the subtitles of the first video through a semantic recognition algorithm; the combination sub-module is used for combining at least two second target words in the plurality of second target words to obtain at least one combined word; and the third recognition sub-module is used for taking at least one combined word as at least one keyword of the first video.

In some embodiments, the second determining module 703 includes: a recall sub-module for recalling at least one type of a plurality of materials for each of the at least one keyword from the materials database; a fourth determining sub-module, configured to determine a target number of materials for each keyword based on at least one type of a plurality of materials corresponding to each keyword; and a fifth determining sub-module, configured to determine at least one material group for the first video according to the target number of materials corresponding to each keyword.

In some embodiments, the fourth determination submodule is configured to: the priority ranking is carried out on a plurality of materials of the same type of each keyword respectively; selecting different types of materials to be recommended for each keyword according to the priority ordering condition of a plurality of materials of the same type of each keyword; and determining a target number of materials for each keyword according to the different types of materials to be recommended of each keyword.

In some embodiments, the video processing apparatus may further include: a first deduplication module 706 (not shown in the figure) is configured to perform deduplication processing on materials corresponding to a plurality of identical keywords when the plurality of identical keywords are included in one material group.

In some embodiments, the video processing apparatus may further include: a second deduplication module 707 (not shown in the figure) is configured to perform deduplication processing on a plurality of materials of a target type in a case where the plurality of materials of the target type are included in one material group.

It should be understood by those skilled in the art that the functions of each processing module in the video processing apparatus according to the embodiments of the present disclosure may be understood by referring to the foregoing description of the video processing method applied to the server, each processing module in the video processing apparatus according to the embodiments of the present disclosure may be implemented by an analog circuit implementing the functions described in the embodiments of the present disclosure, or may be implemented by running software executing the functions described in the embodiments of the present disclosure on an electronic device.

According to the video processing device disclosed by the embodiment of the disclosure, the material group which is suitable for the first video can be quickly screened out for the first video based on the caption of the first video, so that the terminal can quickly acquire at least one material group, and the server provides the material group for the terminal, so that the autonomy of editing the video at the terminal side can be increased, and the video editing efficiency at the terminal side can be improved.

Fig. 8 shows a schematic view of a video processing scenario, and as can be seen from fig. 8, after receiving the audio uploaded from each terminal, the electronic device, such as a cloud server, generates and returns corresponding subtitles for each audio. After receiving the subtitles uploaded by each terminal, the electronic equipment identifies the subtitles based on a preset vocabulary library to obtain keywords; and recalling the materials from the material database based on the keywords, generating a material group, returning the corresponding material group to the terminal, and editing the video based on the material group by the terminal to generate the video added with the materials.

Several video clip scenes are listed below. For example, after a user records a video of an oral, the video of the oral is imported into a clipping client on the terminal, and the clipping client automatically clips the video of the oral into a video added with materials. For another example, after the user records the video through the editing client on the terminal, the user triggers the material recommending function, selects one of the material groups returned from the server as the target material group, and the editing client generates a new video based on the target material group.

It should be understood that the scene diagram shown in fig. 8 is merely illustrative and not restrictive, and that various obvious changes and/or substitutions may be made by one skilled in the art based on the example of fig. 8, and the resulting technical solutions still fall within the scope of the disclosed embodiments of the present disclosure.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the related user personal information all conform to the regulations of related laws and regulations, and the public sequence is not violated.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 9 shows a schematic block diagram of an example electronic device 900 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 9, the apparatus 900 includes a computing unit 901 that can perform various appropriate actions and processes according to a computer program stored in a Read-Only Memory (ROM) 902 or a computer program loaded from a storage unit 908 into a random access Memory (Random Access Memory, RAM) 903. In the RAM 903, various programs and data required for the operation of the device 900 can also be stored. The computing unit 901, the ROM 902, and the RAM 903 are connected to each other by a bus 904. An Input/Output (I/O) interface 905 is also connected to bus 904.

Various components in device 900 are connected to I/O interface 905, including: an input unit 906 such as a keyboard, a mouse, or the like; an output unit 907 such as various types of displays, speakers, and the like; a storage unit 908 such as a magnetic disk, an optical disk, or the like; and a communication unit 909 such as a network card, modem, wireless communication transceiver, or the like. The communication unit 909 allows the device 900 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunications networks.

The computing unit 901 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 901 include, but are not limited to, a central processing unit (Central Processing Unit, CPU), a graphics processing unit (Graphics Processing Unit, GPU), various dedicated artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) computing chips, various computing units running machine learning model algorithms, a digital signal Processor (DIGITAL SIGNAL Processor, DSP), and any suitable Processor, controller, microcontroller, etc. The computing unit 901 performs the respective methods and processes described above, for example, a video processing method. For example, in some embodiments, the video processing method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 908. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 900 via the ROM 902 and/or the communication unit 909. When the computer program is loaded into the RAM 903 and executed by the computing unit 901, one or more steps of the video processing method described above may be performed. Alternatively, in other embodiments, the computing unit 901 may be configured to perform the video processing method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above can be implemented in digital electronic circuitry, integrated circuitry, field programmable gate arrays (Field Programmable GATE ARRAY, FPGA), application-specific integrated circuits (ASIC), application-specific standard Products (ASSP), system-on-Chip Systems (SOC), load-programmable logic devices (Complex Programmable Logic Device, CPLD), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a random access Memory, a read-Only Memory, an erasable programmable read-Only Memory (EPROM), a flash Memory, an optical fiber, a portable compact disc read-Only Memory (Compact Disk Read Only Memory, CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., cathode Ray Tube (CRT) or Liquid crystal display (Liquid CRYSTAL DISPLAY, LCD) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: a local area network (Local Area Network, LAN), a wide area network (Wide Area Network, WAN), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially, or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A video processing method, comprising:

uploading the audio of the first video from the terminal to the server;

receiving, by the terminal, subtitles of the first video returned based on the audio from the server;

uploading subtitles of the first video from the terminal to the server in response to receiving a material recommendation triggering operation for the first video;

Receiving, by the terminal, a material group returned based on the subtitles of the first video from the server; and

A second video is generated by adding material in the set of material to the first video.

2. The method of claim 1, wherein generating the second video by adding material in the set of material to the first video comprises:

Determining the occurrence time of each material in the material group in the first video; and

And adding materials corresponding to the corresponding appearance time on the corresponding appearance time of the time axis of the first video.

3. The method of claim 2, wherein determining the time of occurrence of each material in the set of materials at the first video comprises:

Searching the keywords included in the material group from the subtitles of the first video according to the corresponding relation between the materials in the material group and the keywords; and

And determining the appearance time of the keyword in the first video as the appearance time of the material corresponding to the keyword.

4. The method of claim 1, wherein generating the second video by adding material in the set of material to the first video further comprises:

determining display information of the materials in the material group; and

The material is added to an image in the first video based on the display information of the material.

5. The method of claim 4, wherein the display information includes a display position, a display angle, and a display size, the display position being a display coordinate of material in an image of the first video, the display angle being an angle of material in the image of the first video relative to a horizontal line or a vertical line, and the display size being a size of material in the image of the first video.

6. The method of claim 1, wherein generating the second video by adding material in the set of material to the first video further comprises:

Under the condition that a first material belonging to sound effect type materials exists in the material group, determining the volume of the first material according to a preset volume ratio; and

And adding the volume of the first material to the audio in the first video.

7. The method of claim 1, wherein generating the second video by adding material in the set of material to the first video further comprises:

identifying a target image in the first video corresponding to a second material under the condition that the second material belonging to the decorative material exists in the material group; and

And adding the second material to the corresponding preset position in the target image.

8. The method of claim 1, wherein receiving, by the terminal, the material group returned based on the subtitles of the first video from the server comprises:

receiving a plurality of material groups returned by the server based on the subtitles; and

And determining the material groups based on the plurality of material groups.

9. The method of claim 1, wherein the material group does not include a plurality of materials having a target type for a preset period of time.

10. The method of claim 1, wherein the materials corresponding to the same keyword in the material group are different.

11. The method of claim 1, wherein the group of materials includes text-type materials, decal-type materials, sound-type materials, and animation-type materials.

12. A video processing method, comprising:

receiving, by a server, audio of a first video from a terminal;

Deriving subtitles of the first video based on the audio;

Transmitting the subtitle of the first video from a server to the terminal;

Receiving, by the server, a subtitle of the first video from the terminal, the subtitle being uploaded by the terminal upon receiving a material recommendation trigger operation for the first video;

Determining a material group for the first video based on the subtitles; and

And sending the material group from the server to the terminal, wherein the material group is used for indicating materials which can be added by the first video.

13. The method of claim 12, wherein determining at least one material group for the first video based on the subtitles comprises:

identifying at least one keyword from the subtitle;

the material group is determined for the first video based on the at least one keyword.

14. The method of claim 13, wherein identifying the at least one keyword from the subtitle comprises:

Splitting the caption of the first video to obtain a plurality of first target words of the first video; and

And taking at least one first target word which can be found in a preset word list library as the at least one keyword of the first video.

15. The method of claim 14, wherein identifying the at least one keyword from the subtitle further comprises:

extracting a plurality of second target words in the subtitles through a semantic recognition algorithm;

combining at least two second target words in the plurality of second target words to obtain at least one combined word; and

And using the at least one combined word as the at least one keyword.

16. The method of claim 13, wherein determining the material group for the first video based on the at least one keyword comprises:

recalling a plurality of materials of at least one type for each of the at least one keyword from a materials database;

determining a target number of materials for each keyword based on the at least one type of multiple materials corresponding to each keyword; and

And determining at least one material group for the first video according to the target number of materials corresponding to each keyword.

17. The method of claim 16, wherein determining a target number of stories for each keyword based on the at least one type of plurality of stories for each keyword comprises:

the priority ranking is carried out on a plurality of materials of the same type of each keyword respectively;

Selecting different types of materials to be recommended for each keyword according to the priority ordering condition of a plurality of materials of the same type of each keyword; and

And determining a target number of materials for each keyword according to the different types of materials to be recommended of each keyword.

18. The method of claim 12, further comprising:

And under the condition that one material group comprises a plurality of identical keywords, carrying out de-duplication processing on materials corresponding to the identical keywords.

19. The method of claim 12, further comprising:

In the case that a plurality of materials of a target type are included in one material group, the plurality of materials of the target type are subjected to deduplication processing.

20. The method of claim 12, wherein the group of materials includes text-type materials, decal-type materials, sound-type materials, and animation-type materials.

21. A terminal for video processing, comprising:

the audio uploading module is configured to upload the audio of the first video to the server;

A subtitle receiving module configured to receive, from the server, a subtitle of the first video returned based on the audio;

A first sending module configured to upload subtitles of the first video to the server in response to receiving a material recommendation trigger operation for the first video;

A first receiving module configured to receive, from the server, a material group returned based on subtitles of the first video; and

A generation module configured to generate a second video by adding material in the material group to the first video.

22. A server for video processing, comprising:

an audio receiving module configured to receive audio of a first video from a terminal;

A second recognition module configured to obtain subtitles of the first video based on the audio;

A subtitle transmission module configured to transmit the subtitle of the first video to the terminal;

a second receiving module configured to receive the subtitle of the first video from the terminal, the subtitle being uploaded by the terminal upon receiving a material recommendation trigger operation for the first video;

a second determining module configured to determine a material group for the first video based on the subtitles; and

And the second sending module is configured to send the material group to the terminal, wherein the material group is used for indicating materials available for the first video to be added.

23. An electronic device, comprising:

at least one processor; and

A memory communicatively coupled to the at least one processor,

Wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-20.

24. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-20.

25. A computer program product tangibly stored on a non-transitory computer readable medium and comprising computer instructions that, when executed, cause a machine to implement the method of any one of claims 1-20.